Hi @Academy_Instructor ,
Thank you very much for your quick reply. Your suggestion worked for me!
Per your suggestion, I’ve shut down the 3 monitor nodes and I powered them back on, one after the other (and, in each case, waiting for the conclusion of the boot process and for the stabilization of CPU usage, before starting the next “monitor” node ).
Now, the “ceph -s” command in the “admin” Node in my Lab is not “hanging” anymore and is returning the following output:
admin:~ # ceph -s
cluster:
id: f2b0bde4-8ecc-4900-ab34-7d0234101292
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 142/645 objects degraded (22.016%), 27 pgs degraded, 317 pgs undersized
317 pgs not deep-scrubbed in time
317 pgs not scrubbed in time
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 14m)
mgr: mon3(active, since 14m), standbys: mon2, mon1
mds: cephfs:1 {0=mon1=up:active}
osd: 9 osds: 7 up (since 10m), 9 in (since 10M)
data:
pools: 7 pools, 480 pgs
objects: 215 objects, 4.2 KiB
usage: 7.1 GiB used, 126 GiB / 133 GiB avail
pgs: 142/645 objects degraded (22.016%)
290 active+undersized
163 active+clean
27 active+undersized+degraded
io:
client: 851 B/s rd, 0 op/s rd, 0 op/s wr
admin:~ #
EDIT: Because I’ve noticed that I had “2 osds down” in the output above, I then followed the instructions in the “Task 5: Are the OSDs Âup and running properly?” section of the PDF, namely:
1 - I ran the “ceph osd tree” command to find the culprits:
admin:~ # ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.16727 root default
-3 0.05576 host data1
0 hdd 0.01859 osd.0 down 1.00000 1.00000
3 hdd 0.01859 osd.3 up 1.00000 1.00000
7 hdd 0.01859 osd.7 down 1.00000 1.00000
-5 0.05576 host data2
2 hdd 0.01859 osd.2 up 1.00000 1.00000
5 hdd 0.01859 osd.5 up 1.00000 1.00000
8 hdd 0.01859 osd.8 up 1.00000 1.00000
-7 0.05576 host data3
1 hdd 0.01859 osd.1 up 1.00000 1.00000
4 hdd 0.01859 osd.4 up 1.00000 1.00000
6 hdd 0.01859 osd.6 up 1.00000 1.00000
2 - Based on the output above, the problem here seemed to lie on “osd.0” (“down”) and “osd.7” (also “down”) of “data1”. So, I restarted those two services:
admin:~ # ssh data1 systemctl restart ceph-osd@0.service
admin:~ # ssh data1 systemctl restart ceph-osd@7.service
3 - After doing this, the output of the “ceph osd tree” command was looking good (everything is “up”):
admin:~ # ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.16727 root default
-3 0.05576 host data1
0 hdd 0.01859 osd.0 up 1.00000 1.00000
3 hdd 0.01859 osd.3 up 1.00000 1.00000
7 hdd 0.01859 osd.7 up 1.00000 1.00000
-5 0.05576 host data2
2 hdd 0.01859 osd.2 up 1.00000 1.00000
5 hdd 0.01859 osd.5 up 1.00000 1.00000
8 hdd 0.01859 osd.8 up 1.00000 1.00000
-7 0.05576 host data3
1 hdd 0.01859 osd.1 up 1.00000 1.00000
4 hdd 0.01859 osd.4 up 1.00000 1.00000
6 hdd 0.01859 osd.6 up 1.00000 1.00000
4 - By this time, the “ceph -s” still was in “HEALTH_WARN” but no longer showing “osds down”:
admin:~ # ceph -s
cluster:
id: f2b0bde4-8ecc-4900-ab34-7d0234101292
health: HEALTH_WARN
Degraded data redundancy: 59/645 objects degraded (9.147%), 7 pgs degraded
301 pgs not deep-scrubbed in time
301 pgs not scrubbed in time
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 53m)
mgr: mon3(active, since 53m), standbys: mon2, mon1
mds: cephfs:1 {0=mon1=up:active}
osd: 9 osds: 9 up (since 9s), 9 in (since 10M); 3 remapped pgs
data:
pools: 7 pools, 480 pgs
objects: 215 objects, 4.2 KiB
usage: 9.1 GiB used, 162 GiB / 171 GiB avail
pgs: 59/645 objects degraded (9.147%)
473 active+clean
6 active+recovery_wait+undersized+degraded+remapped
1 active+recovering+undersized+degraded+remapped
io:
recovery: 0 B/s, 2 objects/s
5 - And, after a few minutes, the health changed from “HEALTH_WARN” to “HEALTH_OK”:
admin:~ # ceph status
cluster:
id: f2b0bde4-8ecc-4900-ab34-7d0234101292
health: HEALTH_OK
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 63m)
mgr: mon3(active, since 63m), standbys: mon2, mon1
mds: cephfs:1 {0=mon1=up:active}
osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
data:
pools: 7 pools, 480 pgs
objects: 215 objects, 4.2 KiB
usage: 9.1 GiB used, 162 GiB / 171 GiB avail
pgs: 480 active+clean
io:
client: 852 B/s rd, 0 op/s rd, 0 op/s wr
admin:~ # ceph -s
cluster:
id: f2b0bde4-8ecc-4900-ab34-7d0234101292
health: HEALTH_OK
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 64m)
mgr: mon3(active, since 63m), standbys: mon2, mon1
mds: cephfs:1 {0=mon1=up:active}
osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
data:
pools: 7 pools, 480 pgs
objects: 215 objects, 4.2 KiB
usage: 9.1 GiB used, 162 GiB / 171 GiB avail
pgs: 480 active+clean
io:
client: 851 B/s rd, 0 op/s rd, 0 op/s wr
So, everything seems to be looking good!