"Lab 1.1 - Start the Lab Environment" - "ceph -s" shows no output (only cursor blinking)

Hi everyone,

I’m beginning to follow this new SUSE Academy SES201v6 (“SUSE Enterprise Storage 6 Basic Operations”) course.
Today (27th July 2020), I’ve watched the first few videos for Module 1 (“01 - Week One - SES201v6 Course Introduction and Overview”).

I’m now doing the first Lab (“Lab 1.1 - Start the Lab Environment (10 mins)”). In the Lab Environment, I started successfully all the 3 “Monitor” Nodes (“mon1”, “mon2” and “mon3”), the first 3 “Data” Nodes (“data1”, “data2” and “data3”) and also the “admin” Node.

I’m also following the Lab Guide for this first Lab (“lab_exercises_1.1v2.pdf”), including the last section, named “Resolve SES Cluster Startup Issues”. I’m now in “Task 1: Check the cluster’s health” of that Section. In that Task, it’s said to run the command “ceph -s” (as “root”) in a Terminal session (in the “admin” Node) and to evaluate the output. However, when I enter that command (and press ENTER, of course), then I just get a blinking (block) cursor (and it’s now been in that state for more than 10 minutes). Is that to be expected?

Hello, that is a symptom of a communication problem with one or more of the monitor nodes. Please shutdown and restart all 3 monitor nodes and try again.

Hi @Academy_Instructor ,

Thank you very much for your quick reply. Your suggestion worked for me! :smile:
Per your suggestion, I’ve shut down the 3 monitor nodes and I powered them back on, one after the other (and, in each case, waiting for the conclusion of the boot process and for the stabilization of CPU usage, before starting the next “monitor” node ).
Now, the “ceph -s” command in the “admin” Node in my Lab is not “hanging” anymore and is returning the following output:

admin:~ # ceph -s
  cluster:
    id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
    health: HEALTH_WARN
            2 osds down
            Degraded data redundancy: 142/645 objects degraded (22.016%), 27 pgs degraded, 317 pgs undersized
            317 pgs not deep-scrubbed in time
            317 pgs not scrubbed in time
 
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 14m)
    mgr: mon3(active, since 14m), standbys: mon2, mon1
    mds: cephfs:1 {0=mon1=up:active}
    osd: 9 osds: 7 up (since 10m), 9 in (since 10M)
 
  data:
    pools:   7 pools, 480 pgs
    objects: 215 objects, 4.2 KiB
    usage:   7.1 GiB used, 126 GiB / 133 GiB avail
    pgs:     142/645 objects degraded (22.016%)
             290 active+undersized
             163 active+clean
             27  active+undersized+degraded
 
  io:
    client:   851 B/s rd, 0 op/s rd, 0 op/s wr
 
admin:~ # 

EDIT: Because I’ve noticed that I had “2 osds down” in the output above, I then followed the instructions in the “Task 5: Are the OSDs “up” and running properly?” section of the PDF, namely:

1 - I ran the “ceph osd tree” command to find the culprits:

admin:~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
-1       0.16727 root default                           
-3       0.05576     host data1                         
 0   hdd 0.01859         osd.0    down  1.00000 1.00000 
 3   hdd 0.01859         osd.3      up  1.00000 1.00000 
 7   hdd 0.01859         osd.7    down  1.00000 1.00000 
-5       0.05576     host data2                         
 2   hdd 0.01859         osd.2      up  1.00000 1.00000 
 5   hdd 0.01859         osd.5      up  1.00000 1.00000 
 8   hdd 0.01859         osd.8      up  1.00000 1.00000 
-7       0.05576     host data3                         
 1   hdd 0.01859         osd.1      up  1.00000 1.00000 
 4   hdd 0.01859         osd.4      up  1.00000 1.00000 
 6   hdd 0.01859         osd.6      up  1.00000 1.00000 

2 - Based on the output above, the problem here seemed to lie on “osd.0” (“down”) and “osd.7” (also “down”) of “data1”. So, I restarted those two services:

admin:~ # ssh data1 systemctl restart ceph-osd@0.service
admin:~ # ssh data1 systemctl restart ceph-osd@7.service

3 - After doing this, the output of the “ceph osd tree” command was looking good (everything is “up”):

admin:~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
-1       0.16727 root default                           
-3       0.05576     host data1                         
 0   hdd 0.01859         osd.0      up  1.00000 1.00000 
 3   hdd 0.01859         osd.3      up  1.00000 1.00000 
 7   hdd 0.01859         osd.7      up  1.00000 1.00000 
-5       0.05576     host data2                         
 2   hdd 0.01859         osd.2      up  1.00000 1.00000 
 5   hdd 0.01859         osd.5      up  1.00000 1.00000 
 8   hdd 0.01859         osd.8      up  1.00000 1.00000 
-7       0.05576     host data3                         
 1   hdd 0.01859         osd.1      up  1.00000 1.00000 
 4   hdd 0.01859         osd.4      up  1.00000 1.00000 
 6   hdd 0.01859         osd.6      up  1.00000 1.00000 

4 - By this time, the “ceph -s” still was in “HEALTH_WARN” but no longer showing “osds down”:

admin:~ # ceph -s
  cluster:
    id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
    health: HEALTH_WARN
            Degraded data redundancy: 59/645 objects degraded (9.147%), 7 pgs degraded
            301 pgs not deep-scrubbed in time
            301 pgs not scrubbed in time
 
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 53m)
    mgr: mon3(active, since 53m), standbys: mon2, mon1
    mds: cephfs:1 {0=mon1=up:active}
    osd: 9 osds: 9 up (since 9s), 9 in (since 10M); 3 remapped pgs
 
  data:
    pools:   7 pools, 480 pgs
    objects: 215 objects, 4.2 KiB
    usage:   9.1 GiB used, 162 GiB / 171 GiB avail
    pgs:     59/645 objects degraded (9.147%)
             473 active+clean
             6   active+recovery_wait+undersized+degraded+remapped
             1   active+recovering+undersized+degraded+remapped
 
  io:
    recovery: 0 B/s, 2 objects/s

5 - And, after a few minutes, the health changed from “HEALTH_WARN” to “HEALTH_OK”:

admin:~ # ceph status
  cluster:
    id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 63m)
    mgr: mon3(active, since 63m), standbys: mon2, mon1
    mds: cephfs:1 {0=mon1=up:active}
    osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
 
  data:
    pools:   7 pools, 480 pgs
    objects: 215 objects, 4.2 KiB
    usage:   9.1 GiB used, 162 GiB / 171 GiB avail
    pgs:     480 active+clean
 
  io:
    client:   852 B/s rd, 0 op/s rd, 0 op/s wr
 
admin:~ # ceph -s
  cluster:
    id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 64m)
    mgr: mon3(active, since 63m), standbys: mon2, mon1
    mds: cephfs:1 {0=mon1=up:active}
    osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
 
  data:
    pools:   7 pools, 480 pgs
    objects: 215 objects, 4.2 KiB
    usage:   9.1 GiB used, 162 GiB / 171 GiB avail
    pgs:     480 active+clean
 
  io:
    client:   851 B/s rd, 0 op/s rd, 0 op/s wr

So, everything seems to be looking good! :smiley:

The same happens when LAB env resumed. No action required, just wait. “seph -s -w” will tell you when the whole environment resumed.

I am glad this resolved your issue.

Hi,
@voleg4u : Thanks for the information (“The same happens when LAB env resumed. No action required, just wait. “ceph -s -w” will tell you when the whole environment resumed.”).
@Academy_Instructor : Thanks again!

Hi @Academy_Instructor

“Please shutdown and restart all 3 monitor nodes and try again.”
may i know, when i restart all 3, does the other data and admin nodes need to be shut as well? or it can be remain on and just restart all 3 monitor nodes.

is it sufficient when i just wait until the monitor node to reach the login prompt, then i begin to power on the other monitor nodes?

No, just the monitor nodes.