SES5 Install not adding all OSDs

Hi,

I am doing a new install of 6-node SES5, each node is with 9 disks. After installing all the ceph stages from 0 to 4, i was expecting 54 disks should be in the ceph cluster, but i found only 31 in the cluster. As per documentation, i have done complete wipe out of each of 54 disks. I ran stages from 1 to 3 multiple times, but it is not adding all the remaining disks. Can anyone please suggest me what could be the issue here and how to solve it.

For example, for node6, in the proposal, all the 9 disks are mentioned. But after SES5 install, i found only 4 got added as OSDs.

salt ‘*’ pillar.items

sosesn6.swlab.net:
----------
available_roles:

benchmark:
    ----------
    
ceph:
    ----------
    storage:
        ----------
        osds:
            ----------
            /dev/sdb:
                ----------
                format:
                    bluestore
            /dev/sdc:
                ----------
                format:
                    bluestore
            /dev/sdd:
                ----------
                format:
                    bluestore
            /dev/sde:
                ----------
                format:
                    bluestore
            /dev/sdf:
                ----------
                format:
                    bluestore
            /dev/sdg:
                ----------
                format:
                    bluestore
            /dev/sdh:
                ----------
                format:
                    bluestore
            /dev/sdi:
                ----------
                format:
                    bluestore
            /dev/sdj:
                ----------
                format:
                    bluestore

ceph-disk list

/dev/sda :
/dev/sda1 swap, swap
/dev/sda2 other, ext4, mounted on /
/dev/sdb other, unknown
/dev/sdc other, unknown
/dev/sdd other, unknown
/dev/sde :
/dev/sde1 ceph data, active, cluster ceph, osd.22, block /dev/sde2
/dev/sde2 ceph block, for /dev/sde1
/dev/sdf other, unknown
/dev/sdg :
/dev/sdg1 ceph data, active, cluster ceph, osd.16, block /dev/sdg2
/dev/sdg2 ceph block, for /dev/sdg1
/dev/sdh other, unknown
/dev/sdi :
/dev/sdi1 ceph data, active, cluster ceph, osd.11, block /dev/sdi2
/dev/sdi2 ceph block, for /dev/sdi1
/dev/sdj :
/dev/sdj1 ceph data, active, cluster ceph, osd.3, block /dev/sdj2
/dev/sdj2 ceph block, for /dev/sdj1

Thanks & Regards,
Shashi Kanth.

Hi,

first of all, check the ceph-osd.log files (/var/log/ceph/) on one of the servers that doesn’t deploy all the OSDs it is supposed to. There should be something revealing the cause.

Also the deepsea monitor usually shows what’s going on (run it on a second session on the admin node). But even the execution of stage.X with salt should show error messages.
Sometimes after wiping the disks only a reboot helps to get rid of certain symptoms, then you could retry the stages.

Another option would be to try a manual deployment of one OSD to see what exactly fails if the orchestrated deployment takes too long or isn’t clear enough. To split the creation into two steps:

ceph-disk prepare --data-dir /dev/sdX ceph-disk activate /dev/sdX
If “prepare” already fails, analyze the logs and find out what’s wrong.

Thank you for the pointer.

“ceph-disk prepare” is going through, but “ceph-disk activate” giving the bellow error. Not sure what can be done now.

ceph-disk activate /dev/sdj

ceph-disk: Cannot discover filesystem type: device /dev/sdj: Line is truncated:

In the “activate” command you’re supposed to provide the respective partition of the prepared disk, so probably ceph-disk activate /dev/sdj1.
From the previous output I read that /dev/sdj already had been partitioned in a previous run. If the “activate” command doesn’t work I would wipe that disk and start over without partitions on /dev/sdj, running the manual steps for this disk again to see if that changes anything.