SES6 - Stage.3 Disks.deploy not deploy all disks

Hi,

there is a problem in the deployment using many disks.
I have 6 JBOD box with 60 disks each.
The deployment create 10 OSD in each server. But the 11th hang and after around 10 hours the deployment process finish with error:
Module function cephprocesses.wait executed

Deepsea version: deepsea 0.9.23+git.0.6a24f24a0

In all OSD servers I have the same message in the 11th disk:

[2019-11-18 20:29:42,576][ceph_volume.process][INFO ] Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-61
[2019-11-18 20:29:42,580][ceph_volume.process][INFO ] Running command: /bin/systemctl enable ceph-volume@lvm-61-4bdfff79-105f-4ce8-ab96-f8866aa2021e
[2019-11-18 20:29:42,590][ceph_volume.process][INFO ] stderr Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-61-4bdfff79-105f-4ce8-ab96-f8866aa2021e.service → /usr/lib/systemd/system/ceph-volume@.service.
[2019-11-18 20:29:42,874][ceph_volume.process][INFO ] Running command: /bin/systemctl enable --runtime ceph-osd@61

It is the last message of the 11th disk.

I would be able to run the service manually. But I have more 300 disks. The deployment did not create all OSDs. It stopped in 11th.

It is weird that in the messages log appear exactly 10 times the message Unknown lvalue ‘LockPersonality’ in section ‘Service’ before deploy the 11th disk.

message repeated 10 times: [ /usr/lib/systemd/system/ceph-osd@.service:15: Unknown lvalue ‘LockPersonality’ in section ‘Service’]

It is not in one server. I have 6 OSD servers with the same behaviour.
It seems something is blocking the creation of many OSD. But it is not registered in any log.

By the way, the environment is:
Suse Enterprise 15 SP1
Suse Storage 6
deepsea 0.9.23+git.0.6a24f24a0

6xOSD servers each one direct attached with JBOD (60 disks) = 360 disks
3xMON
1xAdmin

The salt-run disks.report shows correctly all disks that needs be deployed.

The installation and deployment was strictly following Suse`s documentation.

At this moment I will remove all OSDs and deploy again without LockPersonality parameter in service file.

Thanks in advance for helping.

PS.: We have payed for 5 years support. But unfortunately Suse support is not treating this as a problem/bug. I hope someone in the community would be able to help.

Kind regards

DeepSea certainly should create all OSDs that are defined by the DriveGroup configuration file. Can you send me the service request number via PM, please? I’ll have a look at it then.

Thanks,
Joerg

[QUOTE=jadergiacon;58937]Hi,
[…]
PS.: We have payed for 5 years support. But unfortunately Suse support is not treating this as a problem/bug. I hope someone in the community would be able to help.

[/QUOTE]

Sorry to hear that, there must be some kind of miscommunication or error happening. Please reach out to Jörg with the service request number and we’ll dig into it. The situation you describe is definitely something to report to support, so let us investigate what went wrong,

Andreas

Hi Joerg,

The SR is 101266502271.
It was initially created due a problem to identify cluster x public network.
But the most important is the disks deploy hanging. This one I mentioned.

I am performing many tests to try workaround this problem.
For example:
Comment the LockPersonality parameter in systemd…ceph-… services just stopped the error messages in the log
Now I am giving a try with tuned profiles off as I have seen some issues in a deepsea github page https://github.com/SUSE/DeepSea/issues?utf8=%E2%9C%93&q=hang

Thanks for help
Jader

Dear Suse,

After many tests my colleague and I was able to discover the bug source.

The bug is on dg.deploy when it call the command:
root 34361 33264 0 16:03 ? 00:00:03 /usr/bin/python3 /usr/sbin/ceph-volume lvm batch --no-auto /dev/nvme0n1 /dev/sda /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax /dev/sday /dev/sdaz /dev/sdb /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz --yes

It is an example of my environment.

This process (in this case 34361) create pipe file descriptors in /proc/34361/fd
In this step exists two pipes with name 1 and 2

I have cat the pipes and checked that pipe number 2 had a lot of information.

So, I clean my cluster and run again. But in one node, as soon as the deployment called this process I increased the pipe size to maximum (1MB) and SUCCESS!!
All disks in this node was deployed.

Of course, the deployment hang again because I did not do it in all OSD servers.

Thus, there is a bug in the deployment process that is not “paying attention” to pipe size. Or forgetting to clean it up. Or forgetting to increase the size.

I think it is a critical bug and needs a solution as soon as possible.

Thanks again for help
Jader

Thanks, Jader.

A new Support ticket was created by the customer and SUSE Support & engineering are reviewing the situation.

Andreas

Just for the record and because of xkcd 979, the Pull Request https://github.com/SUSE/DeepSea/pull/1812 (commit 903e258) has a fix, currently waiting to be merged.