SES6 - Stage.3 Disks.deploy not deploy all disks

Jader · November 26, 2019, 2:48pm

Hi,

there is a problem in the deployment using many disks.
I have 6 JBOD box with 60 disks each.
The deployment create 10 OSD in each server. But the 11th hang and after around 10 hours the deployment process finish with error:
Module function cephprocesses.wait executed

Deepsea version: deepsea 0.9.23+git.0.6a24f24a0

In all OSD servers I have the same message in the 11th disk:

[2019-11-18 20:29:42,576][ceph_volume.process][INFO ] Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-61
[2019-11-18 20:29:42,580][ceph_volume.process][INFO ] Running command: /bin/systemctl enable ceph-volume@lvm-61-4bdfff79-105f-4ce8-ab96-f8866aa2021e
[2019-11-18 20:29:42,590][ceph_volume.process][INFO ] stderr Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-61-4bdfff79-105f-4ce8-ab96-f8866aa2021e.service â /usr/lib/systemd/system/ceph-volume@.service.
[2019-11-18 20:29:42,874][ceph_volume.process][INFO ] Running command: /bin/systemctl enable --runtime ceph-osd@61

It is the last message of the 11th disk.

I would be able to run the service manually. But I have more 300 disks. The deployment did not create all OSDs. It stopped in 11th.

It is weird that in the messages log appear exactly 10 times the message Unknown lvalue ‘LockPersonality’ in section ‘Service’ before deploy the 11th disk.

message repeated 10 times: [ /usr/lib/systemd/system/ceph-osd@.service:15: Unknown lvalue ‘LockPersonality’ in section ‘Service’]

It is not in one server. I have 6 OSD servers with the same behaviour.
It seems something is blocking the creation of many OSD. But it is not registered in any log.

By the way, the environment is:
Suse Enterprise 15 SP1
Suse Storage 6
deepsea 0.9.23+git.0.6a24f24a0

6xOSD servers each one direct attached with JBOD (60 disks) = 360 disks
3xMON
1xAdmin

The salt-run disks.report shows correctly all disks that needs be deployed.

The installation and deployment was strictly following Suse`s documentation.

At this moment I will remove all OSDs and deploy again without LockPersonality parameter in service file.

Thanks in advance for helping.

PS.: We have payed for 5 years support. But unfortunately Suse support is not treating this as a problem/bug. I hope someone in the community would be able to help.

Kind regards

jreuter · November 27, 2019, 2:16pm

DeepSea certainly should create all OSDs that are defined by the DriveGroup configuration file. Can you send me the service request number via PM, please? I’ll have a look at it then.

Thanks,
Joerg

a_jaeger · November 27, 2019, 2:22pm

[QUOTE=jadergiacon;58937]Hi,
[…]
PS.: We have payed for 5 years support. But unfortunately Suse support is not treating this as a problem/bug. I hope someone in the community would be able to help.

[/QUOTE]

Sorry to hear that, there must be some kind of miscommunication or error happening. Please reach out to JÃ¶rg with the service request number and we’ll dig into it. The situation you describe is definitely something to report to support, so let us investigate what went wrong,

Andreas

Jader · November 27, 2019, 2:36pm

Hi Joerg,

The SR is 101266502271.
It was initially created due a problem to identify cluster x public network.
But the most important is the disks deploy hanging. This one I mentioned.

I am performing many tests to try workaround this problem.
For example:
Comment the LockPersonality parameter in systemd…ceph-… services just stopped the error messages in the log
Now I am giving a try with tuned profiles off as I have seen some issues in a deepsea github page https://github.com/SUSE/DeepSea/issues?utf8=%E2%9C%93&q=hang

Thanks for help
Jader

Jader · November 27, 2019, 6:36pm

Dear Suse,

After many tests my colleague and I was able to discover the bug source.

The bug is on dg.deploy when it call the command:
root 34361 33264 0 16:03 ? 00:00:03 /usr/bin/python3 /usr/sbin/ceph-volume lvm batch --no-auto /dev/nvme0n1 /dev/sda /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax /dev/sday /dev/sdaz /dev/sdb /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz --yes

It is an example of my environment.

This process (in this case 34361) create pipe file descriptors in /proc/34361/fd
In this step exists two pipes with name 1 and 2

I have cat the pipes and checked that pipe number 2 had a lot of information.

So, I clean my cluster and run again. But in one node, as soon as the deployment called this process I increased the pipe size to maximum (1MB) and SUCCESS!!
All disks in this node was deployed.

Of course, the deployment hang again because I did not do it in all OSD servers.

Thus, there is a bug in the deployment process that is not “paying attention” to pipe size. Or forgetting to clean it up. Or forgetting to increase the size.

I think it is a critical bug and needs a solution as soon as possible.

Thanks again for help
Jader

a_jaeger · December 2, 2019, 5:50pm

Thanks, Jader.

A new Support ticket was created by the customer and SUSE Support & engineering are reviewing the situation.

Andreas

jreuter · December 18, 2019, 6:28pm

Just for the record and because of xkcd 979, the Pull Request https://github.com/SUSE/DeepSea/pull/1812 (commit 903e258) has a fix, currently waiting to be merged.

Topic		Replies	Views
SES5 Install not adding all OSDs SUSE Enterprise Storage	3	562	September 27, 2018
Stage 1 Change Number of OSD Disks SUSE Enterprise Storage	1	554	August 31, 2018
CephFS & Oracle database, 8k blocks, slow write SUSE Enterprise Storage	4	802	April 11, 2017
SES 4 issues SUSE Enterprise Storage	9	526	February 24, 2017
Trying to create an NFS export... SUSE Enterprise Storage	1	591	February 24, 2020

SES6 - Stage.3 Disks.deploy not deploy all disks

Related topics