HA in sles15 lvmlockd


I have configure 2 node cluster on sles 15 with lvm in exclusive mode. I have problem when I fence or reboot active node the resources doesn’t move to secondary node. It stop at activation of lvm

2019-01-25T14:20:33.996488+01:00 sles15cl2 pengine[1828]: notice: Watchdog will be used via SBD if fencing is required
2019-01-25T14:20:33.997068+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.997296+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.998199+01:00 sles15cl2 pengine[1828]: warning: Forcing vgcluster away from sles15cl2 after 1000000 failures (max=3)

Failed Actions:

  • lvmlockd_stop_0 on sles15cl1 ‘not installed’ (5): call=56, status=Not installed, exitreason=’’,
    last-rc-change=‘Fri Jan 25 14:41:07 2019’, queued=1ms, exec=1ms

Failed Actions:

  • vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
    last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
  • vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
    last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

sles15cl2:~ # crm status
Stack: corosync
Current DC: sles15cl2 (version 1.1.18+20180430.b12c320f5-1.14-b12c320f5) - partition with quorum
Last updated: Fri Jan 25 14:00:59 2019
Last change: Fri Jan 25 14:00:55 2019 by root via cibadmin on sles15cl2

2 nodes configured
10 resources configured

Online: [ sles15cl1 sles15cl2 ]

Full list of resources:

admin-ip (ocf::heartbeat:IPaddr2): Started sles15cl2
stonith-sbd (stonith:external/sbd): Started sles15cl2
Clone Set: cl-storage [g-storage]
Started: [ sles15cl1 sles15cl2 ]
Resource Group: apache-group
ip-apache (ocf::heartbeat:IPaddr2): Started sles15cl1
vgcluster (ocf::heartbeat:LVM-activate): Stopped
clusterfs (ocf::heartbeat:Filesystem): Stopped
service-apache (ocf::heartbeat:apache): Stopped

Failed Actions:

  • vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
    last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
  • vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
    last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

it’s look like lvmlockd is not running but it is running

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep dlm
root 2714 1 0 14:43 ? 00:00:00 dlm_controld -s 0
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 4040 2 0 14:45 ? 00:00:00 [dlm_scand]
root 4041 2 0 14:45 ? 00:00:00 [dlm_recv]
root 4042 2 0 14:45 ? 00:00:00 [dlm_send]
root 4043 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 4050 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 23871 2919 0 15:16 pts/0 00:00:00 grep --color=auto dlm

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep lvm
root 381 1 0 14:42 ? 00:00:00 /usr/sbin/lvmetad -f
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 23957 2919 0 15:16 pts/0 00:00:00 grep --color=auto lvm
sles15cl2:/usr/lib/ocf/resource.d/heartbeat #

It’s look like bug described here:

Resources can be only started if I run: crm resource cleanup

Is there some other workaround? If not then this is not a cluster…




It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:

Be sure to read the forum FAQ about what to expect in the way of responses:

If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…

Good luck!

Your SUSE Forums Team

In order to find the reason you need to debug it.
Power off one of the nodes and then wait for the failure , but don’t clear it up.
Then Run the resource in debug mode - ou can check how to do it here: https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures