HA in sles15 lvmlockd

Hi

I have configure 2 node cluster on sles 15 with lvm in exclusive mode. I have problem when I fence or reboot active node the resources doesn’t move to secondary node. It stop at activation of lvm

2019-01-25T14:20:33.996488+01:00 sles15cl2 pengine[1828]: notice: Watchdog will be used via SBD if fencing is required
2019-01-25T14:20:33.997068+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.997296+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.998199+01:00 sles15cl2 pengine[1828]: warning: Forcing vgcluster away from sles15cl2 after 1000000 failures (max=3)

Failed Actions:

  • lvmlockd_stop_0 on sles15cl1 ‘not installed’ (5): call=56, status=Not installed, exitreason=’’,
    last-rc-change=‘Fri Jan 25 14:41:07 2019’, queued=1ms, exec=1ms

Failed Actions:

  • vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
    last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
  • vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
    last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

sles15cl2:~ # crm status
Stack: corosync
Current DC: sles15cl2 (version 1.1.18+20180430.b12c320f5-1.14-b12c320f5) - partition with quorum
Last updated: Fri Jan 25 14:00:59 2019
Last change: Fri Jan 25 14:00:55 2019 by root via cibadmin on sles15cl2

2 nodes configured
10 resources configured

Online: [ sles15cl1 sles15cl2 ]

Full list of resources:

admin-ip (ocf::heartbeat:IPaddr2): Started sles15cl2
stonith-sbd (stonith:external/sbd): Started sles15cl2
Clone Set: cl-storage [g-storage]
Started: [ sles15cl1 sles15cl2 ]
Resource Group: apache-group
ip-apache (ocf::heartbeat:IPaddr2): Started sles15cl1
vgcluster (ocf::heartbeat:LVM-activate): Stopped
clusterfs (ocf::heartbeat:Filesystem): Stopped
service-apache (ocf::heartbeat:apache): Stopped

Failed Actions:

  • vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
    last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
  • vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
    last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

it’s look like lvmlockd is not running but it is running

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep dlm
root 2714 1 0 14:43 ? 00:00:00 dlm_controld -s 0
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 4040 2 0 14:45 ? 00:00:00 [dlm_scand]
root 4041 2 0 14:45 ? 00:00:00 [dlm_recv]
root 4042 2 0 14:45 ? 00:00:00 [dlm_send]
root 4043 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 4050 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 23871 2919 0 15:16 pts/0 00:00:00 grep --color=auto dlm

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep lvm
root 381 1 0 14:42 ? 00:00:00 /usr/sbin/lvmetad -f
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 23957 2919 0 15:16 pts/0 00:00:00 grep --color=auto lvm
sles15cl2:/usr/lib/ocf/resource.d/heartbeat #

It’s look like bug described here:
https://github.com/ClusterLabs/resource-agents/pull/1281/commits/848d62c32b355a03c2ad8d246eb3e34b04af07ca

Resources can be only started if I run: crm resource cleanup

Is there some other workaround? If not then this is not a cluster…

Thanks

Jost

rakovec,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…

Good luck!

Your SUSE Forums Team
http://forums.suse.com

In order to find the reason you need to debug it.
Power off one of the nodes and then wait for the failure , but don’t clear it up.
Then Run the resource in debug mode - ou can check how to do it here: https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures