HA in sles15 lvmlockd

rakovec · January 25, 2019, 4:24pm

Hi

I have configure 2 node cluster on sles 15 with lvm in exclusive mode. I have problem when I fence or reboot active node the resources doesn’t move to secondary node. It stop at activation of lvm

2019-01-25T14:20:33.996488+01:00 sles15cl2 pengine[1828]: notice: Watchdog will be used via SBD if fencing is required
2019-01-25T14:20:33.997068+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.997296+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.998199+01:00 sles15cl2 pengine[1828]: warning: Forcing vgcluster away from sles15cl2 after 1000000 failures (max=3)

Failed Actions:

lvmlockd_stop_0 on sles15cl1 ‘not installed’ (5): call=56, status=Not installed, exitreason=’’,
last-rc-change=‘Fri Jan 25 14:41:07 2019’, queued=1ms, exec=1ms

Failed Actions:

vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

sles15cl2:~ # crm status
Stack: corosync
Current DC: sles15cl2 (version 1.1.18+20180430.b12c320f5-1.14-b12c320f5) - partition with quorum
Last updated: Fri Jan 25 14:00:59 2019
Last change: Fri Jan 25 14:00:55 2019 by root via cibadmin on sles15cl2

2 nodes configured
10 resources configured

Online: [ sles15cl1 sles15cl2 ]

Full list of resources:

admin-ip (ocf:IPaddr2): Started sles15cl2
stonith-sbd (stonith:external/sbd): Started sles15cl2
Clone Set: cl-storage [g-storage]
Started: [ sles15cl1 sles15cl2 ]
Resource Group: apache-group
ip-apache (ocf:IPaddr2): Started sles15cl1
vgcluster (ocf:LVM-activate): Stopped
clusterfs (ocf:Filesystem): Stopped
service-apache (ocf:apache): Stopped

Failed Actions:

vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms
vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms

it’s look like lvmlockd is not running but it is running

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep dlm
root 2714 1 0 14:43 ? 00:00:00 dlm_controld -s 0
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 4040 2 0 14:45 ? 00:00:00 [dlm_scand]
root 4041 2 0 14:45 ? 00:00:00 [dlm_recv]
root 4042 2 0 14:45 ? 00:00:00 [dlm_send]
root 4043 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 4050 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 23871 2919 0 15:16 pts/0 00:00:00 grep --color=auto dlm

sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep lvm
root 381 1 0 14:42 ? 00:00:00 /usr/sbin/lvmetad -f
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 23957 2919 0 15:16 pts/0 00:00:00 grep --color=auto lvm
sles15cl2:/usr/lib/ocf/resource.d/heartbeat #

It’s look like bug described here:
https://github.com/ClusterLabs/resource-agents/pull/1281/commits/848d62c32b355a03c2ad8d246eb3e34b04af07ca

Resources can be only started if I run: crm resource cleanup

Is there some other workaround? If not then this is not a cluster…

Thanks

Jost

Automatic_Reply · January 30, 2019, 7:30am

rakovec,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:

Visit http://www.suse.com/support and search the knowledgebase and/or check all
the other support options available.
Open a service request: https://www.suse.com/support
You could also try posting your message again. Make sure it is posted in the
correct newsgroup. (http://forums.suse.com)

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…

Good luck!

Your SUSE Forums Team
http://forums.suse.com

strahil-nikolov-dxc · June 26, 2019, 5:35pm

In order to find the reason you need to debug it.
Power off one of the nodes and then wait for the failure , but don’t clear it up.
Then Run the resource in debug mode - ou can check how to do it here: https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures

Topic		Replies	Views
SLES11SP4 cLVM reboot issue SLES High Availability Extension	1	436	January 30, 2017
SLES 11 SP3 - SBD Stonith - Resources do not migrate SLES High Availability Extension	2	498	December 16, 2016
SBD fails to fence node if 1 of 2 sbd devices unreachable SLES High Availability Extension	1	287	January 10, 2013
Pacemaker + DRBD + Xen: failback issues SLES High Availability Extension	1	309	March 15, 2012
SLES 12 HAE - fence_scsi not working SLES High Availability Extension	3	671	June 29, 2015

HA in sles15 lvmlockd

Related topics