Hi
I have configure 2 node cluster on sles 15 with lvm in exclusive mode. I have problem when I fence or reboot active node the resources doesn’t move to secondary node. It stop at activation of lvm
2019-01-25T14:20:33.996488+01:00 sles15cl2 pengine[1828]: notice: Watchdog will be used via SBD if fencing is required
2019-01-25T14:20:33.997068+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.997296+01:00 sles15cl2 pengine[1828]: warning: Processing failed op start for vgcluster on sles15cl2: unknown error (1)
2019-01-25T14:20:33.998199+01:00 sles15cl2 pengine[1828]: warning: Forcing vgcluster away from sles15cl2 after 1000000 failures (max=3)
Failed Actions:
- lvmlockd_stop_0 on sles15cl1 ‘not installed’ (5): call=56, status=Not installed, exitreason=’’,
last-rc-change=‘Fri Jan 25 14:41:07 2019’, queued=1ms, exec=1ms
Failed Actions:
- vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms - vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms
sles15cl2:~ # crm status
Stack: corosync
Current DC: sles15cl2 (version 1.1.18+20180430.b12c320f5-1.14-b12c320f5) - partition with quorum
Last updated: Fri Jan 25 14:00:59 2019
Last change: Fri Jan 25 14:00:55 2019 by root via cibadmin on sles15cl2
2 nodes configured
10 resources configured
Online: [ sles15cl1 sles15cl2 ]
Full list of resources:
admin-ip (ocf:IPaddr2): Started sles15cl2
stonith-sbd (stonith:external/sbd): Started sles15cl2
Clone Set: cl-storage [g-storage]
Started: [ sles15cl1 sles15cl2 ]
Resource Group: apache-group
ip-apache (ocf:IPaddr2): Started sles15cl1
vgcluster (ocf:LVM-activate): Stopped
clusterfs (ocf:Filesystem): Stopped
service-apache (ocf:apache): Stopped
Failed Actions:
- vgcluster_start_0 on sles15cl2 ‘unknown error’ (1): call=82, status=Timed Out, exitreason=’’,
last-rc-change=‘Fri Jan 25 13:49:36 2019’, queued=0ms, exec=90003ms - vgcluster_start_0 on sles15cl1 ‘not configured’ (6): call=39, status=complete, exitreason=‘lvmlockd daemon is not running!’,
last-rc-change=‘Fri Jan 25 13:51:06 2019’, queued=0ms, exec=308ms
it’s look like lvmlockd is not running but it is running
sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep dlm
root 2714 1 0 14:43 ? 00:00:00 dlm_controld -s 0
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 4040 2 0 14:45 ? 00:00:00 [dlm_scand]
root 4041 2 0 14:45 ? 00:00:00 [dlm_recv]
root 4042 2 0 14:45 ? 00:00:00 [dlm_send]
root 4043 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 4050 2 0 14:45 ? 00:00:00 [dlm_recoverd]
root 23871 2919 0 15:16 pts/0 00:00:00 grep --color=auto dlm
sles15cl2:/usr/lib/ocf/resource.d/heartbeat # ps -ef |grep lvm
root 381 1 0 14:42 ? 00:00:00 /usr/sbin/lvmetad -f
root 2792 1 0 14:43 ? 00:00:00 lvmlockd -p /run/lvmlockd.pid -A 1 -g dlm
root 23957 2919 0 15:16 pts/0 00:00:00 grep --color=auto lvm
sles15cl2:/usr/lib/ocf/resource.d/heartbeat #
It’s look like bug described here:
https://github.com/ClusterLabs/resource-agents/pull/1281/commits/848d62c32b355a03c2ad8d246eb3e34b04af07ca
Resources can be only started if I run: crm resource cleanup
Is there some other workaround? If not then this is not a cluster…
Thanks
Jost