GFS2 two node cluster

Hi,

I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:

2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.

my config looks like this:

suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured

Online: [ suse1 (1) suse2 (2) ]

stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf::heartbeat:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf::heartbeat:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf::heartbeat:Filesystem): Started suse2
Started: [ suse1 suse2 ]

suse1:~ # crm configure show
node 1: suse1
node 2: suse2
primitive admin_addr IPaddr2 \
params ip=172.16.1.22 \
op monitor interval=10 timeout=20
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=60s
primitive gfs2-01 Filesystem \
params device="/dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VBe1e15cd7-8104f1f3" directory="/disklib/mp001" fstype=gfs2 \
op monitor interval=20s timeout=40s
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30 \
meta target-role=Started \
op monitor interval=20s timeout=40s start-delay=20s
group gfs2-group dlm gfs2-01
clone gfs2-clone gfs2-group \
meta interleave=true target-role=Started
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.13-14.7-6f22ad7 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
stonith-timeout=72 \
stonith-action=poweroff
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Thanks a lot for your help in advance.

bergest2,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…

Good luck!

Your SUSE Forums Team
http://forums.suse.com

Hi,

anything you see in the logs of suse1 right before it is rebooted itself? Anything in the logs of suse2 indicating a reason for reboot of suse1?

Regards,
J

[QUOTE=bergest2;33252]Hi,

I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:

2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.

my config looks like this:

suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured

Online: [ suse1 (1) suse2 (2) ]

stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf::heartbeat:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf::heartbeat:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf::heartbeat:Filesystem): Started suse2
Started: [ suse1 suse2 ]

suse1:~ # crm configure show
node 1: suse1
node 2: suse2
primitive admin_addr IPaddr2 \
params ip=172.16.1.22 \
op monitor interval=10 timeout=20
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=60s
primitive gfs2-01 Filesystem \
params device="/dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VBe1e15cd7-8104f1f3" directory="/disklib/mp001" fstype=gfs2 \
op monitor interval=20s timeout=40s
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30 \
meta target-role=Started \
op monitor interval=20s timeout=40s start-delay=20s
group gfs2-group dlm gfs2-01
clone gfs2-clone gfs2-group \
meta interleave=true target-role=Started
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.13-14.7-6f22ad7 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
stonith-timeout=72 \
stonith-action=poweroff
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Thanks a lot for your help in advance.[/QUOTE]

Please check if you encountered this issue:
https://github.com/ClusterLabs/pacemaker/pull/839

Upgrade your dlm and pacemaker packages if so, or just revert that patch for controld RA from that redhat guy.