GFS2 two node cluster

bergest2 · June 28, 2016, 11:35pm

Hi,

I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:

2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.

my config looks like this:

suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured

Online: [ suse1 (1) suse2 (2) ]

stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf:Filesystem): Started suse2
Started: [ suse1 suse2 ]

suse1:~ # crm configure show
node 1: suse1
node 2: suse2
primitive admin_addr IPaddr2 \
params ip=172.16.1.22 \
op monitor interval=10 timeout=20
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=60s
primitive gfs2-01 Filesystem \
params device="/dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VBe1e15cd7-8104f1f3" directory="/disklib/mp001" fstype=gfs2 \
op monitor interval=20s timeout=40s
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30 \
meta target-role=Started \
op monitor interval=20s timeout=40s start-delay=20s
group gfs2-group dlm gfs2-01
clone gfs2-clone gfs2-group \
meta interleave=true target-role=Started
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.13-14.7-6f22ad7 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
stonith-timeout=72 \
stonith-action=poweroff
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Thanks a lot for your help in advance.

Automatic_Reply · July 4, 2016, 7:30am

bergest2,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:

Visit http://www.suse.com/support and search the knowledgebase and/or check all
the other support options available.
Open a service request: https://www.suse.com/support
You could also try posting your message again. Make sure it is posted in the
correct newsgroup. (http://forums.suse.com)

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…

Good luck!

Your SUSE Forums Team
http://forums.suse.com

Jens-U · July 5, 2016, 5:01pm

Hi,

anything you see in the logs of suse1 right before it is rebooted itself? Anything in the logs of suse2 indicating a reason for reboot of suse1?

Regards,
J

ZRen · September 21, 2016, 4:56am

[QUOTE=bergest2;33252]Hi,

I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:

2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.

my config looks like this:

suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured

Online: [ suse1 (1) suse2 (2) ]

stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf:Filesystem): Started suse2
Started: [ suse1 suse2 ]

suse1:~ # crm configure show
node 1: suse1
node 2: suse2
primitive admin_addr IPaddr2 \
params ip=172.16.1.22 \
op monitor interval=10 timeout=20
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=60s
primitive gfs2-01 Filesystem \
params device="/dev/disk/by-id/scsi-SATA_VBOX_HARDDISK_VBe1e15cd7-8104f1f3" directory="/disklib/mp001" fstype=gfs2 \
op monitor interval=20s timeout=40s
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30 \
meta target-role=Started \
op monitor interval=20s timeout=40s start-delay=20s
group gfs2-group dlm gfs2-01
clone gfs2-clone gfs2-group \
meta interleave=true target-role=Started
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.13-14.7-6f22ad7 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
no-quorum-policy=ignore \
placement-strategy=balanced \
stonith-timeout=72 \
stonith-action=poweroff
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Thanks a lot for your help in advance.[/QUOTE]

Please check if you encountered this issue:
https://github.com/ClusterLabs/pacemaker/pull/839

Upgrade your dlm and pacemaker packages if so, or just revert that patch for controld RA from that redhat guy.

Topic		Replies	Views
Both nodes in OCFS2 cluster keep rebooting SLES High Availability Extension	2	424	June 15, 2015
ASK: Setting up SLES HA with OCFS2 SLES High Availability Extension	2	466	February 28, 2017
SLES 11 SP3 - SBD Stonith - Resources do not migrate SLES High Availability Extension	2	499	December 16, 2016
SLES 12 HAE - fence_scsi not working SLES High Availability Extension	3	673	June 29, 2015
SBD fails to fence node if 1 of 2 sbd devices unreachable SLES High Availability Extension	1	287	January 10, 2013

GFS2 two node cluster

Related topics