I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:
2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.
my config looks like this:
suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured
Online: [ suse1 (1) suse2 (2) ]
stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf:Filesystem): Started suse2
Started: [ suse1 suse2 ]
It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.
These forums are peer-to-peer, best effort, volunteer run and that if your issue
is urgent or not getting a response, you might try one of the following options:
Visit http://www.suse.com/support and search the knowledgebase and/or check all
the other support options available.
If this is a reply to a duplicate posting or otherwise posted in error, please
ignore and accept our apologies and rest assured we will issue a stern reprimand
to our posting bot…
I’m trying to configure a sles 12 two node GFS cluster, everything works fine and the GFS2 volumes getting mounted on both nodes. But if I try to fence one of the nodes the other node is fencing also after a wile. I see the following log enteis after i executing a stonith_admin -f suse2 in the logs where my first node suse1 is self fencing / rebooting:
2016-06-28T22:23:57.967586+02:00 suse1 sbd: [3420]: info: off successfully delivered to suse2
2016-06-28T22:23:57.989952+02:00 suse1 sbd: [3419]: info: Message successfully delivered.
2016-06-28T22:23:58.994912+02:00 suse1 stonith-ng[1803]: notice: Operation ‘off’ [3400] (call 2 from stonith_admin.3288) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:23:59.003734+02:00 suse1 stonith-ng[1803]: notice: Operation off of suse2 by suse1 for stonith_admin.3288@suse1.36a60d0b: OK
2016-06-28T22:23:59.004218+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (off) by suse1 for suse1: OK (ref=36a60d0b-d407-4422-a2c5-45ce64a037a6) by client stonith_admin.3288
2016-06-28T22:23:59.004529+02:00 suse1 crmd[1807]: notice: Transition aborted: External Fencing Operation (source=tengine_stonith_notify:339, 0)
2016-06-28T22:23:59.052980+02:00 suse1 sbd: [3772]: info: Watchdog enabled.
2016-06-28T22:23:59.080166+02:00 suse1 sbd: [3782]: info: Watchdog enabled.
2016-06-28T22:24:00.211586+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (reboot) suse2: static-list
2016-06-28T22:24:00.287803+02:00 suse1 sbd: [3805]: info: Watchdog enabled.
2016-06-28T22:24:00.302747+02:00 suse1 sbd: [3809]: info: Watchdog enabled.
2016-06-28T22:24:00.332430+02:00 suse1 sbd: [3820]: info: Watchdog enabled.
2016-06-28T22:24:00.337599+02:00 suse1 sbd: [3819]: info: Watchdog enabled.
2016-06-28T22:24:01.454173+02:00 suse1 stonith-ng[1803]: notice: watchdog can not fence (poweroff) suse2: static-list
2016-06-28T22:24:01.496340+02:00 suse1 sbd: [3836]: info: Watchdog enabled.
2016-06-28T22:24:01.528702+02:00 suse1 sbd: [3846]: info: Watchdog enabled.
2016-06-28T22:24:02.634541+02:00 suse1 dlm_controld[2151]: 278 fence wait 2 pid 3421 running
2016-06-28T22:24:02.635128+02:00 suse1 dlm_controld[2151]: 278 mygfs2 wait for fencing
2016-06-28T22:24:02.656065+02:00 suse1 stonith-ng[1803]: notice: Delaying reboot on stonith-sbd for 19968ms (timeout=300s)
2016-06-28T22:24:02.695489+02:00 suse1 sbd: [3862]: info: Watchdog enabled.
2016-06-28T22:24:02.724094+02:00 suse1 sbd: [3872]: info: Watchdog enabled.
2016-06-28T22:24:06.627152+02:00 suse1 controld(dlm)[3922]: ERROR: DLM status is: wait fencing
2016-06-28T22:24:06.633082+02:00 suse1 controld(dlm)[3922]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
2016-06-28T22:24:06.651833+02:00 suse1 stonith-ng[1803]: notice: Client stonith_admin.controld.3949.a7745819 wants to fence (reboot) ‘suse1’ with device ‘(any)’
2016-06-28T22:24:06.652073+02:00 suse1 stonith-ng[1803]: notice: Initiating remote operation reboot for suse1: 63a4e1b8-bc8a-4426-b998-80157e193cf2 (0)
2016-06-28T22:24:06.653058+02:00 suse1 stonith-ng[1803]: notice: watchdog can fence (reboot) suse1: static-list
2016-06-28T22:24:22.668377+02:00 suse1 sbd: [4041]: info: Watchdog enabled.
2016-06-28T22:24:22.700764+02:00 suse1 sbd: [4052]: info: Watchdog enabled.
2016-06-28T22:24:22.704832+02:00 suse1 sbd: [4053]: info: Delivery process handling /dev/disk/by-id/scsi-1ATA_VBOX_HARDDISK_VBfcc9b5de-d4e560d5
2016-06-28T22:24:22.706552+02:00 suse1 sbd: [4053]: info: Device UUID: a8ed63ee-ce0d-40db-963b-01d67208f75b
2016-06-28T22:24:22.706750+02:00 suse1 sbd: [4053]: info: Writing reset to node slot suse2
2016-06-28T22:24:22.707810+02:00 suse1 sbd: [4053]: info: Messaging delay: 50
2016-06-28T22:25:06.602606+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000 process (PID 3922) timed out
2016-06-28T22:25:06.607541+02:00 suse1 lrmd[1804]: warning: dlm_monitor_60000:3922 - timed out after 60000ms
2016-06-28T22:25:06.607828+02:00 suse1 crmd[1807]: error: Operation dlm_monitor_60000: Timed Out (node=suse1, call=24, timeout=60000ms)
2016-06-28T22:25:12.708604+02:00 suse1 sbd: [4053]: info: reset successfully delivered to suse2
2016-06-28T22:25:12.730007+02:00 suse1 sbd: [4052]: info: Message successfully delivered.
2016-06-28T22:25:13.732768+02:00 suse1 stonith-ng[1803]: notice: Operation ‘reboot’ [4033] (call 2 from stonith-api.3421) for host ‘suse2’ with device ‘stonith-sbd’ returned: 0 (OK)
2016-06-28T22:25:13.736734+02:00 suse1 stonith-ng[1803]: notice: Operation reboot of suse2 by suse1 for stonith-api.3421@suse1.ad0373ac: OK
2016-06-28T22:25:13.736971+02:00 suse1 stonith-api[3421]: stonith_api_kick: Node 2/(null) kicked: reboot
2016-06-28T22:25:13.737135+02:00 suse1 crmd[1807]: notice: Peer suse2 was terminated (reboot) by suse1 for suse1: OK (ref=ad0373ac-938c-4bee-9e07-0476c004f0b9) by client stonith-api.3421
2016-06-28T22:25:13.740752+02:00 suse1 stonith-api[3421]: stonith_api_time: Found 4 entries for 2/(null): 0 in progress, 3 completed
2016-06-28T22:25:13.741070+02:00 suse1 stonith-api[3421]: stonith_api_time: Node 2/(null) last kicked at: 1467145513
2016-06-28T22:25:13.784110+02:00 suse1 sbd: [4363]: info: Watchdog enabled.
my config looks like this:
suse1:~ # crm status detail
Last updated: Tue Jun 28 22:21:31 2016 Last change: Tue Jun 28 22:14:22 2016 by hacluster via crmd on suse1
Stack: corosync
Current DC: suse1 (1) (version 1.1.13-14.7-6f22ad7) - partition with quorum
2 nodes and 6 resources configured
Online: [ suse1 (1) suse2 (2) ]
stonith-sbd (stonith:external/sbd): Started suse1
admin_addr (ocf:IPaddr2): Started suse1
Clone Set: gfs2-clone [gfs2-group]
Resource Group: gfs2-group:0
dlm (ocf::pacemaker:controld): Started suse1
gfs2-01 (ocf:Filesystem): Started suse1
Resource Group: gfs2-group:1
dlm (ocf::pacemaker:controld): Started suse2
gfs2-01 (ocf:Filesystem): Started suse2
Started: [ suse1 suse2 ]