We have a two node cluster and when we try to similate a failure by restarting the first node then the second node gets rebooted by the cluster service the moment it comes back online on the first node.
Anyone any idea on where to look for the cause of this?
We have a two node cluster and when we try to similate a failure by restarting the first node then the second node gets rebooted by the cluster service the moment it comes back online on the first node.
Anyone any idea on where to look for the cause of this?[/QUOTE]
yes: In syslog - check the pacemaker messages to see who initiated the reboot.
I’ve seen this before, but don’t recall the actual cause - may have been that both nodes decided on fencing the other node and the second action got delayed until quorum was restored…
Dec 14 01:27:02 server1 SAPHana(rsc_SAPHana_BRP_HDB00)[8574]: INFO: RA ==== end action monitor_clone with rc=7 (0.149.4) (7s)====
Dec 14 01:27:02 server1 lrmd[8052]: notice: operation_finished: rsc_SAPHana_BRP_HDB00_monitor_0:8574 [ Error performing operation: No such device or address ]
Dec 14 01:27:02 server1 lrmd[8052]: notice: operation_finished: rsc_SAPHana_BRP_HDB00_monitor_0:8574 [ Error performing operation: No such device or address ]
Dec 14 01:27:02 server1 lrmd[8052]: notice: operation_finished: rsc_SAPHana_BRP_HDB00_monitor_0:8574 [ Could not map name=lpa_brp_lpt to a UUID ]
Dec 14 01:27:02 server1 lrmd[8052]: notice: operation_finished: rsc_SAPHana_BRP_HDB00_monitor_0:8574 [ Error performing operation: No such device or address ]
Dec 14 01:27:02 server1 lrmd[8052]: notice: operation_finished: rsc_SAPHana_BRP_HDB00_monitor_0:8574 [ Error performing operation: No such device or address ]
Dec 14 01:27:02 server1 crmd[8055]: notice: process_lrm_event: LRM operation rsc_SAPHana_BRP_HDB00_monitor_0 (call=11, rc=7, cib-update=30, confirmed=true) not running
Dec 14 01:27:02 server1 crmd[8055]: notice: process_lrm_event: server1-rsc_SAPHana_BRP_HDB00_monitor_0:11 [ 30 ]
Dec 14 01:27:02 server1 crmd[8055]: notice: te_rsc_command: Initiating action 3: probe_complete probe_complete on server1 (local) - no waiting
Dec 14 01:27:02 server1 attrd[8053]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec 14 01:27:02 server1 attrd[8053]: notice: attrd_perform_update: Sent update 10: probe_complete=true
Dec 14 01:27:05 server1 sbd: [8713]: info: reset successfully delivered to server2
Dec 14 01:27:05 server1 sbd: [8706]: info: Message successfully delivered.
Dec 14 01:27:06 server1 stonith-ng[8051]: notice: log_operation: Operation ‘reboot’ [8684] (call 2 from crmd.8055) for host ‘server2’ with device ‘stonith-sbd’ returned: 0 (OK)
Dec 14 01:27:06 server1 stonith-ng[8051]: notice: remote_op_done: Operation reboot of server2 by server1 for crmd.8055@server1.01fea0a0: OK
Dec 14 01:27:06 server1 crmd[8055]: notice: tengine_stonith_callback: Stonith operation 2/44:0:0:de761f36-d72f-4ac1-aaa1-7c68150141b5: OK (0)
Dec 14 01:27:06 server1 crmd[8055]: notice: crm_update_peer_state: send_stonith_update: Node server2[0] - state is now lost (was (null))
Dec 14 01:27:06 server1 crmd[8055]: notice: tengine_stonith_notify: Peer server2 was terminated (st_notify_fence) by server1 for server1: OK (ref=01fea0a0-ab49-4c5f-a60c-66c9beaf0ce5) by client crmd.8055
the excerpt only covers a short time window when the sbd was delivered - check much earlier entries, around when the other node was rebooted as well. I’d expect the reboot decision to have been made then.
is the above excerpt from after a server1 reboot (or restart of cluster services)? I see that it starts with server1 becoming a member. Considering your initial message, I’ll assume that server1 got fenced and then you observed above messages during its start-up phase, when server2 got fenced.
The ring status updates look fishy: It looks as if only one node is available (server1?, IP 172.21.1.242), likely that’s why it’s rebooting server2 to reach a clean state.
Quorum is not involved, I see that you have set the according option to ignore missing quorum.
I think this is the problem. Can a corrupt /var/lib/pacemaker/cib/cib.xml cause this behaviour?
Dec 15 14:48:34 hnbrpdb1 mgmtd: [53915]: info: Pacemaker-mgmt Git Version:
Dec 15 14:48:34 hnbrpdb1 mgmtd: [53915]: WARN: Core dumps could be lost if multiple dumps occur.
Dec 15 14:48:34 hnbrpdb1 mgmtd: [53915]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability
Dec 15 14:48:34 hnbrpdb1 mgmtd: [53915]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Dec 15 14:48:34 hnbrpdb1 mgmtd: [53915]: notice: connect to lrmd: 0, rc=-107
Dec 15 14:48:34 hnbrpdb1 corosync[53903]: [pcmk ] info: pcmk_ipc: Recorded connection 0x6a1170 for stonith-ng/53910
Dec 15 14:48:34 hnbrpdb1 cib[53909]: error: validate_cib_digest: Digest comparision failed: expected b37e2a4901e0e5b4b2b357199d9b8f8b (/var/lib/pacemaker/cib/cib.xml.sig), calculated 0a716e016aebb4e95e59e9d75dcdeccb
Dec 15 14:48:34 hnbrpdb1 cib[53909]: error: retrieveCib: Checksum of /var/lib/pacemaker/cib/cib.xml failed! Configuration contents ignored!
Dec 15 14:48:34 hnbrpdb1 cib[53909]: error: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Dec 15 14:48:34 hnbrpdb1 cib[53909]: warning: retrieveCib: Continuing but /var/lib/pacemaker/cib/cib.xml will NOT used.
Dec 15 14:48:34 hnbrpdb1 cib[53909]: error: cib_rename: Archiving corrupt or unusable file /var/lib/pacemaker/cib/cib.xml as /var/lib/pacemaker/cib/cib.auto.z2LFxS
Dec 15 14:48:34 hnbrpdb1 cib[53909]: error: cib_rename: Archiving corrupt or unusable file /var/lib/pacemaker/cib/cib.xml.sig as /var/lib/pacemaker/cib/cib.auto.rjWFhF
Dec 15 14:48:34 hnbrpdb1 cib[53909]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backup…
Dec 15 14:48:34 hnbrpdb1 cib[53909]: warning: readCibXmlFile: Attempting to load: /var/lib/pacemaker/cib/cib-23.raw
Dec 15 14:48:34 hnbrpdb1 cib[53909]: warning: validate_cib_digest: No on-disk digest present
Dec 15 14:48:34 hnbrpdb1 crmd[53914]: notice: main: CRM Git Version: 2db99f1
Dec 15 14:48:34 hnbrpdb1 corosync[53903]: [pcmk ] info: pcmk_ipc: Recorded connection 0x6a54d0 for attrd/53912
Dec 15 14:48:34 hnbrpdb1 attrd[53912]: notice: main: Starting mainloop…
sounds not so likely… if it rejects the CIB, it wouldn’t know about the other node, and hence wouldn’t fence it.
Have you cleared the CIB situation by now, and did that solve the issue?
If not: When server1 comes up, what does server2 recognize? Server1 reports being alone on the ring, does server2 see anything from server1 at that moment? “Famous last words”: “There is no other node that might issue a fence opera… ”