sudden split brain occurred. No reason foud

bcunix · January 12, 2016, 7:09pm

A processor failed, forming new configuration.
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] CLM CONFIGURATION CHANGE
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] New Configuration:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Left:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Joined:
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 712: memb=1, new=0, lost=1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: memb: node1 1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: lost: node2 2
Jan 8 02:47:37 node1 corosync[12261]: [pcmk ] info: update_member: Node 2/node2 is now: lost
Jan 8 02:47:37 node1 crmd[12271]: notice: crm_update_peer_state: plugin_handle_membership: Node node2[2] - state is now lost (was member)
Jan 8 02:47:37 node1 sbd: [12249]: WARN: CIB: We do NOT have quorum!
Jan 8 02:47:37 node1 sbd: [12245]: WARN: Pacemaker health check: UNHEALTHY
Jan 8 02:47:37 node1 pengine[12270]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 8 02:47:37 node1 pengine[12270]: warning: pe_fence_node: Node node2 will be fenced because the node is no longer part of the cluster.
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: memb: node2 2
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: lost: node1 1
Jan 8 02:47:37 node2 pengine[26241]: warning: handle_startup_fencing: Blind faith: not fencing unseen nodes

Suddenly my cluster went to split brain situation.
We did not observed any abnormalities at vmware level.
No latency or network disconnection observed at the vm level at the time when split brain happened.
What can cause this sudden situation

Jens-U · January 13, 2016, 1:20pm

Hi bcunix,

[QUOTE=bcunix;31125]A processor failed, forming new configuration.
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] CLM CONFIGURATION CHANGE
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] New Configuration:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Left:
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] r(0) ip()
Jan 8 02:47:36 node1 corosync[12261]: [CLM ] Members Joined:
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 712: memb=1, new=0, lost=1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: memb: node1 1
Jan 8 02:47:36 node1 corosync[12261]: [pcmk ] info: pcmk_peer_update: lost: node2 2
Jan 8 02:47:37 node1 corosync[12261]: [pcmk ] info: update_member: Node 2/node2 is now: lost
Jan 8 02:47:37 node1 crmd[12271]: notice: crm_update_peer_state: plugin_handle_membership: Node node2[2] - state is now lost (was member)
Jan 8 02:47:37 node1 sbd: [12249]: WARN: CIB: We do NOT have quorum!
Jan 8 02:47:37 node1 sbd: [12245]: WARN: Pacemaker health check: UNHEALTHY
Jan 8 02:47:37 node1 pengine[12270]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 8 02:47:37 node1 pengine[12270]: warning: pe_fence_node: Node node2 will be fenced because the node is no longer part of the cluster.
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: memb: node2 2
Jan 8 02:47:36 node2 corosync[26232]: [pcmk ] info: pcmk_peer_update: lost: node1 1
Jan 8 02:47:37 node2 pengine[26241]: warning: handle_startup_fencing: Blind faith: not fencing unseen nodes

Suddenly my cluster went to split brain situation.
We did not observed any abnormalities at vmware level.
No latency or network disconnection observed at the vm level at the time when split brain happened.
What can cause this sudden situation[/QUOTE]

have you seen/caused any “high LRM workload” situation on node2, i.e. by running “crm_resource --cleanup” (without specifying a specific resource to clean up)?

OTOH, both nodes declared the respective other node as lost, so it does look like a ring interruption to me.

Is above all you have in syslog, especially on node2? If so, you might want to increase the debug level to better diagnose future events. Else it might be helpful to have some context on what happened prior to the “split brain” situation.

Regards,
Jens

Topic		Replies	Views
HA Configuration Issue SLES High Availability Extension	10	644	June 24, 2019
Sync cluster nodes configuration - lack of communication SLES High Availability Extension	3	257	March 19, 2014
Cluster comms broken, or not SLES High Availability Extension	0	372	October 19, 2015
second node rebooted when first node comes back online SLES High Availability Extension	8	987	December 15, 2015
stonith resource in down in a cluster General Discussion	2	1748	February 8, 2022

sudden split brain occurred. No reason foud

Related topics