Seeing TOTEM msgs(Totem is unable to form a cluster because of an operating system or network fault)

PRIYANKA · May 25, 2021, 5:18pm

Hello,
We have a use case where we have configured both SLES HA 15 as well as strongswan service. While initially setting up these services in particular order we do not see any issues in HA services coming up.
But in a particular use case when we reboot any of our cluster nodes (say Node-1) , we see that crm status shows FILE status as UNCLEAN and we notice following messages repeatedly:

2021-05-19T07:47:02.908610+00:00 FILE-1 corosync[3610]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
2021-05-19T07:47:53.913841+00:00 FILE-1 corosync[3610]: message repeated 34 times: [ [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.]
2021-05-19T07:47:54.018432+00:00 FILE-1 systemd[1]: Started Service to post platform alerts.
.
.
.
This status remains so unless there is any ping or ssh to another node (in a 2 node setup ). As soon as there is a ping/ssh… we notice following logs:

2021-05-19T07:47:54.020228+00:00 FILE-1 charon-systemd[2864]: creating acquire job for policy 147.178.40.8/32[udp/34030] === 147.178.40.7/32[udp/blackjack] with reqid {145}
2021-05-19T07:47:54.020488+00:00 FILE-1 charon-systemd[2864]: initiating IKE_SA local-FILE-147178408-147178407[5] to 147.178.40.7
2021-05-19T07:47:54.021248+00:00 FILE-1 charon-systemd[2864]: generating IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(FRAG_SUP) N(HASH_ALG) N(REDIR_SUP) ]
2021-05-19T07:47:54.021470+00:00 FILE-1 charon-systemd[2864]: sending packet: from 147.178.40.8[500] to 147.178.40.7[500] (332 bytes)
2021-05-19T07:47:54.024443+00:00 FILE-1 charon-systemd[2864]: received packet: from 147.178.40.7[500] to 147.178.40.8[500] (340 bytes)

Can you please help in understanding what goes wrong as soon as the node rebooted and it is unable to join the cluster on its own?
TIA!

hunter86_bg · July 2, 2021, 7:41pm

I am struggling to understand your situation, so I doubt anyone will be able to help.
Can you provide more details like cluster setup, stonith availability, corosync options, links per host ,network topology.

Are you trying to tunnel corosync traffic over VPN ? I doubt it’s supported, but to stay on track , you need your VPN service to start before corosync. Also, you will need the “two_node” option, so quorum won’t be lost when the other cluster peer is dead.

Topic		Replies	Views
corosync [TOTEM ] Type of received message is wrong... ignoring SLES High Availability Extension	1	1704	February 5, 2021
Cluster comms broken, or not SLES High Availability Extension	0	372	October 19, 2015
Unable to bring node 2 in a 2 node cluster online SLES High Availability Extension	9	376	December 11, 2012
SLES 11 SP2 - 2 node cluster, unclean state / res. migration SLES High Availability Extension	1	300	July 25, 2012
HA Configuration Issue SLES High Availability Extension	10	646	June 24, 2019

Seeing TOTEM msgs(Totem is unable to form a cluster because of an operating system or network fault)

Related topics