Cluster comms broken, or not

Hi all,

I’ve had a wierd situation last Friday and wanted to ask around for similar experiences and/or ideas:

  • three-node cluster (identical hardware) with SLES11SP3 + HAE Dom0s, mostly used to run Xen VMs

  • Pacemaker, Corosync etc.

  • corosync is set to use multicast, with all traffic going through the production switch via a non-exclusive two-port LACP bond with 802.1q (the VLAN interfaces have IPs 192.168.103.xx on the Dom0s)

[CODE] totem {

[…]
interface {
bindnetaddr: 192.168.103.0

   mcastaddr: 239.39.103.1
      mcastport: 5405

          ttl: 1

 }

[/CODE]

I had a stale resource event that I wanted to clean-up (“crm_resource --cleanup --resource somename”), and when checking syslog after issuing the command, I noticed an ever-increasing number of errors “corosync[19247]: [TOTEM ] Retransmit List: 9b0 9b1 9b2 9b3 9b4 9b5 9b6 9b7 9b8 9b9 9ba 9bb 9bc 9bd 9be 9bf 9c0 9c1 9c2 9c3”. The list didn’t change. According to syslog, re Retransmit messages started about 25 minutes before I did the crm_resource call, so no coincidence here.

After a few minutes, probably due to the cleanup timing out, stonithing kicked in and rebooted two of the three machines.

Before the reboot, I had a chance to run “tcpdump” on all three machines to check for multicast problems. Interestingly, I could see the non-multicast traffic being sent between the nodes (on the port specified for multicast in corosync.conf), in a ring-like fashion (node1 to node2, node2 to node 3, node 3 to node 1). But not a single multicast packet.

I had also checked the ring status via “corosync-cfgtool -s”, which reported the ring status as “ring 0 active with no faults”, at least on the one node I could check… that’s when stonithing kicked in.

After the restart (and still today) I can see both the unicast traffic and multicast traffic (sent by only one of the nodes, and seen by the other two nodes).

I guess somehow the multicast sender died away, but I cannot tell if this was the cause of the TOTEM retransmits, or a co-failure due to a common cause. What’s puzzling me even more it that corosync reports the ring as active (which at first sight matches the tcpdump observation), but the retransmit list doesn’t shrink… as if a node got confused WRT the guaranteed-order delivery. The restransmit messages were reported in node 1 and 3, but not in node 2’s syslog.

Anybody who has seen something like this, and any ideas what to restart on the node(s) to re-set the TOTEM delivery - without rebooting one or all of the nodes?

Regards,
Jens