Both nodes in OCFS2 cluster keep rebooting

Hi there,

We created an OCSFS2 cluster by using SUSE Linux High Availability Extensions on SLES SP3. The cluster’s nodes are two Apache servers which share one disk. We have stonith enabled and SBD daemon. It works fine, but…

When one of the nodes is disconnected from the network (Network Card disconnected in VirtualBox) and therefore both nodes fail to communicate in cluster, both servers are rebooted brutally 30 seconds later.

Once the nodes are started again, one of them keeps rebooting the other, so the service availability is lost completely. To recover, the first failed node is reconnected to network (Network Card connected again in VBox) and the problem is fixed.

Questions are:

[LIST=1]
[]Why does this happen?
[
]How can I avoid this behaviour?
[/LIST]

The expected result from us is to ensure service level availability, so that if a node disconnects temporarily from network, the another one could continue serving and using the disk, event if network connection between them is lost.

If I either kill the corosync daemon (killall -9 corosync) on one node, or the node is shutdown normally, the remaining node keeps working fine. Why doesn’t this work when the network card is disconnected? :-/

I’m providing the Cluster Configuration (crm configure show) here:

jonvargas,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

Has your issue been resolved? If not, you might try one of the following options:

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting, please ignore and accept our apologies
and rest assured we will issue a stern reprimand to our posting bot.

Good luck!

Your SUSE Forums Team
http://forums.suse.com

On 06/09/2015 06:54 PM, jonvargas wrote:[color=blue]

Hi there,

We created an OCSFS2 cluster by using SUSE Linux High Availability
Extensions on SLES SP3. The cluster’s nodes are two Apache servers which
share one disk. We have stonith enabled and SBD daemon. It works fine,
but…

When one of the nodes is disconnected from the network (Network Card
disconnected in VirtualBox) and therefore both nodes fail to communicate
in cluster, both servers are rebooted brutally 30 seconds later.

Once the nodes are started again, one of them keeps rebooting the other,
so the service availability is lost completely. To recover, the first
failed node is reconnected to network (Network Card connected again in
VBox) and the problem is fixed.

QUESTIONS ARE:

  • Why does this happen?
  • How can I avoid this behaviour?[/color]

Not much help, but sounds like STONITH isn’t working. A failed node should
not come back up after any sort of reboot once it has been STONITH’d.

That’s the whole idea of STONITH.

STONITH is not some kind of “death until reboot” sort of thing. On physical
servers this is why STONITH usually requires integration with the local IPMI to
keep the node from coming back to life even during some kind of hard recycle.

With that said, it’s always possible to confuse a cluster by unplugging and
plugging network cables… but ideally speaking even that should ensure at least
STONITH on one node (in most cases).

So… no help here, just observation…