SBD and multipath

gehorvath · January 24, 2013, 3:15pm

Hi,

I have 2 SLES 11 SP1 clusters (10 nodes in one blade center) with a strange issue. The nodes are loosing randomly the communications with sbd partition and rebooting. The warning messages are the following in the /var/log/warn:
sbd: [7817]: WARN: Latency: No liveness for 59 s exceeds threshold of 3 s (healthy servants: 0)
sbd: [7817]: WARN: Latency: No liveness for 60 s exceeds threshold of 3 s (healthy servants: 0)

The timeout is 60 second, so the sbd does its job right. But why is the communication failing ? Why can’t the sbd watchdog access the sbd partition ? That is the question. There is nothing in the logs, that a path from the 4 pathes went down or anything from multipathd, just the sbd messages. The polling interval in multiapath.conf is set to 1 second. If I am right, if a path fails, there should be something in the logs after 4 seconds, that multipathd tries to recover, arn’t I?

Or can a RAID group on the SAN stall for more than 60 seconds, with 4 pathes up and running ???

Thanks,
Gellert

system · January 28, 2013, 2:27pm

Maybe the HBA settings(e.g. port down retry; link down retry and so on)
are are too slow? So multipathing will never detect and switch the path
because the HBA (re)tries to recover from failure, in the meantime SBD
does his job. AFAIR SBD waits 8 seconds before sending a poisin pill
(stonith, right?)

Tom

gehorvath · January 28, 2013, 3:02pm

The /etc/modprobe.conf.local contains already the options qla2xxx qlport_down_retry=1, which should have the effect, that the HBA will not try to recover, it will propagate the error immediately. The SBD waits for 60 seconds till the poison pill. Multipath polling interval is set to 1s.

system · January 28, 2013, 4:13pm

Am 28.01.2013 14:04, schrieb gehorvath:[color=blue]

The /etc/modprobe.conf.local contains already the options qla2xxx
qlport_down_retry=1, which should have the effect, that the HBA will not
try to recover, it will propagate the error immediately. The SBD waits
for 60 seconds till the poison pill. Multipath polling interval is set
to 1s.

[/color]
What does SANsurfer tell about the settings? There are under advanced
setting two options: port down retry count and link down timeout.

Tom

Topic		Replies	Views
SBD fails to fence node if 1 of 2 sbd devices unreachable SLES High Availability Extension	1	287	January 10, 2013
SLES11 SP3 DM-Multipath:Paths alternate b/w failed & active SLES Configure-Administer	2	220	April 10, 2015
drbd/multipath errors after upgrading SLES and HA SP2 to SP3 SLES High Availability Extension	2	287	September 30, 2013
Multipath and zFCP SLES for System Z	1	515	August 15, 2018
SLES 11 SP3 - SBD Stonith - Resources do not migrate SLES High Availability Extension	2	498	December 16, 2016

SBD and multipath

Related topics