I have 2 SLES 11 SP1 clusters (10 nodes in one blade center) with a strange issue. The nodes are loosing randomly the communications with sbd partition and rebooting. The warning messages are the following in the /var/log/warn:
sbd: [7817]: WARN: Latency: No liveness for 59 s exceeds threshold of 3 s (healthy servants: 0)
sbd: [7817]: WARN: Latency: No liveness for 60 s exceeds threshold of 3 s (healthy servants: 0)
The timeout is 60 second, so the sbd does its job right. But why is the communication failing ? Why can’t the sbd watchdog access the sbd partition ? That is the question. There is nothing in the logs, that a path from the 4 pathes went down or anything from multipathd, just the sbd messages. The polling interval in multiapath.conf is set to 1 second. If I am right, if a path fails, there should be something in the logs after 4 seconds, that multipathd tries to recover, arn’t I?
Or can a RAID group on the SAN stall for more than 60 seconds, with 4 pathes up and running ???
Maybe the HBA settings(e.g. port down retry; link down retry and so on)
are are too slow? So multipathing will never detect and switch the path
because the HBA (re)tries to recover from failure, in the meantime SBD
does his job. AFAIR SBD waits 8 seconds before sending a poisin pill
(stonith, right?)
The /etc/modprobe.conf.local contains already the options qla2xxx qlport_down_retry=1, which should have the effect, that the HBA will not try to recover, it will propagate the error immediately. The SBD waits for 60 seconds till the poison pill. Multipath polling interval is set to 1s.
Am 28.01.2013 14:04, schrieb gehorvath:[color=blue]
The /etc/modprobe.conf.local contains already the options qla2xxx
qlport_down_retry=1, which should have the effect, that the HBA will not
try to recover, it will propagate the error immediately. The SBD waits
for 60 seconds till the poison pill. Multipath polling interval is set
to 1s.
[/color]
What does SANsurfer tell about the settings? There are under advanced
setting two options: port down retry count and link down timeout.