The servers in our datacentre (variously Dell PER710, PE2950 and HP DL380G5) are configured with 2 NICs bonded in an active-backup configuration (mode=active-backup miimon=100). The networking team have been complaining that they causing ‘port flapping’ on their Cisco switches, so we have been comparing the switch and server logs to see whether we can detect a pattern. In each instance that we have found so far, when the flapping occurs /var/log/messages has a warning about the lp driver:
Oct 22 09:03:56 <hostname> kernel: lp: driver loaded but no devices found
We have various OS versions, but in the instance here it’s SLES10SP3 (and OES2SP3):
# cat /etc/*release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 3
LSB_VERSION="core-2.0-noarch:core-3.0-noarch:core-2.0-x86_64:core-3.0-x86_64"
Novell Open Enterprise Server 2.0.3 (x86_64)
VERSION = 2.0.3
PATCHLEVEL = 3
BUILD
We don’t believe in coincidences but this really seems very strange indeed. Can anyone offer any insight as to what is going on here?
[QUOTE=kat_tyrie;24293]The servers in our datacentre (variously Dell PER710, PE2950 and HP DL380G5) are configured with 2 NICs bonded in an active-backup configuration (mode=active-backup miimon=100). The networking team have been complaining that they causing ‘port flapping’ on their Cisco switches, so we have been comparing the switch and server logs to see whether we can detect a pattern. In each instance that we have found so far, when the flapping occurs /var/log/messages has a warning about the lp driver:
Oct 22 09:03:56 <hostname> kernel: lp: driver loaded but no devices found
[…]We don’t believe in coincidences but this really seems very strange indeed. Can anyone offer any insight as to what is going on here?[/QUOTE]
I remember those messages (but not too well), iirc they are symptoms, not causes. If my crystal ball is tuned to the proper channel, then something else is going on on your server(s) and is probably seen like some hardware change. Do you have anything else in syslog for that time window, not neccessarily error messages? Is there any correlation between the time windows when this occurs? Do you see those lp0 messages also when no port flapping is registered by your network team?
Are both ports flapping, only the inactive one or only the active port? And usually, you should see link status change messages in syslog. As you mentioned none - is it only the Cisco side that is registering the up/downs?
sorry for the separate reply: If “flapping” actually is meant as “MAC addresses moving from one port to the other” (rather than down/up cycles of a single port), you might look into what traffic is going across the inactive link. It might be the bonding driver checking the link status for whatever reason.
Are only servers with rather oldish software versions affected, or new ones (SLES11SP2/3), too? There may have been bugs in the older implementations, I’ve seen reports that not all options that were set were actually in effect… and you might try to set
mode=1 miimon=100 downdelay=200 updelay=200 primary=<yourpreferrednetdevice> to make miimon a little bit less sensitive to link status changes and to set a preferred interface.
Thanks Jens. It will take me a while to go through the switch logs… In the meantime we do have some SLES11SP1 and SLES11SP2 servers and there are no lp0 messages in the logs. All of the SLES10SPx servers I have checked (clustered and standalone) which are configured with bonded NICs have the lp0 message. I did look for patterns in the logs and nothing stands out so far. Often there is nothing written to the log for several (10-30) minutes either side of the message and there is certainly nothing i can see indicating bond0 is polling the ports or the NIC states are changing. I’ll let you know what I find.
when coming back, it’d help to also tell us what “flipping” your Ciscos report: down/up of single ports or MACs moving from one port to another. The term is used ambiguously, and obviously these symptoms differ, as do their potential causes.
Having said that there didn’t seem to be a pattern, if I look specifically at 2 nodes on one of the 8-node clusters, in the first example it’s changing port after a reasonably random time interval and in the second example it’s pretty much every 2 hours. Looking at a node on another cluster, while there are some “cl1agw kernel: lp: driver loaded but no devices found” messages in isolation, they are also appearing with the zmd: ShutdownManager and zmd: NetworkManagerModule warmings. That’s beginning to look like a big flashing neon sign…
In each case the lp0 message corresponds to the ‘flap’ in the switch log. Looks like it’s ‘MACs moving from one port to another’.
cheers
Kat
[CODE]Example 1
Switch:
2014 Aug 14 14:10:09.932 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po94 to Po93
2014 Aug 14 14:10:16.431 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po93 to Po94
[…]
2014 Aug 14 21:34:10.507 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po94 to Po93
2014 Aug 14 21:34:18.036 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po93 to Po94
[…]
2014 Aug 15 00:29:02.961 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po94 to Po93
2014 Aug 15 00:29:09.681 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6b98 in vlan 100 has moved from Po93 to Po94
And corresponding SuSE log entries:
Aug 14 13:34:29 syslog-ng[13762]: STATS: dropped 0
Aug 14 13:57:19 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 14:10:02 kernel: lp: driver loaded but no devices found
Aug 14 14:27:30 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 14:27:31 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 21:24:02
[…]
Aug 14 20:34:31 syslog-ng[13762]: STATS: dropped 0
Aug 14 21:24:06 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 21:34:02 kernel: lp: driver loaded but no devices found
Aug 14 21:34:31 syslog-ng[13762]: STATS: dropped 0
Aug 14 21:54:17 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 21:54:18 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 22:26:30
Aug 14 23:34:32 syslog-ng[13762]: STATS: dropped 0
[…]
Aug 15 00:00:57 volmnd[25804]: Pinging EFLs
Aug 15 00:11:54 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 15 00:28:55 kernel: lp: driver loaded but no devices found
Aug 15 00:34:32 syslog-ng[13762]: STATS: dropped 0
Aug 15 00:42:06 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 15 00:42:07 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/15/2014 14:00:19[/CODE]
[CODE]Example 2
Switch:
2014 Aug 14 13:27:24.277 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
2014 Aug 14 13:27:30.524 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 13:27:30.524 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 13:59:12.468 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
2014 Aug 14 13:59:20.216 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 15:27:24.467 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
2014 Aug 14 15:27:25.809 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 17:27:24.606 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
2014 Aug 14 17:27:31.293 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 19:27:24.089 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
2014 Aug 14 19:27:27.748 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po93 to Po94
2014 Aug 14 21:27:22.647 %L2FM-4-L2FM_MAC_MOVE: Mac xxxx.xxxx.6a9a in vlan 100 has moved from Po94 to Po93
And corresponding SuSE log entries:
Aug 14 12:27:44 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 13:17:16 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 13:27:16 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 13:27:44 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 13:32:28 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 13:32:29 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 13:44:20
Aug 14 13:17:16 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 13:27:16 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 13:27:44 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 13:32:28 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 13:32:29 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 13:44:20
Aug 14 13:44:24 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 13:54:35 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 13:54:36 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 13:55:04
Aug 14 13:55:04 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 13:59:05 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 14:00:16 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 14:00:17 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 15:17:15
Aug 14 14:27:45 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 15:17:16 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 15:27:15 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 15:27:46 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 15:32:29 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 15:32:30 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 17:17:15
Aug 14 16:27:46 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 17:17:19 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 17:27:15 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 17:27:47 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 17:32:30 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 17:32:31 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 19:17:15
Aug 14 18:27:48 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 19:17:19 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 19:27:15 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 19:27:49 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 19:32:36 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 19:32:37 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 21:17:15
Aug 14 20:27:49 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 21:17:16 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager
Aug 14 21:27:15 hostname2 kernel: lp: driver loaded but no devices found
Aug 14 21:27:50 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 21:32:28 hostname2 zmd: ShutdownManager (WARN): Preparing to sleep…
Aug 14 21:32:29 hostname2 zmd: ShutdownManager (WARN): Going to sleep, waking up at 08/14/2014 23:17:15
Aug 14 22:27:50 hostname2 syslog-ng[13746]: STATS: dropped 0
Aug 14 23:17:19 hostname2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager[/CODE]
you definitely seem to be on the right track! And yes, it’s a MAC address moving from one port to the other (and back), which does look like an “link active” check (within the bonding code) to me.
BTW, from earlier messages I read that this happens both with SLES10 and SLES11 machines. While all traces I found on the net implied that earlier releases of the Linux bonding code may have issues that cause such behavior, it might well be that this check is done with current bonding code, too.
Are these “flip messages” more than annoying? I’d say they are simple results of “normal operation” of the bonding driver and could be ignored. So if your networking team dislikes them, they might want to set up a filter in their correlation engine so that flips every two hours are silently ignored and only “moves” (without going back to the original port) or “too many flips” are regarded as worth reporting to the upper layers?