Hello all I’m wondering if anyone can help me out with an odd error we are getting before I burn a ticket with nts. We are seeing the following block of text in the message log file, one server per day (we have 8 servers at this site) over the last two weeks.
/etc/xen/scripts/vid-bridge: online type_if=vif XENBUS_PATH=backend/vif/146/0
device tap46.0 entered promiscuous mode
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state
device vif146.0 entered promiscuous mode
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state
/etc/xen/scripts/vif-bridge: Successful vif-bridge online for vif146.0, bridge br0.1
/etc/xen/scripts/vif-bridge: Successful vif-bridge add for tap146.0, bridge br0.1
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state
The ports and tap are random per server, again we have 8 servers at this site and randomly one of those servers per day will get this weird loop for about 10 minutes. We’ve noted that some of our VM’s then fail to re-boot(love patch Tuesday from Micro$oft) due to the network bridge name changing from vlan8 to tap146, or something else along those lines. We’ve had to manually edit our config files to get those machines to boot. Oddly re-booting the host seems to resolve this for the guests.
This issue is also randomly causing issues with our iSCSI connected targets, when the above errors begin we start seeing:
connection3:0 ping timeout of 5 sec expired, rec timeout 5, last rx 15629386836, last ping 15629388088, now 15629389340
connection3:0 detected conn error (1011)
Kernel reported iSCSI connection 3:0 error (1011 - ISCSI_ERR_CONN_FAILED: ISCSI connection failed) state (3)
connection3:0 is operational after recovery (4 attempts)
which again cycles the ping the disconnect error then recovery message for about 10 minutes.
The wierd thing is our switches are clean, they show zero errors during the time frames we note these errors. Some of the time errors are midnight (i.e. we don’t have users, or IT staff moving huge files) others are during peak business hours.
I have 20 sites in total and this is the only site (currently) doing this. Servers are SLES11SP3 and have been running since about a week after SP3 was launched. We have not patched the hosts, and this issue started (we believe) about two weeks ago.
My ‘Google-fu’ has led me to this link:
however the conversation around that troubleshooting guide focuses on SLES10 not SLES11 so I am hesitant to tr the fixes (specifically the netloop) therein, however we have 8 NIC’s bonded into two bonds with four vlans presented to the four bridges so it seems like it’s possible we are running into that issue; however again this setup has been live since about a week after SLES11SP3 was available so one would think if the previous configurations we were running were bad we’d have noted that way before now.
Hopefully that’s not too bad a wall of text, and if any additional information would prove useful I’ll gladly get it.
Thanks for reading!