XEN network bridge (random) restarting

Hello all I’m wondering if anyone can help me out with an odd error we are getting before I burn a ticket with nts. We are seeing the following block of text in the message log file, one server per day (we have 8 servers at this site) over the last two weeks.

/etc/xen/scripts/vid-bridge: online type_if=vif XENBUS_PATH=backend/vif/146/0
device tap46.0 entered promiscuous mode
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state
device vif146.0 entered promiscuous mode
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state
/etc/xen/scripts/vif-bridge: Successful vif-bridge online for vif146.0, bridge br0.1
/etc/xen/scripts/vif-bridge: Successful vif-bridge add for tap146.0, bridge br0.1
br0.1: port 5 (tap146.0 entering forwarding state
br0.1: port 5(tap146.0) entering forwarding state

The ports and tap are random per server, again we have 8 servers at this site and randomly one of those servers per day will get this weird loop for about 10 minutes. We’ve noted that some of our VM’s then fail to re-boot(love patch Tuesday from Micro$oft) due to the network bridge name changing from vlan8 to tap146, or something else along those lines. We’ve had to manually edit our config files to get those machines to boot. Oddly re-booting the host seems to resolve this for the guests.

This issue is also randomly causing issues with our iSCSI connected targets, when the above errors begin we start seeing:

connection3:0 ping timeout of 5 sec expired, rec timeout 5, last rx 15629386836, last ping 15629388088, now 15629389340
connection3:0 detected conn error (1011)
Kernel reported iSCSI connection 3:0 error (1011 - ISCSI_ERR_CONN_FAILED: ISCSI connection failed) state (3)
connection3:0 is operational after recovery (4 attempts)

which again cycles the ping the disconnect error then recovery message for about 10 minutes.

The wierd thing is our switches are clean, they show zero errors during the time frames we note these errors. Some of the time errors are midnight (i.e. we don’t have users, or IT staff moving huge files) others are during peak business hours.

I have 20 sites in total and this is the only site (currently) doing this. Servers are SLES11SP3 and have been running since about a week after SP3 was launched. We have not patched the hosts, and this issue started (we believe) about two weeks ago.

My ‘Google-fu’ has led me to this link:

https://www.suse.com/communities/conversations/xen-network-bridges-explained-with-troubleshooting-notes/

however the conversation around that troubleshooting guide focuses on SLES10 not SLES11 so I am hesitant to tr the fixes (specifically the netloop) therein, however we have 8 NIC’s bonded into two bonds with four vlans presented to the four bridges so it seems like it’s possible we are running into that issue; however again this setup has been live since about a week after SLES11SP3 was available so one would think if the previous configurations we were running were bad we’d have noted that way before now.

Hopefully that’s not too bad a wall of text, and if any additional information would prove useful I’ll gladly get it.

Thanks for reading!

amginenigma,

It appears that in the past few days you have not received a response to your
posting. That concerns us, and has triggered this automated reply.

Has your issue been resolved? If not, you might try one of the following options:

Be sure to read the forum FAQ about what to expect in the way of responses:
http://forums.suse.com/faq.php

If this is a reply to a duplicate posting, please ignore and accept our apologies
and rest assured we will issue a stern reprimand to our posting bot.

Good luck!

Your SUSE Forums Team
http://forums.suse.com

Hi amginenigma,

how do you set up your networking (bonds, bridges) - is this via SLES or are you using the Xen scripts?

I strongly suggest to build your server system so that all the basic networking - up to and including the bridge level - is set up via SLES scripts. You can then leave out any network-related scripting actions from your xend configuration and simply reference the appropriate bridge name in the DomU VIF configuration. I’ve found that to work reliably, for years, and it would rule out xend as a source of trouble in your specific case :wink: .

Regards,
Jens

Hi Jens,

Thanks for the reply! I am (and have been) using SLES to setup networking for years as well, in fact just looked at uptime on one of these boxes and it’s been 500+ days since it was last rebooted. I note that as I had a DBA in my office complaining of performance issues and demanding an immediate re-boot of the host; seeing as he had a manager in the office with him I complied and to my amazement it seems to have resolved the performance issue. In looking at the logs since Monday I’ve not seen this server network stack ‘flap’ or whatever term we’d like to use, however I don’t expect it will if it is going to for another few days.

Hi amginenigma,

in fact just looked at uptime on one of these boxes and it’s been 500+ days since it was last rebooted

while that number looks “nice” at first glance, I wonder if you have been applying updates to your installation during that time… at lot has happened concerning Xen in SLES11SP3,not only the base OS.

In looking at the logs since Monday I’ve not seen this server network stack ‘flap’ or whatever term we’d like to use, however I don’t expect it will if it is going to for another few days

If/when you see this happen again, it’d be nice to see some syslog from (right) before the first instance of these messages. If you wouldn’t want these in public for security reasons, feel free to send me a private message. And if this happens even if you update to the latest patch level, opening an SR would be an option, too - this definitely isn’t a “typical error”.

Regards,
Jens