What is it with SLES10 and Xen bridging failing suddenly?

While we’re mostly at SLES11, we’ve still got 3 older SLES10 boxes currently out in the field due to lack of personnel and money.

We’ve had 4 SLES10sp4 servers over the past year and half just suddenly stop talking on the network due to xen bridge failing. Interfaces had to be deleted and recreated. In one case, there was no getting it working again, but fortunately in that instance that server could be replaced without losing a lot of data.
The latest incident was this past Monday. The host and guests were all up and responsive to keyboard input (after an hour and a half drive to test it), but the network refused to work.
Its like the bridging software in SLES10 Xen just disintegrates after about 5 years. Everything else is fine. Sometimes it’s just the VMs that lose connectivity, though this last time even the physical host was inaccessible.
These are/were all Dell R710s.
SLES10 is now past EOL I believe, so any pointers would be extremely appreciated, as I expect to run into this again before we get those servers migrated. I’ve noticed SLES11 does it’s bridging quote a bit differently.

Hi lpphiggp,

[… liux bridge suudenly failing …]

which debugging steps have you taken, with what results?

I’d specifically

  • look into syslog to see if anything special happened around the time connectivity,dropped

  • run rpm -V to verify that the on-disk contents are left unmodified - especially the bridging setup scripts

  • run “brctl show” to check the setup once bridging failed

  • run network traces to see what actually happens, if everything else “looks good”

If you see any results you’d like to discuss, please include details on how your networking setup (system-, Xen- and DomU-wise) is configured on the affected server. If you use commands to retrieve such details (i.e. “ifconfig”, “brctl”), cut&paste of command invocation and results is preferred, at least by me :wink:

Regards,
Jens

[QUOTE=jmozdzen;30930]Hi lpphiggp,

[… liux bridge suudenly failing …]

which debugging steps have you taken, with what results?

I’d specifically

  • look into syslog to see if anything special happened around the time connectivity,dropped

  • run rpm -V to verify that the on-disk contents are left unmodified - especially the bridging setup scripts

  • run “brctl show” to check the setup once bridging failed

  • run network traces to see what actually happens, if everything else “looks good”

If you see any results you’d like to discuss, please include details on how your networking setup (system-, Xen- and DomU-wise) is configured on the affected server. If you use commands to retrieve such details (i.e. “ifconfig”, “brctl”), cut&paste of command invocation and results is preferred, at least by me :wink:

Regards,
Jens[/QUOTE]

Hi Jens,

I didn’t do a lot of troubleshooting at the time, it needed to get back up ASAP, I mostly just blew out all the networking and recreated it.
Nothing interesting showed up in /var/log messages of either host or guest, and the /var/log/xen logs on the hostaren’t helpful in that regard either. It’s like the server got blindsided.

But here’s the current config, and so far, it’s working- the physical host and dhcp vm have been fine for over a week, the data share vm has been okay since last Thursday.
(I just changed our IP here for anonymity)

xenbuena:~ # ifconfig
eth1 Link encap:Ethernet HWaddr 84:2B:2B:1A:31:62
inet addr:172.16.20.182 Bcast:172.16.20.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:52115496 errors:0 dropped:0 overruns:0 frame:0
TX packets:12928553 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:145149685141 (138425.5 Mb) TX bytes:1504565208 (1434.8 Mb)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:220079 errors:0 dropped:0 overruns:0 frame:0
TX packets:220079 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:672068331 (640.9 Mb) TX bytes:672068331 (640.9 Mb)

peth1 Link encap:Ethernet HWaddr 84:2B:2B:1A:31:62
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:121007130 errors:0 dropped:0 overruns:0 frame:0
TX packets:131215180 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:39524193182 (37693.2 Mb) TX bytes:25926839220 (24725.7 Mb)
Interrupt:21 Memory:d8000000-d8012800

vif5.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4128276 errors:0 dropped:0 overruns:0 frame:0
TX packets:9430409 errors:0 dropped:15 overruns:0 carrier:0
collisions:0 txqueuelen:32
RX bytes:358036010 (341.4 Mb) TX bytes:6349218524 (6055.0 Mb)

vif8.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:84928345 errors:0 dropped:0 overruns:0 frame:0
TX packets:52539983 errors:0 dropped:10057 overruns:0 carrier:0
collisions:0 txqueuelen:32
RX bytes:146740457628 (139942.6 Mb) TX bytes:10570369062 (10080.6 Mb)

xenbuena:~ #
xenbuena:~ # brctl show
bridge name bridge id STP enabled interfaces
eth1 8000.842b2b1a3162 no peth1
vif5.0
vif8.0
xenbuena:~ #

But this is with everything currently working.

Hi lpphiggp,

But this is with everything currently working

… which makes a good baseline for comparison when things go bad again :slight_smile:

BTW, do you set up bridging via Xen (so you have some script configured in xend’s config file)? Even in SLES10 days, I found it to be more “logical” (and thus more maintainable) to create all bridges via regular SLES, and then only have Xen connect the VIFs to the according bridge.

Best regards,
Jens

[QUOTE=jmozdzen;30941]Hi lpphiggp,

But this is with everything currently working

… which makes a good baseline for comparison when things go bad again :slight_smile:

BTW, do you set up bridging via Xen (so you have some script configured in xend’s config file)? Even in SLES10 days, I found it to be more “logical” (and thus more maintainable) to create all bridges via regular SLES, and then only have Xen connect the VIFs to the according bridge.

Best regards,
Jens[/QUOTE]

We’ve done both. This last time though I left things more or less as they were, allowing xen to rename eth1 to peth1 and creating bridge eth1 in it’s place.

On other boxes previously, another admin (sort of) manually just created a br0 and pointed the actual interfaces (eth0, etc…) to that, and that’s worked, all but once. This one was a bit different though. This time, even the physical NIC wasn’t communicating, I couldn’t even get to the host box*.
Still, I was going to do that too, originally, but got flustered when I found, that unlike the other servers with this issue, I couldn’t… besides the fact that the physical NIC was not communicating,the xen script was preventing me from editing anything in the network in YaST2. Running the command to stop the script didn’t help. ( /etc/xen/scripts/network-bridge stop netdev=eth0)

By the time I figured out rebooting to the non-xen kernel would allow me to edit the networking config in yast, I just went with the standard setup first to see if that would at least give me back my host access - I also switched the cable to eth1 in case I had a bad NIC port.
Anyway, once I rebooted into the xen kernel again to see if I had connectivity, I did; at that point, I just wanted to get the two VMs up and running ASAP as they’d been down for hours -we’d had someone else onsite locally look at it first when it first went down, then it took me over an hour to drive there when that didn’t pan out.
So after recreating the virtual NICs, and finding that it all in fact started working again, I just went with that. I didn’t edit any scripts or anything, it just picked it all up.

I did get a scare though. That was a Monday. That Thursday, one of the VMs went incommunicado again. However, a reboot of the VM fixed that too.
But now, just before the Christmas holiday, I’m worrying everyday that the **** thing will go down again. The whole business just seems very unstable to me.

*Normally, this issue would only disconnect the VMs, where now we realize perhaps a simple "brctrl addif “ ” command might’ve sufficed -if I only lose VM network connectivity again, I’ll try that, but this last one manifested itself in a particularly ugly manner.