Bizarre SLES 11 Sp4 boot problem

dgersic · November 21, 2018, 9:22pm

As a starting point, please understand that what I’m about to describe makes no sense to me either.

I have a collection of SLES11 Sp4 hosts that are being maintained from a SUSE Manager installation. They are all effectively the same. They’re VMs (vSphere). They’re patched following the same procedure, from the same SUMA. They have the same workload (eDirectory replica servers). The only real difference is the hostname and IP address. There are 22 of them. They are split logically in to three tiers: dev, test, and production. Patches get rolled out to dev first, then to test, then to prod.

Most recently, prod was patched and booted. Of the 8 hosts in prod, 6 came up and ran fine. Previously, all of the dev and test hosts had been patched and booted, and came up fine. The remaining two did something I cannot explain.

After the reboots, SUMA showed everything was fine, all hosts checked in. We’re using the Salt stack for this. That seems to be an important clue.

But eDirectory on was reporting -625 errors in timesync and replica sync status checks. (Effectively, -625 is “can’t get there from here”). At that point, I also found that I could not SSH to either of these two hosts. Nor could I ping them.

Poking around some more, I found that SUMA could issue remote commands on these hosts. So I could run “uptime” and “ping” and other utilities from SUMA, on two hosts that I could no longer SSH to.

From a working host, ping failing-host does not work. But from failing host, ping working-host does work.

By inference, outbound connections from the failing-host work fine, but inbound connections fail. The Salt minion can poll its manager for things to do, so it can run remote command scripts, and can report back the results to the manager.

Next was attempting to fix this. The only way in is from the console. So, through VMware’s console, I gained access to the VM. Not having the root password, the next step was to break in (reboot via SUMA, set init=/bin/bash on the GRUB command line). I did that. And set the root password (passwd / sync). And rebooted again.

At this point, there was still no working communication with the affected host. So the reboot itself did not change anything that mattered. From the console I logged in as root, and communication started magically working. I found no problems with the affected host, and from the working side, I suddenly found that SSH, ping, and everything else now worked correctly.

Yes, this sounds crazy. I didn’t believe it either. But I had only fixed one host at this point. I still had another broken one. I retraced my steps, repeating them exactly as above.

Confirmed outbound connections working, inbound not working.
Access to console via VMware, no change.
Break in / Set root password / Reboot. No change.

At this point I deviated slightly. I started ping in a window, and accessed the VM in another window. With the VM console sitting at the “login:” prompt, and ping reporting host unreachable, I typed “root” and hit return. Then I typed in the new root password and hit return.

As root was logging in, profile scripts running, ping started working. By the time I reached the bash prompt, everything was working fine.

I’m completely mystified by this. Any theories, good ideas, or anything to explain what I’ve described here would be appreciated. The only thing that I’m sure of is that I saw this happen, on two hosts.

malcolmlewis · November 22, 2018, 2:43am

Hi
Sounds like a stale arp cache on the system trying to contact the vm’s… Did you try from a different machine to ping/ssh to the hosts or all from the same machine?

Nothing configured in /etc/sysctl.conf or /etc/sysctl.conf.d/ output from systool on the NIC’s kernel module all ok?

ab1 · November 26, 2018, 5:36am

Sometime in the past month, or maybe week, I’m sure I’ve seen an odd
thread where I found out that VMware now changes MAC addresses when a VM
is stopped, which is something I’d never heard about before:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.networking.doc/GUID-290AE852-1894-4FB4-A8CA-35E3F7D2ECDF.html#GUID-290AE852-1894-4FB4-A8CA-35E3F7D2ECDF

If I recall correctly from that issue, the option to change the MAC
configuration to leave MAC addresses static, like I thought they were
anyway, fixed the issue in that case.

If this is seen again I wonder if the arp cache of the VMware system can
be seen to see if things match up with reality, or if clearing entries
from the cache helps resolve the issue by having the cache rebuilt quickly
from systems.

In any case, even with these new configurations in VMware land I think the
MAC only changes when powering off a system. Did that happen at all
during the SUMA patching process? Is there any logging that can show MAC
addresses per VM over time, maybe from the VMware side of things? If
powered off, and the MAC changed, but VMware (or something else like a
router or switch) did not update its cache, that might explain the packets
dying mysteriously, but I agree it’s still odd that solicited packets (vs.
unsolicited ones) could get back to the host via Salt’s minion or the ICMP
echo response (ping response).

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

dgersic · November 26, 2018, 4:35pm

[QUOTE=malcolmlewis;55413]Hi
Sounds like a stale arp cache on the system trying to contact the vm’s… Did you try from a different machine to ping/ssh to the hosts or all from the same machine?

Nothing configured in /etc/sysctl.conf or /etc/sysctl.conf.d/ output from systool on the NIC’s kernel module all ok?[/QUOTE]

Tried SSH from multiple locations. Tried ping from multiple locations as well. Even within what should be the same section of the network, to eliminate routers and switches and such. I don’t know the network architecture to be sure of it, but I couldn’t reach x.y.z.181 from x.y.z.183, on a /24 network. These are in the same VM cluster, don’t know exactly what the VM host configuration looks like.

Nothing interesting or strange looking in /etc/sysctl.conf. systool -m -v doesn’t show anything strange, seems to be the same between the failed host and other not failed hosts.

The problem, of course, is that I don’t know what I’m looking for. Not really even any good ideas.

dgersic · November 26, 2018, 4:44pm

[QUOTE=ab;55432]Sometime in the past month, or maybe week, I’m sure I’ve seen an odd
thread where I found out that VMware now changes MAC addresses when a VM
is stopped, which is something I’d never heard about before:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.networking.doc/GUID-290AE852-1894-4FB4-A8CA-35E3F7D2ECDF.html#GUID-290AE852-1894-4FB4-A8CA-35E3F7D2ECDF

If I recall correctly from that issue, the option to change the MAC
configuration to leave MAC addresses static, like I thought they were
anyway, fixed the issue in that case.

If this is seen again I wonder if the arp cache of the VMware system can
be seen to see if things match up with reality, or if clearing entries
from the cache helps resolve the issue by having the cache rebuilt quickly
from systems.

In any case, even with these new configurations in VMware land I think the
MAC only changes when powering off a system. Did that happen at all
during the SUMA patching process? Is there any logging that can show MAC
addresses per VM over time, maybe from the VMware side of things? If
powered off, and the MAC changed, but VMware (or something else like a
router or switch) did not update its cache, that might explain the packets
dying mysteriously, but I agree it’s still odd that solicited packets (vs.
unsolicited ones) could get back to the host via Salt’s minion or the ICMP
echo response (ping response).

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

Sounds like that option only really matters if there is a conflict in MAC addresses, like if you cloned a machine. If the MAC address had changed, I’d have expected the machine to fall off the network until the network interface was reconfigured. So, I can’t say that it didn’t change, but I don’t think that’s it.

Stale ARP cache, at this point it’s far too late to check that. It doesn’t seem likely, if the MAC address didn’t change, and the IP address definitely didn’t change.

During the initial failure, no, the virtual machines were not powered off. Just a standard install patches and reboot from SUMA. After the failure, they may have been power cycled. I’m a bit fuzzy on details from that, it was about 5am and memory gets a bit fuzzy on lack of sleep. Regardless, by that point the failure had already happened.

Yeah, odd. Very very odd. That’s about all I’ve got on this one.

mrosen · November 26, 2018, 5:17pm

On 26.11.2018 15:44, dgersic wrote:[color=blue]

Nothing interesting or strange looking in /etc/sysctl.conf. systool -m
-v doesn’t show anything strange, seems to be the same between the
failed host and other not failed hosts.

The problem, of course, is that I don’t know what I’m looking for. Not
really even any good ideas.[/color]

Well, obviously the root login shook something lose. Theonly thing I can
remotely imagine is that it must have something to do with SUMA, but
that’s not more than an educated guess. I juat can’t think of anything
else that happens on a root login that could cause this.

CU,

Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de

dgersic · November 26, 2018, 6:02pm

[QUOTE=mrosen;55441]On 26.11.2018 15:44, dgersic wrote:[color=blue]

Nothing interesting or strange looking in /etc/sysctl.conf. systool -m
-v doesn’t show anything strange, seems to be the same between the
failed host and other not failed hosts.

The problem, of course, is that I don’t know what I’m looking for. Not
really even any good ideas.[/color]

Well, obviously the root login shook something lose. Theonly thing I can
remotely imagine is that it must have something to do with SUMA, but
that’s not more than an educated guess. I juat can’t think of anything
else that happens on a root login that could cause this.

CU,

Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de[/QUOTE]

By that point, SUMA is out of the picture. Host patches installed ok. Host booted ok. SUMA (the Salt minion) goes back to waiting for something to do.

It’s the “shook something loose” part that I’m looking for. Root login runs a couple of scripts, but they don’t do anything that should have any affect on “I can’t ping the box”.

mrosen · November 26, 2018, 11:58pm

On 26.11.2018 17:04, dgersic wrote:[color=blue]

[/color]
[color=blue]
By that point, SUMA is out of the picture. Host patches installed ok.
Host booted ok. SUMA (the Salt minion) goes back to waiting for
something to do.[/color]

I know basically nothing about SUMA, but is that verified? Ain’t it
possible SUMA leaves behind some one-time script that runs on the first
login?
[color=blue]

It’s the “shook something loose” part that I’m looking for. Root login
runs a couple of scripts, but they don’t do anything that should have
any affect on “I can’t ping the box”.[/color]

Did you do some Lan traces maybe? Or can recreate the problem and run
them? At least that might give us a hint what exactly went wrong on the
network side.

CU,

Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de

dgersic · November 27, 2018, 12:34am

[QUOTE=mrosen;55455]On 26.11.2018 17:04, dgersic wrote:[color=blue]

[/color]
[color=blue]
By that point, SUMA is out of the picture. Host patches installed ok.
Host booted ok. SUMA (the Salt minion) goes back to waiting for
something to do.[/color]

I know basically nothing about SUMA, but is that verified? Ain’t it
possible SUMA leaves behind some one-time script that runs on the first
login?
[color=blue]

It’s the “shook something loose” part that I’m looking for. Root login
runs a couple of scripts, but they don’t do anything that should have
any affect on “I can’t ping the box”.[/color]

Did you do some Lan traces maybe? Or can recreate the problem and run
them? At least that might give us a hint what exactly went wrong on the
network side.

CU,

Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de[/QUOTE]

For this discussion, SUMA is a pretty wrapper around Salt (https://docs.saltstack.com/en/latest/topics/tutorials/walkthrough.html).

No packet trace. Can’t reproduce it, so there’s currently nothing left to test.

Yeah, I know, that’s frustrating.

Topic		Replies	Views
SUMA 4 - shows patches available, host doesn't SUSE Manager	0	449	December 11, 2019
Problem adding SLES 12 machines to SuMa 2.1 SUSE Manager	3	230	May 6, 2015
RHEL/CentOS/OEL clients, Package Upgrade jobs in SUMA never complete SUSE Manager	0	690	July 21, 2021
SUSE Manager 3 - errors SUSE Manager	1	219	September 20, 2016
Upgrade from SLES 11 SP4 timeout registration service SLES SAP Applications	1	425	May 15, 2019

Bizarre SLES 11 Sp4 boot problem

CU,

CU,

CU,

CU,

Related topics