As a starting point, please understand that what I’m about to describe makes no sense to me either.
I have a collection of SLES11 Sp4 hosts that are being maintained from a SUSE Manager installation. They are all effectively the same. They’re VMs (vSphere). They’re patched following the same procedure, from the same SUMA. They have the same workload (eDirectory replica servers). The only real difference is the hostname and IP address. There are 22 of them. They are split logically in to three tiers: dev, test, and production. Patches get rolled out to dev first, then to test, then to prod.
Most recently, prod was patched and booted. Of the 8 hosts in prod, 6 came up and ran fine. Previously, all of the dev and test hosts had been patched and booted, and came up fine. The remaining two did something I cannot explain.
After the reboots, SUMA showed everything was fine, all hosts checked in. We’re using the Salt stack for this. That seems to be an important clue.
But eDirectory on was reporting -625 errors in timesync and replica sync status checks. (Effectively, -625 is “can’t get there from here”). At that point, I also found that I could not SSH to either of these two hosts. Nor could I ping them.
Poking around some more, I found that SUMA could issue remote commands on these hosts. So I could run “uptime” and “ping” and other utilities from SUMA, on two hosts that I could no longer SSH to.
From a working host, ping failing-host does not work. But from failing host, ping working-host does work.
By inference, outbound connections from the failing-host work fine, but inbound connections fail. The Salt minion can poll its manager for things to do, so it can run remote command scripts, and can report back the results to the manager.
Next was attempting to fix this. The only way in is from the console. So, through VMware’s console, I gained access to the VM. Not having the root password, the next step was to break in (reboot via SUMA, set init=/bin/bash on the GRUB command line). I did that. And set the root password (passwd / sync). And rebooted again.
At this point, there was still no working communication with the affected host. So the reboot itself did not change anything that mattered. From the console I logged in as root, and communication started magically working. I found no problems with the affected host, and from the working side, I suddenly found that SSH, ping, and everything else now worked correctly.
Yes, this sounds crazy. I didn’t believe it either. But I had only fixed one host at this point. I still had another broken one. I retraced my steps, repeating them exactly as above.
Confirmed outbound connections working, inbound not working.
Access to console via VMware, no change.
Break in / Set root password / Reboot. No change.
At this point I deviated slightly. I started ping in a window, and accessed the VM in another window. With the VM console sitting at the “login:” prompt, and ping reporting host unreachable, I typed “root” and hit return. Then I typed in the new root password and hit return.
As root was logging in, profile scripts running, ping started working. By the time I reached the bash prompt, everything was working fine.
I’m completely mystified by this. Any theories, good ideas, or anything to explain what I’ve described here would be appreciated. The only thing that I’m sure of is that I saw this happen, on two hosts.