SLES12 SP3 LTSS running on vmware stuck for 20min before starting

Hello

I need some help : sometimes, a VM installed with SLES12 SP3 (LTSS) gets stuck after reboot. For around 20 minutes, no reaction ! After that, kernel loads and system starts.
I don’t have access to the vmware infrastructure and I need to prove that the issue could come from there… :neutral:
I look at boot.log, messages and dmesg but nothing seems wrong except the delay.
For example, here is a messages content:
2021-01-05T07:00:01.584829+01:00 suse1 CRON[27153]: (root) CMD (/sbin/shutdown -r 2>&1 >/dev/null)
(…)
2021-01-05T07:01:01.597972+01:00 suse1 systemd[1]: network.target: Found ordering cycle on network.target/stop
2021-01-05T07:01:01.598592+01:00 suse1 systemd[1]: network.target: Found dependency on unmountnfs.service/stop
2021-01-05T07:01:01.602189+01:00 suse1 systemd[1]: network.target: Found dependency on sysinit.target/stop
2021-01-05T07:01:01.621744+01:00 suse1 systemd[1]: network.target: Found dependency on systemd-tmpfiles-setup.service/stop
2021-01-05T07:01:01.621844+01:00 suse1 systemd[1]: network.target: Found dependency on local-fs.target/stop
2021-01-05T07:01:01.621937+01:00 suse1 systemd[1]: network.target: Found dependency on var-backup.mount/stop
2021-01-05T07:01:01.622090+01:00 suse1 systemd[1]: network.target: Found dependency on network.target/stop
2021-01-05T07:01:01.622192+01:00 suse1 systemd[1]: network.target: Breaking ordering cycle by deleting job unmountnfs.service/stop
2021-01-05T07:01:01.622286+01:00 suse1 systemd[1]: unmountnfs.service: Job unmountnfs.service/stop deleted to break ordering cycle starting with network.target/stop
2021-01-05T07:01:01.622462+01:00 suse1 systemd[1]: wickedd.service: Found ordering cycle on wickedd.service/stop
2021-01-05T07:01:01.622557+01:00 suse1 systemd[1]: wickedd.service: Found dependency on local-fs.target/stop
2021-01-05T07:01:01.622643+01:00 suse1 su: pam_unix(su-l:session): session closed for user htuser
2021-01-05T07:31:28.344203+01:00 suse1 dmeventd[616]: dmeventd ready for processing.
2021-01-05T07:31:28.344231+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpuset
2021-01-05T07:31:28.344566+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpu
2021-01-05T07:31:28.344567+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
2021-01-05T07:31:28.344568+01:00 suse1 kernel: [ 0.000000] Linux version 4.4.180-94.113-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Fri Dec 13 14:20:57 UTC 2019 (c6649f6)
2021-01-05T07:31:28.344568+01:00 suse1 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.4.180-94.113-default root=/dev/mapper/vg_root-lv_root resume=/dev/sda2 splash=silent quiet showopts biosdevname=0 splash=verbose consoleblank=0 nomodeset

and machine starts well.
This machine is restarted everyday (to prevent a memory leak from a custom script) and it got stuck twice in the last 3 weeks.
If there was an issue during the stop, I don’t see any errors.
I suppose an issue with the vmware infrastructure as we don’t have any warning about a lack of resources (cpu, disk, memory), but how could I prove it ?
If this was not an issue with the infrastructure, where could I find some traces before the kernel load ?

@Frederic Hi, I would not use shutdown, likewise cron (systemd service and timer) look at using systemctl poweroff instead or systemctl reboot.

Thanks, I will try it (not reproducing the issue on my own vmware infrastructure for the moment).

@Frederic Hi, also look at the processes running for htuser when the system is up, is this user trying to finish some work, or is it just hung with the memory leak, if so could look at using pgrep to kill off the processes this user is running, eg;

for p in pgrep -u “htuser”; do kill -9 $p; done

Above is an aggressive option… maybe a different signal will suffice for a graceful killing of the processes.

Can you not do remote logging somewhere to see what’s in the logs? Do you do any sort of remote monitoring for system load etc?

Hi @malcomlewis
I cannot have access to the server, that’s the difficulty. However, as there is no error in OS logs, they start to look at vmware logs and found some warnings and strange behavior. The issue seems to come from the network device which take a long time to deactivate before the server stops.
We go deeper in vmware logs, trying to understand why there is this kind of issue (still doesn’t reproduce the issue on my mockup)

@Frederic Hi, routing or maybe even a hardware fault?