SLES12 SP3 LTSS running on vmware stuck for 20min before starting

Frederic · January 6, 2021, 4:20pm

Hello

I need some help : sometimes, a VM installed with SLES12 SP3 (LTSS) gets stuck after reboot. For around 20 minutes, no reaction ! After that, kernel loads and system starts.
I don’t have access to the vmware infrastructure and I need to prove that the issue could come from there… :neutral:
I look at boot.log, messages and dmesg but nothing seems wrong except the delay.
For example, here is a messages content:
2021-01-05T07:00:01.584829+01:00 suse1 CRON[27153]: (root) CMD (/sbin/shutdown -r 2>&1 >/dev/null)
(…)
2021-01-05T07:01:01.597972+01:00 suse1 systemd[1]: network.target: Found ordering cycle on network.target/stop
2021-01-05T07:01:01.598592+01:00 suse1 systemd[1]: network.target: Found dependency on unmountnfs.service/stop
2021-01-05T07:01:01.602189+01:00 suse1 systemd[1]: network.target: Found dependency on sysinit.target/stop
2021-01-05T07:01:01.621744+01:00 suse1 systemd[1]: network.target: Found dependency on systemd-tmpfiles-setup.service/stop
2021-01-05T07:01:01.621844+01:00 suse1 systemd[1]: network.target: Found dependency on local-fs.target/stop
2021-01-05T07:01:01.621937+01:00 suse1 systemd[1]: network.target: Found dependency on var-backup.mount/stop
2021-01-05T07:01:01.622090+01:00 suse1 systemd[1]: network.target: Found dependency on network.target/stop
2021-01-05T07:01:01.622192+01:00 suse1 systemd[1]: network.target: Breaking ordering cycle by deleting job unmountnfs.service/stop
2021-01-05T07:01:01.622286+01:00 suse1 systemd[1]: unmountnfs.service: Job unmountnfs.service/stop deleted to break ordering cycle starting with network.target/stop
2021-01-05T07:01:01.622462+01:00 suse1 systemd[1]: wickedd.service: Found ordering cycle on wickedd.service/stop
2021-01-05T07:01:01.622557+01:00 suse1 systemd[1]: wickedd.service: Found dependency on local-fs.target/stop
2021-01-05T07:01:01.622643+01:00 suse1 su: pam_unix(su-l:session): session closed for user htuser
2021-01-05T07:31:28.344203+01:00 suse1 dmeventd[616]: dmeventd ready for processing.
2021-01-05T07:31:28.344231+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpuset
2021-01-05T07:31:28.344566+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpu
2021-01-05T07:31:28.344567+01:00 suse1 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
2021-01-05T07:31:28.344568+01:00 suse1 kernel: [ 0.000000] Linux version 4.4.180-94.113-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Fri Dec 13 14:20:57 UTC 2019 (c6649f6)
2021-01-05T07:31:28.344568+01:00 suse1 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.4.180-94.113-default root=/dev/mapper/vg_root-lv_root resume=/dev/sda2 splash=silent quiet showopts biosdevname=0 splash=verbose consoleblank=0 nomodeset
and machine starts well.
This machine is restarted everyday (to prevent a memory leak from a custom script) and it got stuck twice in the last 3 weeks.
If there was an issue during the stop, I don’t see any errors.
I suppose an issue with the vmware infrastructure as we don’t have any warning about a lack of resources (cpu, disk, memory), but how could I prove it ?
If this was not an issue with the infrastructure, where could I find some traces before the kernel load ?

malcolmlewis · January 7, 2021, 2:10pm

@Frederic Hi, I would not use shutdown, likewise cron (systemd service and timer) look at using systemctl poweroff instead or systemctl reboot.

Frederic · January 7, 2021, 3:29pm

Thanks, I will try it (not reproducing the issue on my own vmware infrastructure for the moment).

malcolmlewis · January 7, 2021, 3:59pm

@Frederic Hi, also look at the processes running for htuser when the system is up, is this user trying to finish some work, or is it just hung with the memory leak, if so could look at using pgrep to kill off the processes this user is running, eg;

for p in pgrep -u “htuser”; do kill -9 $p; done

Above is an aggressive option… maybe a different signal will suffice for a graceful killing of the processes.

Can you not do remote logging somewhere to see what’s in the logs? Do you do any sort of remote monitoring for system load etc?

Frederic · January 13, 2021, 8:19am

Hi @malcomlewis
I cannot have access to the server, that’s the difficulty. However, as there is no error in OS logs, they start to look at vmware logs and found some warnings and strange behavior. The issue seems to come from the network device which take a long time to deactivate before the server stops.
We go deeper in vmware logs, trying to understand why there is this kind of issue (still doesn’t reproduce the issue on my mockup)

malcolmlewis · January 13, 2021, 2:22pm

@Frederic Hi, routing or maybe even a hardware fault?

Topic		Replies	Views
SuSE11sp3 server hangs on bootup - checking file systems SLES Install-Boot	4	282	August 26, 2015
SUSE 12SP3 x64 installer hanging SLES Install-Boot	1	415	October 5, 2018
SLES Virtual machine stucks during boot SLES Configure-Administer	0	262	October 30, 2019
SUSE Linux Ent Server 11 SP3 on VMware - GUI hangs SLES Virtualization	9	715	October 25, 2017
Need tips on what caused SLES11 SP1 VM to hang / not respond SLES Configure-Administer	2	258	June 6, 2016

SLES12 SP3 LTSS running on vmware stuck for 20min before starting

Related topics