Had an interesting case the past couple days where one of our rancher 2.2.2 worker nodes keeps crashing. This is a custom cluster provisioned from the commands copied/pasted in Rancher.
I deployed a script in cron which logs output from a few commands (ps, mpstat, iostat, vmstat) every minute, and it caught the following set of commands running 5 minutes before the crash, after which no more ps output came through (but our script continued producing mpstat, vmstat output):
root 116155 0.0 0.0 8064 3264 ? Sl 15:17 0:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b -address /run/containerd/containerd.sock -
containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root 116163 0.0 0.0 133716 10536 ? Sl 15:17 0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b/init.pid 8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b
root 116170 0.0 0.0 23012 2824 ? S 15:17 0:00 runc init
root 116208 0.0 0.0 23012 4 ? R 15:17 0:00 runc init
root 116413 0.0 0.0 236564 5856 ? S 15:18 0:00 abrt-server -s
root 116414 0.0 0.0 226144 5848 ? S 15:18 0:00 /usr/libexec/abrt-handle-event -i -e post-create ā /var/spool/abrt/oops-2019-05-14-15:18:20-116412-0
root 116423 0.0 0.0 125596 1560 ? S 15:18 0:00 /bin/sh -c nice sosreport --tmp-dir āDUMP_DIR" --batch \ --only=anaconda --only=boot --only=devicemapper \ --only=filesys --only=hardware --only=kernel --only=libraries \ --only=memory --only=networking --only=nfsserver --only=pam \ --only=process --only=rpm -k rpm.rpmva=off --only=ssh \ --only=services --only=yum --only=date --only=host --only=x11 \ --only=cups --only=logs --only=grub2 --only=cron --only=pci \ --only=auditd --only=selinux --only=lvm2 --only=sar \ --only=processor \ >sosreport.log 2>&1 \ && { rm sosreport.log rm sosreport*.md5 mv sosreport*.tar.bz2 sosreport.tar.bz2 mv sosreport*.tar.xz sosreport.tar.xz exit 0 } 2>/dev/null # Error in sosreport run. Let user see the problem. echo "sosreport run failed with exit code ?, log follows:ā # sosreport prints many useless empty lines, nuke them: # it looks awful in syslog otherwise. cat sosreport.log | sed ās/ *//' | grep -v '^ā rm sosreport.log exit 1
root 116424 0.8 0.4 1174060 76988 ? SNl 15:18 0:02 /usr/bin/python /usr/sbin/sosreport --tmp-dir /var/spool/abrt/oops-2019-05-14-15:18:20-116412-0 --batch --only=anaconda --only=boot --only=devicemapper --only=filesys --only=hardware --only=kernel --only=libraries --only=memory --only=networking --only=nfsserver --only=pam --only=process --only=rpm -k rpm.rpmva=off --only=ssh --only=services --only=yum --only=date --only=host --only=x11 --only=cups --only=logs --only=grub2 --only=cron --only=pci --only=auditd --only=selinux --only=lvm2 --only=sar --only=processor
root 116985 0.0 0.0 8628 672 ? SN 15:18 0:00 timeout 300s grub2-mkconfig
root 116987 0.0 0.0 22108 1544 ? SN 15:18 0:00 /bin/sh /usr/sbin/grub2-mkconfig
root 116991 0.0 0.0 22108 732 ? SN 15:18 0:00 /bin/sh /usr/sbin/grub2-mkconfig
root 116992 0.0 0.0 45188 3216 ? SN 15:18 0:00 /usr/sbin/grub2-probe --device /dev/mapper/vgos-lvos --target=fs_uuid
root 116993 0.0 0.0 60780 5836 ? DN 15:18 0:00 vgs --options vg_uuid,pv_name --noheadings --separator :
root 117134 0.0 0.0 236360 5648 ? S 15:18 0:00 abrt-server -s
splunk 117463 0.0 0.0 0 0 ? Z 15:19 0:00 [sadc]
splunk 117489 0.0 0.0 0 0 ? Z 15:19 0:00 [sadc]
root 118041 0.0 0.0 0 0 ? Zs 15:21 0:00 [salt-minion]
root 118042 0.0 0.0 0 0 ? Z 15:21 0:00 [salt-minion]
root 118047 0.6 0.2 338616 45548 ? Dl 15:21 0:00 puppet agent: applying configuration
root 118186 0.0 0.0 45260 1432 ? D 15:22 0:00 iptables-restore --noflush --counters
Obviously thereās a RedHat SOS report being generated, but that kicked off 1 minute after that āruncā related to moby.
This coincided with, in āmpstat -P ALLā, one of the CPUs suddenly pegging 100%:
03:18:44 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:19:43 PM all 0.52 0.02 12.92 0.03 0.00 0.02 0.00 0.00 0.00 86.49
03:19:43 PM 0 0.36 0.00 0.24 0.00 0.00 0.00 0.00 0.00 0.00 99.40
03:19:43 PM 1 1.06 0.00 0.60 0.02 0.00 0.02 0.00 0.00 0.00 98.31
03:19:43 PM 2 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:19:43 PM 3 0.43 0.00 0.24 0.00 0.00 0.02 0.00 0.00 0.00 99.32
03:19:43 PM 4 0.61 0.02 0.36 0.19 0.00 0.03 0.00 0.00 0.00 98.79
03:19:43 PM 5 0.43 0.02 0.24 0.00 0.00 0.05 0.00 0.00 0.00 99.26
03:19:43 PM 6 0.34 0.12 0.48 0.00 0.00 0.02 0.00 0.00 0.00 99.05
03:19:43 PM 7 0.89 0.05 0.61 0.03 0.00 0.02 0.00 0.00 0.00 98.39
Vmstat isnāt showing anything alarming to me during the general period these are running:
0 0 2216 10451784 6468 3979668 0 0 0 129 3256 5662 1 1 98 0 0
0 0 2216 10449888 6468 3980968 0 0 0 61 2468 4616 1 0 99 0 0
0 0 2216 10463040 6468 3976696 70 526 70 807 4032 6841 2 1 97 0 0
4 0 5544 10228324 17396 4098564 24 311 1087 889 9552 6171 2 12 86 0 0
4 0 5544 10201928 17396 4113572 0 0 0 95 9258 4270 1 13 86 0 0
4 0 5544 10190928 17396 4126764 0 0 0 47 9223 4193 0 13 87 0 0
Any hints on where to look next?