Rancher 2.2.2, worker node hard-crashes

Had an interesting case the past couple days where one of our rancher 2.2.2 worker nodes keeps crashing. This is a custom cluster provisioned from the commands copied/pasted in Rancher.

I deployed a script in cron which logs output from a few commands (ps, mpstat, iostat, vmstat) every minute, and it caught the following set of commands running 5 minutes before the crash, after which no more ps output came through (but our script continued producing mpstat, vmstat output):

root 116155 0.0 0.0 8064 3264 ? Sl 15:17 0:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b -address /run/containerd/containerd.sock -
containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root 116163 0.0 0.0 133716 10536 ? Sl 15:17 0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b/init.pid 8ec18f8a0f23ea03da833044864eea1378d0b0544e45897f0db7ae0bf7c8547b
root 116170 0.0 0.0 23012 2824 ? S 15:17 0:00 runc init
root 116208 0.0 0.0 23012 4 ? R 15:17 0:00 runc init
root 116413 0.0 0.0 236564 5856 ? S 15:18 0:00 abrt-server -s
root 116414 0.0 0.0 226144 5848 ? S 15:18 0:00 /usr/libexec/abrt-handle-event -i -e post-create – /var/spool/abrt/oops-2019-05-14-15:18:20-116412-0
root 116423 0.0 0.0 125596 1560 ? S 15:18 0:00 /bin/sh -c nice sosreport --tmp-dir “DUMP_DIR" --batch \ --only=anaconda --only=boot --only=devicemapper \ --only=filesys --only=hardware --only=kernel --only=libraries \ --only=memory --only=networking --only=nfsserver --only=pam \ --only=process --only=rpm -k rpm.rpmva=off --only=ssh \ --only=services --only=yum --only=date --only=host --only=x11 \ --only=cups --only=logs --only=grub2 --only=cron --only=pci \ --only=auditd --only=selinux --only=lvm2 --only=sar \ --only=processor \ >sosreport.log 2>&1 \ && { rm sosreport.log rm sosreport*.md5 mv sosreport*.tar.bz2 sosreport.tar.bz2 mv sosreport*.tar.xz sosreport.tar.xz exit 0 } 2>/dev/null # Error in sosreport run. Let user see the problem. echo "sosreport run failed with exit code ?, log follows:” # sosreport prints many useless empty lines, nuke them: # it looks awful in syslog otherwise. cat sosreport.log | sed ‘s/ *//' | grep -v '^’ rm sosreport.log exit 1
root 116424 0.8 0.4 1174060 76988 ? SNl 15:18 0:02 /usr/bin/python /usr/sbin/sosreport --tmp-dir /var/spool/abrt/oops-2019-05-14-15:18:20-116412-0 --batch --only=anaconda --only=boot --only=devicemapper --only=filesys --only=hardware --only=kernel --only=libraries --only=memory --only=networking --only=nfsserver --only=pam --only=process --only=rpm -k rpm.rpmva=off --only=ssh --only=services --only=yum --only=date --only=host --only=x11 --only=cups --only=logs --only=grub2 --only=cron --only=pci --only=auditd --only=selinux --only=lvm2 --only=sar --only=processor
root 116985 0.0 0.0 8628 672 ? SN 15:18 0:00 timeout 300s grub2-mkconfig
root 116987 0.0 0.0 22108 1544 ? SN 15:18 0:00 /bin/sh /usr/sbin/grub2-mkconfig
root 116991 0.0 0.0 22108 732 ? SN 15:18 0:00 /bin/sh /usr/sbin/grub2-mkconfig
root 116992 0.0 0.0 45188 3216 ? SN 15:18 0:00 /usr/sbin/grub2-probe --device /dev/mapper/vgos-lvos --target=fs_uuid
root 116993 0.0 0.0 60780 5836 ? DN 15:18 0:00 vgs --options vg_uuid,pv_name --noheadings --separator :
root 117134 0.0 0.0 236360 5648 ? S 15:18 0:00 abrt-server -s
splunk 117463 0.0 0.0 0 0 ? Z 15:19 0:00 [sadc]
splunk 117489 0.0 0.0 0 0 ? Z 15:19 0:00 [sadc]
root 118041 0.0 0.0 0 0 ? Zs 15:21 0:00 [salt-minion]
root 118042 0.0 0.0 0 0 ? Z 15:21 0:00 [salt-minion]
root 118047 0.6 0.2 338616 45548 ? Dl 15:21 0:00 puppet agent: applying configuration
root 118186 0.0 0.0 45260 1432 ? D 15:22 0:00 iptables-restore --noflush --counters

Obviously there’s a RedHat SOS report being generated, but that kicked off 1 minute after that “runc” related to moby.

This coincided with, in “mpstat -P ALL”, one of the CPUs suddenly pegging 100%:
03:18:44 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:19:43 PM all 0.52 0.02 12.92 0.03 0.00 0.02 0.00 0.00 0.00 86.49
03:19:43 PM 0 0.36 0.00 0.24 0.00 0.00 0.00 0.00 0.00 0.00 99.40
03:19:43 PM 1 1.06 0.00 0.60 0.02 0.00 0.02 0.00 0.00 0.00 98.31
03:19:43 PM 2 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:19:43 PM 3 0.43 0.00 0.24 0.00 0.00 0.02 0.00 0.00 0.00 99.32
03:19:43 PM 4 0.61 0.02 0.36 0.19 0.00 0.03 0.00 0.00 0.00 98.79
03:19:43 PM 5 0.43 0.02 0.24 0.00 0.00 0.05 0.00 0.00 0.00 99.26
03:19:43 PM 6 0.34 0.12 0.48 0.00 0.00 0.02 0.00 0.00 0.00 99.05
03:19:43 PM 7 0.89 0.05 0.61 0.03 0.00 0.02 0.00 0.00 0.00 98.39

Vmstat isn’t showing anything alarming to me during the general period these are running:
0 0 2216 10451784 6468 3979668 0 0 0 129 3256 5662 1 1 98 0 0
0 0 2216 10449888 6468 3980968 0 0 0 61 2468 4616 1 0 99 0 0
0 0 2216 10463040 6468 3976696 70 526 70 807 4032 6841 2 1 97 0 0
4 0 5544 10228324 17396 4098564 24 311 1087 889 9552 6171 2 12 86 0 0
4 0 5544 10201928 17396 4113572 0 0 0 95 9258 4270 1 13 86 0 0
4 0 5544 10190928 17396 4126764 0 0 0 47 9223 4193 0 13 87 0 0

Any hints on where to look next?

I woke up this AM and 1 out of 3 nodes was in state “Kubelet stopped posting node status” and many of our workloads were thrashing.

I could not ssh into the node.

Restarting the node (on AWS) didn’t recover the cluster.

Our logging system “logdna” stopped capturing logs.

It sucks that I cannot even do a postmortem.

I added a new node, deleted the broken node and 10 minutes later everything was working again.

Yikes!

1 Like

Having some serious concerns about stability here. Will reserve decisions until I’m able to discern the causes of this…

Question for @ryan - Are you running the prometheus/grafana monitoring suite on your cluster? (Rancher offers it with a simple click & form)

@spirilis - yes we are running the one-click prometheus/grafana
This AM, another node lost connectivity with the cluster. AND worse, workloads were not moved to the healthy nodes.
:frowning:

This is probably the issue that is fixed by 2.2.3 with monitoring continuously redeploying itself.

2 Likes

@vincent - a link to the issue? I just had my third crash in 2 days.

I’ve had three crashes in two days. :frowning:

I just finished deploying everything new, starting with a new Rancher server running 2.2.3

I built an all new k8s cluster, this time I have not enabled “one click” Prometheus. As far as I can tell, once enabling monitoring, there is no way to disable it. If I get no crashes for a few days, I might try enabling it again, but only after getting caught up on sleep.

edit:

Wrote a post saying everything looked good since the 2.2.3 upgrade, but now on second look, I was peeking at my “controlplane” dedicated nodes, not the worker nodes. Worker node #1 is crashed.

Might have to destroy the cluster (it’s just a testing environment) and try again…

Fwiw, the node that keeps crashing, keeps showing out of memory and one of the prometheus dockers was the only one running when it crashed just now (trying to decommission it - and couldn’t run my cleanup script because that prometheus docker was thrashing the box already, minutes after resetting the VM)

So I’m hopeful this is the issue :slight_smile:

PS- There is a way to disable it, go to the Global section for the cluster, then go to Tools > Monitoring. Same screen as you used to first enable it.

Whilst we haven’t seen this issue exactly, we have noted the load that turning on monitoring in Rancher causes.

What spec of nodes are you using (can’t check right now but I think Rancher monitoring is implemented with resource constraints so in theory they should not cause an OOM) ?

IIRC our controlplane/etcd nodes (3x) are using 4 vCPUs, 8GB RAM, and the worker nodes have 8 vCPU, 16GB RAM. This is a small internal “tools-dev” cluster and we may revise the hardware config at a later date (but for now it’s largely a playground for our eval of Rancher 2.2)

FYI- Gone several days with a rebuilt cluster with no monitoring enabled, no crashes. I had trouble deploying the monitoring when I rebuilt it- something wouldn’t complete and the monitoring API was unavailable. I’ll try again to deploy the monitoring system.

Hm, deploying monitoring worked great this time & completed within a minute or 2… although one difference here is I did have persistent storage available (storage class based on github.com/kubernetes-incubator/external-storage 's nfs-client example). Will report back if I have issues with this setup.

Are you using a small EBS volume on your EC2 instances by any chance?

EBS volumes under 1TB have a Burst Balance which is drawn down on when the disk i/o bursts above the number of IOPS that were provisioned. If you check the monitoring of the EBS volume you can see if the Burst Balance has ever been drawn down to 0. If it has, that’s often a cause of nodes seeming to “crash” because the disk i/o becomes very slow. Provisioning a larger EBS volume can work around that issue because the larger the volume the larger the base IOPS available to it.

I’m having the very issue with v.2.2.4

  • OS: CentOS Linux release 7.6.1810
  • k8s: v1.13.5
  • docker: 18.9.6
    custom deployment on bare-metal

I am not able to ssh into the machine to see the logs.