Nodes are flapping with error "PLEG is not healthy."

Hey everyone. Two out of six nodes in my cluster are reporting an error of “PLEG is not healthy: pleg was last seen active 3m18.21025856s ago; threshold is 3m0s.” Right now they seem to be getting this error every 5-10 minutes. I previously had node6 of my cluster reporting this error. I then redeployed yesterday which showed that none of the nodes had this warning. This morning, I see that Rancher is reporting that error on nodes 2 and 5. Any suggestions? Thanks.

From google you will no doubt have found that there are a number of possible causes for PLEG. We too see this and are talking with Rancher engineers to help us identify the root cause. When we find this I’ll let you know, although be aware it’s possible (likely even) that your cause is different. Out of interest, what version of Rancher are you using, what OS are you on, do you have any AV, malware or IDS software running, have you observed any particular circumstances when this starts happening (for example in our case when we cycle nodes the replacement tend to register correctly but after a number of days, something longer, PLEG errors appear), is this effecting nodes of any type (control-plane, Etcd, worker) …

Regards

Fraser

I had Rancher v2.1.7 at the time. I drained the problem nodes and rebooted. I haven’t seen the issue since. The nodes are on Ubuntu 16.04.6. I don’t have any AV, malware, or IDS software running. I don’t remember any particular change that may have caused this. One day I just started seeing the PLEG errors. All the node are etcd, Control Plane, and Worker nodes. I also upgrade to Rancher v2.2.7 now.

I had this too. Probably after l enabled the
Longhorn and forgot to install iscsi on new node that was added later instead a dead one… but there was some other changes at that time and I’m not sure what was the exact reason.

Sounds like a good reason to keep etcd, control panel and worker nodes separate.

We run separate nodes for all types in HA mode but still see this periodically. It’s really hard to pin down a root cause, we’ve been looking for months but still only have anecdotal evidence.

I have not seen the PLEG problem since draining and rebooting the affected nodes. I will update this if I run across it again.