Nodes are flapping with error "PLEG is not healthy."

Hey everyone. Two out of six nodes in my cluster are reporting an error of “PLEG is not healthy: pleg was last seen active 3m18.21025856s ago; threshold is 3m0s.” Right now they seem to be getting this error every 5-10 minutes. I previously had node6 of my cluster reporting this error. I then redeployed yesterday which showed that none of the nodes had this warning. This morning, I see that Rancher is reporting that error on nodes 2 and 5. Any suggestions? Thanks.

From google you will no doubt have found that there are a number of possible causes for PLEG. We too see this and are talking with Rancher engineers to help us identify the root cause. When we find this I’ll let you know, although be aware it’s possible (likely even) that your cause is different. Out of interest, what version of Rancher are you using, what OS are you on, do you have any AV, malware or IDS software running, have you observed any particular circumstances when this starts happening (for example in our case when we cycle nodes the replacement tend to register correctly but after a number of days, something longer, PLEG errors appear), is this effecting nodes of any type (control-plane, Etcd, worker) …