After Upgrade from 2.3.5 to 2.4.2: Failed to communicate with API Server

Hi everyone.

I’ve a set-up with one master and one worker node (ec2-instances with RancherOS), my Rancher is running on its own node startet with docker run.

I’ve upgraded to 2.4.2 recently. Since then I’m experiencing randomly that my master node is going awol with this Error Message:
rancher Cluster health check failed: Failed to communicate with API server: https://xxx.xxx.xxx.xxx:6443/api/v1/namespaces/kube-system?timeout=30s: dial tcp i/o timeout

RancherOS is at Version 1.5.3
Kubernetes Version is 1.17.4-rancher1-3

After rebooting EC2 instance everything is working fine again for a while. But after some random time it will go unreachable again.

I hope this is enough information. Thanks in advance.

We see this issue also on various clusters intermittently, did you ever managed to find out what caused it?

Small update what we did:
Rancher 2.5.9
Kubernetes: v1.20.9

What we checked/adjusted:
MTU sizes, we adjusted to the AWS jumbo frames
Checked arp tables
We upgraded our Rancher controlplane Traefik from v1 to v2
We disabled Rancher controlplane nodes docker daemons
We upgraded Rancher controlplane k3s nodes to kubernetes version: v1.21.4+k3s1
We disabled high availability on the rancher controlplane nodes, we run a single instance of rancher CP
We scaled down to 2 nodes in the rancher control plane cluster

We see the errors in the Traefik logs on the Rancher CP nodes.

Issues are intermittent, we have 4 clusters. It affects them 1 by 1 randomly. Not at the same time. Usually restarting the rancher pods solves the issue directly.

Quick update:
We removed Traefik from the Rancher control plane and replaced it with Nginx, we still see sockets being closed and I/O errors. But recovery time is much faster and nearly noticeable. We went from 10-20+ complaints per day from our developers to 1 in 1 week.