We have been running Rancher 2.5.11 on our clusters for some time now, and a couple days ago, we started having lots of problems with rancher not being available and pods across the k3s 2 node cluster restarting. It seems like Rancher → K3S → mysql is flooding the host and either triggering SYN flood detection or running out of file descriptors due to the massive number of open connections.
As part of troubleshooting the issue, we upgraded Rancher to 2.5.15 and the problem persists.
The configuration is 2 VM’s running k3s with Mysql for external database running on node n01.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
cy2tk3s-n01 Ready control-plane,master 125d v1.20.11+k3s2
cy2tk3s-n02 Ready control-plane,master 125d v1.20.11+k3s2
Checking the hosts, we found that k3s-server is using 400-600% CPU, Mysql is using 40-50%, Traefix is using 40-50%, and Rancher pods (3 pods) using 100-200% CPU.
We also notice that the host is complaining about possible SYN Flood attacks which may be shutting down the ports 6443, 443, 80, and 3306. We also see a ton (several thousand) sockets in TIME_WAIT status connected to the MYSQL port. The number of TIME_WAIT ports opened varies by several hundred per seconds.
The pods on the k3s cluster keep getting livelyness probes failing and restarting periodically.
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-traefik-wsq7l 0/1 Completed 0 125d
newrelic newrelic-bundle-nri-kube-events-6d4c4c7b8d-zwwdn 2/2 Running 9 2d1h
kube-system metrics-server-86cbb8457f-xt8kr 1/1 Running 5 125d
newrelic newrelic-bundle-nrk8s-controlplane-48w7q 2/2 Running 15 2d1h
newrelic newrelic-bundle-nrk8s-ksm-57cc7d475d-x7wk5 2/2 Running 14 2d1h
kube-system svclb-traefik-r4tlz 2/2 Running 6 2d
cattle-resources-system rancher-backup-594f9fdb79-pxwp9 1/1 Running 2 47h
cattle-system rancher-webhook-6d4f87bf67-r6hrm 1/1 Running 0 43h
kube-system svclb-traefik-vwh4j 2/2 Running 6 2d
newrelic newrelic-bundle-nrk8s-controlplane-rrlb6 2/2 Running 6 2d1h
newrelic newrelic-bundle-newrelic-logging-xt94c 1/1 Running 0 102m
newrelic newrelic-bundle-newrelic-logging-qrzp5 1/1 Running 0 102m
newrelic newrelic-bundle-nri-prometheus-5b4456cd55-bzc7b 1/1 Running 0 102m
default dnsutils 1/1 Running 44 44h
rancher-operator-system rancher-operator-68b685d797-vnwl4 1/1 Running 257 47h
kube-system coredns-57d6d69685-fjk9v 1/1 Running 56 47h
newrelic newrelic-bundle-nri-metadata-injection-68d94bd975-7htzw 1/1 Running 3 2d1h
fleet-system fleet-agent-96f6f455c-rb52g 1/1 Running 235 47h
kube-system local-path-provisioner-84454bff78-kkfks 1/1 Running 938 47h
fleet-system gitjob-54ff59776f-6z79g 1/1 Running 251 47h
kube-system traefik-6f9cbd9bd4-5bz2k 1/1 Running 23 125d
newrelic newrelic-bundle-kube-state-metrics-54dcf9f69b-hh8dw 1/1 Running 5 2d1h
newrelic newrelic-bundle-nrk8s-kubelet-dwzfx 2/2 Running 811 2d1h
fleet-system fleet-controller-6dd584d8c8-j4rbb 1/1 Running 273 47h
cattle-system rancher-7b87794f5d-z4cgr 0/1 Running 177 43h
cattle-system rancher-7b87794f5d-fgl7t 0/1 Running 171 43h
newrelic newrelic-bundle-nrk8s-kubelet-6tp4z 2/2 Running 518 2d1h
cattle-system rancher-7b87794f5d-ghx9q 1/1 Running 138 43h
There is 1 3 node bare metal RKE cluster supported by the Rancher installation so its not that big of a site.
Currently all 3 Rancher pods are running on node2, but it doesn’t matter where the pods are, problem persists.
Node1:
Tasks: 342 total, 2 running, 340 sleeping, 0 stopped, 0 zombie
%Cpu(s): 77.2 us, 9.6 sy, 0.0 ni, 4.2 id, 0.0 wa, 2.4 hi, 6.6 si, 0.0 st
MiB Mem : 15830.6 total, 2269.7 free, 5082.7 used, 8478.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 10089.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2715847 root 20 0 2344268 1.6g 91500 S 415.0 10.4 9:06.70 k3s-server
1262 mysql 20 0 4993220 1.0g 37220 S 95.0 6.5 1844:07 mysqld
2144277 root 20 0 738180 638388 46928 S 50.0 3.9 72:00.27 traefik
[Sun Aug 28 02:15:32 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 03:33:17 2022] TCP: request_sock_TCP: Possible SYN flooding on port 3306. Sending cookies. Check SNMP counters.
[Sun Aug 28 05:30:51 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 21:47:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:01:45 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:54:23 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 05:42:11 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 09:25:29 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
Node2:
Tasks: 319 total, 3 running, 316 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.5 us, 8.9 sy, 0.0 ni, 7.7 id, 0.0 wa, 1.8 hi, 6.2 si, 0.0 st
MiB Mem : 15830.6 total, 3474.9 free, 6793.2 used, 5562.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8185.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
927974 root 20 0 2134632 1.3g 90880 S 396.0 8.6 2:41.78 k3s-server
923805 root 20 0 1965216 1.1g 62984 S 99.7 7.3 7:16.99 rancher
919407 root 20 0 2245468 1.4g 68028 S 96.0 9.2 17:54.31 rancher
926155 root 20 0 1760660 977.9m 63428 S 77.7 6.2 4:27.21 rancher
[Mon Aug 29 10:30:46 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:02:54 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:46:22 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:54:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:59:05 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[root@cy2tk3s-n02 ~]# netstat -na | grep TIME_WAIT | wc
2248 13488 179840