Rancher 2.5.15 on 2 node k3s cluster with Mysql getting extremely high CPU use on the host and SYN Flood alerts on ports

We have been running Rancher 2.5.11 on our clusters for some time now, and a couple days ago, we started having lots of problems with rancher not being available and pods across the k3s 2 node cluster restarting. It seems like Rancher → K3S → mysql is flooding the host and either triggering SYN flood detection or running out of file descriptors due to the massive number of open connections.

As part of troubleshooting the issue, we upgraded Rancher to 2.5.15 and the problem persists.

The configuration is 2 VM’s running k3s with Mysql for external database running on node n01.

kubectl get nodes

NAME STATUS ROLES AGE VERSION
cy2tk3s-n01 Ready control-plane,master 125d v1.20.11+k3s2
cy2tk3s-n02 Ready control-plane,master 125d v1.20.11+k3s2

Checking the hosts, we found that k3s-server is using 400-600% CPU, Mysql is using 40-50%, Traefix is using 40-50%, and Rancher pods (3 pods) using 100-200% CPU.

We also notice that the host is complaining about possible SYN Flood attacks which may be shutting down the ports 6443, 443, 80, and 3306. We also see a ton (several thousand) sockets in TIME_WAIT status connected to the MYSQL port. The number of TIME_WAIT ports opened varies by several hundred per seconds.

The pods on the k3s cluster keep getting livelyness probes failing and restarting periodically.

kubectl get pods --all-namespaces

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system helm-install-traefik-wsq7l 0/1 Completed 0 125d
newrelic newrelic-bundle-nri-kube-events-6d4c4c7b8d-zwwdn 2/2 Running 9 2d1h
kube-system metrics-server-86cbb8457f-xt8kr 1/1 Running 5 125d
newrelic newrelic-bundle-nrk8s-controlplane-48w7q 2/2 Running 15 2d1h
newrelic newrelic-bundle-nrk8s-ksm-57cc7d475d-x7wk5 2/2 Running 14 2d1h
kube-system svclb-traefik-r4tlz 2/2 Running 6 2d
cattle-resources-system rancher-backup-594f9fdb79-pxwp9 1/1 Running 2 47h
cattle-system rancher-webhook-6d4f87bf67-r6hrm 1/1 Running 0 43h
kube-system svclb-traefik-vwh4j 2/2 Running 6 2d
newrelic newrelic-bundle-nrk8s-controlplane-rrlb6 2/2 Running 6 2d1h
newrelic newrelic-bundle-newrelic-logging-xt94c 1/1 Running 0 102m
newrelic newrelic-bundle-newrelic-logging-qrzp5 1/1 Running 0 102m
newrelic newrelic-bundle-nri-prometheus-5b4456cd55-bzc7b 1/1 Running 0 102m
default dnsutils 1/1 Running 44 44h
rancher-operator-system rancher-operator-68b685d797-vnwl4 1/1 Running 257 47h
kube-system coredns-57d6d69685-fjk9v 1/1 Running 56 47h
newrelic newrelic-bundle-nri-metadata-injection-68d94bd975-7htzw 1/1 Running 3 2d1h
fleet-system fleet-agent-96f6f455c-rb52g 1/1 Running 235 47h
kube-system local-path-provisioner-84454bff78-kkfks 1/1 Running 938 47h
fleet-system gitjob-54ff59776f-6z79g 1/1 Running 251 47h
kube-system traefik-6f9cbd9bd4-5bz2k 1/1 Running 23 125d
newrelic newrelic-bundle-kube-state-metrics-54dcf9f69b-hh8dw 1/1 Running 5 2d1h
newrelic newrelic-bundle-nrk8s-kubelet-dwzfx 2/2 Running 811 2d1h
fleet-system fleet-controller-6dd584d8c8-j4rbb 1/1 Running 273 47h
cattle-system rancher-7b87794f5d-z4cgr 0/1 Running 177 43h
cattle-system rancher-7b87794f5d-fgl7t 0/1 Running 171 43h
newrelic newrelic-bundle-nrk8s-kubelet-6tp4z 2/2 Running 518 2d1h
cattle-system rancher-7b87794f5d-ghx9q 1/1 Running 138 43h

There is 1 3 node bare metal RKE cluster supported by the Rancher installation so its not that big of a site.

Currently all 3 Rancher pods are running on node2, but it doesn’t matter where the pods are, problem persists.

Node1:
Tasks: 342 total, 2 running, 340 sleeping, 0 stopped, 0 zombie
%Cpu(s): 77.2 us, 9.6 sy, 0.0 ni, 4.2 id, 0.0 wa, 2.4 hi, 6.6 si, 0.0 st
MiB Mem : 15830.6 total, 2269.7 free, 5082.7 used, 8478.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 10089.1 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

2715847 root 20 0 2344268 1.6g 91500 S 415.0 10.4 9:06.70 k3s-server
1262 mysql 20 0 4993220 1.0g 37220 S 95.0 6.5 1844:07 mysqld
2144277 root 20 0 738180 638388 46928 S 50.0 3.9 72:00.27 traefik

[Sun Aug 28 02:15:32 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 03:33:17 2022] TCP: request_sock_TCP: Possible SYN flooding on port 3306. Sending cookies. Check SNMP counters.
[Sun Aug 28 05:30:51 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 21:47:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:01:45 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:54:23 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 05:42:11 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 09:25:29 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.

Node2:
Tasks: 319 total, 3 running, 316 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.5 us, 8.9 sy, 0.0 ni, 7.7 id, 0.0 wa, 1.8 hi, 6.2 si, 0.0 st
MiB Mem : 15830.6 total, 3474.9 free, 6793.2 used, 5562.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8185.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

927974 root 20 0 2134632 1.3g 90880 S 396.0 8.6 2:41.78 k3s-server
923805 root 20 0 1965216 1.1g 62984 S 99.7 7.3 7:16.99 rancher
919407 root 20 0 2245468 1.4g 68028 S 96.0 9.2 17:54.31 rancher
926155 root 20 0 1760660 977.9m 63428 S 77.7 6.2 4:27.21 rancher

[Mon Aug 29 10:30:46 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:02:54 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:46:22 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:54:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:59:05 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.

[root@cy2tk3s-n02 ~]# netstat -na | grep TIME_WAIT | wc
2248 13488 179840

Was there any recent updates to your cluster recently?

No updates to the k3s cluster recently before the issue started.
We did upgrade from Rancher 2.5.11 to 2.5.15 late in the troubleshooting just to rule out out a couple thngs.

There haven’t been any cluster configuration changes on the RKE cluster that the rancher cluster manages in quite some time.

We’ve asked our dev teams what changes they may have made within deployments on the RKE cluster but they claim nothing changed in the days leading up to the issue.

The number of open fd’s by k3s-server process at times is concerning.

[root@cy2tk3s-n02 ~]# ss -ntlp
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:((“sshd”,pid=1084,fd=5))
LISTEN 0 2048 0.0.0.0:32377 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=306))
LISTEN 0 100 127.0.0.1:25 0.0.0.0:* users:((“master”,pid=1579,fd=16))
LISTEN 0 2048 127.0.0.1:10010 0.0.0.0:* users:((“containerd”,pid=961170,fd=150))
LISTEN 0 2048 127.0.0.1:8001 0.0.0.0:* users:((“newrelic-infra”,pid=2235,fd=15))
LISTEN 0 2048 0.0.0.0:32387 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1490))
LISTEN 0 128 0.0.0.0:2020 0.0.0.0:* users:((“fluent-bit”,pid=890254,fd=69))
LISTEN 0 2048 127.0.0.1:10248 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1248))
LISTEN 0 2048 127.0.0.1:10249 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1367))
LISTEN 0 2048 127.0.0.1:6444 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=24))
LISTEN 0 2048 127.0.0.1:10256 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=303))
LISTEN 0 128 [::]:22 [::]:* users:((“sshd”,pid=1084,fd=7))
LISTEN 0 100 [::1]:25 [::]:* users:((“master”,pid=1579,fd=17))
LISTEN 0 2048 *:10250 : users:((“k3s-server”,pid=961138,fd=1225))
LISTEN 1 2048 *:6443 : users:((“k3s-server”,pid=961138,fd=7))

I scaled Rancher deployment to 0 and problem goes away as rancher is shutdown.
I scale up just 1 rancher pod and its immediately back with high cpu usage, thousands of tcp connections, etc…

I do notice several of these errors in the rancher pod logs:

E0829 21:36:23.090421       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:26.260773       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:38.019125       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:40.229796       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:42.798130       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:59.982973       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:01.781206       9 request.go:1011] Unexpected error when reading response body: context canceled
W0829 21:37:01.892673       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
E0829 21:37:11.743868       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:12.635044       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:13.246894       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:14.023681       9 request.go:1011] Unexpected error when reading response body: context canceled
W0829 21:37:31.003529       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
I0829 21:37:31.003812       9 trace.go:205] Trace[891381185]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.20.0-rancher.1/tools/cache/reflector.go:168 (29-Aug-2022 21:37:03.332) (total time: 27671ms):
Trace[891381185]: ---"Objects listed" 27671ms (21:37:00.003)
Trace[891381185]: [27.671271617s] [27.671271617s] END
W0829 21:37:31.018288       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
E0829 21:37:34.446744       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:41.164710       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:43.513207       9 request.go:1011] Unexpected error when reading response body: context canceled
I0829 21:37:46.784618       9 trace.go:205] Trace[637310963]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.20.0-rancher.1/tools/cache/reflector.go:168 (29-Aug-2022 21:37:18.143) (total time: 28640ms):
Trace[637310963]: ---"Objects listed" 28640ms (21:37:00.784)
Trace[637310963]: [28.640624512s] [28.640624512s] END
E0829 21:38:00.970069       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:13.613221       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:18.889684       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:31.575837       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:34.164451       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:35.177839       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:42.141709       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:52.304389       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:02.935042       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:02.955438       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:13.197400       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:13.973229       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:21.685135       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:26.868389       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.051437       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.086862       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.636336       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:29.521167       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:40.224769       9 request.go:1011] Unexpected error when reading response body: context canceled

Tracked down the issue to a backups (px-backup) process on one of our other clusters that was hammering the rancher k3s cluster with https requests.