Rancher 2.5.15 on 2 node k3s cluster with Mysql getting extremely high CPU use on the host and SYN Flood alerts on ports

dmitchel · August 29, 2022, 5:12pm

We have been running Rancher 2.5.11 on our clusters for some time now, and a couple days ago, we started having lots of problems with rancher not being available and pods across the k3s 2 node cluster restarting. It seems like Rancher → K3S → mysql is flooding the host and either triggering SYN flood detection or running out of file descriptors due to the massive number of open connections.

As part of troubleshooting the issue, we upgraded Rancher to 2.5.15 and the problem persists.

The configuration is 2 VM’s running k3s with Mysql for external database running on node n01.

kubectl get nodes

NAME STATUS ROLES AGE VERSION
cy2tk3s-n01 Ready control-plane,master 125d v1.20.11+k3s2
cy2tk3s-n02 Ready control-plane,master 125d v1.20.11+k3s2

Checking the hosts, we found that k3s-server is using 400-600% CPU, Mysql is using 40-50%, Traefix is using 40-50%, and Rancher pods (3 pods) using 100-200% CPU.

We also notice that the host is complaining about possible SYN Flood attacks which may be shutting down the ports 6443, 443, 80, and 3306. We also see a ton (several thousand) sockets in TIME_WAIT status connected to the MYSQL port. The number of TIME_WAIT ports opened varies by several hundred per seconds.

The pods on the k3s cluster keep getting livelyness probes failing and restarting periodically.

kubectl get pods --all-namespaces

NAMESPACE NAME kube-system helm-install-traefik-wsq7l newrelic newrelic-bundle-nri-kube-events- kube-system metrics-server-86cbb8457f-xt8kr newrelic newrelic-bundle-nrk8s-controlplane-48w7q newrelic newrelic-bundle-nrk8s-ksm-57cc7d475d-x7wk5 kube-system svclb-traefik-r4tlz cattle-resources-system rancher-backup-594f9fdb79-pxwp9 cattle-system rancher-webhook-6d4f87bf67-r6hrm kube-system svclb-traefik-vwh4j newrelic newrelic-bundle-nrk8s-controlplane-rrlb6 newrelic newrelic-bundle-newrelic-logging-xt94c newrelic newrelic-bundle-newrelic-logging-qrzp5 newrelic newrelic-bundle-nri-prometheus-5 default dnsutils rancher-operator-system rancher-operator-68b685d797-vnwl4 kube-system coredns-57d6d69685-fjk9v newrelic newrelic-bundle-nri-metadata-inj fleet-system fleet-agent-96f6f455c-rb52g kube-system local-path-provisioner-84454bff78-kkfks fleet-system gitjob-54ff59776f-6z79g kube-system traefik-6f9cbd9bd4-5bz2k newrelic newrelic-bundle-kube-state-metri newrelic newrelic-bundle-nrk8s-kubelet-dwzfx fleet-system fleet-controller-6dd584d8c8-j4rbb cattle-system rancher-7b87794f5d-z4cgr cattle-system rancher-7b87794f5d-fgl7t newrelic newrelic-bundle-nrk8s-kubelet-6tp4z cattle-system rancher-7b87794f5d-ghx9q READY STATUS RESTARTS AGE
0/1 Completed 0 125d
6d4c4c7b8d-zwwdn 2/2 Running 9 2d1h
1/1 Running 5 125d
2/2 Running 15 2d1h
2/2 Running 14 2d1h
2/2 Running 6 2d
1/1 Running 2 47h
1/1 Running 0 43h
2/2 Running 6 2d
2/2 Running 6 2d1h
1/1 Running 0 102m
1/1 Running 0 102m
b4456cd55-bzc7b 1/1 Running 0 102m
1/1 Running 44 44h
1/1 Running 257 47h
1/1 Running 56 47h
ection-68d94bd975-7htzw 1/1 Running 3 2d1h
1/1 Running 235 47h
1/1 Running 938 47h
1/1 Running 251 47h
1/1 Running 23 125d
cs-54dcf9f69b-hh8dw 1/1 Running 5 2d1h
2/2 Running 811 2d1h
1/1 Running 273 47h
0/1 Running 177 43h
0/1 Running 171 43h
2/2 Running 518 2d1h
1/1 Running 138 43h

There is 1 3 node bare metal RKE cluster supported by the Rancher installation so its not that big of a site.

Currently all 3 Rancher pods are running on node2, but it doesn’t matter where the pods are, problem persists.

Node1:
Tasks: 342 total, 2 running, 340 sleeping, 0 stopped, 0 zombie
%Cpu(s): 77.2 us, 9.6 sy, 0.0 ni, 4.2 id, 0.0 wa, 2.4 hi, 6.6 si, 0.0 st
MiB Mem : 15830.6 total, 2269.7 free, 5082.7 used, 8478.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 10089.1 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

2715847 root 20 0 2344268 1.6g 91500 S 415.0 10.4 9:06.70 k3s-server
1262 mysql 20 0 4993220 1.0g 37220 S 95.0 6.5 1844:07 mysqld
2144277 root 20 0 738180 638388 46928 S 50.0 3.9 72:00.27 traefik

[Sun Aug 28 02:15:32 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 03:33:17 2022] TCP: request_sock_TCP: Possible SYN flooding on port 3306. Sending cookies. Check SNMP counters.
[Sun Aug 28 05:30:51 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Sun Aug 28 21:47:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:01:45 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 04:54:23 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 05:42:11 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
[Mon Aug 29 09:25:29 2022] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.

Node2:
Tasks: 319 total, 3 running, 316 sleeping, 0 stopped, 0 zombie
%Cpu(s): 75.5 us, 8.9 sy, 0.0 ni, 7.7 id, 0.0 wa, 1.8 hi, 6.2 si, 0.0 st
MiB Mem : 15830.6 total, 3474.9 free, 6793.2 used, 5562.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8185.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

927974 root 20 0 2134632 1.3g 90880 S 396.0 8.6 2:41.78 k3s-server
923805 root 20 0 1965216 1.1g 62984 S 99.7 7.3 7:16.99 rancher
919407 root 20 0 2245468 1.4g 68028 S 96.0 9.2 17:54.31 rancher
926155 root 20 0 1760660 977.9m 63428 S 77.7 6.2 4:27.21 rancher

[Mon Aug 29 10:30:46 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:02:54 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:46:22 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:54:48 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
[Mon Aug 29 11:59:05 2022] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.

[root@cy2tk3s-n02 ~]# netstat -na | grep TIME_WAIT | wc
2248 13488 179840

robertsirc · August 29, 2022, 7:10pm

Was there any recent updates to your cluster recently?

dmitchel · August 29, 2022, 7:20pm

No updates to the k3s cluster recently before the issue started.
We did upgrade from Rancher 2.5.11 to 2.5.15 late in the troubleshooting just to rule out out a couple thngs.

There haven’t been any cluster configuration changes on the RKE cluster that the rancher cluster manages in quite some time.

We’ve asked our dev teams what changes they may have made within deployments on the RKE cluster but they claim nothing changed in the days leading up to the issue.

dmitchel · August 29, 2022, 7:23pm

The number of open fd’s by k3s-server process at times is concerning.

[root@cy2tk3s-n02 ~]# ss -ntlp
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:((“sshd”,pid=1084,fd=5))
LISTEN 0 2048 0.0.0.0:32377 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=306))
LISTEN 0 100 127.0.0.1:25 0.0.0.0:* users:((“master”,pid=1579,fd=16))
LISTEN 0 2048 127.0.0.1:10010 0.0.0.0:* users:((“containerd”,pid=961170,fd=150))
LISTEN 0 2048 127.0.0.1:8001 0.0.0.0:* users:((“newrelic-infra”,pid=2235,fd=15))
LISTEN 0 2048 0.0.0.0:32387 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1490))
LISTEN 0 128 0.0.0.0:2020 0.0.0.0:* users:((“fluent-bit”,pid=890254,fd=69))
LISTEN 0 2048 127.0.0.1:10248 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1248))
LISTEN 0 2048 127.0.0.1:10249 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=1367))
LISTEN 0 2048 127.0.0.1:6444 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=24))
LISTEN 0 2048 127.0.0.1:10256 0.0.0.0:* users:((“k3s-server”,pid=961138,fd=303))
LISTEN 0 128 [::]:22 [::]:* users:((“sshd”,pid=1084,fd=7))
LISTEN 0 100 [::1]:25 [::]:* users:((“master”,pid=1579,fd=17))
LISTEN 0 2048 *:10250 : users:((“k3s-server”,pid=961138,fd=1225))
LISTEN 1 2048 *:6443 : users:((“k3s-server”,pid=961138,fd=7))

dmitchel · August 29, 2022, 9:40pm

I scaled Rancher deployment to 0 and problem goes away as rancher is shutdown.
I scale up just 1 rancher pod and its immediately back with high cpu usage, thousands of tcp connections, etc…

I do notice several of these errors in the rancher pod logs:

E0829 21:36:23.090421       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:26.260773       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:38.019125       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:40.229796       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:42.798130       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:36:59.982973       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:01.781206       9 request.go:1011] Unexpected error when reading response body: context canceled
W0829 21:37:01.892673       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
E0829 21:37:11.743868       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:12.635044       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:13.246894       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:14.023681       9 request.go:1011] Unexpected error when reading response body: context canceled
W0829 21:37:31.003529       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
I0829 21:37:31.003812       9 trace.go:205] Trace[891381185]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.20.0-rancher.1/tools/cache/reflector.go:168 (29-Aug-2022 21:37:03.332) (total time: 27671ms):
Trace[891381185]: ---"Objects listed" 27671ms (21:37:00.003)
Trace[891381185]: [27.671271617s] [27.671271617s] END
W0829 21:37:31.018288       9 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
E0829 21:37:34.446744       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:41.164710       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:37:43.513207       9 request.go:1011] Unexpected error when reading response body: context canceled
I0829 21:37:46.784618       9 trace.go:205] Trace[637310963]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.20.0-rancher.1/tools/cache/reflector.go:168 (29-Aug-2022 21:37:18.143) (total time: 28640ms):
Trace[637310963]: ---"Objects listed" 28640ms (21:37:00.784)
Trace[637310963]: [28.640624512s] [28.640624512s] END
E0829 21:38:00.970069       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:13.613221       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:18.889684       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:31.575837       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:34.164451       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:35.177839       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:42.141709       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:38:52.304389       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:02.935042       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:02.955438       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:13.197400       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:13.973229       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:21.685135       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:26.868389       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.051437       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.086862       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:28.636336       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:29.521167       9 request.go:1011] Unexpected error when reading response body: context canceled
E0829 21:39:40.224769       9 request.go:1011] Unexpected error when reading response body: context canceled

dmitchel · August 30, 2022, 7:00pm

Tracked down the issue to a backups (px-backup) process on one of our other clusters that was hammering the rancher k3s cluster with https requests.

Topic		Replies	Views
A lot of Issues after Server-Downtime Rancher	5	1950	November 2, 2021
Rancher pod restart many times Rancher	0	833	May 27, 2021
Rancher connection to cluster flaps regularly Rancher	2	5683	January 2, 2020
Rancher 2.0.8 high IO Rancher	1	1086	September 24, 2018
Bare metal installation for homelab / high CPU usage Rancher	2	1090	July 22, 2021

Rancher 2.5.15 on 2 node k3s cluster with Mysql getting extremely high CPU use on the host and SYN Flood alerts on ports

kubectl get nodes

kubectl get pods --all-namespaces

Related topics