Rancher agents not available after upgrade from 2.1.1 to 2.3.0

siggi · October 10, 2019, 4:50pm

We upgraded rancher from 2.1.1 to 2.3.0 with the upgrade instructions here https://rancher.com/docs/rancher/v2.x/en/upgrades/upgrades/single-node/

I decided to move to LetsEncrypt rather than self-signed SSL, so I added the ACME option when turning on the server.

When I look at the cluster, it seems to be in “Provisioning” mode

Failed to get Kubernetes server version: Get https://x.x.x.x:6443/version?timeout=32s: can not build dialer to [c-gcplt:m-7d4wx]; exit status 1, unable to recognize "management-statefile_path_redacted":

The 2.1.1 agents constantly reboot with this in the error log

INFO: Using resolv.conf: nameserver 10.0.0.2 search eu-west-1.compute.internal
INFO: https://rancher.DOMAIN.org/ping is accessible
ERROR: Value from https://rancher.DOMAIN.org/v3/settings/cacerts does not look like an x509 certificate (unable to load certificate 140304558158272:error:0906D06C:PEM routines:PEM_read_bio:no start line:../crypto/pem/pem_lib.c:691:Expecting: TRUSTED CERTIFICATE)
ERROR: Retrieved cacerts:

I tried removing the SSL option by updating the agent to 2.3.0 and removing the --ca-checksum option from the command line, but then I get this error:

time="2019-10-10T16:00:19Z" level=info msg="Connecting to wss://rancher.DOMAIN.org/v3/connect with token f6dc2pshnxhd52rwcnwf8jh4758lkkpwzf82w6kzfdphb4xj66mjzs"
time="2019-10-10T16:00:19Z" level=info msg="Connecting to proxy" url="wss://rancher.traveldev.org/v3/connect"
time="2019-10-10T16:00:19Z" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
time="2019-10-10T16:00:19Z" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"

I tried adding a new agent in the interface, but it can’t connect to the other hosts, so it fails to launch.

Can you help me?
How can I bring the cluster up?
Can I somehow move everything to a new cluster despite not being able to connect to the cluster?
Can you link to something that helps me debug this better?

dsielert · October 21, 2019, 9:36pm

You ever figure this out ? we have same problem

siggi · October 22, 2019, 9:09am

No, unfortunately, I haven’t figured it out.
I haven’t had time to work on this so part of our infrastructure has been down.

I am wondering if I should downgrade and see if that will bring the cluster up.

@superseb, do you have any ideas about the next step in debugging/fixing?

superseb · October 22, 2019, 5:26pm

When Rancher is updated, the Rancher version is set for the agent version to update. But if you changed the certificates of Rancher as well, the agents will lose connection and the access to the cluster will be lost.

There are steps in https://github.com/rancher/rancher/issues/14731 where it gives you an option to regenerate agent definitions and apply them to the cluster, to make sure the agents are actually properly configured. (given that Rancher is now properly configured by using the correct helm install command).

I see you are using EC2 nodes, did you also configure the cluster with the Kubernetes Cloud provider as Amazon in the cluster settings?

siggi · October 24, 2019, 9:58pm

Thanks for the suggestion.
Yeah, the cluster is configured with Amazon as the provider.

I tried your suggestion from https://gist.github.com/superseb/076f20146e012f1d4e289f5bd1bd4971, but the only object in the data array is the system, so I don’t get any commands.

curl -s -H "Authorization: Bearer ${TOKEN}" "${RANCHERURL}/v3/clusterregistrationtokens?clusterId=${CLUSTERID}" | jq -r '.data[] | select(.name != "system") | .command'

^ This results in nothing

curl -s -H "Authorization: Bearer ${TOKEN}" "${RANCHERURL}/v3/clusterregistrationtokens?clusterId=${CLUSTERID}" | jq -r '.data[] | select(.name = "system") | .command'

^ This results in a command I can run. I’m going to try it and see what happens

siggi · October 30, 2019, 1:46pm

OK, I failed at that … I can’t seem to update the cluster.

When I try to get the kubeconfig_admin.yaml, I get an error; Unable to connect to the server: net/http: TLS handshake timeout
I found an old kubeconfig_admin.yaml file and I used that, but I still get the error: Unable to connect to the server: net/http: TLS handshake timeout

When I look at the docker status, I can see that kube-apiserver is constantly restarting

ubuntu@ranchertest1:~$ sudo docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
d704073bc13b        rancher/rancher-agent:v2.1.1         "run.sh --server h..."   5 days ago          Up 5 days                               silly_haibt
8f9079771aa2        rancher/hyperkube:v1.11.2-rancher1   "/opt/rke-tools/en..."   12 months ago       Up 2 weeks                              kube-proxy
79d667615d77        rancher/hyperkube:v1.11.2-rancher1   "/opt/rke-tools/en..."   12 months ago       Up 2 weeks                              kubelet
c4b134475cc1        rancher/hyperkube:v1.11.2-rancher1   "/opt/rke-tools/en..."   12 months ago       Up 2 weeks                              kube-scheduler
0d06281ce7f4        rancher/hyperkube:v1.11.2-rancher1   "/opt/rke-tools/en..."   12 months ago       Up 2 weeks                              kube-controller-manager
e18b3ebbacc7        rancher/hyperkube:v1.11.2-rancher1   "/opt/rke-tools/en..."   12 months ago       Up 2 seconds                            kube-apiserver
f304f98d0270        rancher/coreos-etcd:v3.2.18          "/usr/local/bin/et..."   12 months ago       Up 2 weeks                              etcd

I took a look at the logs for the kube-apiserver and saw some nice errors…

{
  "log": "F1030 13:37:51.820860       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://10.0.134.133:2379 https://10.0.141.59:2379 https://10.0.140.63:2379] /etc/kubernetes/ssl/kube-node-key.pem /etc/kubernetes/ssl/kube-node.pem /etc/kubernetes/ssl/kube-ca.pem true false 1000 0xc420261500 <nil> 5m0s 1m0s}), err (context deadline exceeded)\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:51.821152667Z"
}
{
  "log": "+ grep -q cloud-provider=azure\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.29488014Z"
}
{
  "log": "+ echo kube-apiserver --kubelet-client-certificate=/etc/kubernetes/ssl/kube-apiserver.pem --allow-privileged=true --service-cluster-ip-range=10.43.0.0/16 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --requestheader-allowed-names=kube-apiserver-proxy-client --tls-cert-file=/etc/kubernetes/ssl/kube-apiserver.pem --requestheader-username-headers=X-Remote-User --service-node-port-range=30000-32767 --etcd-prefix=/registry --admission-control=ServiceAccount,NamespaceLifecycle,LimitRanger,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds --requestheader-group-headers=X-Remote-Group --kubelet-client-key=/etc/kubernetes/ssl/kube-apiserver-key.pem --etcd-certfile=/etc/kubernetes/ssl/kube-node.pem --etcd-keyfile=/etc/kubernetes/ssl/kube-node-key.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem --proxy-client-key-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client-key.pem --proxy-client-cert-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client.pem --secure-port=6443 --storage-backend=etcd3 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --requestheader-extra-headers-prefix=X-Remote-Extra- --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --tls-private-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --etcd-cafile=/etc/kubernetes/ssl/kube-ca.pem --bind-address=0.0.0.0 --insecure-port=0 --service-account-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --authorization-mode=Node,RBAC --etcd-servers=https://10.0.134.133:2379,https://10.0.141.59:2379,https://10.0.140.63:2379 --insecure-bind-address=127.0.0.1 --cloud-provider=aws\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.29569754Z"
}
{
  "log": "+ '[' kube-apiserver = kubelet ']'\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.296391272Z"
}
{
  "log": "+ exec kube-apiserver --kubelet-client-certificate=/etc/kubernetes/ssl/kube-apiserver.pem --allow-privileged=true --service-cluster-ip-range=10.43.0.0/16 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --requestheader-allowed-names=kube-apiserver-proxy-client --tls-cert-file=/etc/kubernetes/ssl/kube-apiserver.pem --requestheader-username-headers=X-Remote-User --service-node-port-range=30000-32767 --etcd-prefix=/registry --admission-control=ServiceAccount,NamespaceLifecycle,LimitRanger,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds --requestheader-group-headers=X-Remote-Group --kubelet-client-key=/etc/kubernetes/ssl/kube-apiserver-key.pem --etcd-certfile=/etc/kubernetes/ssl/kube-node.pem --etcd-keyfile=/etc/kubernetes/ssl/kube-node-key.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem --proxy-client-key-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client-key.pem --proxy-client-cert-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client.pem --secure-port=6443 --storage-backend=etcd3 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --requestheader-extra-headers-prefix=X-Remote-Extra- --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --tls-private-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --etcd-cafile=/etc/kubernetes/ssl/kube-ca.pem --bind-address=0.0.0.0 --insecure-port=0 --service-account-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --authorization-mode=Node,RBAC --etcd-servers=https://10.0.134.133:2379,https://10.0.141.59:2379,https://10.0.140.63:2379 --insecure-bind-address=127.0.0.1 --cloud-provider=aws\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.296404885Z"
}
{
  "log": "Flag --admission-control has been deprecated, Use --enable-admission-plugins or --disable-admission-plugins instead. Will be removed in a future version.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.446063372Z"
}
{
  "log": "Flag --insecure-port has been deprecated, This flag will be removed in a future version.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.44609073Z"
}
{
  "log": "Flag --insecure-bind-address has been deprecated, This flag will be removed in a future version.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.44609495Z"
}
{
  "log": "I1030 13:37:52.446264       1 server.go:703] external host was not specified, using 10.0.141.59\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.446441805Z"
}
{
  "log": "I1030 13:37:52.446400       1 server.go:145] Version: v1.11.2\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:52.446449752Z"
}
{
  "log": "W1030 13:37:53.047054       1 admission.go:71] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.047375752Z"
}
{
  "log": "I1030 13:37:53.047307       1 plugins.go:158] Loaded 6 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.047400206Z"
}
{
  "log": "I1030 13:37:53.047326       1 plugins.go:161] Loaded 3 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,ResourceQuota.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.047404911Z"
}
{
  "log": "W1030 13:37:53.048803       1 admission.go:71] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.049040344Z"
}
{
  "log": "I1030 13:37:53.048975       1 plugins.go:158] Loaded 6 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.049051415Z"
}
{
  "log": "I1030 13:37:53.048989       1 plugins.go:161] Loaded 3 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,ResourceQuota.\n",
  "stream": "stderr",
  "time": "2019-10-30T13:37:53.049055263Z"
}

So, that makes sense.
The kube API server is responsible for connecting on port 6443, but it keeps crashing for some reason … maybe it’s the certificates.

@superseb can I replace the certificates directly and stop it from crashing?

siggi · October 30, 2019, 1:46pm

Or can I recreate the cluster somehow with a hacky method?

superseb · October 30, 2019, 2:01pm

It connects to etcd as backend and that is failing, the first step is to check the status of etcd using https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/ and get that sorted out before any Kubernetes components can continue to run.

siggi · October 30, 2019, 2:03pm

The etcd error (from docker logs) is:

etcdmain: rejected connection from \"10.0.134.133:41698\" (error \"remote error: tls: bad certificate\", ServerName \"\")

So I presume I have to replace some certificates somewhere … I’m digging more into this

siggi · October 31, 2019, 9:15am

A small status update.
@superseb has been really helpful in the research over Slack - thanks

The cause of the initial cluster failure is that the certificates on the cluster expired, causing a mammoth log spam.
This caused the disks to fill up and the cluster to fail and be unresponsive.

The initial fix was to add more disk space or truncate the logs. I added more disk space.

This script was really useful to identify broken certificates - https://gist.github.com/superseb/e605f94b6db76d1c0a818f048b8e9295

But after finding out that the certificates are outdated, I have no action (yet) to take since I am not certain how to rotate the certificates manually.

Since I can’t rotate the certs manually yet, I decided to get the cluster up and running.

The problem was that etcd was failing, and the cluster didn’t want to boot.
I managed to get etcd to stop failing by changing the system clock on the cluster to be within the certificate time (just replace the 2019-09-30 with whatever date is within your certificate time)

timedatectl set-ntp 0
timedatectl set-time "2019-09-30 $(date +%H:%M:%S)"

This got etcd running, but kube-controller-manager was now failing. The logs revealed that AWS doesn’t accept this time difference when using their credentials.

To fix that, I had to turn NTP time on again

timedatectl set-ntp 1

This got the cluster up and running and I could see the pods starting to boot up, but this caused kubectl to report errors when I’m running commands (cert expired), so I had to turn the system clock back in time again.

timedatectl set-ntp 0
timedatectl set-time "2019-09-30 $(date +%H:%M:%S)"

NOW we have something … the cluster is up and running and kubectl works properly.

Unfortunately, I’m still stuck - Rancher can’t connect.
When I look at the newly applied definitions, I can see that cattle-cluster-agent can’t start properly.

root@ranchertest1:~# docker run --rm --net=host -v $PWD/kubeconfig_admin.yaml:/root/.kube/config --entrypoint bash $(docker inspect $(docker images -q --filter=label=io.cattle.agent=true) --format='{{index .RepoTags 0}}' | tail -1) -c 'kubectl -n cattle-system get pods -o wide'
NAME                                    READY     STATUS    RESTARTS   AGE       IP             NODE
cattle-cluster-agent-7d74545487-ft68q   0/1       Error     3288       1d        10.42.2.94     ip-10-0-141-59.eu-west-1.compute.internal
cattle-cluster-agent-7d74545487-fxcsm   0/1       Evicted   0          66d       <none>         ip-10-0-134-133.eu-west-1.compute.internal
cattle-node-agent-9bzdd                 0/1       Error     32         3d        10.0.134.133   ip-10-0-134-133.eu-west-1.compute.internal
cattle-node-agent-bxzq6                 0/1       Error     22         311d      10.0.141.59    ip-10-0-141-59.eu-west-1.compute.internal
cattle-node-agent-r68xk                 0/1       Error     28         311d      10.0.140.63    ip-10-0-140-63.eu-west-1.compute.internal

The logs reveal cluster-register pod is not available

root@ranchertest1:~# docker run --rm --net=host -v $PWD/kubeconfig_admin.yaml:/root/.kube/config --entrypoint bash $(docker inspect $(docker images -q --filter=label=io.cattle.agent=true) --format='{{index .RepoTags 0}}' | tail -1) -c 'kubectl -n cattle-system logs -l app=cattle-cluster-agent'
INFO: Environment: CATTLE_ADDRESS=10.42.2.94 CATTLE_CA_CHECKSUM=REDACTED CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-7d74545487-ft68q CATTLE_SERVER=https://rancher.DOMAIN.COM
INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal options ndots:5
ERROR: https://rancher.DOMAIN.COM/ping is not accessible (Could not resolve host: rancher.DOMAIN.COM)
Error from server (BadRequest): container "cluster-register" in pod "cattle-cluster-agent-7d74545487-fxcsm" is not available

And that’s where I’m at right now … I’m looking into if I can replace the certificates based on existing keys - hopefully that’s something that I can do, then roll the clock back to the correct date.

Sorry for the long post, but perhaps this saves someone else some time

siggi · October 31, 2019, 10:20am

OK, I tried replacing the certificates by creating new certificates based on their current key and certificate.

for file in $(ls *.pem | grep -v "\-key\.pem$" | grep -v kube-admin | grep -v kube-apiserver-proxy-client.pem | grep -v kube-apiserver-requestheader-ca.pem | grep -v kube-ca.pem | sort); do
    openssl verify -CAfile kube-ca.pem $file
    if [[ "$?" != "0" ]]; then
        # The certificate is outdated - renew
        BASENAME=$(echo $file | sed -e "s/\.pem//")
        
        # Create CSR
        openssl x509 -in $BASENAME.pem \
            -signkey $BASENAME-key.pem \
            -x509toreq -out $BASENAME.csr
            
        # Create and sign certificate
        openssl x509 -signkey $BASENAME-key.pem\
            -CA kube-ca.pem \
            -CAkey kube-ca-key.pem \
            -CAcreateserial \
            -in $BASENAME.csr \
            -req -days 365 \
            -out $BASENAME.pem
    fi
done

This created valid certificates, signed with the kube root ca.

Then I restarted docker, but the cluster didn’t go up
Now I get the certificate errors again in etcd and kube-controller-manager

Topic		Replies	Views
SSL certificate is not downloaded on agent Rancher	4	2679	September 29, 2021
/v3/settings/cacerts does not look like an x509 certificate after upgrade to 2.1.0 from 2.0.8 Rancher	1	2312	October 14, 2018
Letsencrypt by nginxproxy/acme-companion Rancher	1	17	January 1, 2025
Cattle node agent - Certificate chain is not complete	1	3541	May 20, 2020
Corrupted etcd?	0	904	March 29, 2022

Rancher agents not available after upgrade from 2.1.1 to 2.3.0

Related topics