Corrupted etcd?

Siegmund · March 29, 2022, 6:33am

Hello.
I run a single node Rancher with docker like this:

$ cat /opt/rancher/docker-compose.yml 
version: "3.7"
services:
  rancher:
    image: rancher/rancher:v2.5.7
    restart: unless-stopped
    container_name: rancher
    command: ["--no-cacerts"]
    ports:
      - 80:80
      - 443:443
    volumes:
      - "./data:/var/lib/rancher"
      - "/opt/rancher/ssl/wildcard.rancher.url-chain.pem:/etc/rancher/ssl/cert.pem"
      - "/opt/rancher/ssl/wildcard.rancher.url.key:/etc/rancher/ssl/key.pem"
    privileged: true

$ kubectl get nodes
NAME       STATUS   ROLES               AGE    VERSION
master01   Ready    controlplane,etcd   338d   v1.20.5
master02   Ready    controlplane,etcd   338d   v1.20.5
master03   Ready    controlplane,etcd   338d   v1.20.5
worker01   Ready    worker              338d   v1.20.5
worker02   Ready    worker              338d   v1.20.5
worker03   Ready    worker              338d   v1.20.5

The cluster was set up about a year ago. The Rancher URL was configured to be https://old.rancher.url and was then changed to https://new.rancher.url a bit later.
While attempting to upgrade Rancher I faced error messages and Rancher did not start up. I don’t have the error messages at hand but Rancher complained about handshake errors.
I rolled back to v2.5.7 and took a closer look. I noticed that the cattle-cluster-agent pod was red. Also there where two of them, both in not ready state. Their logs showed messages like this:

time="2022-03-25T07:22:10Z" level=info msg="Connecting to proxy" url="wss://old.rancher.url/v3/connect"
time="2022-03-25T07:22:10Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp xxx.xx.xx.xx:443: connect: connection refused"
time="2022-03-25T07:22:10Z" level=error msg="Remotedialer proxy error" error="dial tcp xxx.xx.xx.xx: connect: connection refused"
time="2022-03-25T07:22:20Z" level=info msg="Connecting to wss://old.rancher.url/v3/connect with token jagbdajbdajsbvdajvdajsbdaisdbajksbdaksbdkasbdkabsdkabd"
time="2022-03-25T07:22:20Z" level=info msg="Connecting to proxy" url="wss://old.rancher.url/v3/connect"
time="2022-03-25T07:22:20Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp xxx.xx.xx.xx:443: connect: connection refused"

In the deployment config I configured the env variable CATTLE_SERVER like this:

$ kubectl get pod cattle-cluster-agent-2323c46aa5-xzkbp -n cattle-system -o json | jq '.spec.containers[].env[6]'
{
  "name": "CATTLE_SERVER",
  "value": "https://new.rancher.url"
}

but that didn’t help, even after a redeploy. I still had two pods in not ready state. Then I added an entry in /etc/hosts to point old.rancher.url to the Rancher IP and suddenly one pod went green and the other dissapeared but the error messages are still the same and Rancher Upgrade is not possible. How to get rid of that old.rancher.url? Is it a corrupted etcd?

Topic		Replies	Views
Etcd - error "tls: failed to verify client's certificate: x509 Rancher	2	4474	March 16, 2022
Rancher 2.2.2 certificate expiration issues Rancher	5	10177	March 8, 2023
Rancher Fails Provisioning - Etcd Plane Unhealthy - Cert Signed by Unknown Authority Rancher	0	1583	February 10, 2022
Etcd deployment issues on docker Rancher	37	15867	July 8, 2021
New cluster create, stuck on [etcd] Building up etcd plane, cert issues Rancher	14	2639	October 19, 2021

Corrupted etcd?

Related topics