Hello.
I run a single node Rancher with docker like this:
$ cat /opt/rancher/docker-compose.yml
version: "3.7"
services:
rancher:
image: rancher/rancher:v2.5.7
restart: unless-stopped
container_name: rancher
command: ["--no-cacerts"]
ports:
- 80:80
- 443:443
volumes:
- "./data:/var/lib/rancher"
- "/opt/rancher/ssl/wildcard.rancher.url-chain.pem:/etc/rancher/ssl/cert.pem"
- "/opt/rancher/ssl/wildcard.rancher.url.key:/etc/rancher/ssl/key.pem"
privileged: true
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready controlplane,etcd 338d v1.20.5
master02 Ready controlplane,etcd 338d v1.20.5
master03 Ready controlplane,etcd 338d v1.20.5
worker01 Ready worker 338d v1.20.5
worker02 Ready worker 338d v1.20.5
worker03 Ready worker 338d v1.20.5
The cluster was set up about a year ago. The Rancher URL was configured to be https://old.rancher.url and was then changed to https://new.rancher.url a bit later.
While attempting to upgrade Rancher I faced error messages and Rancher did not start up. I don’t have the error messages at hand but Rancher complained about handshake errors.
I rolled back to v2.5.7 and took a closer look. I noticed that the cattle-cluster-agent pod was red. Also there where two of them, both in not ready state. Their logs showed messages like this:
time="2022-03-25T07:22:10Z" level=info msg="Connecting to proxy" url="wss://old.rancher.url/v3/connect"
time="2022-03-25T07:22:10Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp xxx.xx.xx.xx:443: connect: connection refused"
time="2022-03-25T07:22:10Z" level=error msg="Remotedialer proxy error" error="dial tcp xxx.xx.xx.xx: connect: connection refused"
time="2022-03-25T07:22:20Z" level=info msg="Connecting to wss://old.rancher.url/v3/connect with token jagbdajbdajsbvdajvdajsbdaisdbajksbdaksbdkasbdkabsdkabd"
time="2022-03-25T07:22:20Z" level=info msg="Connecting to proxy" url="wss://old.rancher.url/v3/connect"
time="2022-03-25T07:22:20Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp xxx.xx.xx.xx:443: connect: connection refused"
In the deployment config I configured the env variable CATTLE_SERVER like this:
$ kubectl get pod cattle-cluster-agent-2323c46aa5-xzkbp -n cattle-system -o json | jq '.spec.containers[].env[6]'
{
"name": "CATTLE_SERVER",
"value": "https://new.rancher.url"
}
but that didn’t help, even after a redeploy. I still had two pods in not ready state. Then I added an entry in /etc/hosts to point old.rancher.url to the Rancher IP and suddenly one pod went green and the other dissapeared but the error messages are still the same and Rancher Upgrade is not possible. How to get rid of that old.rancher.url? Is it a corrupted etcd?