Rancher Crashes With "Invalid Configuration" Error After Upgrade

We are running Rancher as a standalone Docker image. Yesterday we successfully updated one of our installations from v2.5.3 to v2.5.11 by following the Upgrading Rancher Installed with Docker documentation. Today we followed the same process on a separate installation but the new version of Rancher failed to start, throwing an “invalid configuration” error.

The only notable difference between the 2 processes is that I got forcefully disconnected from the host while watching the startup logs during the first startup attempt after upgrading. I don’t see how this could be related, but I am including it just in case.

Here is the full startup log from the final attempt before I rolled back:

2021/12/14 18:47:32 [INFO] Rancher version v2.5.11 (49de3fb27) is starting
2021/12/14 18:47:32 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Agent:false Features: ClusterRegistry:}
2021/12/14 18:47:32 [INFO] Listening on /tmp/log.sock
2021/12/14 18:47:32 [INFO] Running etcd --data-dir=management-state/etcd --heartbeat-interval=500 --election-timeout=5000
2021-12-14 18:47:32.984266 W | pkg/flags: unrecognized environment variable ETCD_URL_arm64=https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-arm64.tar.gz
2021-12-14 18:47:32.984413 W | pkg/flags: unrecognized environment variable ETCD_URL_amd64=https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz
2021-12-14 18:47:32.984485 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=amd64
2021-12-14 18:47:32.984595 W | pkg/flags: unrecognized environment variable ETCD_URL=ETCD_URL_amd64
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-14 18:47:32.984765 I | etcdmain: etcd Version: 3.4.3
2021-12-14 18:47:32.984825 I | etcdmain: Git SHA: 3cf2f69b5
2021-12-14 18:47:32.984904 I | etcdmain: Go Version: go1.12.12
2021-12-14 18:47:32.984977 I | etcdmain: Go OS/Arch: linux/amd64
2021-12-14 18:47:32.985037 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2021-12-14 18:47:32.985201 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-14 18:47:32.985859 I | embed: name = default
2021-12-14 18:47:32.985995 I | embed: data dir = management-state/etcd
2021-12-14 18:47:32.986057 I | embed: member dir = management-state/etcd/member
2021-12-14 18:47:32.986137 I | embed: heartbeat = 500ms
2021-12-14 18:47:32.986211 I | embed: election = 5000ms
2021-12-14 18:47:32.986269 I | embed: snapshot count = 100000
2021-12-14 18:47:32.986360 I | embed: advertise client URLs = http://localhost:2379
2021-12-14 18:47:32.986416 I | embed: initial advertise peer URLs = http://localhost:2380
2021-12-14 18:47:32.986498 I | embed: initial cluster =
2021-12-14 18:47:33.039879 I | etcdserver: recovered store from snapshot at index 98200992
2021-12-14 18:47:33.040884 I | mvcc: restore compact to 93339533
2021-12-14 18:47:33.062447 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 98201689
raft2021/12/14 18:47:33 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
raft2021/12/14 18:47:33 INFO: 8e9e05c52164694d became follower at term 37235
raft2021/12/14 18:47:33 INFO: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 37235, commit: 98201689, applied: 98200992, lastindex: 98201689, lastterm: 37235]
2021-12-14 18:47:33.062954 I | etcdserver/api: enabled capabilities for version 3.4
2021-12-14 18:47:33.062972 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2021-12-14 18:47:33.063073 I | etcdserver/membership: set the cluster version to 3.4 from store
2021-12-14 18:47:33.064448 I | mvcc: restore compact to 93339533
2021-12-14 18:47:33.076566 W | auth: simple token is not cryptographically signed
2021-12-14 18:47:33.077616 I | etcdserver: starting server... [version: 3.4.3, cluster version: 3.4]
2021-12-14 18:47:33.077967 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-12-14 18:47:33.079984 I | embed: listening for peers on 127.0.0.1:2380
raft2021/12/14 18:47:38 INFO: 8e9e05c52164694d is starting a new election at term 37235
raft2021/12/14 18:47:38 INFO: 8e9e05c52164694d became candidate at term 37236
raft2021/12/14 18:47:38 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 37236
raft2021/12/14 18:47:38 INFO: 8e9e05c52164694d became leader at term 37236
raft2021/12/14 18:47:38 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 37236
2021-12-14 18:47:38.064563 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
2021-12-14 18:47:38.064721 I | embed: ready to serve client requests
2021-12-14 18:47:38.065531 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2021/12/14 18:47:38 [FATAL] invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

I rolled back using the Rolling Back Rancher Installed with Docker documentation and v2.5.3 started normally.

I plan to try the upgrade again tomorrow, but is there anything obviously wrong with my environment based on the v2.5.11 startup logs shown above?

Any feedback or guidance is appreciated.

In case it is relevant, here are the startup logs from the point up to where I was forcefully disconnected from the server:

2021/12/14 18:36:47 [INFO] Rancher version v2.5.11 (49de3fb27) is starting
2021/12/14 18:36:47 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Agent:false Features: ClusterRegistry:}
2021/12/14 18:36:47 [INFO] Listening on /tmp/log.sock
2021/12/14 18:36:47 [INFO] Running etcd --data-dir=management-state/etcd --heartbeat-interval=500 --election-timeout=5000
2021-12-14 18:36:48.136055 W | pkg/flags: unrecognized environment variable ETCD_URL_arm64=https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-arm64.tar.gz
2021-12-14 18:36:48.136105 W | pkg/flags: unrecognized environment variable ETCD_URL_amd64=https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz
2021-12-14 18:36:48.136111 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=amd64
2021-12-14 18:36:48.136117 W | pkg/flags: unrecognized environment variable ETCD_URL=ETCD_URL_amd64
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-14 18:36:48.136182 I | etcdmain: etcd Version: 3.4.3
2021-12-14 18:36:48.136188 I | etcdmain: Git SHA: 3cf2f69b5
2021-12-14 18:36:48.136192 I | etcdmain: Go Version: go1.12.12
2021-12-14 18:36:48.136197 I | etcdmain: Go OS/Arch: linux/amd64
2021-12-14 18:36:48.136202 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2021-12-14 18:36:48.136302 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-14 18:36:48.137081 I | embed: name = default
2021-12-14 18:36:48.137096 I | embed: data dir = management-state/etcd
2021-12-14 18:36:48.137122 I | embed: member dir = management-state/etcd/member
2021-12-14 18:36:48.137129 I | embed: heartbeat = 500ms
2021-12-14 18:36:48.137133 I | embed: election = 5000ms
2021-12-14 18:36:48.137139 I | embed: snapshot count = 100000
2021-12-14 18:36:48.137149 I | embed: advertise client URLs = http://localhost:2379
2021-12-14 18:36:48.137155 I | embed: initial advertise peer URLs = http://localhost:2380
2021-12-14 18:36:48.137161 I | embed: initial cluster =
2021-12-14 18:36:55.495218 I | etcdserver: recovered store from snapshot at index 98200992
2021-12-14 18:36:55.500736 I | mvcc: restore compact to 93339533
2021-12-14 18:36:55.583298 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 98201621
raft2021/12/14 18:36:55 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
raft2021/12/14 18:36:55 INFO: 8e9e05c52164694d became follower at term 37218
raft2021/12/14 18:36:55 INFO: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 37218, commit: 98201621, applied: 98200992, lastindex: 98201621, lastterm: 37218]
2021-12-14 18:36:55.583757 I | etcdserver/api: enabled capabilities for version 3.4
2021-12-14 18:36:55.583799 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2021-12-14 18:36:55.583811 I | etcdserver/membership: set the cluster version to 3.4 from store
2021-12-14 18:36:55.585411 I | mvcc: restore compact to 93339533
2021-12-14 18:36:55.596442 W | auth: simple token is not cryptographically signed
2021-12-14 18:36:55.597575 I | etcdserver: starting server... [version: 3.4.3, cluster version: 3.4]
2021-12-14 18:36:55.597849 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-12-14 18:36:55.600324 I | embed: listening for peers on 127.0.0.1:2380
raft2021/12/14 18:36:58 INFO: 8e9e05c52164694d is starting a new election at term 37218
raft2021/12/14 18:36:58 INFO: 8e9e05c52164694d became candidate at term 37219
raft2021/12/14 18:36:58 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 37219
raft2021/12/14 18:36:58 INFO: 8e9e05c52164694d became leader at term 37219
raft2021/12/14 18:36:58 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 37219
2021-12-14 18:36:58.084980 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
2021-12-14 18:36:58.085436 I | embed: ready to serve client requests
2021-12-14 18:36:58.086350 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2021/12/14 18:36:58 [INFO] Waiting for k3s to start
time="2021-12-14T18:36:58Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
time="2021-12-14T18:36:58Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/2d753699589478b1821bd86b3efed6baafd0388c616e89c9d32f1842d4f31eb6"
2021/12/14 18:36:59 [INFO] Waiting for k3s to start
2021/12/14 18:37:00 [INFO] Waiting for k3s to start
2021/12/14 18:37:01 [INFO] Waiting for k3s to start
2021/12/14 18:37:02 [INFO] Waiting for k3s to start
2021/12/14 18:37:03 [INFO] Waiting for k3s to start
2021/12/14 18:37:04 [INFO] Waiting for k3s to start
W1214 18:37:05.159756       9 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1214 18:37:05.241177       9 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1214 18:37:05.319937       9 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
2021-12-14 18:37:08.941325 W | etcdserver: read-only range request "key:\"/registry/networkpolicies/\" range_end:\"/registry/networkpolicies0\" " with result "range_response_count:1 size:1443" took too long (106.942937ms) to execute

Forceful disconnect isn’t impossible to cause problems as it can kill the shell session (maybe sending your process SIGINT or SIGKILL) and leave things in a halfway configured state (assuming you aren’t in a daemonized, nohup, screen session or similar).

It wouldn’t be my first guess for that to interrupt Docker as I’d assume you’d run the command in daemon mode and were just watching the logs so that wouldn’t impact things.

Thanks for the feedback @wcoateRR .

You are correct that Docker was running daemonized and I was just watching the logs.

The last line of pre-disconnect logs might be a clue to something going wrong in the upgrade process:

2021-12-14 18:37:08.941325 W | etcdserver: read-only range request "key:\"/registry/networkpolicies/\" range_end:\"/registry/networkpolicies0\" " with result "range_response_count:1 size:1443" took too long (106.942937ms) to execute

It seems like a strange coincidence that there is a message in the logs about something related to network policies taking too long and then, immediately after that, my remote SSH session was forcefully disconnected.

Could be, same event causing it is feasible even when session ending wouldn’t impact it.

Though if it was something fragile that getting part-way done would hose then there should’ve been some retries, but maybe less error checking for the Docker version since Rancher aims for that as just quick test and people who care more using helm or otherwise installed version so there’s a bit more of a, “meh, they can just restart the whole thing easily if this error hits”-mindset.

Just to close the loop on this, I tried the upgrade again today and this time it completed successfully.

I got forcefully disconnected again at roughly the same point, but the container restarted (not sure if the 2 are related but it seems likely) and came up normally on what I believe was its second attempt.

I’m never a fan of “just try it again and see what happens” solutions, but that seems to be the way to go here.

And I fully recognize that we probably shouldn’t be running the Docker-based version of Rancher in any kind of production capacity, so any risks because of that choice are definitely on us.