Can't join additional master nodes to single-node cluster with embedded etcd

I have a single-node cluster that I was always planning on upgrading to a 3-node cluster once I got it all set up and could migrate existing workloads into the cluster.

To bring up the cluster originally, I used the following command:

export K3S_TOKEN=<redacted>
curl -sfL https://get.k3s.io | sh -s - server --cluster-init

I have confirmed that the original server node has --cluster-init in its systemd service file.

I am now attempting to join an additional server node to the cluster like so:

export K3S_TOKEN=<redacted>
curl -sfL https://get.k3s.io | sh -s - server --server https://192.168.50.9:6443

where 192.168.50.9 is the LAN IP of the existing node in the cluster.

I have also made some customizations to the default configuration and am managing this by putting a file at /etc/rancher/k3s/config.yaml, that file looks like this (with a few details redacted)

datastore-endpoint: etcd
write-kubeconfig-mode: 660
secrets-encryption: true
cluster-domain: beleriand
flannel-backend: wireguard-native

tls-san:
  - beleriand.<redacted>.net
  - beleriand.<redacted>.ts.net
  - 100.125.27.105

cluster-cidr:
  - 10.42.0.0/16
  - <redacted>::/56

service-cidr:
  - 10.43.0.0/16
  - <redacted>::/112

disable:
  - traefik
  - servicelb

When I do this, the second server doesn’t fail to come up, but it also doesn’t join the cluster. Instead, I get what appears to be a completely independent single-node cluster, running nothing but the default pods that k3s runs (coredns, local-path-provisioner, etc.)

At first I thought I must have the token wrong, and indeed I confirmed that this behavior is also what happens when I try to install K3s with a missing or invalid token on the second server. However, I have double and triple-checked that the token I’m using is correct, comparing it to /var/lib/rancher/k3s/server/token on the original server, and it is. I’ve tried using both the secure format and the short format, as described here, with no success.

Finally, I know for sure that at some point the new server is talking to the old server, because if I fail to put the correct config at /etc/rancher/k3s/config.yaml (which I have done a few times by accident, and a few more on purpose) then the new server fails to come up and complains about mismatched config values:

May 14 19:32:46 mandos sh[442104]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=info msg="Starting k3s v1.30.0+k3s1 (14549535)"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=info msg="Managed etcd cluster not yet initialized"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: ClusterDNSs.slice[1]"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: ClusterIPRanges.slice[1]"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: cluster-domain"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: secrets-encryption"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: flannel-backend"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=warning msg="critical configuration mismatched: ServiceIPRanges.slice[1]"
May 14 19:32:46 mandos k3s[442109]: time="2024-05-14T19:32:46Z" level=fatal msg="starting kubernetes: preparing server: failed to validate server configuration: critical configuration value mismatch between servers"
May 14 19:32:46 mandos systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
May 14 19:32:46 mandos systemd[1]: k3s.service: Failed with result 'exit-code'.
May 14 19:32:46 mandos systemd[1]: Failed to start Lightweight Kubernetes.
May 14 19:32:51 mandos systemd[1]: k3s.service: Scheduled restart job, restart counter is at 3.
May 14 19:32:51 mandos systemd[1]: Stopped Lightweight Kubernetes.
May 14 19:32:51 mandos systemd[1]: Starting Lightweight Kubernetes...
May 14 19:32:51 mandos sh[442136]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 14 19:32:51 mandos k3s[442141]: time="2024-05-14T19:32:51Z" level=info msg="Starting k3s v1.30.0+k3s1 (14549535)"
May 14 19:32:51 mandos k3s[442141]: time="2024-05-14T19:32:51Z" level=info msg="Managed etcd cluster not yet initialized"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: ClusterDNSs.slice[1]"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: ClusterIPRanges.slice[1]"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: cluster-domain"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: secrets-encryption"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: flannel-backend"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=warning msg="critical configuration mismatched: ServiceIPRanges.slice[1]"
May 14 19:32:52 mandos k3s[442141]: time="2024-05-14T19:32:52Z" level=fatal msg="starting kubernetes: preparing server: failed to validate server configuration: critical configuration value mismatch between servers"

One more wrinkle: The second server, the one that I’m trying to add to the cluster, was previously running its own single-node cluster. At one point I thought there must be lingering state on the second server that is somehow preventing it from joining the cluster. However, I have run k3s-uninstall.sh any number of times now, I’ve checked for stale data in /var/lib/rancher and /etc/rancher and found nothing, and I’ve even tried joining a third server (which has never run k3s otherwise) to the cluster, with the same result.

I’m completely at a loss. I know the servers can talk to each other (again, because of the mismatched-config errors). I know the token is correct, because with an invalid or missing token I don’t get those mismatched config errors when I start up without the appropriate config file. I’ve tried several different versions (1.28.something, 1.29.4, and finally 1.30.0 to see if the latest and greatest would help). What on earth is going on?

Update: I figured this out. Removing datastore-endpoint: etcd from the config file solved the issue. My guess is that specifying it that way specifically means to use etcd locally, not in a cluster, and apparently this takes priority over the --cluster-init flag. Oh well. Lesson learned.

1 Like