Running v1.20.7+k3s1 in HA mode with external Postgres DB.
Nodes running Ubuntu 20.04 LTS.
Issue:
Due to deprecated hardware, we’re retiring some of our nodes to replace with newer hardware.
In order to maintain HA status, this also includes creating new servers/control planes.
When adding new server nodes to the cluster, existing nodes report certificate verify errors, resulting in downtime.
The new server has a valid kubeconfig and kubectl commands work as expected. All other nodes show a ‘NOT READY’ state.
We’ve narrowed it down to the CA hash in the K3S_TOKEN changing after adding the new server. Updating this in the systemd (/etc/systemd/system/k3s.service.env or /etc/systemd/system/k3s-agent.service.env) variables on the other nodes restores the connection to the cluster, but isn’t handy.
A number of apps, including CoreDNS also fail as their service account tokens are no longer valid. This is easily resolved by deleting the tokens, followed by the pods to generate new tokens, but again, isn’t handy.
Is there a reason the K3S_TOKEN is updating with each new additional server, or is there a specific process needed in order to add new server nodes to a HA setup?
Script used for adding new server nodes
#!/bin/bash
TOKEN="<token>" # Token from initial server.
REG_URL="hostname:6443" # Hostname of server node.
K3S_VERSION="v1.20.7+k3s1" # Must match server.
if [[ -z $REG_URL || -z $K3S_VERSION || -z $TOKEN ]]; then
echo "Error: One or more variables are undefined."
exit 1
fi
curl -sfL https://get.k3s.io | K3S_URL=$REG_URL K3S_TOKEN=$TOKEN INSTALL_K3S_VERSION=$K3S_VERSION sh -