This happened again and no way to bring it back. The only option is reinstall Rancher server and register the existing k8s to it. The downside is the k8s can’t be upgraded.
This is annoying and seems no update to similar cases in the forum.
2023/03/20 02:26:23 [INFO] Stopping cluster agent for c-4jnqb
2023/03/20 02:26:23 [ERROR] failed to start cluster controllers c-4jnqb: context canceled
Just in case anyone else hit this problem.
Rancher server can no longer access the downstream Clusters - in the rancher gui you’ll see an http 500 Error. in the rancher pod logs (Rancher Cluster, namespace cattle-system) you see messages like this:
2023/06/20 12:02:51 [INFO] Stopping cluster agent for
2023/06/20 12:02:51 [ERROR] failed to start cluster controllers : context canceled
2023/06/20 12:02:52 [ERROR] Failed to connect to peer wss:///v3/connect [local ID=]: websocket: bad handshake
2023/06/20 12:02:53 [ERROR] Failed to handle tunnel request from remote address : response 400: cluster not found
what worked for me was to delete all rancher pods in the Rancher Cluster ninnamespace cattle-system. I also deleted the rancher-webhook, but that was before I restarted the rancher pods, so maybe it’s not necessary.
Afterwards the downstream Clusters were accessible again.