I started a rancher server and node on a spare machine of mine in my home, and at first it was working just fine!
Then I ran into a series of ridiculous problems (you can skip this section): Pipelines failed to clone their repositories because
github.com wasn’t resolvable, even though parts of rancher that use the GitHub API worked fine! So I toyed around with my node a bit and it turns out
/etc/resolv.conf was a bit screwy because of how
systemd-resolved populates if. So I fixed that, then rebooted. and turns out my server didn’t correctly configure my network interface on boot because I didn’t have it hooked up to anything during installation! After populating
/etc/network/interfaces, having it still not work, then populating
systemd-networkd's config, I got it to actually use my interface on reboot, but now I’m at the current stage of my problem:
First, Jenkins hung indefinitely during a build of my pipeline. So I tried deleting the workload. Turns out the pipeline doesn’t bring it back up and the solution was to kill the pipeline and its namespace and make a new one! I tried, but now the namespace is stuck
Terminating (through several reboots etc) despite having no resources! I got an error about contacting the metrics server, so I went and checked out the system project… Lo and behold, everything is fucked!
cattle-node-agent is stuck “Updating Workload”, as are nginx-ingress-controller and canal! Even after rebooting everything!
metrics-server and cattle-cluster-agent are in “Deployment does not have minimum availability” state due to CrashLoopBackOff on both of them!
I checked in, and here’s what I saw:
cattle-cluster-agent is trying to ping my old server-url (I originally configured rancher to host as
https://my.domain.here, then reconfigured it with my local IP because I forgot that everything runs in my home network and my public IP isn’t resolvable from inside the network!)
metrics-server is giving me this shit:
extension-apiserver-authentication: dial tcp 10.43.0.1:443: connect: no route to host(10.x.x.x isn’t my LAN’s subnet so I assume this is internal to rancher.)
How do I unfuck this? Do I need to totally purge the cluster and start over so that everything uses the right host? I can’t find any remnants of the old host info in anything’s configuration.
Edit: I found and updated the server-url in
cattle-cluster-agent, but now it’s having difficulty reaching the rancher server via the proper IP!
https://192.168.2.17/ping is not accessible (Failed to connect to 192.168.2.17 port 443: No route to host)
Edit 2: I removed the node, killed all node-related docker images, rebooted, and tried to bring up the node again. The new one is doing the exact same thing!
Edit 3: I removed everything and tried to set up a new rancher server. First I dealt with
read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:536" took too long (201.206879ms) to execute, and now it’s
unable to access 'https://git.rancher.io/charts/': Could not resolve host: git.rancher.io. I can ping
Edit 4: It starts finally, but etcd is fucked:
rejected connection from "192.168.2.17:38656" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")