Home rancher server can't bring cattle-system or metrics-server up!

Hi everyone,

I started a rancher server and node on a spare machine of mine in my home, and at first it was working just fine!

Then I ran into a series of ridiculous problems (you can skip this section): Pipelines failed to clone their repositories because github.com wasn’t resolvable, even though parts of rancher that use the GitHub API worked fine! So I toyed around with my node a bit and it turns out /etc/resolv.conf was a bit screwy because of how systemd-resolved populates if. So I fixed that, then rebooted. and turns out my server didn’t correctly configure my network interface on boot because I didn’t have it hooked up to anything during installation! After populating /etc/network/interfaces, having it still not work, then populating systemd-networkd's config, I got it to actually use my interface on reboot, but now I’m at the current stage of my problem:

First, Jenkins hung indefinitely during a build of my pipeline. So I tried deleting the workload. Turns out the pipeline doesn’t bring it back up and the solution was to kill the pipeline and its namespace and make a new one! I tried, but now the namespace is stuck Terminating (through several reboots etc) despite having no resources! I got an error about contacting the metrics server, so I went and checked out the system project… Lo and behold, everything is fucked!

  • cattle-node-agent is stuck “Updating Workload”, as are nginx-ingress-controller and canal! Even after rebooting everything!

  • metrics-server and cattle-cluster-agent are in “Deployment does not have minimum availability” state due to CrashLoopBackOff on both of them!

I checked in, and here’s what I saw:

  • cattle-cluster-agent is trying to ping my old server-url (I originally configured rancher to host as https://my.domain.here, then reconfigured it with my local IP because I forgot that everything runs in my home network and my public IP isn’t resolvable from inside the network!)

  • metrics-server is giving me this shit: extension-apiserver-authentication: dial tcp 10.43.0.1:443: connect: no route to host (10.x.x.x isn’t my LAN’s subnet so I assume this is internal to rancher.)

How do I unfuck this? Do I need to totally purge the cluster and start over so that everything uses the right host? I can’t find any remnants of the old host info in anything’s configuration.

Edit: I found and updated the server-url in cattle-cluster-agent, but now it’s having difficulty reaching the rancher server via the proper IP! https://192.168.2.17/ping is not accessible (Failed to connect to 192.168.2.17 port 443: No route to host)

Edit 2: I removed the node, killed all node-related docker images, rebooted, and tried to bring up the node again. The new one is doing the exact same thing!

Edit 3: I removed everything and tried to set up a new rancher server. First I dealt with read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:536" took too long (201.206879ms) to execute, and now it’s unable to access 'https://git.rancher.io/charts/': Could not resolve host: git.rancher.io. I can ping git.rancher.io though…

Edit 4: It starts finally, but etcd is fucked: rejected connection from "192.168.2.17:38656" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")