Answered it myself, here’s what I did-
Copied out /opt/rke/etcd-snapshots somewhere safe.
Ran my “cleanup.sh” which blows away all the rancher/rke dockers and cleans up the folders to bring the server to a clean slate. Rancher access to the cluster was a hot mess of timeouts at this point.
Eventually the rancher API server calmed down, cluster was unavailable, but I was able to “Edit Cluster” and find the cmd to create a new etcd node. I created said node, and tried to restore from snapshots - Rancher apparently still had the original list of snapshots despite /opt/rke/etcd-snapshots being empty, but when restoring from the latest, it complained it “Failed to start backup-container … blahblahblah … stat /backup/ failed”
I then re-populated /opt/rke/etcd-snapshots with data from my safe location, and it took a few minutes, including controlplane becoming unavailable in the midst, but the whole restore process occurred in the background successfully. I was able to view my workloads, showing the exact same ones as before, and hit their services successfully… that was cool.
I would suggest pre-populating /opt/rke/etcd-snapshots on your server ahead of time to shortcut that process. I’m going to try doing this again to validate that it goes smoother with pre-populated etcd-snapshots folder.