Rancher keeps recreating scheduler and healthcheck


#1

Hi,
I migrated the rancher server onto another server with Ubuntu 18.04 by simply copying the mysql database (stopped rancher, copied it, installed it as new on another server, copied the db on the new one, started the rancher server there). I’m using Rancher 1.6.25.

Then I reran the the rancher-agent (docker run etc.) on one of the hosts.

It also seemed fine, but Rancher kept recreating the scheduler container, so I thought that the problem was the initial scheduler container ‘stood in the way’. It was identified as a standalone container. I deleted the initial scheduler container (through the rancher interface), but Rancher kept on recreating the new scheduler container. I then restarted the host and now, besides the scheduler container, the healthcheck container also isn’t working.
Any ideas how I can get past this?
This is what I keep getting on the healthcheck container logs:

time="2019-01-24T12:03:36Z" level=error msg="Failed to report status 6573f81c-ebfc-48f8-8aa8-81a73fde790c_1a9f83c1-6b68-4bc8-bcb4-41f8f255e9bc_2=DOWN: Bad response from [http://207.154.200.246:8080/v1/serviceevents], statusCode [403]. Status [403 Forbidden]. Body: [{\"id\":\"449637b7-265f-4238-a955-75e830645649\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":403,\"code\":\"CantVerifyHealthcheck\",\"message\":\"CantVerifyHealthcheck\",\"detail\":null,\"baseType\":\"error\"}]"

And this is what I get on the scheduler container logs:
time="2019-01-24T12:05:29Z" level=info msg="Listening on /tmp/log.sock" time="2019-01-24T12:05:29Z" level=info msg="Connecting to cattle event stream." time="2019-01-24T12:05:29Z" level=info msg="Subscribing to metadata changes." time="2019-01-24T12:05:29Z" level=info msg="Listening for health checks on 0.0.0.0:80/healthcheck" time="2019-01-24T12:05:29Z" level=info msg="Initializing event router" workerCount=100 time="2019-01-24T12:05:29Z" level=info msg="Connection established" time="2019-01-24T12:05:29Z" level=info msg="Starting websocket pings" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [instanceReservation] with total 1000000 and used 15 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [cpuReservation] with total 2000 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [memoryReservation] with total 4135583744 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [storageSize] with total 81032015 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [portReservation], ip set [0.0.0.0], ports map tcp map[0.0.0.0:map[80:30eee514-b6ac-491b-be8e-5fb8dfec2754]], ports map udp map[0.0.0.0:map[500:a760f279-482d-4880-8bd3-3b1ee7800912 4500:a760f279-482d-4880-8bd3-3b1ee7800912]] for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [hostLabels] with label map [map[io.rancher.host.agent_image:rancher/agent:v1.2.11 io.rancher.host.docker_version:18.06 io.rancher.host.kvm:true io.rancher.host.linux_kernel_version:4.15 io.rancher.host.os:linux]]"

Am I missing something regarding the migration of the Rancher server? I don’t understand why it simply doesn’t work as it should. I can’t find this information anywhere in the documentation. I’d gladly read it if someone could point me to it.


#2

The solution seems to be here:

In my case there weren’t any load balances (so I’m not sure how smoothly that would go if I had to delete those too), but the idea is as follows (for anyone else who’s been struggling like me to find a solution):

After moving the mysql database of the rancher server to another server (set the new url of the rancher server and whatever else there needs to be done there), on the host you first stop the rancher-agent (if you don’t, you cannot manually delete other rancher-containers, as they’ll keep respawning indefinitely). Then you can remove the healthcheck and the metadata container. Normally that should be enough.
Then you copy the url from the “add host” section of the rancher server in order to run the new rancher-agent. That should be enough and you don’t need to delete the old volumes. The scheduler might take some time to stabilize and it eventually works (at least in my case it did work - I tested it several times on simple hosts only with a nextcloud stack downloaded from the community catalogue and the connection between apache and mariadb was kept intact).