Rancher keeps recreating scheduler and healthcheck

vinci · January 24, 2019, 12:08pm

Hi,
I migrated the rancher server onto another server with Ubuntu 18.04 by simply copying the mysql database (stopped rancher, copied it, installed it as new on another server, copied the db on the new one, started the rancher server there). I’m using Rancher 1.6.25.

Then I reran the the rancher-agent (docker run etc.) on one of the hosts.

It also seemed fine, but Rancher kept recreating the scheduler container, so I thought that the problem was the initial scheduler container ‘stood in the way’. It was identified as a standalone container. I deleted the initial scheduler container (through the rancher interface), but Rancher kept on recreating the new scheduler container. I then restarted the host and now, besides the scheduler container, the healthcheck container also isn’t working.
Any ideas how I can get past this?
This is what I keep getting on the healthcheck container logs:

time="2019-01-24T12:03:36Z" level=error msg="Failed to report status 6573f81c-ebfc-48f8-8aa8-81a73fde790c_1a9f83c1-6b68-4bc8-bcb4-41f8f255e9bc_2=DOWN: Bad response from [http://207.154.200.246:8080/v1/serviceevents], statusCode [403]. Status [403 Forbidden]. Body: [{\"id\":\"449637b7-265f-4238-a955-75e830645649\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":403,\"code\":\"CantVerifyHealthcheck\",\"message\":\"CantVerifyHealthcheck\",\"detail\":null,\"baseType\":\"error\"}]"

And this is what I get on the scheduler container logs:
time="2019-01-24T12:05:29Z" level=info msg="Listening on /tmp/log.sock" time="2019-01-24T12:05:29Z" level=info msg="Connecting to cattle event stream." time="2019-01-24T12:05:29Z" level=info msg="Subscribing to metadata changes." time="2019-01-24T12:05:29Z" level=info msg="Listening for health checks on 0.0.0.0:80/healthcheck" time="2019-01-24T12:05:29Z" level=info msg="Initializing event router" workerCount=100 time="2019-01-24T12:05:29Z" level=info msg="Connection established" time="2019-01-24T12:05:29Z" level=info msg="Starting websocket pings" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [instanceReservation] with total 1000000 and used 15 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [cpuReservation] with total 2000 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [memoryReservation] with total 4135583744 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [storageSize] with total 81032015 and used 0 for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [portReservation], ip set [0.0.0.0], ports map tcp map[0.0.0.0:map[80:30eee514-b6ac-491b-be8e-5fb8dfec2754]], ports map udp map[0.0.0.0:map[500:a760f279-482d-4880-8bd3-3b1ee7800912 4500:a760f279-482d-4880-8bd3-3b1ee7800912]] for host 6573f81c-ebfc-48f8-8aa8-81a73fde790c" time="2019-01-24T12:05:30Z" level=info msg="Adding resource pool [hostLabels] with label map [map[io.rancher.host.agent_image:rancher/agent:v1.2.11 io.rancher.host.docker_version:18.06 io.rancher.host.kvm:true io.rancher.host.linux_kernel_version:4.15 io.rancher.host.os:linux]]"

Am I missing something regarding the migration of the Rancher server? I don’t understand why it simply doesn’t work as it should. I can’t find this information anywhere in the documentation. I’d gladly read it if someone could point me to it.

vinci · January 25, 2019, 3:21pm

The solution seems to be here:

gist.github.com

https://gist.github.com/superseb/f44d51cc42c7e5a0049d393d6d2563e7

README.md

## Move rancher/server container with volumes to other host + change of DNS name (Host Registration URL/CATTLE_URL/API URL) for Rancher 1.6

**This applies to Rancher 1.6 only**

### Variables used in this guide:
* Old URL: http://old.test.com:8080
* New URL: http://new.test.com:8080
* Container name running `rancher/server` image: `rancher_server`
* Old host: `oldhost`
* New host: `newhost`

This file has been truncated. show original

In my case there weren’t any load balances (so I’m not sure how smoothly that would go if I had to delete those too), but the idea is as follows (for anyone else who’s been struggling like me to find a solution):

After moving the mysql database of the rancher server to another server (set the new url of the rancher server and whatever else there needs to be done there), on the host you first stop the rancher-agent (if you don’t, you cannot manually delete other rancher-containers, as they’ll keep respawning indefinitely). Then you can remove the healthcheck and the metadata container. Normally that should be enough.
Then you copy the url from the “add host” section of the rancher server in order to run the new rancher-agent. That should be enough and you don’t need to delete the old volumes. The scheduler might take some time to stabilize and it eventually works (at least in my case it did work - I tested it several times on simple hosts only with a nextcloud stack downloaded from the community catalogue and the connection between apache and mariadb was kept intact).

Topic		Replies	Views
Health checks not initializing properly? Rancher 1.x	22	5910	September 5, 2017
Healthcheck failing Rancher 1.x	1	852	June 19, 2017
Auto Restart and Re-Create on bad healthcheck not working Rancher 1.x	0	922	November 27, 2017
Scheduler and Healthcheck stuck Initializing v1.2.2 Rancher 1.x	2	2597	January 7, 2017
Rancher 2+ node cluster dies overnight - reproducable with multiple OSs Rancher 1.x	1	1121	September 20, 2017

Rancher keeps recreating scheduler and healthcheck

Related topics