This is a bit of a corner case but is a bug that should be addressed anyway. Rancher-server does not synchronise the running state of a container after rebuilding/restarting rancher-server.
Sit down, this will be a long one…
- We have a particular container that is a one-shot to download data from an external API. This simply runs, grabs the data and exits. Is is started with start_once = True
- This container is re-started every 30m by socialengine/rancher-cron (excellent tool by the way)
- The Rancher server crashed hard due to disk space issues. The cattle DB was corrupt.
- I restored the Cattle DB from backup and restarted Rancher server. All connected back up again and the hosts all reconnected. THANKS! Nice to see things recover from hard situations like that with no issues.
The one slight glitch is that the one-shot container was running at the time of the backup. After the restore rancher-server still thought this container was running even though it was stopped and then rancher-cron could not restart it as it relies on the rancher API status.
Can you get rancher-server to synchronise running state from the actual container state either at host reconnect time or periodically so that this won’t cause issues in future?
Thanks. Will report this in GitHub as well since it’s definitely a corner case bug.