I have some containers that do some long running jobs. When I do a rolling upgrade, the containers get force-ably killed if they don’t shutdown within a certain time-frame. I would like the containers to finish all the in-process jobs (sometimes could be 2-4 minutes) on shutdown. This works fine locally, or if I use docker stop -t 300
. Is there anyway to set this time on rancher so that my containers can have a longer time frame to shutdown gracefully before being killed?
Same need here too. I need to be sure to drain every running queries before killing the container.
Same here. We need this too.
While I think this may be something that could be implemented, and may provide some use, I’m not sure it solves the problem though.
While I don’t know your apps/infrastructure/etc, it seems that if you have specific needs around long running jobs, then certain “automated/automatic” processes may not be the right thing for your scenario, and that you need some sort of job manager that knows of in-flight jobs and can either automatically or manually queue/re-route those jobs to other available containers.
I’m trying to think of other scenarios in the container/micro-services realm where applying a pre-shutdown timeout would be used/needed, and I really can’t think of anything. I think this is more of an app design (immutable containers/12 factor development) issue, rather than an issue with Rancher.
Ok how about I give you some use cases.
I have a rabbitmq event queue. I have web tiers and worker tiers. The worker tiers can do some log running processes. Using rancher, I can do a rolling upgrade whenever I want to deploy my code. So this means I can tell once container to shut-down. It finished the jobs its doing, but doesnt accept any new jobs. Then it shutdown, gets replaced, and we continue one container at a time. This is a very standard pattern.
Option 2, I have a service that has clients connected via websockets, I want to properly drain and shutdown every client before the service gets terminated.
Option 3, My services uses a gossip algorithm for service discovery (such as hazelcast), when a service shuts down, it needs to update the cluster. If multiple shut down at a single time, this can sometimes take longer than 10 seconds, but when a service dies before updating the cluster, things get problematic.
" some sort of job manager that knows of in-flight jobs and can either
automatically or manually queue/re-route those jobs to other available
containers"
Just cause I have this in place doesnt mean I cant do automated deploys, in fact the opposite. Yes I can have something like this in place, but it would still be nice to let jobs finish.
One of the factor of the 12 factor app is “graceful shutdown” which means having time to clean up. A set timeout (10 seconds) does not guarantee this. Sure apps should handle sudden death type situations, but that should be the exception, not the rule.
This seems to be a related issue: https://github.com/rancher/rancher/issues/6214. Everyone should upvote it if you need this. (I know we need it because our workers do 60 second long polling, so we need to wait 60 seconds before stopping.)