Spotinst integration live container migration not working

I have been attempting to replicate the behavior advertised in this blog article.

In short, the idea is being able to utilize Spot Instances for your Rancher cluster. As a cluster Spot node is terminated, a new Spot node is provisioned and the existing containers on the terminating instance are live migrated to the newly provisioned Spot by using CRIU functionality, before the terminating Spot is removed from the cluster and shut down.

I’ve been trying to trace the API calls to narrow down the cause of the issue. In so doing I have found an API call to the following:

I’m assuming that this is the call that is supposed to drain the containers, but inevitably what happens is that the host is deleted. This causes the containers to move to another host, but it doesn’t live migrate the container, but rather it starts fresh on the new host.

I’m trying to understand how to tie this audit log event to the API and understand what API call is being initiated. The hope is that in understanding the call, as well as the intent of the call, I can help troubleshoot. I have already engaged Spotinst and am waiting on response from them as well, but have found myself confused in how they are anticipating these containers to live migrate.

Any help on translating this audit log event to a specific API call would be appreciated.
Also, if anybody else has been able to successfully implement this functionality, I would love to hear how it was done and lessons learned.

Found some helpful information today along with a solution architect from Spotinst. In short I think the root cause may be awarded to the fact that the CRIU functionality is not yet native in the version of Docker Engine I am running. Based on the information shown on CRIU.org, the production release of Docker does not contain the CRIU functionality. You have to either compile your own version or use a pre-compiled version that contains the CRIU functionality.

My assumptions at this point are:

  • Given that the original integration article was written well before Docker 1.12 was released, the only possible version of the CRIU that could have been in use was the 1.10 version that is explained at the CRIU website referenced above.

  • If the integration does truly exist, I could potentially be able to accomplish this functionality by providing my hosts with the appropriate 1.10 version of Docker that has the CRIU compiled in it.

What I don’t necessarily know is how Rancher would manage such calls as docker checkpoint as opposed to a simple removal and restart of a fresh container on the new host. Questions I still have:

  • Is the CRIU Docker only necessary on the hosts in the Rancher cluster, or does the Rancher Server host also need this version?

  • Is there a specific version of Rancher or the Rancher API that are necessary to accommodate?

Making progress. I have built an AMI in AWS based on Ubuntu with CRIU installed and the compiled Docker binary version 1.10.0-dev as outlined in the article here.

Can anybody at Rancher tell me whether this setup will enable Spotinst to live migrate the containers from Host A to Host B? I’m going to run some tests, but would like some input from the Rancher team if possible.

So my testing with the Ubuntu image with the following components failed.

  • Docker 1.10.0-dev
  • Experimental criu build / Also tried with the latest criu from Ubuntu repo
  • Dependencies to support criu

Some components above provided by boucher’s experimental build found here.

What failed?

Docker installed correctly, as well as criu and the dependencies. I was able to run Docker, but upon attempting a checkpoint I received a criu error about not being able to infect with parasite, blah blah blah.

Next steps

Focus efforts on contributing to the community and helping to enable this functionality on the latest releases as opposed to trying to make this work on a DEV release of Docker 1.10. Boucher has made some progress in implementing criu into the 1.13 experimental release of Docker. Our intent is to assist in any way possible to…

  • Enable the CRIU feature of Docker for production use
  • Enable live migration calls in Rancher using CRIU enabled Docker

Why?

From our point of view, this is single most important gap that needs filling when it comes to container management and orchestration. This functionality is most definitely possible, and I often wonder why there aren’t more people focusing on making this happen. So we are going to.

Here is an example of a way to use CRIU by building the containers with CRIU in them. You could then use a Rancher NFS volume for the checkpoint data.

And there is a little bit of information on the new docker checkpoint command here:

This still isn’t integrated with Rancher though. :slightly_frowning_face: Still, while Rancher integration would be awesome, I don’t think it is all that bad to put CRIU in the container if necessary.

After thinking about it I’m actually starting to think that it makes more sense to do CRIU inside the container as opposed through a Rancher integration. So far the only infrastructure limitation to doing CRIU inside the container is that you can’t do it on top of an AUFS Docker storage driver. On the other hand if you wanted to do it through a Rancher integration it would require alot more work and more system requirements ( CRIU installed in addition to Docker on every host, a central CRIU image store of some sort, etc. )

In the end I think it actually makes more sense to do it inside the container. :smiley:

Maybe it could get into Rancher someday, but, for now, IMHO I don’t think it is actually necessary nor worth it, Dad (@opax). :wink: