Service/Stack host to host migration

TLTR Question: How to deal with stateful services host migration and the question about the volumes

Moving from proxmox KVM based service runtimes to rancher docker based stacks, i currently search for a solution/strategy to be able to migrate a service(stack) from one host to another.

I did make up my mind for this for months already, i am running rancher/rancheros for more then a year now with different purposes, but now i move our core infrastructure - and this one has frequent service movement ( maintanance, server capacity /load )

Obviously the core issue for migrating the stacks/services are the volumes - i do yet not really strive for hot migrations.

As far as i understand, there are 2 main primary strategies (correct me):

  1. Complete Shared storage: ClusterFS alikes sync a whole block device / partition across all hosts, replicating and duplicating anything, anytime and making all volumes accessible on all hosts at any given time.
    Pros: make horizontal scaling with stateful apps even easier, since the whole storage is shared anyway ( across hosts)
    Cons: Data duplication ( space ) and probably general network load ( a bit )

  2. On Demand Volume migration, so if i tell rancher to move a service from host A to B, the volumes get copied “automatically” on demand
    Pros: no dupes, network traffic on demand only
    Cons: no horrizontal scaling of stateful apps out of the box


Solutions:

For strategy 1:

  • Convoy + NFS
  • Portworx (Block device based syncing)
  • Longhorn ( soon )

For strategy 2:
AFAICS nothing as close to be automated, so like as a side-task when rancher migrates a service to a host it cares about the volumes. So doing it manually as "elegant as possible"
a) docker checkpoints probably using CRUI - somewhat like this article describes
b) Docker base volume backups

a) is yet not really done and b) is horrible manually craftig, you cannot do this at scale at all. It works, i have done that already, but the process is way to time consuming and partially error prone.

So finally, my question is:

What are you guys using, suggesting or looking for? What is the general idea of the rancher developers / team about this topic? Maybe i missed it, but i do not see a lot of channeled discussions about that yet.

i) Is the only option to run a clusterFS like storage - no matter how big data becomes ( which is a pain point here )?

ii) Do i have the wrong POV on infrastructure right now, do i try to stuff the old idea on how services are run and located over a more cloud based, agile one. Saying that, i am using to run a stack ( KVM based linux host plus App + its DB + whatever ) on one host ( KVM ) - and if its more performance / host ist going to be maintained, i move the whole KVM somewhere else. With rancher / docker based stacks, one could argue, that database and app can be separated and be on different hosts and are scale>1 and any given time. Beside the extra resources consumed due to 2+ and the extra complexity for HA with databases and app, that might work, but is probably not a solution for every single stack in mind - for some this is just way over the top

iii) are any solutions there / planned for strategy 2 yet?

iiii) how do you guys handle “storage” with portworx - do you just add a second drive to the rancheros and then and this to the service as the “synced one” - and thats it? So every volume created on that “storage driver” is shared? Is there a way to migrate from on storage driver to another? If yes, this could be used as a transition/transfer layer for on-demand volume migration. So local-storage HOST A->portworx dervice HOST A->portworx dervice HOST B->local-storage HOST-B + delete on portwox ( move )

Sorry for this really long post, but i though i try to start a more general, wide discussion about storage, since i think this could be a question being interested for a broader audience.

Any chance to have some insight probably by the Rancher-Team to share their mission on this?