Disaster recovery?

I continue to follow Rancher waiting for it to reach the tipping point when I can put it in production, however, there are still a few details that the entire container ecosystem suffers from; in the end it’s a disaster recovery or being able to reproduce the configuration.

A basic feature of the orchestration systems is the ability to react to a failed node and restart containers on the remaining nodes and reconnect the services through service discovery. But what happens (a) then the entire framework fails or (b) one needs to test a feature or service outside the production environment [many times a customer needs to perform an offline audit or validation].

Much of the CI/CD universe believes that CI/CD applies to the environment too… while rancher lets you add a web server from the command line and the UI… how do I recreate the environment? This gets a bit wonky when I edit chef or puppet files because I do not want to redeploy the entire ecosystem just for one new service.

This was just the thread of a much deeper topic.

@rbucker, this is a really deep topic!! But certainly one that needs to be discussed.

Disaster recovery still comes down to the way an environment is deployed, backed up and managed. Managing failure boundaries and being able to isolate failures is something all of these systems need to be able to address. It looks like the container orchestration systems are focusing on managing multiple control planes that share nothing. Similar to the way IaaS systems are deployed today. Rancher would be deployed per geographic region, and it sounds like Kubernetes clusters will be deployed the same way and managed via something like Ubernetes. This is something is still evolving in both camps, is there something that your looking to see in a product or missing from the current landscape?

Environment is a bit tricky of a term, from the context of your question I’ll assume you mean a user application running on top of Rancher. This could be a lot broader… In the context of a user application, CI/CD says you should be pinning your dependencies, and with containers you should do that as well. It is a good practice to version your containers, always deploying latest is challenging, but they added a SHA to the latest tag even. If you treat the container as a static thing that is versionable you should be able to call it when you need it from a registry. The configuration data that is environment specific should also be version controlled, to the extent that its not dynamic.(If you rely on service discovery to autoupdate a load balancer and that varies while running you can’t really lock that.) Rancher allows you to describe an application stack in compose syntax, which allows you to specify containers, configuration and data(through env vars) These files can be versioned and stored in some place persistent. Rancher also has the rancher-compose.yml file that allows additional data to be configured in Rancher like load balancers, metadata (coming soon!!) and initial scaling size. With these files you can now describe an application, and reproduce it at different scales in different environments. We are building out a compose-template library that currently has an Elasticsearch, Logstash, and Kibana stack along with Zookeeper with more to come. With these users are able to take a standardized way of describing an application and running it in their own environment.

Certainly a deep topic, and I’m curious to hear more about your thoughts and concerns in this space.

I think I’m masking a much bigger question.

  1. In the world of audit and compliance there are no “shared” credentials… so as admins join the company they need to be added to the authentication system… In the case of CoreOS one needs to bootstrap userspace with ssh keys or user records with pre-computed hash values. As admins join and leave the organization they need to be added or removed from the system. In the meantime CoreOS, likely RancherOS too, needs those credentials so the admins can bootstrap the whole system.

  2. backing up and restoring the SQL database may or may not be useful (I’m not sure what’s in the DB). But Rancher uses a manually assigned container to node. It’s not elastic or self correcting. As I mentioned to David this afternoon if I had 1000 machines in production and there was a failure… while I might only have 50 started now, once all 100 are ready how would I rebalance the containers?

**David mentioned Kubernetes and that it had a role in Rancher. This had me doing some thinking about the many different layers between systems. hardware->OS->rancher(server,worker)->kubernetes->containers->apps … Each of these layers might be tightly coupled with the layer above and below. While I can deploy containers manually in rancher kubernetes could/would do it for me. This might be an important fact since rancher does not currently support self correction. So now the story might be doing enough so that the layers do sufficient self correction.

**right now of my cluster breaks then I have to restore it manually. There is no master script like the rancher installer that I can depend on. Also (not discussed) Hightower did a light’s out presentation. That is key.

Thank you for the reply. And thank you for David’s time.

Yeah, we are really at the tip of the ice berg =D

RancherOS and CoreOS are driven by the cloud configurations passed in at runtime. CoreOS currently does support multiple users added via cloud-config or manually on the host. RancherOS still needs a multi-user feature, as everything today is shared under the single rancher user. These types of OSes simplify management in that there isn’t much to do on the host, and they can be provisioned quickly at boot. Since things have been moving towards the pets not cattle trend having systems that can easily be provisioned and are ephemeral in nature makes things easier. Outside of user management, the environments need to provide auditing logs for provisioning and access to a host. These services can be provisioned via the initialization cloud-files that these OSes consume.

  1. Ranchers Database stores the state, so everything lives in there. In an HA setup you do not need to persist data in Zookeeper or Redis, but the DB needs to be managed and cared for. I’m not sure what you mean by self correction, Rancher has a concept of stacks and services. Stacks are a collection of services that make up an application, for instance Elasticsearch might be a stack with elasticsearch-masters, elasticsearch-data and elasticsearch-client services. Each service can be set with a scale and placement rules. If a host or container drops out and Rancher can place the containers elsewhere to hit the scale number it will. It also supports global services where it will put a container on every host that the scheduling rules allow it. That said, we really see Kubernetes as another application/workload to deploy on top of Rancher. If you want to run Kubernetes and Rancher you could.

What are you envisioning as far as building a cluster back up? Something where it was brought back up in a cloud through a cloudformation/terraform/heat like solution? Or something more like Ansible/Chef/Puppet scripts? Some combination of the two. Also, how much control would you want to have over it? ECS and GKE offer the orchestration platforms without the user having to manage the underlying infrastructure. No chef to install the platform, but the tradeoff is you lose the ability to customize at some levels, and are locked into the provider. The applications on the cluster in either Rancher or Kubernetes could be described in the native way and reproduced. Rancher uses docker-compose.yml syntax + rancher-compose additions for scale, load balancing, etc. It supports deploying to multiple hosts, and its file based, making it the new manifest/cookbook/playbook that can be version controlled. Container versions, scale, configurations could be pulled from it. Kubernetes has their configuration files which accomplish the same thing.

I’m not familiar with that talk from Kelsey Hightower, do you have a link?

If you want to bounce some ideas around feel free to reach out. We are also in the #rancher channel on IRC.

I’ll admit I’m walking, chewing gum, while eating a PB&J; at the same time. So I have not considered the best thru-line. But maybe this will help as I reconstruct it for myself.

How to idiomatically bootstrap an environment with the least amount of risk? Something an enterprise would use to keep the lights on.

Careful :slight_smile:

I think that gets to the core of it, need to think about it some.