Thoughts / Questions after bringing Rancher to Production on AWS

After getting our setup into production, I had some thoughts / questions on Rancher and instead of flooding Github I figured here might be the best place to post them to get feedback. If any of these should be an issue, let me know and I’ll post it in GH.

So far we’re pretty happy with everything. Thanks for all of your hard work!

Automatic Host Deletion / Automatic Rescheduling

We run on AWS and each application gets its own ASG per environment. Everything works fine when new instances come online, however when an instance terminates Rancher doesn’t reschedule containers on the terminated host.

I think a couple of enhancements here would help:

  • Have an option for automatic host deletion w/ a Grace Period. For example we could configure Rancher to automatically remove a host if it is in Reconnecting for 60s.

  • Have an option for automatic container rescheduling if a container’s host is in Reconnecting for a period of time.

Allow Services in a Stack to be transient

In our use-case we scheduled a service to be global, and using labels assigned it to an ASG that would scale to 3 instances on a schedule (then off). This worked, but the end result was:

  • “Reconnecting” hosts sitting in the hosts screen
  • An ever increasing number of containers for the service
  • The stack always showing degraded.

The first two items are due to the host problem I mentioned earlier, but if it would be nice to allow services to be flagged as temporary so a stack doesn’t show degraded if it isn’t around.

Multiple environments for Rancher Compose

It would be nice if Rancher Compose supported multiple “profiles” similar to kubectl. We have 3 environments and switching between them is frustrating and error-prone.

Load Balancer w/ Let’s Encrypt SSL built in

I saw the catalog entries for this, but it would be great if the Load Balancer service just did this automatically and globally for an environment. The catalog service also appears to be using a library that states not to use it in production.

An ideal component would allow a user to simply select an available service in the environment and enter a domain name.

Notifications

Are there any plan to have Rancher push web hooks or emails when environment events occur?

Future of Cattle?

Personally I really like Cattle over Kubernetes (I have not used swarm). I think aligning deployment to docker-compose is ideal for development workflow. Is the plan to keep Cattle a priority or will focus switch to supporting the other schedulers?

Enhancements for the Hosts page

It would be nice if the hosts page adopted the same listing view as the stacks page. Additionally if it would allow a user to sort / filter by host labels that would be great too.

3 Likes

+1 to all of these, they also mirror many of my thoughts.

I think Rancher are planning some improvements around automatic re-distribution of containers when hosts are replaced. What we do about that at present is a form of rolling upgrade along the lines of, first add one or more new hosts (via an ASG if you want) and auto-register them to Rancher. Then de-register and remove existing Rancher hosts. This causes Rancher to reschedule containers on to the new hosts (although that does depend somewhat on the scheduling strategy). The biggest issue we have today is often-times Rancher leaves hosts in a ‘Re-Connecting’ state and stacks continuously degraded (as you mentioned). I have raised questions here and directly with Rancher about this but haven’t really got to the bottom of it yet, and it does somewhat erode confidence in the platform despite all of the things that it does do very well indeed.

We also currently prefer ‘cattle’ over the other schedulers, but this might very well change when docker 1.12.x is supported. That also leads to some thoughts about whether docker-compose also has a future ?

Regards

Fraser.

I’ve also just rolled out Rancher in AWS, and echo your sentiments on many of these points.

Regarding your first point, I’m not sure that automatic host deletion is the right thing for Rancher to be doing. As far as Rancher is concerned, all it knows is that it’s lost comms with the agent. Not being AWS aware, it can’t determine if this is because the host is gone, or for some other reason (say, a network partition).

For example, in our deployment we have a Rancher environment in a different AWS region connecting in over an Ipsec tunnel. On occasion this tunnel goes down and the hosts in Rancher go into “reconnecting” mode. The hosts and containers on those hosts are fine, and the apps keep running correctly - it’s just that Rancher can’t talk to them. In this scenario Rancher does exactly the right thing by simply leaving the hosts there in “reconnecting” state. Once the tunnel is back up, the hosts come back to active and everything is fine. I’d not want hosts to be automatically deleted in this scenario.

That said, I feel your pain when hosts really do terminate, leaving a UI strewn with zombie hosts and any containers without health checks stuck on hosts that no longer exist. So I just recently wrote a small container to automatically reap any Rancher hosts that have been terminated in AWS: https://github.com/ampedandwired/rancher-reaper

Feel free to have a play with it. It’s pretty new so there may be some bugs, but I would appreciate any feedback.