Tips on Preparing your Docker Hosts for Production?

Hi all,

So just put up an experiment with the first set of Rancher hosts to test a staging deployment on AWS, but we had to pull out due to multiple issues with managing the underlying hosts that Rancher uses.

2 Hosts Ubuntu LTS 16.04 Docker

m3.medium (Rancher Server - single node)  # I'm aware 1 node is bad, but HA doesn't make child nodes more reliable
- Web UI
- Rancher Server
- ECR credentials puller

m4.large (Node) 
- Elasticsearch Host mounted data volume
- Java App Host mounted
- ECR credentials puller

Issues

  • Docker Hosts running out of disk space, we mounted a 100gb EBS to /mnt and did DOCKER_OPTS=’-g /mnt’
  • The node host defaults to using DeviceMapper which died 3 days into operations
  • Running 2 hosts to cut costs, but having issues, do we need to run at least x hosts for good reliability?
  • People lost trust and just want to use the old way that works, / having trouble debugging.
  • ECR credentials puller does not work on the remote host and fails to pull containers.

Questions

Some questions, for people with a tad longer operating experience on these container deployments.

  • Do you run heterogeneous or homogeneous clusters? I see a potential use case where we would want certain hosts to be dedicated database hosts (EBS attached) or load balancer hosts, though not sure if this is best practice.
  • What distro is the best? I have some users of CoreOS, others RancherOS, just wondering if there is any tidbits of information on what has the best out of the box configuration for Docker?
  • For routing is it always best to use the Rancher LBs or do you have good experience with DNS solutions? Any examples here?
  • Any pitfalls for securing Docker Hosts from random Disk Space outages?

Currently we’re testing if the org can shift from using tradition Cloud provider services such as ELBs, AutoScaling Groups, AMIs to a more containerized solution for cost savings.

Though from my experience thus far, operating a clustering solution with traditional mindsets is a tad awkward, lots of push back from management to just stop and run all ASGs at 1 instance for cost savings …

On Ubunut, I install the linux-image-extra-virtual and linux-image-extra-$(uname -r) pacakages so that AUFS is available instead of DeviceMapper.
for ECR credential puller, what image are you talking about? The one int he Rancher catalog? That only works when deployed as part of a Rancher stack. You shouldn’t be running it directly on the machine.

Filling up a 100gb disk in 3 days seems a bit crazy even when using devicemapper. Were you churning a bunch of images through the system? You’ll need to make sure to run something like janitor to remove old images from your hosts.
Also, did you restart the docker service after making the config change to DOCKER_OPTS? If you didn’t, then docker won’t be using the mounted volume.

1 Like

Oh no it was the default 8gb instance volume that we filled up (first attempt). Our device mapper was using loop-back by default, and it just broke after a few day of operation, we noticed and switched to AUFS.

Had to run with DOCKER_OPTS="-g /mnt --storage-driver=aufs"

Oh just the Rancher catalogue one,
objectpartners/rancher-ecr-credentials

same config that we used on our inhouse ubuntu 16.04 hosts, its reporting credentials updated successful in the logs, but the Rancher agents on the hosts can’t seem to authenticate to pull images.

1 Like