Hi all,
So just put up an experiment with the first set of Rancher hosts to test a staging deployment on AWS, but we had to pull out due to multiple issues with managing the underlying hosts that Rancher uses.
2 Hosts Ubuntu LTS 16.04 Docker
m3.medium (Rancher Server - single node) # I'm aware 1 node is bad, but HA doesn't make child nodes more reliable
- Web UI
- Rancher Server
- ECR credentials puller
m4.large (Node)
- Elasticsearch Host mounted data volume
- Java App Host mounted
- ECR credentials puller
Issues
- Docker Hosts running out of disk space, we mounted a 100gb EBS to /mnt and did DOCKER_OPTS=’-g /mnt’
- The node host defaults to using DeviceMapper which died 3 days into operations
- Running 2 hosts to cut costs, but having issues, do we need to run at least x hosts for good reliability?
- People lost trust and just want to use the old way that works, / having trouble debugging.
- ECR credentials puller does not work on the remote host and fails to pull containers.
Questions
Some questions, for people with a tad longer operating experience on these container deployments.
- Do you run heterogeneous or homogeneous clusters? I see a potential use case where we would want certain hosts to be dedicated database hosts (EBS attached) or load balancer hosts, though not sure if this is best practice.
- What distro is the best? I have some users of CoreOS, others RancherOS, just wondering if there is any tidbits of information on what has the best out of the box configuration for Docker?
- For routing is it always best to use the Rancher LBs or do you have good experience with DNS solutions? Any examples here?
- Any pitfalls for securing Docker Hosts from random Disk Space outages?
Currently we’re testing if the org can shift from using tradition Cloud provider services such as ELBs, AutoScaling Groups, AMIs to a more containerized solution for cost savings.
Though from my experience thus far, operating a clustering solution with traditional mindsets is a tad awkward, lots of push back from management to just stop and run all ASGs at 1 instance for cost savings …