This is how I have deployed Rancher

My goal was to build a resilient Rancher environment that is self-healing.
I have:

  • Rancher Server: Single server with a backup AMI I can use to replace it if needed. Database is on RDS with multi-zone replication. (Tried Full HA but it would not remain stable. Had certificate issues. etc.)

  • AMI for my base image: This images has my preferred version of docker, my registry certificate, the gluster driver, AWS cli tools.

  • Launch Configurations (2): one for my log server cluster and one for what I am calling my Rancher Pool. (This is where the majority of my containers live.) These are configured to mount the gluster volume, change the ssh port to 2222, and install the rancher agent with the correct label for each group.

  • AutoScaling Groups (2): one for my log server cluster and one for the RancherPool. My RancherPool ASG starts with 3 and scales to 5 according the CPU load. The LogServer one is set to stay at 3.

On the Rancher Side, I have my application stacks which are configured to run on the appropriate hosts.

  • ELK stack: (es, kibana, logstash,kibana) all configured to run on my LogServer hosts.

  • Jobber: which runs on one my Gluster Servers. (This backs up my gluster volume to S3 every hour)

  • Registry: (from catalog, but customized) runs on the RancerPool instances and stores its data to the Gluster volume

  • A load balancer: I am using the Advanced routing options, so that I can route based on DNS names.

  • Route53: This provides me with DNS Names for any container that exposes a port. I set up a Route53 zone just for rancher. I can then use CNames in my production zone to make the urls nice and simple.

  • Logspout: to gather the logs and send them to Logstash. (runs on all hosts)

  • Prometheus: Produces some pretty graphs. (Runs on any/all instances.)

Current Challenges:

  • Getting logs from outside of my rancher environment: I need to figure out a way to add inputs to the Logstash environment.
  • Deploy GitLab: I have tried several times but it seems that the build works for a few days then something goes wrong with the file permissions or the container filesystem becomes read only. I am not using the Gitlab from the catalog because it uses a sidekick instead of mounting a volume and its an older version which has a security issue. Considering downloading the compose files and customizing this to use the latest version and my gluster volume. Now that my environment is more stable my original files might work.
  • Deploy Jenkins: Using my own compose files. Also had problems with file permissions in the past. Will try again no that environment is stable.

So that is where I am at today.

10 Likes

Thanks for this, it’s really good to see how others are getting on. My setup smaller and isn’t quite as refined. I seem to be struggling with some issues you have already dealt with.

Here are my main challenges:

Replicated storage: I tried using the Gluster catalog a while back but it failed miserably. How have you found it to work and can you offer any tips on how to set it up? Do you feel like it’s production ready?

Backups: Been pondering for a while how to set up a backup-to-S3 solution. I’ve considered building on dockup https://github.com/tutumcloud/dockup and doing something with the rancher metadata but I’m not sure what. What is Jobber and can you give any more info about it?

Route 53: I have this setup but since you can’t have a domain apex pointing to a CNAME I am struggling to see the utility. For example I was hoping I could just do mysite.com CNAME service.stack.env.rancher.mycompany.com but due to the apex CNAME rule I’m forced to use A records with the IP of each host.

Load Balancer: lI’m using a global load balancer with advanced L7 settings to route based on host name. Load balancing the Traefik load balancer will be perfect once it supports HTTPS: http://rancher.com/traefik-active-load-balancer-on-rancher/. Very much looking forward to this!

Gitlab / CI / Registry

As for Gitlab / CI / registry I have had great success with this gitlab stack https://github.com/sameersbn/docker-gitlab and have been running it for about a year now. As for CI my workflow is simply to build a docker image and push to the registry. Gitlab CI is more than capable of this so I don’t have the need for separate Jenkins.

And finally Gitlab has just released 8.8 which contains a container registry built in https://about.gitlab.com/2016/05/23/gitlab-container-registry/. For me this will be a massive improvement and I’m looking forward to using Gitlab for all my SCM / CI and registry needs.

If there is anything that’s really holding me back right now it is getting a good distributed storage and backup setup in place.

2 Likes

Thanks for the feedback, djskinner! I will definitely check out that gitlab stack.

Here is more detail about how I have setup my Gluster. I built my server outside of my rancher environment. I did not use the one from the catalog, because, like you, after several tries, it just kept failing. Its really meant for testing anyway and not for a productions build. Important note: I have a base image that I use for all my rancher host. As mentioned in my original post, this base image has the AWS CLI installed. That will be important later. I call this image Docker GOLD

  • I started here : http://gluster.readthedocs.io/en/latest/Install-Guide/Install/

  • I installed this on two t2.small instances with 16GB os drives and 50 GB storage drives. I used my Docker Gold image.

  • Following the instructions I created a volume and named it vol1.

  • I modified my Docker GOLD image by installing the Gluster driver.

  • I added the mount command to the User Data of my Launch config. So now every host in my system is mounting that volume to /app

      sudo mkdir /app
      sudo mount -t glusterfs glusterfs1.domain.com:vol1 /app
    
  • Inside that volume I created a folder called data. (its important that you don’t create your directories at the root of the volume.) Under data is the individual folder(s) for each app. For example there is one for the Registry, called registry.

  • I can then modify the “volumes” section of the docker-compose as appropriate. For example:

    volumes:

    • /app/data/registry/certs:/etc/nginx/certs:ro

So far it is working like a charm. I did have to trigger a self heal once because I had deleted something directly off the server. If you need to delete something do it from the mounted volume.

I will explain Jobber in my next post.

1 Like

How I backup my Gluster volume to S3 with Jobber
Information on Jobber can be found here. http://dshearer.github.io/jobber/#intro
Their github is here: https://hub.docker.com/r/blacklabelops/jobber/

  • I pulled the image and pushed it to my registry.
  • I copied the docker-compose.yml to my local computer where I store all my templates
  • I created an S3 bucket and enabled versioning and lifecycle rules
  • I modified the Jobber docker compose as follows:

JobberAWSBackup: environment: AWS_ACCESS_KEY_ID: MYACCESS KEY AWS_SECRET_ACCESS_KEY: MY ACCESS KEY PASSWORD JOB_NAME1: CopyExportstoS3 JOB_COMMAND1: sudo aws s3 sync /data s3://MYBUCKET/GlusterFS/data --region us-west-1 --delete JOB_TIME1: 1 1 JOB_ON_ERROR1: Backoff log_driver: '' labels: io.rancher.container.pull_image: always io.rancher.scheduler.affinity:host_label: Service=FileServer log_opt: {} image: reg.domain.com:5000/jobber/jobber:test volumes: - /export/xvdb1/brick/data:/data:ro

What this does:

  • io.rancher.scheduler.affinity:host_label: Service=FileServer The container runs only on one of the two file servers. Does not matter which one.
  • /export/xvdb1/brick/data:/data:ro Mounts the data folder to /data as read only
  • JOB_TIME1: 1 1 Runs in the first second of the first minute of every hour.
  • JOB_COMMAND1: sudo aws s3 sync /data s3://MYBUCKET/GlusterFS/data --region us-west-1 --delete Copies all the files and folders in the data drive to my s3 folder and deletes any files in the bucket that have been deleted locally. (This is okay because I have versioning turned on and deleted files are recoverable if I need them.)

Here is why it was so important not to make your folder structure at the root of the volume. There is a hidden folder called .glusterfs which resides at the root of the created volume. This folder drives the s3 sync command CRAZY. You can’t exclude it and its full of symlinks that s3 can’t follow so it just gives up and never reads other folders.

Important to note: Jobber comes in several versions. There is a special on for AWS and on for Docker Tools and one for Google. There is an all in one version as well. You can use it to run multiple jobs. It is a good tool for any cron jobs that you need to run.

2 Likes

@djskinner I wanted you to know that I used the gitlab install that you suggested. Slightly modified, but works like a charm!!!

@cloudlady911 I’m interested on how are you monitoring your application (not the infrastructure) with Prometheus: do you use any service discovery method to find the scrapping targets (like consul or DNS)?

Nothing so fancy at this point @brutus333. Just running it with default out of the box settings. The main purpose for it right now is to show something pretty to the boss.

I just finished building out my Graylog Server. It is fully HA and uses the ElasicSearch 2 from the Rancher Catalog. This is an over view of how I did it.

2 Likes

My Rancher setup is based around a scalable, resilient, micro-services architecture. Our goal with Rancher is to have a core, container based system for all stateless functionality, most everything else leverages AWS provided services.

  • Rancher Server: Single Rancher server for now, connected to an AWS Aurora RDS instance and multi-zone replication.

  • Environments: Dev/QA/Staging/Prod, users are assigned to the appropriate environment.

  • Host OS: This one took awhile to decide on, and produced some interesting findings along the way. we tested CentOS (major issues related to Docker storage/devicemapper, do not use as a Docker host OS, container OS works fine), Ubuntu (no major issues, could live with it), and RancherOS (keep it in the family). We were previously using CentOS exclusively on all of our AWS instances and had a high degree of comfort with it. The decision came down to re-thinking the intended purpose of the Rancher implementation, and decided it is more important to utilize something more inline with Rancher, and less about us being able to access and run things on the boxes like we were used to. We decided to go with Rancher OS for all hosts.

  • Logging: We spent a good deal of time on this as well, there are many options to choose from. We again addressed the intended purpose of the platform to direct our decision (containers are immutable, do not store logs there), and we also did not want to add to the complexity of what is supposed to be a resilient micro-services platform (do not create dependencies on Rancher/Hosts by adding storage volumes/file systems for logging). We were already using SumoLogic for all of our logging, so for our Rancher setup, we implemented the “direct HTTPS end-point” method. Each container is configured to send the OS and app logs directly to an HTTPS end-point at SumoLogic, no log files are ever created on the container or hosts.

  • Monitoring: Again, this took some time to work out and was heavily influenced by the purpose of the platform. We were New Relic customers, but the host OS decision (and some other factors) had us looking to change. I was already familiar with Datadog, and when I installed the catalog item, and had host and docker info in Datadog in under a minute, I was sold.

  • Deployment: We have used Atlassian Bamboo for a long time, and still use it for our legacy apps. We are still testing Bamboo with Rancher and think we have come up with a solution that works for us. All initial stack setup is done via Bamboo deploy (stable, repeatable) with 2 files (stored in Bitbucket), docker-compose and rancher-compose. The stack is deployed with a single instance of the service(s) and an LB. All subsequent deployments to existing stacks happens through the service “upgrade” process in Rancher, or via the cli.

  • AWS ELB: All “external” application traffic is routed through AWS ELB’s to the specified hosts LB’s. We wanted the external connectivity to be handled/gated by the AWS ELB’s. All of our containers are micro-services api’s or back-end processes (celery/django/custom python) no “web” content is delivered from the micro-services, we do that all through AWS Cloudfront CDN’s with S3 buckets as origins.

  • Route 53: Used for all external domains, Rancher set up to create entries in private domain hosted zone through the Route 53 stack.

  • Registry: We use the AWS ECR service.

  • Things we do not use Rancher for: Elasticsearch (we use the ES cloud service), Message queue (we currently use RabbitMQ but will be moving to either SQS or Redis through Elasticache), Cassandra (primary storage of ingested data, moving to Dynamo maybe), Web Content (all served from S3 buckets via Cloudfront CDN), RDBMS (we use RDS for all RDBMS’s)

That’s where we are at for now, I’ll try to update this as we find new things or make changes to the setup.

Phillip

5 Likes

We are going to deploy graylog2 on rancher as you did, could you share your stack docker-compose and rancher-compose files?
Thanks

I want to share my Graylog setup notes, but the format limitations are a problem. Tried to upload a PDF but that is not allowed.

So I am sharing it from my google docs for now. Here

1 Like

I fine tuned the instructions and they are now visible here.

4 Likes

Thank you for sharing this info!