Suggestion : Autoscale for a Service

A suggestion for the backlog… Adding an autoscale feature to a service. In terms of UI, I would imagine this as an additional tab in the advanced options, comparible to the scheduling. In terms of the input for the decision making, there I’m still pondering a bit what would be the most logical solution. Create a kind of “helper service” that generates an output that can be interpreted… of if it is something that should reside within the container itself. Though this could intervene with the idempotency principle of a container.

Any thoughts?

Personally I think this makes most sense to integrate into the healthchecks. At the moment that process can detect alive and dead services and react accordingly. If it could be improved to detect idle and overloaded processes, possibly with the help of the container itself then it could drive the scaling process. At least that’s what I would do.

In fact we are already on the road to this. All of our microservices provide a /health endpoint which is intended for healthcheck to use. This endpoint does include idle/busy/overload information as well.

I think this should probably be linked to “container utilization” or something which is still not available in rancher… I can see this as a more advanced form of scaling, but would argue that resource utilization would be a logical first step…

I’d like to see something in the lines of “configurable” scaling, (such as by use of an endpoint and certain conditions like @kiboro mentioned), but I think without resource-based auto-scaling, this would be an overstep without first implementing “more common” scaling rules… but just my opinion :wink:

One interesting and potentially cool/unique way to implement this could be to scale based on traffic information from a Rancher LB… This would provide static or ideally dynamic rules to scale services based on response time (ex: scale webserver when response time reaches x% over moving average or Xms fixed, for example), traffic (scale when rpm or connections reaches X over moving average, etc), and potentially other things such as time-of-day scheduling, etc… (this last one wouldnt necessarily be tied to an LB service, but the LB service may be a logical place to support something of similar nature)…

Here some current related work on this:

  1. cowbell Cowbell is an experiment to add a Rancher service scale up webhook trigger. Cowbell is launched inside a Rancher stack and is configured to listen for events, and react accordingly.

  2. proposal on github Auto-scaling capabilities based on container’s metrics

The auto scaling based upon container metrics looks like a very “infra” / “ops” approach… A use case I currently have is based upon launching more consumers depending on the amount of messages in a queue… Here I would see chronos as a possible helper in the matter, triggering the scale function via the api.

The cowbell looks promising imho. That is the suggestion I can relate too the most easy (in an abstract sense)

Scaling when all hosts get to 100% CPU is too late, simplest way is to determine when to scale for web server could be:

  1. Determine how much load each server can take (usually, requests, clients, large POSTs etc) either by using synthetic benchmark like apache bench or collecting performance data from live servers. Implement /health endpoint that provides at least ok, load, overload based on performance data.

  2. Create service withing stack that checks /health endpoint in each container and if some value is above threshold it will scale that service by +1 or more based on severity.

  3. Create service that checks scheduling data and if there is not enough servers it will spawn some VPS (AWS, DO etc.) or alert admin.

  4. Don’t forget to set rules for downscaling and number of stand-bys.

Yes, scaling when a container gets to 100% is too late… but scaling when it gets to 80% (where normal load is 70%, for exampe), or when it gets to 90% (and normal load is 80%) is still useful… Even if for a particular motive the container may be failing, app leaking, etc, ignoring resource allocation X resource use is a bit short-sighted imo…

Dont mean to sound like a broken record, but I’d strongly favor not re-inventing the wheel but to using a plugin architecture that would at least support the standard tools like swarm or other schedulers…

I came up with something different… maybe simpler to implement and understand.

why not scaling rules based on recurring time ranges? Like run 8 instances of this service mon-fri 9-17 otherwise default is 4.

This should be simpler to implement. One can adjust rules overtime.

Most applications are used by humans and have a load pattern that repeat itself by human activity patterns e.g. higher loads on certain hours, days of the week and months.

I mean I understand the beauty of really dynamically scaling a service based on real measures like network traffic of CPUs load… but I kind of think all that dynamism is overkill, it may just help some extreme use cases with random load access. Rancher already provides an API so one can easily have an external cron that changes scale up and down based on time logic. Problem is kind of solved for me this way.

I think this discussion pretty accurately demonstrates how people have wildly varying definitions of what autoscaling means…

The sensible thing to do seems to be to provide info in the API/metadata to allow you to make a decision (and we probably need more of this) and then a hook to allow you to act on it (which you can do with the API today, or something more convenient like cowbell).

Especially because the thing that really costs money is the hosts running the containers so often below the level of services that you actually want to auto-scale, with varying support from hosting providers. One thing people do today is use e.g. EC2 autoscaling with a startup script that will register the host and global services so that containers get automatically deployed to the host when it shows up (and we can make this easier with auto-removing hosts when they are killed off).

@vincent i totally agree. thanks for joining the conversation :slightly_smiling: The scaling of the hosts is the most important and complicated and what actually costs more.

For us that do not use Amazon EC2 or other advanced IaaS providers and are on smaller providers or on private cloud, the auto-scaling of Hosts is a bit tricky.

Since Rancher already uses docker machine and acts as a kind of “IaaS proxy orchestrator” … would it make sense to add some more automation of adding predefined Hosts to and environment when the environment is “full”? Just brainstorming here…

@RVN_BR I think we have to reinvent the wheel. Scaling spawns across several interfaces like Rancher, server provider, service reporting health and must reflect different factors like usage prediction, physical location, time of day planning, lifetime and instance limits etc. If you know about some open-source solution please let me know.

@demarant I agree scaling actual hosts is more difficult than scaling services especially when it comes to host maintenance or down scaling (removal from cluster). It can allow for significant savings of hardware cost by spawning vps in peak while base is supplied by bare metal.

@vincent AFAIK current API provides all necessary information and tools for building own auto-scaling service. It is bit trickier to tell Rancher that host should not be scheduled on.
I would be in favor of solutions in catalog templates to prove usefulness first and then implementing then into Rancher.
Maybe it would be useful to have label to allow service manage environment and add hosts so it is not necessary to create API credentials manually.

For me you have two parts of your scaling… the “platform” and the “service”. The first objective should be how you can scale the service. Here we should make the assumption that the platform has sufficient resources. Next up, the same triggering mechanism (or at least the logic behind it) can be used to do the same for the hosts (for organisations where this applies).

To me the service load goes beyond the “100%” (cpu metric). A possible implementation could be to select a “helper” container as a load checker. This could then return the actions needed ; “upscale/downscale”, “amount to scale”. For me the “logic” would then be contained in the container, and it could be easily referenced by rancher without interfering with the base principle of having a non intrusive integration with Docker.

If you create a container with these labels it will get credentials as environment variables that can communicate with the API, which would include adding/removing hosts or scaling services:

io.rancher.container.create_agent: "true"
io.rancher.container.agent.role: "environment"

Take it one more step and make an API call after you make a decision and you don’t need any integration from us at all :smile:.

1 Like

Hi Vincent, do you know if the auto-removing hosts feature has been implemented or when it will be, I feel like I am spending way to much time when spot instance prices change removing dead hosts.


  • Trevor