Remove containers from load balancer targets while upgrading?

I have noticed that while upgrading, all containers still seems to be part of the load balancer targets.
e.g. if I spam http get requests to my load balanced service, some times there will be a huge latency when the load balancer hits a container that is currently upgrading.
even if I have a a big batch interval for the upgrade. when a specific instance is upgraded and hit by the balancer, there will be some latency.

Wouldn’t it be possible to have Rancher notifying the loadbalancer before it upgrades a container?
e.g. if you have 3 instances of the container, if instance #1 is about to be upgraded, the load balancer could first remove it from its target list. thus eliminating any quirky latency or bad requests. and add it back to the target list once the container is healthy again.

I might have misinterpreted some of the behavior, but it sure looks like what I observed above is how things work.

even if you add health checks to a service, those checks will not be fine grained enough to ensure complete responsiveness.

Is this the current state or am I misunderstanding something?

1 Like

Did you use “Start before Stopping” when upgrading the targets or not?

Hi Denise,
I have tried both options.

I have made some benchmarking on this topic.
I have used a http service that runs with scale 3 and a load balancer infront of it, and I have then tried upgrading it with different settings while at the same time hitting it with a tool that outputs request time to a log:

Here is a latency graph when using "start before stopping"

In the first case, everything looks fine until just before clicking the “finish upgrade” icon in Rancher.
Then there was a 16 000 millisecond spike.

And here is when not using the start before stopping:

In the latter case, there are 3 spikes each of 5000 milliseconds while upgrading.
This behavior is consistent between runs, so they are not one time issues, but happens the same way when I run each scenario.

Is this due to start up overhead in my services? or is the loadbalancer hitting containers that are being shut down?
Is there anything I can do to mitigate this?

@rogeralsing - I had similar - is your service actually ready to process response when it shows as Running in Rancher? In my case, it was a Spring based Java application. The container was up and running but Tomcat was still starting up, deploying the Spring application, initial Spring bootstrap, etc. That would take 15 seconds after the container was marked as running. So the load balancer would start sending traffic to it. I solved that by having a health check. Now the container is in Initializing mode until the web application is actually fully started and responds to web requests and the load balancer only sends traffic once it’s done initializing.

I have tested this using a simple nginx & php-fpm stack (4 containers - 2 per host) behind rancher internal load balancer. If i run something like apachebench (ab) and perform an upgrade batch (1 at a time) it will indeed close already open connections (ie not inform haproxy of a container shutting down, it seems)…

Benchmarking [redacted] (be patient)
Completed 1500 requests
Completed 3000 requests
SSL read failed (5) - closing connection
Completed 4500 requests
Completed 6000 requests
Completed 7500 requests
Completed 9000 requests
Completed 10500 requests
Completed 12000 requests
Completed 13500 requests
Completed 15000 requests
Finished 15000 requests

Non-2xx responses: 16 :frowning:

… This is with health checks on the php container to ensure a 200 is return as well.

Edit: Im assuming they are not utilizing the “DRAIN” feature of haproxy, and rather removing from config/reloading haproxy. is this correct?

Edit2: Semes like im correct via this Github issue https://github.com/rancher/rancher/issues/2777 … Looks like it never made it into the GA.