What on-prem load balancing / virtual-ip implementation did you use?

@ml2,

In my case the Openstack don’t release the IP’s dinamically associated to the instances until I ask for that or terminate the instance.

I use the Terraform to create all the basic of my Openstack deployment and it works like a charm.

Att,

Ml2,

I have had success using metallb (alpha project from Google - https://github.com/google/metallb) for external ip using spec.loadbalancerip. And using external-dns and a subzone in infoblox for external dns (https://github.com/kubernetes-incubator/external-dns). The combination allows me to resolve service clusterIPs externally without needing a wildcard dns. It has been pretty reliable.

@cwade

Would you recommend it for production?

Since there’s no major version?

MetalLB seems like a promising option. Might be sufficiently mature for my use case.

From the site:

MetalLB is being used in several production and non-production clusters, by several people and companies. So far, it appears to be very stable in those deployments, but in the grand scheme of things, MetalLB is still not hugely “battle tested.”
MetalLB, bare metal load-balancer for Kubernetes

@etlweather
I found a thread on the project website that people explain their use case using it.

I’m implementing the Istio plataform. I’m adapting the helm oficial based on example in istio-ingress-tutorial to deploy istio-ingress to each node in a dedicated node pool.

Let us know how it goes!

By the way, if you don’t want the nginx ingress, looks like you can disable it in your cluster by editing your cluster in yaml.

Presumably, you can also specify a node selector so it does not run on every node.

@etlweather, did you make some progress on this topic? (trying metallb, nginx or something else)

I’m back from vacation so I didn’t advance on this on my side.

Do you have some useful conclusion so that I can start to build something from there?

@amioranza, did some user started to use your setup? Is it working fine?
Thanks

@ml2 I have been continuing my research / testing.

Ingress

I tried 3 forms of ingress / load balancers: the default NGINX Ingress, Ambassador and Traefik.

In the end, I had the best experience with the default NGINX Ingress. Using the documentation, I was able to add configuration to it and resolve the defects I was running into.

Specifically, the main problem I ran into overall is that if I killed a worker node, it took 5 minutes for Kubernetes to officially consider the node broken. During that time, traffic was being sent to the broken node and would just “hang”. NGINX would continue to send traffic and would wait 30 seconds or something like that before returning a 503. This was unacceptable. This is what pushed me to look at other options - the lack of a circuit breaker. And also because I had a few requirements that I was trying to meet such as:

  • Restricting which network can access an HTTP resource
  • Exposing non-HTTP services
  • A desire to possibly use rate-limiting features
  • Monitoring of the Ingress
  • Sticky sessions (optionally)

Maybe some more but these are the key things I was looking for and at the outset, I didn’t see were available in the Ingress provided by Rancher.

Circuit breaker

This is the main feature I tested all 3 Ingresses against because that is what made the switch to Rancher 2.0 not possible. Right now, If a node goes down, Rancher 1.6 load balancer with HA Proxy detects that the node is down and stops sending traffic to it. HAProxy does health checks against the containers.

In Rancher 2.0 / Kubernetes, the health checks you define on a pod are executed by the Kubernetes controller, running on the same host as the container. Which while it has values such as detecting if the container is ready to receive traffic (readiness probe) and if it is still alive (so as to restart it), it has no value in providing a nice experience to your end users and keeping your service uptime. Your service is factually unavailable if the users get 30s waits and then get a 503, every X requests (X being the number of different hosts running your pod).

My post - Health check - does it even work? covers that in maybe more details and more accurately as to exact timeout values and behavior. It also covers that you can set some options for the kubernetes controller, but I have yet to find out how to set those using Rancher/RKE.

But by going through the NGINX Ingress documentation, making sure to use the Rancher’s fork of the project to avoid finding documentation for a newer version (if there is even such a thing), I found that I could set some options in NGINX Igress via annotations that would resolve this problem.

Specifically, I can set the connection timeout to 1 second, which for an on-premise system used on the LAN, is totally sufficient. If NGINX can’t connect to a pod after 1 second, it goes to the next one and the user gets an answer. 1 second is still a bit long so more options are needed so that the one second isn’t repeated every 3 requests (assuming you are running 3 instances of the pod).

That is where max_fails and fail_timeout come into play. These are documented as

max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.

fail_timeout=time
sets the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable; and the period of time the server will be considered unavailable. By default, the parameter is set to 10 seconds.

Module ngx_http_upstream_module

This satisfied my requirements. I still have to come up with what the best value for these options are.

Traefik

That’s a workable alternative. I found how to turn off the NGINX Ingress (mentioned in a post above) and so was able to make Traefik listen on port 80.

I stopped playing with it when I couldn’t make its circuit breaker work. It’s probably something I’m doing wrong - I’m sure the feature does work. But by then, I had the circuit breaker working with NGINX so my patience was limited.

When I killed a worker node, Traefik was still sending traffic to the pod, until kubernetes conroller detected the node had vanished. So I added an annotation to my service:

traefik.backend.circuitbreaker: "NetworkErrorRatio() > 0.0000001"

I tried various values, this still didn’t do it.

I tried adding the parameter forwardingtimeouts.dialtimeout=2s both as a configmap and as a argument to the program. Didn’t do it.

So I tried Ambassador.

Ambassador

My first attempt to use it (was actually before trying Traefik), I did not like it because the configuration isn’t a first class citizen - it’s YAML inside YAML.

annotations:
  getambassador.io/config: |
    ---
    apiVersion: ambassador/v0
    kind: Module
    name: ambassador
    config:
        use_remote_address: true
    ---
    apiVersion: ambassador/v0
    kind: Mapping
    name: myhttp-service
    prefix: /
    service: myhttp-service

OK, I could live with that but at first it pushed me away. But I came back to it. Worked OK. Was a bit better when I found I did not need to put the annotation on each services definition I wanted to expose but could rather put them all in one service definition for Ambassador itself. The documentation at first lead you to think you have to put the Mappings as annotation to the service you are mapping to. That didn’t sit right with how I map my services. But that clarified, it made me want to test it further.

First thing I observed is all my requests were going to the same backend (I had 3 backends running). Presumably, if another computer was going to make requests, it would have gone to a different backend - i.e. maybe it has sticky session enabled by default? But that sticky session was not with a cookie, so it could have been based on the source IP? I did not look further into it yet. But there was definitely no “load balancing” with a single client. All other Ingress would send request 1 to backend 1, request 2 to backend 2 and so forth and cycle back to 1.

So I killed the worker node that was running the pod which was receiving the requests. No retries. After a delay, I got an error (503) until kubernetes controller marked the pod as “unknown” and removed it from the service discovery. At which point Ambassador routed the requests to the next pod.

OK, not quite what I wanted. I spent a bit of time trying to see how I would configure this. Haven’t solved it yet. Too bad. There are some features of Ambassador which were interesting (many also available in Traefik) like the Canary release - being able to give some weight to a service handling a resource. The shadow feature is particularly interesting.

Summary

But the key thing I’m looking into a frontend load balancer / Ingress - is to keep my service available, at all times, whatever happens with the backends. If it can’t detect a pod/container is down and and continues sending traffic to it because the service discovery tells it that it’s still there, then it’s pointless. It is inconsequential that the service discovery makes a mention of a backend, if the frontend can’t reach it.

So for now, I’m settling with NGINX Ingress that comes with Rancher. I may use Traefik or something else for non-HTTP services.

NGINX Ingress integrates nicely in Rancher 2.0 / Kubernetes. It uses “standard” Ingress specs for Kubernetes and while I don’t use the Rancher UI to define workloads and stuff, the Ingress definitions do show in the UI (unlike Ambassador). I want the UI to visualize my set up, I don’t want to be reading YAML only.

Virtual IP

That is what I now have to finalize. I haven’t done much more research into this yet.

2 Likes

Wow!!!

Very detailed information. Thanks for the reply, very appreciated. I thought k8s was supposed to handle a pod that is down or a node. Now it defeats the purpose if it can’t be done.

I assume Virtual IP will use to point the the Ngnix Ingress with a failover node ?

Yes, I was also a bit baffled by the slow response to a down node. Default is something like 40s to mark it “unknown status” and 5 minutes to reschedule the pod. Not a big deal if your running your service with a scale of 100 and you lose 1… but if you run your service with a scale of 2 and you lose one, you’re no longer redundant for 5 minutes. But of course, all that can be changed (but as stated, still trying to figure out how to change it fully).

I suppose these default delays may make sense in a cloud hosting environment.

The NGINX Ingress not having a default to properly discard a pod that can’t be connected to however is a bit strange. I would think the default should be to properly response and favor response time as uptime to the user is the reason we use these tools in the first place.

But it is possible to achieve it as described in my post so it’s somewhat of a moot point for the initiated, troublesome for the newbies because the defaults are not logical (in my opinion).


Regarding the virtual IP, my idea is to dedicate two virtual machines to handle the ingress traffic. I tried changing the cluster.yaml to put a node selector for ingress but all I got is that no ingress started. So in the end, I don’t really care if the ingress service is running on all worker nodes, it does not really cost me anything.

So two worker nodes will not run any other workload than the ingress. And the idea is a virtual IP will be floating between the two nodes and the DNS record will point to the virtual IP.

This is how I do things basically in Rancher 1.6.

The advantage of having Ingress run on all nodes (but not used) is that if I have a problem with the two ingress virtual machines for example, I can at least recover by changing the DNS to point to another worker node and thus rapidly mitigate the outage and restore service that way (provided network ACL rules permit the traffic obviously).

Alright, I solved one more problem - the time it takes for kube controller to detect a down node and how fast it evicts the pods and create new ones to replace those from the down node.

This is covered in Health check - does it even work?

Nice!!, thanks for the update.

@ml2 OK - so here is my write up on virtual IP…

Virtual IPs for Kubernetes Ingress

To have a resilient load-balancing service in Rancher 1.6, we used Keepalived to implement VRRP. A floating virtual IP address was shared between the two VMs running the load-balancing containers. Keepalived would keep hearthbeat going between both and if one noticed the other was gone, the virtual IP address was switched using gratuitous ARP push. Keepalived was also performing some kind of checks to see that the load-balancer was still there and if it wasn’t, would also do a switch. This worked also relatively well.

Scenarios where it did not work well were when the Rancher IPSec networking between host would fail. If IPSec failed on the load-balancer hosts, then nothing would detect that and the HAProxy container wasn’t able to send traffic to any backend services. If IPSec failed on one of the workload hosts, then HAProxy would ignore the backend containers it couldn’t reach but only for so long, then would try it again, etc. This resulted in “slow” webapps at times. So that worked relatively well also but wasn’t perfect and here and there it caused problems.

Here and there, the live host would just not let go of the virtual IP once it detected a failure and needed a manual reboot.

In Rancher 2.0, we wanted to improve this setup, starting with whatever is standard in Rancher 2.0 and Kubernetes and improving from there.


Keepalived on Kubernetes

There is a [Keepalived Kubernetes contrib project](contrib/keepalived-vip. I attempted to use this to setup my Ingress for high-availability but it did not work well for me.

To start with, it’s purpose is not to just do VRRP (virtual router redundancy protocol), it is setup to also receive the traffic and pass it on to some backend - presumably the Ingress. It connects to Kubernetes API to find what Ingress exists and send the traffic to those, in load-balance mode or otherwise.

It is possible to configure it to only do VRRP however. But I still ran into some issues and the virtual IP would get assigned, etc.

So I went the route of creating my own container image for Keepalived (based on work I had done previously for Rancher 1.x) and use that instead. After some basic tests, this is working well. I don’t have this tested thoroughly and it’s not on production but the initial smoke tests show that it responds properly to the various failure scenarios I could think of.

Also the contrib project seems to be using a relatively old version of Keepalived.

Keepalived container image

Note: the latest Keepalived wants Kernel 4.15.18 or newer. I successfully ran it on 4.15.0 but failed to run it on 4.4.

I built my container image with this:

FROM ubuntu:18.04

ARG KEEPALIVED_VERSION=2.0.6

RUN apt-get update \
 && apt-get install -y build-essential libssl-dev curl \
 && curl -o /tmp/keepalived.tar.gz -SL http://www.keepalived.org/software/keepalived-${KEEPALIVED_VERSION}.tar.gz \
 && cd /tmp \
 && tar -xzf keepalived.tar.gz \
 && cd /tmp/keepalived* \
 && ./configure \
 && make \
 && make install \
 && apt-get remove build-essential libssl-dev -y \
 && apt-get autoremove -y \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* /var/log/apt

CMD ["/usr/bin/startup.sh"]
COPY resources/startup.sh /usr/bin/startup.sh

The startup script basically modifies some variables in the configuration file (which will be provided as a ConfigMap) and starts Keepalived.

#!/bin/bash

pkill keepalived

## Set node priority based on env variable
if [ -z "$NODE_PRIORITY" ]; then
   NODE_PRIORITY=100
fi
mkdir -p /etc/keepalived/
cp /tmp/config/keepalived.conf /etc/keepalived/keepalived.conf

echo "Setting priority to: $NODE_PRIORITY"
sed -i "s/{{NODE_PRIORITY}}/${NODE_PRIORITY}/" /etc/keepalived/keepalived.conf

ROUTER_ID=$(hostname)
echo "Setting Router ID to: $ROUTER_ID"
sed -i "s/{{ROUTER_ID}}/${ROUTER_ID}/" /etc/keepalived/keepalived.conf

# Make sure we react to these signals by running stop() when we see them - for clean shutdown
# And then exiting
trap "stop; exit 0;" SIGTERM SIGINT

stop()
{
  # We're here because we've seen SIGTERM, likely via a Docker stop command or similar
  # Let's shutdown cleanly
  echo "SIGTERM caught, terminating keepalived process..."
  # Record PIDs
  pid=$(pidof keepalived)
  # Kill them
  kill -TERM $pid > /dev/null 2>&1
  # Wait till they have been killed
  wait $pid
  echo "Terminated."
  exit 0
}

# This loop runs till until we've started up successfully
while true; do

  # Check if Keepalived is running by recording it's PID (if it's not running $pid will be null):
  pid=$(pidof keepalived)

  # If $pid is null, do this to start or restart Keepalived:
  while [ -z "$pid" ]; do
    echo "Displaying resulting /etc/keepalived/keepalived.conf contents..."
    cat /etc/keepalived/keepalived.conf
    echo "Starting Keepalived in the background..."
    /usr/local/sbin/keepalived --dont-fork --dump-conf --log-console --log-detail --vrrp &
    # Check if Keepalived is now running by recording it's PID (if it's not running $pid will be null):
    pid=$(pidof keepalived)

    # If $pid is null, startup failed; log the fact and sleep for 2s
    # We'll then automatically loop through and try again
    if [ -z "$pid" ]; then
      echo "Startup of Keepalived failed, sleeping for 2s, then retrying..."
      sleep 2
    fi

  done

  # Break this outer loop once we've started up successfully
  # Otherwise, we'll silently restart and Rancher won't know
  break

done

# Wait until the Keepalived processes stop (for some reason)
wait $pid
echo "The Keepalived process is no longer running, exiting..."
# Exit with an error
exit 1

Disclaimer: The startup script is not totally my own but something which was done by NeoAssist/docker-keepalived

Keepalived Kubernetes configuration

I have two nodes in Kubernetes cluster which have a label of run-ingress=true. I create the Keepalived workload as a DaemonSet with a node selector. The keepalived.conf comes from a ConfigMap.

So here is the configuration:

apiVersion: v1
data:
  keepalived.conf: |-
    global_defs {
        router_id {{ROUTER_ID}}
        vrrp_garp_master_delay 1
        vrrp_garp_master_refresh 30
        notification_email {
           keepalived@example.com
        }
        notification_email_from keepalived@example.com
        smtp_server x.x.x.x
    }
    vrrp_script chk_port {
        script "curl http://127.0.0.1/healthz"
        timeout 3
        interval 2
        fall 2
        rise 2
    }
    vrrp_instance k8s-vips {
        state BACKUP
        interface eth0
        virtual_router_id 22
        priority {{NODE_PRIORITY}}
        advert_int 1
        nopreempt
        dont_track_primary
        track_script {
            chk_port
        }
        authentication {
            auth_type PASS
            auth_pass k8s-vips
        }
        virtual_ipaddress {
            x.x.x.x/24 dev eth0
        }
        smtp_alert
    }
kind: ConfigMap
metadata:
  name: keepalived-conf

And here is the DeamonSet configuration.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: keepalived-vip
spec:
  template:
    metadata:
      labels:
        app: keepalived-vip
        env: prod
    spec:
      hostNetwork: true
      volumes:
        - name: config-volume
          configMap:
            name: keepalived-conf
      containers:
        - name: keepalived
          image: registry.example.com/keepalived:2.0.6
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          volumeMounts:
            - name: config-volume
              mountPath: /tmp/config/
      nodeSelector:
        run-ingress: "true"

Initial tests showed that this should work. I tried the following failure scenarios:

  • Killed the MASTER node VM.
  • Killed the pod which is currently MASTER.
  • Killed the containers related to the Ingress on the current MASTER node.
1 Like

@etlweather Really cool write up. I think you should become technical writer :slight_smile:

Thanks, I will look into it :slight_smile:

@etlweather I found a question on stack overflow with an interesting answer. It looks a lot like the comment from @cwade

I’m not sure if I would have to point each service on the MetalLB or my Ingress Controller (Ngnix) could use it.

MetalLB looked like a good option, but I didn’t try it because its Layer 2 mode (I wouldn’t be using BGP in my case) is not much different than Keepalived and since I was already familiar and experienced with Keepalived, I preferred sticking with it rather than brigning yet another technology to my growing list.

But from what I read in its docs, it seems to be a very good solution.

I think I will try it. What I’m wondering is if I can make it work with the Nginx Controller so that I can use the UI or if I have to always configure it in .yml. I’m not Rancher/Kubernetes experts (first time deploying it) so I don’t understand everything that is going on.

Also since I’m more a Dev than a OP, your keepalived configuration files scared me, but maybe I’ll try it.

@etlweather Thanks for the writeup! I work on Ambassador, so your feedback is greatly appreciated.

  • Right now, Ambassador only routes to Kubernetes services, not pods. Thus, the load balancing is limited to round robin, and we rely on Kubernetes to remove pods from the round robin.
  • We are working on changing Ambassador to route dynamically to pods, which will enable more advanced behavior (e.g., circuit breaking, smarter load balancing, etc.). This work is very much in progress at the moment, hopefully in the next month or so. :slight_smile:
  • The reason why the documentation suggests you create the annotations on a per-service basis is because we typically see different engineers/teams responsible for different services, and in this model, each service team can control / manage their own routes. As you discovered, you can of course create a central configuration, but this can get big over time.

Thanks again, hope this is useful info!

Thanks @richarddli - that explains the results I got.