@ml2 I have been continuing my research / testing.
Ingress
I tried 3 forms of ingress / load balancers: the default NGINX Ingress, Ambassador and Traefik.
In the end, I had the best experience with the default NGINX Ingress. Using the documentation, I was able to add configuration to it and resolve the defects I was running into.
Specifically, the main problem I ran into overall is that if I killed a worker node, it took 5 minutes for Kubernetes to officially consider the node broken. During that time, traffic was being sent to the broken node and would just “hang”. NGINX would continue to send traffic and would wait 30 seconds or something like that before returning a 503. This was unacceptable. This is what pushed me to look at other options - the lack of a circuit breaker. And also because I had a few requirements that I was trying to meet such as:
- Restricting which network can access an HTTP resource
- Exposing non-HTTP services
- A desire to possibly use rate-limiting features
- Monitoring of the Ingress
- Sticky sessions (optionally)
Maybe some more but these are the key things I was looking for and at the outset, I didn’t see were available in the Ingress provided by Rancher.
Circuit breaker
This is the main feature I tested all 3 Ingresses against because that is what made the switch to Rancher 2.0 not possible. Right now, If a node goes down, Rancher 1.6 load balancer with HA Proxy detects that the node is down and stops sending traffic to it. HAProxy does health checks against the containers.
In Rancher 2.0 / Kubernetes, the health checks you define on a pod are executed by the Kubernetes controller, running on the same host as the container. Which while it has values such as detecting if the container is ready to receive traffic (readiness probe) and if it is still alive (so as to restart it), it has no value in providing a nice experience to your end users and keeping your service uptime. Your service is factually unavailable if the users get 30s waits and then get a 503, every X requests (X being the number of different hosts running your pod).
My post - Health check - does it even work? covers that in maybe more details and more accurately as to exact timeout values and behavior. It also covers that you can set some options for the kubernetes controller, but I have yet to find out how to set those using Rancher/RKE.
But by going through the NGINX Ingress documentation, making sure to use the Rancher’s fork of the project to avoid finding documentation for a newer version (if there is even such a thing), I found that I could set some options in NGINX Igress via annotations that would resolve this problem.
Specifically, I can set the connection timeout to 1 second, which for an on-premise system used on the LAN, is totally sufficient. If NGINX can’t connect to a pod after 1 second, it goes to the next one and the user gets an answer. 1 second is still a bit long so more options are needed so that the one second isn’t repeated every 3 requests (assuming you are running 3 instances of the pod).
That is where max_fails
and fail_timeout
come into play. These are documented as
max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.
fail_timeout=time
sets the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable; and the period of time the server will be considered unavailable. By default, the parameter is set to 10 seconds.
This satisfied my requirements. I still have to come up with what the best value for these options are.
Traefik
That’s a workable alternative. I found how to turn off the NGINX Ingress (mentioned in a post above) and so was able to make Traefik listen on port 80.
I stopped playing with it when I couldn’t make its circuit breaker work. It’s probably something I’m doing wrong - I’m sure the feature does work. But by then, I had the circuit breaker working with NGINX so my patience was limited.
When I killed a worker node, Traefik was still sending traffic to the pod, until kubernetes conroller detected the node had vanished. So I added an annotation to my service:
traefik.backend.circuitbreaker: "NetworkErrorRatio() > 0.0000001"
I tried various values, this still didn’t do it.
I tried adding the parameter forwardingtimeouts.dialtimeout=2s
both as a configmap and as a argument to the program. Didn’t do it.
So I tried Ambassador.
Ambassador
My first attempt to use it (was actually before trying Traefik), I did not like it because the configuration isn’t a first class citizen - it’s YAML inside YAML.
annotations:
getambassador.io/config: |
---
apiVersion: ambassador/v0
kind: Module
name: ambassador
config:
use_remote_address: true
---
apiVersion: ambassador/v0
kind: Mapping
name: myhttp-service
prefix: /
service: myhttp-service
OK, I could live with that but at first it pushed me away. But I came back to it. Worked OK. Was a bit better when I found I did not need to put the annotation on each services definition I wanted to expose but could rather put them all in one service definition for Ambassador itself. The documentation at first lead you to think you have to put the Mappings as annotation to the service you are mapping to. That didn’t sit right with how I map my services. But that clarified, it made me want to test it further.
First thing I observed is all my requests were going to the same backend (I had 3 backends running). Presumably, if another computer was going to make requests, it would have gone to a different backend - i.e. maybe it has sticky session enabled by default? But that sticky session was not with a cookie, so it could have been based on the source IP? I did not look further into it yet. But there was definitely no “load balancing” with a single client. All other Ingress would send request 1 to backend 1, request 2 to backend 2 and so forth and cycle back to 1.
So I killed the worker node that was running the pod which was receiving the requests. No retries. After a delay, I got an error (503) until kubernetes controller marked the pod as “unknown” and removed it from the service discovery. At which point Ambassador routed the requests to the next pod.
OK, not quite what I wanted. I spent a bit of time trying to see how I would configure this. Haven’t solved it yet. Too bad. There are some features of Ambassador which were interesting (many also available in Traefik) like the Canary release - being able to give some weight to a service handling a resource. The shadow feature is particularly interesting.
Summary
But the key thing I’m looking into a frontend load balancer / Ingress - is to keep my service available, at all times, whatever happens with the backends. If it can’t detect a pod/container is down and and continues sending traffic to it because the service discovery tells it that it’s still there, then it’s pointless. It is inconsequential that the service discovery makes a mention of a backend, if the frontend can’t reach it.
So for now, I’m settling with NGINX Ingress that comes with Rancher. I may use Traefik or something else for non-HTTP services.
NGINX Ingress integrates nicely in Rancher 2.0 / Kubernetes. It uses “standard” Ingress specs for Kubernetes and while I don’t use the Rancher UI to define workloads and stuff, the Ingress definitions do show in the UI (unlike Ambassador). I want the UI to visualize my set up, I don’t want to be reading YAML only.
Virtual IP
That is what I now have to finalize. I haven’t done much more research into this yet.