I created a workload with a Readiness and Liveness check. I kill the host it’s running on and no new pod get created.
When the pod is running, if I look at the logs of the web server, I see requests coming from 10.42.1.1, 2 requests every 2 seconds.
- It does not make sense that the health check only come from one place - to be good, it should come from 3 different place (I have 3 nodes in my cluster). Then you can fail the pod if 2 of the 3 health check report failure.
- I can’t find any pod with that IP address. Unsure what it is.
- Obviously, if a worker node running a pod under health checking is killed, the health check should detect that and start a new pod somewhere else.
Anybody has health check working.
Version 2.0.5
Probably the answer to my question of what does the check can be found in Kubernetes documentation:
For an HTTP probe, the kubelet sends an HTTP request to the specified path and port to perform the check.
Configure Liveness, Readiness and Startup Probes | Kubernetes
Well, that’s nice - the check is done on the same host… It serves a basic purpose - if the service inside the container is fubar.
So how does the system detects the host has vanished and its time to start more pods on other hosts to meet the scale requirement?
As you have surmised, unlike in Cattle/1.x in Kubernetes healthchecks are done only from the host running the pod. By default each node gets a 10.42.x.0/24 so 10.42.x.1 is “the node”.
Node health is managed by the node controller, which does heartbeats and has timeouts for showing unavailable, then rescheduling pods…
Thanks for confirming and the link.
(The default timeouts are 40s to start reporting ConditionUnknown and 5m after that to start evicting pods.) The node controller checks the state of each node every --node-monitor-period seconds
Now at least what I’m observing make sense.
The only problem I’m stuck with at the moment is that the NGINX Ingress still tries to send traffic to the failed node for quite some time. The ingress logs a 499 status every so often (I have 3 nodes in my test so it’s 200, 200, 499, 200, 200, 499, 200, 200, 499, etc.) Well I guess it’s because I was refreshing the page every seconds so the browser wouldn’t wait long enough to get a 503 from NGINX.
But in any case, that’s not really acceptable for an on-prem system. Maybe even not for cloud but I won’t speak for that…
With Rancher 1.6, as multiple worker nodes were doing the health check, such a condition would get detected rapidly and the load balancer was updated immediately to remove the failed containers running on the failed node.
I would want to emulate that. Perhaps I can reduce the --node-monitor-period
and pod-eviction-timeout
, not quite sure yet where that could be changed.
1 Like
The NGINX generated config for the Ingress has this:
upstream default-mywebsite-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
keepalive 32;
server 10.42.2.13:80 max_fails=0 fail_timeout=0;
server 10.42.2.10:80 max_fails=0 fail_timeout=0;
server 10.42.1.20:80 max_fails=0 fail_timeout=0;
}
Which means it will never consider a backend server failed, even if it does not connect. Changing this would resolve my issue I think where if a worker node vanish, users are affected…
But a bit later in the conf we have:
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# In case of errors try the next upstream server before returning an error
proxy_next_upstream error timeout invalid_header http_502 http_503 http_504;
So that looks fine. I’m going to retest with tries every 8 seconds instead of 1 second…
Frankly, a connect timeout of 5s seems a bit high to me. I reduced that to 1s for testing and it makes much more sense. At least for an on-premise deployment.
By setting the annotation nginx.ingress.kubernetes.io/proxy-connect-timeout
to 1
, the Ingress NGINX attempted to connect to the pod from the downed node, after 1 second bailed out and attempted the next backend server and the user got the response, but with a 1s delay.
This went on a few times until K8S node controller failed the node. At which point NGINX refreshed the configuration and stopped attempting the failed node.
Not sure how I missed that earlier, but we can also customize the max_fails
and fail_timeout
values.
Custom NGINX upstream checks
NGINX exposes some flags in the upstream configuration that enable the configuration of each server in the upstream. The Ingress controller allows custom max_fails and fail_timeout parameters in a global context using upstream-max-fails and upstream-fail-timeout in the NGINX ConfigMap or in a particular Ingress rule. upstream-max-fails defaults to 0. This means NGINX will respect the container’s readinessProbe if it is defined. If there is no probe and no values for upstream-max-fails NGINX will continue to send traffic to the container.
!!! tip With the default configuration NGINX will not health check your backends. Whenever the endpoints controller notices a readiness probe failure, that pod’s IP will be removed from the list of endpoints. This will trigger the NGINX controller to also remove it from the upstreams.**
https://github.com/rancher/ingress-nginx/blob/master/docs/user-guide/nginx-configuration/annotations.md#custom-nginx-upstream-checks
So a mix of both satisfies my requirement.
- Lowering the connect timeout to a more reasonable value for an on-prem setup from the 5s default to 1s (or so, as will be determined with more experience).
- Setting
max_fail
and fail_timeout
values for the upstream so that they aren’t constantly tried if the Ingress can’t connect to the backend. For an on-prem setup, I feel like a max_fail
of 2 or 3 is sufficient and a fail_timeout
of 5s will do it. If you have any significant amount of usage, this won’t take long. And as the next upstream
feature will kick in after 1s based on the proxy-connect-timeout
, the user won’t get much of a delay (1s + normal response time).
That leaves me with the 40s it takes for K8S controller to notice the node is gone, I would like to bring this down, and I would like to bring down the 5m before it re-schedule the pods.
Anyone has data on this?
1 Like
@vincent Do you have any insight on how to change the following values for the cluster?
- kubelet: node-status-update-frequency
- controller-manager: node-monitor-period
- controller-manager: node-monitor-grace-period
- controller-manager: pod-eviction-timeout
I tried creating a new cluster and editing the YAML file before starting it. I added:
services:
kublet:
extra_args:
node-status-update-frequency: "5s"
kube-controller:
extra_args:
node-monitor-period: "2s"
node-monitor-grace-period: "16s"
pod-eviction-timeout: "30s"
After the cluster creation, these changes disappeared from the cluster config in YAML except for node-status-update-frequency
. So this did not result in marking the node as unhealthy faster when I killed it. And the pods were re-created only 5 minutes after it was detected as unhealthy.
2 Likes
@anyone knows how to change these configurations?
Got this working thanks to @superseb on GitHub.
The RKE documentation shows kube-controller
but it, at least in Rancher’s cluster.yaml
, it should be kube_controller
(use an underscore).
services:
kublet:
extra_args:
node-status-update-frequency: "5s"
kube_controller:
extra_args:
node-monitor-period: "2s"
node-monitor-grace-period: "16s"
pod-eviction-timeout: "30s"
This works and now the kube controller rapidly detects a down node and also rapidly start new pods to replace those that were on the down node.
2 Likes