Rescheduling PODs after RKE worker node failure in less than 5 minutes

anunez · July 29, 2020, 10:03am

Hello all,

I reconfigured my RKE Cluster on Rancher 2.4 in order to get PODs rescheduled in less than 5 minutes after a node failure (testing it by shutting down the worker node). However it does not work, it reschedules the PODs after the default 300 seconds.

I followed Superseb indications in order to change this default behaviour:

gist.github.com

https://gist.github.com/superseb/a9925c465b42bc5001b94c4ec241265a

cluster.yml

services:
  kubelet:
    extra_args:
      node-status-update-frequency: 4s
  kube-api:
    extra_args:
      default-not-ready-toleration-seconds: 30
      default-unreachable-toleration-seconds: 30
  kube-controller:
    extra_args:

This file has been truncated. show original

This is the configuration I set up for my cluster:

 kube-api:
  always_pull_images: false
  extra_args:
    default-not-ready-toleration-seconds: '30'
    default-unreachable-toleration-seconds: '30'
  pod_security_policy: false
  service_node_port_range: 30000-32767
kube-controller:
  extra_args:
    node-monitor-grace-period: 16s
    node-monitor-period: 2s
    pod-eviction-timeout: 30s
kubelet:
  extra_args:
    node-status-update-frequency: 4s
  fail_swap_on: false
  generate_serving_certificate: false
kubeproxy: {}
scheduler: {}

What am I missing?

Best Regards,
Alex.

mattmattox · July 29, 2020, 5:17pm

This is known issue with upstream kubernetes https://github.com/kubernetes/kubernetes/issues/55713

I did create a workaround for this issue https://github.com/mattmattox/drain-node-on-crash

anunez · July 30, 2020, 11:15am

Hi,

After doing some further testing, the configuration I specified is actually working.
Maybe it took some time to apply the changes.

Thank you anyway!

b3tts32 · September 14, 2020, 6:46pm

How much time passed before it started working? I’m experiencing the same thing.

anunez · September 15, 2020, 7:53am

I think I tried it the next day and then I realized it was actually working.

Topic		Replies	Views
Change --pod-eviction-timeout in controller-manager Rancher	1	2442	August 7, 2018
Draining a node does not evict pods Rancher	3	4091	June 21, 2019
Deleting a node -> redeploy pods automatically? Rancher	1	1831	June 19, 2019
No workload rebalance when node is not available Rancher	2	544	January 20, 2019
Altering RKE defaults to pass parameters to kubelet Rancher	4	866	July 25, 2018

Rescheduling PODs after RKE worker node failure in less than 5 minutes

Related topics