Rescheduling PODs after RKE worker node failure in less than 5 minutes

Hello all,

I reconfigured my RKE Cluster on Rancher 2.4 in order to get PODs rescheduled in less than 5 minutes after a node failure (testing it by shutting down the worker node). However it does not work, it reschedules the PODs after the default 300 seconds.

I followed Superseb indications in order to change this default behaviour:

This is the configuration I set up for my cluster:

 kube-api:
  always_pull_images: false
  extra_args:
    default-not-ready-toleration-seconds: '30'
    default-unreachable-toleration-seconds: '30'
  pod_security_policy: false
  service_node_port_range: 30000-32767
kube-controller:
  extra_args:
    node-monitor-grace-period: 16s
    node-monitor-period: 2s
    pod-eviction-timeout: 30s
kubelet:
  extra_args:
    node-status-update-frequency: 4s
  fail_swap_on: false
  generate_serving_certificate: false
kubeproxy: {}
scheduler: {}

What am I missing?

Best Regards,
Alex.

This is known issue with upstream kubernetes https://github.com/kubernetes/kubernetes/issues/55713

I did create a workaround for this issue https://github.com/mattmattox/drain-node-on-crash

Hi,

After doing some further testing, the configuration I specified is actually working.
Maybe it took some time to apply the changes.

Thank you anyway!

How much time passed before it started working? I’m experiencing the same thing.

I think I tried it the next day and then I realized it was actually working.