Node Allocator - Limit Pods without Limiting System


We are having an issue with nodes going into NotReady state.
We implemented metricbeat to watch the per-process cpu/mem usage, and filebeat to grab kubelet logs and we see at the point of crashing:

  • kswapd spikes
  • network and basic system services seem to drop out.
  • ssh listening but doesnt process request

It looks like we are entering swap hell where the server is swapping out mmap files from executables, then they need to be brought back in off disk when the process is to be run again.

I’ve had a good look at the Node Allocator project.
The annoying thing with this is that you need to specify SystemReserved - which as they warn is an Upper Limit on memory for System Services. This is really not ideal.

I can see that we have the -cgroups-per-qos flag set and we have the kubepods cgroup created with all the pods in it.

I assume it would be easy to apply a memory limit to the kubepods cgroup that’s e.g. 2GB less than the total RAM on the machine.

The problem i have is that kubelet’s Available Memory calculation is still going to look at the memory on the entire Node and not just what’s in the kubepod’s cgroup…

Has anyone found a workaround to get kubelet to calculate Available Memory based on the kubepods cgroup, without needing to turn on Node Allocatable?