Adding cpu-manager-policy argument to kubelet?

randyrue · May 17, 2021, 7:09pm

Hi All,

I’m looking for a way to set the CPU management policy to ‘static’ as outlined at Control CPU Management Policies on the Node | Kubernetes for a rancher-built/managed on premise cluster.

I’ve tried editing the yaml live in the webUI for the cluster itself but can’t seem to find a syntax that works in the kubelet stanza. The cluster either flails for a while trying and failing to apply the change and then rolls back, or just ignores the changes and they disappear.

I’m guessing I want to be using an “extra_args” flag in the yaml and then kubectl’s native “–cpu-manager-policy=static” ? or is there some rancher/yaml syntax along the lines of cpu_manager_policy: static? I’ve tried a slew of variations on a test cluster with no luck and I confess this feels like a really sloppy way to figure this out. But I can’t seem to find any examples on line.

Any help would be hugely appreciated.

randyrue · May 18, 2021, 4:57pm

Some Progress:
By experimenting with changing values slightly for existing extra_args statements in the etcd service and checking the results of “ps aux” inside the container (first I had to install ps) I can see how rancher parses it’s yaml syntax and presents it to the binary call as an argument.

I also found information on the actual thing I’m trying to change (the CPU manager policy) suggesting that if I set this I also need to reserve at least one CPU for kubernetes use.

So I’m about sure the correct syntax for the kubelet stanza in the “live” config.yaml is:

kubelet:
  fail_swap_on: false
  generate_serving_certificate: false
  extra_args:
    cpu-manager-policy: 'static'
    reserved-cpus: 0

Note those first two lines in the stanza were there already.

‘docker ps’ on a controller node that’s failing to accept the changed config shows that the kubelet container is in a restart loop. ‘docker logs’ shows kubelet complaining about both of my added arguments being deprecated and that they should be passed in a config file. But the usage statement lists a slew of CLI arguments as similarly deprecated. Presumably at some point rancher is going to need to reengineer their kubelet container to inject and call a config file.

The current meaningful error is the last line of the attempted start:
F0518 16:36:35.790721 6150 server.go:273] failed to run Kubelet: could not initialize checkpoint manager: could not restore state from checkpoint: configured policy “static” differs from state checkpoint policy “none”

I suspect this is a kubernetes issue, not a rancher one. But would still be grateful for any input.

randyrue · May 18, 2021, 5:08pm

The output also says “Please drain this node and delete the CPU manager checkpoint file “/var/lib/kubelet/cpu_manager_state” before restarting Kubelet.” but how do I do that in a docker container that I can’t shell into while it’s in a restart loop?

So this does still appear to be a rancher issue: if this container is fungible, the previous checkpoint would come with the container image pulled from rancher’s repo? Even if I could delete that file, the checkpoint state would just revert on the next restart? And, it seems like this would be a problem for many other settings that might change?

randyrue · May 19, 2021, 3:58pm

I’ve also posted this as an issue in the rancher/racher github repo:

github.com/rancher/rancher

Can't change cpu-manager-policy settings in kubelet from default "none" to "static"

opened 10:11PM - 18 May 21 UTC

randyrue

**What kind of request is this (question/bug/enhancement/feature request):** bu…g or question? **Steps to reproduce (least amount of steps as possible):** in a working rancher cluster, from the webUI, open the yaml editor for the cluster config and change the kubelet stanza to: ``` kubelet: fail_swap_on: false generate_serving_certificate: false extra_args: cpu-manager-policy: 'static' reserved-cpus: 0 ``` This should make rancher launch kubelet with two extra CLI arguments that change the cpu-manager-policy to "static" and reserve CPU #0 for systems use. In fact, if I go back to the webUIs yaml editor I can see that the lines were retained, and if I check the logs for the failing kubelet container on an error-flagged controller node I can see that kubelet is being called with those arguments. **Result:** The kubelet docker container on the controller nodes fails to reload and enters a restart loop. Docker logs show a kubelet error: ``` F0518 16:36:35.790721 6150 server.go:273] failed to run Kubelet: could not initialize checkpoint manager: could not restore state from checkpoint: configured policy "static" differs from state checkpoint policy "none" Please drain this node and delete the CPU manager checkpoint file "/var/lib/kubelet/cpu_manager_state" before restarting Kubelet. ``` If the container was stable I guess I could exec in and delete that file but now while it's in a restart loop. Also, if these containers are fungible it seems likely the problem would recur as soon as the kubelet container is restarted for any reason (like another config change). **Other details that may be helpful:** On a separate machine I pulled and started the hyperkube container image with an entrypoint of /bin/bash and determined that the checkpoint file above doesn't exist at startup. Running /hyperkube kubelet fails eventually but not before it creates that directory and file, a one line json identifying the cpu manager policy setting. If I start kubelet from hyperkube with those two args as CLI arguments, I can see it accepting them and the checkpoint file shows that the policy is set to "static." I don't know what's going wrong. If these containers are fungible it seems like they'd be recreated with no checkpoint file and should be able to start with the policy set to static. This is supported by the fact that if I install any troubleshooting tools to the container e.g. mlocate those changes are gone after trying to apply the config changes. It's almost as if the kubelet service was being started "vanilla" first and then restarted with the confg applied? **Environment information** - Rancher version (`rancher/rancher`/`rancher/server` image tag or shown bottom left in the UI): 2.5.7 - Installation option (single install/HA): single install **Cluster information** - Cluster type (Hosted/Infrastructure Provider/Custom/Imported): on premise - Machine type (cloud/VM/metal) and specifications (CPU/memory): bare metal - Kubernetes version (use `kubectl version`): 1.17.4 (paste the output here) - Docker version (use `docker version`): 19.3.6

randyrue · October 22, 2021, 4:07pm

Five months later I’ve had no luck solving this problem. My opened issue on github got no reply and eventually autoclosed. Anyone? Bueller? Bueller?

randyrue · October 22, 2021, 7:31pm

SOLVED. Unable to shell into the rebooting container I never got far enough to figure out that /var/lib/kubelet/cpu_manager_state is not inside the container, it’s mounted from the node’s file system. Fix is to apply the extra_arg setting in the cluster’s yaml config and then delete /var/lib/kubelet/cpu_manager_state from every controller and worker node

Topic		Replies	Views
Static cpu manager policy - how Rancher	0	964	September 4, 2018
CPU pinning or affinity on Harvester 1.3.2 Harvester	0	60	October 21, 2024
How do we provide arguments to kubelet service in rancher managed kubernetes Rancher 1.x	6	2281	March 27, 2018
[SOLVED] Rancher-agent kubelet extra_binds Rancher	2	2484	March 10, 2022
How to add a new extra_binds argument to kubelet service of RKE cluster that was provisioned with Rancher Rancher	2	1998	November 24, 2020

Adding cpu-manager-policy argument to kubelet?

Related topics