Hi,
I am exclusively using Ubuntu 18.04 on hosts for a cluster. On occasion, I see certain hosts that have chronic DNS resolution issues. I’ve poked around quite a bit and it sounds like it may be related to the kubelet/systemd issue mentioned here.
If I understand this issue and the relevant commit) correctly, it intends to solve the problem by:
- if
systemd-resolved
is detected, it adds --resolv-conf=/run/systemd/resolve/resolv.conf
to kubelet.
@superseb Based on that fix, should I reasonably expect to always see --resolv-conf=/run/systemd/resolve/resolv.conf
as a kubelet arg on all my ubuntu hosts?
I’ve taken a look at kubelet args on several hosts with docker inspect
and in all cases I’m seeing:
"--resolv-conf=/etc/resolv.conf",
does this mean the fix isn’t behaving as expected, or is there something else here I’m not considering?
To handle this seamlessly, it is done in the entrypoint of the container here: https://github.com/rancher/rke-tools/blob/master/entrypoint.sh#L53
So checking the kubelet process should be enough to validate it.
Also, if this is the issue, DNS would not work at all, not sporadically fail. If you have DNS issues, please create an issue here https://github.com/rancher/rancher/issues/new and describe your setup and what errors you see exactly. We have troubleshooting steps available here https://rancher.com/docs/rancher/v2.x/en/troubleshooting/ that might help in a first attempt to nail down the cause.
Thanks for the reply.
Are you saying my assumptions are correct, namely that if I see the kubelet arg "--resolv-conf=/etc/resolv.conf"
on an ubuntu 18.04 host that’s a member of a cluster, then the rancher isn’t properly detecting and/or accommodating for systemd-resolved when it launches kubelet?
Your point about if this were the issue, DNS would be consistently failing is an interesting one; I had the same thought. The challenge has been that I’ve not been consistently seeing this problem: most nodes are behaving properly, just on occasion I’ll get nodes that have chronic DNS issues from within kubernetes deployments (the containers within the pods). I’ll try to pull together good data and repro steps to write up an issue.
The detection can be checked using:
ps -ef |grep kubelet | grep resolv-conf
which should result in a line ending in:
--resolv-conf=/etc/resolv.conf --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cgroup-driver=cgroupfs --resolv-conf=/run/systemd/resolve/resolv.conf
where you can see the added --resolv-conf=/run/systemd/resolve/resolv.conf
at the end which is added if resolved is detected.