Occasional ubuntu 18.04 host with chronic DNS failures

Hi,
I am exclusively using Ubuntu 18.04 on hosts for a cluster. On occasion, I see certain hosts that have chronic DNS resolution issues. I’ve poked around quite a bit and it sounds like it may be related to the kubelet/systemd issue mentioned here.

If I understand this issue and the relevant commit) correctly, it intends to solve the problem by:

  1. if systemd-resolved is detected, it adds --resolv-conf=/run/systemd/resolve/resolv.conf to kubelet.

@superseb Based on that fix, should I reasonably expect to always see --resolv-conf=/run/systemd/resolve/resolv.conf as a kubelet arg on all my ubuntu hosts?

I’ve taken a look at kubelet args on several hosts with docker inspect and in all cases I’m seeing:

"--resolv-conf=/etc/resolv.conf",

does this mean the fix isn’t behaving as expected, or is there something else here I’m not considering?

To handle this seamlessly, it is done in the entrypoint of the container here: https://github.com/rancher/rke-tools/blob/master/entrypoint.sh#L53

So checking the kubelet process should be enough to validate it.

Also, if this is the issue, DNS would not work at all, not sporadically fail. If you have DNS issues, please create an issue here https://github.com/rancher/rancher/issues/new and describe your setup and what errors you see exactly. We have troubleshooting steps available here https://rancher.com/docs/rancher/v2.x/en/troubleshooting/ that might help in a first attempt to nail down the cause.

Thanks for the reply.

Are you saying my assumptions are correct, namely that if I see the kubelet arg "--resolv-conf=/etc/resolv.conf" on an ubuntu 18.04 host that’s a member of a cluster, then the rancher isn’t properly detecting and/or accommodating for systemd-resolved when it launches kubelet?

Your point about if this were the issue, DNS would be consistently failing is an interesting one; I had the same thought. The challenge has been that I’ve not been consistently seeing this problem: most nodes are behaving properly, just on occasion I’ll get nodes that have chronic DNS issues from within kubernetes deployments (the containers within the pods). I’ll try to pull together good data and repro steps to write up an issue.

The detection can be checked using:

ps -ef |grep kubelet | grep resolv-conf

which should result in a line ending in:

--resolv-conf=/etc/resolv.conf --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cgroup-driver=cgroupfs --resolv-conf=/run/systemd/resolve/resolv.conf

where you can see the added --resolv-conf=/run/systemd/resolve/resolv.conf at the end which is added if resolved is detected.