Kube-dns could not resolve anything (RKE Cluster)

Seems like I’m hitting this issue Rancher Cattle Cluster Agent Could not Resolve Host · Issue #16454 · rancher/rancher · GitHub or this one musl dns client stop further search domain when one search domain return something unexpected. (#9017) · Issues · alpine / aports (if cattle-cluster-agent is based on alpine image)


[user@myhost ~]$ k logs -f cattle-cluster-agent-6d6cfcdd87-znrxb -n=cattle-system


INFO: Using resolv.conf: nameserver search cattle-system.svc.cluster.local svc.cluster.local cluster.local lan netgear.com options ndots:5
ERROR: https://some.host.com/ping is not accessible (Could not resolve host: some.host.com)

[user@myhost ~]$ ping some.host.com
PING some.host.com (xx.xxx.x.xxx) 56(84) bytes of data.
64 bytes from xx.xxx.x.xxx.bc.googleusercontent.com (xx.xxx.x.xxx): icmp_seq=1 ttl=59 time=17.8 ms

RKE version v0.2.2

UPDATE: my kube-dns doesn’t seem to be accessible:

$ kubectl run busybox --image=busybox:1.28 --rm -ti --restart=Never – nslookup kubernetes.default
If you don’t see a command prompt, try pressing enter.

Address 1:

nslookup: can’t resolve ‘kubernetes.default’
pod “busybox” deleted
pod default/busybox terminated (Error)


$ k exec -it kube-dns-58bd5b8dd7-26xdf /bin/sh -n kube-system
Defaulting container name to kubedns.
Use ‘kubectl describe pod/kube-dns-58bd5b8dd7-26xdf -n kube-system’ to see all of the containers in this pod.
/ # cat /etc/resolv.conf
search netgear.com

/ # ping google.com
PING google.com ( 56 data bytes
64 bytes from seq=0 ttl=56 time=12.944 ms

The host OS is CentOS7, network plugin deployed Flannel.

UPDATE: Looks like flannel is not working properly, here is some info:

$ip route
default via dev enp0s25 proto dhcp metric 100 via dev flannel.1 onlink via dev flannel.1 onlink via dev flannel.1 onlink dev cni0 proto kernel scope link src dev docker0 proto kernel scope link src dev enp0s25 proto kernel scope link src metric 100

Not sure if it’s relevant but docker subnet is completely different than flannel. Still puzzled, may have to reinstall the cluster with different network :frowning:

For anybody that will bump into the same network related complication here is the fast way to verify your k8 network is down:

$ nmcli device status
enp0s25 ethernet connected enp0s25
docker0 bridge connected docker0
flannel.1 vxlan disconnected –
lo loopback unmanaged –

When i try to connect the disconnected (not sure why) device:

$ nmcli device connect flannel.1
Error: Failed to add/activate new connection: A ‘vxlan’ setting is required.

UPDATE: redeployed the cluster with weave network (disabled Selinux just in case) and everything seems to be good now.