Kube-dns could not resolve anything (RKE Cluster)

Seems like I’m hitting this issue Rancher Cattle Cluster Agent Could not Resolve Host · Issue #16454 · rancher/rancher · GitHub or this one musl dns client stop further search domain when one search domain return something unexpected. (#9017) · Issues · alpine / aports (if cattle-cluster-agent is based on alpine image)

details:

[user@myhost ~]$ k logs -f cattle-cluster-agent-6d6cfcdd87-znrxb -n=cattle-system

INFO: Environment: CATTLE_ADDRESS=10.42.1.9 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6d6cfcdd87-znrxb CATTLE_SERVER=https://some.host.com

INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local lan netgear.com options ndots:5
ERROR: https://some.host.com/ping is not accessible (Could not resolve host: some.host.com)

[user@myhost ~]$ ping some.host.com
PING some.host.com (xx.xxx.x.xxx) 56(84) bytes of data.
64 bytes from xx.xxx.x.xxx.bc.googleusercontent.com (xx.xxx.x.xxx): icmp_seq=1 ttl=59 time=17.8 ms

RKE version v0.2.2

UPDATE: my kube-dns doesn’t seem to be accessible:

$ kubectl run busybox --image=busybox:1.28 --rm -ti --restart=Never – nslookup kubernetes.default
If you don’t see a command prompt, try pressing enter.

Address 1: 10.43.0.10

nslookup: can’t resolve ‘kubernetes.default’
pod “busybox” deleted
pod default/busybox terminated (Error)

However:

$ k exec -it kube-dns-58bd5b8dd7-26xdf /bin/sh -n kube-system
Defaulting container name to kubedns.
Use ‘kubectl describe pod/kube-dns-58bd5b8dd7-26xdf -n kube-system’ to see all of the containers in this pod.
/ # cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
search netgear.com

/ # ping google.com
PING google.com (172.217.14.238): 56 data bytes
64 bytes from 172.217.14.238: seq=0 ttl=56 time=12.944 ms

The host OS is CentOS7, network plugin deployed Flannel.

UPDATE: Looks like flannel is not working properly, here is some info:

$ip route
default via 192.168.2.1 dev enp0s25 proto dhcp metric 100
10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
10.42.2.0/24 via 10.42.2.0 dev flannel.1 onlink
10.42.3.0/24 dev cni0 proto kernel scope link src 10.42.3.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.2.0/24 dev enp0s25 proto kernel scope link src 192.168.2.229 metric 100

Not sure if it’s relevant but docker subnet is completely different than flannel. Still puzzled, may have to reinstall the cluster with different network :frowning:

For anybody that will bump into the same network related complication here is the fast way to verify your k8 network is down:

$ nmcli device status
DEVICE TYPE STATE CONNECTION
enp0s25 ethernet connected enp0s25
docker0 bridge connected docker0
flannel.1 vxlan disconnected –
lo loopback unmanaged –

When i try to connect the disconnected (not sure why) device:

$ nmcli device connect flannel.1
Error: Failed to add/activate new connection: A ‘vxlan’ setting is required.

UPDATE: redeployed the cluster with weave network (disabled Selinux just in case) and everything seems to be good now.