Kube-dns could not resolve anything (RKE Cluster)

Dmitry_Shultz · May 22, 2019, 5:58am

Seems like I’m hitting this issue https://github.com/rancher/rancher/issues/16454 or this one musl dns client stop further search domain when one search domain return something unexpected. (#9017) · Issues · alpine / aports · GitLab (if cattle-cluster-agent is based on alpine image)

details:

[user@myhost ~]$ k logs -f cattle-cluster-agent-6d6cfcdd87-znrxb -n=cattle-system

INFO: Environment: CATTLE_ADDRESS=10.42.1.9 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6d6cfcdd87-znrxb CATTLE_SERVER=https://some.host.com

INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local lan netgear.com options ndots:5
ERROR: https://some.host.com/ping is not accessible (Could not resolve host: some.host.com)

[user@myhost ~]$ ping some.host.com
PING some.host.com (xx.xxx.x.xxx) 56(84) bytes of data.
64 bytes from xx.xxx.x.xxx.bc.googleusercontent.com (xx.xxx.x.xxx): icmp_seq=1 ttl=59 time=17.8 ms

RKE version v0.2.2

UPDATE: my kube-dns doesn’t seem to be accessible:

$ kubectl run busybox --image=busybox:1.28 --rm -ti --restart=Never – nslookup kubernetes.default
If you don’t see a command prompt, try pressing enter.

Address 1: 10.43.0.10

nslookup: can’t resolve ‘kubernetes.default’
pod “busybox” deleted
pod default/busybox terminated (Error)

However:

$ k exec -it kube-dns-58bd5b8dd7-26xdf /bin/sh -n kube-system
Defaulting container name to kubedns.
Use ‘kubectl describe pod/kube-dns-58bd5b8dd7-26xdf -n kube-system’ to see all of the containers in this pod.
/ # cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
search netgear.com

/ # ping google.com
PING google.com (172.217.14.238): 56 data bytes
64 bytes from 172.217.14.238: seq=0 ttl=56 time=12.944 ms

The host OS is CentOS7, network plugin deployed Flannel.

UPDATE: Looks like flannel is not working properly, here is some info:

$ip route
default via 192.168.2.1 dev enp0s25 proto dhcp metric 100
10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
10.42.2.0/24 via 10.42.2.0 dev flannel.1 onlink
10.42.3.0/24 dev cni0 proto kernel scope link src 10.42.3.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.2.0/24 dev enp0s25 proto kernel scope link src 192.168.2.229 metric 100

Not sure if it’s relevant but docker subnet is completely different than flannel. Still puzzled, may have to reinstall the cluster with different network

For anybody that will bump into the same network related complication here is the fast way to verify your k8 network is down:

$ nmcli device status
DEVICE TYPE STATE CONNECTION
enp0s25 ethernet connected enp0s25
docker0 bridge connected docker0
flannel.1 vxlan disconnected –
lo loopback unmanaged –

When i try to connect the disconnected (not sure why) device:

$ nmcli device connect flannel.1
Error: Failed to add/activate new connection: A ‘vxlan’ setting is required.

UPDATE: redeployed the cluster with weave network (disabled Selinux just in case) and everything seems to be good now.

Topic		Replies	Views
Cluster registration problem Rancher	1	1101	March 13, 2021
Cattle-cluster-agent ERROR when importing cluster Rancher	1	4509	September 10, 2021
Cattle-cluster-agent	1	669	August 19, 2021
New install of Rancher v2.5.1, cattle-cluster-agent can't resolve the Rancher host Rancher	3	5418	July 16, 2021
Kube DNS failing? Rancher	2	1486	May 25, 2018

Kube-dns could not resolve anything (RKE Cluster)

Related topics