Since I upgraded a cluster from Kubernetes 1.13 to 1.14 (via Global → Clusters → Edit Cluster → Kubernetes version) the containers in this cluster cannot do any DNS resolution anymore (external, internal, and dns entries defined in service discovery).
I also see no DNS containers anymore in the kube-system of this cluster:
Please describe on what Rancher version the cluster was created (and what exact Kubernetes version) and to what exact Kubernetes version was upgraded.
The provisioning log from the rancher/rancher container where you upgraded would show what happened during provisioning. Also the amount of nodes and roles per node is helpful here.
Some more info about the setup:
Rancher 2.2.8 HA with 3 nodes (all roles) making up the “local” cluster with Kubernetes
Cluster 1 (4 nodes, 3 nodes with role “All”, 1 node worker) with Kubernetes 1.13.10 → everything is still working here
Cluster 2 (4 nodes, 3 nodes with role “All”, 1 node worker) with Kubernetes 1.14.6 → that’s the affected cluster
Cluster 3 (3 nodes, 3 nodes with role “All”) with Kubernetes 1.13.10 → everything is still working here
The affected cluster (Cluster 2) was initially created in Rancher v 2.1.x (2.1.6 I think but not certain!) with Kubernetes 1.11.3.
Rancher was upgraded to 2.2.2 a while ago, without changing the Kubernetes versions.
Rancher was upgraded to 2.2.8 recently. Afterwards Kubernetes version of that cluster was upgraded from 1.11.3 to 1.13.10. Everything still worked.
Today this cluster’s Kubernetes version was upgraded from 1.13.10 to 1.14.6, followed by a reboot of every cluster node, one after another.
Since then the workloads are failing and crashing ( CrashLoopBackOff: Back-off 1m20s restarting failed container=service2 pod=service2-qw7kz_gamma(c57d3534-d3c9-11e9-823c-0050568d2805). After debugging we found out that this happens because no DNS resolution works anymore (tested from within a container).
I just noticed that the namespace cattle-system seems to have problems, too:
cattle-cluster-agent reports “Containers with unready status: [cluster-register]” with 86 restarts so far.
As I didn’t have anything to lose anymore (I decided to rebuild the whole cluster if there would be no forseeable solution by tomorrow noon) I upgraded the cluster to Kubernetes 1.15.3 (marked as experimental) through the UI.
And to my big surprise, once the cluster was upgraded, the coredns workload appeared in the kube-system namespace!
The cluster provisioning is logged in the rancher/rancher container, if you can supply that log we can find out in what part of the provisioning things went wrong. Every upgrade is tested before it is released, the provisioning logic to deploy coredns in the new system is equal to when its being upgraded.
The cluster with Kubernetes 1.15 turned out to have problems because Ingress Rules did not create service entries in service discovery.
I created a completely new cluster with Kubernetes 1.14 through the UI and coredns seems to have been deployed correctly. At least the container is running. However after a few moments DNS resolution stops working. I used the steps on https://rancher.com/docs/rancher/v2.x/en/troubleshooting/dns/ and this is the result:
ckadm@mintp ~ $ kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-bdffbc666-87l5r 1/1 Running 0 9m29s
ckadm@mintp ~ $ kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 43h
ckadm@mintp ~ $ kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.43.0.10
pod "busybox" deleted
pod default/busybox terminated (Error)
ckadm@mintp ~ $ kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
If you don't see a command prompt, try pressing enter.
Address 1: 10.43.0.10
nslookup: can't resolve 'www.google.com'
pod "busybox" deleted
pod default/busybox terminated (Error)
@superseb I was able to reproduce this DNS problem with a completely new cluster.
Created a new Rancher managed cluster using default settings
Added the first node (role: all) and waited for the provisioning to finish
Deployed some workloads and tested DNS -> all working
Tested redeploying workloads, testing DNS again -> all working
Added a second node (role: all) and waited for the provisioning to finish
Tested redeploying workloads (some across all worker nodes), testing DNS again -> all working
Added a third node (role: etcd, control plane) and waited for the provisioning to finish
Tested redeploying workloads (some across all worker nodes), testing DNS again -> fail
The third node is in located in another location, connecting through a VPN tunnel. It basically serves as helper in a split-brain situation between nodes 1 and 2.
I am currently checking with the firewall team if something is blocked between nodes 1/2 and the third node. I saw that coredns added another port compared to kube-dns (tcp/9153). Maybe this port is blocked, tbv. I will report back as soon as I know more.
Update: It really does look like additional ports are required since Kubernetes 1.14.
We’ve had the following incoming rules allowed on the third (remote) node and the cluster worked perfectly fine with 1.13: