[Solved] No DNS resolution after upgrading Kubernetes 1.13 to 1.14

Since I upgraded a cluster from Kubernetes 1.13 to 1.14 (via Global → Clusters → Edit Cluster → Kubernetes version) the containers in this cluster cannot do any DNS resolution anymore (external, internal, and dns entries defined in service discovery).

I also see no DNS containers anymore in the kube-system of this cluster:

image

(There should be a “kube-dns” service, or according to https://rancher.com/docs/rancher/v2.5/en/cluster-provisioning/rke-clusters/options/ CoreDNS something).

A downgrade to the previous Kubernetes version is not possible anymore, at least not through the UI (unsupported).

I went through the DNS Troubleshooting (https://rancher.com/docs/rancher/v2.5/en/troubleshooting/dns/) but this only applies when there are actually DNS containers in the kube-system namespace. As I mentioned, these are completely gone.

This is on Rancher 2.2.8, HA.
Any ideas?

Update: Additional infos using kubectl. Seems the configmap for coredns was not found in that cluster:

ckadm@mintp ~/.kube $ kubectl get pods --all-namespaces | grep kube-system
kube-system      canal-fklc2                               2/2     Running            4          13d
kube-system      canal-fzbrp                               2/2     Running            4          13d
kube-system      canal-q2qgf                               2/2     Running            4          13d
kube-system      canal-v89g2                               2/2     Running            4          13d
kube-system      metrics-server-58bd5dd8d7-hc9rh           1/1     Running            0          72m

ckadm@mintp ~/.kube $ kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}
Error from server (NotFound): configmaps "coredns" not found

Please describe on what Rancher version the cluster was created (and what exact Kubernetes version) and to what exact Kubernetes version was upgraded.

The provisioning log from the rancher/rancher container where you upgraded would show what happened during provisioning. Also the amount of nodes and roles per node is helpful here.

Some more info about the setup:
Rancher 2.2.8 HA with 3 nodes (all roles) making up the “local” cluster with Kubernetes
Cluster 1 (4 nodes, 3 nodes with role “All”, 1 node worker) with Kubernetes 1.13.10 → everything is still working here
Cluster 2 (4 nodes, 3 nodes with role “All”, 1 node worker) with Kubernetes 1.14.6 → that’s the affected cluster
Cluster 3 (3 nodes, 3 nodes with role “All”) with Kubernetes 1.13.10 → everything is still working here

The affected cluster (Cluster 2) was initially created in Rancher v 2.1.x (2.1.6 I think but not certain!) with Kubernetes 1.11.3.
Rancher was upgraded to 2.2.2 a while ago, without changing the Kubernetes versions.
Rancher was upgraded to 2.2.8 recently. Afterwards Kubernetes version of that cluster was upgraded from 1.11.3 to 1.13.10. Everything still worked.
Today this cluster’s Kubernetes version was upgraded from 1.13.10 to 1.14.6, followed by a reboot of every cluster node, one after another.
Since then the workloads are failing and crashing ( CrashLoopBackOff: Back-off 1m20s restarting failed container=service2 pod=service2-qw7kz_gamma(c57d3534-d3c9-11e9-823c-0050568d2805). After debugging we found out that this happens because no DNS resolution works anymore (tested from within a container).

I just noticed that the namespace cattle-system seems to have problems, too:

cattle-cluster-agent reports “Containers with unready status: [cluster-register]” with 86 restarts so far.

cattle-node-agent containers are started.

There is no kube-api-auth workload (which exists in the other working clusters).

Since the upgrade this morning a lot of errors are being logged. Most of them from kubelet indicating a container restart due to crash:

As I didn’t have anything to lose anymore (I decided to rebuild the whole cluster if there would be no forseeable solution by tomorrow noon) I upgraded the cluster to Kubernetes 1.15.3 (marked as experimental) through the UI.

And to my big surprise, once the cluster was upgraded, the coredns workload appeared in the kube-system namespace!

ckadm@mintp ~/.kube $ kubectl get pods --all-namespaces | grep kube-system
kube-system      canal-2jf4g                               2/2     Running            0          2m56s
kube-system      canal-l8xvm                               2/2     Running            0          3m12s
kube-system      canal-rz972                               2/2     Running            2          3m35s
kube-system      canal-st25h                               2/2     Running            0          2m40s
kube-system      coredns-5678df9bcc-99xch                  1/1     Running            1          2m50s
kube-system      coredns-autoscaler-57bc9c9bd-c8sqw        1/1     Running            0          2m50s
kube-system      metrics-server-784769f887-6zz2s           1/1     Running            0          2m15s
kube-system      rke-coredns-addon-deploy-job-xbqjq        0/1     Completed          0          3m21s
kube-system      rke-metrics-addon-deploy-job-cmv8m        0/1     Completed          0          2m46s
kube-system      rke-network-plugin-deploy-job-297lv       0/1     Completed          0          3m37s

And I also saw that the service accounts were this time created (they did not exist in the cluster with Kubernetes 1.14):

ckadm@mintp ~/.kube $ kubectl get serviceAccounts --all-namespaces | grep dns
kube-system       coredns                              1         88s
kube-system       coredns-autoscaler                   1         88s

I went into the application containers and DNS resolution works now!
The logged errors are down to 0 since the upgrade to Kubernetes 1.15:

So either there’s a bug in Kubernetes 1.14 or something inside Rancher 2.2.8 which triggered the Kubernetes upgrade, did fail.

The cluster provisioning is logged in the rancher/rancher container, if you can supply that log we can find out in what part of the provisioning things went wrong. Every upgrade is tested before it is released, the provisioning logic to deploy coredns in the new system is equal to when its being upgraded.

Can you please specify the exact container or image name? rancher/rancher does not exist in the cluster. There’s rancher/rancher-agent.

If you mean rancher/rancher in the “local” (Rancher 2 itself) cluster, you can find the logs here:

It looks that the provisioning got an error during dns task but then didn’t fix them and continued to the next task (metrics):

"September 10th 2019, 08:53:29.104","2019/09/10 06:53:29 [INFO] cluster [c-hpb7s] provisioning: [addons] Executing deploy job rke-network-plugin"
"September 10th 2019, 08:53:29.216","2019/09/10 06:53:29 [INFO] cluster [c-hpb7s] provisioning: [dns] removing DNS provider kube-dns"
"September 10th 2019, 08:54:29.562","2019/09/10 06:54:29 [ERROR] cluster [c-hpb7s] provisioning: Failed to deploy DNS addon execute job for provider coredns: Failed to get job complete status for job rke-kube-dns-addon-delete-job in namespace kube-system"
"September 10th 2019, 08:54:29.576","2019/09/10 06:54:29 [INFO] cluster [c-hpb7s] provisioning: [addons] Setting up Metrics Server"
"September 10th 2019, 08:54:29.592","2019/09/10 06:54:29 [INFO] cluster [c-hpb7s] provisioning: [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes"
"September 10th 2019, 08:54:29.612","2019/09/10 06:54:29 [INFO] cluster [c-hpb7s] provisioning: [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes"

The cluster with Kubernetes 1.15 turned out to have problems because Ingress Rules did not create service entries in service discovery.

I created a completely new cluster with Kubernetes 1.14 through the UI and coredns seems to have been deployed correctly. At least the container is running. However after a few moments DNS resolution stops working. I used the steps on https://rancher.com/docs/rancher/v2.x/en/troubleshooting/dns/ and this is the result:

ckadm@mintp ~ $ kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-bdffbc666-87l5r   1/1     Running   0          9m29s

ckadm@mintp ~ $ kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   43h

ckadm@mintp ~ $ kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.43.0.10

pod "busybox" deleted
pod default/busybox terminated (Error)

ckadm@mintp ~ $ kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
If you don't see a command prompt, try pressing enter.
Address 1: 10.43.0.10

nslookup: can't resolve 'www.google.com'
pod "busybox" deleted
pod default/busybox terminated (Error)

@superseb I was able to reproduce this DNS problem with a completely new cluster.

  • Created a new Rancher managed cluster using default settings
  • Added the first node (role: all) and waited for the provisioning to finish
  • Deployed some workloads and tested DNS -> all working
  • Tested redeploying workloads, testing DNS again -> all working
  • Added a second node (role: all) and waited for the provisioning to finish
  • Tested redeploying workloads (some across all worker nodes), testing DNS again -> all working
  • Added a third node (role: etcd, control plane) and waited for the provisioning to finish
  • Tested redeploying workloads (some across all worker nodes), testing DNS again -> fail

The third node is in located in another location, connecting through a VPN tunnel. It basically serves as helper in a split-brain situation between nodes 1 and 2.

I am currently checking with the firewall team if something is blocked between nodes 1/2 and the third node. I saw that coredns added another port compared to kube-dns (tcp/9153). Maybe this port is blocked, tbv. I will report back as soon as I know more.

Update: It really does look like additional ports are required since Kubernetes 1.14.
We’ve had the following incoming rules allowed on the third (remote) node and the cluster worked perfectly fine with 1.13:

  • tcp/2379
  • tcp/2380
  • tcp/6443
  • udp/8472
  • tcp/10250

I saw that on https://rancher.com/docs/rancher/v2.x/en/installation/references/ a lot of ports have recently been added. We haven’t kept the rules up to date.

For now we have enabled all ports bi-directional between the Kubernetes cluster nodes and since then the cluster now works correctly, including DNS.

This reason might also be the source of the failed cluster upgrade (1.13 to 1.14).