Cluster communication fails when spanning across class B and C networks

Our production cluster is running fine on k8s 1.12.3-rancher1-1 having several nodes in two different networks: 192.168.225.0/24 (2) and 172.30.0.0/24 (6).
When upgrading the cluster to any newer version of k8s (verified with 1.16.4-rancher1-1 and 1.17.5-rancher1-1) communication between nodes of these networks fails.

To reproduce the issue set up the following environment. It is not necessary to perform an upgrade from 1.12.3 to a new version. A clean install of any new version seems to produce the same result:

  • 3 VMs using “Ubuntu LTS 16.04”
    • one VM: GATEWAY forwarding packages between the networks (172.30.0.1; 192.168.225.1) as well as access to the internet
    • one VM: CORE01 (172.30.0.2) as etcd, controlplane and worker
    • one VM: FRONTEND01 (192.168.225.2) as worker

<cluster.yml>

nodes:

# frontend nodes
  - address: 192.168.225.2
    role:
      - worker
    hostname_override: frontend01
    labels:
      tier: frontend
      environment: Production
    user: deployuser
    ssh_key_path: ./frontend.key
    # note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent

# core nodes
  - address: 172.30.0.2
    role:
      - controlplane
      - etcd
      - worker
    hostname_override: core01
    labels:
      tier: core
      environment: Production
    user: deployuser
    ssh_key_path: ./backend.key
    # note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent

# Cluster Level Options
cluster_name: production
ignore_docker_version: false
kubernetes_version: "v1.16.4-rancher1-1"

# SSH Agent
ssh_agent_auth: false # use the rke built agent

# deploy an ingress controller on all ''
ingress:
    provider: nginx
    options:
      server-tokens: false
      ssl-redirect: false

Firewall-Rules

host rule
FRONTEND01 allow 8472/udp from 172.30.0.2
FRONTEND01 allow 10250/tcp from 172.30.0.2
FRONTEND01 allow ssh
CORE01 allow 6443/tcp from 192.168.225.2
CORE01 allow 8472/udp from 192.168.225.2
CORE01 allow ssh
  • Deploy the cluster using rke (v1.0.8) and wait for it to be ready.
  • Launch a centos-pod on one of the nodes, e.g. CORE01
    kubectl run -it centos1 --rm --image=centos --restart=Never --overrides='{"apiVersion":"v1","spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["core01"]}]}]}}}}}' --kubeconfig kube_config_cluster.yml -- /bin/bash
  • ping your favourite external site
    for i in {1..100}; do ping -c 1 wikipedia.com; done

Notice the very slow speed in name resolution that often fails completely.

  • Stop FRONTEND01 and wait for the cluster to recognize the lost node
  • ping again

Name resolution works fast and ping succeeds every time.

  • Reset all VMs and change the network configuration for GATEWAY and CORE01: Put it in a 192.168.0.0/16 network segment (but not the same as FRONTEND01!)
  • Deploy the cluster
  • ping some external site

Name resolution works fast and ping succeeds every time.

component version
OS Ubuntu 16.04.6
docker 19.03.1 (docker-ce, docker-ce-cli)
k8s 1.12.3-rancher1-1 (ok); 1.16.4-rancher1-1 (failed), 1.17.5-rancher1-1 (failed)
rke 1.0.8
kubectl 1.16.1

We face a similar problem. Some nodes of our cluster are located in a different network segment. Beause fo the issue we are sticked to k8s version < “1.13.x-rancher1-1”.

Does anybody can help?

If this consistently breaking in a new k8s release, probably the auto detection on kubelet start is different or something in CNI has changed. Please share the kubelet and CNI pod container logs from a working and non working version, that’s probably the fastest way to diagnose.

Hi Superseb,

thanks for the reply.

I couldn’t find any upload function here and posting about 1,3 Megabytes of logs directly is rather messy. Therefore you can find all logs here: https://www.magentacloud.de/share/o.h34r1izi

If you need more logs and information, just let me know.

Did you make any progress?

Quick update: I tested the scenario using the network providers flannel and canal. The behaviour doesn’t change - I’m still running into this issue.

Issue still exists on Ubuntu 18.04.4 with Kernels 4.18 or 5.4 - even without any firewall regulations.
However a default installation of CentOS7 with firewall rules in place works perfectly fine.

Question is what does CentOS do differently than Ubuntu LTS?

@superseb: Do you have any thoughts on that?

So it doesn’t work on Ubuntu 16 and Ubuntu 18 but it works on CentOS7? Can you share the output of docker info and lsmod from one of the Ubuntu hosts? And kubectl get pods -n kube-system from a working and not working cluster? The requested logs are not all CNI related pod/container logs, but if all the pods/containers are Running and marked as Ready, it doesn’t matter as much.

Is there a cloud image you are using which I can use to reproduce?

Hi @superseb,

thanks for your response!

I do not use cloudimages, I’m afraid. My VMs are minimal server installations to whom I just added the docker installation and configured iptables-rules. Though iptables-rules don’t matter as my testcase fails without them being in place as well.

Please find the requested outputs below:

Ubuntu

kubectl get pods -n kube-system --kubeconfig kube_config_cluster.yml

NAME                                      READY   STATUS      RESTARTS   AGE
canal-klzfl                               1/2     Running     0          2m3s
canal-s5cbs                               2/2     Running     0          2m3s
coredns-7c5566588d-j8p6d                  0/1     Running     0          119s
coredns-7c5566588d-xsmnz                  1/1     Running     0          27s
coredns-autoscaler-65bfc8d47d-ndm62       1/1     Running     0          118s
metrics-server-6b55c64f86-xk5g4           1/1     Running     0          114s
rke-coredns-addon-deploy-job-z52fz        0/1     Completed   0          2m1s
rke-ingress-controller-deploy-job-k4xcv   0/1     Completed   0          111s
rke-metrics-addon-deploy-job-r22rd        0/1     Completed   0          116s
rke-network-plugin-deploy-job-sxbvv       0/1     Completed   0          2m11s

As you can see all pods are running just fine.

lsmod

Module                  Size  Used by
xt_statistic           16384  3
xt_set                 20480  2
ipt_rpfilter           16384  1
xt_multiport           16384  52
iptable_raw            16384  1
ip_set_hash_ip         32768  1
ip_set_hash_net        32768  2
ip_set                 40960  3 ip_set_hash_ip,xt_set,ip_set_hash_net
ip6table_nat           16384  0
nf_nat_ipv6            16384  1 ip6table_nat
ip_vs_sh               16384  0
ip_vs_wrr              16384  0
ip_vs_rr               16384  0
ip_vs                 151552  6 ip_vs_rr,ip_vs_sh,ip_vs_wrr
xt_comment             16384  222
veth                   16384  0
iptable_nat            16384  1
bridge                159744  0
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
nf_conntrack_netlink    40960  0
nfnetlink              16384  3 nf_conntrack_netlink,ip_set
xfrm_user              32768  1
xfrm_algo              16384  1 xfrm_user
aufs                  245760  0
overlay                94208  18
nls_iso8859_1          16384  1
ip6t_REJECT            16384  1
nf_reject_ipv6         16384  1 ip6t_REJECT
nf_log_ipv6            16384  5
xt_hl                  16384  22
ip6t_rt                16384  3
nf_conntrack_ipv6      20480  9
nf_defrag_ipv6         20480  1 nf_conntrack_ipv6
ipt_REJECT             16384  1
nf_reject_ipv4         16384  1 ipt_REJECT
nf_log_ipv4            16384  5
nf_log_common          16384  2 nf_log_ipv4,nf_log_ipv6
xt_LOG                 16384  10
hyperv_fb              20480  1
serio_raw              16384  0
xt_limit               16384  13
xt_tcpudp              16384  45
hv_balloon             24576  0
xt_addrtype            16384  5
joydev                 24576  0
xt_conntrack           16384  36
ip6table_filter        16384  1
ip6_tables             28672  54 ip6table_filter,ip6table_nat
nf_conntrack_netbios_ns    16384  0
nf_conntrack_broadcast    16384  1 nf_conntrack_netbios_ns
sch_fq_codel           20480  3
nf_nat_ftp             16384  0
nf_conntrack_ftp       20480  1 nf_nat_ftp
ib_iser                49152  0
rdma_cm                61440  1 ib_iser
iw_cm                  45056  1 rdma_cm
iptable_filter         16384  1
bpfilter               16384  0
ib_cm                  53248  1 rdma_cm
ib_core               233472  4 rdma_cm,iw_cm,ib_iser,ib_cm
iscsi_tcp              20480  0
libiscsi_tcp           20480  1 iscsi_tcp
libiscsi               53248  3 libiscsi_tcp,iscsi_tcp,ib_iser
scsi_transport_iscsi    98304  3 iscsi_tcp,ib_iser,libiscsi
xt_nat                 16384  9
xt_mark                16384  45
iptable_mangle         16384  1
ipt_MASQUERADE         16384  4
nf_conntrack_ipv4      16384  42
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
nf_nat_ipv4            16384  2 ipt_MASQUERADE,iptable_nat
nf_nat                 32768  4 nf_nat_ftp,nf_nat_ipv6,nf_nat_ipv4,xt_nat
nf_conntrack          131072  14 xt_conntrack,nf_conntrack_ipv6,nf_conntrack_ipv4,nf_nat,nf_nat_ftp,nf_nat_ipv6,ipt_MASQUERADE,nf_conntrack_netbios_ns,nf_nat_ipv4,xt_nat,nf_conntrack_broadcast,nf_conntrack_netlink,nf_conntrack_ftp,ip_vs
vxlan                  57344  0
ip6_udp_tunnel         16384  1 vxlan
udp_tunnel             16384  1 vxlan
ip_tables              28672  12 iptable_filter,iptable_raw,iptable_nat,iptable_mangle
x_tables               40960  23 ip6table_filter,xt_conntrack,xt_statistic,iptable_filter,xt_LOG,xt_multiport,xt_tcpudp,ipt_MASQUERADE,xt_addrtype,xt_nat,ip6t_rt,xt_comment,xt_set,ip6_tables,ipt_REJECT,ipt_rpfilter,iptable_raw,ip_tables,xt_limit,xt_hl,ip6t_REJECT,iptable_mangle,xt_mark
autofs4                40960  2
btrfs                1163264  0
zstd_compress         163840  1 btrfs
raid10                 53248  0
raid456               151552  0
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
xor                    24576  2 async_xor,btrfs
raid6_pq              114688  4 async_pq,btrfs,raid456,async_raid6_recov
libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,raid456,ip_vs
raid1                  40960  0
raid0                  20480  0
multipath              16384  0
linear                 16384  0
crct10dif_pclmul       16384  0
crc32_pclmul           16384  0
hid_generic            16384  0
ghash_clmulni_intel    16384  0
pcbc                   16384  0
hv_netvsc              73728  0
hyperv_keyboard        16384  0
hid_hyperv             16384  0
hv_storvsc             20480  3
hid                   122880  2 hid_hyperv,hid_generic
scsi_transport_fc      57344  1 hv_storvsc
hv_utils               28672  0
hv_vmbus               90112  7 hv_balloon,hv_utils,hv_netvsc,hid_hyperv,hv_storvsc,hyperv_keyboard,hyperv_fb
aesni_intel           200704  0
aes_x86_64             20480  1 aesni_intel
crypto_simd            16384  1 aesni_intel
cryptd                 24576  3 crypto_simd,ghash_clmulni_intel,aesni_intel
glue_helper            16384  1 aesni_intel

docker info

Client:
 Debug Mode: false

Server:
 Containers: 30
  Running: 18
  Paused: 0
  Stopped: 12
 Images: 12
 Server Version: 19.03.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.18.0-25-generic
 Operating System: Ubuntu 18.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.317GiB
 Name: backend
 ID: KIU5:WIM7:GKYD:4ETC:33Z7:YBNI:CGH2:KHBL:POXE:Y23N:3YCL:HV46
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

CentOS7

kubectl get pods -n kube-system --kubeconfig kube_config_cluster.yml

NAME                                      READY   STATUS      RESTARTS   AGE
canal-dpdfk                               2/2     Running     0          2m35s
canal-wdkz2                               2/2     Running     0          2m35s
coredns-7c5566588d-hjjlh                  1/1     Running     0          2m30s
coredns-7c5566588d-r9r7s                  1/1     Running     0          54s
coredns-autoscaler-65bfc8d47d-hxhj4       1/1     Running     0          2m29s
metrics-server-6b55c64f86-42krb           1/1     Running     0          2m25s
rke-coredns-addon-deploy-job-5ws7d        0/1     Completed   0          2m31s
rke-ingress-controller-deploy-job-28l7z   0/1     Completed   0          2m21s
rke-metrics-addon-deploy-job-g24c7        0/1     Completed   0          2m26s
rke-network-plugin-deploy-job-zxp6g       0/1     Completed   0          2m41s

lsmod

Module                  Size  Used by
xt_statistic           12601  3
xt_set                 18141  2
ipt_rpfilter           12606  1
xt_multiport           12798  53
iptable_raw            12678  1
ip_set_hash_ip         31658  1
ip_set_hash_net        36021  2
ip_set                 45799  3 ip_set_hash_net,ip_set_hash_ip,xt_set
ip6table_nat           12864  0
nf_nat_ipv6            14131  1 ip6table_nat
xt_nat                 12681  9
iptable_mangle         12695  1
ipt_MASQUERADE         12678  4
nf_nat_masquerade_ipv4    13463  1 ipt_MASQUERADE
xt_comment             12504  254
xt_mark                12563  55
veth                   13458  0
iptable_nat            12875  1
nf_nat_ipv4            14115  1 iptable_nat
bridge                151336  0
stp                    12976  1 bridge
llc                    14552  2 stp,bridge
nf_conntrack_netlink    36396  0
nfnetlink              14519  3 ip_set,nf_conntrack_netlink
overlay                91659  20
ip6t_REJECT            12625  1
nf_reject_ipv6         13717  1 ip6t_REJECT
nf_log_ipv6            12726  5
xt_hl                  12521  22
ip6t_rt                13537  3
nf_conntrack_ipv6      18935  9
nf_defrag_ipv6         35104  1 nf_conntrack_ipv6
ipt_REJECT             12541  1
nf_reject_ipv4         13373  1 ipt_REJECT
nf_log_ipv4            12767  5
nf_log_common          13317  2 nf_log_ipv4,nf_log_ipv6
xt_LOG                 12690  10
xt_limit               12711  13
xt_addrtype            12676  5
nf_conntrack_ipv4      15053  33
nf_defrag_ipv4         12729  1 nf_conntrack_ipv4
xt_conntrack           12760  40
ip6table_filter        12815  1
ip6_tables             26912  2 ip6table_filter,ip6table_nat
nf_conntrack_netbios_ns    12665  0
nf_conntrack_broadcast    12589  1 nf_conntrack_netbios_ns
nf_nat_ftp             12809  0
nf_nat                 26583  5 nf_nat_ftp,nf_nat_ipv4,nf_nat_ipv6,xt_nat,nf_nat_masquerade_ipv4
nf_conntrack_ftp       18478  1 nf_nat_ftp
nf_conntrack          139264  12 nf_nat_ftp,nf_conntrack_netbios_ns,nf_nat,nf_nat_ipv4,nf_nat_ipv6,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_netlink,nf_conntrack_broadcast,nf_conntrack_ftp,nf_conntrack_ipv4,nf_conntrack_ipv6
iptable_filter         12810  1
vfat                   17461  1
fat                    65950  1 vfat
crc32_pclmul           13133  0
ghash_clmulni_intel    13273  0
sg                     40719  0
joydev                 17389  0
aesni_intel           189456  0
hv_utils               25808  2
lrw                    13286  1 aesni_intel
ptp                    19231  1 hv_utils
gf128mul               15139  1 lrw
glue_helper            13990  1 aesni_intel
pps_core               19057  1 ptp
hv_balloon             22858  0
ablk_helper            13597  1 aesni_intel
cryptd                 21190  3 ghash_clmulni_intel,aesni_intel,ablk_helper
pcspkr                 12718  0
vxlan                  53857  0
ip6_udp_tunnel         12755  1 vxlan
udp_tunnel             14423  1 vxlan
ip_tables              27126  4 iptable_filter,iptable_mangle,iptable_nat,iptable_raw
xfs                   997681  2
libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
sd_mod                 46281  4
crc_t10dif             12912  1 sd_mod
sr_mod                 22416  0
cdrom                  42600  1 sr_mod
crct10dif_generic      12647  0
crct10dif_pclmul       14307  1
crct10dif_common       12595  3 crct10dif_pclmul,crct10dif_generic,crc_t10dif
hv_storvsc             22546  3
hyperv_fb              17798  1
serio_raw              13434  0
scsi_transport_fc      64007  1 hv_storvsc
hv_netvsc              50527  0
hyperv_keyboard        12787  0
hid_hyperv             13118  0
crc32c_intel           22094  1
scsi_tgt               20027  1 scsi_transport_fc
hv_vmbus               96714  7 hv_balloon,hyperv_keyboard,hv_netvsc,hid_hyperv,hv_utils,hyperv_fb,hv_storvsc
dm_mirror              22289  0
dm_region_hash         20813  1 dm_mirror
dm_log                 18411  2 dm_region_hash,dm_mirror
dm_mod                124501  8 dm_log,dm_mirror

docker info

Client:
 Debug Mode: false

Server:
 Containers: 32
  Running: 20
  Paused: 0
  Stopped: 12
 Images: 13
 Server Version: 19.03.12
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1127.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.395GiB
 Name: backend
 ID: P742:4VA2:GDC4:P3NT:HEZB:GLKK:ML7F:6NMU:LELL:MCTH:L3P3:DFGD
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

Hi @superseb,

instead of pinging some external site, I changed my test to verify functionality: curl -k https://kubernetes sometimes succeeds and sometimes fails with curl: (6) Could not resolve host: kubernetes. Depending which coreDNS service is queried. If it’s the local one the name resolution succeeds, if it is the one running on the other node it fails.

Please find below some more logs of coreDNS and a tcpdump tracing the communication.

BACKEND (core01)

10:27:29.837759 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 10649+ A? kubernetes.default.svc.cluster.local. (54)
10:27:29.837772 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 59807+ AAAA? kubernetes.default.svc.cluster.local. (54)
10:27:29.838427 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 10649*- 1/0/0 A 10.43.0.1 (106)
10:27:29.838445 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 59807*- 0/1/0 (147)
10:27:34.840202 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 10649+ A? kubernetes.default.svc.cluster.local. (54)
10:27:34.840243 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 59807+ AAAA? kubernetes.default.svc.cluster.local. (54)
10:27:34.840894 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 10649*- 1/0/0 A 10.43.0.1 (106)
10:27:34.841078 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 59807*- 0/1/0 (147)
10:27:39.845369 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 23883+ A? kubernetes. (28)
10:27:39.845404 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 25938+ AAAA? kubernetes. (28)
10:27:39.861660 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 23883 NXDomain 0/1/0 (103)
10:27:39.865593 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 25938 NXDomain 0/1/0 (103)
10:27:44.850484 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 23883+ A? kubernetes. (28)
10:27:44.850517 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 25938+ AAAA? kubernetes. (28)
10:27:44.851205 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 23883 NXDomain* 0/1/0 (103)
10:27:44.851232 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 25938 NXDomain* 0/1/0 (103)

FRONTEND

10:27:29.839175 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 10649+ A? kubernetes.default.svc.cluster.local. (54)
10:27:29.839221 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 59807+ AAAA? kubernetes.default.svc.cluster.local. (54)
10:27:29.839453 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 10649*- 1/0/0 A 10.43.0.1 (106)
10:27:29.839549 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 59807*- 0/1/0 (147)
10:27:34.841637 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 10649+ A? kubernetes.default.svc.cluster.local. (54)
10:27:34.841680 IP 172.30.0.2.47905 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.40100 > 10.42.1.2.domain: 59807+ AAAA? kubernetes.default.svc.cluster.local. (54)
10:27:34.841963 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 10649*- 1/0/0 A 10.43.0.1 (106)
10:27:34.842147 IP 192.168.225.2.34713 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.40100: 59807*- 0/1/0 (147)
10:27:39.846767 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 23883+ A? kubernetes. (28)
10:27:39.846811 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 25938+ AAAA? kubernetes. (28)
10:27:39.862540 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 23883 NXDomain 0/1/0 (103)
10:27:39.866623 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 25938 NXDomain 0/1/0 (103)
10:27:44.851909 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 23883+ A? kubernetes. (28)
10:27:44.851953 IP 172.30.0.2.34582 > 192.168.225.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.0.7.44551 > 10.42.1.2.domain: 25938+ AAAA? kubernetes. (28)
10:27:44.852138 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 23883 NXDomain* 0/1/0 (103)
10:27:44.852219 IP 192.168.225.2.24000 > 172.30.0.2.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.1.2.domain > 10.42.0.7.44551: 25938 NXDomain* 0/1/0 (103)

coreDNS

[INFO] 10.42.0.7:40100 - 10649 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000162801s
[INFO] 10.42.0.7:40100 - 59807 "AAAA IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 147 0.000258501s
[INFO] 10.42.0.7:40100 - 10649 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.0002726s
[INFO] 10.42.0.7:40100 - 59807 "AAAA IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 147 0.000397001s
[INFO] 10.42.0.7:44551 - 23883 "A IN kubernetes. udp 28 false 512" NXDOMAIN qr,rd,ra 103 0.015610524s
[INFO] 10.42.0.7:44551 - 25938 "AAAA IN kubernetes. udp 28 false 512" NXDOMAIN qr,rd,ra 103 0.019523331s
[INFO] 10.42.0.7:44551 - 23883 "A IN kubernetes. udp 28 false 512" NXDOMAIN qr,aa,rd,ra 103 0.0001025s
[INFO] 10.42.0.7:44551 - 25938 "AAAA IN kubernetes. udp 28 false 512" NXDOMAIN qr,aa,rd,ra 103 0.0001603s

coreDNS (successfully querying the local instance)

[INFO] 10.42.0.7:60763 - 55668 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000145701s
[INFO] 10.42.0.7:60763 - 2427 "AAAA IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 147 0.0000924s

coreDNS configmap

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  Corefile: |
    .:53 {
        log
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . "/etc/resolv.conf"
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"Corefile":".:53 {\n    errors\n    health {\n      lameduck 5s\n    }\n    ready\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n      pods insecure\n      fallthrough in-addr.arpa ip6.arpa\n    }\n    prometheus :9153\n    forward . \"/etc/resolv.conf\"\n    cache 30\n    loop\n    reload\n    loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"coredns","namespace":"kube-system"}}
  creationTimestamp: "2020-08-05T07:38:55Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "7177"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 63cbe102-c521-433f-81b2-3627b4e7a36e

Using kube-dns instead of coreDNS doesn’t make a difference. The issue remains.

Ready 1/2 is not running fine, it’s failing to get ready. Please share a kubectl describe pod and the logs from both containers. As a quick test, can you allow TCP port 9099 on the firewall (not sure if this changed between the version mentioned but it’s something)

Sorry, I screwed up there and didn’t wait long enough for all services to start correctly. My bad.

Now all services are running fine but curl -k https://kubernetes keeps failing when the remote DNS is queried.

NAME                                      READY   STATUS      RESTARTS   AGE
canal-n86tr                               2/2     Running     0          7m26s
canal-x942n                               2/2     Running     0          7m26s
coredns-7c5566588d-5xhcb                  1/1     Running     0          7m22s
coredns-7c5566588d-fn5mp                  1/1     Running     0          5m32s
coredns-autoscaler-65bfc8d47d-68hwj       1/1     Running     0          7m21s
metrics-server-6b55c64f86-dw8w7           1/1     Running     0          7m17s
rke-coredns-addon-deploy-job-l4c2b        0/1     Completed   0          7m23s
rke-ingress-controller-deploy-job-wvsms   0/1     Completed   0          7m13s
rke-metrics-addon-deploy-job-bbpc6        0/1     Completed   0          7m18s
rke-network-plugin-deploy-job-jt2hx       0/1     Completed   0          7m33s