Etcd cluster becomes unhealthy


#1

I have deployed a kubernetes cluster using Rancher 1.6.25

Everything comes up fine and etcd cluster (3 node cluster) is healthy

root@e46cc2c6d07d:/opt/rancher# etcdctl cluster-health
member 4d86f9c7df30ee86 is healthy: got healthy result from https:// kubernetes-etcd-1:2379
member 83cf4246c8e69b21 is healthy: got healthy result from https:// kubernetes-etcd-3:2379
member d5dbb2d2eacdec65 is healthy: got healthy result from https:// kubernetes-etcd-2:2379
cluster is healthy
root@e46cc2c6d07d:/opt/rancher#
root@e46cc2c6d07d:/opt/rancher#
root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-1
PING kubernetes-etcd-1.rancher.internal (10.42.198.89) 56(84) bytes of data.
64 bytes from e46cc2c6d07d (10.42.198.89): icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from e46cc2c6d07d (10.42.198.89): icmp_seq=2 ttl=64 time=0.025 ms
^C
— kubernetes-etcd-1.rancher.internal ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.022/0.023/0.025/0.005 ms
root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-2
PING kubernetes-etcd-2.rancher.internal (10.42.19.155) 56(84) bytes of data.
64 bytes from 10.42.19.155: icmp_seq=1 ttl=62 time=0.512 ms
64 bytes from 10.42.19.155: icmp_seq=2 ttl=62 time=0.445 ms
^C
— kubernetes-etcd-2.rancher.internal ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.445/0.478/0.512/0.040 ms
root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-3
PING kubernetes-etcd-3.rancher.internal (10.42.250.133) 56(84) bytes of data.
64 bytes from 10.42.250.133: icmp_seq=1 ttl=62 time=0.282 ms
64 bytes from 10.42.250.133: icmp_seq=2 ttl=62 time=0.453 ms
64 bytes from 10.42.250.133: icmp_seq=3 ttl=62 time=0.285 ms
^C
— kubernetes-etcd-3.rancher.internal ping statistics —
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.282/0.340/0.453/0.079 ms
root@e46cc2c6d07d:/opt/rancher#

However; over time the etcd cluster becomes unhealthy in that the node are not able to communicate with each other. From node 1, I cannot ping node 2 or 3

root@e46cc2c6d07d:/opt/rancher# etcdctl cluster-health
member 4d86f9c7df30ee86 is unhealthy: got unhealthy result from https:// kubernetes-etcd-1:2379
failed to check the health of member 83cf4246c8e69b21 on https:// kubernetes-etcd-3:2379: Get https:// kubernetes-etcd-3:2379/health: dial tcp 10.42.250.133:2379: i/o timeout
member 83cf4246c8e69b21 is unreachable: [https:/ /kubernetes-etcd-3:2379] are all unreachable
failed to check the health of member d5dbb2d2eacdec65 on https:// kubernetes-etcd-2:2379: Get https:// kubernetes-etcd-2:2379/health: dial tcp 10.42.19.155:2379: i/o timeout
member d5dbb2d2eacdec65 is unreachable: [https:// kubernetes-etcd-2:2379] are all unreachable
cluster is unhealthy
root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-1
PING kubernetes-etcd-1.rancher.internal (10.42.198.89) 56(84) bytes of data.
64 bytes from e46cc2c6d07d (10.42.198.89): icmp_seq=1 ttl=64 time=0.039 ms
64 bytes from e46cc2c6d07d (10.42.198.89): icmp_seq=2 ttl=64 time=0.020 ms
^C
— kubernetes-etcd-1.rancher.internal ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.020/0.029/0.039/0.010 ms
root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-2
PING kubernetes-etcd-2.rancher.internal (10.42.19.155) 56(84) bytes of data.
^C
— kubernetes-etcd-2.rancher.internal ping statistics —
7 packets transmitted, 0 received, 100% packet loss, time 6047ms

root@e46cc2c6d07d:/opt/rancher# ping kubernetes-etcd-3
PING kubernetes-etcd-3.rancher.internal (10.42.250.133) 56(84) bytes of data.
^C
— kubernetes-etcd-3.rancher.internal ping statistics —
7 packets transmitted, 0 received, 100% packet loss, time 5999ms

root@e46cc2c6d07d:/opt/rancher#

If I reboot the host VMs, the etcd cluster comes back healthy and the nodes are able to communicate with each other. But over time (after a few hours) it becomes unhealthy again. Any suggestions on what I can look into to try and figure out why this is happening?

Here are some logs around the time the cluster goes unhealthy

{“log”:“2019-02-09 01:59:09.163771 I | etcdserver: start to snapshot (applied: 390042, lastsnap: 380041)\n”,“stream”:“stderr”,“time”:“2019-02-09T01:59:09.163985613Z”}
{“log”:“2019-02-09 01:59:09.215785 I | etcdserver: saved snapshot at index 390042\n”,“stream”:“stderr”,“time”:“2019-02-09T01:59:09.215947182Z”}
{“log”:“2019-02-09 01:59:09.216035 I | etcdserver: compacted raft log at 385042\n”,“stream”:“stderr”,“time”:“2019-02-09T01:59:09.21616145Z”}
{“log”:“2019-02-09 01:59:23.952790 I | fileutil: purged file /pdata/data.current/member/snap/0000000000000002-0000000000053044.snap successfully\n”,“stream”:“stderr”,“time”:“2019-02-09T01:59:23.953087278Z”}
{“log”:“time=“2019-02-09T02:09:55Z” level=info msg=“Created backup” name=“2019-02-09T02:09:54Z_etcd_1” runtime=354.277648ms \n”,“stream”:“stderr”,“time”:“2019-02-09T02:09:55.032899272Z”}
{“log”:“2019-02-09 02:16:53.913341 I | etcdserver: start to snapshot (applied: 400043, lastsnap: 390042)\n”,“stream”:“stderr”,“time”:“2019-02-09T02:16:53.913573568Z”}
{“log”:“2019-02-09 02:16:53.973500 I | etcdserver: saved snapshot at index 400043\n”,“stream”:“stderr”,“time”:“2019-02-09T02:16:53.973681801Z”}
{“log”:“2019-02-09 02:16:53.973829 I | etcdserver: compacted raft log at 395043\n”,“stream”:“stderr”,“time”:“2019-02-09T02:16:53.973925027Z”}
{“log”:“2019-02-09 02:17:23.965983 I | fileutil: purged file /pdata/data.current/member/snap/0000000000000002-0000000000055755.snap successfully\n”,“stream”:“stderr”,“time”:“2019-02-09T02:17:23.966242969Z”}
{“log”:“2019-02-09 02:24:40.379842 E | rafthttp: failed to read 83cf4246c8e69b21 on stream MsgApp v2 (read tcp 10.42.198.89:34896-\u003e10.42.250.133:2380: i/o timeout)\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:40.380717138Z”}
{“log”:“2019-02-09 02:24:40.379874 I | rafthttp: the connection with 83cf4246c8e69b21 became inactive\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:40.380776793Z”}
{“log”:“2019-02-09 02:24:41.642558 E | rafthttp: failed to read d5dbb2d2eacdec65 on stream Message (read tcp 10.42.198.89:39766-\u003e10.42.19.155:2380: i/o timeout)\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:41.642735007Z”}
{“log”:“2019-02-09 02:24:41.642578 I | rafthttp: the connection with d5dbb2d2eacdec65 became inactive\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:41.642767582Z”}
{“log”:“2019-02-09 02:24:52.383731 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:52.383929375Z”}
{“log”:“2019-02-09 02:24:53.510742 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost [merged 2 repeated lines in 1.13s]\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:53.51090796Z”}
{“log”:“time=“2019-02-09T02:24:55Z” level=info msg=“Created backup” name=“2019-02-09T02:24:54Z_etcd_1” runtime=380.515544ms \n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:55.058320248Z”}
{“log”:“2019-02-09 02:24:56.357419 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:56.357583998Z”}
{“log”:“2019-02-09 02:24:57.511468 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost [merged 2 repeated lines in 1.15s]\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:57.511649576Z”}
{“log”:“2019-02-09 02:24:58.925612 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost\n”,“stream”:“stderr”,“time”:“2019-02-09T02:24:58.925902022Z”}
{“log”:“2019-02-09 02:25:00.510716 E | etcdhttp: etcdserver: request timed out, possibly due to connection lost [merged 7 repeated lines in 1.58s]\n”,“stream”:“stderr”,“time”:“2019-02-09T02:25:00.510897523Z”}