Can't initialize etcd on kubernetes environment (AWS)

Hello everyone,

i have been trying to build a basic kubernetes cluster using rancher on AWS.

I can deploy on host fine but as soon as i add second hosts i seems to have troubles with etcd and system containers keep on restarting. I’m not quite sure what logs you would need to give me a hand but this is what i find for etcd :

20/09/2016 12:13:09Get http://10.42.239.235:2379/health: dial tcp 10.42.239.235:2379: getsockopt: no route to host
20/09/2016 12:13:27Get http://10.42.239.235:2379/health: dial tcp 10.42.239.235:2379: getsockopt: no route to host
20/09/2016 12:13:45Get http://10.42.239.235:2379/health: dial tcp 10.42.239.235:2379: getsockopt: no route to host
20/09/2016 12:14:03Get http://10.42.239.235:2379/health: dial tcp 10.42.239.235:2379: getsockopt: no route to host
20/09/2016 12:14:21Get http://10.42.239.235:2379/health: dial tcp 10.42.239.235:2379: getsockopt: no route to host 

this is what i have in the 10.42.239.235 logs :

20/09/2016 12:02:47++ giddyup service scale etcd
20/09/2016 12:02:47+ SCALE=3
20/09/2016 12:02:47++ giddyup ip myip
20/09/2016 12:02:47+ IP=10.42.239.235
20/09/2016 12:02:47+ META_URL=http://rancher-metadata.rancher.internal/2015-12-19
20/09/2016 12:02:47++ wget -q -O - http://rancher-metadata.rancher.internal/2015-12-19/self/stack/name
20/09/2016 12:02:47+ STACK_NAME=Kubernetes
20/09/2016 12:02:47++ wget -q -O - http://rancher-metadata.rancher.internal/2015-12-19/self/container/create_index
20/09/2016 12:02:47+ CREATE_INDEX=34
20/09/2016 12:02:47++ wget -q -O - http://rancher-metadata.rancher.internal/2015-12-19/self/container/service_index
20/09/2016 12:02:47+ SERVICE_INDEX=1
20/09/2016 12:02:47++ wget -q -O - http://rancher-metadata.rancher.internal/2015-12-19/self/host/uuid
20/09/2016 12:02:47+ HOST_UUID=ec5e47d2-9345-44c9-a863-83849ae01dcb
20/09/2016 12:02:47+ LEGACY_DATA_DIR=/data
20/09/2016 12:02:47+ DATA_DIR=/pdata
20/09/2016 12:02:47+ DR_FLAG=/pdata/DR
20/09/2016 12:02:47+ export ETCD_DATA_DIR=/pdata/data.current
20/09/2016 12:02:47+ ETCD_DATA_DIR=/pdata/data.current
20/09/2016 12:02:47+ export ETCDCTL_ENDPOINT=http://etcd.Kubernetes:2379
20/09/2016 12:02:47+ ETCDCTL_ENDPOINT=http://etcd.Kubernetes:2379
20/09/2016 12:02:47++ tr . -
20/09/2016 12:02:47++ echo 10.42.239.235
20/09/2016 12:02:47+ NAME=10-42-239-235
20/09/2016 12:02:47+ '[' 1 -eq 0 ']'
20/09/2016 12:02:47+ eval node
20/09/2016 12:02:47++ node
20/09/2016 12:02:47++ mkdir -p /pdata/data.current
20/09/2016 12:02:47++ '[' -d /data/member ']'
20/09/2016 12:02:47++ '[' -d /data/data.current ']'
20/09/2016 12:02:47++ '[' -f /pdata/DR ']'
20/09/2016 12:02:47++ '[' -d /pdata/data.current/member ']'
20/09/2016 12:02:47+++ cat /pdata/data.current/ip
20/09/2016 12:02:47++ '[' 10.42.239.235 == 10.42.239.235 ']'
20/09/2016 12:02:47++ restart_node
20/09/2016 12:02:47++ ++ healthcheck_proxyrolling_backup
20/09/2016 12:02:47
20/09/2016 12:02:47++ ++ etcd --name 10-42-239-235 --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://10.42.239.235:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://10.42.239.235:2380 --initial-cluster-state existing
20/09/2016 12:02:47++ WAIT=60s
20/09/2016 12:02:47++ etcdwrapper healthcheck-proxy --port=:2378 --wait=60s --debug=false
20/09/2016 12:02:47EMBEDDED_BACKUPS=true
20/09/2016 12:02:47++ '[' true == true ']'
20/09/2016 12:02:47++ BACKUP_PERIOD=15m
20/09/2016 12:02:47++ BACKUP_RETENTION=24h
20/09/2016 12:02:47++ giddyup leader elect --proxy-tcp-port=2160 etcdwrapper rolling-backup --period=15m --retention=24h --index=1
20/09/2016 12:02:47time="2016-09-20T10:02:47Z" level=info msg="Listening on 0.0.0.0:2160"
20/09/2016 12:02:47time="2016-09-20T10:02:47Z" level=info msg="Forwarding setup to: :2160"
20/09/2016 12:02:482016-09-20 10:02:48.049350 I | flags: recognized and used environment variable ETCD_DATA_DIR=/pdata/data.current
20/09/2016 12:02:482016-09-20 10:02:48.049574 I | etcdmain: etcd Version: 2.3.7
20/09/2016 12:02:482016-09-20 10:02:48.049632 I | etcdmain: Git SHA: fd17c91
20/09/2016 12:02:482016-09-20 10:02:48.049652 I | etcdmain: Go Version: go1.6.2
20/09/2016 12:02:482016-09-20 10:02:48.049686 I | etcdmain: Go OS/Arch: linux/amd64
20/09/2016 12:02:482016-09-20 10:02:48.049698 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
20/09/2016 12:02:482016-09-20 10:02:48.049738 W | etcdmain: found invalid file/dir ip under data dir /pdata/data.current (Ignore this if you are upgrading etcd)
20/09/2016 12:02:482016-09-20 10:02:48.049753 N | etcdmain: the server is already initialized as member before, starting as etcd member...
20/09/2016 12:02:482016-09-20 10:02:48.049845 I | etcdmain: listening for peers on http://0.0.0.0:2380
20/09/2016 12:02:482016-09-20 10:02:48.049901 I | etcdmain: listening for client requests on http://0.0.0.0:2379
20/09/2016 12:02:482016-09-20 10:02:48.207867 I | etcdserver: recovered store from snapshot at index 50005
20/09/2016 12:02:482016-09-20 10:02:48.207893 I | etcdserver: name = 10-42-239-235
20/09/2016 12:02:482016-09-20 10:02:48.207899 I | etcdserver: data dir = /pdata/data.current
20/09/2016 12:02:482016-09-20 10:02:48.207905 I | etcdserver: member dir = /pdata/data.current/member
20/09/2016 12:02:482016-09-20 10:02:48.207909 I | etcdserver: heartbeat = 100ms
20/09/2016 12:02:482016-09-20 10:02:48.207913 I | etcdserver: election = 1000ms
20/09/2016 12:02:482016-09-20 10:02:48.207916 I | etcdserver: snapshot count = 10000
20/09/2016 12:02:482016-09-20 10:02:48.207930 I | etcdserver: advertise client URLs = http://10.42.239.235:2379
20/09/2016 12:02:482016-09-20 10:02:48.435244 I | etcdserver: restarting member a113af6263612296 in cluster 758a82db1924ffd2 at commit index 57370
20/09/2016 12:02:482016-09-20 10:02:48.435534 I | raft: a113af6263612296 became follower at term 2
20/09/2016 12:02:482016-09-20 10:02:48.435556 I | raft: newRaft a113af6263612296 [peers: [a113af6263612296], term: 2, commit: 57370, applied: 50005, lastindex: 57370, lastterm: 2]
20/09/2016 12:02:482016-09-20 10:02:48.438018 I | etcdserver: added member a113af6263612296 [http://10.42.239.235:2380] to cluster 758a82db1924ffd2 from store
20/09/2016 12:02:482016-09-20 10:02:48.438040 I | etcdserver: set the cluster version to 2.3 from store
20/09/2016 12:02:482016-09-20 10:02:48.438218 I | etcdserver: starting server... [version: 2.3.7, cluster version: 2.3]
20/09/2016 12:02:48time="2016-09-20T10:02:48Z" level=info msg="Initializing Rolling Backups" period=15m0s retention=24h0m0s
20/09/2016 12:02:492016-09-20 10:02:49.138421 I | raft: a113af6263612296 is starting a new election at term 2
20/09/2016 12:02:492016-09-20 10:02:49.138458 I | raft: a113af6263612296 became candidate at term 3
20/09/2016 12:02:492016-09-20 10:02:49.138465 I | raft: a113af6263612296 received vote from a113af6263612296 at term 3
20/09/2016 12:02:492016-09-20 10:02:49.138599 I | raft: a113af6263612296 became leader at term 3
20/09/2016 12:02:492016-09-20 10:02:49.138618 I | raft: raft.node: a113af6263612296 elected leader a113af6263612296 at term 3
20/09/2016 12:02:492016-09-20 10:02:49.139075 I | etcdserver: published {Name:10-42-239-235 ClientURLs:[http://10.42.239.235:2379]} to cluster 758a82db1924ffd2
20/09/2016 12:17:49time="2016-09-20T10:17:49Z" level=info msg="Created backup" name="2016-09-20T10:17:48Z_etcd_1" runtime=426.536346ms

the hosts are in the rancher created security groups, i even tried to manually add them in the default aws security group but no luck.

thanks alot !

Hi again,

for testing purposes i tried with the cattle orchestrator and i seem to have the same problem :

EC2 instances can ping each other no problem, containers on the same host can ping each other OK but they can’t reach each other when on different hosts.

Is this expected behavior ?

cheers !

It sounds like your cross host communication isn’t working. Have you looked at the FAQs for this issue?

http://docs.rancher.com/rancher/v1.2/en/faqs/troubleshooting/#cross-host-communication

Also, you can confirm whether or not cross host communication is working by exec-ing into the network agent on one host and pinging the IP (10.42.x.x) of a different network agent. Depending on which version you are running, those containers may be hidden in the UI on the hosts page and can be shown be checking “Show System” in the upper right hand corner of the hosts page.

Thank you denise that was it ! my bad for skipping the FAQ… In the host page it was indeed showing the docker bridge IP instead of the host’s IP.

I manually deployed some Debian hosts with the included env variable and now it works

Since this problem also appeared using the auto provision plugin and rancherOS AMI i am wondering : is there a way to reliably use auto host provisioning or is it still work in progress ?

thanks !

When you say “auto provision plugin”, do you mean using the UI?

Sorry for the late reply. Yes, it’s what I meant.

Thanks!