Rancher HA Management Stack not coming up

shethchintan7 · April 15, 2016, 10:03am

I am trying out Rancher v1.0.1 HA Setup on AWS
OS : RancherOS

Using RDS as the external Mysql instance and I’m running the script on 3 nodes

One instance of both go-machine-service & rancher-compose-executor are not starting and thus my management stack.

Any help would be much appreciated

Rucknar · April 15, 2016, 12:58pm

Could you get the logs for those containers that won’t start and post them here.

shethchintan7 · April 15, 2016, 1:55pm

They never reach running status. Initializing > Reconciling > Stopping so not sure how do I find the logs for failing instance

Last log in docker logs -f rancher-ha is the following on all three nodes

time=“2016-04-15T09:50:02Z” level=info msg=“Container agent is not running in state &types.ContainerState{Status:“exited”, Running:false, Paused:false, Restarting:false, OOMKilled:false, Dead:false, Pid:0, ExitCode:0, Error:”", StartedAt:“2016-04-15T09:46:52.499514335Z”, FinishedAt:“2016-04-15T09:47:02.105292501Z”}" component=docker
time=“2016-04-15T09:50:02Z” level=info msg=“Deleting container 89f5a53bd7ce3b263891f9d303fadd052b29edd048c9fe6af269151035893b06” component=docker

/var/log/docker.log is spamming

time=“2016-04-15T13:53:46.124242130Z” level=error msg=“Handler for GET /v1.22/containers/rancher-ha-agent/json returned error: No such container: rancher-ha-agent”

Rucknar · April 15, 2016, 2:14pm

I’m guessing here…

Have a look at the logs coming out of the cattle container, it might not log to STDOUT though. Think it’s logs are in /var/lib/cattle/logs

shethchintan7 · April 20, 2016, 7:20am

So that issue is fixed now, there were some loadbalancer issues. But I’ve hit another snag

So management stack is working great now but there’s a networking problem in the hosts that I’m adding. Containers are not able to communicate among each other, basically I cannot ping a container from any other worker node. Is there a different method for adding worker hosts to the HA stack?

Network agent has following in STDERR

RTNETLINK answers: No such file or directory
SIOCSARP: Invalid argument
arp: cannot set entry on line 2 of etherfile content-home/etc/cattle/ethers !

Rucknar · April 20, 2016, 7:30am

Seeing similar actually:

4/19/2016 10:29:46 PMSIOCSARP: Invalid argument
4/19/2016 10:29:46 PMarp: cannot set entry on line 2 of etherfile content-home/etc/cattle/ethers !

Should probably raise as a bug on github.

shethchintan7 · April 20, 2016, 7:32am

Also one more weird thing, Host ips on the UI are not correct. Nodes on which HA is running they have correct IPs but any host that I add get an incorrect one.

Rucknar · April 20, 2016, 8:15am

Are they all the same IP? just noticed that myself! I think it’s breaking the host>host communication as it’s establishing the tunnel to the wrong IP

shethchintan7 · April 20, 2016, 8:52am

They are not the same in my case

Rucknar · April 20, 2016, 9:28am

Just validated, you don’t get this issue when not using HA.

shethchintan7 · April 20, 2016, 9:42am

Yeap, same here. HA issue it seems

shethchintan7 · April 20, 2016, 10:57am

So it seems it is taking ELB IPs ( private ip )

ELB Private ips can be found by

aws ec2 describe-network-interfaces --filters “Name=description,Values=ELB rancher” |grep -wE ‘Description|PrivateIpAddress’

I used -e CATTLE_AGENT_IP=‘’ to explicitly state IP and it seems to work.

Rucknar · April 20, 2016, 4:17pm

Good find. So Rancher is possibly doing a reverse DNS lookup on the Callback URL and that value is somehow being translated into the host IP address. Seems strange as it works on a single server without ELB.

Rucknar · April 21, 2016, 5:06pm

So, enabling the Proxy protocol on the ELB i’m led to believe might fix this issue. I’ve not had success personally but it’s worth a shot if you get chance.

Fernando_Felicissimo · April 21, 2016, 5:37pm

I’m having the same problem … and some container just does not start up and can not get “network agent” via shell to make a ping between them … and a bug you? Do you have a fix?

regards,

shethchintan7 · April 22, 2016, 1:17am

Check the logs for exited containers. I had two main issues, https certificates not matching URL and elb listerners being http https, it should be tcp and ssl.

denise · April 29, 2016, 8:36pm

@shethchintan7 Are you still having issues? We have started to recommend specifying the -e CATTLE_AGENT_IP when using an AWS setup with ELB.

ThatsNinja · April 29, 2016, 8:38pm

@denise What are you recommending setting the CATTLE_AGENT_IP to?

shethchintan7 · April 30, 2016, 1:49am

@denise No, it’s working great. Thanks

denise · April 30, 2016, 2:10am

@ThatsNinja

when adding the host, you set it it to the IP of the host. These docs explain what you are doing when setting the CATTLE_AGENT_IP but just a different known use case of when it needs to be set.

http://docs.rancher.com/rancher/rancher-ui/infrastructure/hosts/custom/#adding-hosts-to-the-same-machine-as-rancher-server

Topic		Replies	Views
Production Rancher HA on AWS Rancher 1.x	10	1999	May 27, 2016
Impossible to run Rancher HA Rancher 1.x	0	659	September 20, 2016
Rancher agent cannot be started Rancher 1.x	7	1633	July 23, 2016
Rancher-HA startup woes and problems Rancher 1.x	0	1075	October 17, 2016
Rancher HA Stack "Degraded" Rancher 1.x	27	5979	May 2, 2016

Rancher HA Management Stack not coming up

Related topics