All AWS Hosts Disconnected on Rancher 1.3?

willrstern · January 20, 2017, 2:31pm

Shortly after deploying our first stack on Rancher 1.3 running a cattle environment with all defaults (rancher ipsec, etc) all AWS hosts went to “disconnected” and I can’t get them to reconnect…the hosts were happily connected for a few days up until now. Any thoughts?

This is an issue we haven’t seen on Rancher 1.1 with the same host-instance-types, security groups, etc.

ecliptik · January 20, 2017, 7:31pm

We just started getting the same thing. 2 AWS hosts in one of our environments went into Disconnected mode within 5 minutes of each other. The other 2 are still connected. Nothing was changed networking or Rancher wise, and our other two environments have 10+ Rancher hosts in AWS running okay.

Major difference is the environment that lost hosts is in us-east-1 and the others are in us-west-1. The ec2 instances themselves are still online, so not sure where the network disconnection happened. Looking over the logs now to see if there is any information.

This has been happening more and more frequently with our environments since upgrading to Rancher 1.2.2. Haven’t tried 1.3 yet.

willrstern · January 20, 2017, 7:38pm

Found the issue on my end, posting in case someone else makes the same dumb mistake as me:
Settings/Host Registration URL was a temporary URL, since we were migrating stacks from a Rancher 1.1 cluster.

As soon as I flipped DNS of the manager to a new URL, nodes couldn’t find the manager.

Lesson learned: get that setting permanent before creating any hosts.

ecliptik · January 20, 2017, 7:41pm

Good to know.

I tried logging into the 2 ec2 instances that we lost, and they are both down hard. No information in the systemlog and ssh is completely unresponsive, so somehow they crashed.

We were re-deploying some high memory containers (artifactory) and they were trying to grab a lot of memory, which may have caused them to crash since we run with 0 swap, but right now it’s coincidental.

willrstern · January 20, 2017, 7:47pm

@ecliptik are you running a cleanup/janitor service? I’ve had that happen with storage being filled up before.

ecliptik · January 25, 2017, 4:44pm

@willrstern yes, janitor is running and the hosts are running Docker v1.12.6 with overlay2 as the storagedriver to avoid possible inode exhaustion.

These hosts are re-rolled quite frequently, and have frequently fresh disks.

I’ve added swap and will see if that makes a difference.

sumobob · January 25, 2017, 9:47pm

I’m experiencing this as well, any guides on enabling swap on freshly created ec2 hosts with docker?

Or just ssh in and swapon ?

ecliptik · January 26, 2017, 8:14pm

@sumobob, in our Rancher user-data that runs when an AWS EC2 instance comes up and sets up Rancher we have a setup_swap function. This should work on any RHEL or Ubuntu based AMI.

function setup_swap () {
  #Create a swapfile to avoid running out of memory 
  SWAP_FILE="/.swapfile"
  fallocate --length 32GiB ${SWAP_FILE}
  mkswap ${SWAP_FILE}
  chmod 0600 ${SWAP_FILE}
  swapon ${SWAP_FILE}
  echo "${SWAP_FILE} none swap defaults       0 0" >> /etc/fstab
}

setup_swap

Topic		Replies	Views
EC2 cattle hosts shows disconnected after some time running Rancher 1.x	3	1049	June 16, 2017
Rancher Hosts Continually Disconnect Rancher 1.x	10	10382	August 16, 2017
Just setting up rancher for the first time and my first host is disconnected RancherOS	16	1901	June 21, 2017
Thoughts / Questions after bringing Rancher to Production on AWS Rancher 1.x	2	1218	September 27, 2016
Vagrant install, latest version, rancher-server host disconnection Rancher 1.x	3	1395	December 17, 2015

All AWS Hosts Disconnected on Rancher 1.3?

Related topics