All AWS Hosts Disconnected on Rancher 1.3?

Shortly after deploying our first stack on Rancher 1.3 running a cattle environment with all defaults (rancher ipsec, etc) all AWS hosts went to “disconnected” and I can’t get them to reconnect…the hosts were happily connected for a few days up until now. Any thoughts?

This is an issue we haven’t seen on Rancher 1.1 with the same host-instance-types, security groups, etc.

We just started getting the same thing. 2 AWS hosts in one of our environments went into Disconnected mode within 5 minutes of each other. The other 2 are still connected. Nothing was changed networking or Rancher wise, and our other two environments have 10+ Rancher hosts in AWS running okay.

Major difference is the environment that lost hosts is in us-east-1 and the others are in us-west-1. The ec2 instances themselves are still online, so not sure where the network disconnection happened. Looking over the logs now to see if there is any information.

This has been happening more and more frequently with our environments since upgrading to Rancher 1.2.2. Haven’t tried 1.3 yet.

Found the issue on my end, posting in case someone else makes the same dumb mistake as me:
Settings/Host Registration URL was a temporary URL, since we were migrating stacks from a Rancher 1.1 cluster.

As soon as I flipped DNS of the manager to a new URL, nodes couldn’t find the manager.

Lesson learned: get that setting permanent before creating any hosts.

Good to know.

I tried logging into the 2 ec2 instances that we lost, and they are both down hard. No information in the systemlog and ssh is completely unresponsive, so somehow they crashed.

We were re-deploying some high memory containers (artifactory) and they were trying to grab a lot of memory, which may have caused them to crash since we run with 0 swap, but right now it’s coincidental.

@ecliptik are you running a cleanup/janitor service? I’ve had that happen with storage being filled up before.

@willrstern yes, janitor is running and the hosts are running Docker v1.12.6 with overlay2 as the storagedriver to avoid possible inode exhaustion.

These hosts are re-rolled quite frequently, and have frequently fresh disks.

I’ve added swap and will see if that makes a difference.

I’m experiencing this as well, any guides on enabling swap on freshly created ec2 hosts with docker?

Or just ssh in and swapon ?

@sumobob, in our Rancher user-data that runs when an AWS EC2 instance comes up and sets up Rancher we have a setup_swap function. This should work on any RHEL or Ubuntu based AMI.

function setup_swap () {
  #Create a swapfile to avoid running out of memory 
  SWAP_FILE="/.swapfile"
  fallocate --length 32GiB ${SWAP_FILE}
  mkswap ${SWAP_FILE}
  chmod 0600 ${SWAP_FILE}
  swapon ${SWAP_FILE}
  echo "${SWAP_FILE} none swap defaults       0 0" >> /etc/fstab
}

setup_swap