EC2 host gets disconnected from rancher randomly, and gets "Credentials are no longer valid, please re-register this agent" error when attempting to reconnect

Hello,

I’m having issues with hosts occasionally disconnecting from rancher, and then being unauthorized and unable to reconnect when they try.
Rancher v1.6.14
Cattle v0.183.37
User Interface v1.6.37
Rancher CLI v0.6.7
Rancher Compose v0.12.5

I have some EC2 instances that spin up in my auto scale groups, and connect to rancher through their user_data script that calls the add host script rancher generates, like so:

docker run --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.9 http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1/scripts/x:y:z

They stay up for a while, and then randomly and fairly rarely will disconnect, and be unable to reconnect.

I checked the rancher agent logs, and see this happening:

time="2018-10-02T14:46:30Z" level=info msg="Reply: dec824e6-b945-43ec-bb91-4f1da33948a9, compute.instance.activate, 1ihm4215085:instanceHostMap" 
time="2018-10-02T14:46:30Z" level=info msg="Reply: 50237fa1-e7ae-4805-8da4-bf3a300e0b40, compute.instance.activate, 1ihm4215086:instanceHostMap" 
time="2018-10-02T18:18:34Z" level=error msg="Received error reading from socket. Exiting." error="write tcp 10.0.x.x:34646->10.0.x.x:8080: i/o timeout" 
time="2018-10-02T18:18:39Z" level=warning msg="websocket closed: write tcp 10.0.x.x:48222->10.0.x.x:8080: i/o timeout" 
time="2018-10-02T18:18:43Z" level=warning msg="Hit websocket pong timeout. Last websocket ping received at 2018-10-02 18:18:20.694883542 +0000 UTC. Closing connection." 
time="2018-10-02T18:18:48Z" level=error msg="Failed to connect to websocket proxy: %vwrite tcp 10.0.x.x:34646->10.0.6.194:8080: i/o timeout" 
time="2018-10-02T18:19:37Z" level=error msg="Failed to get rancher client for host-api startup: Get http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" 
time="2018-10-02T18:19:53Z" level=info msg="Hit websocket pong timeout. Last websocket ping received at 2018-10-02 18:18:46.472304538 +0000 UTC. Closing connection." 
INFO: Starting agent for 0FB398F1E541B7x
INFO: Access Key: 0FB398F1E541B7Dx
INFO: Config URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: Storage URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: API URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: IP: 10.0.x.x
INFO: Port:
INFO: Required Image: rancher/agent:v1.2.9
INFO: Current Image: rancher/agent:v1.2.9
INFO: Using image rancher/agent:v1.2.9
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
INFO: Downloading agent http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1/configcontent/configscripts
K5T��G`"�>����?�G���/�������\�o����EV��;}���3[�:V�mj��o�9�;����f���>�{��˯���ڭ�ڭ�ڭ�ڭ�ڭ�ڭ���/CJ�$P::nw{*�[�6+�z_o�f���ut�ؿ)T1u��2���L����c�3�h�����Z͖nZ=��<+³M�יu��e7{�ެi͌YϞ�Ԟ���;�^�lu��#0+��Y�c�ͮ������6��\ն��j�z=�0�Q�ʛ���=�К�n���t:V��Ö:�iX3
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
{"id":"8465ecef-7743-4e9f-82f1-468f1c6789f2","type":"error","links":{},"actions":{},"status":401,"code":"Unauthorized","message":"Unauthorized","detail":null,"baseType":"error"}ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent

I am not sure why it disconnects, and Im also not sure why the credentials become invalid/are downloaded again in non-gzip format. I am thinking that perhaps I am running the hosts out of memory, so I set a lot of reservations and restrictions to prevent that from happening.

I am also not sure why it cannot reconnect. If I remove all the rancher services on the host and grab a script from rancher to add the host back, it seems to work fine. Scaling up the auto scale group will bring new hosts in, so it seems like the user_data script is still valid as well.

Any ideas of where I can check for more errors, or a possible fix?
Thank you!

This still happens to me sometimes. Did you ever figure out why?