EC2 host gets disconnected from rancher randomly, and gets "Credentials are no longer valid, please re-register this agent" error when attempting to reconnect

gsagen · October 3, 2018, 6:08pm

Hello,

I’m having issues with hosts occasionally disconnecting from rancher, and then being unauthorized and unable to reconnect when they try.
Rancher v1.6.14
Cattle v0.183.37
User Interface v1.6.37
Rancher CLI v0.6.7
Rancher Compose v0.12.5

I have some EC2 instances that spin up in my auto scale groups, and connect to rancher through their user_data script that calls the add host script rancher generates, like so:

docker run --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.9 http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1/scripts/x:y:z

They stay up for a while, and then randomly and fairly rarely will disconnect, and be unable to reconnect.

I checked the rancher agent logs, and see this happening:

time="2018-10-02T14:46:30Z" level=info msg="Reply: dec824e6-b945-43ec-bb91-4f1da33948a9, compute.instance.activate, 1ihm4215085:instanceHostMap" 
time="2018-10-02T14:46:30Z" level=info msg="Reply: 50237fa1-e7ae-4805-8da4-bf3a300e0b40, compute.instance.activate, 1ihm4215086:instanceHostMap" 
time="2018-10-02T18:18:34Z" level=error msg="Received error reading from socket. Exiting." error="write tcp 10.0.x.x:34646->10.0.x.x:8080: i/o timeout" 
time="2018-10-02T18:18:39Z" level=warning msg="websocket closed: write tcp 10.0.x.x:48222->10.0.x.x:8080: i/o timeout" 
time="2018-10-02T18:18:43Z" level=warning msg="Hit websocket pong timeout. Last websocket ping received at 2018-10-02 18:18:20.694883542 +0000 UTC. Closing connection." 
time="2018-10-02T18:18:48Z" level=error msg="Failed to connect to websocket proxy: %vwrite tcp 10.0.x.x:34646->10.0.6.194:8080: i/o timeout" 
time="2018-10-02T18:19:37Z" level=error msg="Failed to get rancher client for host-api startup: Get http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" 
time="2018-10-02T18:19:53Z" level=info msg="Hit websocket pong timeout. Last websocket ping received at 2018-10-02 18:18:46.472304538 +0000 UTC. Closing connection." 
INFO: Starting agent for 0FB398F1E541B7x
INFO: Access Key: 0FB398F1E541B7Dx
INFO: Config URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: Storage URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: API URL: http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1
INFO: IP: 10.0.x.x
INFO: Port:
INFO: Required Image: rancher/agent:v1.2.9
INFO: Current Image: rancher/agent:v1.2.9
INFO: Using image rancher/agent:v1.2.9
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
INFO: Downloading agent http://internal-Rancher-Engines-Internal-x.us-west-2.elb.amazonaws.com:8080/v1/configcontent/configscripts
K5T��G`"�>����?�G���/�������\�o����EV��;}���3[�:V�mj��o�9�;����f���>�{��˯���ڭ�ڭ�ڭ�ڭ�ڭ�ڭ���/CJ�$P::nw{*�[�6+�z_o�f���ut�ؿ)T1u��2���L����c�3�h�����Z͖nZ=��<+³M�יu��e7{�ެi͌YϞ�Ԟ���;�^�lu��#0+��Y�c�ͮ������6��\ն��j�z=�0�Q�ʛ���=�К�n���t:V��Ö:�iX3
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
{"id":"8465ecef-7743-4e9f-82f1-468f1c6789f2","type":"error","links":{},"actions":{},"status":401,"code":"Unauthorized","message":"Unauthorized","detail":null,"baseType":"error"}ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent
ERROR: Credentials are no longer valid, please re-register this agent

I am not sure why it disconnects, and Im also not sure why the credentials become invalid/are downloaded again in non-gzip format. I am thinking that perhaps I am running the hosts out of memory, so I set a lot of reservations and restrictions to prevent that from happening.

I am also not sure why it cannot reconnect. If I remove all the rancher services on the host and grab a script from rancher to add the host back, it seems to work fine. Scaling up the auto scale group will bring new hosts in, so it seems like the user_data script is still valid as well.

Any ideas of where I can check for more errors, or a possible fix?
Thank you!

cloudlady911 · March 16, 2020, 7:27pm

This still happens to me sometimes. Did you ever figure out why?

Topic		Replies	Views
EC2 cattle hosts shows disconnected after some time running Rancher 1.x	3	1049	June 16, 2017
All AWS Hosts Disconnected on Rancher 1.3? Rancher 1.x	7	1923	January 26, 2017
Rancher Hosts Continually Disconnect Rancher 1.x	10	10316	August 16, 2017
Rancher OS EC2 machines get stuck after a while RancherOS	3	1411	July 2, 2016
Rancher agent looses connection to server, active/reconnecting, SELinux issues Rancher 1.x	4	3323	January 4, 2016

EC2 host gets disconnected from rancher randomly, and gets "Credentials are no longer valid, please re-register this agent" error when attempting to reconnect

Related Topics