Azure hosts disconnecting regularly

I’m experiencing Rancher hosts disconnecting every few days, and with about 20 of them across 4 environments that can mean I can expect one or more almost every day. :unamused:

Restarting the agent always rectifies.

Here’s some errors that are occurring in the agent logs prior to the disconnect;

time="2018-03-19T15:27:44Z" level=error msg="Received error reading from socket. Exiting." error="read tcp 10.33.46.4:57782->10.33.250.119:80: read: connection reset by peer" 
time="2018-03-19T15:27:44Z" level=error msg="Failed to connect to websocket proxy: %vread tcp 10.33.46.4:57782->10.33.250.119:80: read: connection reset by peer" 
time="2018-03-19T15:27:49Z" level=error msg="Failed to get rancher client for host-api startup: Bad response statusCode [502]. Status [502 Bad Gateway]. Body: [<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"/>\r\n<title>502 - Web server received an invalid response while acting as a gateway or proxy server.</title>\r\n<style type=\"text/css\">\r\n<!--\r\nbody{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}\r\nfieldset{padding:0 15px 10px 15px;} \r\nh1{font-size:2.4em;margin:0;color:#FFF;}\r\nh2{font-size:1.7em;margin:0;color:#CC0000;} \r\nh3{font-size:1.2em;margin:10px 0 0 0;color:#000000;} \r\n#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:\"trebuchet MS\", Verdana, sans-serif;color:#FFF;\r\nbackground-color:#555555;}\r\n#content{margin:0 0 0 2%;position:relative;}\r\n.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}\r\n-->\r\n</style>\r\n</head>\r\n<body>\r\n<div id=\"header\"><h1>Server Error</h1></div>\r\n<div id=\"content\">\r\n <div class=\"content-container\"><fieldset>\r\n  <h2>502 - Web server received an invalid response while acting as a gateway or proxy server.</h2>\r\n  <h3>There is a problem with the page you are looking for, and it cannot be displayed. When the Web server (while acting as a gateway or proxy) contacted the upstream content server, it received an invalid response from the content server.</h3>\r\n </fieldset></div>\r\n</div>\r\n</body>\r\n</html>\r\n] from [http://rancher.xxx.xxx.xxx/v1]"

Which I can understand would cause the agent grief, but it doesn’t appear to continue retrying.

And another different example;

time="2018-03-19T15:42:11Z" level=error msg="checking aws cloud provider error after 6 attempts, last error: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: EC2MetadataError: failed to make EC2Metadata request\ncaused by: <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"/>\r\n<title>404 - File or directory not found.</title>\r\n<style type=\"text/css\">\r\n<!--\r\nbody{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}\r\nfieldset{padding:0 15px 10px 15px;} \r\nh1{font-size:2.4em;margin:0;color:#FFF;}\r\nh2{font-size:1.7em;margin:0;color:#CC0000;} \r\nh3{font-size:1.2em;margin:10px 0 0 0;color:#000000;} \r\n#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:\"trebuchet MS\", Verdana, sans-serif;color:#FFF;\r\nbackground-color:#555555;}\r\n#content{margin:0 0 0 2%;position:relative;}\r\n.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}\r\n-->\r\n</style>\r\n</head>\r\n<body>\r\n<div id=\"header\"><h1>Server Error</h1></div>\r\n<div id=\"content\">\r\n <div class=\"content-container\"><fieldset>\r\n  <h2>404 - File or directory not found.</h2>\r\n  <h3>The resource you are looking for might have been removed, had its name changed, or is temporarily unavailable.</h3>\r\n </fieldset></div>\r\n</div>\r\n</body>\r\n</html>\r\n" 
time="2018-03-19T15:42:41Z" level=warning msg="stat /var/lib/rancher/state/info.json: no such file or directory" 
time="2018-03-19T15:45:11Z" level=error msg="checking aliyun cloud provider error after 2 attempts, last error: Get http://100.100.100.200/latest/meta-data/zone-id: dial tcp 100.100.100.200:80: i/o timeout"

The first error is odd in that I’m not aware of configuring anything for AWS, not surprising it fails - this is running in Azure. :stuck_out_tongue:

The second error I would expect if the agent version was a mismatch for the support Docker version. But it isn’t. If it was a real issue then the agent should never start?

The last error is also odd. Aliyun cloud provider? Huh?

Here’s the Environment that is running;

  • Rancher - 1.6.12
  • Rancher Agent - 1.2.7
  • Rancher Container Orchestration - Cattle
  • Docker - 17.06.2-ce
  • Host/Server OS - Ubuntu 16.04.3 LTS
  • Cloud - Azure