I’m using Rancher 1.3.4 in a dev cattle environment (One server with rancher/server, another with agent/services), and everything was fine.
When suddenly, a wild behavior appeared: Once I upgrade a service (through docker-compose our through the webUI), not every time but almost, the “scheduler/scheduler” and the updated instance throw me a “Timeout getting IP address” and start an infinite loop. Here is the log from the rancher/server :
ERROR [d75237c2-d871-4878-9a72-e6349c9a0f76:34875] [instance:4096->instanceHostMap:4018] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-8] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=22]: Timeout getting IP address
ERROR [d75237c2-d871-4878-9a72-e6349c9a0f76:34875] [instance:4096] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.InstanceStart] Failed [2/3] to Starting for instance [4096]
I don’t know where is the problem and how to resolve it. I have to clean up the host and reconfigure it every day, since it bug again.
Can someone tell me where should I look, what should I do ?
Thanks
EDIT: I found that “network-services/metadata” is throwing http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s. rancher-metadata process opens more than 1024 files at the time.
The problem with rancher-metadata opening loads of TCP connections should be fixed in Rancher v1.4.
The root cause was a bug in go-rancher-metadata library, that used to open a TCP (keep alive) connections and never bothered to close. The fix made it into Rancher v1.4: all Rancher services using the library should have received the fix.
Make sure to update to the latest master of go-rancher-metadata if you use it in your code.
We experience the exact same problem on an AWS EC2 instance running the default AWS AMI.
I’ve upgraded to 1.4, upgraded the infrastructure stack but keep having this issue.
A snippet from the metadata container logs:
16/02/2017 15:29:002017/02/16 14:29:00 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=info msg=“Initializing event router” workerCount=3
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg=“Error subscribing to events: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files” subscribeUrl="wss://rancher..com/v1/subscribe?eventNames=ping&eventNames=config.update"
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg="Exiting subscriber: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files"
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg=“Failed to download and reload metadata: Get https://rancher..com/v1/configcontent/metadata-answers: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files”
And here a listing of the ulimits:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31861
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 20000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 31861
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hi,
is there any news regarding this topic?
I need to restart the metadata service every day now.
In my environments (1.4.1) I only have issues with ubuntu 16.04 hosts.
RancherOS 0.8.0 seems to be working alright.
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=info msg=“Initializing event router” workerCount=3
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg=“Error subscribing to events: dial tcp rancher:8080: socket: too many open files” subscribeUrl=“ws://rancher:8080/v1/subscribe?eventNames=ping&eventNames=config.update”
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg=“Exiting subscriber: dial tcp rancher:8080: socket: too many open files”
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg=“Failed to download and reload metadata: Get http://rancher:8080/v1/configcontent/metadata-answers: dial tcp rancher:8080: socket: too many open files”
21.2.2017 08:34:012017/02/21 07:34:01 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s
The accept tcp [::]:80 seems to be related to my load balancer (which is load-balancing 80 → 8080)
Into Stacks > Infrastructure > Network-services > Metadata
Just go into the upgrade page, healthcheck tab, and set HTTP to port 80, path /
It restart the service when it start to fails
Thanks. Will give the healthcheck a try. We are maxing out several times a day. We are using Amazon linux that doesn’t have a 1000000 ulimit. I guess I should mention I am using Rancher 1.4.1.