"rancher-metadata" is opening too many files (actually sockets)

Hi,
First of all sorry tho bother you.

I’m using Rancher 1.3.4 in a dev cattle environment (One server with rancher/server, another with agent/services), and everything was fine.

When suddenly, a wild behavior appeared: Once I upgrade a service (through docker-compose our through the webUI), not every time but almost, the “scheduler/scheduler” and the updated instance throw me a “Timeout getting IP address” and start an infinite loop. Here is the log from the rancher/server :

ERROR [d75237c2-d871-4878-9a72-e6349c9a0f76:34875] [instance:4096->instanceHostMap:4018] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-8] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=22]: Timeout getting IP address 
ERROR [d75237c2-d871-4878-9a72-e6349c9a0f76:34875] [instance:4096] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.InstanceStart] Failed [2/3] to Starting for instance [4096] 

I don’t know where is the problem and how to resolve it. I have to clean up the host and reconfigure it every day, since it bug again.

Can someone tell me where should I look, what should I do ?

Thanks

EDIT: I found that “network-services/metadata” is throwing http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s. rancher-metadata process opens more than 1024 files at the time.

1 Like

Just had the same problem on live production and CI…
3 environments, 2 were buggy, the only solution I found : delete environment, then recreate it…

Still having the problem after recreated everything… :frowning:
Rebooting the metadata service seems the only way for me to temporary solve the problem

Hi all,

The problem with rancher-metadata opening loads of TCP connections should be fixed in Rancher v1.4.

The root cause was a bug in go-rancher-metadata library, that used to open a TCP (keep alive) connections and never bothered to close. The fix made it into Rancher v1.4: all Rancher services using the library should have received the fix.

Make sure to update to the latest master of go-rancher-metadata if you use it in your code.

Seems that the “rancher/scheduler” is still using an old version of go-rancher-metadata.
Do i have something to do to update this ?

@miccarr Have you upgraded to 1.4? If so, you can go to the infrastructure stacks and upgrade to the latest scheduler template.

Already done… here are my infrastructure stacks :

Still stuck with the problem, while everything still up to date.
I added an healthcheck to restart the service when it fails.

Did someone have an idea how to resolve this ?
I also tried to reinstall on a new host, or a new env, same result.

We experience the exact same problem on an AWS EC2 instance running the default AWS AMI.
I’ve upgraded to 1.4, upgraded the infrastructure stack but keep having this issue.

A snippet from the metadata container logs:
16/02/2017 15:29:002017/02/16 14:29:00 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=info msg=“Initializing event router” workerCount=3
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg=“Error subscribing to events: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files” subscribeUrl="wss://rancher..com/v1/subscribe?eventNames=ping&eventNames=config.update"
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg="Exiting subscriber: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files"
16/02/2017 15:29:00time=“2017-02-16T14:29:00Z” level=error msg=“Failed to download and reload metadata: Get https://rancher..com/v1/configcontent/metadata-answers: dial tcp: lookup rancher..com on 192.168.104.2:53: dial udp 192.168.104.2:53: socket: too many open files”

And here a listing of the ulimits:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31861
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 20000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 31861
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I’m on rancher 1.4.1 and there’s still the same error on my stack. One of the rancher-metadata containers is opening too many connections.

The other two containers (I’ve got 3 nodes) are at some 300% CPU while they are loading the metadata.

2 Likes

Hi,
is there any news regarding this topic?
I need to restart the metadata service every day now.

In my environments (1.4.1) I only have issues with ubuntu 16.04 hosts.
RancherOS 0.8.0 seems to be working alright.

21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=info msg=“Initializing event router” workerCount=3
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg=“Error subscribing to events: dial tcp rancher:8080: socket: too many open files” subscribeUrl="ws://rancher:8080/v1/subscribe?eventNames=ping&eventNames=config.update"
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg="Exiting subscriber: dial tcp rancher:8080: socket: too many open files"
21.2.2017 08:34:01time=“2017-02-21T07:34:01Z” level=error msg="Failed to download and reload metadata: Get http://rancher:8080/v1/configcontent/metadata-answers: dial tcp rancher:8080: socket: too many open files"
21.2.2017 08:34:012017/02/21 07:34:01 http: Accept error: accept tcp [::]:80: accept4: too many open files; retrying in 1s

The accept tcp [::]:80 seems to be related to my load balancer (which is load-balancing 80 -> 8080)

/hw

1 Like

Adding a health check in the metadata service worked for me… not a solution, but a workaround.

Hi,
not sure what you mean, but adding a healtcheck to my services did not help me.

Now I’ve seen there’s been an update to the metaservice (0.7.2). Have’nt read any changelog, but one can hope it improves this issue.

/hw

@denise,

There is still issues with the metaservice. This is Rancher 1.4.1.

doing shell on meta-service

ulimit -n
4096
lsof -i tcp|wc -l
175

ls -l /proc/1/fd/ | wc -l

This value increases until 4097 and then crashing.

I can do some more debugging, if needed.

I did think my RancherOS (0.8.1) was not affected, but ulimit -n on these are: 1000000
after 1 day:
ls -l /proc/1/fd/ | wc -l
17547

/hw

@miccarr What health check did you add?

Thanks
Mark

Into Stacks > Infrastructure > Network-services > Metadata
Just go into the upgrade page, healthcheck tab, and set HTTP to port 80, path /
It restart the service when it start to fails

Thanks! :smiley:

That helps, at least until they address this as a fix of their own.

Same problem here. This is really a pretty major issue. It is keeping us from going production right now.

Thanks. Will give the healthcheck a try. We are maxing out several times a day. We are using Amazon linux that doesn’t have a 1000000 ulimit. I guess I should mention I am using Rancher 1.4.1.

This is a major problem. Rancher 1.4.1