Issue with network agents

jgmartin · August 18, 2015, 8:58pm

Hello all. We’ve run into an issue where a container was pushed that accidentally exposed host ports that another container already had exposed. After doing this, other stacks deployed after were failing to connect to one another (just looping in ‘networking…’ forever and failing to resolved linked host names). We tried restarting the networking agent on the hosts, but one of them is stuck trying to stop (just says ‘stopping’ forever). Deleting stacks also appears to be locked up, as the containers just spin forever on the hosts infinitely looping between stopping and networking. Any ideas how to recover from this?

tobowers · August 19, 2015, 12:09am

To add a little more information… we’re seeing a lot of this in the logs:

gist.github.com

https://gist.github.com/tobowers/f506b96d64d2d6eabfe8

gistfile1.txt

[tbowers@ip-10-53-5-199 ~]$ sudo docker logs --tail=500 rancher-server
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

time="2015-08-18T20:14:26Z" level=info msg="Invalid backend host requested." hostUuid=32df8cde-5d5a-41f1-85c2-3ab8f1aca5f5 
time="2015-08-18T20:20:10Z" level=info msg="Invalid backend host requested." hostUuid=32df8cde-5d5a-41f1-85c2-3ab8f1aca5f5 
2015-08-18 20:20:27,551 ERROR [:] [] [] [] [862633345-32242] [i.g.i.g.r.handler.ExceptionHandler  ] Exception in API for request [io.github.ibuildthecloud.gdapi.request.ApiRequest@396bdb49] io.cattle.platform.engine.process.impl.ProcessCancelException: State [purged] is not valid
	at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.preRunStateCheck(DefaultProcessInstanceImpl.java:282) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:186) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:161) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]

This file has been truncated. show original

We’re currently on: v0.31.0 with agents .

I restarted docker on the one host where networking was spinning and removed all stopped containers then put the agent back on the box. The UI still shows all the old containers on that host.

Now seeing the following in the logs:

2015-08-19 00:07:37,332 ERROR [:] [orService-30615] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed update [configscripts, services, agent-instance-scripts, monit, agent-instance-startup, node-services, ipsec, healthcheck, ipsec-hosts, hosts, iptables, healthcheck] on [agent:3], exit code [1] output [nsenter: failed to execute /var/lib/cattle/events/config.update: No such file or directory

tobowers · August 19, 2015, 2:37pm

We have a deployment going on that’s taking a while… one of the hosts keeps saying ‘reconnecting’ and the python agent is using a ton of CPU:

root 10895 46.6 0.7 71996 27380 ? R 14:36 0:01 python /var/lib/cattle/pyagent/main.py

85 -96% of CPU

tobowers · August 19, 2015, 3:26pm

new information is that docker itself appears to be hung on our agent. docker ps just hangs and there is nothing in the journalctl logs for docker int he last 2 minutes.

denise · August 19, 2015, 4:33pm

Would you be able to restart docker to see if that helps? When docker hangs, we won’t be able to stop/start containers.

Can you provide the rancher agent logs as well?

tobowers · August 19, 2015, 5:08pm

restarting docker fixed the problem (restarted docker, then started a new agent).

This doesn’t seem like a real option for production though as it means taking down a whole host.

Our current working theory was that a docker-compose with a “build” caused the agent to consume tons of resources and brought the agent down… but we don’t have evidence.

Those agent logs are gone now unfortunately I think?

Topic		Replies	Views
Upgrading Network Agent Woes Rancher 1.x	2	920	October 28, 2016
Rancher vagrant hosts disconnect are network agent start Rancher 1.x	0	907	February 7, 2016
New container keeps hanging on "Networking", can't find logs Rancher 1.x	3	2109	September 9, 2016
Errors after changing registration IP, and back Rancher 1.x	4	1451	March 23, 2016
Default docker instances stopped Rancher 1.x	8	2139	January 15, 2016

Issue with network agents

Related topics