Issue with network agents

Hello all. We’ve run into an issue where a container was pushed that accidentally exposed host ports that another container already had exposed. After doing this, other stacks deployed after were failing to connect to one another (just looping in ‘networking…’ forever and failing to resolved linked host names). We tried restarting the networking agent on the hosts, but one of them is stuck trying to stop (just says ‘stopping’ forever). Deleting stacks also appears to be locked up, as the containers just spin forever on the hosts infinitely looping between stopping and networking. Any ideas how to recover from this?

To add a little more information… we’re seeing a lot of this in the logs:

We’re currently on: v0.31.0 with agents .

I restarted docker on the one host where networking was spinning and removed all stopped containers then put the agent back on the box. The UI still shows all the old containers on that host.

Now seeing the following in the logs:

2015-08-19 00:07:37,332 ERROR [:] [orService-30615] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed update [configscripts, services, agent-instance-scripts, monit, agent-instance-startup, node-services, ipsec, healthcheck, ipsec-hosts, hosts, iptables, healthcheck] on [agent:3], exit code [1] output [nsenter: failed to execute /var/lib/cattle/events/config.update: No such file or directory

We have a deployment going on that’s taking a while… one of the hosts keeps saying ‘reconnecting’ and the python agent is using a ton of CPU:

root 10895 46.6 0.7 71996 27380 ? R 14:36 0:01 python /var/lib/cattle/pyagent/main.py

85 -96% of CPU

new information is that docker itself appears to be hung on our agent. docker ps just hangs and there is nothing in the journalctl logs for docker int he last 2 minutes.

Would you be able to restart docker to see if that helps? When docker hangs, we won’t be able to stop/start containers.

Can you provide the rancher agent logs as well?

restarting docker fixed the problem (restarted docker, then started a new agent).

This doesn’t seem like a real option for production though as it means taking down a whole host.

Our current working theory was that a docker-compose with a “build” caused the agent to consume tons of resources and brought the agent down… but we don’t have evidence.

Those agent logs are gone now unfortunately I think?