Urgent help needed with rancher server wonkiness

On friday we tried to upgrade one of our stacks and saw the network-services containers started to flap. We tried for several hours to get this to return to a normal state to no avail. We ended up, trying to recreate the whole environment, but can not get a new environment stack to come online in a clean state. Network services are always an issue, where network-manager flaps, because metadata never activates.

Currently we have three environments that refuse to deactivate:

In our attempts to fix the issue with the brand new dev2 environment we tried to remove some of the underlying infrastructure services so we could try recreating them.

It took overnight before the services were removed and I was successfully able to remove all the core rancher infrastructure stacks. However DNS container is not coming up:

Also I see very many delayed processes, volumestoragepoolmap.remove

I have seen issues reference delayed processes that lead me to be believe this could be an underlying cause to other issues, however i havent really found anything that says how to clear these out in a safe way. I did find reference to a rancher/clean-up image that does database manipulation to prune cruft, but I dont know if it will help or hinder at this point. Cant find any rancher doc on this tool, so its a bit of mystery.

Typically we run 4-5 environments with approximately 15 hosts across the board. Each environment typically runs 10 containers, with one environment having a lions share of maybe 50 application containers across 6 hosts. We upgraded to server 1.6.18 during this mess and haven’t upgraded the infrastructure stacks in other environments yet.

If anyone can help with this craziness, it would be appreciated. Also would consider opening up rancher a support plan for this. So if anyone is listening, please let me know.

There is no need to delete any of the infrastructure stacks to activate them. Adding a host to an environment will launch all the necessary containers to setup an environment. Same applies to all other added hosts. To determine the root cause we will need the logging of network-manager, metadata and metadata-dns containers to see why the refuse to keep running.

In these cases it also helps when you post host OS/kernel/Docker info.

I work with Jeremy, here is the requested info.

OS: Amazon Linux AMI 2018.03 (Linux kernel 4.14)
Docker: 18.03.1-ce

Note network manager eventually came up but ipsec continues to go up and down

Here are the logs:
From ipsec log:
6/24/2018 8:55:10 AMRefer to router sidekick for logs
6/24/2018 10:32:59 AMRefer to router sidekick for logs
6/24/2018 10:32:59 AMmkfifo: cannot create fifo ‘f’: File exists

From network-services-metadata-dns:
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Starting rancher-dns v0.17.3”
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Loaded answers”
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Listening for Reload on 127.0.0.1:8113”
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Listening on 169.254.169.250:53”
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Reloading answers”
6/24/2018 10:33:03 AMtime=“2018-06-24T14:33:03Z” level=info msg=“Reloaded answers”
6/24/2018 8:17:46 PMtime=“2018-06-25T00:17:46Z” level=info msg=“Reloading answers”
6/24/2018 8:17:46 PMtime=“2018-06-25T00:17:46Z” level=info msg=“Reloaded answers”
6/24/2018 8:18:36 PMtime=“2018-06-25T00:18:36Z” level=info msg=“Reloading answers”
6/24/2018 8:18:36 PMtime=“2018-06-25T00:18:36Z” level=info msg=“Reloaded answers”

From network-services-metadata:
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Downloaded in 98.16344ms”
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Generating and reloading answers”
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Generating answers”
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Generated and reloaded answers”
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Applied http://…/v1/configcontent/metadata-answers?client=v2&requestedVersion=8020?version=8020-25f88b208a323837e5062060b101a217”
6/24/2018 8:18:02 PMtime=“2018-06-25T00:18:02Z” level=info msg=“Download and reload in: 462.928311ms”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Update requested for version: 8021” access_key=… url=“http://…/v1”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Downloading metadata”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Downloaded in 87.544977ms”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Generating and reloading answers”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Generating answers”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Generated and reloaded answers”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Applied http://…/v1/configcontent/metadata-answers?client=v2&requestedVersion=8021?version=8021-25f88b208a323837e5062060b101a217”
6/24/2018 8:18:18 PMtime=“2018-06-25T00:18:18Z” level=info msg=“Download and reload in: 363.629887ms”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Update requested for version: 8022” access_key=… url=“http://…/v1”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Downloading metadata”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Downloaded in 80.420017ms”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Generating and reloading answers”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Generating answers”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Generated and reloaded answers”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Applied http://…/v1/configcontent/metadata-answers?client=v2&requestedVersion=8022?version=8022-25f88b208a323837e5062060b101a217”
6/24/2018 8:18:19 PMtime=“2018-06-25T00:18:19Z” level=info msg=“Download and reload in: 403.31707ms”
6/24/2018 8:18:21 PMtime=“2018-06-25T00:18:21Z” level=info msg=“Update requested for version: 8023” access_key=… url=“http://…/v1”
6/24/2018 8:18:21 PMtime=“2018-06-25T00:18:21Z” level=info msg=“Downloading metadata”
6/24/2018 8:18:21 PMtime=“2018-06-25T00:18:21Z” level=info msg=“Downloaded in 86.938687ms”

6/25/2018 10:38:38 AMtime=“2018-06-25T14:38:38Z” level=warning msg=“websocket closed: websocket: close sent”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=warning msg=“websocket closed: websocket: close sent”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Initializing event router” workerCount=3
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Downloading metadata”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Connection established”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Downloaded in 161.301058ms”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Generating and reloading answers”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Starting websocket pings”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Generating answers”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Generated and reloaded answers”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Applied http://…/v1/configcontent/metadata-answers?client=v2&requestedVersion=8029?version=8029-25f88b208a323837e5062060b101a217”
6/25/2018 10:38:39 AMtime=“2018-06-25T14:38:39Z” level=info msg=“Download and reload in: 609.613631ms”

In general our rancher environments are unstable and we have a few environments which we were trying to remove and they seem stuck deactivating and/or purging

Additionally we still face the issues that Jeremy indicated with many processes running and delayed.

Logs look fine, bit high times on the reload but nothing disturbing. If you can share network-manager logs as well, that would help.

Also, some generic info on the setup, if actions in environments take such a long time, it can also depend on:

  • Amount of hosts running rancher/server container (single or HA, and how many)
  • Specs of hosts running rancher/server container
  • What database is being used, what specs does it have etc

Also, rancher/server logs could show some indication on the issue.

Heres the network manager logs:

6/24/2018 8:54:21 AMtime="2018-06-24T12:54:21Z" level=info msg="hostnat: Applying new nat rules"
6/24/2018 8:55:02 AMtime="2018-06-24T12:55:02Z" level=info msg="arpsync: Network router changed, syncing ARP tables 1/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:02 AMtime="2018-06-24T12:55:02Z" level=info msg="hostports: : Applying new rules"
6/24/2018 8:55:04 AMtime="2018-06-24T12:55:04Z" level=info msg="Setting up resolv.conf for ContainerId [91186d04f8fbcf5a8d577cc5f6f8d8e3acc03116cc707397b36791a0ab606d0d]"
6/24/2018 8:55:07 AMtime="2018-06-24T12:55:07Z" level=info msg="arpsync: Network router changed, syncing ARP tables 2/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:09 AMtime="2018-06-24T12:55:09Z" level=info msg="Setting up resolv.conf for ContainerId [0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39]"
6/24/2018 8:55:09 AMtime="2018-06-24T12:55:09Z" level=info msg="CNI up: cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39 networkMode=ipsec"
6/24/2018 8:55:10 AMtime="2018-06-24T12:55:10Z" level=info msg="CNI up done: result=IP4:{IP:{IP:10.42.8.127 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]} cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39 networkMode=ipsec"
6/24/2018 8:55:10 AMtime="2018-06-24T12:55:10Z" level=info msg="hostports: : Applying new rules"
6/24/2018 8:55:12 AMtime="2018-06-24T12:55:12Z" level=info msg="arpsync: Network router changed, syncing ARP tables 3/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:17 AMtime="2018-06-24T12:55:17Z" level=info msg="arpsync: Network router changed, syncing ARP tables 4/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:23 AMtime="2018-06-24T12:55:23Z" level=info msg="arpsync: Network router changed, syncing ARP tables 5/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:28 AMtime="2018-06-24T12:55:28Z" level=info msg="arpsync: Network router changed, syncing ARP tables 6/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:55:34 AMtime="2018-06-24T12:55:34Z" level=info msg="iptablessync: need to hook CATTLE_NETWORK_POLICY chain"
6/24/2018 8:55:34 AMtime="2018-06-24T12:55:34Z" level=info msg="iptablessync: installing cmd: iptables -w -t filter -I FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -j CATTLE_NETWORK_POLICY"
6/24/2018 8:55:34 AMtime="2018-06-24T12:55:34Z" level=info msg="iptablessync: fixing order for CATTLE_FORWARD chain"
6/24/2018 8:55:34 AMtime="2018-06-24T12:55:34Z" level=info msg="iptablessync: installing cmd: iptables -w -t filter -I FORWARD 2 -j CATTLE_FORWARD"
6/24/2018 9:10:37 AMtime="2018-06-24T13:10:37Z" level=info msg="arpsync: Network router changed, syncing ARP tables 7/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 9:10:46 AMtime="2018-06-24T13:10:46Z" level=info msg="arpsync: Network router changed, syncing ARP tables 8/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 9:10:51 AMtime="2018-06-24T13:10:51Z" level=info msg="arpsync: Network router changed, syncing ARP tables 9/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 9:10:56 AMtime="2018-06-24T13:10:56Z" level=info msg="arpsync: Network router changed, syncing ARP tables 10/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 10:10:37 AMtime="2018-06-24T14:10:37Z" level=info msg="CNI down: cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39 networkMode=ipsec"
6/24/2018 10:11:17 AMtime="2018-06-24T14:11:17Z" level=info msg="hostports: : Applying new rules"
6/24/2018 10:32:53 AMtime="2018-06-24T14:32:53Z" level=info msg="routesync: DOCKER_BRIDGE=docker0, METADATA_IP=169.254.169.250"
6/24/2018 10:32:53 AMtime="2018-06-24T14:32:53Z" level=info msg="Listening on /tmp/log.sock"
6/24/2018 10:32:53 AMtime="2018-06-24T14:32:53Z" level=info msg="routesync: starting monitoring on bridge: docker0, for metadataIP: 169.254.169.250 every 60 seconds"
6/24/2018 10:32:53 AMtime="2018-06-24T14:32:53Z" level=info msg="Waiting for metadata"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T14:07:44.232225172Z running=false cid=31873dae020f4ce4009eeccc253f54b08e505dc200aa06e9a3843d40bb07c095"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T14:29:55.194730429Z running=false cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T14:32:48.845135292Z running=true cid=91186d04f8fbcf5a8d577cc5f6f8d8e3acc03116cc707397b36791a0ab606d0d"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Recording previously started: startedAt=2018-06-24T14:32:48.845135292Z cid=91186d04f8fbcf5a8d577cc5f6f8d8e3acc03116cc707397b36791a0ab606d0d"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T14:32:54.165369795Z running=true cid=35878070ddfa1a4de5b31020094f8eb2d98e87371bf0a17433746e236c275ebe"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Recording previously started: startedAt=2018-06-24T14:32:54.165369795Z cid=35878070ddfa1a4de5b31020094f8eb2d98e87371bf0a17433746e236c275ebe"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T14:32:52.909784567Z running=true cid=fe323674e5bc840eef2abebb4df663e13183c89399f5ad1540e7f4af108dc5ea"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Recording previously started: startedAt=2018-06-24T14:32:52.909784567Z cid=fe323674e5bc840eef2abebb4df663e13183c89399f5ad1540e7f4af108dc5ea"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Inspecting on start: startedAt=2018-06-24T11:47:07.026643723Z running=true cid=e2969f5f0b767409c9b837220cf5a364eeaed91fd2e80f67e2026b8d25c99613"
6/24/2018 10:32:55 AMtime="2018-06-24T14:32:55Z" level=info msg="Recording previously started: startedAt=2018-06-24T11:47:07.026643723Z cid=e2969f5f0b767409c9b837220cf5a364eeaed91fd2e80f67e2026b8d25c99613"
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="enabling dns setup"
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="Starting event router."
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="Container [0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39] not running. Can't setup resolv.conf."
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="Container [31873dae020f4ce4009eeccc253f54b08e505dc200aa06e9a3843d40bb07c095] not running. Can't setup resolv.conf."
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="Setting up resolv.conf for ContainerId [91186d04f8fbcf5a8d577cc5f6f8d8e3acc03116cc707397b36791a0ab606d0d]"
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="hostnat: Applying new nat rules"
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="hostports: : Applying new rules"
6/24/2018 10:32:56 AMtime="2018-06-24T14:32:56Z" level=info msg="arpsync: Network router changed, syncing ARP tables 1/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 10:32:59 AMtime="2018-06-24T14:32:59Z" level=info msg="Setting up resolv.conf for ContainerId [0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39]"
6/24/2018 10:32:59 AMtime="2018-06-24T14:32:59Z" level=info msg="CNI up: cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39 networkMode=ipsec"
6/24/2018 10:32:59 AMtime="2018-06-24T14:32:59Z" level=info msg="CNI up done: result=IP4:{IP:{IP:10.42.8.127 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]} cid=0a22dc9f315fca0b3f4b6006403f19c37c63a09ad289ff124f9f2b38c2b70e39 networkMode=ipsec"
6/24/2018 10:33:01 AMtime="2018-06-24T14:33:01Z" level=info msg="arpsync: Network router changed, syncing ARP tables 2/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 10:33:06 AMtime="2018-06-24T14:33:06Z" level=info msg="arpsync: Network router changed, syncing ARP tables 3/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 10:33:12 AMtime="2018-06-24T14:33:12Z" level=info msg="arpsync: Network router changed, syncing ARP tables 4/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 10:33:17 AMtime="2018-06-24T14:33:17Z" level=info msg="arpsync: Network router changed, syncing ARP tables 5/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 4:39:12 PMtime="2018-06-24T20:39:12Z" level=info msg="arpsync: Network router changed, syncing ARP tables 6/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:17:46 PMtime="2018-06-25T00:17:46Z" level=info msg="arpsync: Network router changed, syncing ARP tables 7/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:17:52 PMtime="2018-06-25T00:17:52Z" level=info msg="Setting up resolv.conf for ContainerId [0e8515bf4e87918a71b28a26e54bb652bf371194457c282aa478cb5d60497687]"
6/24/2018 8:17:52 PMtime="2018-06-25T00:17:52Z" level=info msg="CNI up: cid=0e8515bf4e87918a71b28a26e54bb652bf371194457c282aa478cb5d60497687 networkMode=ipsec"
6/24/2018 8:17:52 PMtime="2018-06-25T00:17:52Z" level=info msg="CNI up done: result=IP4:{IP:{IP:10.42.212.154 Mask:ffff0000} Gateway:10.42.0.1 Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} GW:10.42.0.1}]}, DNS:{Nameservers:[] Domain: Search:[] Options:[]} cid=0e8515bf4e87918a71b28a26e54bb652bf371194457c282aa478cb5d60497687 networkMode=ipsec"
6/24/2018 8:17:57 PMtime="2018-06-25T00:17:57Z" level=info msg="arpsync: Network router changed, syncing ARP tables 8/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:18:02 PMtime="2018-06-25T00:18:02Z" level=info msg="arpsync: Network router changed, syncing ARP tables 9/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:18:07 PMtime="2018-06-25T00:18:07Z" level=info msg="arpsync: Network router changed, syncing ARP tables 10/10 in containers, new MAC: 02:c4:2a:e7:ad:ee"
6/24/2018 8:18:17 PMtime="2018-06-25T00:18:17Z" level=info msg="CNI down: cid=0e8515bf4e87918a71b28a26e54bb652bf371194457c282aa478cb5d60497687 networkMode=ipsec"

At this point IPSec has been trying to activate since this morning:

We have one rancher server, running 4 different environments but occasionally 5 environments. The rancher server is running a t2.xlarge AWS instance ( 4 cpu / 16GB Ram ).

Sample

1 host of t2.medium
3 containers

Staging

1 host of t2.large
2 hosts of t2.medium
16 Containers

Production

6 hosts of t2.large
46 containers

The container count from above is just our applications, it doesn’t include the rancher infrastructure stack containers. Additionally, each environment runs the Janitor stack installed from the catalog.

The dev site is currently what is problematic, but that typically runs in operating environment similar to staging. Running Mysql in RDS on a db.t2.medium class machine for rancher server. Activity graphs, for RDS look okay.

The rancher server log is filled with exceptions, some that seem to coincide with the delayed processes mentioned above.

The file is large so I will just post snippets that seem relavant:

2018-06-23 17:51:42,521 ERROR [634250ec-93ca-4a53-b115-e296c18b9ec4:2125710] [port:36047] [port.purge] [] [ecutorService-5] [.p.p.c.g.GenericResourceProcessState] Resource is null, can't find resource id [36047] 
2018-06-23 17:51:56,858 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [595] count [3] 
2018-06-23 17:51:56,858 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [702] count [3] 
2018-06-23 17:51:56,858 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [4130] count [3] 
2018-06-23 17:51:59,263 ERROR [c97eebe9-8b5f-4eaa-a112-770875ed3d6c:2155985] [instance:73773] [instance.start->(InstanceStart)] [] [ecutorService-6] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance hasn't gotten passed creating state] for instance [73773] 
2018-06-23 17:51:59,357 ERROR [1608cd10-05ea-41b7-a2a8-8f65eab8566e:2178577] [instance:74247] [instance.start->(InstanceStart)] [] [cutorService-14] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance hasn't gotten passed creating state] for instance [74247] 
2018-06-23 17:51:59,505 ERROR [80895b57-46f2-42de-adc7-5a7dbd80bb34:2151718] [instance:73578] [instance.start->(InstanceStart)] [] [cutorService-33] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance hasn't gotten passed creating state] for instance [73578] 
2018-06-23 17:52:01,865 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [595] count [4] 
2018-06-23 17:52:01,865 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [702] count [4] 
2018-06-23 17:52:01,866 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [4130] count [4] 
2018-06-23 17:52:06,890 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [595] count [5] 
2018-06-23 17:52:06,891 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [702] count [5] 
2018-06-23 17:52:06,891 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [4130] count [5] 
2018-06-23 17:52:11,906 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [595] count [6] 
2018-06-23 17:52:11,908 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for agent [595] host [70] count [6] 
2018-06-23 17:52:13,865 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [4130] count [6] 
2018-06-23 17:52:13,867 ERROR [:] [] [] [] [TaskScheduler-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for agent [4130] host [263] count [6] 
2018-0

2018-06-23 18:00:25,697 ERROR [:] [] [] [] [cutorService-12] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [mount.create:2193964] on [40465], canceled by [State [deactivating] is not valid for process [volume.activate:2193975] on resource [29570]] 
2018-06-23 18:06:29,134 ERROR [1f96e72c-e539-4aca-8403-304cd6a5a1f0:2135483] [instance:72853] [instance.start->(InstanceStart)] [] [cutorService-17] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [72853] 
2018-06-23 18:06:46,179 ERROR [1f50fd80-e499-46df-8e49-9572a2e7b77c:2099974] [instance:71475] [instance.start->(InstanceStart)] [] [cutorService-20] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [71475] 
2018-06-23 18:10:16,888 ERROR [7543cafd-424d-4957-b520-44e3d83ace97:2078612] [containerEvent:254621] [containerevent.create] [] [cutorService-25] [.p.p.c.g.GenericResourceProcessState] Resource is null, can't find resource id [254621] 
2018-06-23 18:10:18,072 ERROR [f6e8558b-693f-48b3-8c82-c6eac9904a96:2174538] [instance:74173] [instance.start->(InstanceStart)] [] [cutorService-27] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance hasn't gotten passed creating state] for instance [74173] 
2018-06-23 18:10:33,907 ERROR [:] [] [] [] [cutorService-10] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception org.jooq.exception.DataChangedException: Database record has been changed
	at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550)
	at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291)
	at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90)
	at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260)
	at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123)
	at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255)
	at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149)
	at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:228)
	at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:133)
	at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:121)
	at io.cattle.platform.engine.idempotent.Idempotent.change(Idempotent.java:88)
	at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:118)
	at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:113)
	at io.cattle.platform.object.impl.AbstractObjectManager.setFields(AbstractObjectManager.java:139)
	at io.cattle.platform.process.progress.ProcessProgressContext.update(ProcessProgressContext.java:73)
	at io.cattle.platform.process.progress.ProcessProgressContext.messsage(ProcessProgressContext.java:63)
	at io.cattle.platform.process.common.handler.AgentBasedProcessLogic$1.progress(AgentBasedProcessLogic.java:167)
	at io.cattle.platform.agent.impl.EventCallProgressHelper$1.progress(EventCallProgressHelper.java:27)
	at io.cattle.platform.agent.impl.EventCallProgressHelper$1.progress(EventCallProgressHelper.java:27)
	at io.cattle.platform.eventing.impl.FutureEventListener.onEvent(FutureEventListener.java:48)
	at io.cattle.platform.eventing.impl.AbstractThreadPoolingEventService$2.doRun(AbstractThreadPoolingEventService.java:141)
	at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15)
	at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108)
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
	at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

2018-06-23 18:12:22,126 ERROR [:] [] [] [] [ecutorService-2] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [instance.start:2151964] on [73587] io.cattle.platform.util.exception.ExecutionException: Dependencies readiness error
        at io.cattle.platform.process.instance.InstanceStart.waitForDeploymentUnitCreate(InstanceStart.java:233)
        at io.cattle.platform.process.instance.InstanceStart.handle(InstanceStart.java:122)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:470)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:421)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:415)
        at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:415)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:517)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:345)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:245)
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7)
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40)
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:242)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:184)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:157)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:116)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:113)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40)
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37)
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:113)
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl.processExecuteWithLock(ProcessInstanceDispatcherImpl.java:136)
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1.doWithLock(ProcessInstanceDispatcherImpl.java:123)
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1.doWithLock(ProcessInstanceDispatcherImpl.java:120)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$4.doWithLock(AbstractLockManagerImpl.java:50)
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:25)
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:47)
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl.execute(ProcessInstanceDispatcherImpl.java:120)
        at io.cattle.platform.engine.server.ProcessInstanceReference.doRun(ProcessInstanceReference.java:59)
        at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
time="2018-06-23T18:14:26Z" level=info msg="Rollback Stack Event Received" eventId=85f9df2d-2da4-4ea4-b9c1-0d15ee5880e6 resourceId=1st627
time="2018-06-23T18:14:29Z" level=info msg="Timeout waiting for network-manager to finish"
time="2018-06-23T18:14:29Z" level=info msg="Timeout processing stack:1st627"

 Failed [Dependencies readiness error instance hasn't gotten passed creating state] for instance [73059]
2018-06-23 19:46:42,928 ERROR [:] [] [] [] [cutorService-60] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [mount.create:2197811] on [40489], canceled by [State [deactivating] is not valid for process [volume.activate:2197816] on resource [25867]]
2018-06-23 19:46:42,928 ERROR [:] [] [] [] [cutorService-52] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [mount.create:2197810] on [40488], canceled by [State [deactivating] is not valid for process [volume.activate:2197815] on resource [25880]]
2018-06-23 19:46:42,977 ERROR [:] [] [] [] [cutorService-63] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [mount.create:2197813] on [40491], canceled by [State [deactivating] is not valid for process [volume.activate:2197817] on resource [25882]]
2018-06-23 19:47:43,541 ERROR [:] [] [] [] [cutorService-52] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [mount.create:2197845] on [40491], canceled by [State [deactivating] is not valid for process [volume.activate:2197848] on resource [25882]]

While the logs seem to indicate issues, its not very clear on how to proceed at this point. Please advise. Also if possible can you have someone reach to me at jweber@edge2web.com to discuss support options. I haven’t received anything after filling out my request online.

Also, if we there is any chance for some one to diagnose these issues in realtime I am sure we can arrange it on our end.

Thanks

J

This morning I ran the rancher/cleanup-1-1 container and it seems to have the number of delayed processes, as they have shrunk from 3K to just about 100. However we still have problems activating IPSec in our dev environment and also have 3 environments stuck in deactivating.

The rancher server log is filled with errors such as:

2018-06-26 11:35:49,785 ERROR [c07639a0-4d1e-4b62-a220-baaf5d067609:1891305] [volumeStoragePoolMap:53482] [volumestoragepoolmap.remove] [] [cutorService-11] [.p.p.c.g.GenericResourceProcessState] Resource is null, can't find resource id [53482]
2018-06-26 11:35:49,797 ERROR [dcc3af82-0de8-4e6d-a1e1-059dd0b12ed0:1891307] [volumeStoragePoolMap:53509] [volumestoragepoolmap.remove] [] [cutorService-23] [.p.p.c.g.GenericResourceProcessState] Resource is null, can't find resource id [53509]

and

2018-06-26 11:42:07,047 ERROR [48a683d9-ce70-40fd-9db3-8b71508a2281:2054438] [instance:69279] [instance.start->(InstanceStart)] [] [cutorService-12] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [69279]
2018-06-26 11:42:09,047 ERROR [44539bfa-3920-44ec-81d1-5e3e2b92c2e1:2054460] [instance:69282] [instance.start->(InstanceStart)] [] [cutorService-24] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [69282]
2018-06-26 11:42:14,058 ERROR [e9dcaac1-68f2-4faf-a805-e99c8795a03b:2054698] [instance:69295] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [69295]
2018-06-26 11:42:20,054 ERROR [16b5c811-2ce7-40a4-a0e7-5f1150bfa932:2054915] [instance:69307] [instance.start->(InstanceStart)] [] [cutorService-25] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [69307]

Please advise, we would really like to get rid of these stuck environments, before creating another environment to see if the rancher infrastructure stacks come up cleanly.

Thanks

J

Seeing so many issues in the logs, and it looks like a beefy host running rancher/server (I assume it’s not under heavy load CPU/memory wise), all things point to the database. It’s not a really performant database type that you have picked. Since when are you seeing issues? If this started happening when you adding environments/hosts/stacks/services/containers, that would confirm DB getting more and more overloaded.

Seems unlikely to us its a database bottleneck. The issues started for us during a normal upgrade of our user stacks in our dev environment.

Before I ran the cleanup tool, the database dump was only 800MB. However, what would you recommend?

Failed to get ping from agent

This indicates failed network connection between rancher/server and rancher/agent. As you are running on AWS I assume this should work fine, unless either of those get hammered with network traffic from your apps/workloads.

Secondly, if DB and rancher/server are all fine, what is the status of Processes under Admin -> Processes? If there are environments continuously flapping, the processes will start to build up and this will slow things down as well. But for that to go away, you’ll need to manually clean up all not-active hosts/stacks/services in environments.

If you ping me on Rancher Users Slack and give me access I can take a look.

We started and finished migrating to a new rancher server last night, so the bad operating environment and db is no longer in use. Still look would like to get to the underlying root cause here for understanding, and also so we can be best prepared. However, the chances to do so are probably pretty diminished at this point.

A couple of closing questions:

Do you have a recommendation on the db instance type?
Is a large number of continued periodic delayed processes indicative of a problem? Seems so and if so how to recover specifically in regards to the large number of volumestoragepoolmap.remove delayed processes.
Would multiple rancher servers set up for HA help alleviate some of these issues or exacerbated them due to overhead?

Also:

While your support has been great, debugging this and migrating to a new server was non-trivial and while
Ive repeatedly requested to be contacted for paid support options, unfortunately there has been none. Thats been disheartening.

Thanks

J