ISP shutting down my Rancher servers due to high disk I/O

Hi,

I’m running three Rancher Server deployments, each with a handful of environments. Two for clients, and one for my own use. To start with I was hosting them all on Amazon AWS, with a few environments hosted on in-house equipment.

I found that that the Rancher Server and it’s database would run into swap space on a 2GB (t2.small) instance, and after a while the 4GB (t2.medium) instances were getting a bit expensive. They were also constantly knocking up against the CPU credit limit, and container upgrades started to take forever.

I moved my own Rancher Server (1.2.0-rc1-pre2) to another, cheaper ISP - a 4GB instance. It ran happily there for some weeks, so I moved the Rancher Server (1.1.2) over too. Things have been better since, but yesterday, the ISP shut down the Rancher servers, and one of two of the node hosts too, as this is what they do they a VM starts to use more than 10x the ‘average’ VM’s disk I/O and was considered to be hogging the shared resources.

On firing these servers back up, the ‘docker logs --tail=100 -f rancher-server’ spews a huge amount of unnecessary information, and I’m guessing this is the cause of all the I/O. It seems to be rapidly repeating itself. A sample…

2016-09-09 09:37:18,612 INFO  [:] [] [] [] [qtp331954843-13] [i.c.p.c.s.impl.ConfigItemServerImpl ] Setting item [hosts] to version [24522-257aff5cc23042701e4b6e49146ac4f33de95ee93d14f3484f87f4e23b378943] for [agent:30] 
2016-09-09 09:37:18,801 INFO  [:] [] [] [] [qtp331954843-16] [i.c.p.c.s.impl.ConfigItemServerImpl ] Processing [iptables] for client [agent:30] 
2016-09-09 09:37:19,004 INFO  [:] [] [] [] [qtp331954843-15] [i.c.p.c.s.impl.ConfigItemServerImpl ] Setting item [iptables] to version [8497-7931ce6fa4d4deb51a429cdea62546151d044b616a3a35a5b4011ac102187cd2] for [agent:30] 
 2016-09-09 09:37:19,148 INFO  [:] [] [] [] [qtp331954843-14] [i.c.p.c.s.impl.ConfigItemServerImpl ] Processing [healthcheck] for client [agent:30] 
2016-09-09 09:37:19,309 INFO  [:] [] [] [] [tp331954843-110] [i.c.p.c.s.impl.ConfigItemServerImpl ] Setting item [healthcheck] to version [34063-65d5d9de1a6d13e42cd660a6a178baf58129b2dec06f43aa69e0603c72ac427e] for [agent:30] 
2016-09-09 09:37:19,432 INFO  [:] [] [] [] [qtp331954843-88] [i.c.p.c.s.impl.ConfigItemServerImpl ] Client [agent:30] requested unknown item [rancher-metadata] 
2016-09-09 09:37:19,918 INFO  [:] [] [] [] [cutorService-36] [i.c.p.e.a.MethodInvokingListener    ] Timeout: container starting [container:1i432067] for event [EventVO [id=d5506766-0ed7-4a55-98ee-f170dcec895c, name=service.update, previousNames=null, replyTo=reply.4430228278425614835, resourceId=172, resourceType=service, publisher=null, transitioning=null, transitioningMessage=null, transitioningInternalMessage=null, previousIds=null, data={configUrl=https://ranch.golder.org/v1, items=[{name=reconcile, requestedVersion=270, apply=true, increment=true, checkInSyncOnly=false}]}, time=Fri Sep 09 09:36:54 UTC 2016, listenerKey=null, transitioningProgress=null]] 
2016-09-09 09:37:20,094 ERROR [:] [] [] [] [cutorService-10] [i.c.p.agent.impl.AgentLocatorImpl   ] Failed to find instance to delegate to for agent [6148] uri [delegate:///?vnetId=26&networkServiceProviderId=3] 
2016-09-09 09:37:20,096 ERROR [:] [] [] [] [cutorService-10] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: Delegate [delegate:///?vnetId=26&networkServiceProviderId=3] has no instance associated
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2199) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache.get(LocalCache.java:3932) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4812) ~[guava-15.0.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl.lookupAgent(AgentLocatorImpl.java:85) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.call(ConfigUpdatePublisher.java:321) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.publishUpdate(ConfigUpdatePublisher.java:229) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.publish(ConfigUpdatePublisher.java:200) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.loopBody(ConfigUpdatePublisher.java:126) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.doRun(ConfigUpdatePublisher.java:99) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) ~[cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_111]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_111]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
Caused by: java.lang.IllegalStateException: Delegate [delegate:///?vnetId=26&networkServiceProviderId=3] has no instance associated
	at io.cattle.platform.agent.impl.AgentLocatorImpl.getAgentForDelegate(AgentLocatorImpl.java:132) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl.buildDelegate(AgentLocatorImpl.java:104) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl$1.load(AgentLocatorImpl.java:52) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl$1.load(AgentLocatorImpl.java:47) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) ~[guava-15.0.jar:na]
	... 23 common frames omitted

2016-09-09 09:37:21,140 INFO  [:] [] [] [] [cutorService-46] [i.c.p.e.a.MethodInvokingListener    ] Timeout: container starting [container:1i432066] for event [EventVO [id=403665c3-8eb4-49c8-aff1-19493b2a6029, name=service.update, previousNames=null, replyTo=reply.243803981802626500, resourceId=108, resourceType=service, publisher=null, transitioning=null, transitioningMessage=null, transitioningInternalMessage=null, previousIds=null, data={configUrl=https://ranch.golder.org/v1, items=[{name=reconcile, requestedVersion=8601, apply=true, increment=true, checkInSyncOnly=false}]}, time=Fri Sep 09 09:36:54 UTC 2016, listenerKey=null, transitioningProgress=null]] 
2016-09-09 09:37:22,124 ERROR [:] [] [] [] [ecutorService-2] [i.c.p.agent.impl.AgentLocatorImpl   ] Failed to find instance to delegate to for agent [19] uri [delegate:///?vnetId=6&networkServiceProviderId=1] 
2016-09-09 09:37:22,129 ERROR [:] [] [] [] [ecutorService-2] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: Delegate [delegate:///?vnetId=6&networkServiceProviderId=1] has no instance associated
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2199) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache.get(LocalCache.java:3932) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4812) ~[guava-15.0.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl.lookupAgent(AgentLocatorImpl.java:85) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.call(ConfigUpdatePublisher.java:321) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.publishUpdate(ConfigUpdatePublisher.java:229) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.publish(ConfigUpdatePublisher.java:200) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.loopBody(ConfigUpdatePublisher.java:126) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.configitem.version.impl.ConfigUpdatePublisher.doRun(ConfigUpdatePublisher.java:99) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) ~[cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_111]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_111]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
Caused by: java.lang.IllegalStateException: Delegate [delegate:///?vnetId=6&networkServiceProviderId=1] has no instance associated
	at io.cattle.platform.agent.impl.AgentLocatorImpl.getAgentForDelegate(AgentLocatorImpl.java:132) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl.buildDelegate(AgentLocatorImpl.java:104) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl$1.load(AgentLocatorImpl.java:52) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.impl.AgentLocatorImpl$1.load(AgentLocatorImpl.java:47) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) ~[guava-15.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) ~[guava-15.0.jar:na]
	... 23 common frames omitted

2016-09-09 09:37:22,193 INFO  [:] [] [] [] [ecutorService-4] [i.c.p.a.server.ping.impl.PingStatus ] Recieved ping for [23514] 

The output of the I/O columns while doing a ‘vmstat 2 10’ with rancher-server started and stopped show a massive difference.

root@rancher01:~# vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 1003772 217264 1663868    0    0    29    69  308  504  3  1 95  1  0
 0  0      0 1003724 217264 1663868    0    0     0     0  110  146  0  0 99  0  0
 0  0      0 1003692 217264 1663872    0    0     0     0  212  325  1  0 99  0  0
 0  0      0 1003660 217264 1663880    0    0     0     6  286  407  2  0 98  1  0
 0  0      0 1003628 217264 1663880    0    0     0    62  207  258  1  0 99  0  0
 0  0      0 1003596 217264 1663880    0    0     0     0  151  234  0  0 100  0  0
 0  0      0 999408 217264 1663892    0    0     0    18  292  394  4  1 95  0  0
 0  1      0 999376 217264 1663892    0    0     0   120  190  255  1  0 96  3  0
 0  0      0 998808 217264 1663916    0    0     0     0  324  414  1  1 98  0  0
 0  0      0 998808 217264 1663916    0    0     0     0  240  364  0  0 99  0  0
root@rancher01:~# 
root@rancher01:~# docker start rancher-server
rancher-server
root@rancher01:~# vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 1094588 217264 1549296    0    0    29    69  307  504  3  1 95  1  0
 4  0      0 1054000 217264 1547528    0    0     0  1324 2175 2657 66  8 26  0  0
 3  0      0 1036504 217264 1549100    0    0     0  6832 1493 2039 79  7 14  0  0
 3  0      0 1003012 217264 1551052    0    0     0    44 1616 3628 83  8  9  0  0
 2  0      0 966876 217264 1550440    0    0     0     0 1449 1930 73  8 18  0  0
 3  0      0 963688 217264 1550000    0    0     0     6 1810 4540 81  7 13  0  0
 2  0      0 938012 217264 1554148    0    0     0   150 2165 3188 66 11 23  0  0
 3  0      0 931572 217264 1556896    0    0     0    50 2193 4034 69 17 14  0  0
 1  1      0 889068 217264 1563480    0    0     0   718 1840 1717 82  8 10  0  0
 2  0      0 861024 217268 1558480    0    0     0 18806 1516 1900 89  8  3  0  0
root@rancher01:~# vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 818388 217268 1558704    0    0    29    70  308  504  3  1 95  1  0
 3  0      0 850736 217272 1559044    0    0     0   168 1994 3462 65  9 25  1  0
 3  0      0 839400 217276 1559612    0    0     0   140 1968 3793 77 12 11  0  0
 3  0      0 827856 217276 1559940    0    0     0   364 1579 2765 68 10 23  0  0
 1  0      0 819400 217288 1560580    0    0     0   228 1577 2689 65 10 25  0  0
 1  0      0 828240 217288 1560948    0    0     0    98 1971 3350 56 10 33  0  0
 2  1      0 819516 217288 1561368    0    0     0   388 1805 3272 66 11 22  0  0
 2  0      0 737456 217288 1562112    0    0     0  6850 1511 2154 81  8 10  0  0
 3  0      0 732900 217288 1562676    0    0     0   300 1399 2095 81  6 13  0  0
 3  0      0 685496 217288 1563524    0    0     0  7384 1696 2889 78  9 13  0  0

Otherwise, when it is running, it runs fine. If anyone can help me identify the cause of the excessive logs, or even just help me to hush them to prevent the I/O they’re causing. Otherwise, my ISP will just keep shutting me down :frowning:

Thanks,


Ross