Rancher DB connection errors when accessing deployed service via the web

Hey I’m not sure if this is covered anywhere but I didnt see it in the first few pages of posts.

So I have my rancher master setup with a single host provisioned. On the host I have deployed a redis container, a memcache container, and a php application container with service links to those.

Lastly I have a rancher load balancer container in front of my php application, to assist with rolling service upgrades.

I have noticed that every time I visit the web url of my php application, the rancher master server is spammed with db connection errors. The logs suggest that it is failing to connect to the same db instance I am using for the rancher db itself, which seems strange if it is getting the connections from the same pool.

To be clear this is in the logs on the rancher master, not on the container host that I have provisioned. This could be a huge limiting factor in using rancher if, under load it it constantly communicating with the master and incurring database queries.

I hope I have just set something up poorly.

Here is an example from the log files on the rancher master, this happens on every single page view to my deployed php application:

2015-08-26 17:12:55,001 ERROR [:] [] [] [] [cutorService-18] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception org.jooq.exception.DataAccessException: SQL [select process_instance.id, process_instance.start_time, process_instance.end_time, process_instance.data, process_instance.priority, process_instance.process_name, process_instance.resource_type, process_instance.resource_id, process_instance.result, process_instance.exit_reason, process_instance.phase, process_instance.start_process_server_id, process_instance.running_process_server_id from process_instance where process_instance.id = ?]; Could not connect to ################.us-east-1.rds.amazonaws.com:3306 : Connection timed out
at org.jooq.impl.Utils.translate(Utils.java:1287) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.DefaultExecuteContext.sqlException(DefaultExecuteContext.java:495) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:326) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractResultQuery.fetchLazy(AbstractResultQuery.java:351) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractResultQuery.fetchLazy(AbstractResultQuery.java:341) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractResultQuery.fetchOne(AbstractResultQuery.java:470) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.SelectImpl.fetchOne(SelectImpl.java:2372) ~[jooq-3.3.0.jar:na]
at io.cattle.platform.engine.manager.impl.jooq.JooqProcessRecordDao.getRecord(JooqProcessRecordDao.java:91) ~[cattle-iaas-engine-jooq-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.shouldWaitLonger(ProcessEventListenerImpl.java:98) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:56) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at sun.reflect.GeneratedMethodAccessor770.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_79]
at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_79]
at io.cattle.platform.eventing.annotation.MethodInvokingListener$1.doWithLockNoResult(MethodInvokingListener.java:76) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.eventing.annotation.MethodInvokingListener.onEvent(MethodInvokingListener.java:72) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.eventing.impl.AbstractThreadPoolingEventService$2.doRun(AbstractThreadPoolingEventService.java:135) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:13) ~[cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
Caused by: java.sql.SQLNonTransientConnectionException: Could not connect to ##############.us-east-1.rds.amazonaws.com:3306 : Connection timed out
at org.mariadb.jdbc.internal.SQLExceptionMapper.get(SQLExceptionMapper.java:136) ~[mariadb-java-client-1.1.7.jar:na]
at org.mariadb.jdbc.internal.SQLExceptionMapper.throwException(SQLExceptionMapper.java:106) ~[mariadb-java-client-1.1.7.jar:na]
at org.mariadb.jdbc.Driver.connect(Driver.java:114) ~[mariadb-java-client-1.1.7.jar:na]
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38) ~[commons-dbcp-1.4.jar:1.4]
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582) ~[commons-dbcp-1.4.jar:1.4]
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1188) ~[commons-pool-1.6.jar:1.6]
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106) ~[commons-dbcp-1.4.jar:1.4]
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044) ~[commons-dbcp-1.4.jar:1.4]
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) ~[spring-jdbc-3.2.4.RELEASE.jar:3.2.4.RELEASE]
at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$TransactionAwareInvocationHandler.invoke(TransactionAwareDataSourceProxy.java:224) ~[spring-jdbc-3.2.4.RELEASE.jar:3.2.4.RELEASE]
at com.sun.proxy.$Proxy55.prepareStatement(Unknown Source) ~[na:na]
at org.jooq.impl.ProviderEnabledConnection.prepareStatement(ProviderEnabledConnection.java:112) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.SettingsEnabledConnection.prepareStatement(SettingsEnabledConnection.java:76) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractResultQuery.prepare(AbstractResultQuery.java:230) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:296) ~[jooq-3.3.0.jar:na]
… 28 common frames omitted
Caused by: org.mariadb.jdbc.internal.common.QueryException: Could not connect to ##############.us-east-1.rds.amazonaws.com:3306 : Connection timed out
at org.mariadb.jdbc.internal.mysql.MySQLProtocol.connect(MySQLProtocol.java:677) ~[mariadb-java-client-1.1.7.jar:na]
at org.mariadb.jdbc.internal.mysql.MySQLProtocol.(MySQLProtocol.java:266) ~[mariadb-java-client-1.1.7.jar:na]
at org.mariadb.jdbc.Driver.connect(Driver.java:110) ~[mariadb-java-client-1.1.7.jar:na]
… 40 common frames omitted
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.7.0_79]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[na:1.7.0_79]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) ~[na:1.7.0_79]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[na:1.7.0_79]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.7.0_79]
at java.net.Socket.connect(Socket.java:579) ~[na:1.7.0_79]
at java.net.Socket.connect(Socket.java:528) ~[na:1.7.0_79]
at org.mariadb.jdbc.internal.mysql.MySQLProtocol.connect(MySQLProtocol.java:372) ~[mariadb-java-client-1.1.7.jar:na]
at org.mariadb.jdbc.internal.mysql.MySQLProtocol.connect(MySQLProtocol.java:673) ~[mariadb-java-client-1.1.7.jar:na]
… 42 common frames omitted

There is definitely no direct correlation between making a request to a container and causing a request from the server to the database (or at least there’s not supposed to be.)

My guess is either they are unrelated and the timing is coincidental/that error is repeating all the time. Or your request is causing the web container to stop/crash/restart, which then gets noticed by the rancher/agent on the host and reported up to rancher/server, which would cause queries.

But how is anything working in rancher/server if it can’t connect to the database…

I would have expected that is the case, which is why this is so concerning.

Rancher server works great, I have been using hte rancher UI to setup my stack, provision instances, open terminals and view logs, etc.

From watching the logs it seems pretty clear to be a direct correlation between the web request on the slave and the logs ont he master, though I cant prove that empirically.

My current opsworks based production infrastructure serves hundreds of thousands of page views per day, and has constant concurrent users over 5k so this would present a particular block to using it as our main infrastructure management engine.

Quick update, after about a half hour of this behaviour the logson rancher master show the following:

time=“2015-08-26T17:43:27Z” level=info msg="Exiting go-machine-service…"
time=“2015-08-26T17:43:28Z” level=info msg=“Setting log level” logLevel=info
time=“2015-08-26T17:43:28Z” level=info msg=“Starting go-machine-service…” gitcommit=102d311
time=“2015-08-26T17:43:28Z” level=info msg=“Initializing event router” workerCount=10
time=“2015-08-26T17:43:28Z” level=info msg="Connection established"
time=“2015-08-26T17:44:17Z” level=info msg=“Registering backend for host [4f3a9946-7989-47eb-8161-773f9b831b88]”

After this page loads don’t seem to incur these errors. I believe this started after spinning a service upgrade from our jenkins service, which seemed to fail. I am going to try it again to see if it reproduces.

Confirmed. When i run a service upgrade the upgrade fails with a bunch of the same db connect timeout errors. From that point until rancher restarts the go service every page request to the slave container incurs db errors ont he master.

rancher-compose -f /srv/######/docker/deploy-dev/config/docker-compose.yml -p dev-######-main upgrade dev-#######-ui dev-#######-ui-temp --wait

my yaml file:

dev-######-ui:
#noop
dev-######-ui-temp:
restart: always
tty: true
image: {DOCKER_IMAGE}
links:

  • dev-redis:redis
  • dev-memcache:memcache
    stdin_open: true

{DOCKER_IMAGE} is replaced at buildtime with the image on our custom private repository

The logs from the service upgrade:

time=“2015-08-26T18:08:47Z” level=info msg="Creating service dev-#####-ui-temp"
time=“2015-08-26T18:08:47Z” level=info msg="Upgrading dev-#####-ui to dev-#####-ui-temp, scale=1"
time=“2015-08-26T18:10:55Z” level=fatal msg=“Bad response from [http://############:8080/v1/services/1s16], statusCode [500]. Status [500 Internal Server Error]. Body: [{“id”:“5046abb1-a658-41b8-84df-856c56ee8f74”,“type”:“error”,“links”:{},“actions”:{},“status”:500,“code”:“Internal Server Error”,“message”:“Internal Server Error”,“detail”:null}]”

I’m still sticking with there’s no way these are actually causally related unless your container is dying :smile:, and wondering how anything works if the server can’t reach the database consistently. It stopping when go-machine-service restarted and connected successfully sort of affirms this.

Is there anything interesting in the logs of the rancher/agent container on the host?

There appear to be several errors, not sure what each one means. Also further to your comment, it works perfectly right up until I run a service upgrade which fails catastrophically (apparently) and puts the cluster into an unstable state until the go service gets restarted.

8/26/2015 1:44:09 PMINFO: https://github.com/rancher/cadvisor-package/releases/download/v0.3.0/cadvisor.tar.gz already downloaded
8/26/2015 1:44:09 PMINFO: Updating volmgr
8/26/2015 1:44:09 PMINFO: https://github.com/rancher/rancher-volume/releases/download/v0.1.5/volmgr.tar.gz already downloaded
8/26/2015 1:44:09 PMINFO: Sending pyagent applied 1-9e3dcab12585149a65d818082279708965cd826df1937a55b24f48428f8f0c068f62c3b4a3262b4a623032732e29d801
8/26/2015 1:44:09 PMINFO: Starting agent /var/lib/cattle/pyagent/apply.sh
8/26/2015 1:44:09 PMINFO: Executing /var/lib/cattle/pyagent/main.py
8/26/2015 1:44:09 PMINFO: Updating host-config
8/26/2015 1:44:09 PMINFO: Downloading http://#######:8080/v1//configcontent//host-config current=host-config-1-4cba2e995ecc671cfd5475a2c52fd6bd65449a68287b94ae337dbee1bf23f874
8/26/2015 1:44:09 PMINFO: Already up to date
8/26/2015 1:44:09 PMINFO: Sending host-config applied 1-4cba2e995ecc671cfd5475a2c52fd6bd65449a68287b94ae337dbee1bf23f874
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Starting event router."
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Watching state directory: /var/lib/rancher/state/containers"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“f87dc4b4aeaea981734400ace5afdbfd90531ce1e7792ab522687b22f9e357bc”, From:”-simulated-”, Time:0}“
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“27b4a41f22863f4f6e92f119aaa92cd3b9ca7bc4ef2b5a6ae2af5b6bd8bebb38”, From:”-simulated-”, Time:0}“
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“b6170404c213d6f83c2584a540b1281c047055e64484232da6cca4289a812963”, From:”-simulated-”, Time:0}“
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“b0ae6db3bef2e7e8d86c7cd13bfdb66b61430661d0a5dab6d4e8fe03a985d56b”, From:”-simulated-”, Time:0}“
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“4d363c2f963982b43db76ffc3093c246f21af8bca7e2c7d222286e67c9aeb1fc”, From:”-simulated-”, Time:0}“
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“0e376b36fa5f403e61d065e2613dacec6d1f775ab800a1346742da64f24cfb8a”, From:”-simulated-”, Time:0}"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Assigning IP [10.42.78.27/16], ContainerId [f87dc4b4aeaea981734400ace5afdbfd90531ce1e7792ab522687b22f9e357bc], Pid [21853]"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Assigning IP [10.42.220.222/16], ContainerId [4d363c2f963982b43db76ffc3093c246f21af8bca7e2c7d222286e67c9aeb1fc], Pid [4788]"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Assigning IP [10.42.49.190/16], ContainerId [b0ae6db3bef2e7e8d86c7cd13bfdb66b61430661d0a5dab6d4e8fe03a985d56b], Pid [6533]"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Assigning IP [10.42.106.176/16], ContainerId [b6170404c213d6f83c2584a540b1281c047055e64484232da6cca4289a812963], Pid [7888]"
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg="Assigning IP [10.42.48.206/16], ContainerId [27b4a41f22863f4f6e92f119aaa92cd3b9ca7bc4ef2b5a6ae2af5b6bd8bebb38], Pid [16154]"
8/26/2015 1:44:10 PMI0826 17:44:10.480127 18860 storagedriver.go:92] No backend storage selected
8/26/2015 1:44:10 PMI0826 17:44:10.480178 18860 storagedriver.go:94] Caching 60 stats in memory
8/26/2015 1:44:10 PMI0826 17:44:10.480240 18860 manager.go:106] cAdvisor running in container: "/docker/0e376b36fa5f403e61d065e2613dacec6d1f775ab800a1346742da64f24cfb8a"
8/26/2015 1:44:10 PMI0826 17:44:10.480730 18860 fs.go:87] Filesystem partitions: map[/dev/disk/by-uuid/67825988-18df-4b3b-9610-9fe4e0d71e78:{mountpoint:/ major:202 minor:1}]
8/26/2015 1:44:10 PMI0826 17:44:10.481948 18860 manager.go:127] Machine: {NumCores:2 CpuFrequency:2900066 MemoryCapacity:3947413504 MachineID:f60565d29552f365e7380162556f6674 SystemUUID:EC2313A9-E40F-C60F-C788-4BD72CAE0A48 BootID:3ec94b70-d1da-4eb4-a3b3-1b39269ede18 Filesystems:[{Device:/dev/disk/by-uuid/67825988-18df-4b3b-9610-9fe4e0d71e78 Capacity:211233755136}] DiskMap:map[202:0:{Name:xvda Major:202 Minor:0 Size:214748364800 Scheduler:deadline}] NetworkDevices:[{Name:dummy0 MacAddress:7a:b7:18:72:5b:48 Speed:0 Mtu:1500} {Name:eth0 MacAddress:12:7a:e2:05:45:f1 Speed:10000 Mtu:9001}] Topology:[{Id:0 Memory:3947413504 Cores:[{Id:0 Threads:[0 1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:26214400 Type:Unified Level:3}]}]}
8/26/2015 1:44:10 PMI0826 17:44:10.482739 18860 manager.go:134] Version: {KernelVersion:3.13.0-53-generic ContainerOsVersion:Ubuntu 14.04.2 LTS DockerVersion:1.8.1 CadvisorVersion:0.12.0}
8/26/2015 1:44:10 PMtime=“2015-08-26T17:44:10Z” level=“info” msg=“Connecting to proxy.” url="ws://#######:8080/v1/connectbackend?token=eyJhbGciOiJSUzI1NiJ9.eyJleHAiOjE0NDA2MTEzNTcsInN1YiI6ImNhdHRsZSIsImlzcyI6Imh0dHA6XC9cL2NhdHRsZS5pbyIsInJlcG9ydGVkVXVpZCI6IjRmM2E5OTQ2LTc5ODktNDdlYi04MTYxLTc3M2Y5YjgzMWI4OCIsImlhdCI6MTQ0MDYxMTA1Nywia2lkIjoiZGVmYXVsdCJ9.D-v1gOMVmHEauhA1aCOTBs0hPr-2IRUnyzLStv7PE1dbbOXrGHGzdki0hQe2D9RQTWWADOdA9FVES1R1wjs3TWCR2s2OlHFfHmmwaw4ofSIYHRvNXpfP_L9YBC34QdlLN3b5D7OKm-S2a1uIVhbuFM9nFOZXEU0lpK4KCU7PyzE7Pd6XizaFaPcbq_sxxZZrk2uKdUmk2BkU1AfaGq8ze1X7WGpngBM-5ncCQuCD8ezI8oxo2mwHI45Scr1HZd1KpOGWC31wbWFh0BAqMD9qgiTnvtzWw-r0X8HZOLnyCduqcFbC-h4GQpLRsXxeT6iMDP5CsNqin99zvsDmkxbqVg"
8/26/2015 1:44:10 PMI0826 17:44:10.579642 18860 factory.go:226] Registering Docker factory
8/26/2015 1:44:10 PMI0826 17:44:10.580178 18860 factory.go:60] Registering Raw factory
8/26/2015 1:44:10 PMI0826 17:44:10.580888 18860 manager.go:872] Started watching for new ooms in manager
8/26/2015 1:44:10 PMI0826 17:44:10.594306 18860 manager.go:212] Starting recovery of all containers
8/26/2015 1:44:10 PMI0826 17:44:10.630799 18860 manager.go:217] Recovery completed
8/26/2015 1:44:10 PMI0826 17:44:10.637449 18860 cadvisor.go:90] Starting cAdvisor version: “0.12.0” on port 9344
8/26/2015 2:18:17 PMtime=“2015-08-26T18:18:17Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“start”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613097}"
8/26/2015 2:18:17 PMtime=“2015-08-26T18:18:17Z” level=“info” msg="Assigning IP [10.42.10.231/16], ContainerId [de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996], Pid [13854]“
8/26/2015 2:18:17 PMtime=“2015-08-26T18:18:17Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:“start”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:”-simulated-”, Time:0}"
8/26/2015 2:18:17 PMtime=“2015-08-26T18:18:17Z” level=“info” msg="Container locked. Can’t run StartHandler. ID: [de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996]"
8/26/2015 2:20:24 PM2015-08-26 18:20:24,995 ERROR cattle [140132981029936] [utils.py:278] Failed to call ([’/var/lib/cattle/config.sh’, u’host-iptables’, u’host-routes’],) {‘stderr’: -2, ‘cwd’: ‘/var/lib/cattle’, ‘env’: {‘AGENT_PARENT_PID’: ‘30439’, ‘CATTLE_CONFIG_URL’: ‘http://#######:8080/v1’, ‘CATTLE_AGENT_PIDNS’: ‘host’, ‘CATTLE_URL’: ‘http://#######:8080/v1’, ‘SHLVL’: ‘1’, ‘OLDPWD’: ‘/tmp’, ‘HOSTNAME’: ‘dev-#######-docker-1’, ‘PWD’: ‘/var/lib/cattle/pyagent’, ‘CATTLE_STORAGE_URL’: ‘http://#######:8080/v1’, ‘CATTLE_ACCESS_KEY’: ‘D4A5D8C55897F1EC569E’, ‘CATTLE_AGENT_LOG_FILE’: ‘/var/log/rancher/agent.log’, ‘CATTLE_AGENT_IP’: ‘172.30.100.121’, ‘CATTLE_STATE_DIR’: ‘/var/lib/rancher/state’, ‘CATTLE_SECRET_KEY’: ‘8m6b3DJ6sHgbMwS2sJDPNCAaavaWy8whnwk1KhJZ’, ‘CATTLE_SYSTEMD’: ‘false’, ‘RANCHER_AGENT_IMAGE’: ‘rancher/agent:v0.7.11’, ‘HOME’: ‘/root’, ‘CATTLE_CADVISOR_WRAPPER’: ‘cadvisor.sh’, ‘CATTLE_PHYSICAL_HOST_UUID’: ‘765f4d02-f745-4c1a-be0f-d6dc5f04c83f’, ‘CATTLE_HOME’: ‘/var/lib/cattle’, ‘PATH’: ‘/var/lib/cattle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin’}}, exit [1], output :
8/26/2015 2:20:24 PMINFO: Updating host-iptables
8/26/2015 2:20:24 PMINFO: Removing host-iptables-343-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683
8/26/2015 2:20:24 PMINFO: Downloading http://#######:8080/v1//configcontent//host-iptables current=host-iptables-354-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683
8/26/2015 2:20:24 PMbasename: missing operand
8/26/2015 2:20:24 PMTry ‘basename --help’ for more information.
8/26/2015 2:20:24 PMERROR: Missing SHA1SUMS files, invalid download
8/26/2015 2:20:24 PM
8/26/2015 2:20:24 PMTraceback (most recent call last):
8/26/2015 2:20:24 PM File “/var/lib/cattle/pyagent/cattle/utils.py”, line 274, in get_command_output
8/26/2015 2:20:24 PM return check_output(*args, **kw)
8/26/2015 2:20:24 PM File “/var/lib/cattle/pyagent/cattle/utils.py”, line 331, in check_output
8/26/2015 2:20:24 PM raise e1
8/26/2015 2:20:24 PMCalledProcessError: Command ‘[’/var/lib/cattle/config.sh’, u’host-iptables’, u’host-routes’]’ returned non-zero exit status 1
8/26/2015 2:22:36 PMtime=“2015-08-26T18:22:36Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“kill”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613356}"
8/26/2015 2:22:46 PMtime=“2015-08-26T18:22:46Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“kill”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613366}"
8/26/2015 2:22:46 PMtime=“2015-08-26T18:22:46Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“die”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613366}"
8/26/2015 2:22:46 PMtime=“2015-08-26T18:22:46Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“stop”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613366}"
8/26/2015 2:22:51 PM2015-08-26 18:22:51,962 ERROR agent [140132981028016] [event.py:111] 4372f588-a6cb-4a5a-b7db-7313570c85e9 : Unknown error
8/26/2015 2:22:51 PMTraceback (most recent call last):
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/cattle/agent/event.py”, line 96, in _worker_main
8/26/2015 2:22:51 PM publisher.publish(resp)
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/cattle/plugins/core/publisher.py”, line 24, in publish
8/26/2015 2:22:51 PM timeout=5)
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 508, in post
8/26/2015 2:22:51 PM return self.request(‘POST’, url, data=data, json=json, **kwargs)
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 465, in request
8/26/2015 2:22:51 PM resp = self.send(prep, **send_kwargs)
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 573, in send
8/26/2015 2:22:51 PM r = adapter.send(request, **kwargs)
8/26/2015 2:22:51 PM File “/var/lib/cattle/pyagent/dist/requests/adapters.py”, line 433, in send
8/26/2015 2:22:51 PM raise ReadTimeout(e, request=request)
8/26/2015 2:22:51 PMReadTimeout: HTTPConnectionPool(host=’#######’, port=8080): Read timed out. (read timeout=5)
8/26/2015 2:24:54 PMtime=“2015-08-26T18:24:54Z” level=“error” msg="Error processing event &docker.APIEvents{Status:“stop”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613366}. Error: Bad response from [http://#######:8080/v1/containerevents], statusCode [500]. Status [500 Internal Server Error]. Body: [{“id”:“e35ebdb4-9d61-4427-8055-0193e8da78a7”,“type”:“error”,“links”:{},“actions”:{},“status”:500,“code”:“Internal Server Error”,“message”:“Internal Server Error”,“detail”:null}]"
8/26/2015 2:32:16 PMtime=“2015-08-26T18:32:16Z” level=“info” msg="Processing event: &docker.APIEvents{Status:“destroy”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613936}"
8/26/2015 2:32:31 PM2015-08-26 18:32:31,380 ERROR cattle [140132980674640] [utils.py:278] Failed to call ([’/var/lib/cattle/config.sh’, u’host-iptables’, u’host-routes’],) {‘stderr’: -2, ‘cwd’: ‘/var/lib/cattle’, ‘env’: {‘AGENT_PARENT_PID’: ‘30439’, ‘CATTLE_CONFIG_URL’: ‘http://#######:8080/v1’, ‘CATTLE_AGENT_PIDNS’: ‘host’, ‘CATTLE_URL’: ‘http://#######:8080/v1’, ‘SHLVL’: ‘1’, ‘OLDPWD’: ‘/tmp’, ‘HOSTNAME’: ‘dev-#######-docker-1’, ‘PWD’: ‘/var/lib/cattle/pyagent’, ‘CATTLE_STORAGE_URL’: ‘http://#######:8080/v1’, ‘CATTLE_ACCESS_KEY’: ‘D4A5D8C55897F1EC569E’, ‘CATTLE_AGENT_LOG_FILE’: ‘/var/log/rancher/agent.log’, ‘CATTLE_AGENT_IP’: ‘172.30.100.121’, ‘CATTLE_STATE_DIR’: ‘/var/lib/rancher/state’, ‘CATTLE_SECRET_KEY’: ‘8m6b3DJ6sHgbMwS2sJDPNCAaavaWy8whnwk1KhJZ’, ‘CATTLE_SYSTEMD’: ‘false’, ‘RANCHER_AGENT_IMAGE’: ‘rancher/agent:v0.7.11’, ‘HOME’: ‘/root’, ‘CATTLE_CADVISOR_WRAPPER’: ‘cadvisor.sh’, ‘CATTLE_PHYSICAL_HOST_UUID’: ‘765f4d02-f745-4c1a-be0f-d6dc5f04c83f’, ‘CATTLE_HOME’: ‘/var/lib/cattle’, ‘PATH’: ‘/var/lib/cattle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin’}}, exit [2], output :
8/26/2015 2:32:31 PMINFO: Updating host-iptables
8/26/2015 2:32:31 PMINFO: Removing host-iptables-344-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683
8/26/2015 2:32:31 PMINFO: Downloading http://#######:8080/v1//configcontent//host-iptables current=host-iptables-355-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683
8/26/2015 2:32:31 PMINFO: Running /var/lib/cattle/download/host-iptables/host-iptables-356-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683/apply.sh
8/26/2015 2:32:31 PMINFO: HOME -> ./
8/26/2015 2:32:31 PMINFO: HOME -> ./etc/
8/26/2015 2:32:31 PMINFO: HOME -> ./etc/cattle/
8/26/2015 2:32:31 PMINFO: HOME -> ./etc/cattle/host-iptables
8/26/2015 2:32:31 PMINFO: Sending host-iptables applied 356-d049e47d422e7c2556577ff35624f118e182ec70793d23712724afa2f3165683
8/26/2015 2:32:31 PMINFO: Updating host-routes
8/26/2015 2:32:31 PMINFO: Removing host-routes-167-da1d80e38fde9e820c70acdfe4f806f9c860539e2168600405b9d1aca8a9541b
8/26/2015 2:32:31 PMINFO: Downloading http://#######:8080/v1//configcontent//host-routes current=host-routes-178-da1d80e38fde9e820c70acdfe4f806f9c860539e2168600405b9d1aca8a9541b
8/26/2015 2:32:31 PM
8/26/2015 2:32:31 PMgzip: stdin: not in gzip format
8/26/2015 2:32:31 PMtar: Child returned status 1
8/26/2015 2:32:31 PMtar: Error is not recoverable: exiting now
8/26/2015 2:32:31 PM
8/26/2015 2:32:31 PMTraceback (most recent call last):
8/26/2015 2:32:31 PM File “/var/lib/cattle/pyagent/cattle/utils.py”, line 274, in get_command_output
8/26/2015 2:32:31 PM return check_output(*args, **kw)
8/26/2015 2:32:31 PM File “/var/lib/cattle/pyagent/cattle/utils.py”, line 331, in check_output
8/26/2015 2:32:31 PM raise e1
8/26/2015 2:32:31 PMCalledProcessError: Command ‘[’/var/lib/cattle/config.sh’, u’host-iptables’, u’host-routes’]’ returned non-zero exit status 2
8/26/2015 2:34:24 PMtime=“2015-08-26T18:34:24Z” level=“error” msg=“Error processing event &docker.APIEvents{Status:“destroy”, ID:“de95648c578ab227c3b3319b7a54e28d954e72676b46a7c3d239545a417cc996”, From:“docker.#######.com/#######/#######-ui:developmentb8d19a82d4833c8249966e422a8f332779d97968”, Time:1440613936}. Error: Bad response from [http://#######:8080/v1/containerevents], statusCode [500]. Status [500 Internal Server Error]. Body: [{“id”:“feba6865-13ee-441f-bdc6-f8149853939b”,“type”:“error”,“links”:{},“actions”:{},“status”:500,“code”:“Internal Server Error”,“message”:“Internal Server Error”,“detail”:null}]”

So on a service upgrade I get the following:

[ec2-user@ip-172-30-0-233 matt]$ rancher-compose -f docker-compose.yml -p dev-######-main upgrade dev-#####-ui dev-######-ui-temp --wait
INFO[0000] Creating service dev-#####-ui-temp
INFO[0000] Upgrading dev-#####-ui to dev-#####-ui-temp, scale=1
FATA[0128] Bad response from [http://172.30.0.118:8080/v1/services/1s19], statusCode [500]. Status [500 Internal Server Error]. Body: [{“id”:“d1c8a821-1c3d-437f-b586-f8c5beb68984”,“type”:“error”,“links”:{},“actions”:{},“status”:500,“code”:“Internal Server Error”,“message”:“Internal Server Error”,“detail”:null}]
[ec2-user@ip-172-30-0-233 matt]$

And the rancher server spams db connection errors. I don;t understand this, the database is definitely configured properly, it is used from the rancher server ui constantly and works properly.

In the rancher UI i see the new service starting, the old one goes into status ‘Upgrading’ and the new one into status ‘Updating-Active’ but these operations never complete.

UPDATE: looks like after 10 minutes or so the command finally completes, but that is too long for an automated deployment, I mist have something setup incorrectly.

Last Update: Well don;t I feel stupid. Yesterday one of my devops guys made some changes to our core infrastructure security group which effectively prevented the rancher server from accessing RDS. The strange part is that it did not disconnect any of hte existing connections when it was changed. So when rancher-server was trying to access the db, if it was able to use an existing connection from the pool it worked fine, but if it needed to create a new connection it would fail.

Thanks for your help and sorry for the noise on the forum here.