Rancher upgrade to 1.2.0 blocked in "upgrading environment" and host disconnected

Hi,

I’ve decided to jump from v1.1.3 to v1.2.0 and it’s not working as expected :).

My rancher is limited to a host for the rancher-ui and a single host for the default env.

The current state is that the rancher-ui is still showing the “upgrading environment” message and that my single host is in state disconnected.

Full log available at: https://gist.github.com/looztra/2702655fa7afb1496f726da007ff94bd

Any suggestion?

Did not find anything valuable in the logs, I gave up and started from a fresh install.

This happened to me also. I installed a simple v1.1.4 Rancher, added a few hosts, and just started up some catalog services. Then I upgraded to 1.2, and it sort of hung at “Upgrading Environment…”

I found the following in /var/log/rancher/agent.log, and the log was filled with a bunch of JSON/Dict data.

Please, advise.

2016-12-07 19:20:52,816 ERROR agent [139853967076560] [event.py:112] Error in request : 1ca86a17-d59e-4a57-9aaf-5e3b8970d31c
Traceback (most recent call last):
File “/var/lib/cattle/pyagent/cattle/agent/event.py”, line 95, in _worker_main
resp = agent.execute(req)
File “/var/lib/cattle/pyagent/cattle/agent/init.py”, line 15, in execute
return self._router.route(req)
File “/var/lib/cattle/pyagent/cattle/plugins/core/event_router.py”, line 13, in route
resp = handler.execute(req)
File “/var/lib/cattle/pyagent/cattle/plugins/core/event_handlers.py”, line 32, in execute
type.on_ping(event, resp)
File “/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py”, line 126, in on_ping
self._add_instances(ping, pong)
File “/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py”, line 138, in _add_instances
running, nonrunning = self._get_all_containers_by_state()
File “/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py”, line 171, in _get_all_containers_by_state
for c in client.containers(all=True):
File “/var/lib/cattle/pyagent/dist/docker/api/container.py”, line 69, in containers
res = self._result(self._get(u, params=params), True)
File “/var/lib/cattle/pyagent/dist/docker/utils/decorators.py”, line 47, in inner
return f(self, *args, **kwargs)
File “/var/lib/cattle/pyagent/dist/docker/client.py”, line 112, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 487, in get
return self.request(‘GET’, url, **kwargs)
File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 475, in request
resp = self.send(prep, **send_kwargs)
File “/var/lib/cattle/pyagent/dist/requests/sessions.py”, line 585, in send
r = adapter.send(request, **kwargs)
File “/var/lib/cattle/pyagent/dist/requests/adapters.py”, line 453, in send
raise ConnectionError(err, request=request)
ConnectionError: (‘Connection aborted.’, error(111, ‘ECONNREFUSED’))

Just in case it helps. I added three hosts running Ubuntu 14, and started Weavescope, Ghost and Wordpress from the catalogs. Nothing else is running.

Warning: this is a long one:

Rancher did have a red exclamation badge for not having access control configured. This was just a test of the install process, and a temporary Rancher, so I didn’t feel it was necessary. But after poking around in under the Admin menu I found that the account.upgrade process was yellow and reporting and exception.

		I clicked on the account upgrade process and found some of these messages:
		
		
		So I clicked on "View in API" and found a bunch of these for the account upgrade process. 
		• children": [ ],
		• "name": "EnvironmentUpgrade",
		• "startTime": 1481141854206,
		• "stopTime": 1481141854209,
		• "exception": {
		• "message": "Failed to find default template for upgrade",
		• "clz": "java.lang.IllegalStateException",
		• "cause": null,
		• "stackTrace": "java.lang.IllegalStateException: Failed to find default template for upgrade\n\tat io.cattle.platform.systemstack.process.EnvironmentUpgrade.assignTemplate(EnvironmentUpgrade.java:156)\n\tat io.cattle.platform.systemstack.process.EnvironmentUpgrade.handle(EnvironmentUpgrade.java:69)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:448)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:399)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393)\n\tat io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:393)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:492)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:326)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:243)\n\tat io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7)\n\tat io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40)\n\tat io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:240)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:182)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:155)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:114)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:111)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40)\n\tat io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37)\n\tat io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:111)\n\tat io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl.processExecuteWithLock(ProcessInstanceDispatcherImpl.java:98)\n\tat io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1$1.doWithLockNoResult(ProcessInstanceDispatcherImpl.java:71)\n\tat io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7)\n\tat io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl$4.doWithLock(AbstractLockManagerImpl.java:50)\n\tat io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:25)\n\tat io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:47)\n\tat io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1.doRun(ProcessInstanceDispatcherImpl.java:68)\n\tat org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15)\n\tat org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)\n\tat org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)\n\tat org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108)\n\tat org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)\n\tat org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\n"

},
• “shouldContinue”: false,
• “shouldDelegate”: false,
• “chainProcessName”: null

ok, last comment… I tried the entire set up, from scratch, making sure I started with no images or containers on the system before started. This time I added local authentication as soon as the v1.1.4 rancher was started. The 1.2 rancher seems to be in exactly the same state - hung at Upgrading Environment…

$ docker run --name rancher -d --restart=unless-stopped -p 8080:8080 rancher/server:v1.1.4
added local authentication
started weavscope and wordpress stacks from the catalog
$ docker stop rancher
$ docker create --volumes-from rancher --name rancher-data rancher/server:v1.1.4
$ docker pull rancher/server:latest
$ docker run -d --volumes-from rancher-data --restart=unless-stopped -p 8080:8080 rancher/server:latest

I had exactly the same problem, it completely broke my setup so Rancher(either version) wouldn’t start anymore. I’m having to restore the database to a previous point. We’re moving everything to Kubernetes, this was the last straw.

I had the exact same problem upgrading from Rancher v1.1.1.

We have a rancher setup with an external mariadb database server. The server has no internet access and I suspect that is causing the problem. Could that be the case?

The exception can be found below.

Moreover I noticed that there are some SNAPSHOT dependencies listed in the stack trace. I guess I don’t need to mention that it’s not really a best practice having SNAPSHOT dependencies in releases.

Exception:

2017-01-24 22:30:48,905 ERROR [:] [] [] [] [cutorService-50] [.e.s.i.ProcessInstanceDispatcherImpl] Unknown exception running process [account.upgrade:2630147] on [10] java.lang.IllegalStateException: Failed to find default template for upgrade
        at io.cattle.platform.systemstack.process.EnvironmentUpgrade.assignTemplate(EnvironmentUpgrade.java:156) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.systemstack.process.EnvironmentUpgrade.handle(EnvironmentUpgrade.java:69) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:448) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:399) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:393) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:492) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:326) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:243) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:240) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:182) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:155) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:114) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:111) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:111) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl.processExecuteWithLock(ProcessInstanceDispatcherImpl.java:98) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1$1.doWithLockNoResult(ProcessInstanceDispatcherImpl.java:71) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$4.doWithLock(AbstractLockManagerImpl.java:50) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:25) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:47) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1.doRun(ProcessInstanceDispatcherImpl.java:68) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]

Yes, you need all the appropriate images and catalogs available for anything to work. http://docs.rancher.com/rancher/v1.3/en/installing-rancher/installing-server/no-internet-access/#using-a-private-registry

They’re our libraries; SNAPSHOT != shipping random untested uncommitted code.

Thanks for your reply.

Details:

  • upgrading from rancher/server:v1.1.1
  • external database (mariadb)
  • server has no internet access at all
  • only “cattle” environments

I now performed the following actions:

  • make a DB backup
  • make the following images available in our private registry:
    rancher/scheduler:v0.5.1
    rancher/net:v0.8.1
    rancher/network-manager:v0.4.0
    rancher/healthcheck:v0.2.0
    rancher/agent:v1.1.2
    rancher/lb-service-haproxy:v0.4.6
  • I started rancher using the old database and new rancher/server image
  • When I go to “Environments” I see that “Environment templates” is empty. I think its not “indexing” the Environment templates that are already present in /var/lib/cattle/DATA/library
  • When I click “Upgrade” on an environment, it fails because the environment templates are not there (see screenshot)

Side research:

  • I hooked up a rancher server at my machine with internet access and it managed to to the upgrade just fine. There I see the “Environment templates” Cattle, Kubernetes etc available
  • I git clone’d your rancher certified library and compared the whole directory structure with the structure in the rancher/server image (on my server w/o internet access), and there are no differences.

What can I try next?

Hi @vincent, do you have any suggestions? For our TA environment we can simply install a new Rancher server and move on, but for P this is not the case.

Also run into similar issue. Managed to rollback to older version to get it back working, but the error is same when upgrading: Failed to find default template for upgrade.

What are possible solutions?

In the end what I needed to do was
0. Backup the cattle database

  1. (optional, I guess) delete from the DB all instances that had status “Stopping” since a long time. Don’t know how they ended up in my DB, but they caused a lot of errors.
  2. Arrange internet access (proxy) for the Rancher server. This was the only way for me to make the environment templates available in Rancher.
  3. Download the following Docker images from Rancher and made them available on all the servers that needed an upgrade (yes, also on servers that only have a Rancher agent). Unfortunately, Rancher doesn’t pull them from a (custom) registry.
    rancher/agent:v1.1.2
    rancher/network-manager:v0.2.19
    rancher/scheduler:v0.4.0
    rancher/net:v0.7.5
    rancher/lb-service-haproxy:v0.4.6
    rancher/healthcheck:v0.1.0
    rancher/metadata:v0.6.8
    rancher/dns:v0.11.0
  4. Click “Upgrade environment” and wait for the process to finish, for every environment. I had 1 service that had a load balancer. That one failed the upgrade, I decided to delete it and recreate it later.
  5. After you successfully upgraded, I recommend to upgrade to 1.3.4 (latest stable) immediately. Therefore you need:
    rancher/agent:1.1.3
    rancher/lb-service-haproxy:v0.4.9