Viability of going directly from 0.51.0 to 1.0.1?

I’d love to get my Rancher instance running on the latest stable (1.0.1). But since I’ve got some production workloads running am cautious about taking something down without planning.

My understanding is that 0.56.0 introduces charon as a replacement to racoon for the IPSec tunnels. I’m also unclear if I need special migrations that need to be done for every release (0.51.0 -> 0.56.0 -> ??? -> 1.0.1).

So my question is, how do I perform this upgrade and should I plan for containers to be restarted and/or lose connections?

We typically test the upgrade path from the previous release to the new release and aren’t able to test the many combinations of even older releases to the latest.

Since you have stuff in production, what I’d advise is to try and set up a test environment and take some of your stacks in production and using rancher-compose to import them into the test environment to see how the upgrade would fare from v0.51.0 to v1.0.1.

OK, so we spent a lot of this week mocking up our production environment and doing a staged upgrade. The Rancher server came up after a bit. Then we saw the agents upgrade to v1.0.1. At this point networking was working and containers seemed happy.

We then deployed a stack to the environment and this is where things went south. The new stack came up and then the existing stacks stopped communicating with each other. The new stack doesn’t appear to be able to communicate with other containers as well (guessing this is where the charon upgrade was triggered?).

I’d prefer NOT to have to restart / redeploy containers or reboot the hosts. I’m wondering what the next thing to try here would be?

Can you ping the IP of a container on one of the hosts, from another host? There are some commands you can run in the network agent to troubleshoot:

Default location for the configuration file is: /etc/strongswan.conf
Control is via the swanctl command;
swanctl --list-sas to list associations (VPNs)
swanctl --list-conns to list connections

Another option could be to try deleting the Network Agent, and then do something else that will trigger networking on the host so that a new network agent will be deployed. We typically put upgrades in the network agent, but the upgrades could have gone south, so you could try to deploy a new network agent.

Is there a difference to restarting the network agent versus removing it? We ended up just restarting it and communication between containers came back up in a matter of seconds. We observed the logs that it downloaded something and seemed to re-provision itself.

We then removed the agent on another node and it did something similar but seemed to install a lot more stuff and took much longer. My current plan is to script out the network agent restart in Ansible so we can restart on all servers after the upgrade. Wondering if we need to actually remove it instead?

There should be no difference. Even though the network agent will be a different tagged version, the software inside should be updated to match what would happen if you launched the correctly tagged version of the network agent for your Rancher server version.

But restarting the network agent vs stopping & starting (or similarly deleting & recreating) it has triggered bugs in Docker before that end up making the daemon unresponsive. So we generally just say to delete.

Yep, we ran in to the scenario that Docker stopped responding on a couple hosts (maybe 5 out of 50 or so). We also seemed to need to restart some of the load balancers (maybe because of a HAProxy syntax change for proxies using custom parameters?). Other than that, we were able to upgrade all our hosts in about half an hour with maybe less than 10 minutes of network downtime (plus the downtime of the services which happened to be on the borked servers where Docker locked up).