Local IP for rancher agent with Scale Sets

Been fiddling around to integrate rancher into the new Azure Virtual Machine Scale Sets (https://azure.microsoft.com/en-us/blog/azure-vm-scale-sets-public-preview/).

I was looking to achieve this by extending my previous ARM template ; https://github.com/Azure/azure-quickstart-templates/blob/master/docker-rancher/nodes.json

For those not accustomed to ARM (Azure Resource Manager), I’m basically using docker compose to deploy the rancher agent ;

         "compose": {
                "rancheragent": {
                  "image": "rancher/agent:v0.8.2",
                  "restart": "always",
    			  "privileged": true,
    			  "volumes": [
                    "/var/run/docker.sock:/var/run/docker.sock"
                  ],
                  "command": "[parameters('rancherApi')]"
                }
              }

The downside of this approach is that the agent always uses the external / public IP address for the server communication. If you wouldn’t do this, then the inter host networking would fail due to the IPSec VPN setup underneath.

Though as the client IP is dynamic (at boot), I’m having trouble using the CATTLE_AGENT_IP environment variable. Therefor I needed to use the public IP. Though this has a huge downside… The number of public IP addresses is limited / charged in Azure. When using scale sets, you would typically scale beyond those limits.

Any suggestion how to tackle this? The paths I’ve considered ;

  • using the variable interpolation of docker compose in combination with the CATTLE_AGENT_IP => though I think this would not prove to be stable
  • deploying the server in the same subnet & use the internal address as host ip => not tested, not sure if this would fix it
  • extending the docker images with some additions with a bash script to enter the IP dynamically => though this is very work intensive in regards to upgrades
  • extending the ARM template with a shell script as wrapper => at the moment this seems to be the best way, though it is far more complex compared to a “simple” docker compose

Anyhow, am I the only one experiencing these kind of deployment issues? Or am I pushing it too far in terms of automation… Any suggestions on what the best course of action would be to have an “easy” scalability in terms of hosts.

TL;DR

  • CATTLE_AGENT_IP is needed as # of public IPs is limited
  • setting the CATTLE_AGENT_IP dynamically / automated is not without implementation risks
  • asking for suggestions / take me to school! :wink:

Issues that I’m currently facing when working with auto scale sets… ;

  • When scaling down
    Hosts go into “reconnecting”-state, where I would expect a cleanup after a given period… Automatically remove disconnected hosts

  • When scaling up
    As I want to scale from 0 tot … I’m limited by the amount of public IP’s I can assign (in the case of Azure). So I want the inter host network to use the Azure internal networks and expose service via the load balancer. The caveat I have here is that I can only do this via the “CATTLE_AGENT_IP”. Suggestion : A “switch” (triggered by an environment variable) that would trigger the agent to use a local network interface instead of the source nat ip adress of the hosts. OR (more complex) extend the agent to work with NATted environments. As the main reason for the public IP address is due to the fact that port 500 & 4500 are not configurable.

Rancher will most likely never automatically remove a disconnected host as we don’t know the reason why the host is in reconnecting state. Also, I couldn’t find one, but please feel free to create a Github issue for an option to automatically clean up hosts after a certain period.

Please feel free to open a Github issue on the switch and trying to use the private IP instead of public one.

@denise ; Good suggestion!

https://github.com/rancher/rancher/issues/3745
https://github.com/rancher/rancher/issues/3746