Load balancer failing on new hosts after upgrade of rancher server

We have a rancher-server with 10 hosts running around 250 containers. We faced lots of issues with UI as heap dump and saw in one of the post that heap dump is fixed in version 1.0.1 so we upgraded rancherserver to 1.0.1. Performance has definitely improved but having problems with load balancer.

We have a load balancer and runs each container on each host and has hostname based routing.

After upgrade when we add new host it spins up agent v0.8.1 but we have hosts which has version 0.8.0. Below is the error I see while spinning up load balancer

 Degraded (Waiting for [instance:Default_rancher-lb_1]. Instance status: 500 Server Error: Internal Server Error ("Cannot start container 58fd1a41f3121143d3e5ce41a9706b15ee820f55c233980101af47f9eb93192e: [9] System error: argument list too long"))
Type:

time="2016-05-04T15:28:57Z" level="info" msg="Processing event: &docker.APIEvents{Status:\"start\", ID:\"57e4efc4f995535e7743f55e3c863f3e2473e3b5536002638c88bd68a2694714\", From:\"rancher/agent-instance:v0.8.1\", Time:1462375737}" 
time="2016-05-04T15:28:57Z" level="info" msg="Processing event: &docker.APIEvents{Status:\"die\", ID:\"57e4efc4f995535e7743f55e3c863f3e2473e3b5536002638c88bd68a2694714\", From:\"rancher/agent-instance:v0.8.1\", Time:1462375737}" 
2016-05-04 15:28:57,476 ERROR agent [139765886887568] [event.py:112] Error in request : 66fbd568-c411-4555-a0da-1b827fd8492e 
Traceback (most recent call last):
  File "/var/lib/cattle/pyagent/cattle/agent/event.py", line 95, in _worker_main
    resp = agent.execute(req)
  File "/var/lib/cattle/pyagent/cattle/agent/__init__.py", line 15, in execute
    return self._router.route(req)
  File "/var/lib/cattle/pyagent/cattle/plugins/core/event_router.py", line 13, in route
    resp = handler.execute(req)
  File "/var/lib/cattle/pyagent/cattle/agent/handler.py", line 34, in execute
    return method(req=req, **req.data.__dict__)
  File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 529, in instance_activate
    self._do_instance_activate(instance, host, progress)
  File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 608, in _do_instance_activate
    client.start(container_id)
  File "/var/lib/cattle/pyagent/dist/docker/utils/decorators.py", line 21, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/var/lib/cattle/pyagent/dist/docker/api/container.py", line 363, in start
    self._raise_for_status(res)
  File "/var/lib/cattle/pyagent/dist/docker/client.py", line 146, in _raise_for_status
    raise errors.APIError(e, response, explanation=explanation)
APIError: 500 Server Error: Internal Server Error ("rpc error: code = 2 desc = "oci runtime error: argument list too long"")
time="2016-05-04T15:28:57Z" level="info" msg="Container [57e4efc4f995535e7743f55e3c863f3e2473e3b5536002638c88bd68a2694714] not running. Can't assign IP [10.42.250.37/16]." 

In theory, there should be no issues with your load balancer running different versions of rancher/agent-instance as we would upgrade the software inside the container to match what the new image would be running.

Are you still having these issues? Have you tried deleting all the old version of load balancer containers so that new versions of the container came up?

@denise, Yes I tried but it wasn’t launching new version of agent. Below is the docker-compose file

rancher-lb:
  ports:
  - 80:80
  external_links:
  - SVG2JPG/svg2jpg:svg2jpg
  labels:
    io.rancher.loadbalancer.target.SVG2JPG/svg2jpg-test: svg2jpg-test.synduit.com:80=80
  tty: true
  image: rancher/load-balancer-service
  stdin_open: true

Like SVG2JPG we have 800 rules in external_links and io.rancher.loadbalancer.target which I trimmed it for convinence to copy paste config here

@Swaroop_Kundeti
I know you have about 800 rules in external_links and io.rancher.loadbalancer.target, but looking at the one that you provided the external_links and target service are not exactly the same.

One is SVG2JPG/svg2jpg and the other is SVG2JPG/svg2jpg-test, can you just confirm that’s a typo?

When upgrading, did your docker version change?

Each hostname routing rule is a label on a docker container. Could you test if docker supports being able to handle 800 labels in a docker run command?

@denise Docker version did not change it was the same. Anyways we moved from rancher load balancer to Nginx. Its all fine now. Thank you.