Help. *All* is going to shits

I like Rancher. I really do. Unfortunately, from time to time, Rancher commits suicide and I would love to know why.

Currently it’s our dev stack (recently upgraded to 1.5.2) and our production stack (until yesterday happily running on 1.4.1). The “happily” part changed.

In short: everything is going to shit right now!

So that itself is bad enough. It might even be my fault (although I didn’t do anything since yesterday, and my team didnt do anything that I know of). But, let’s say it is, but still I want to fix it.

Just … HOW??? I don’t know …

  • … why something is not coming up
  • … why basically ANYTHING is in some kind of fucked up “Upgrading” state
  • … what the error messages mean (“Waiting: container stopping [container:1i32783]” - for OVER ONE HOUR now. So??)
  • … what I can do to fix anything.

I NEED HELP. Sorry for yelling. But I really do.

Let me illustrate

This is like this for about one hour:

Why??

The other infra stacks look like this (same time span):

What’s happening??

This is one of our production stacks. Did I mention we did NOT trigger any updates (except the LB, that was me. doesnt work as well)?? And pretty much all other stacks look exactly like this.

Oh yeah, the hosts are super-fine, btw:

I also get the following in the Rancher server logs:

2017-03-23 08:29:37,911 ERROR [1e1a92d3-7dc5-4ae3-900a-00afb50efdfe:844001] [instance:32791->instanceHostMap:10457] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=1453]: Timeout getting IP address 
2017-03-23 08:29:37,911 ERROR [1e1a92d3-7dc5-4ae3-900a-00afb50efdfe:844001] [instance:32791] [instance.start->(InstanceStart)] [] [ecutorService-7] [i.c.p.process.instance.InstanceStart] Failed [2/3] to Starting for instance [32791] 
2017-03-23 08:29:38,029 ERROR [1e1a92d3-7dc5-4ae3-900a-00afb50efdfe:844001] [instance:32791->instanceHostMap:10457] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=1453]: Timeout getting IP address 
2017-03-23 08:29:38,029 ERROR [1e1a92d3-7dc5-4ae3-900a-00afb50efdfe:844001] [instance:32791] [instance.start->(InstanceStart)] [] [ecutorService-7] [i.c.p.process.instance.InstanceStart] Failed [3/3] to Starting for instance [32791] 
2017-03-23 08:29:38,298 ERROR [1e1a92d3-7dc5-4ae3-900a-00afb50efdfe:844001] [instance:32791] [instance.start] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=1453]: Timeout getting IP address 

What’s this??

I restarted the Rancher master now, after rebooting basically all hosts one-by-one, and suddenly things … work.

Rebooting the hosts did not change a thing, but restarting the Rancher master did. What is this?!? I am on a stable image. I would not expect this kind of behavior.

Anyway, it’s green now - and I will include “reboot Rancher” in my list of troubleshooting docs. Hello good old Windows days - rebooting as first step of problem solving never gets old it seems.

I would still greatly appreciate any feedback on what is this and why it was happening.

Now I am trying to fix our dev cluster running on 1.5.2 now. Restarting anything did not work. I get “Error checking for upgrade” on pretty much any service, which is backed up by this entry in the logs:

2017-03-23 09:54:09,225 ERROR [:] [] [] [] [tp859417998-169] 
  [i.g.i.g.r.handler.ExceptionHandler  ] 
  Exception in API for request 
  [http://OUR_RANCHER_DEV_URL/v1/proxy/http://localhost:8088/v1-catalog/catalogs]. 
  Error id: [589a14ca-3ce1-407a-bbea-bc9b057bcc1a]. 
  org.apache.http.conn.HttpHostConnectException: Connect to localhost:8088 
  [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused

I also get an internal 500 error when going to the catalog page, and again I have absolutely no clue how to fix this. It seems Rancher is trying to proxy a request to an internal proxy path endpoint, but the target (“localhost”) is not in the settings entry for allowed proxy targets. Don’t know if this makes a difference, cause adding “localhost” didnt work.

Really, this is not the day Rancher shines.
Like the new secrets thing, though.

I also get this every two seconds:

time="2017-03-23T10:03:59Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:01Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:03Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:05Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:07Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:09Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:11Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:13Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:15Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:17Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:19Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:21Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:23Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:25Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:27Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 
time="2017-03-23T10:04:29Z" level=fatal msg="Error 1045: Access denied for user 'cattle'@'10.10.15.83' (using password: YES)" 

The IP is our rancher-dev master, and this seems to be a MySQL error message. It connects to the database though (in earlier versions it did not come up without a DB), but I do pass a different user / password as env variables:

docker run \
        --net bridge -m 0b \
        -e CATTLE_DB_CATTLE_MYSQL_HOST=OUR_DB \
        -e CATTLE_DB_CATTLE_MYSQL_NAME=DB_NAME \
        -e CATTLE_DB_CATTLE_MYSQL_PORT=3306 \
        -e CATTLE_DB_CATTLE_MYSQL_USERNAME=OUR_DB_USER \
        -e CATTLE_DB_CATTLE_MYSQL_PASSWORD=YUP \
        -p 80:8080  \
        --name rancher-dev \
        rancher/server:v1.5.2

[UPDATE]

I was able to get rid of the error by using the command line parameters (--db-*) for the settings. then it worked, and we also got connectivity to the catalog back.

your production issue looks like it was due to the network-manager being hurt somehow. Not sure if the logs for those containers still exist. It would be good to take a look at those. Scrub them for anything sensitive.

It looks like you were trying to upgrade all your infra stacks at once. In our upgrade docs, we recommend upgrading one infra stack at a time and finishing the upgrade before upgrading the next one.

thing is, I didn’t do this. that was the state I found when opening the UI …

I feel your pain. We are currently dealing with two of our environments down in Rancher.

Services don’t just randomly upgrade themselves… So the obvious thing would be that a human triggered it, or in 1.5.x there is a new option to automatically upgrade infrastructure stacks, which should be currently off by default (Admin -> Settings -> Advanced Settings -> upgrade.manager). Either way you can look at Audit log to see what triggered it.

In 1.5 the catalog service uses the DB too, so you need to use the CLI flags as you’ve noted. This is also where “Error checking for upgrade” comes from (UI can’t reach catalog because it’s not running).

things that seemed to help us:

  • restart rancher master.
  • use command line params to pass DB config, not env variables (for 1.5.2)
  • use the rancher cli as often as you can, the UI still freezes your system (at least ours) when shit goes down
  • re-deploy the infra scheduler stack (delete, deploy! not stop, start)
  • be patient :wink:

I unfortunately also don’t know more.

1 Like

@vincent - yes, I kinda figured that. I do not recall to have read that in the release notes though :confused: .

I also have no idea who might have triggered the service upgrade, because they should have been current, and nobody from my team did, I am sure of that. actually I think it might have been this weird “updating-acti…” state in the beginning, and I then pressed the upgrade button since it definitely was not working. I am not 100% sure about that, I tried a lot of stuff.

[UPDATE - read the original post again, and if I said I did not trigger updates, that’s most probably the case. I wrote the post in parallel of trying to fix things, that when I wrote this I knew where I was in the process…]

this also was a 1.4.1 release, so that setting has not been there AFAIK …

I had this issue last week as well with v1.4.1.
The network manager stopped DNS from working causing all manner of haywire. Tried manually setting hosts files e.t.c
But in the end I had to restart all nodes for it to come back up as well. Wasn’t able to track down the issue. Happened across 2 separate AWS VPCs at the same time too

two clusters at the same time? THAT is bad news. we just added a 2nd cluster to cope with a single cluster outage. if you had it like this … hm.