Rancher is feeling like a mistake. Help!

I posted before about 2.5 weeks ago, and got no help.

We have been “using” Rancher for 2+ months, in production, and committed to it for production, which is feeling like a mistake.

We can put a light load on it, and it seems to work fine… but once you push anything to it (30 to 50 services), it starts really misbehaving… deployments taking 1 hour+, a lot of services not starting or working, the front end not responding.

And, the UI goes unresponsive, forcing a reboot.

I’m not sure what the problem is with it. We’ve tried a myriad of things. I posted 2+ weeks ago, filled in a bunch of details for people, and got no answer.

So, Is Rancher BS and not really ready for this? I need to know. It’s killing my little startup at this point, with a simple “code push” causing massive instability, downtime, and an 1 day+ of cleaning crap up to get things running again.

HELP!

To add to this, our logs are full of this:

time=“2016-12-09T13:35:16Z” level=info msg=“Setting log level” logLevel=info
time=“2016-12-09T13:35:16Z” level=info msg=“Starting go-machine-service…” gitcommit=v0.34.1
time=“2016-12-09T13:35:16Z” level=info msg="Waiting for handler registration (1/2)"
time=“2016-12-09T13:35:16Z” level=info msg=“Starting rancher-compose-executor” version=v0.12.0
time=“2016-12-09T13:35:28Z” level=fatal msg=“Exiting go-machine-service: Timed out waiting for transtion.”

If that helps.

Another bit of data… our image sizes are pretty large: ~= 2g, if that makes any difference?

It seems like your rancher-compose-executor is not able to start up. Could you try this on the host running rancher server?

docker exec <rancher_server_container_ID> killall -9 rancher-compose-executor

i will do this…

But, this happens with any deployment/load.

I worked all morning to reduce the image size to around 1Gb. We managed to get a subset of the system up but now, as is becoming usual with Rancher… the UI is giving us the grey spinning and no actual UI now. The “work” of setup is already done (I think), and now it just died.

Would this error cause that?

According to Raghav, right now, the executor is not failing right now. However, the UI is returning the spinner, but then not returning anything else.

Refreshing, different browsers, different locations–all have this same result. It will continue until we restart the docker container.

I have taken to calling it the grey cow of death, as the screenshot doesn’t show it greys out over time.

P.s. We have done your command… doesn’t change anything.

To give even more data. this is an 8 core, 64GB RAM machine with a 1Tb disk. Database is hosted on same machine. The machine doesn’t show any load at all.

Here is the end of our logs. The agents don’t seem to be reachable, and UI is unresponsive.

Network characteristics: 1Gb private LAN between all nodes.

What version of Rancher? What version of Docker? is the DB internal to Rancher server, or do you have a mysql service running on the box which holds the DB? Where are your servers hosted? What are the specs of the Rancher hosts? What is the disk I/O of the Rancher server and hosts when it takes 1 hour to deploy something.

Is there any firewall or proxy between the hosts and the Rancher server?

Hi @William_Flanagan

We are so sorry about the problems you are suffering. Obviously, this behavior doesn’t seem normal.

Additionally to @Phillip_Ulberg questions, could you please tell us, how many servers do you have?? Your system is comming from an upgrade or is a fresh installation?? Your rancher server is installed in HA or standalone??

Is there any possibility that we could connect to your system to take a deeper view??

Thanks, best regards…

(FYI @ibuildthecloud and I went through a bunch of stuff with @William_Flanagan on IRC Friday and at least one problem is the external DB being non-responsive)

Hey all,

So, update. Totally wiping and starting from scratch, doing a bunch of tuning to get the system as clean and pristine as possible, changing our deploy strategy to catalog updates and a manual, 1 by 1, push, we were able to get through an initial deploy.

However, on upgrade, we died again… exact same symptoms. And, the system how now reverted to the behavior above.

FYI Vincent i tried to reach out to you on IRC as well to give you an update yesterday… and now today, i’m back to totally wiping and rebuilding things from scratch.

This is miserable. As a startup, my app is DOWN right now, as I got most of my back end deployed, but can’t get my front end deployed as Rancher has locked up/crashed.

@rawmind. 1 server (standalone). Fresh, 1.2.0 installation (no upgrade from 1.1.4 from a DB perspective). I have 10 “hosts”, mysql is cohosted.The rancher server is on a machine with 64GB RAM and a 2TB HDD.

And, I’m happy to give someone access to look around… I offered that to Vincent last Friday.

Related: the UI is locking up… on the /identities path. Using the console rancher app, all hosts are disconnected.

And, more information: per this guide…https://github.com/rancher/rancher/wiki/Cowpoke-2%3A-Halp!-(Debugging,-troubleshooting,-starting-over) I tried to go in and clear the database locks. Logging into the container, then trying to mysql -u root says the socket isn’t available. Trying to restart the mysql service, results in an error:

root@69209d48bf71:/etc/mysql# service mysql restart

  • Stopping MySQL database server mysqld /usr/sbin/mysqld: error while loading shared libraries: libaio.so.1: cannot open shared object file: Permission denied

Not sure if that’s a red herring, or something legitimate. So, I thought I’d add it here.

And last bit, more about the “environment”. There are 10 nodes, each 64GB RAM nodes, running 13 to 30 containers. Each container has at least 2GB potentially available to it. Its hard to image that this would need more hardware.

Let’s add to the potential red herrings.

My concern here is that even the “running processes” are affected. We just attempted the cowpoke that I mentioned above, and this is what happened.

Chart is a 15 minute window

This is from our Kibana logserver (Kibana is NOT running inside Rancher).

Like I said before, could be a red herring. But, its does correlate that my app’s existing, running deployed processes are being locked down by Rancher, and not running.

I’d love to login and look, but my app is currently down because i’m the middle of a busted config.

@William_Flanagan, could you give me access to your system, ui and ssh?? I would like to do a deeper review of it.

With that system sizing and that container load, it should work fine…

Please, send me a private mail with your system data access and i send you my ssh pub key…

raul@rancher.com

Best regards…

Gathering and sending…

Hi @rawmind we sent a key, usernames and passwords for you to get in. Please let me know when you are done so we can remove… as we do not typically have user-based login enabled.

WF