Rancher is feeling like a mistake. Help!

William_Flanagan · December 9, 2016, 12:55pm

I posted before about 2.5 weeks ago, and got no help.

We have been “using” Rancher for 2+ months, in production, and committed to it for production, which is feeling like a mistake.

We can put a light load on it, and it seems to work fine… but once you push anything to it (30 to 50 services), it starts really misbehaving… deployments taking 1 hour+, a lot of services not starting or working, the front end not responding.

And, the UI goes unresponsive, forcing a reboot.

I’m not sure what the problem is with it. We’ve tried a myriad of things. I posted 2+ weeks ago, filled in a bunch of details for people, and got no answer.

So, Is Rancher BS and not really ready for this? I need to know. It’s killing my little startup at this point, with a simple “code push” causing massive instability, downtime, and an 1 day+ of cleaning crap up to get things running again.

HELP!

William_Flanagan · December 9, 2016, 1:38pm

To add to this, our logs are full of this:

time=“2016-12-09T13:35:16Z” level=info msg=“Setting log level” logLevel=info
time=“2016-12-09T13:35:16Z” level=info msg=“Starting go-machine-service…” gitcommit=v0.34.1
time=“2016-12-09T13:35:16Z” level=info msg="Waiting for handler registration (1/2)"
time=“2016-12-09T13:35:16Z” level=info msg=“Starting rancher-compose-executor” version=v0.12.0
time=“2016-12-09T13:35:28Z” level=fatal msg=“Exiting go-machine-service: Timed out waiting for transtion.”

If that helps.

William_Flanagan · December 9, 2016, 2:24pm

Another bit of data… our image sizes are pretty large: ~= 2g, if that makes any difference?

denise · December 9, 2016, 7:15pm

It seems like your rancher-compose-executor is not able to start up. Could you try this on the host running rancher server?

docker exec <rancher_server_container_ID> killall -9 rancher-compose-executor

William_Flanagan · December 9, 2016, 10:11pm

i will do this…

But, this happens with any deployment/load.

I worked all morning to reduce the image size to around 1Gb. We managed to get a subset of the system up but now, as is becoming usual with Rancher… the UI is giving us the grey spinning and no actual UI now. The “work” of setup is already done (I think), and now it just died.

Would this error cause that?

William_Flanagan · December 9, 2016, 10:25pm

According to Raghav, right now, the executor is not failing right now. However, the UI is returning the spinner, but then not returning anything else.

Refreshing, different browsers, different locations–all have this same result. It will continue until we restart the docker container.

I have taken to calling it the grey cow of death, as the screenshot doesn’t show it greys out over time.

William_Flanagan · December 9, 2016, 10:29pm

P.s. We have done your command… doesn’t change anything.

William_Flanagan · December 9, 2016, 10:38pm

To give even more data. this is an 8 core, 64GB RAM machine with a 1Tb disk. Database is hosted on same machine. The machine doesn’t show any load at all.

William_Flanagan · December 9, 2016, 11:04pm

Here is the end of our logs. The agents don’t seem to be reachable, and UI is unresponsive.

Network characteristics: 1Gb private LAN between all nodes.

Phillip_Ulberg · December 10, 2016, 4:44am

What version of Rancher? What version of Docker? is the DB internal to Rancher server, or do you have a mysql service running on the box which holds the DB? Where are your servers hosted? What are the specs of the Rancher hosts? What is the disk I/O of the Rancher server and hosts when it takes 1 hour to deploy something.

Is there any firewall or proxy between the hosts and the Rancher server?

rawmind · December 10, 2016, 11:38am

Hi @William_Flanagan…

We are so sorry about the problems you are suffering. Obviously, this behavior doesn’t seem normal.

Additionally to @Phillip_Ulberg questions, could you please tell us, how many servers do you have?? Your system is comming from an upgrade or is a fresh installation?? Your rancher server is installed in HA or standalone??

Is there any possibility that we could connect to your system to take a deeper view??

Thanks, best regards…

vincent · December 10, 2016, 5:29pm

(FYI @ibuildthecloud and I went through a bunch of stuff with @William_Flanagan on IRC Friday and at least one problem is the external DB being non-responsive)

William_Flanagan · December 14, 2016, 1:17pm

Hey all,

So, update. Totally wiping and starting from scratch, doing a bunch of tuning to get the system as clean and pristine as possible, changing our deploy strategy to catalog updates and a manual, 1 by 1, push, we were able to get through an initial deploy.

However, on upgrade, we died again… exact same symptoms. And, the system how now reverted to the behavior above.

FYI Vincent i tried to reach out to you on IRC as well to give you an update yesterday… and now today, i’m back to totally wiping and rebuilding things from scratch.

This is miserable. As a startup, my app is DOWN right now, as I got most of my back end deployed, but can’t get my front end deployed as Rancher has locked up/crashed.

@rawmind. 1 server (standalone). Fresh, 1.2.0 installation (no upgrade from 1.1.4 from a DB perspective). I have 10 “hosts”, mysql is cohosted.The rancher server is on a machine with 64GB RAM and a 2TB HDD.

And, I’m happy to give someone access to look around… I offered that to Vincent last Friday.

William_Flanagan · December 14, 2016, 1:18pm

Related: the UI is locking up… on the /identities path. Using the console rancher app, all hosts are disconnected.

William_Flanagan · December 14, 2016, 1:24pm

And, more information: per this guide…https://github.com/rancher/rancher/wiki/Cowpoke-2%3A-Halp!-(Debugging,-troubleshooting,-starting-over) I tried to go in and clear the database locks. Logging into the container, then trying to mysql -u root says the socket isn’t available. Trying to restart the mysql service, results in an error:

root@69209d48bf71:/etc/mysql# service mysql restart

Stopping MySQL database server mysqld /usr/sbin/mysqld: error while loading shared libraries: libaio.so.1: cannot open shared object file: Permission denied

Not sure if that’s a red herring, or something legitimate. So, I thought I’d add it here.

William_Flanagan · December 14, 2016, 1:40pm

And last bit, more about the “environment”. There are 10 nodes, each 64GB RAM nodes, running 13 to 30 containers. Each container has at least 2GB potentially available to it. Its hard to image that this would need more hardware.

William_Flanagan · December 14, 2016, 2:19pm

Let’s add to the potential red herrings.

My concern here is that even the “running processes” are affected. We just attempted the cowpoke that I mentioned above, and this is what happened.

Chart is a 15 minute window

This is from our Kibana logserver (Kibana is NOT running inside Rancher).

Like I said before, could be a red herring. But, its does correlate that my app’s existing, running deployed processes are being locked down by Rancher, and not running.

I’d love to login and look, but my app is currently down because i’m the middle of a busted config.

rawmind · December 14, 2016, 3:28pm

@William_Flanagan, could you give me access to your system, ui and ssh?? I would like to do a deeper review of it.

With that system sizing and that container load, it should work fine…

Please, send me a private mail with your system data access and i send you my ssh pub key…

raul@rancher.com

Best regards…

William_Flanagan · December 14, 2016, 3:44pm

Gathering and sending…

William_Flanagan · December 14, 2016, 5:08pm

Hi @rawmind we sent a key, usernames and passwords for you to get in. Please let me know when you are done so we can remove… as we do not typically have user-based login enabled.

WF

Topic		Replies	Views
Help. All is going to shits Rancher 1.x	13	2131	April 4, 2017
Rancher-HA startup woes and problems Rancher 1.x	0	1075	October 17, 2016
Rancher eating all the CPU, is it overloaded? Rancher 1.x	1	948	June 18, 2018
1.2 UI Still Extremely Slow Rancher 1.x	2	887	December 17, 2016
All rancher not start Rancher 1.x	1	878	April 12, 2016

Rancher is feeling like a mistake. Help!

Related topics