Rancher is feeling like a mistake. Help!

I’m not on IRC 24/7, but I did reply to you 15 minutes later and you never came back.

Last time we talked you were looking at the external database being non-responsive. So going in to the server container and trying to connect to the local database (that isn’t running) doesn’t make sense; Go to your external DB. But you don’t need to clear the database lock, you’re not stuck trying to upgrade the schema on startup

There are at least two separate issues that your use-case is good at hitting (#7017 and #6995) and they will be fixed in the next release (this week). Restarting and reinstalling and all that is not going to just start producing different results.

@vincent, thanks so much about the update.

@William_Flanagan… I’ve connected to your system and the server where rancher and mysql are running is not working well. You are running some other process apart from rancher and mysql (mongo, postgre,…), and your i/o wait is so high. As you can see your sdb disk is full…

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.83    0.00    0.93   26.78    0.00   70.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    66.00    0.00  117.40     0.00     5.28    92.06     0.65    5.56    0.00    5.56   4.15  48.72
sdb               0.00    60.00   10.00  652.00     0.09     3.13     9.97   144.54  217.29  109.28  218.95   1.51 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.71    0.00    0.65   28.15    0.00   68.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    34.60    0.00   96.80     0.00     4.93   104.36     0.64    6.58    0.00    6.58   4.92  47.60
sdb               0.00    36.40    8.00  606.60     0.10     2.76     9.54   144.82  234.46  125.00  235.91   1.63 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.60    0.00    2.19   25.12    0.00   66.09

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    23.40    0.00   72.00     0.00     4.87   138.60     0.51    7.14    0.00    7.14   5.91  42.56
sdb               0.00    59.80   57.40  572.80     1.35     2.72    13.21   148.04  235.19  123.14  246.41   1.59 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.12    0.00    0.75   24.05    0.00   72.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    72.60    0.00  118.40     0.00     5.01    86.59     0.70    5.92    0.00    5.92   3.57  42.24
sdb               0.00    72.60  113.40  513.80     2.82     2.56    17.57   162.91  260.84  166.34  281.70   1.59 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.99    0.00    1.08   21.22    0.00   74.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    22.60    0.00   72.60     0.00     1.89    53.36     0.44    6.05    0.00    6.05   4.95  35.92
sdb               0.00    65.00   51.40  599.00     1.21     2.86    12.81   148.99  229.32  107.02  239.81   1.54 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.22    0.00    2.01   21.63    0.00   68.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    20.20    0.00   61.40     0.00     2.51    83.78     0.41    6.64    0.00    6.64   6.18  37.92
sdb               0.00    15.60   28.00  613.00     0.61     2.63    10.37   146.67  229.49  124.20  234.30   1.56 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.67    0.00    0.63   23.57    0.00   73.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    48.60    0.00   97.40     0.00     3.16    66.51     0.61    6.23    0.00    6.23   5.19  50.56
sdb               0.00    57.20   82.60  564.80     2.00     2.59    14.53   156.71  241.84  176.05  251.47   1.54 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.49    0.00    0.70   20.58    0.00   75.23

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    10.20    0.00   47.80     0.00     1.13    48.47     0.26    5.44    0.00    5.44   4.60  22.00
sdb               0.00    94.20  106.20  552.40     2.56     2.70    16.33   154.45  234.52  104.11  259.59   1.52 100.08

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.53    0.00    2.21   16.90    0.00   73.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    23.20    0.00   38.60     0.00     3.59   190.63     0.47   12.17    0.00   12.17   8.23  31.76
sdb               0.00   209.80   13.40  678.80     0.16     3.69    11.38   141.07  203.27  100.18  205.30   1.44 100.00

Beside wait for the fixes that tell @vincent, you would need to improve your database performance, may be with a external one or making some improvements to your storage to avoid this amount of i/o wait.

I’m done with the access, please remove it.

Best regards…

@rawmind, i had inquired about the disk I/O several days ago, it seems like a very good metric to track for issues like this.

Yes @Phillip_Ulberg, i know… But no data was posted…

Ok.

To be clear, you show iowaits in the 13-25% range. While higher than you’d like, its not “grinding to a halt” IO times. In fact, for database application as I understand it, optimal performance is “under 20”. So, we’re not showing a system that is ground to a halt in terms of IOWait in our stats you posted above.

This is now just running the rancher server and an external MySQL service.

1.6% IOWait. Nothing running on this 64GB RAM 1TB+ machine.
sdb by the way was ONLY graphite storage, and wasn’t the disk where the MySQL or any other processes were writing.

You were showing iowaits in the 20s… not great… but not “this app shouldn’t work at all.”

This is it now, IOwaits in the 0-5 range. Yet, when I to access the front end.

The same problems occur. Further, let’s look at which processes are causing IO.

So, if this was an IO problem, shouldn’t this be cleared by now? Or, so I have to rebuild again to see any benefit of all of this free IO? Or, if I do rebuild, and it doesn’t work, the next step is the fixes in a number of days?

@William_Flanagan, the data that you post right now, shows that mongodb and/or postgressql, that was running and now are stoped, was consuming your i/o massively. It seems so much for a single sata disk.

Could you please try to refresh?? rancher server logs don’t show anything from a 10-15 minutes ago…

2016/12/14 18:15:50 http: proxy error: net/http: request canceled
2016/12/14 18:15:50 http: proxy error: net/http: request canceled
2016/12/14 18:44:37 http: proxy error: net/http: request canceled

Yep…more or less the steps would be what you comment.
i/o bottleneck could generate that rancher-server problems, it seems reasonable. Then you could try if this cleared the problem.
If not, you could try to rebuild the system avoiding i/o bottleneck.
If not, @vincent has explained about some fixes that would be included in the next release (this week)…

Best regards…

The bugs with pulling large images timing out and processes preventing API calls from the UI from returning do not magically go away when you reinstall or the database starts responding again. We’re don’t normally suggest people use RC releases but since nothing works for you and you’re reinstalling all the time anyway you can try v1.2.1-rc2.

I suspect you’ll still run into contention making the database unresponsive at exactly the time when it’s needed and busy trying to deploy a bunch of new containers. A single consumer-grade spinning disk trying to handle the rancher server, multiple unrelated database servers, and I’m assuming all the containers your trying to deploy connecting and sending metrics to those databases doesn’t sound like a great situation.

Yes, I refreshed before I posted that link. I"m not that stupid.

So, let me try to translate. You are saying that a machine with 64GB RAM and a 1TB+ disk, bare metal, dedicated exclusively to Rancher server (with a MySQL server running locally outside rancher) isn’t enough to handle the deployment/management of 200 containers across 10 hosts?

It needs more horsepower than that?

And, this is a server in a data center… not my local disk. I’m pretty sure the hosting company i’m using would take issue with their equipment being positioned as “consumer grade spinning disk”. Or, are you saying that an SSD is required to run Rancher Server?

It doesn’t matter how much CPU and RAM and disk space you have if the disk I/O is pegged, as it was when @rawmind looked at it. I didn’t login myself but he said there’s MySQL, Mongo, Postgres, Carbon (Graphite) and other things running on there, and that it got better when you stopped Mongo and PG.

You or your hosting can take issue with whatever you like, but that’s what it is. A run-of-the-mill SATA-attached 7200RPM hard drive. The bottom line is rancher/server needs a database that is responsive to queries in order to work and it doesn’t seem like this disk can provide that while servicing all the other things running on the host.

Well. Here we are again.

We took one of our database servers offline, wiped it. It has 64GB of RAM… and pure SSD. Installed Rancher from scratch, and we were able to deploy (as has been the case for us previously on a fresh install).

About 24 hours later though, and our Rancher server has stopped responding. No upgrades. Nothing other than it simply sitting.

This is NOT IO. Attached is a picture showing the iowait at 0.1.

Since my “consumer grade spinning disks” were determined + sharing resources was the culprit, and my basic browsing skills were called into question, note that the top process is a java process (assuming the rancher server) is the #1 task in terms of usage.

Also, here is EVERYTHING running on this box. There is nothing but Rancher running.

So, it’s still IO? Its still that we somehow have old/bad config from bad IO wait (on a completely fresh and formatted system that has never run anything but 1.2.0)?

Or, do you now believe this is the bug introduced in 1.2? This is the same behavior we experienced before 1.2, FYI.

Hi William,

“great” to hear someone else gets the “cow of death” :slight_smile: . Hint: Our team found out that using Chrome makes this much better. (I think you are using Firefox, right? Bad choice for this, blocks my machine for minutes sometimes) . That really should be fixed.

I can also acknowledge that on huge deployments (e.g. when you remove a host and wait for the containers to be redistributed to the remaining ones) Rancher becomes really, really unusable, which I also don’t like. Especially when you have opened the “Infrastructure -> Hosts” view in a tab.

And we are also having some heavy stability issues right now with Rancher 1.1.4, which drive me personally crazy, but I can’t pin that to Rancher per se.

So I just subscribed to this thread and I am very interested to see to which conclusion this thread will lead to. :wink:

Cheers!
Axel.

Again, you’re conflating more than one problem… (Now rc3)

To be clear since I didn’t really say it before, there are specific fixes for those issues in 1.2.1.

Hi Vincent, was this adressed to me? I know that those are multiple problems, I just wanted to shout out to William that “He’s not alone” :wink: and confirm that I share a couple of his problems.

I will definitely not try -rc3, I’d rather wait for 1.2.1 final to be relesed, for I am still running 1.1.4. My first two attempts at migrating to R1.2.0 were pretty desastrous, and “just pull and install the new image” did not really go well at all. :frowning: The next strategy actually is creating a completely fresh cluster and re-deploying evtg on there.

… Ah, not sure you did, but if there are already some release notes I’ll happily have a look!

Cheers!
Axel.

No it was for @William_Flanagan.

@flypengiun - I’m on Chrome on OSX/MacOS El Capitan. Thanks though!

Vincent,

My point is that my engineering team and our CI and production deployment process is unable to work again.

The only way to “fix” it is to:

  1. send everyone home and wait until 1.2.1 comes out (your suggestion).
  2. Tear the entire thing down, wipe the hosts, and start over again, which is what we do literally EVERY time we need to deploy a new version, which burns 10 hours of Devops time or so (and my time).

At this point, you have to understand how skeptical, and frankly pissed off, I am. This SAME problem manifested itself on 1.1.4, its not unique to 1.2.0. With 1.1.4 and even earlier versions, the first deploy after a wipe would work. It was when we try to push an upgrade that it would fail, or when we would use the system for more than 1 day or so.

And, we have spent since last Friday, chasing a “its your environment and hardware” thing, chasing our tail, and you blaming “consumer grade spinning disks.” While the UI is a bit snappier on the hardware its on, the net behavior of this system is NO DIFFERENT Than it was before. We’ve made no progress, other than to determine that the system runs marginally faster on faster hardware.

And that the “next version” will fix it–which was the comments previously made about 1.2.

I’m trying to believe, but its getting hard to continue to do so…

My suggestion multiple times now has been exactly the opposite, to try the 1.2.1 RC now since it contains fixes for specifically the issues you’re hitting, you are rebuilding all the time anyway, and have little to lose. Redeploying the same 1.2.0 with the same known problems you’re hitting and expecting it to get better is not going to help.

There are any number of problems that can manifest themselves as “the UI won’t load” and you definitely did not have a single problem. This last UI screenshot with identities is not the same problem you had in 1.1.4. It is a specific issue introduced late in the 1.2 cycle with API requests for the UI going into a work queue with other backend tasks which can cause them to never (or very slowly) get responded to.

Another good way for the UI to be unresponsive is if the server can’t get stuff from the database. You mentioned not being able to connect to it in IRC Friday. You let @rawmind into the server and he saw evidence that the disk on the host was overworked. If the server container can’t perform queries in a timely manner, it is clear that nothing is going to work. You put in SSDs and can presumably see that it is better now and the DB stays responsive. It seems likely this was previously a factor in problems you had with 1.1.4.

I still think you have/had a combination of 4 things:

    1. #6995 - API requests to the wrong queue. This is new in 1.2.0. Fixed in 1.2.1-rcX.
    1. #7017 - Pulling large images times out and has to retry. This existed in 1.1.x too. Fixed in 1.2.1-rcX.
    1. Slow/overloaded/sometimes unresponsive/whatever external database. I assume you’ve addressed this now, but likely had the same issue in 1.1.x.
    1. Cascading failure. You try to update service(s), which writes a lot of stuff to the DB, which gets slow or stops responding (#3), pulls time out because of #2, which causes retries which causes more DB operations, you’re trying to load the page but can’t because of #1, etc. Eventually hosts also probably disconnect because their heartbeats aren’t getting through and nothing works.

Being indignant about the open-source software you’re using not working perfectly is not helpful for anybody. If you don’t like ours there are other choices, but I can pretty much guarantee you the core Google employees that work on Kubernetes are not going to be on IRC on a Friday night if you have a problem.

We’re trying to help you, we really are. I want our software to work for people and I’m sorry it’s affecting your company. We are a startup too and @ibuildthecloud @rawmind and I have spent hours (and after-hours) of our time talking to you and trying to resolve your problem. But at this point we’re pretty much just going around in circles… 1.2.0 has known issues that affect you, we believe they’re fixed in 1.2.1. The likelihood of them actually being resolved would go up if you could try the RC and show us it does or doesn’t fix your use-case.

We will try 1.2.1-rc. Do I need to totally wipe the database before, or can I upgrade in place?

In-place should work.