Frequent errors and reliability issues

I’ve been evaluating Rancher for a couple of weeks and have been keeping a log of the issues that I’ve encountered. I have many UI-related usability issues that I’ll write-up in a separate post. In general, however, I’m finding rancher to be: a) pretty unstable; b) very difficult to debug since when things go wrong you often have to switch between the different service/ host/ container views to try to figure out what didn’t work.

So far this week:

A) Fatal P0 Errors:

1). To begin with, I installed rancher/server on a CoreOS host (with 1G) on Digital Ocean. After seeing rancher crash pretty much every day and after speaking with @willchan I re-installed on a 2G machine. After numerous problems while testing today, I gave up on CoreOS and re-installed on Ubuntu 14.04 with 2G, but am still seeing frequent fatal errors. On one occasion the server crash erased all state (i.e., all host and service definitions were lost).

2). Error while deploying services: The service UI gets stuck in the “In Progress” state, but looking at the container view I see the following error:

142d63eb-18da-4347-b48e-d67e3efff774 : Image [richburdon/meteor-demo:latest] failed to pull : Error pulling image (latest) from richburdon/meteor-demo, Driver aufs failed to create image rootfs bf84c1d84a8fbea92675f0e8ff61d5b7f484462c4c44fd59f0fdda8093620024: open /var/lib/docker/aufs/layers/64e5325c0d9d80a28031d3c3689ac02041d74360cb0e7383a4df8a780328d833: no such file or directory

Then this container is destroyed and a retry begins, but the container hangs in the “Starting” state. No other errors or logs are visible. Furthermore, selecting “Stop” from the service menu shows the “Deactivating” status, but this just hangs too.

3). Multiple times when creating a new host the status indicates “Almost there” immediately but then hangs. The “Contacting Digital Ocean” message is never shown. When I attempt to create a second host I see the following error in the Hosts view:

segmentation fault (core dumped)

B) Serious P1 Errors.

1). If a bad image name is provided when creating a service, rancher retries indefinitely.

2). Service logs not displaying “docker run” errors (using CoreOS with rancher/server)

3). Frequently the service display “In progress” after the service is up and running.

C) Non fatal Errors:

1). The JS App leaks memory and starts to get really slow. After an hour or so it hangs the browser tab. Opening a new tab resolves the issue.

2). “Invalid date” displayed in log messages.

3). The container count is frequently off-by-one in the Services view (possilby related to the “In progress” bug above, or perhaps refers to unpurged containers?)

Hi Rich,

Thanks for providing this invaluable feedback during this beta phase. In general, you are correct about the error handling aspect where we do retry but do not provide enough feedback to the user and indicate what could be wrong. We will address this fairly soon.

I will work with the team and provide this as a feedback on how to improve Rancher based on your findings. I am a bit concerned about (1) about the server crashing though. I’ve had Rancher run fine on Ubuntu on DO for weeks and have never encountered a crash. Perhaps you can send me the Rancher logs so we can try to determine what could be causing this.

Will

I haven’t had any issues running Rancher on a 1GB DO Ubuntu machine for the last 3 weeks. I currently have it controlling 4 other hosts with a few containers on each and I’ve done a TON of testing. That server type is all I’ve ever used Rancher on and, as far as I can tell, I’ve never had any issues related specifically to the server. And for what it’s worth, I always setup my Rancher host with docker-machine. @richburdon, have you tried provisioning your machine that way yet? Super easy…

#1
docker-machine create --driver digitalocean \
                      --digitalocean-access-token $MY_TOKEN \
                      --digitalocean-size 1gb \
                      server-name
#2
eval "$(docker-machine env server-name)"

#3
docker run -d --restart=always -p 8080:8080 rancher/server

Thanks @jeremy . Thanks for the docker-machine tip; how do you get access to the machine via ssh with this method (I don’t see how to set .ssh keys after creation)?

When creating machines using docker-machine you can SSH in using docker-machine ssh.

From the Docker documentation, scroll down to the section about using docker-machine with cloud providers.

When the creation of a host is initiated, a unique SSH key for accessing the host (initially for provisioning, then directly later if the user runs the docker-machine ssh command) will be created automatically and stored in the client’s directory in ~/.docker/machines. After the creation of the SSH key, Docker will be installed on the remote machine and the daemon will be configured to accept remote connections over TCP using TLS for authentication. Once this is finished, the host is ready for connection.

Awesome. Thanks to you both.