I’ve been evaluating Rancher for a couple of weeks and have been keeping a log of the issues that I’ve encountered. I have many UI-related usability issues that I’ll write-up in a separate post. In general, however, I’m finding rancher to be: a) pretty unstable; b) very difficult to debug since when things go wrong you often have to switch between the different service/ host/ container views to try to figure out what didn’t work.
So far this week:
A) Fatal P0 Errors:
1). To begin with, I installed rancher/server on a CoreOS host (with 1G) on Digital Ocean. After seeing rancher crash pretty much every day and after speaking with @willchan I re-installed on a 2G machine. After numerous problems while testing today, I gave up on CoreOS and re-installed on Ubuntu 14.04 with 2G, but am still seeing frequent fatal errors. On one occasion the server crash erased all state (i.e., all host and service definitions were lost).
2). Error while deploying services: The service UI gets stuck in the “In Progress” state, but looking at the container view I see the following error:
142d63eb-18da-4347-b48e-d67e3efff774 : Image [richburdon/meteor-demo:latest] failed to pull : Error pulling image (latest) from richburdon/meteor-demo, Driver aufs failed to create image rootfs bf84c1d84a8fbea92675f0e8ff61d5b7f484462c4c44fd59f0fdda8093620024: open /var/lib/docker/aufs/layers/64e5325c0d9d80a28031d3c3689ac02041d74360cb0e7383a4df8a780328d833: no such file or directory
Then this container is destroyed and a retry begins, but the container hangs in the “Starting” state. No other errors or logs are visible. Furthermore, selecting “Stop” from the service menu shows the “Deactivating” status, but this just hangs too.
3). Multiple times when creating a new host the status indicates “Almost there” immediately but then hangs. The “Contacting Digital Ocean” message is never shown. When I attempt to create a second host I see the following error in the Hosts view:
segmentation fault (core dumped)
B) Serious P1 Errors.
1). If a bad image name is provided when creating a service, rancher retries indefinitely.
2). Service logs not displaying “docker run” errors (using CoreOS with rancher/server)
3). Frequently the service display “In progress” after the service is up and running.
C) Non fatal Errors:
1). The JS App leaks memory and starts to get really slow. After an hour or so it hangs the browser tab. Opening a new tab resolves the issue.
2). “Invalid date” displayed in log messages.
3). The container count is frequently off-by-one in the Services view (possilby related to the “In progress” bug above, or perhaps refers to unpurged containers?)