Very high CPU and memory usage by Java process

I’m running rancher/server on RancherOS v0.8.0, I also tried v0.8.0-rc11. Server is started with the following command:

sudo docker run -d --restart=unless-stopped -p 8080:8080 rancher/server:stable

I noticed that web-ui was crashing quite a lot. Investigations led me to very high CPU and memory usage by Java process: always ~40% RAM and ~15% CPU.

Eventually it eats all the memory which leads to the nonsenses like load average: 18.76, 10.55, 4.73 and also the following, though I have only 3 workers registered to it:

rancher@ros-m01:~$ docker ps
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0xa6b6f7 m=0

goroutine 0 [idle]:

goroutine 1 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc82001a770 sp=0xc82001a768
runtime.main()
	/usr/local/go/src/runtime/proc.go:126 +0x62 fp=0xc82001a7c0 sp=0xc82001a770
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82001a7c8 sp=0xc82001a7c0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1

rax    0x0
rbx    0x1374fc8
rcx    0xa6b6f7
rdx    0x6
rdi    0xc01
rsi    0xc01
rbp    0xeed7de
rsp    0x7ffefb614a98
r8     0xa
r9     0x31c0880
r10    0x8
r11    0x202
r12    0x31c2c10
r13    0xeba544
r14    0x0
r15    0x8
rip    0xa6b6f7
rflags 0x202
cs     0x33
fs     0x0
gs     0x0

What are rancher/server minimum requirements?? Current resource usage seem a bit inadequate.

My current setup:

rancher@ros-m01:~$ free -m
             total       used       free     shared    buffers     cached
Mem:           993        899         94          0         13        191
-/+ buffers/cache:        694        298
Swap:            0          0          0
rancher@ros-m01:~$ lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel Core Processor (Broadwell)
Stepping:              2
CPU MHz:               3399.836
BogoMIPS:              6799.67
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0

Please advise.

EDIT:

It seems that such behavior can only be observed during the ignition, when all/most services are initialized it tends to stabilize, though it’s still using vast amounts of RAM.

I’m having the same issue - I finally got Rancher Server going on my DO droplet, but I wanted to see if I could run my server and agents on the same box. Unfortunately after 30+ minutes not all of the services have passed health checks, the dashboard ui is really slow and checking the server load it’s similar to what you report. This doesn’t seem like a workable solution for small/hobby sites like mine. It would be nice if they would update the docs to say as much.

http://docs.rancher.com/rancher/v1.5/en/installing-rancher/installing-server/#requirements

This is just for the server container, so running the agent & related containers and whatever your applications are will be on top of that.

What size DO droplet?

I have a similar problem running Rancher Server on a 2GB memory droplet. Normally load balance is hardly over 1. Now it is in average around 5. Rancher Java CPU is often more than a 100%.

  • Running Rancher v.1.4.1
  • Process 0 and non delayed. Not a single issue on the web UI.
  • /var/lib/cattle/logs/cattle-error.log below with only some issues. But the errors are 3 hours away from now.
  • Same applies for docker log. Last error 3 hours ago. see below.

cattle-error.log
2017-03-10 08:00:08,348 ERROR [4c26fb67-bfb2-4218-9ef8-4f13e422613f:133724] [instance:1685->instanceHostMap:1553] [instance.start->(InstanceStart)->instancehostmap.activate] [] [utorService-751] [c.p.e.p
.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=20]: no such file or directory
2017-03-10 08:00:08,348 ERROR [4c26fb67-bfb2-4218-9ef8-4f13e422613f:133724] [instance:1685] [instance.start->(InstanceStart)] [] [utorService-751] [i.c.p.process.instance.InstanceStart] Failed [1/3] to S
tarting for instance [1685]
2017-03-10 09:11:45,862 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [129] count [3]
2017-03-10 09:11:45,862 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [138] count [3]
2017-03-10 09:11:45,863 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [164] count [3]
2017-03-10 09:11:45,863 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [185] count [3]

docker logs rancher server
2017-03-10 09:11:45,862 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [129] count [3]
2017-03-10 09:11:45,862 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [138] count [3]
2017-03-10 09:11:45,863 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [164] count [3]
2017-03-10 09:11:45,863 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [185] count [3]
time=“2017-03-10T09:11:47Z” level=info msg="Installing builtin drivers"
time=“2017-03-10T09:11:49Z” level=info msg="Downloading all drivers"
time=“2017-03-10T09:11:51Z” level=info msg="Copying /var/lib/cattle/machine-drivers/1f7058341420e2f525168052818c3f819ff78e9ca5f57d5a650a049bcd5945e9-docker-machine-driver-packet => /usr/local/bin/docker-machine-driver-packet"
time=“2017-03-10T09:11:52Z” level=info msg="Done downloading all drivers"
2017/03/10 09:33:35 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:62503: tls: oversized record received with length 20624
2017/03/10 09:33:35 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:31399: tls: oversized record received with length 20624
2017/03/10 09:33:35 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:61967: tls: oversized record received with length 20624
2017/03/10 09:33:46 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:45813: EOF
2017/03/10 09:33:57 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:33611: EOF
2017/03/10 09:34:08 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:38837: EOF
2017/03/10 09:34:08 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:47253: tls: oversized record received with length 20480
2017/03/10 09:34:08 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:46653: tls: oversized record received with length 20480
2017/03/10 09:34:08 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:60657: tls: oversized record received with length 20480
2017/03/10 09:34:19 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:46815: EOF
2017/03/10 09:34:30 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:52511: EOF
2017/03/10 09:34:41 http: TLS handshake error from XX.XXX.XXX.XXX.XXX:27281: EOF

Hi, we’ve got the same problem: when Rancher starts on 2G RAM droplet everything seems to be okay but after several days of usage Java process mem/cpu usage almost freezes VM.
Looks like some kind of memory leak problem.

Any ideas on how to get the root cause to help you guys fix it?
rancher v1.41
cattle v0.176.9
rancher compose v0.12.2

UPD: not sure if this is a good htop output filtered by java:

Why this is happening?

I’m getting the same issue!

I have a writeup here:

And created a bug report here:

No response yet.