Haproxy containers producing zombies + logging stops

We have lots of haproxy containers running in production and on 3 separate hosts, we have a bunch of zombie processes – all of them from logrotate. It took a little digging, but i found that it was the haproxy container that was producing these, all with monit as the parent PID.

— Rancher 0.56.1 here

root@a37178ad75ea:/# ps ax | grep -w Zs                                         
10401 ?        Zs     0:00 [logrotate] <defunct> 

On the docker host…

ubuntu@rancher51:~$ ps ax | grep -w Zs | grep -v grep
21972 ?        Zs     0:00 [logrotate] <defunct>    
21980 ?        Zs     0:00 [logrotate] <defunct>
21988 ?        Zs     0:00 [logrotate] <defunct>
22024 ?        Zs     0:00 [logrotate] <defunct>
22032 ?        Zs     0:00 [logrotate] <defunct>
22045 ?        Zs     0:00 [logrotate] <defunct>
22053 ?        Zs     0:00 [logrotate] <defunct>
22070 ?        Zs     0:00 [logrotate] <defunct>
22078 ?        Zs     0:00 [logrotate] <defunct>
22086 ?        Zs     0:00 [logrotate] <defunct>
22094 ?        Zs     0:00 [logrotate] <defunct>
22102 ?        Zs     0:00 [logrotate] <defunct>
22110 ?        Zs     0:00 [logrotate] <defunct>
22118 ?        Zs     0:00 [logrotate] <defunct>
22131 ?        Zs     0:00 [logrotate] <defunct>
22145 ?        Zs     0:00 [logrotate] <defunct>
22163 ?        Zs     0:00 [logrotate] <defunct>
22175 ?        Zs     0:00 [logrotate] <defunct>

And that’s just one of the hosts… they all are like this.

Also, logging has stopped for both haproxy and rancher-agent containers… I can see rotated logs such as rancher-dns.log.1.gz but the “live” log is gone, and /proc//fd shows that “(file deleted)” error on open files that its trying to write to.

root@7a96e4f3c2a1:/# ps ax|grep dns                                             
864 ?        Sl    71:26 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher...
27287 pts/2    S+     0:00 grep dns
                                         
root@7a96e4f3c2a1:/# ls -l /proc/864/fd                                         
total 0                                                                         
lrwx------ 1 root root 64 Mar 14 22:04 0 -> /dev/null                           
lrwx------ 1 root root 64 Mar 14 22:04 1 -> /dev/null                           
lrwx------ 1 root root 64 Mar 14 22:04 2 -> /dev/null                           
lrwx------ 1 root root 64 Mar 14 22:04 3 -> /var/log/rancher-dns.log.1 (deleted)
lrwx------ 1 root root 64 Mar 14 22:04 4 -> socket:[22766]                      
lrwx------ 1 root root 64 Mar 14 22:04 5 -> anon_inode:[eventpoll]              
lrwx------ 1 root root 64 Mar 14 22:04 6 -> socket:[20462]         

Notice the “(deleted)” above.

Basically, logrotate is messed up in multiple rancher containers (haproxy, rancher-agent) and we’ve made no changes to these. Its running stock versions.

Oops… meant to post this in the beta forum, but here it is anyway. ¯\(ツ)

This seems like it’s related to

This was fixed in v0.59.0+

Awesome! We’re a couple versions behind that, and were putting off upgrading until your GA release comes out, but maybe we should go ahead and do it anyway. Especially if you have that fix for the load balancer 503’ing when all the services are actually up!