Rancher fails to start due to "no space left on device"

Hello, new to rancher. System up 7 years running 2.3.1. container 7b2e51e2ebab rancher/rancher:latest “entrypoint.sh” shows the below error. However I have plenty of space on the o/s physical server log: exiting because of error: log: cannot create log: open /tmp/rancher.7b2e51e2ebab.root.log.INFO.20230103-162949.6: no space left on device

if i ssh into the rancher container and do a df i get
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/docker-253:8-125834138-d151ae3d32e20fe3b95d29de0c27ea5f1640c23e95b2cf52c4e50c4385f518d9 10G 10G 56K 100% /
tmpfs 64M 0 64M 0% /dev
tmpfs 126G 0 126G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/mapper/OS_Vol-lv_var 224G 32G 192G 15% /etc/hosts
/dev/mapper/OS_Vol-lv_root 10G 4.0G 6.1G 40% /var/lib/rancher
tmpfs 126G 0 126G 0% /proc/acpi
tmpfs 126G 0 126G 0% /proc/scsi
tmpfs 126G 0 126G 0% /sys/firmware

I dont know how to create more space as it shows 100% used

problem solved, was able to ssh into container and remove all excess logs in /tmp

PROBLEM:

Rancher UI shows cluster status as Updating due to Node DiskPreassure

Logs reveal “No space left on device

Mar 11 03:02:54  rsyslogd: file '/var/log/syslog'[7] write error : No space left on device
Mar 11 03:02:54  rsyslogd: file '/var/log/syslog'[7] write error : No space left on device 

As well as on the console




SOLUTION:

Free up space by deleting local etcd snapshots so you can troubleshoot problem further.
In my particular case, I ran out of space because the root ( / ) partition size was a mere 30GB and etcd snapshot retention (days) was too high. I solved DiskPreassure by expanding virtual disk on VM to 100GB and decreasing etcd retention to 10 days.

Ideally you want to have etcd snapshots stored in a S3 bucket.

List snapshots to determine how many days to keep

ls -halt /var/lib/rancher/rke2/server/db/snapshots/

Now delete snapshots older than X days ( where X=5 - adjust as needed below)

find /var/lib/rancher/rke2/server/db/snapshots/* -mtime +5 -exec rm {} \;

Display available disk space

df -hl /

Repeat steps on all master nodes and wait for cluster to recover on its own.
Takes approx 2-5 mins depending on your node specs etc.

1 Like

Hi! Thank you for sharing :slight_smile: