Server crashes with a long BTRFS error list

Suddenly I receive an error with the bellow dump when the SLES 12.4 started. Then, the system becomes unavailable.

When I repair the root filesystem [on a guest VM otherwise it isn’t possible because the root filesystem is mounted] then the root partition is marked as full even an empty space is shown in [FONT=Courier New][COLOR="#008000"]df[/COLOR][/FONT].

What happened, how can I solve it? The very strange thing is that when I restore a 3 weeks old backup [where all was OK ] into a new VM I receive the same error messages.

[FONT=Courier New][SIZE=2][ 37.505018] invalid opcode: 0000 [#1] SMP NOPTI
[ 37.505170] CPU: 10 PID: 568 Comm: systemd-journal Not tainted 4.12.14-95.13-default #1 SLE12-SP4
[ 37.505476] task: ffff880003bacc00 task.stack: ffffc90041620000
[ 37.505701] RIP: e030:create_reloc_root+0x295/0x2a0 [btrfs]
[ 37.505853] RSP: e02b:ffffc90041623b98 EFLAGS: 00010282
[ 37.505997] RAX: 00000000ffffffef RBX: ffff8800f731ae00 RCX: 0000000000000001
[ 37.506205] RDX: 0000000000000003 RSI: ffff8800f5323460 RDI: 0000000000000200
[ 37.506486] RBP: ffff8800f965f000 R08: ffff8800ef659cb0 R09: ffffc900416239d8
[ 37.506678] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88000376a2d0
[ 37.506873] R13: ffff8800f58a0000 R14: 0000000000000110 R15: ffff880003bacc00
[ 37.507071] FS: 00007fa2c23a0880(0000) GS:ffff8800faa80000(0000) knlGS:0000000000000000
[ 37.507319] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.507532] CR2: 00007fa2bef25550 CR3: 00000000872fa000 CR4: 0000000000040660
[ 37.507796] Call Trace:
[ 37.507923] btrfs_init_reloc_root+0x8e/0xa0 [btrfs]
[ 37.508130] record_root_in_trans+0xa9/0xf0 [btrfs]
[ 37.508333] btrfs_record_root_in_trans+0x4a/0x70 [btrfs]
[ 37.508539] start_transaction+0xab/0x440 [btrfs]
[ 37.508691] btrfs_dirty_inode+0x49/0xe0 [btrfs]
[ 37.508839] file_update_time+0xa6/0xf0
[ 37.508972] btrfs_page_mkwrite+0x129/0x490 [btrfs]
[ 37.509109] ? vsnprintf+0x1e5/0x4b0
[ 37.509212] do_page_mkwrite+0x31/0x70
[ 37.509373] do_wp_page+0x43f/0x570
[ 37.509473] __handle_mm_fault+0x793/0xef0
[ 37.509601] handle_mm_fault+0xc4/0x1d0
[ 37.509719] __do_page_fault+0x1f3/0x4c0
[ 37.509831] do_page_fault+0x2b/0x70
[ 37.509934] ? do_syscall_64+0x9a/0x150
[ 37.510044] ? page_fault+0x2f/0x50
[ 37.510172] page_fault+0x45/0x50
[ 37.510301] RIP: 0510:0x7ffefbe30518
[ 37.510424] RSP: 0024:00005575011ab0a0 EFLAGS: 5575011a46b0
[ 37.510427] Code: 48 83 c6 02 41 83 e8 02 66 89 4f fe e9 37 fe ff ff 8b 0e 48 83 c7 04 48 83 c6 04 41 83 e8 04 89 4f fc e9 2b fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 1f 44 00 00 0f 1f 44 00 00 48 89 f9 45 31
[ 37.511135] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache af_packet iscsi_ibft iscsi_boot_sysfs xenfs xen_privcmd intel_rapl sb_edac x86_pkg_temp_thermal coretemp crc32_pclmul ghash_clmulni_intel pcbc xen_netfront aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc btrfs xor raid6_pq xen_blkfront crc32c_intel sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
[ 37.512270] Supported: Yes
[ 37.512419] —[ end trace ab510ab54e7d565e ]—
[ 37.512563] RIP: e030:create_reloc_root+0x295/0x2a0 [btrfs]
[ 37.512721] RSP: e02b:ffffc90041623b98 EFLAGS: 00010282
[ 37.512724] RAX: 00000000ffffffef RBX: ffff8800f731ae00 RCX: 0000000000000001
[ 37.512726] RDX: 0000000000000003 RSI: ffff8800f5323460 RDI: 0000000000000200
[ 37.512728] RBP: ffff8800f965f000 R08: ffff8800ef659cb0 R09: ffffc900416239d8
[ 37.512730] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88000376a2d0
[ 37.512732] R13: ffff8800f58a0000 R14: 0000000000000110 R15: ffff880003bacc00
[ 37.512740] FS: 00007fa2c23a0880(0000) GS:ffff8800faa80000(0000) knlGS:0000000000000000
[ 37.512744] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.512750] CR2: 00007fa2bef25550 CR3: 00000000872fa000 CR4: 0000000000040660
[/SIZE][/FONT]

This is the output from [FONT=Courier New][SIZE=2][COLOR="#008000"]btrfs check --repair /dev/xvdc2[/COLOR][/SIZE][/FONT]

[FONT=Courier New][SIZE=2]enabling repair mode
Checking filesystem on /dev/xvdc2
UUID: c88dbf5b-3513-4966-b3d6-5bb6c9b7717e
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don’t match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
found 9380937728 bytes used err is 0
total csum bytes: 8061100
total tree bytes: 230670336
total fs tree bytes: 181600256
total extent tree bytes: 34013184
btree space waste bytes: 41120097
file data blocks allocated: 10417725440
referenced 8217731072[/SIZE][/FONT]

Then I get this:

[FONT=Courier New][SIZE=2]Filesystem 1MB-blocks Used Available Use% Mounted on
/dev/xvda2 11249MB 9918MB 0MB 100% /[/SIZE][/FONT]

Hi
The df tools in SP4 are not btrfs friendly…

See this thread: http://forums.suse.com/showthread.php?t=13627

And also https://www.suse.com/documentation/sles11/stor_admin/data/trbl_btrfs_volfull.html

[QUOTE=malcolmlewis;57675]Hi
The df tools in SP4 are not btrfs friendly…

See this thread: http://forums.suse.com/showthread.php?t=13627

And also https://www.suse.com/documentation/sles11/stor_admin/data/trbl_btrfs_volfull.html[/QUOTE]

@df: You are right, but this is not the problem.
@volume full: This is not the problem and does not solve my issue.

I may have been not accurate enough: The root filesystem is totally damaged. When I start the server the 1st time [from a 3 weeks old backup] so I get after a while the above error messages.
When I try to start the server then the lot of error messages appear and the server becomes unavailable. I found also this post where the same error messages are posted.

Please note that the system is referring with the message ‘[FONT=Courier New][SIZE=2][COLOR="#FF0000"]kernel BUG at …/fs/btrfs/relocation.c:1449![/COLOR][/SIZE][/FONT]’ to a bug in the kernel in the 1st line of the dump!

Again: The server is idle, this means, apart from the normal services, nothing special is running. Suddenly, the error messages in the 1st post appear in the console and then the VM is broken!

It looks to me like the kernel got a very serious bug in one of the previous updates.

[QUOTE=AAEBHolding;57678]@df: You are right, but this is not the problem.
@volume full: This is not the problem and does not solve my issue.

I may have been not accurate enough: The root filesystem is totally damaged. When I start the server the 1st time [from a 3 weeks old backup] so I get after a while the above error messages.
When I try to start the server then the lot of error messages appear and the server becomes unavailable. I found also this post where the same error messages are posted.

Please note that the system is referring with the message ‘[FONT=Courier New][SIZE=2][COLOR="#FF0000"]kernel BUG at …/fs/btrfs/relocation.c:1449![/COLOR][/SIZE][/FONT]’ to a bug in the kernel in the 1st line of the dump!

Again: The server is idle, this means, apart from the normal services, nothing special is running. Suddenly, the error messages in the 1st post appear in the console and then the VM is broken!

It looks to me like the kernel got a very serious bug in one of the previous updates.[/QUOTE]
Hi
At GRUB, can you select advanced boot options and boot to an earlier snapshot?

[QUOTE=malcolmlewis;57684]Hi
At GRUB, can you select advanced boot options and boot to an earlier snapshot?[/QUOTE]

It is a VM running under XenServer. I have only snapshots available which are provided by XenServer - the SLES snapshots are not available.

I think I should provide this information because it may be the real reason for this issue; the root partition was running out of space so I did this in order to increase the root partition:
[LIST=1]
[]Resized in XenCenter the root partition.
[
]Detached the SLES 12.4 VM.
[]Attached the root partition to another SLES 12.4 VM.
[
]Resized the root partition in the other VM with yast/Partition Manager.
[]Detached from the helper VM the root partition.
[
]Attached the resized root partition to the origin VM.
[*]
[/LIST]

All steps like increasing, detaching and attaching have been performed when the VMs have been down.

Since then even an old backup crashes after few seconds when the VM started.
Is it possible there is a more deeper information stored in the disk and when even I restore to a snapshot where the disk hasn’t been resized so it doesn’t match and the problems starts?

Does it help?

Hi
It does make it clearer :slight_smile: Are you in a position to raise a SR (Support
Request)?


Cheers Malcolm °¿° SUSE Knowledge Partner (Linux Counter #276890)
Tumbleweed 20190512 | GNOME Shell 3.32.1 | 5.0.13-1-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

[QUOTE=malcolmlewis;57691]Hi
It does make it clearer :slight_smile: Are you in a position to raise a SR (Support
Request)?
[/QUOTE]

Not really, I never raised a SR. I am alone with in business and I am maintaining my servers and the XenServer on my own. It is now working like this more or less perfectly since 4 years but with this issue I am totally overstrained.

Hi
So if you boot the system to runlevel 1 (at grub and 1 to the options), can you mount the / partition and look at the logs to see what’s failing? Or boot the system in rescue mode.

[QUOTE=malcolmlewis;57716]Hi
So if you boot the system to runlevel 1 (at grub and 1 to the options), can you mount the / partition and look at the logs to see what’s failing? Or boot the system in rescue mode.[/QUOTE]

Which logs should I check? I am in the rescue mode and mounted the root partition as [SIZE=2][FONT=Courier New][COLOR="#008000"]/mnt[/COLOR][/FONT][/SIZE].
Can I try to fix it somehow? Running [FONT=Courier New][SIZE=2][COLOR="#008000"]btrfs check --repair /dev/xvda2[/COLOR][/SIZE][/FONT] doesn’t solve the problem. As I wrote, when I then boot regularly the / partion is out of space.

[QUOTE=AAEBHolding;57720]Which logs should I check? I am in the rescue mode and mounted the root partition as [SIZE=2][FONT=Courier New][COLOR="#008000"]/mnt[/COLOR][/FONT][/SIZE].
Can I try to fix it somehow? Running [FONT=Courier New][SIZE=2][COLOR="#008000"]btrfs check --repair /dev/xvda2[/COLOR][/SIZE][/FONT] doesn’t solve the problem. As I wrote, when I then boot regularly the / partion is out of space.[/QUOTE]
Hi
Check down in /var/log for big files, check those for clues (maybe even copy them off to an external drive), especially messages log.

Maybe it’s coredumping?

Run;

coredumpctl list

If there are old logs you think can be deleted, remove those and see how it goes getting some disk space.

[QUOTE=malcolmlewis;57722]Hi
Check down in /var/log for big files, check those for clues (maybe even copy them off to an external drive), especially messages log.

Maybe it’s coredumping?

If there are old logs you think can be deleted, remove those and see how it goes getting some disk space.[/QUOTE]

Good, I think I could fix it after hours of desperate trying.
These 3 threads did it:
[LIST=1]
[]https://unix.stackexchange.com/questions/174446/btrfs-error-error-during-balancing-no-space-left-on-device
[
]http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
[*]http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
[/LIST]

I summarize what I did:
[LIST=1]
[]Start the VM in the rescue mode.
[
][FONT=Courier New][SIZE=2][COLOR="#0000FF"]mount /dev/xvda2 /mnt[/COLOR][/SIZE][/FONT]
[][FONT=Courier New][SIZE=2][COLOR="#0000FF"]btrfs balance start -v --full-balance /mnt[/COLOR][/SIZE][/FONT]
[
]
[/LIST]

When I got [FONT=Courier New][SIZE=2][COLOR="#FF0000"]Done, had to relocate 12 out of 12 chunks[/COLOR][/SIZE][/FONT] then it worked (1).
When I got the message
[FONT=Courier New][SIZE=2][COLOR="#FF0000"]ERROR: error during balancing ‘/mnt’ - No space left on device
There may be more info in syslog - try dmesg | tail[/COLOR][/SIZE][/FONT]
I had to proceed with (2).

And here, I had two different situations:

  1. One VM has been completely recovered and works well even before the VM didn’t start and crashed while booting - with other words the VM became void.
  2. The other VM was more persistent and really tiresome. I had to find a XenServer snapshot from the VM where the VM did at least start properly even it crashed within few seconds.

Then, to check if it really works I was running some heavy disk access routines where before the VM crashed within seconds. Now, it runs without any problem.
I hope, it will remain like this.

[QUOTE=AAEBHolding;57729]Good, I think I could fix it after hours of desperate trying.
These 3 threads did it:
[LIST=1]
[]https://unix.stackexchange.com/questions/174446/btrfs-error-error-during-balancing-no-space-left-on-device
[
]http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
[*]http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
[/LIST]

I summarize what I did:
[LIST=1]
[]Start the VM in the rescue mode.
[
][FONT=Courier New][SIZE=2][COLOR="#0000FF"]mount /dev/xvda2 /mnt[/COLOR][/SIZE][/FONT]
[][FONT=Courier New][SIZE=2][COLOR="#0000FF"]btrfs balance start -v --full-balance /mnt[/COLOR][/SIZE][/FONT]
[
]
[/LIST]

When I got [FONT=Courier New][SIZE=2][COLOR="#FF0000"]Done, had to relocate 12 out of 12 chunks[/COLOR][/SIZE][/FONT] then it worked (1).
When I got the message
[FONT=Courier New][SIZE=2][COLOR="#FF0000"]ERROR: error during balancing ‘/mnt’ - No space left on device
There may be more info in syslog - try dmesg | tail[/COLOR][/SIZE][/FONT]
I had to proceed with (2).

And here, I had two different situations:

  1. One VM has been completely recovered and works well even before the VM didn’t start and crashed while booting - with other words the VM became void.
  2. The other VM was more persistent and really tiresome. I had to find a XenServer snapshot from the VM where the VM did at least start properly even it crashed within few seconds.

Then, to check if it really works I was running some heavy disk access routines where before the VM crashed within seconds. Now, it runs without any problem.
I hope, it will remain like this.[/QUOTE]
Hi
Thanks for the feedback and good work :slight_smile:

I’ve experienced this exact same error at relocation.c:1449 running “4.12.14-lp150.12.58-default #1 openSUSE Leap 15.0”. The call to btrfs_insert_root() within create_reloc_root() returns an error code which causes the subsequent BUG_ON() assertion to fail.

I’m hopeful that the following commit in the kernel will fix the underlying issue.

https://bugzilla.kernel.org/show_bug.cgi?id=203405
https://lkml.org/lkml/2019/6/7/720
https://github.com/torvalds/linux/commit/30d40577e322b670551ad7e2faa9570b6e23eb2b

OMG, after the last updates SUSE offered and I ran it started again. But now I cannot fix it. One VM which had the problem wasn’t updated and it doesn’t have any problem.
Again the same. There is no way I fix it with restoring a snapshot. This VM is apparently broken.