SSD - failure

I using SLED over some years. On my laptop that runs every day for some hours I use a SSD. Some years ago we had to exchange one disk - ok. shit happen. Now I get failures again.
The laptop freeze totally and the HD controller is running continuously. Checking GSmartControl, no disk failures are found.
Can there be a issue with SSD and linux or specific SUSE?

just the data as well:
SLED15:/home/hans-christoph # hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: KINGSTON SA400S37240G
Serial Number: 50026B767B01BD02
Firmware Revision: SBFK71E0
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 11 10 9 8 7 6 5
Likely used: 11
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63

CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 468862128
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 10241024: 228936 MBytes
device size with M = 1000
1000: 240057 MBytes (240 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec’d by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
Device-initiated interface power management
* Software settings preservation
* DOWNLOAD MICROCODE DMA command
* SET MAX SETPASSWORD/UNLOCK DMA commands
* WRITE BUFFER DMA command
* READ BUFFER DMA command
* DEVICE CONFIGURATION SET/IDENTIFY DMA commands
* Data Set Management TRIM supported (limit 8 blocks)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
20min for SECURITY ERASE UNIT. 60min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50026b767b01bd02
NAA : 5
IEEE OUI : 0026b7
Unique ID : 67b01bd02
Checksum: correct

@HANS-CHRISTOPH Hi, what about output from smartctl -a /dev/sda. Are you running btrfs, perhaps needs defrag/balance etc?

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3910
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 1581
148 Unknown_Attribute 0x0000 255 255 000 Old_age Offline - 8
149 Unknown_Attribute 0x0000 255 255 000 Old_age Offline - 2
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 6
170 Bad_Blk_Ct_Erl/Lat 0x0013 100 100 010 Pre-fail Always - 0/7
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 21 (Average 13)
181 Program_Fail_Cnt_Total 0x0012 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0000 255 255 000 Old_age Offline - 1
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 1
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 60
194 Temperature_Celsius 0x0023 066 048 000 Pre-fail Always - 34 (Min/Max 11/52)
196 Not_In_Use 0x0000 100 100 000 Old_age Offline - 2
199 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0000 100 100 000 Old_age Offline - 0
231 SSD_Life_Left 0x0013 100 100 000 Pre-fail Always - 98
233 Flash_Writes_GiB 0x0013 100 100 000 Pre-fail Always - 3219
241 Lifetime_Writes_GiB 0x0012 100 100 000 Old_age Always - 3142
242 Lifetime_Reads_GiB 0x0012 100 100 000 Old_age Always - 1505
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 13
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 21
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 160584

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 3907 -

2 Short offline Completed without error 00% 3905 -

3 Short offline Completed without error 00% 3890 -

4 Extended offline Completed without error 00% 3889 -

5 Short offline Completed without error 00% 3877 -

6 Short offline Completed without error 00% 3871 -

7 Short offline Completed without error 00% 3867 -

8 Short offline Completed without error 00% 3861 -

9 Short offline Completed without error 00% 3858 -

#10 Short offline Completed without error 00% 3854 -
#11 Short offline Completed without error 00% 3850 -
#12 Short offline Completed without error 00% 3842 -
#13 Short offline Completed without error 00% 3834 -
#14 Short offline Completed without error 00% 3819 -
#15 Short offline Completed without error 00% 3815 -
#16 Short offline Completed without error 00% 3814 -
#17 Short offline Completed without error 00% 3808 -
#18 Short offline Completed without error 00% 3795 -
#19 Short offline Completed without error 00% 3781 -
#20 Short offline Completed without error 00% 3762 -
#21 Short offline Completed without error 00% 3752 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I use BTRFS as native file system - as SUSE suggest.

@HANS-CHRISTOPH Hi, that looks ok, what scheduler is in use?

cat /sys/block/sda/queue/scheduler

hans-christoph@SLED15:~> cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
This is the result. The disc are not full, have space. It is strange. And this issues happen randomly. Sometimes I think it happens when using several programs and Internet? But I couldn’t find any issue here.

@HANS-CHRISTOPH Hi, that’s the correct one for a SSD. Can you confirm that the likes of fstrim, btrfs defrag, balance etc has been run?

@HANS-CHRISTOPH Hi forgot to add the command to check :wink: systemctl list-timers will show the info when it ran etc…

NEXT LEFT LAST PASSED UNIT ACTIVATES
Thu 2021-02-11 08:00:00 CET 49min left Thu 2021-02-11 07:00:07 CET 10min ago snapper-timeline.timer snapper-timeline.service
Thu 2021-02-11 23:05:47 CET 15h left Tue 2021-02-09 10:08:38 CET 1 day 21h ago snapper-cleanup.timer snapper-cleanup.service
Thu 2021-02-11 23:11:04 CET 16h left Tue 2021-02-09 10:13:56 CET 1 day 20h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Fri 2021-02-12 00:00:00 CET 16h left Thu 2021-02-11 06:50:48 CET 20min ago logrotate.timer logrotate.service
Fri 2021-02-12 00:00:00 CET 16h left Thu 2021-02-11 06:50:48 CET 20min ago mandb.timer mandb.service
Fri 2021-02-12 00:15:52 CET 17h left Thu 2021-02-11 06:50:48 CET 20min ago check-battery.timer check-battery.service
Fri 2021-02-12 01:39:26 CET 18h left Thu 2021-02-11 06:50:48 CET 20min ago backup-sysconfig.timer backup-sysconfig.service
Fri 2021-02-12 01:58:12 CET 18h left Thu 2021-02-11 06:50:48 CET 20min ago backup-rpmdb.timer backup-rpmdb.service
Mon 2021-02-15 00:00:00 CET 3 days left Mon 2021-02-08 08:06:54 CET 2 days ago btrfs-balance.timer btrfs-balance.service
Mon 2021-02-15 00:00:00 CET 3 days left Mon 2021-02-08 08:06:54 CET 2 days ago fstrim.timer fstrim.service
Mon 2021-03-01 00:00:00 CET 2 weeks 3 days left Mon 2021-02-01 07:39:12 CET 1 weeks 2 days ago btrfs-scrub.timer btrfs-scrub.service

It looks like all run well.
See, I have 2 SLED 15.2 running. One on my old desktop with a HD. There are no problems with disc.
One on my laptop with a SSD. I had exchanged one disk in past. The problems seems to be the same as last time. Random freezing of laptop. I that case, hard reboot is the only option. What I can see in that case is the SSD controller (LED) is running.
Searching for “Linux SSD freeze” I can see other had same problem.

Hi
So if you look at the times when the above services ran, does it correspond with when you get a freeze?

I also suggest enable the magic sysrq key… https://en.wikipedia.org/wiki/Magic_SysRq_key use cat /proc/sys/kernel/sysrq to see the current value, can also setup a /etc/sysctl.d/10-magic-sysrq.conf file…

Hi
I’m not sure. The services run often and at night as can see on time stamp.
The sysrq value is 184 - ??
I can see I have to work with it a little bit more, but seems to have a lot of function. When freeze, ALT+F2 doesn’t work.

@HANS-CHRISTOPH Hi, have a read here: https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html

Try the key combo and press the keys in the required order press alt and hold, press sys rq key and release, then press the following keys one at a time R E I S U B and then release the alt key. System should reboot…

Now I found same failure on my Desktop as well. this has an regularly HD. I think I can relate that to Firefox. It happens wehn many windows are open and Firefox works in longer time.
I always happens when working with Firefox

@HANS-CHRISTOPH Hi, sounds like you might need to look at firefox tweaks, eg disk cache to reduce that.
How much system RAM?

Maybe reduce swappiness?

cat /etc/sysctl.d/98-swap.conf

#Reduce swappiness
vm.swappiness=1
vm.vfs_cache_pressure=50