I/O error

OS: SLES 12 for SAP SP 3 + fully updated/patched.

Once in a week, System(application and Database) becomes unresponsive due to I/O error.
While having I/O error, we can ping and access the system via SSH/PuTTY, though none of the standard linux command runs successfully due to I/O error:

:~ # top
:~ # /usr/bin/top: Input/output error
:~ # dmesg
:~ # /usr/bin/dmesg: Input/output error
:~ # tail -f /var/log/messages
:~ # /usr/bin/tail: Input/output error

The interesting part is that issue always gets fixed(for next 4-5 days) simply by hard rebooting the server(system didnt even reboots via command) and system keeps running without any issue till next I/O error(repeats every 5-6 days).

No single FS error is every reported(in the logs) on this system. We even run the file system checks too.

SUSE Support advised us

I am unable to understand how memory tuning would prevent the I/O error ? Interestingly this is the SAP HANA replication target, i.e this system is a Passive node, while we never ever face I/O errors on Master/Primary SAP Server.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        252G     0  252G   0% /dev
tmpfs           393G   80K  393G   1% /dev/shm
tmpfs           252G  9.9M  252G   1% /run
tmpfs           252G     0  252G   0% /sys/fs/cgroup
/dev/sda4       788G  266G  522G  34% /
/dev/sda2       985M   74M  860M   8% /boot
/dev/sda6       1.0T   28G  997G   3% /hana/log
/dev/sda7       297G   24G  273G   8% /hana/shared
/dev/sda5       2.0T  321G  1.7T  16% /hana/data
tmpfs            51G     0   51G   0% /run/user/485
tmpfs            51G     0   51G   0% /run/user/1000
tmpfs            51G     0   51G   0% /run/user/1006
tmpfs            51G   16K   51G   1% /run/user/487
tmpfs            51G     0   51G   0% /run/user/1004

# /usr/bin/free -h
             total       used       free     shared    buffers     cached
Mem:          503G       140G       363G       7.5G        85M       8.9G
-/+ buffers/cache:       131G       372G
Swap:          20G         0B        20G

Hi
With that amount of memory available, consider tweaking any swap usage?

I use;

cat /etc/sysctl.d/98-grover.conf

#disable swap
vm.swappiness=1
vm.vfs_cache_pressure=50

This will ensure RAM is actually used before hitting the swap space.

So disks are all ok, memory has all been tested, filesystem checks run?

Have you run iotop rather than top, I would leave it (iotop) running in a session…

Nice advise, however I am interested to know if Tuning Memory would possibly prevent I/O error to occur ? For me its hard to imagine.

Hi
Hi
Well is the system swapping when i/o issues occur? Have you checked the disks and filesystems as well as the RAM?

Running iotop may give a better indication of what is happening.