OS: SLES 12 for SAP SP 3 + fully updated/patched.
Once in a week, System(application and Database) becomes unresponsive due to I/O error.
While having I/O error, we can ping and access the system via SSH/PuTTY, though none of the standard linux command runs successfully due to I/O error:
:~ # top
:~ # /usr/bin/top: Input/output error
:~ # dmesg
:~ # /usr/bin/dmesg: Input/output error
:~ # tail -f /var/log/messages
:~ # /usr/bin/tail: Input/output error
The interesting part is that issue always gets fixed(for next 4-5 days) simply by hard rebooting the server(system didnt even reboots via command) and system keeps running without any issue till next I/O error(repeats every 5-6 days).
No single FS error is every reported(in the logs) on this system. We even run the file system checks too.
SUSE Support advised us
I am unable to understand how memory tuning would prevent the I/O error ? Interestingly this is the SAP HANA replication target, i.e this system is a Passive node, while we never ever face I/O errors on Master/Primary SAP Server.