I have a cluster with 2 SLES 11 SP1 servers and I’m running ocfs2 in order to keep a disk mounted on both servers. It has been working perfectly for a long time but last Friday the ocfs filesystem became read only. I unmounted, run fsck.ocfs2 and the problem was solved for a few hours and then happened again.
the errors found in the log are:
Mar 15 14:12:01 server2 kernel: [13327703.909661] OCFS2: ERROR (device sda): ocfs2_validate_dx_leaf: Dir Index Leaf has bad signature onName:
Mar 15 14:12:01 server2 kernel: [13327703.909666] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Mar 15 14:12:01 server2 kernel: [13327703.909671] (mv,5986,4):ocfs2_dx_dir_search:961 ERROR: status = -30
Mar 15 14:12:01 server2 kernel: [13327703.909674] (mv,5986,4):ocfs2_find_entry_dx:1066 ERROR: status = -30
might be a defective back-end storage... where's it running on? SAN? DRBD? Is it a RAID set with redundancy or single disk? You might want to check either the back-end's logs or the disk itself, i.e. via SMART reports.
Are both servers at the same software level, especially concerning OCFS2?
Both servers are at exactly the same suse version and patch level.
The back end storage is an ibm DS, attached with SAS connections and it reports no errors. It shows optimal status. There are multiple disks in RAId5 configuration.
The system worked for more than a year without any issues and suddenly it keeps switching the filesystem to read only every few days.
[QUOTE=kyriakos;12597]Both servers are at exactly the same suse version and patch level.
The back end storage is an ibm DS, attached with SAS connections and it reports no errors. It shows optimal status. There are multiple disks in RAId5 configuration.[/QUOTE]
RAID5 should indeed rule out single-disk failures, especially if no errors are reported at the RAID level.
May I ask to check for additional messages in syslog, especially related to the SAS connection/adapter, on both nodes and probably some time before the error message (as quoted in your first mesage) was recorded?
Of course the most obvious question: Has anything changed in the over-all system (OS/driver/firmware updates, re-wiring,… either at server or storage back-end) right before the problems started to pop up?
RAID5 should indeed rule out single-disk failures, especially if no errors are reported at the RAID level.
May I ask to check for additional messages in syslog, especially related to the SAS connection/adapter, on both nodes and probably some time before the error message (as quoted in your first mesage) was recorded?
Of course the most obvious question: Has anything changed in the over-all system (OS/driver/firmware updates, re-wiring,… either at server or storage back-end) right before the problems started to pop up?
Regards,
Jens[/QUOTE]
Hi Jens
No messages to indicate anything wrong with the hardware or software. everything works normally and suddenly the error start to appear.
Also we haven’t done any changes. The only thing that changed is that there are more data stored in the ocfs filesystem than it used to be as it is ever-increasing, however it is still less than half the total capacity.
No messages to indicate anything wrong with the hardware or software. everything works normally and suddenly the error start to appear.
Also we haven’t done any changes. The only thing that changed is that there are more data stored in the ocfs filesystem than it used to be as it is ever-increasing, however it is still less than half the total capacity.[/QUOTE]
then, I have to admit, I cannot be helpful at the moment - your bet bet to get this resolved is opening a support ticket with SuSE, if possible.
A word of personal opinion: In earlier releases of OCFS2 (already declared as production-ready by Oracle), I’ve stumbled over quite some bugs and fatal errors. During the last year or two I’ve not hit any of them anymore - but this may well be because we moved on to providing individual disks to our VMs, rather than storing their disk images on a common, shared OCFS2 volume. The most instable operations were noticed when using sparse files (while not overly large, those had a fair amount of I/O in already allocated areas). It may well be that your individual access pattern has hit an OCFS2 bug.
Sorry that I cannot give a better advise. To find out the details a close look at the file system, the nodes and their FS access seems required, together with in-depth knowledge of the innards of the OCFS2 version that is used.