SLES 11 GFS2 Problem

Hello,
in my SLES 11 with custom kernel 3.10.9, I insert a GFS2 volume (iSCSI).
It works fine but after a while with data access in the LUN, following error appeared.

[ 6240.651154] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 6240.651159] httpd D ffff88007f619c58 0 19957 19893 0x00000000
[ 6240.651170] ffff88007653f9d8 0000000000000086 0000000000000000 ffff880036c27bc0
[ 6240.651180] 0000000000000000 0000000000000000 ffff88007653f9a8 ffff88007653e010
[ 6240.651189] ffff88007653e000 ffff88007653e010 ffff88007653e000 ffff88007653e000
[ 6240.651208] Call Trace:
[ 6240.651222] [] ? find_get_page+0x4d/0xc0
[ 6240.651229] [] ? find_lock_page+0x25/0x80
[ 6240.651235] [] ? find_or_create_page+0x3a/0xa0
[ 6240.651242] [] ? default_spin_lock_flags+0x13/0x30
[ 6240.651249] [] schedule+0x24/0x70
[ 6240.651268] [] gfs2_glock_holder_wait+0x9/0x10 [gfs2]
[ 6240.651274] [] __wait_on_bit+0x5a/0x90
[ 6240.651287] [] ? gfs2_glock_demote_wait+0x10/0x10 [gfs2]
[ 6240.651301] [] ? gfs2_glock_demote_wait+0x10/0x10 [gfs2]
[ 6240.651306] [] out_of_line_wait_on_bit+0x74/0x90
[ 6240.651314] [] ? autoremove_wake_function+0x40/0x40
[ 6240.651327] [] ? gfs2_glock_put+0x4c/0x260 [gfs2]
[ 6240.651341] [] gfs2_glock_wait+0x3e/0x80 [gfs2]
[ 6240.651355] [] gfs2_glock_nq+0x2f0/0x3d0 [gfs2]
[ 6240.651372] [] gfs2_glock_nq_init+0x21/0x40 [gfs2]
[ 6240.651417] [] gfs2_permission+0xf1/0x100 [gfs2]
[ 6240.651434] [] ? gfs2_glock_nq_init+0x19/0x40 [gfs2]
[ 6240.651441] [] __inode_permission+0x46/0xf0
[ 6240.651447] [] inode_permission+0x3d/0x60
[ 6240.651453] [] link_path_walk+0x46d/0x9b0
[ 6240.651459] [] ? lock_rcu_walk+0x15/0x20
[ 6240.651465] [] path_lookupat+0x53/0x8a0
[ 6240.651471] [] ? getname_flags+0x53/0x1b0
[ 6240.651477] [] filename_lookup+0x33/0xd0
[ 6240.651483] [] user_path_at_empty+0x7b/0xb0
[ 6240.651490] [] ? bad_area_nosemaphore+0xe/0x10
[ 6240.651496] [] ? __do_page_fault+0x2d8/0x540
[ 6240.651502] [] user_path_at+0xc/0x10
[ 6240.651507] [] vfs_fstatat+0x51/0xb0
[ 6240.651512] [] vfs_lstat+0x19/0x20
[ 6240.651517] [] SyS_newlstat+0x1f/0x50
[ 6240.651522] [] ? do_page_fault+0x9/0x10
[ 6240.651529] [] ? page_fault+0x28/0x30
[ 6240.651535] [] system_call_fastpath+0x1a/0x1f

or this

[ 2880.650146] INFO: task httpd:3295 blocked for more than 480 seconds.
[ 2880.650152] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 2880.650154] httpd D ffff88007f616c58 0 3295 3253 0x00000000
[ 2880.650161] ffff880062735c78 0000000000000082 ffff880062734000 ffff88003737a040
[ 2880.650167] 0000000000000000 00000000000080d0 ffff8800448dd3b0 ffff880062734010
[ 2880.650172] ffff880062734000 ffff880062734010 ffff880062734000 ffff880062734000
[ 2880.650177] Call Trace:
[ 2880.650207] [] ? gfs2_holder_uninit+0x1e/0x40 [gfs2]
[ 2880.650218] [] ? gfs2_glock_dq_uninit+0x19/0x20 [gfs2]
[ 2880.650231] [] ? gfs2_open+0xe5/0x160 [gfs2]
[ 2880.650240] [] ? default_spin_lock_flags+0x13/0x30
[ 2880.650251] [] schedule+0x24/0x70
[ 2880.650261] [] gfs2_glock_holder_wait+0x9/0x10 [gfs2]
[ 2880.650265] [] __wait_on_bit+0x5a/0x90
[ 2880.650275] [] ? gfs2_glock_demote_wait+0x10/0x10 [gfs2]

An Apache web server runs in the LUN and it stops working.

Does anybody have some hints or tricks ?

Best regards
B.-D.

Hi B.-D.,

[ 2880.650146] INFO: task httpd:3295 blocked for more than 480 seconds.
Does anybody have some hints or tricks ?

could be tons of root causes, but my first guess would be DLM. IOW, does the DLM connection work reliably? As you’re not mentioning details - how’s your cluster set up?

SLES 11 with custom kernel 3.10.9

Why are you using a custom kernel (not judging, just asking) - any specific patches you include, any user-space tools you additionally updated?

Are you running HAE? Oh, and which SLES11 is that, SP3 or some older version? (“cat /etc/SuSE-release” would provide the answer, in case you’re unsure.)

Regards,
Jens

Hello Jens,

yes, the DLM connections works fine.
DLM works with corosync and heres the corosync config

##################################################################################
totem {
version: 2

    crypto_cipher: none
    crypto_hash: none
    secauth: off
    cluster_name: web_cluster
    rrp_mode: passive
    #rrp_mode: active

    interface {
            ringnumber: 0
            bindnetaddr: 172.16.190.201
            mcastaddr: 226.94.1.3
            mcastport: 5405
            ttl: 10
    }

    #transport: udpu
    token: 30000

}

logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}

nodelist {

    node {
            ring0_addr: 172.16.190.10
            nodeid: 1
    }

    #node {
    #        ring0_addr: 172.16.190.12
    #        nodeid: 2
    #}

}

quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
#provider: corosync_votequorum
#expected_votes: 1
}

aisexec {
# Run as root - this is necessary to be able to manage resources with Pacemaker
user: root
group: root
}

service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemakerl

    use_mgmtd:      yes
    use_logd:       yes

}

amf {
mode: disabled
}
##################################################################################

and here the DLM config

cat /etc/dlm/dlm.conf
enable_fencing=0

I started DLM with the command “dlm_controld -r1”

in the out of the box kernel, the same error appeared.
I compiled everything that is needed for GFS2 by my self.
Here my SLES Version
“cat /etc/SuSE-release”
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

The LUN is only mounted in one server.

Best regrads
B.-D.

Hi B.-D.,

I’ve had my share of difficulties with DLM and OCFS2, albeit with earlier versions of SLES11SP2 and SP3 and when using the file system on more than one node. Our use case included heavy locking operations. HAE updates made them go away, so I flagged them as “bugs”, some of them showing similar behaviour as you report.

Since this is a custom installation, I cannot point you to SLES support via a service request (the folks at SUSE are highly knowledgeable and helpful in the HAE area), but you might consider to have a look at the debugging features available from DLM. When you run into this situation again, try to gather status information not only from syslog, but from DLM’s /sys tree as well. Additionally you might want to do some poor man’s monitoring on the response time of your file system, i.e. by timing a run of “ls -l” or alike. We noticed a steady increase (across days) when working with problematic versions, prior to the actual program abort.

From my experience, debugging this issue will be a hassle :frowning: But we’ll try to assist as best as we can :slight_smile:

Regards,
Jens