Hello,
in my SLES 11 with custom kernel 3.10.9, I insert a GFS2 volume (iSCSI).
It works fine but after a while with data access in the LUN, following error appeared.
[ 2880.650146] INFO: task httpd:3295 blocked for more than 480 seconds.
Does anybody have some hints or tricks ?
could be tons of root causes, but my first guess would be DLM. IOW, does the DLM connection work reliably? As you’re not mentioning details - how’s your cluster set up?
SLES 11 with custom kernel 3.10.9
Why are you using a custom kernel (not judging, just asking) - any specific patches you include, any user-space tools you additionally updated?
Are you running HAE? Oh, and which SLES11 is that, SP3 or some older version? (“cat /etc/SuSE-release” would provide the answer, in case you’re unsure.)
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum #provider: corosync_votequorum #expected_votes: 1
}
aisexec {
# Run as root - this is necessary to be able to manage resources with Pacemaker
user: root
group: root
}
service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemakerl
in the out of the box kernel, the same error appeared.
I compiled everything that is needed for GFS2 by my self.
Here my SLES Version
“cat /etc/SuSE-release”
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3
I’ve had my share of difficulties with DLM and OCFS2, albeit with earlier versions of SLES11SP2 and SP3 and when using the file system on more than one node. Our use case included heavy locking operations. HAE updates made them go away, so I flagged them as “bugs”, some of them showing similar behaviour as you report.
Since this is a custom installation, I cannot point you to SLES support via a service request (the folks at SUSE are highly knowledgeable and helpful in the HAE area), but you might consider to have a look at the debugging features available from DLM. When you run into this situation again, try to gather status information not only from syslog, but from DLM’s /sys tree as well. Additionally you might want to do some poor man’s monitoring on the response time of your file system, i.e. by timing a run of “ls -l” or alike. We noticed a steady increase (across days) when working with problematic versions, prior to the actual program abort.
From my experience, debugging this issue will be a hassle But we’ll try to assist as best as we can