OCFS2 response time

Hi all,

we have a significant performance situation here and I’d like to know if it is “normal” or/and if we can do anything about it (“performance tuning”-wise).

We’re running two SLES11SP1 + HAE nodes, latest patches applied, Xen kernel. The problem exists since the beginning (no changes by applying patches, so far). Both nodes access two shared “Fiber Channel”-based LUNs (disks), with OCFS2 as file systems on top. We’re using Pacemaker as heartbeat to OCFS2. A third node (with no OCFS2 or alike) is running as a standby node and has taken the DC role, both “OCFS2” nodes are simple members. DLM is only active on these two nodes.

One of the file systems is used to store virtual disk images (mounted on /var/lib/xen/images) but currently empty, the other one for VM config files and xend lock files (mounted on /etc/xen/vm).

(“xend lock files” refers to the xend-config.sxp options “(xend-domain-lock yes)” plus “(xend-domain-lock-path /etc/xen/vm/vm_locks)” active on both nodes).

For debugging purposes, we run a periodic access check (“time ls -lR /var/lib/xen/images/” and “time ls -lR /etc/xen/vm/vm_locks”) and monitor the required time to run that command, on both nodes.

With /var/lib/xen/images, we always get sub-second response times (< 100 ms typically) on both nodes.

With /etc/xen/vm, things are much different:
[LIST]
[]When the FS is mounted on a single node only, then the test is equally quick.
[
]When the FS gets mounted on the second node and locks are used on the second node, too, then the times on the first node jump to around 2000 ms (with peaks at 5 seconds) and the second node responds in 6 to 10 seconds, sometimes even higher.
[*]I currently have no verified data at hand for the situation “both nodes mounted but locks only on one node”.
[/LIST]

Obviously, the delays have to do with the distributed locking across both nodes. But since both nodes aren’t under significant load (neither is the FC server) and the networks between both nodes (production network plus dedicated connection via a separate switch) are almost idle, I believe these values to be a bit high…

Anyone out there who could share her/his experiences with me?

With regards,
Jens

jmozdzen wrote:
[color=blue]

Hi all,

we have a significant performance situation here and I’d like to know
if it is “normal” or/and if we can do anything about it (“performance
tuning”-wise).

We’re running two SLES11SP1 + HAE nodes, latest patches applied, Xen
kernel. The problem exists since the beginning (no changes by applying
patches, so far). Both nodes access two shared “Fiber Channel”-based
LUNs (disks), with OCFS2 as file systems on top. We’re using Pacemaker
as heartbeat to OCFS2. A third node (with no OCFS2 or alike) is running
as a standby node and has taken the DC role, both “OCFS2” nodes are
simple members. DLM is only active on these two nodes.

One of the file systems is used to store virtual disk images (mounted
on /var/lib/xen/images) but currently empty, the other one for VM config
files and xend lock files (mounted on /etc/xen/vm).

(“xend lock files” refers to the xend-config.sxp options
“(xend-domain-lock yes)” plus “(xend-domain-lock-path
/etc/xen/vm/vm_locks)” active on both nodes).

For debugging purposes, we run a periodic access check (“time ls -lR
/var/lib/xen/images/” and “time ls -lR /etc/xen/vm/vm_locks”) and
monitor the required time to run that command, on both nodes.

With /var/lib/xen/images, we always get sub-second response times (<
100 ms typically) on both nodes.

With /etc/xen/vm, things are much different:

  • When the FS is mounted on a single node only, then the test is
    equally quick.
  • When the FS gets mounted on the second node and locks are used on
    the second node, too, then the times on the first node jump to around
    2000 ms (with peaks at 5 seconds) and the second node responds in 6 to
    10 seconds, sometimes even higher.
  • I currently have no verified data at hand for the situation “both
    nodes mounted but locks only on one node”.

Obviously, the delays have to do with the distributed locking across
both nodes. But since both nodes aren’t under significant load (neither
is the FC server) and the networks between both nodes (production
network plus dedicated connection via a separate switch) are almost
idle, I believe these values to be a bit high…

Anyone out there who could share her/his experiences with me?

With regards,
Jens

[/color]

I have been dealing with similar issues but not exact. I don’t know if this
TID will help 7009790. I have been dealing with support since early Jan on
these kind of issues. I actually was running under Sles 10/OES2 for the
hosts using iscsi and upgraded to sles11sp1/OES11 and have been trying many
thing to get the performance back. Have been trying other things as well but
I don’t want to share those in here since they are not documented by novell
right now.

Rick,

thanks for your feedback.

The referenced tid references communication delays between Dom0 and DomU. Although this could be related to the difficulties we experience, I currently cannot confirm any unusual network delays while pinging

  • Dom0 A to Dom0 B
  • Dom0 A to third, non-Xen standby cluster node (the DC)
  • Dom0 A to DomU on server A
  • DomU on Server A to Dom0 A

even with the ping msg size referenced in the tid (97) nor higher nor standard size.

I have been dealing with similar issues but not exact.

Were those issues related to OCFS2 file system performance on Dom0 or to network performance (with probably iSCSI performance impacts for iSCSI initiators in DomUs)?

With regards

Jens

jmozdzen wrote:
[color=blue]

Rick,

thanks for your feedback.

The referenced tid references communication delays between Dom0 and
DomU. Although this could be related to the difficulties we experience,
I currently cannot confirm any unusual network delays while pinging

  • Dom0 A to Dom0 B
  • Dom0 A to third, non-Xen standby cluster node (the DC)
  • Dom0 A to DomU on server A
  • DomU on Server A to Dom0 A

even with the ping msg size referenced in the tid (97) nor higher nor
standard size.
[color=green]

I have been dealing with similar issues but not exact.[/color]

Were those issues related to OCFS2 file system performance on Dom0 or
to network performance (with probably iSCSI performance impacts for
iSCSI initiators in DomUs)?

With regards

Jens

[/color]

I did not either. As I said I have been having issues. Maybe you should open
an sr. It could be similar. Been working with Novell on mine.