Very very slow file access iSCSI NSS on SLES11/XEN/OES11

Hi,

Like many Novell customers while carrying out a hardware refresh we are moving off traditional Netware 6.5 to OES11 and at the same time virtualising our environment.

We have new Dell Poweredge 620 serves attached by 10gig iSCSI to Equalogic SAN

Installed SLES will all patches and updates and XEN and then created OES11 SP2 virtual machines, these connect to NSS volume by iSCSI

Migrated files from traditional netware server to new hardware and stated testing and ran into very very slow files access times

A 3.5mb pdf file takes close to 10 minutes to open from local PC with Novell Client installed, same if no client and open via cifs. Opening same file off traditional NW6.5 server takes 3-4 seconds.

We have had a case open with Novell for almost 2 months but they are unable to resolve.

To test other options we installed VMWare ESXi on the internal usb flash drive and booted off that, created same OES11 VM and connected to NSS on SAN and same pdf open in seconds.

The current stack of SLES11/XEN/OES11 is not able to be put into production

Any ideas where the bottleneck might be? We think is in XEN.

Thanks

Any ideas where the bottleneck might be? We think is in XEN.

So don’t use Xen, try KVM instead or stick with ESX.
My understanding is Xen’s network stack is a bit trciky to get right.
If your setup isn’t right that would explain the iSCSI bottleneck. Sorry I
can’t help you troubleshoot Xen it’s not something I am familiar with. Maybe
someone else will chime in.

Hi idgandrewg,

first (of probably a long list of) question: where do you terminate the iSCSI sessions, in Dom0 (the “host”) or DomU (the “VM”)? In other words, what are the elements involved (The following list is intended at getting a better picture of the situation)?

  • iSCSI server is an Equalogic SAN - single-/multipath iSCSI access?
  • dedicated Ethernet network between Xen server and SAN server? Or separate VLAN of production network? Can you rule out the switches? (I saw the ESXi comment, see question below)
  • servers connected to SAN via Gbit ethernet adapters or via iSCSI HBAs?
  • how’s your DomU defined - PVM or HVM? What disk resources are defined for the DomU, and where are they from (local server disk/SAN LUN via i.e. iSCSI/FC/…)?
  • as already asked above: Do the iSCSI connections for the NSS “devices” terminate in Dom0 or in DomU?
  • with your ESXi test, was the iSCSI connection the same (iSCSI terminating in host vs.VM) as with your Xen test?
  • had you run a network (iSCSI) trace during the “slow” accesses to the PDF and could identify a delay?

I’m no OES guy, but these problems seem to be of a more generic nature (“We think is in XEN”), so at least we can get the ball rolling…

Regards,
Jens

Hi GofBorg,

My understanding is Xen’s network stack is a bit trciky to get right.

I wouldn’t say so… you just have to decide whom you’ll let do the job (base OS vs. Xen scripts) :smiley: But then it’s straight-forward and easily comprehensible.

Regards,
Jens

jmozdzen wrote:
[color=blue]

Hi GofBorg,
[color=green]

My understanding is Xen’s network stack is a bit trciky to get right.[/color]

I wouldn’t say so… you just have to decide whom you’ll let do the job
(base OS vs. Xen scripts) :smiley: But then it’s straight-forward and easily
comprehensible.

Regards,
Jens

[/color]

I run against a iscsi 1 gig. I have not seen these kind of issues. Making
sure I understand your layout. I assume you are clustering. First I assume
you are using iscsi in the vm’s as well to connect to the the resources
inside the vm. If you are mounting them from the host side I would not do
that. run the iscsi directly for all san attached storage from the vm
itself. It has worked like a charm. I take less than a 10% performance hit
as far as performance versus hardware.

I wouldn’t say so… you just have to decide whom you’ll let do the job[color=blue]
(base OS vs. Xen scripts) :smiley: But then it’s straight-forward and easily
comprehensible.[/color]

Or it could be that it is more fleshed out since I last played with it.
That would be a good thing. I do recall early on that it was a big gripe
with Xen. I’ll take your word for it that it has changed. As you stated in
your other post, there are a myriad of variables that need to be considered.
It could be a driver issue.

idgandrewg wrote:
[color=blue]

Any ideas where the bottleneck might be? We think is in XEN.[/color]

Here’s a shot in the dark…

There have been various performance issues related to TCP Offloading.
Some examples are described in these TID’s: TID7000478, TID3344651,
TID7007604.

TID7005304 describes a workaround:
Howto change network specific settings using ethtool in combination
with NetworkManager
http://www.novell.com/support/kb/doc.php?id=7005304

It’s a simple matter to disable TCP Offloading to see if it resolves
your issue.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Thanks will give it a try and see if it resolves the problem

HI Jens answers to your questions below

first (of probably a long list of) question: where do you terminate the iSCSI sessions, in Dom0 (the “host”) or DomU (the “VM”)? In other words, what are the elements involved (The following list is intended at getting a better picture of the situation)?

Q - iSCSI server is an Equalogic SAN - single-/multipath iSCSI access?
A - iSCSI connection to SAN is multipath
Q - dedicated Ethernet network between Xen server and SAN server? Or separate VLAN of production network? Can you rule out the switches? (I saw the ESXi comment, see question below)
A - 10 Gb using dedicated network. ESXi test used same switches and we saw fast file access so dont think it is switches
Q - servers connected to SAN via Gbit ethernet adapters or via iSCSI HBAs?
A - iSCSI HBA cards
Q - how’s your DomU defined - PVM or HVM? What disk resources are defined for the DomU, and where are they from (local server disk/SAN LUN via i.e. iSCSI/FC/…)?
A - DomU is PVM and boots from local server disk, SAN is connected via iSCSI Hosts boot from SAN. DomU boots from storage provided by Dom0s iSCSI connection to SAN. DomU NSS disk resource is obtained “directly” via an iSCSI connection to SAN, i.e. the LUN is presented to the DomU not Dom0.
Q - as already asked above: Do the iSCSI connections for the NSS “devices” terminate in Dom0 or in DomU?
A - NSS for OES11 guests terminate in Dom U
Q - with your ESXi test, was the iSCSI connection the same (iSCSI terminating in host vs.VM) as with your Xen test?
A - ESXi test the only change was we booted VMWare off the internal usb flash card rather than the local disk that SLES/XEN booted off.
Q - had you run a network (iSCSI) trace during the “slow” accesses to the PDF and could identify a delay?
A - Yes have done captures for Novell tech and they saw delays

Hi idgandrewg,

Q - servers connected to SAN via Gbit ethernet adapters or via iSCSI HBAs?
A - iSCSI HBA cards
Q - as already asked above: Do the iSCSI connections for the NSS “devices” terminate in Dom0 or in DomU?
A - NSS for OES11 guests terminate in Dom U

just out of curiosity: What was your intention behind the decision to use the DomU’s iSCSI initiator, rather than passing through some LUN(s) provided by the HBA? (I could see reasons for both variants, so this is no good/bad question.)

A - Yes have done captures for Novell tech and they saw delays
It’d of course be interesting to know where the delays were observed - DomU to Dom0 (the VIF part) or in the network, initiator to SAN server vs. SAN server to initiator, DomU delays between receiving the latest SAN response and sending the next request, …

Have you yet had a chance to change the offload settings of the NICs?

Regards,
Jens

Curious, how have you setup the multipath configuration? Have you also installed the latest Equallogic Linux Hit Kit on the SLES host?
I’ve had very good results with it, even if the Xen kernel is not fully supported, just watch that you are using the SuppressPartitions option (when running SLES 11 SP2) for LUNs connected directly to the Xen host that need to be passed through to the domU guest.

The EQL Hit Kit also takes care of handling the multipath configuration (multipathd can be switched off) and volumes created and assigned to the Xen host appear automagically under /dev/eql/[volume name].

It really makes maintaining the iSCSI connections a breeze, adding volumes is a question of setting them up + iqn access on the Equallogic - and then simply refreshing the iscsiadm target.

Are you configuring multipath on the OES11 SP2 virtual machine, or is the volume/LUN connected to the Xen host and then passed onto the OES11 SP2 VM as a phyisical/phy connected DomU device? I prefer this last method (the Xen host does all the iSCSI connection and device handling) and performance is, as I’ve seen it, optimal.

-Willem

I dont think my Systems guys used the Equallogic Linux Hit kits - I will check

Jens,

Your hunch was right

Yes it looks like the offload settings are the problem on the host two interfaces p3p1 and p3p2

using ethtool -K I turned off rx tx sg tso gso gro lro rxvlan txvlan rxhash on p3p1 and p3p2 - FAST file access - very fast.

To test I turned back on rx tx sg tso gso gro lro rxvlan txvlan rxhash on p3p1 and p3p2 and files access SLOOOOOOOOOWWWWWWW
Turn off again and fast file access - able to repeat slow or fast file access by tuning p3p1 and p3p2 on or off with ethtool -K

I am told Interfaces p3p1 and p3p2 are enslaved interfaces in the bond configuration of br2. ???

Waiting for support to tell me what the implications are of this finding and best way to fix

Andrew

Kevin

Your hunch was right

Yes it looks like the offload settings are the problem on the host two interfaces p3p1 and p3p2

using ethtool -K I turned off rx tx sg tso gso gro lro rxvlan txvlan rxhash on p3p1 and p3p2 - FAST file access - very fast.

To test I turned back on rx tx sg tso gso gro lro rxvlan txvlan rxhash on p3p1 and p3p2 and files access SLOOOOOOOOOWWWWWWW
Turn off again and fast file access - able to repeat slow or fast file access by tuning p3p1 and p3p2 on or off with ethtool -K

I am told Interfaces p3p1 and p3p2 are enslaved interfaces in the bond configuration of br2. ???

Waiting for support to tell me what the implications are of this finding and best way to fix

Andrew

It’s not needed per se… it does however configure and set allot of “stuff” for you so you can achieve an optimal setup.

If only running it on a test SLES box, running eqltune tool (included in the Hit Kit) to find which settings it recommends. You can translate most of those to your current setups.

Also see my reply in the Novell forums.

Cheers,
Willem

On both hosts edited the ifcfg file for the two interfaces that we had identified as causing the slow down - p3p1 and p3p2 and set ethtool_options.

This means the setting are now presistent after a reboot

barney # cd /etc/sysconfig/network/

barney:/etc/sysconfig/network # vi ifcfg-p3p1
BOOTPROTO=‘none’
BROADCAST=’’
ETHTOOL_OPTIONS=’-K iface rx off tx off sg off tso off gso off gro off lro off rxvlan off txvlan off rxhash off’
IPADDR=’’
MTU=’’
NAME=‘Intel Ethernet controller’
NETMASK=’’
NETWORK=’’
REMOTE_IPADDR=’’
STARTMODE=‘auto’
USERCONTROL=‘no’