very bad network performance

Dear experts,

we are struggling with bad network performance on a SLES 11 SP4 Installation as PowerLinux LPAR. The LPARS are having installed 10 Gbit/s network adapters to them. Currently in our project we are communicating with AIX sandbox systems that only have 1 Gbit/s. we were able to increase the performance between those systems by 12 MB/s by simply upgrading the linux kernel from 3.0.101-63-ppc64 to 3.0.101-77-ppc64 via zypper, because now ethtool -k eth0 shows that tso and gro are on, that gave us the boost.

however, after the kernel upgrade we tested the network connection between SLES on 10 Gbit/s and a producitve AIX machine on 10 Gbit/s, both residing in the same data center, but different ibm power servers, but in the same subnet of course, so no hops via traceroute are taken. We measured only 1,2 Gbit/s in speed, but between two AIX on 10 Gbit/s we are measuring > 7 Gbit/s on the same network.

When i am querying the devices on the SLES machine via ethtool eth0 and ethtool eth1, the supported ports are detected as FIBRE, supported link modes are 1000baseT/Full and adterised is 1000baseT/full, auto-neogtiation is on, duplex is full and speed is only 1000 Mb/s. However that can’t be really the case because as stated above we measured 1,2 Gbit/s between SLES 10 Gbit/s and AIX 10 Gbit/s.

My concern is: shouldn’t sles be able to advertise a speed of 10000 Mb/s? why does it only advertise 1000 Mb/s?

On 06/30/2016 04:34 AM, dafrk wrote:[color=blue]

we are struggling with bad network performance on a SLES 11 SP4
Installation as PowerLinux LPAR. The LPARS are having installed 10
Gbit/s network adapters to them. Currently in our project we are
communicating with AIX sandbox systems that only have 1 Gbit/s. we were
able to increase the performance between those systems by 12 MB/s by
simply upgrading the linux kernel from 3.0.101-63-ppc64 to
3.0.101-77-ppc64 via zypper, because now ethtool -k eth0 shows that tso
and gro are on, that gave us the boost.[/color]

How fast were the SLES-to-AIX connections before the kernel upgrade? I
see the max speed, and I see some increase, but I do not see the actual
speed prior to the update.
[color=blue]

however, after the kernel upgrade we tested the network connection
between SLES on 10 Gbit/s and a producitve AIX machine on 10 Gbit/s,
both residing in the same data center, but different ibm power servers,
but in the same subnet of course, so no hops via traceroute are taken.
We measured only 1,2 Gbit/s in speed, but between two AIX on 10 Gbit/s
we are measuring > 7 Gbit/s on the same network.[/color]

What kind of speed do you get between two LPARs on the same host?
[color=blue]

When i am querying the devices on the SLES machine via ethtool eth0 and
ethtool eth1, the supported ports are detected as FIBRE, supported link
modes are 1000baseT/Full and adterised is 1000baseT/full,
auto-neogtiation is on, duplex is full and speed is only 1000 Mb/s.
However that can’t be really the case because as stated above we
measured 1,2 Gbit/s between SLES 10 Gbit/s and AIX 10 Gbit/s.[/color]

In other virtual environments I have seen SLES send far faster than
advertised because it’s all virtual and at the end of the day the virtual
bits moving between virtual systems are not really limited by 1 Gbit
hardware, particularly when sending among VMs on the same host, thus my
question above about intra-host communications among VMs.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

[QUOTE=ab;33303]On 06/30/2016 04:34 AM, dafrk wrote:[color=blue]

we are struggling with bad network performance on a SLES 11 SP4
Installation as PowerLinux LPAR. The LPARS are having installed 10
Gbit/s network adapters to them. Currently in our project we are
communicating with AIX sandbox systems that only have 1 Gbit/s. we were
able to increase the performance between those systems by 12 MB/s by
simply upgrading the linux kernel from 3.0.101-63-ppc64 to
3.0.101-77-ppc64 via zypper, because now ethtool -k eth0 shows that tso
and gro are on, that gave us the boost.[/color]

How fast were the SLES-to-AIX connections before the kernel upgrade? I
see the max speed, and I see some increase, but I do not see the actual
speed prior to the update.
[color=blue]

however, after the kernel upgrade we tested the network connection
between SLES on 10 Gbit/s and a producitve AIX machine on 10 Gbit/s,
both residing in the same data center, but different ibm power servers,
but in the same subnet of course, so no hops via traceroute are taken.
We measured only 1,2 Gbit/s in speed, but between two AIX on 10 Gbit/s
we are measuring > 7 Gbit/s on the same network.[/color]

What kind of speed do you get between two LPARs on the same host?
[color=blue]

When i am querying the devices on the SLES machine via ethtool eth0 and
ethtool eth1, the supported ports are detected as FIBRE, supported link
modes are 1000baseT/Full and adterised is 1000baseT/full,
auto-neogtiation is on, duplex is full and speed is only 1000 Mb/s.
However that can’t be really the case because as stated above we
measured 1,2 Gbit/s between SLES 10 Gbit/s and AIX 10 Gbit/s.[/color]

In other virtual environments I have seen SLES send far faster than
advertised because it’s all virtual and at the end of the day the virtual
bits moving between virtual systems are not really limited by 1 Gbit
hardware, particularly when sending among VMs on the same host, thus my
question above about intra-host communications among VMs.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…[/QUOTE]

Hello,

at first, let me thank you for your fast reply.

Before the kernel upgrade we were @ about 600 Mbit/s, after on ~700 MBit/s. Before that we were @ only 200 Mbit/s, but we increased that by simply upgrading our IBM VIO-Server
to version 2.4.2.32. We also weren’t able to restore Backups from Tivoli Storage Until until upgrading that, too. Seems the power Kernel for Linux and the ibm support for PowerLinux is still work in progress.
Between two LPARs on the same host we are getting 3948 Mbit/s. Note that this should be much faster as well because the communication between those lpars only passes the hypervisor itself and does not go out over network.

I seem to get the best out of the network when using the following settings btw
ethtool -K tso on
ethtool -K gro on
sysctl -w net.ipv4.tcp_sack = 0
sysctl -w net.ipv4.tcp_fack=0
sysctl -w net.ipv4.tcp_window_scaling = 1
sysctl -w net.ipv4.tcp_no_metrics_save = 1
sysctl -w net.core.rmem_max=12582912
sysctl -w net.core.wmem_max=12582912
sysctl -w net.core.netdav_max_backlog=9000
sysctl -w net.core.somaxconn=512
sysctl -w net.ipv4.tcp_rmem=“4096 87380 9437184”
sysctl -w net.ipv4.tcp_wmem=“4096 87380 9437184”
sysctl -w net.ipv4.ipfrag_low_thresh=393216
sysctl -w net.ipv4.ipfrag_high_thresh=544288
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.ipv4.tcp_synack_retries=3
sysctl -w net.ipv4.tcp_retries2=6
sysctl -w net.ipv4.tcp_keepalive_time=1000
sysctl -w net.ipv4.tcp_keepalive_probes=4
sysctl -w net.ipv4.tcp_keepalive_intvl=20
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=30

With these settings, i was able to boost SLES-10 Gbit to AIX-10 Gbit communication from 1,2 Gbit/s to 1,95 Gbit/s, but not further.

Also, I cannot use a wide variety of ethtool-options, like ethtool -a, which i can on Intel-Linux installations.

So yes, I guess you are write that the advertised speed has nothing to do with that. However, something’s still fishy.

Best regards

sorry for the double post.

just wanted to add that my problem is that currently AIX to AIX 10 Gbit/s is making about 3 Gbit/s. So my boss is asking me why SLES is - still after tuning parameters - swallowing 1 Gbit/s down it’S throat while AIX are working fine when talking to each other.

Hi dafrk,

there’s more to network performance parameters than the parameters you posted. And I couldn’t see how you measured the throughput, as this would be telling the nature of the network traffic:

  • If you’re transferring large amounts of data, using large packets would increase the over-all throughput. For 10G, you may have large Ethernet packets enabled (and usable) on the AIX side of things, while the Linux LPARs use the standard size

  • When transferring using a windowed protocol (where transmission of further packets requires a receiver’s acknowledgment of the previously sent packets), different window sizing algorithms may affect the throughput

  • using different protocols for tests (or simply measuring the link utilization) would of course cause incomparable results (this is only for future readers, as I’m sure that you already took that into account)

  • of course, comparing virtualized environments with OSs running native would have to consider virtualization-specific effects, too (you didn’t say if the “production servers” are LPARed AIX instances, so I’m just mentioning)

Maybe you’d be able to tell differences when looking at the actual packet transfer - that’d give facts about packet and window sizes, to compare AIX-to-AIX vs AIX-to-SLES?

Regards,
J