Slow network response if 1 CPU is on 100% load

Hello All,

Please help me investigate a strange issue.
I have a few SLES 11 SP3 for WMWare x86_64 server where the network responsiveness is very slow is one cpu core is on full load.
The servers are running under vSphere 5.1 with 20GB ram, and have 4x2 CPU configuration. Either the vCPU nor the memory is not overcommitted.
I’am using vmxnet3 network adapter and paravirtual scsi controller. The vmware tools is uptodate, according to the vSphere version.
First I suspected to the load of the ESXi host, but the response time is slow also when I run only one VM on the host.
Please help me with some sugegstions.

Thank You in advance and kind regards,

Tamas

Hi Tamas,

how did you measure network resonsiveness - please describe as detailed as required to understand:

  • from where to where did you run your measurement (i.e. from an external physical server to a virtual server)
  • which tools you used (i.e. “ping”)
  • if possible, your invocation plus results (i.e. "root@extserver # ping " and the resulting output, for good & bad case)
  • what results you receive that you call “fast” and what results you call “slow” (i.e. “with no vCPU over 50%, ping turn-around times are 5 ms, when one vCPU is at 100%, those increase to 2000 ms”)

I have not yet fully understood under which circumstances the responsiveness is below expectations - a single VM on a vSphere 5.1 server, the VM has how many vCPUs out of how many physical CPUs, the load of the vCPU is at 100%? How’s the state of the physical machine?

When responsiveness drops (and you have more than a single VM active), is only one VM affected (the one with high vCPU), or are all VMs affected? Does then the vSphere host respond quick, or does it slow down, too?

Many questions, I know, but all just to get a picture of what’s actually going on and to allow for ideas to pop up…

Regards,
Jens

Hello Jens,

Thank You for the reply.

  • I run the test from the virtual server which is the only virtual machine on the ESXi host.

  • To test the reponsiveness, I use regular ping. The target is a physical host on the local network which is not under loaded at all.

  • I created a basic shell script to simulate the load of the application:
    infiniteloop.sh
    [FONT=Arial Narrow]#!/bin/bash
    for (( ; ; ))
    do

echo “Pres CTRL+C to stop…”

int=int+1
done[/FONT]

I start this script once, then it loads one vcpu on ~100%.

My test is the following:

  1. examine the current system load, it must be idle
  2. start ping a host on the network, the response time must be under 0.5ms
  3. start infiniteloop.sh
  4. examine ping response time
  5. stop infiniteloop.sh
  6. examine ping response time

Here is the output of my test:
@07:34:28 sapv-prd-as31:~ # ping 192.168.244.56
PING 192.168.244.56 (192.168.244.56) 56(84) bytes of data.
64 bytes from 192.168.244.56: icmp_seq=1 ttl=64 time=0.299 ms
64 bytes from 192.168.244.56: icmp_seq=2 ttl=64 time=0.297 ms
64 bytes from 192.168.244.56: icmp_seq=3 ttl=64 time=0.299 ms
64 bytes from 192.168.244.56: icmp_seq=4 ttl=64 time=0.247 ms
64 bytes from 192.168.244.56: icmp_seq=5 ttl=64 time=0.282 ms
64 bytes from 192.168.244.56: icmp_seq=6 ttl=64 time=0.264 ms
64 bytes from 192.168.244.56: icmp_seq=7 ttl=64 time=0.292 ms
64 bytes from 192.168.244.56: icmp_seq=8 ttl=64 time=0.296 ms
64 bytes from 192.168.244.56: icmp_seq=9 ttl=64 time=0.257 ms
64 bytes from 192.168.244.56: icmp_seq=10 ttl=64 time=0.281 ms
64 bytes from 192.168.244.56: icmp_seq=11 ttl=64 time=0.263 ms
64 bytes from 192.168.244.56: icmp_seq=12 ttl=64 time=0.304 ms
64 bytes from 192.168.244.56: icmp_seq=13 ttl=64 time=0.234 ms
64 bytes from 192.168.244.56: icmp_seq=14 ttl=64 time=0.275 ms
64 bytes from 192.168.244.56: icmp_seq=15 ttl=64 time=0.305 ms
64 bytes from 192.168.244.56: icmp_seq=16 ttl=64 time=0.263 ms
64 bytes from 192.168.244.56: icmp_seq=17 ttl=64 time=5.45 ms ← Here I started the infiniteloop.sh
64 bytes from 192.168.244.56: icmp_seq=18 ttl=64 time=5.99 ms
64 bytes from 192.168.244.56: icmp_seq=19 ttl=64 time=5.00 ms
64 bytes from 192.168.244.56: icmp_seq=20 ttl=64 time=8.37 ms
64 bytes from 192.168.244.56: icmp_seq=21 ttl=64 time=2.58 ms
64 bytes from 192.168.244.56: icmp_seq=22 ttl=64 time=6.97 ms
64 bytes from 192.168.244.56: icmp_seq=23 ttl=64 time=9.99 ms
64 bytes from 192.168.244.56: icmp_seq=24 ttl=64 time=8.99 ms
64 bytes from 192.168.244.56: icmp_seq=25 ttl=64 time=7.99 ms
64 bytes from 192.168.244.56: icmp_seq=26 ttl=64 time=7.00 ms
64 bytes from 192.168.244.56: icmp_seq=27 ttl=64 time=8.02 ms
64 bytes from 192.168.244.56: icmp_seq=28 ttl=64 time=6.94 ms
64 bytes from 192.168.244.56: icmp_seq=29 ttl=64 time=6.97 ms
64 bytes from 192.168.244.56: icmp_seq=30 ttl=64 time=6.00 ms
64 bytes from 192.168.244.56: icmp_seq=31 ttl=64 time=3.34 ms
64 bytes from 192.168.244.56: icmp_seq=32 ttl=64 time=7.62 ms
64 bytes from 192.168.244.56: icmp_seq=33 ttl=64 time=5.99 ms
64 bytes from 192.168.244.56: icmp_seq=34 ttl=64 time=7.01 ms
64 bytes from 192.168.244.56: icmp_seq=35 ttl=64 time=5.98 ms
64 bytes from 192.168.244.56: icmp_seq=36 ttl=64 time=4.99 ms
64 bytes from 192.168.244.56: icmp_seq=37 ttl=64 time=3.97 ms
64 bytes from 192.168.244.56: icmp_seq=38 ttl=64 time=3.00 ms
64 bytes from 192.168.244.56: icmp_seq=39 ttl=64 time=6.42 ms
64 bytes from 192.168.244.56: icmp_seq=40 ttl=64 time=5.56 ms
64 bytes from 192.168.244.56: icmp_seq=41 ttl=64 time=8.97 ms
64 bytes from 192.168.244.56: icmp_seq=42 ttl=64 time=8.00 ms
64 bytes from 192.168.244.56: icmp_seq=43 ttl=64 time=5.47 ms ← Here I stopped the infiniteloop.sh
64 bytes from 192.168.244.56: icmp_seq=44 ttl=64 time=0.261 ms
64 bytes from 192.168.244.56: icmp_seq=45 ttl=64 time=0.260 ms
64 bytes from 192.168.244.56: icmp_seq=46 ttl=64 time=0.224 ms
64 bytes from 192.168.244.56: icmp_seq=47 ttl=64 time=0.256 ms
64 bytes from 192.168.244.56: icmp_seq=48 ttl=64 time=0.255 ms
64 bytes from 192.168.244.56: icmp_seq=49 ttl=64 time=0.248 ms
64 bytes from 192.168.244.56: icmp_seq=50 ttl=64 time=0.234 ms
64 bytes from 192.168.244.56: icmp_seq=51 ttl=64 time=0.272 ms
64 bytes from 192.168.244.56: icmp_seq=52 ttl=64 time=0.304 ms
64 bytes from 192.168.244.56: icmp_seq=53 ttl=64 time=0.254 ms
64 bytes from 192.168.244.56: icmp_seq=54 ttl=64 time=0.248 ms
64 bytes from 192.168.244.56: icmp_seq=55 ttl=64 time=0.241 ms
^C
— 192.168.244.56 ping statistics —
55 packets transmitted, 55 received, 0% packet loss, time 54031ms
rtt min/avg/max/mdev = 0.224/3.276/9.995/3.318 ms

Here is the output of top during the slow period:
top - 07:44:38 up 16:50, 7 users, load average: 2.61, 2.26, 1.97
Tasks: 182 total, 2 running, 180 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 98.0%us, 2.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.5%us, 0.5%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 40257M total, 1588M used, 38668M free, 83M buffers
Swap: 67036M total, 0M used, 67036M free, 586M cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7814 root 20 0 11584 1368 1116 R 100 0.0 0:17.44 infiniteloop.sh
6283 root 20 0 8924 1188 832 S 1 0.0 16:03.07 top
1968 root 20 0 85472 4380 3528 S 0 0.0 0:37.20 vmtoolsd
6229 root 20 0 175m 19m 10m S 0 0.0 3:22.96 gnome-terminal
7648 root 20 0 8924 1196 832 R 0 0.0 0:03.72 top
1 root 20 0 10544 816 680 S 0 0.0 0:01.79 init
2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:01.14 ksoftirqd/0
5 root 20 0 0 0 0 S 0 0.0 0:00.99 kworker/u:0

There is only one vCPU is runnig at 100%.
Currently this virtual server is a single VM on the ESXi host. The physical host has 40 cores, in the VM we load only one vCPU from the 8, which can load only one physical core, eg. the physical machine is not loaded at all.
We are planning to run multiple response time sensitive application servers in multiple VMs on several physical host, but the current situation does not allow to go this solution into production.

Regards,

Tamas

Just an update:
If I replace a vmxnet3 network card with an e1000 one, then the response time if 1 vCPU is on 100% load, is around 0.3-0.4ms. :confused:

On 03/26/2014 03:24 AM, uracst wrote:[color=blue]

Just an update:
If I replace a vmxnet3 network card with an e1000 one, then the response
time if 1 vCPU is on 100% load, is around 0.3-0.4ms. :confused:[/color]

Well that’s good information right there (your other troubleshooting has
also been fun to read). Have you tried any other virtualization solutions
(KVM or Xen, or maybe even an LXC container)? I think you mentioned bare
metal boxes do not have this issue, but if not that’d be a good check too.
What happens if you load a second vCPU with a second instance of your
script? Additional degradation or just the same as before? Any word back
from vmware on this, if you’ve asked/checked there?


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

No, I haven’t tried other hypervisors, because I inherited the environment. Unfortunately bare metal installations aren’t an option too because we have SLES for VMWare licensens coming from the vSphere Std license.
For the second instance of the script: any further instances does not cause any further response time degradations.
I escalated the issue to our VMWare support. Unfortunately it is OEM, so I don’t expect a quick response.

Testing on a KVM image of mine with two virtual processors I see no change
at all, which makes sense since you have narrowed this down to one of tow
NIC drivers, with the problem being VMware’s own:

Code:

64 bytes from 192.168.1.202: icmp_seq=217 ttl=64 time=0.301 ms
64 bytes from 192.168.1.202: icmp_seq=218 ttl=64 time=0.352 ms
64 bytes from 192.168.1.202: icmp_seq=219 ttl=64 time=0.376 ms
64 bytes from 192.168.1.202: icmp_seq=220 ttl=64 time=0.367 ms
64 bytes from 192.168.1.202: icmp_seq=221 ttl=64 time=0.348 ms
64 bytes from 192.168.1.202: icmp_seq=222 ttl=64 time=0.298 ms<-Start load
64 bytes from 192.168.1.202: icmp_seq=223 ttl=64 time=0.274 ms
64 bytes from 192.168.1.202: icmp_seq=224 ttl=64 time=0.285 ms
64 bytes from 192.168.1.202: icmp_seq=225 ttl=64 time=0.297 ms
64 bytes from 192.168.1.202: icmp_seq=226 ttl=64 time=0.287 ms
64 bytes from 192.168.1.202: icmp_seq=227 ttl=64 time=0.254 ms
64 bytes from 192.168.1.202: icmp_seq=228 ttl=64 time=0.335 ms
64 bytes from 192.168.1.202: icmp_seq=229 ttl=64 time=0.281 ms
64 bytes from 192.168.1.202: icmp_seq=230 ttl=64 time=0.338 ms
64 bytes from 192.168.1.202: icmp_seq=231 ttl=64 time=0.258 ms
64 bytes from 192.168.1.202: icmp_seq=232 ttl=64 time=0.279 ms
64 bytes from 192.168.1.202: icmp_seq=233 ttl=64 time=0.319 ms


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

This makes no sense, because the recommanded network card for SLES11 is vmxnet3.
Anyway e1000 is obsolete and will be removed in further versions.

I just moved the VM between physical servers, to see there is any difference. I ran my tests and figured that I cannot reproduce the solution with e1000. I even moved the VM back to its original location and even there, I cannot solve the issue just by replacing the nic card type.
Well this isn’t funny at all. :mad:

I found a couple of systems on which to test, but still cannot duplicate
even though I see these systems are using vmxnet3 like you are. They are
also SLES 11 SP3 x86_64 systems… not sure about post-SP3 patches.
Pinging times are constantly around 0.1 ms no matter what; all systems
have just two processors again (could this be an issue with slowness
because of the scheduler sharing work among your higher number of
processors? Just a silly thought…). Unfortunately the only ping tests
I can do are probably from systems on the same host, since the tests I do
from other hosts are geographically across ponds and have response times
with variances by several milliseconds without any testing in the first place.

Could you clarify on what you wrote: is the problem happening all of the
time now no matter which host or NIC driver you use, or is it no longer
happening at all? A few of the sentences from your last post have me a
bit confused.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Yes, the problem is happening all the hosts (We have 10 actually).
Regardless what I wrote before, it doesn’t matter what type of nic i use.

I just unintentionally managed to ‘repaired?’ one of the VM with the following steps:

  • add e1000 vnic, configure address, vmxnet3 vnic downed
  • test, result was bad
  • reboot
  • test, result was good (!)
  • rebooted and tested more that 5 times and the result was good
  • removed the e1000 vnic, vmxnet3 vnic upped
  • rebooted and tested more than 5 times and the result was good
  • stopped the VM
  • start and retest, and the result was bad (!)

I can reproduce this multiple times. :smiley: [crying]
It seems that the problem is on hypervisor level based on that the test was good before i stopped the VM. But it is unclear to me what changed.

Hi Tamas,

I don’t know too much about the internals of the current ESX versions, but network packet handling (passing between physical and virtual device) is done in software. Might it happen that for some bad coincidence the same physical CPU is used both for the loaded vCPU and the bridging software at the hypervisor level?

Regards,
Jens

Hello Jens,

True, it could happen that the same physical CPU core handles the network traffic for the physical host and I load the same core with the vcpu process. But 3 times right after each other? I don’t think this is bad luck.
I countinue test the workaround on another esxi host.
I will post back the results.

Regards,

Tamas

uracst wrote:
[color=blue]

This makes no sense, because the recommanded network card for SLES11
is vmxnet3.
Anyway e1000 is obsolete and will be removed in further versions.

I just moved the VM between physical servers, to see there is any
difference. I ran my tests and figured that I cannot reproduce the
solution with e1000. I even moved the VM back to its original location
and even there, I cannot solve the issue just by replacing the nic
card type.
Well this isn’t funny at all. :mad:[/color]

While I believe it’s better to diagnose the problem before trying to
fix it, sometimes it’s easier to try a few quick fixes.

Since your issue is related to your network card(s) and there have been
some performance issues with TCP offloading, especially in a VM, you
may want to look into temporarily disabling this feature.

https://www.suse.com/support/kb/doc.php?id=7005304

While I have not seen this feature cause excessive CPU utilisation, it
is easy enough to turn it off to see.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

[QUOTE=KBOYLE;20102]uracst wrote:
[color=blue]

This makes no sense, because the recommanded network card for SLES11
is vmxnet3.
Anyway e1000 is obsolete and will be removed in further versions.

I just moved the VM between physical servers, to see there is any
difference. I ran my tests and figured that I cannot reproduce the
solution with e1000. I even moved the VM back to its original location
and even there, I cannot solve the issue just by replacing the nic
card type.
Well this isn’t funny at all. :mad:[/color]

While I believe it’s better to diagnose the problem before trying to
fix it, sometimes it’s easier to try a few quick fixes.

Since your issue is related to your network card(s) and there have been
some performance issues with TCP offloading, especially in a VM, you
may want to look into temporarily disabling this feature.

https://www.suse.com/support/kb/doc.php?id=7005304

While I have not seen this feature cause excessive CPU utilisation, it
is easy enough to turn it off to see.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…[/QUOTE]

Hello Kevin,

Thank You for the answer.
Here is the output from the ethtool:
ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

Do You see anything suspicious?

Regards,

Tamas

Hello Kevin,

I turned off everythind what is could:
ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off

The problem is still there. I will update my WMWare SR today.

Tamas

uracst wrote:
[color=blue]

Hello Kevin,

I turned off everythind what is could:
ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off

The problem is still there. I will update my WMWare SR today.

Tamas[/color]

Thanks for the feedback, Tamas. IMO, it was worth a try and didn’t cost
anything other than a few minutes of your time.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

I know this thread is already quite old. but we were facing the same issue the last days and since I didn’t find an answer to this issue on the net I would like to share our solution.

Background:
We have a farm of 70+ VMware hosts with windows, and linux VMs.
We experienced a poor latency behavior on quite a few linux vms when one of the cpus (also one multi core VMs) was busy.
Strangely enough this didn’t happen on all VMs.
It was also not consistent on one cluster or ESXi-host.

Solution:
When comparing the *.vmx files from the VMs we found a parameter called sched.cpu.latencySensitivity.

sched.cpu.latencySensitivity = "low" 

On the Systems with poor latency values this was set to “low” while the others had “normal”.
Changing this parameter to normal solved the problem!

sched.cpu.latencySensitivity = "normal" 
  1. Check in the *.vmx-file if the parameter sched.cpu.latencySensitivity is set to “low”
  2. If yes. Shutdown the VM.
  3. Either the *.vmx file directly. Or go to: “Edit Settings” → “Options” → “General” → “Configuration Parameters” and change the parameter in there. Note: This will only be available if the VM is powerd off.
  4. Start the VM
  5. Test (i.e. with the script mentioned in the posts before)

have fun

Hi Arnkit,

Thank You, I can confirm that this advanced vmware setting indeed is a solution for us too.

KR,

Tamas