Mystery RX packet drops on SLES11 SP2 every 30 sec

We have deployed several SP2 servers for testing, and are finding this annoying issue.
Approx every 30 seconds, the RX dropped counter ticks up by 1. This is only happening on SLES11 SP2 systems.
Usually when this counter ticks, it lines up with rx_fw_discards in ethtool statistics. I cannot find a matching stat for this drop.

eth0 Link encap:Ethernet HWaddr 00:21:9B:A0:07:7E
inet addr:192.168.69.247 Bcast:192.168.69.255 Mask:255.255.255.0
inet6 addr: fe80::221:9bff:fea0:77e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:61591 errors:0 dropped:276 overruns:0 frame:0
TX packets:38438099 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:39053090 (37.2 Mb) TX bytes:49909260932 (47597.1 Mb)
Interrupt:36 Memory:d2000000-d2012800

NIC statistics:
rx_bytes: 42672588
rx_error_bytes: 0
tx_bytes: 55266311872
tx_error_bytes: 0
rx_ucast_packets: 32153
rx_mcast_packets: 31988
rx_bcast_packets: 4239
tx_ucast_packets: 37456
tx_mcast_packets: 42720893
tx_bcast_packets: 4
tx_mac_errors: 0
tx_carrier_errors: 0
rx_crc_errors: 0
rx_align_errors: 0
tx_single_collisions: 0
tx_multi_collisions: 0
tx_deferred: 0
tx_excess_collisions: 0
tx_late_collisions: 0
tx_total_collisions: 0
rx_fragments: 0
rx_jabbers: 0
rx_undersize_packets: 0
rx_oversize_packets: 0
rx_64_byte_packets: 3667
rx_65_to_127_byte_packets: 22572
rx_128_to_255_byte_packets: 16761
rx_256_to_511_byte_packets: 353
rx_512_to_1023_byte_packets: 70
rx_1024_to_1522_byte_packets: 24957
rx_1523_to_9022_byte_packets: 0
tx_64_byte_packets: 56728
tx_65_to_127_byte_packets: 34705
tx_128_to_255_byte_packets: 1478329
tx_256_to_511_byte_packets: 2166944
tx_512_to_1023_byte_packets: 3321491
tx_1024_to_1522_byte_packets: 35700156
tx_1523_to_9022_byte_packets: 0
rx_xon_frames: 0
rx_xoff_frames: 0
tx_xon_frames: 0
tx_xoff_frames: 0
rx_mac_ctrl_frames: 0
rx_filtered_packets: 45678
rx_ftq_discards: 0
rx_discards: 0
rx_fw_discards: 0

driver: bnx2
version: 2.1.11
firmware-version: 6.4.5 bc 5.2.3 NCSI 2.0.11
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

I have tried the latest driver 2.2.11j from broadcom with same results. Any ideas what might tick this counter?

I can confirm this behaviour… also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.

[QUOTE=enovaklbank;5817]I can confirm this behaviour… also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.[/QUOTE]

Well it is a pain because we monitor these counts for legitimate drops. There has to be a reason why this is happening.

I’ve been seeing this too with a couple of SLES 11 SP2 Xen setups. I was planning on setting up a small (network) isolated test server to see if it happens there too.

Trying to orientate how to find out what the server is dropping, I came across tool called dropwatch (http://linux.die.net/man/1/dropwatch). Not sure if it will run on SLES 11, but there are some packages for it here : http://pkgs.org/download/dropwatch

How are you guys planning to go about finding whats getting dropped?

Cheers,
Willem

[QUOTE=Magic31;5956]Trying to orientate how to find out what the server is dropping, I came across tool called dropwatch (http://linux.die.net/man/1/dropwatch). Not sure if it will run on SLES 11, but there are some packages for it here : http://pkgs.org/download/dropwatch
[/QUOTE]

Ah, even better - an OBS build for SLES 11 SP2 : https://build.opensuse.org/package/revisions?package=dropwatch&project=home%3Abenjamin_poirier%3Adropwatch (as I also don’t know how trustworthy pkgs.org downloads are)

Interesting utility.
I was able to get this every time I saw the counter move up about every 30 sec. No idea what it means

1 drops at __netif_receive_skb+1fe (0xffffffff8138388e)

[QUOTE=kevins7189_5;6117]Interesting utility.
I was able to get this every time I saw the counter move up about every 30 sec. No idea what it means

1 drops at __netif_receive_skb+1fe (0xffffffff8138388e)[/QUOTE]
Interesting indeed :slight_smile:

I haven’t been able to put more time into this that trying to simulate it in a (very small) test environment. Funny enough, I did not witness any dropped packets there.

My next move is to do this at the two sites where I am seeing this, but I have not had a window to do so yet (time and other priorities).

If you can open an SR, that would be best. It could well be certain type packets are intentionally for some security reason or other. I don’t know, by far, enough about the workings of the kernel and modules… but this snip talken from a Google search did catch my interest:

  • Add a packet_type handler and see if we can prevent
  • other packet_type’s from handling an skb
  • Specifically, we will register our packet_type to be
  • the first handler invoked by netif_receive_skb()
  • If the packet received meets certain conditions, then,
  • drop it, i.e, prevent subsequent ptype_all and ptype_base
  • handlers in netif_receive_skb() from processing the packet

Sorry I can’t be of more help here. I will pass on this thread to my Novell contact to see if this might be something Novell is aware of.

-Willem

I should have asked earlier! : http://www.novell.com/support/kb/doc.php?id=7007165

There you go (and me too).

Cheers,
Willem

[QUOTE=Magic31;6138]I should have asked earlier! : http://www.novell.com/support/kb/doc.php?id=7007165

There you go (and me too).

Cheers,
Willem[/QUOTE]

that reply seems kinda cop-out ish. Seems like this would be an easy to find problem if this has been happening since 2.32.37, but can’t find any so easy.

Have no drops here, so that one is out
cat /proc/net/softnet_stat
03619a59 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035d90a8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
03609a2e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035dce37 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

the other issue is this goes away if I run tcpdump on the machine and try to catch the packets. I can go hours an hours with tcpdump running and the rx counter won’t move, but soon after shutting it down the rx counters start incrementing again.

does anyone know if oui Unknown messages from arp would cause this counter to increment?

Could well be… opening an SR with SUSE would seem the way to get the best answers.

Not a prio for me at the moment, as this is not effecting production, but I will be on a lookout to see if the drop can be ignored within the counter.

-Willem

This may be a known issue with the bnx2 driver’s default buffer coalescing settings. Does the rx_filtered_packets value match ( approximately ) the rx discards value shown in ifconfig? If so, then show us output of

ethtool -c eth0

and post here…

Then try the following:

ethtool -C eth0 rx-usecs 6 rx-usecs-irq 6 rx-frames 0 rx-frames-irq 0

and see if this reduces / eliminates the counters increasing.

Also this can be ( cough ) “perfectly normal” because when the server is unable to find a sink for the packats, it drops them. This can be things like BPDU packets or other stuff which is not a layer 3 protocol for which the server listens for. use of tcpdump or wireshart may well stop the counter from increasing as it will be a sink for all packats, it also changes the ring buffer to capture packets.

If the above incantation fixes the issue, you can leave it that way. The likely cause is a periodic burst of packets which overruns the driver buffer pool. “Stuff happens.”

– Bob

Thanks for the reply!

rx_fw_discards = 117
rx dropped in ifconfig (62 days uptime) is 237501

I’ve done some reading on the coalesce thing with bnx2 but I thought it was more of a troubleshooting step than a “bug”, but using default now.
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 12
rx-usecs-irq: 18
rx-frames-irq: 2

I tried what you listed

Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 6
rx-frames: 0
rx-usecs-irq: 6
rx-frames-irq: 0

still get rx packet drops in ifconfig (now 237523). I’m well aware of the rx_fw_discards issue, because i definitely need to monitor that. I usually monitor it through ifconfig drops, however. But i CAN’T because these other “drops”, whatever they are, are also counting to the same counter. I guess I could monitor rw_fw_discards directly, but I don’t think I should have to. There should be a separate counter for the kernel to report dropping unknown packets, and not to the main ifconfig rx counter.
I’m trying to explain to my network guys about these unknown packets, and they are completely confused and uninterested (imagine that).
From what I can ascertain, I believe the OUI UNKNOWN packets are the cause, but I haven’t found a good explanation of what oui Unknown means. But since it says “unknown”, it seems a good candidate for these drops.

[QUOTE=kevins7189_5;6266]Thanks for the reply!
[/QUOTE]

Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.

If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.

But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.

– Bob

[QUOTE=Bob-O-Rama;6281]Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.

If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.

But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.

– Bob[/QUOTE]

I’ll keep going with packet captures. The network guys have actually been helpful, but there seems to be a giant lack of information on why this “error” was implemented, and what parameters would make it do so. If the kernel devs thought this necessary, they should document the conditions so people could prepare.
Right now I’m focusing on the only “unknowns” I see in tcpdump which are “OUI Unknown” from arp. I don’t even know if these are the correct ones, but all I have to go on. After further investigation, the errors are not quite as timely as every 30 seconds, sometimes no drops occur for several minutes, then 3 or 4 show up.
This is very annoying.

Are any of these issues related?
http://hardforum.com/showthread.php?t=1472177
http://serverfault.com/questions/77510/unknown-tcpdump-packets

If broadcom nics are supported by SLES, then shouldn’t their loopback packets be “supported” and therefore not error?

has anybody found any new information that they can share? I can’t believe this isn’t a big problem for a lot of people…

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you read Magic31’s reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, “cop-out ish”, since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.

What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not (“Pay no attention to the main behind the curtain.”).
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.

Good luck.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJQNNvoAAoJEF+XTK08PnB5Zz8QAIklMYp0gn88ornoszGCN01A
AYNK/5A0CACAaLXJ6kAlrghtW1NxmXdB4Lw6UNlbR3AlbkFFkQrhx2zVueCSiImN
Y5EXfVaBjVx0UxHktGXvu1tZ2I51eFdRHtx4KemN0ySFwJ2VWskI3hG2ZCf/GxjD
hVVUnpy1Zb/xZRqTHsG8h54pHmhhwNoA5voicM53RDt3iYQfwr8WyL4a6hV1rvmY
i+3VIUPTzvy6dfRaLFN5bFn87R01AE3nfcU7ddfIoCYfJKxjf6BtRJkKbQYC9rdz
vg1UeIlVSVYMDusInyhE7sNxkRcq3JRHYTswv1uitYUGQNCY+msWdRcGikW7tA2f
FPlG5HjibZDfEO3fgmQ2sh+5lCn44IPPMyg8spbBdiHVbtD7bUPbacD5G+Lq3c4Q
VRMqxdVvC59USDF6mLGYuYHPEz5VodU+h2tMQCPZaCK5wlUj/1OSQqBW938FomBv
82MXsLUPvCbkeaZ1bb2cofGQt5Nqnyh95FF8v93VTUYscpRgg5KjQLiqrmvoobXB
1AzCNvy+nfXcEp8Mey1z6+Zh5cTgGhK8LPiunRUqxMspuKurQJwYaAXREkW+4arX
SC5ujKCRpZ8zm5wpVC55h34KuyfonaPlzIlkN3de30RJ3Fjvw/Lz3155OXWczVBx
RD1mhnawjfL/XmkSMqby
=pePV
-----END PGP SIGNATURE-----

[QUOTE=ab;6411]-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Did you read Magic31’s reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, “cop-out ish”, since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.

What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not (“Pay no attention to the main behind the curtain.”).
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.

Good luck.
[/QUOTE]

The issue is that the kernel devs could have made their own “counter” for their “unknown” drops, which, are in reality, the kernel devs not keeping up with network protocols, not the other way around. We’ve found the SLES11 kernel doesn’t understand several cisco protocols (really??), and bonding protocols even. If we need to “see” unknown packets, put it in a different counter. There are already LEGITIMATE drops in the DROP counter, don’t need to see the unknown ones here too.
There is no where, where I can find, that says, to be a Kernel 3.0 up user, you need to clean up your unknown tiny packets they may be lurking on your network, because we think its important to count them…

rant over.