We have deployed several SP2 servers for testing, and are finding this annoying issue.
Approx every 30 seconds, the RX dropped counter ticks up by 1. This is only happening on SLES11 SP2 systems.
Usually when this counter ticks, it lines up with rx_fw_discards in ethtool statistics. I cannot find a matching stat for this drop.
I can confirm this behaviour… also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.
[QUOTE=enovaklbank;5817]I can confirm this behaviour… also within ESXi5 VMs, so it does not depend on the NIC (we use vmxnet3 NICs).
Anyway, one dropped packet every ~30 seconds seems not to bother anything.[/QUOTE]
Well it is a pain because we monitor these counts for legitimate drops. There has to be a reason why this is happening.
I’ve been seeing this too with a couple of SLES 11 SP2 Xen setups. I was planning on setting up a small (network) isolated test server to see if it happens there too.
[QUOTE=Magic31;5956]Trying to orientate how to find out what the server is dropping, I came across tool called dropwatch (http://linux.die.net/man/1/dropwatch). Not sure if it will run on SLES 11, but there are some packages for it here : http://pkgs.org/download/dropwatch
[/QUOTE]
[QUOTE=kevins7189_5;6117]Interesting utility.
I was able to get this every time I saw the counter move up about every 30 sec. No idea what it means
1 drops at __netif_receive_skb+1fe (0xffffffff8138388e)[/QUOTE]
Interesting indeed
I haven’t been able to put more time into this that trying to simulate it in a (very small) test environment. Funny enough, I did not witness any dropped packets there.
My next move is to do this at the two sites where I am seeing this, but I have not had a window to do so yet (time and other priorities).
If you can open an SR, that would be best. It could well be certain type packets are intentionally for some security reason or other. I don’t know, by far, enough about the workings of the kernel and modules… but this snip talken from a Google search did catch my interest:
Add a packet_type handler and see if we can prevent
other packet_type’s from handling an skb
Specifically, we will register our packet_type to be
the first handler invoked by netif_receive_skb()
If the packet received meets certain conditions, then,
drop it, i.e, prevent subsequent ptype_all and ptype_base
handlers in netif_receive_skb() from processing the packet
Sorry I can’t be of more help here. I will pass on this thread to my Novell contact to see if this might be something Novell is aware of.
that reply seems kinda cop-out ish. Seems like this would be an easy to find problem if this has been happening since 2.32.37, but can’t find any so easy.
Have no drops here, so that one is out
cat /proc/net/softnet_stat
03619a59 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035d90a8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
03609a2e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
035dce37 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
the other issue is this goes away if I run tcpdump on the machine and try to catch the packets. I can go hours an hours with tcpdump running and the rx counter won’t move, but soon after shutting it down the rx counters start incrementing again.
This may be a known issue with the bnx2 driver’s default buffer coalescing settings. Does the rx_filtered_packets value match ( approximately ) the rx discards value shown in ifconfig? If so, then show us output of
and see if this reduces / eliminates the counters increasing.
Also this can be ( cough ) “perfectly normal” because when the server is unable to find a sink for the packats, it drops them. This can be things like BPDU packets or other stuff which is not a layer 3 protocol for which the server listens for. use of tcpdump or wireshart may well stop the counter from increasing as it will be a sink for all packats, it also changes the ring buffer to capture packets.
If the above incantation fixes the issue, you can leave it that way. The likely cause is a periodic burst of packets which overruns the driver buffer pool. “Stuff happens.”
rx_fw_discards = 117
rx dropped in ifconfig (62 days uptime) is 237501
I’ve done some reading on the coalesce thing with bnx2 but I thought it was more of a troubleshooting step than a “bug”, but using default now.
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
still get rx packet drops in ifconfig (now 237523). I’m well aware of the rx_fw_discards issue, because i definitely need to monitor that. I usually monitor it through ifconfig drops, however. But i CAN’T because these other “drops”, whatever they are, are also counting to the same counter. I guess I could monitor rw_fw_discards directly, but I don’t think I should have to. There should be a separate counter for the kernel to report dropping unknown packets, and not to the main ifconfig rx counter.
I’m trying to explain to my network guys about these unknown packets, and they are completely confused and uninterested (imagine that).
From what I can ascertain, I believe the OUI UNKNOWN packets are the cause, but I haven’t found a good explanation of what oui Unknown means. But since it says “unknown”, it seems a good candidate for these drops.
[QUOTE=kevins7189_5;6266]Thanks for the reply!
[/QUOTE]
Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.
If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.
But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.
[QUOTE=Bob-O-Rama;6281]Since the drops go away when you use packet capturing, that is significant. It either means the capture drive is acting as a sink for packets that have not owning protocol ( things like BPDU ) or it messes with the buffering to allow the driver to correctly accept the packets. I would do about 2-5 minutes of packet captures and eliminate everything which is IP / TCP / UDP and look for something with the frequency you are seeing.
If you are good friends with the network guy, have them disable BPDU, CDP, and other digital detritus on the switch port(s) feeding the server.
But that the counters do not increment when you have captures is a significant finding supporting my theory that the drops counter is pretty meaningless.
– Bob[/QUOTE]
I’ll keep going with packet captures. The network guys have actually been helpful, but there seems to be a giant lack of information on why this “error” was implemented, and what parameters would make it do so. If the kernel devs thought this necessary, they should document the conditions so people could prepare.
Right now I’m focusing on the only “unknowns” I see in tcpdump which are “OUI Unknown” from arp. I don’t even know if these are the correct ones, but all I have to go on. After further investigation, the errors are not quite as timely as every 30 seconds, sometimes no drops occur for several minutes, then 3 or 4 show up.
This is very annoying.
Did you read Magic31’s reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, “cop-out ish”, since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.
What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not (“Pay no attention to the main behind the curtain.”).
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.
Good luck.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
[QUOTE=ab;6411]-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Did you read Magic31’s reply on 2012-08-09 at 09:54 (MDT)? He posted a
link to a TID which does not seem, as you stated, “cop-out ish”, since
it describes what is happening, why it happens, and I think even
provides a link back to when this was checked-in to the mainline kernel;
it also describes using tcpdump as a test, as you found, to stop the
counter from incrementing to show that it is indeed because of the
kernel change.
What would you have b e different at this point? It seems to me that
this change gives you a better picture of reality on your network, even
though that now means you can see things which are being dropped where
before you did not (“Pay no attention to the main behind the curtain.”).
The reality of networking is that data are stopped, but networking is
designed to handle that too from collision detection (which matters less
today it seems) to TCP checksums and acknowledgments.
Good luck.
[/QUOTE]
The issue is that the kernel devs could have made their own “counter” for their “unknown” drops, which, are in reality, the kernel devs not keeping up with network protocols, not the other way around. We’ve found the SLES11 kernel doesn’t understand several cisco protocols (really??), and bonding protocols even. If we need to “see” unknown packets, put it in a different counter. There are already LEGITIMATE drops in the DROP counter, don’t need to see the unknown ones here too.
There is no where, where I can find, that says, to be a Kernel 3.0 up user, you need to clean up your unknown tiny packets they may be lurking on your network, because we think its important to count them…