Increasing amount of kernel deadlocks caused by kworker-proc

Hey guys,

we have some issues with a SLES 11SP4 machine. In the /var/log/messages and the dmesg as well, of course, there are some kernel deadlock messages. See code below.
These deadlocks are mostly pointing to the kworkerprocess. Sometimes they affect other processes as well.

Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.784995] warn_alloc_failed: 2 callbacks suppressed Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.784998] kworker/0:1: page allocation failure: order:1, mode:0x20 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785008] Pid: 0, comm: kworker/0:1 Tainted: G X 3.0.101-77-default #1 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785009] Call Trace: Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785033] [<ffffffff81004b95>] dump_trace+0x75/0x300 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785045] [<ffffffff81466bb3>] dump_stack+0x69/0x6f Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785053] [<ffffffff81102ff6>] warn_alloc_failed+0xc6/0x170 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785056] [<ffffffff81104c0f>] __alloc_pages_slowpath+0x56f/0x820 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785059] [<ffffffff811050a9>] __alloc_pages_nodemask+0x1e9/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785070] [<ffffffff81147be3>] kmem_getpages+0x53/0x190 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785073] [<ffffffff81148a16>] fallback_alloc+0x196/0x270 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785075] [<ffffffff81149607>] kmem_cache_alloc+0x207/0x2a0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785087] [<ffffffff813995e2>] sk_prot_alloc+0x42/0x1b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785090] [<ffffffff8139ac3a>] sk_clone+0x1a/0x320 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785098] [<ffffffff813e9220>] inet_csk_clone+0x10/0xb0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785102] [<ffffffff814027e1>] tcp_create_openreq_child+0x21/0x4b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785104] [<ffffffff81400b37>] tcp_v4_syn_recv_sock+0x47/0x240 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785106] [<ffffffff8140264f>] tcp_check_req+0x2df/0x450 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785108] [<ffffffff81401727>] tcp_v4_hnd_req+0x67/0x130 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785110] [<ffffffff8140197b>] tcp_v4_do_rcv+0x18b/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785112] [<ffffffff81401f87>] tcp_v4_rcv+0x597/0x7f0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785115] [<ffffffff813dea38>] ip_local_deliver_finish+0x118/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785118] [<ffffffff813a760a>] __netif_receive_skb+0x60a/0x640 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785124] [<ffffffff813acf30>] netif_receive_skb+0x90/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785134] [<ffffffffa02036f1>] vmxnet3_rq_rx_complete+0x4b1/0xa60 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785144] [<ffffffffa0203e04>] vmxnet3_poll_rx_only+0x44/0xd0 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785147] [<ffffffff813adfda>] net_rx_action+0x10a/0x2b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785157] [<ffffffff8106925f>] __do_softirq+0x11f/0x260 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785163] [<ffffffff81472fdc>] call_softirq+0x1c/0x30 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785165] [<ffffffff81004695>] do_softirq+0x65/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785168] [<ffffffff81069025>] irq_exit+0xc5/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785171] [<ffffffff81003a5f>] do_IRQ+0x5f/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785174] [<ffffffff8146a193>] common_interrupt+0x13/0x13 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785181] [<ffffffff810300a2>] native_safe_halt+0x2/0x10 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785186] [<ffffffff8100b195>] default_idle+0x145/0x150 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785189] [<ffffffff81002116>] cpu_idle+0x66/0xb0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785196] kworker/0:1: page allocation failure: order:1, mode:0x20 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785197] Pid: 0, comm: kworker/0:1 Tainted: G X 3.0.101-77-default #1 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785198] Call Trace: Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785200] [<ffffffff81004b95>] dump_trace+0x75/0x300 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785202] [<ffffffff81466bb3>] dump_stack+0x69/0x6f Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785204] [<ffffffff81102ff6>] warn_alloc_failed+0xc6/0x170 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785206] [<ffffffff81104c0f>] __alloc_pages_slowpath+0x56f/0x820 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785208] [<ffffffff811050a9>] __alloc_pages_nodemask+0x1e9/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785211] [<ffffffff81147be3>] kmem_getpages+0x53/0x190 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785213] [<ffffffff81148a16>] fallback_alloc+0x196/0x270 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785215] [<ffffffff81149607>] kmem_cache_alloc+0x207/0x2a0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785218] [<ffffffff813995e2>] sk_prot_alloc+0x42/0x1b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785219] [<ffffffff8139ac3a>] sk_clone+0x1a/0x320 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785222] [<ffffffff813e9220>] inet_csk_clone+0x10/0xb0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785224] [<ffffffff814027e1>] tcp_create_openreq_child+0x21/0x4b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785226] [<ffffffff81400b37>] tcp_v4_syn_recv_sock+0x47/0x240 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785227] [<ffffffff8140264f>] tcp_check_req+0x2df/0x450 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785229] [<ffffffff81401727>] tcp_v4_hnd_req+0x67/0x130 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785231] [<ffffffff8140197b>] tcp_v4_do_rcv+0x18b/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785233] [<ffffffff81401f87>] tcp_v4_rcv+0x597/0x7f0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785235] [<ffffffff813dea38>] ip_local_deliver_finish+0x118/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785238] [<ffffffff813a760a>] __netif_receive_skb+0x60a/0x640 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785240] [<ffffffff813acf30>] netif_receive_skb+0x90/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785243] [<ffffffffa02036f1>] vmxnet3_rq_rx_complete+0x4b1/0xa60 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785247] [<ffffffffa0203e04>] vmxnet3_poll_rx_only+0x44/0xd0 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785250] [<ffffffff813adfda>] net_rx_action+0x10a/0x2b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785252] [<ffffffff8106925f>] __do_softirq+0x11f/0x260 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785255] [<ffffffff81472fdc>] call_softirq+0x1c/0x30 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785257] [<ffffffff81004695>] do_softirq+0x65/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785259] [<ffffffff81069025>] irq_exit+0xc5/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785261] [<ffffffff81003a5f>] do_IRQ+0x5f/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785263] [<ffffffff8146a193>] common_interrupt+0x13/0x13 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785266] [<ffffffff810300a2>] native_safe_halt+0x2/0x10 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785268] [<ffffffff8100b195>] default_idle+0x145/0x150 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.785270] [<ffffffff81002116>] cpu_idle+0x66/0xb0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984156] kworker/0:1: page allocation failure: order:1, mode:0x20 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984159] Pid: 0, comm: kworker/0:1 Tainted: G X 3.0.101-77-default #1 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984161] Call Trace: Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984171] [<ffffffff81004b95>] dump_trace+0x75/0x300 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984176] [<ffffffff81466bb3>] dump_stack+0x69/0x6f Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984180] [<ffffffff81102ff6>] warn_alloc_failed+0xc6/0x170 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984183] [<ffffffff81104c0f>] __alloc_pages_slowpath+0x56f/0x820 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984186] [<ffffffff811050a9>] __alloc_pages_nodemask+0x1e9/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984190] [<ffffffff81147be3>] kmem_getpages+0x53/0x190 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984193] [<ffffffff81148a16>] fallback_alloc+0x196/0x270 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984196] [<ffffffff81149607>] kmem_cache_alloc+0x207/0x2a0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984201] [<ffffffff813995e2>] sk_prot_alloc+0x42/0x1b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984203] [<ffffffff8139ac3a>] sk_clone+0x1a/0x320 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984207] [<ffffffff813e9220>] inet_csk_clone+0x10/0xb0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984210] [<ffffffff814027e1>] tcp_create_openreq_child+0x21/0x4b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984212] [<ffffffff81400b37>] tcp_v4_syn_recv_sock+0x47/0x240 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984214] [<ffffffff8140264f>] tcp_check_req+0x2df/0x450 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984216] [<ffffffff81401727>] tcp_v4_hnd_req+0x67/0x130 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984218] [<ffffffff8140197b>] tcp_v4_do_rcv+0x18b/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984220] [<ffffffff81401f87>] tcp_v4_rcv+0x597/0x7f0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984223] [<ffffffff813dea38>] ip_local_deliver_finish+0x118/0x200 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984226] [<ffffffff813a760a>] __netif_receive_skb+0x60a/0x640 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984229] [<ffffffff813acf30>] netif_receive_skb+0x90/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984240] [<ffffffffa02036f1>] vmxnet3_rq_rx_complete+0x4b1/0xa60 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984250] [<ffffffffa0203e04>] vmxnet3_poll_rx_only+0x44/0xd0 [vmxnet3] Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984253] [<ffffffff813adfda>] net_rx_action+0x10a/0x2b0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984258] [<ffffffff8106925f>] __do_softirq+0x11f/0x260 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984262] [<ffffffff81472fdc>] call_softirq+0x1c/0x30 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984264] [<ffffffff81004695>] do_softirq+0x65/0xa0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984267] [<ffffffff81069025>] irq_exit+0xc5/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984270] [<ffffffff81003a5f>] do_IRQ+0x5f/0xe0 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984273] [<ffffffff8146a193>] common_interrupt+0x13/0x13 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984277] [<ffffffff810300a2>] native_safe_halt+0x2/0x10 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984280] [<ffffffff8100b195>] default_idle+0x145/0x150 Nov 24 12:49:27 SOME-SYSTEM-NAME kernel: [11737893.984283] [<ffffffff81002116>] cpu_idle+0x66/0xb0

Anyone here, that could give us some clues where are we locking at?
I know that a deadlock is nothing necessary to take care about, but in this increasing amount i want to do something :slight_smile:

Thank everyone for helping us out.

Greetings from germany

Ulli

A couple questions:

Physical or virtual, and with which technologies if virtual (KVM, Xen,
VirtualBox, etc.)? Actually, looks like VMWare code is there; have you
tried making sure that is patched, or removing it temporarily from the VM,
or testing with KVM or something else like that?

How much RAM does this box have?

What is this box doing with regard to running services/processes?

Having little experience with these things, the only thing that stands out
to me is that the top of the stack appears to be trying to allocate
memory, presumably because of an attempt to setup a network connection
somehow due to vmxnet3, though I’m not sure yet if it is when receiving a
connection or making a new outbound connection. Running out of memory at
that time seems crazy short of a bug, thus the tests above. Maybe this is
just the symptom after something else has negatively impacted the system
(taking all memory) in which case seeing more about current system memory
utilization may be useful.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Thank you for your answer.
Here are the informations that you requested.

[QUOTE=ab;35505]A couple questions:

Physical or virtual, and with which technologies if virtual (KVM, Xen,
VirtualBox, etc.)? Actually, looks like VMWare code is there; have you
tried making sure that is patched, or removing it temporarily from the VM,
or testing with KVM or something else like that?
[/QUOTE]
We are using VMware. That is the only virtualisationtechnologie we are using, so i cannot test it with an other version.

Here is the output of “free”, i hope that is okay.

SOME-MACHINE-NAME:~ # free total used free shared buffers cached Mem: 8062704 6203156 1859548 400 369092 651052 -/+ buffers/cache: 5183012 2879692 Swap: 2095100 273764 1821336

[QUOTE=ab;35505]What is this box doing with regard to running services/processes?[/CODE]
It is a tomcat webserver which provides services to customers. This services has some perfomance issues as well, but i think that this has nothing to do with that kind of performanceissues.

[QUOTE=ab;35505]Having little experience with these things, the only thing that stands out
to me is that the top of the stack appears to be trying to allocate
memory, presumably because of an attempt to setup a network connection
somehow due to vmxnet3, though I’m not sure yet if it is when receiving a
connection or making a new outbound connection. Running out of memory at
that time seems crazy short of a bug, thus the tests above. Maybe this is
just the symptom after something else has negatively impacted the system
(taking all memory) in which case seeing more about current system memory
utilization may be useful.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…[/QUOTE]

I thought about that earlier :slight_smile: and i do it already.
Maybe it has to do something with the ipv6 …?
We will deactivate that today in a maintenancewindow. The machine will be rebooted as well.

The thing with the connections reminds me to a thing, i heard at the beginning of that perfomance issue, some of the users are runnning in timeouts, when they want to use this tomcatwebserver.

Thank you for your help, i appriciate that.
Maybe the new informations could help getting new suggestions.

Bye

Ulli

Hey Guys,

after rebooting and deactivating the deadlocks are gone.
The performanceissues with the tomcatwebapplication is gone as well…

@ab
thank you for your help

Someone an idea if the deadlocks have something to do with ipv6?

I will mark this thread as solved - if someone like to share suggestions, feel free - i dont like to sit in the dark about that :slight_smile:

edit:
i can’t mark it as solved :rolleyes: