Need assistance diagnosing a possible interface problem.

This past Sunday, 11/3, I upgraded my servers (all running as virtual guests under z/VM) to SLES 11 SP3 (from SLES 11 SP2). On Monday my monitoring tool showed that WebSphere stopped processing work for just a few minutes at a time - several occurrences that day. The server suddenly started processing work again. On Tuesday, 11/5, processing stopped on 1 server for almost 16 minutes. I searched the WebSphere and http logs and found that no messages were issued during that time.

I started a tcpdump on the web server looking for any traffic from/to the F5 and found that there was none. I asked Network Engineering to assist and they said that the load balancer in the F5 sends keepalive requests to the web server and removes it from the rotation if it doesn’t respond in 5 seconds. The F5 doesn’t place the web server back into the rotation until it responds properly for several iterations.

The CPU utilization for the web server stays between .2 and about 1.2% (as displayed by ‘top -H’) and z/VM reports that the guest is using about 2%. The lpar on the mainframe that runs z/VM has 5 IFL engines (1204 MIPS each) and was using the equivalent of 2.5 engines, so CPU resources isn’t the problem. z/VM was not paging very heavily either.

I noticed, when the issue was occurring, that it took over a second for the login prompt (via PuTTY) to be displayed and a second or two for the password prompt to be displayed. Normally, these are displayed almost instantly. I login via a different interface, hsi0.

I thought that the problem was related to the SLES upgrade but Network Engineering said that he saw the issue happening last week but it appeared to the F5 that the interface went down and came right back up. He did say that the problem got worse starting on Sunday after the SLES upgrade was completed.

What commands can I issue while the issue is occurring to gain some insight as to what is happening? At this point I am not ruling out that IBM HTTP Server is the one that stops responding but there is nothing in the access nor error logs that show a problem.

I checked the output of ifconfig for the eth0 interface and there are no errors, dropped packets, etc.

Please let me know if you need more info.

Harley

First great write-up.

On 11/06/2013 11:34 AM, x0500hl wrote:[color=blue]

This past Sunday, 11/3, I upgraded my servers (all running as virtual
guests under z/VM) to SLES 11 SP3 (from SLES 11 SP2). On Monday my
monitoring tool showed that WebSphere stopped processing work for just a
few minutes at a time - several occurrences that day. The server
suddenly started processing work again. On Tuesday, 11/5, processing
stopped on 1 server for almost 16 minutes. I searched the WebSphere and
http logs and found that no messages were issued during that time.

I started a tcpdump on the web server looking for any traffic from/to
the F5 and found that there was none. I asked Network Engineering to
assist and they said that the load balancer in the F5 sends keepalive
requests to the web server and removes it from the rotation if it
doesn’t respond in 5 seconds. The F5 doesn’t place the web server back
into the rotation until it responds properly for several iterations.[/color]

How does it check for connectivity exactly? Is this checking the
application (HTTP) layer, the transport layer (open port), the network
layer (able to ping, etc.) or some combination? Knowing this would
potentially help.
[color=blue]

The CPU utilization for the web server stays between .2 and about 1.2%
(as displayed by ‘top -H’) and z/VM reports that the guest is using
about 2%. The lpar on the mainframe that runs z/VM has 5 IFL engines
(1204 MIPS each) and was using the equivalent of 2.5 engines, so CPU
resources isn’t the problem. z/VM was not paging very heavily either.[/color]

What does the ‘uptime’ command return on the SLES system in question,
particularly with regard to the load average numbers?
[color=blue]

I noticed, when the issue was occurring, that it took over a second for
the login prompt (via PuTTY) to be displayed and a second or two for the
password prompt to be displayed. Normally, these are displayed almost
instantly. I login via a different interface, hsi0.

I thought that the problem was related to the SLES upgrade but Network
Engineering said that he saw the issue happening last week but it
appeared to the F5 that the interface went down and came right back up.
He did say that the problem got worse starting on Sunday after the SLES
upgrade was completed.

What commands can I issue while the issue is occurring to gain some
insight as to what is happening? At this point I am not ruling out that
IBM HTTP Server is the one that stops responding but there is nothing in
the access nor error logs that show a problem.[/color]

Have you tried recreating the same test that the load balancer does at
some level? For example, maybe run the following each in their own shell
(assuming use of TCP 80 for the web server) on a machine of yours for
testing that system:

Code:

ping ip.addr.of.system

watch --interval=1 ‘date +%s >> /tmp/netcat.out; netcat -znv
ip.addr.of.system 80 >> /tmp/netcat.out’

watch --interval=1 ‘date +%s >> /tmp/curl.out; curl
http://ip.addr.of.system/; echo $? >> /tmp/curl.out’ #make this URL
return something useful

The output of /var/log/messages during a time of error, or the output of
the ‘dmesg’ command, may be useful.


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

Ab, thank you for your post.

My coworker was on vacation that week and when I described the problem to him, he remembered that he worked on the issue in January 2012. Apparently, the upgrade to SP3 caused more memory to be used which limited the number of threads available to IBM HTTP Server (IHS). Other occurrences of the problem caused IHS to issue messages that it couldn’t create a new thread and that it was performing a graceful restart. Several of the graceful restarts kept the web server out of the rotation for up to 40 minutes (for the work to clear out).

We received some apache tuning recommendations from IBM (to increase the number of threads) and we increased the virtual memory from 128 MB to 256 MB. So far, the problem hasn’t reoccurred.

Harley