This past Sunday, 11/3, I upgraded my servers (all running as virtual guests under z/VM) to SLES 11 SP3 (from SLES 11 SP2). On Monday my monitoring tool showed that WebSphere stopped processing work for just a few minutes at a time - several occurrences that day. The server suddenly started processing work again. On Tuesday, 11/5, processing stopped on 1 server for almost 16 minutes. I searched the WebSphere and http logs and found that no messages were issued during that time.
I started a tcpdump on the web server looking for any traffic from/to the F5 and found that there was none. I asked Network Engineering to assist and they said that the load balancer in the F5 sends keepalive requests to the web server and removes it from the rotation if it doesn’t respond in 5 seconds. The F5 doesn’t place the web server back into the rotation until it responds properly for several iterations.
The CPU utilization for the web server stays between .2 and about 1.2% (as displayed by ‘top -H’) and z/VM reports that the guest is using about 2%. The lpar on the mainframe that runs z/VM has 5 IFL engines (1204 MIPS each) and was using the equivalent of 2.5 engines, so CPU resources isn’t the problem. z/VM was not paging very heavily either.
I noticed, when the issue was occurring, that it took over a second for the login prompt (via PuTTY) to be displayed and a second or two for the password prompt to be displayed. Normally, these are displayed almost instantly. I login via a different interface, hsi0.
I thought that the problem was related to the SLES upgrade but Network Engineering said that he saw the issue happening last week but it appeared to the F5 that the interface went down and came right back up. He did say that the problem got worse starting on Sunday after the SLES upgrade was completed.
What commands can I issue while the issue is occurring to gain some insight as to what is happening? At this point I am not ruling out that IBM HTTP Server is the one that stops responding but there is nothing in the access nor error logs that show a problem.
I checked the output of ifconfig for the eth0 interface and there are no errors, dropped packets, etc.
Please let me know if you need more info.
Harley