I have pretty stable system with SLES 11 SP2 and HA. I run Xen virtual servers on OCFS2. (IBM Blade with 3 physical servers)
Now few weeks ago I created new disk image to OCFS2 partition and the problems started. If I try to write to that image it might be that sometimes everything goes well but sometimes it freezes everything. All virtual machines freeze on all physical machines. It’s like they can’t read or write to disk. Physical machines seem to be working.
And then it seems that the virtual servers won’t start for few hours. I mean that I start VirtualServer1 on Host1 it starts and then I start Virtualserver2 on Host1 it hangs freezes both virtual servers. Or sometimes the VirtualServer1 freezes during it boot. Sometimes they freeze and after awhile for example Windows virtual server that it has problems writing to disk and boots or acts weirdly.
No matter if I boot the physical servers or all the hardware or run fsck.ocfs2 (which reports no problems)
Then suddenly after maybe 2-3 hours everything starts to work again. (Until next time I try to write something on the new image)
In logs I can see for example:
Mar 3 09:00:54 Server1 corosync[5651]: [TOTEM ] Process pause detected for 9846 ms, flushing membership messages.
Mar 3 09:00:54 Server1 corosync[5651]: [TOTEM ] A processor failed, forming new configuration.
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] CLM CONFIGURATION CHANGE
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] New Configuration:
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.1)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.2)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.3)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] Members Left:
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] Members Joined:
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 67356: memb=3, new=0, lost=0
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: memb: Server1 1
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: memb: Server2 2
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: memb: Server3 3
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] CLM CONFIGURATION CHANGE
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] New Configuration:
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.1)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.2)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] r(0) ip(10.20.99.3)
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] Members Left:
Mar 3 09:00:54 Server1 corosync[5651]: [CLM ] Members Joined:
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 67356: memb=3, new=0, lost=0
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: MEMB: Server1 1
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: MEMB: Server2 2
Mar 3 09:00:54 Server1 corosync[5651]: [pcmk ] info: pcmk_peer_update: MEMB: Server3 3
Mar 3 09:00:54 Server1 corosync[5651]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 3 09:00:54 Server1 corosync[5651]: [CPG ] chosen downlist: sender r(0) ip(10.20.99.1) ; members(old:3 left:0)
Mar 3 09:00:54 Server1 corosync[5651]: [MAIN ] Completed service synchronization, ready to provide service.
Overall it seems there is much more going on in the logs then there used to be. What I’m beginning to think is that I have somehow messed up some config file.
that message on almost 10 seconds of scheduling delay would make me nervous.
Do you see other indications of network problems? Does this message ("[TOTEM ] Process pause detected for 9846 ms, flushing membership messages") appear in only a single node’s log?
Have you done any “tuning” to the kernels lately? Are the hosts (or “the host” if this appears on only a single physical node) under heavy load? Is that disk image that you put on the OCFS2 drive the first one and/or did this increase the network i/o substantially, because of heavy disk i/o inside that VM?
If you have a support contract, I strongly advise to open a service request to get optimum help from SUSE specialists on the subject… this may easily get above what can be handled via a public forum.
BTW, my guess is that it’s not related to some pacemaker config change, at least not in terms of resources/colocations/actions. I opt for “i/o increase” with resulting processing delays and/or network i/o increase, then leading to the reported delays and thus OCFS2 timing out.
Well there are these “corosync[5457]: [TOTEM ] A processor failed, forming new configuration.” errors on other nodes but I don’t see that “[TOTEM ] Process pause detected…” on other nodes.
I haven’t done any tuning on kernel. Basically running “stock” version and configs.
There hasn’t been increase in DISK I/O or NET I/O rather decrease because I have taken down few virtual machines. And the New image was just more diskspace on one machine. And the freezing Happens when I copy files from “old” image to “new” image. So then there is more DISK I/O. But I have copied files before between images and I haven’t had any problems.
I am so sorry - obviously I missed your response, so this reply is worse than late
There have been HAE updates in the meantime, were you able to resolve the issue? If not, I strongly suggest opening a service request to receive competent one-on-one support from a SUSE engineer.