What is the “proper” method for updating a HAE cluster?
I have encountered a problem several times now when I have done the following:
- Put a node in standby mode
- Run zypper update on that node
- Rebooted.
It seems that it comes up in some sort of split brain mess. The node that was updated sees the one that is still online as offline, and that one sees the one that was updated as pending or offline.
I can’t put it back online to take the other one offline, so I have to disrupt services by having them BOTH down and applying the update to the other one.
The updates that were applied this time were:
v | SLE11-HAE-SP2-Updates | cluster-glue | 1.0.9.1-0.36.1 | 1.0.9.1-0.38.2 | x86_64
v | SLE11-HAE-SP2-Updates | cluster-network-kmp-default | 1.4_3.0.34_0.7-2.10.30 | 1.4_3.0.38_0.5-2.16.1 | x86_64
v | SLE11-HAE-SP2-Updates | corosync | 1.4.1-0.13.1 | 1.4.3-0.5.1 | x86_64
v | SLE11-HAE-SP2-Updates | crmsh | 1.1.0-0.17.3 | 1.1.0-0.19.16 | x86_64
v | SLE11-HAE-SP2-Updates | gfs2-kmp-default | 2_3.0.34_0.7-0.7.30 | 2_3.0.38_0.5-0.7.37 | x86_64
v | SLE11-HAE-SP2-Updates | ldirectord | 3.9.2-0.25.5 | 3.9.3-0.7.1 | x86_64
v | SLE11-HAE-SP2-Updates | libcorosync4 | 1.4.1-0.13.1 | 1.4.3-0.5.1 | x86_64
v | SLE11-HAE-SP2-Updates | libglue2 | 1.0.9.1-0.36.1 | 1.0.9.1-0.38.2 | x86_64
v | SLE11-HAE-SP2-Updates | libpacemaker3 | 1.1.6-1.29.1 | 1.1.7-0.9.1 | x86_64
v | SLE11-HAE-SP2-Updates for x86_64 | ocfs2-kmp-default | 1.6_3.0.34_0.7-0.7.30 | 1.6_3.0.38_0.5-0.7.37 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker | 1.1.6-1.29.1 | 1.1.7-0.9.1 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker-mgmt | 2.1.0-0.8.74 | 2.1.0-0.10.2 | x86_64
v | SLE11-HAE-SP2-Updates | pacemaker-mgmt-client | 2.1.0-0.8.74 | 2.1.0-0.10.2 | x86_64
v | SLE11-HAE-SP2-Updates | resource-agents | 3.9.2-0.25.5 | 3.9.3-0.7.1 | x86_64
v | SLE11-HAE-SP2-Updates | yast2-cluster | 2.15.0-8.35.4 | 2.15.0-8.39.1 | noarch
The error spewing into /var/log/messages is:
Sep 25 15:50:08 uaweb02 crmd: [4461]: info: update_dc: Set DC to uaweb01 (3.0.5)
Sep 25 15:50:08 uaweb02 cib: [4456]: WARN: cib_process_replace: Replacement 0.86.74 not applied to 0.88.27: current epoch is greater than the replacement
Sep 25 15:50:08 uaweb02 cib: [4456]: WARN: cib_diff_notify: Update (client: uaweb02, call:392016): -1.-1.-1 → 0.86.74 (Update was older than existing configuration)
Sep 25 15:50:08 uaweb02 cib: [4456]: info: cib_process_request: Operation complete: op cib_sync for section ‘all’ (origin=uaweb01/crmd/392018, version=0.88.27): ok (rc=0)
Sep 25 15:50:08 uaweb02 crmd: [4461]: info: do_election_count_vote: Election 175156 (owner: uaweb01) lost: vote from uaweb01 (Version)
Any ideas of why this occurs, and if there is any way I can get this node back online long enough to update the other one?
Thanks.
Allen Beddingfield
Systems Engineer
The University of Alabama