stopping pacemaker doesnt move resources to other node

I have created a 2 node cluster on sles12. Following is the configuration -

msnode1:~ # crm --version
crm 3.0.0

msnode1:~ #  corosync -v
Corosync Cluster Engine, version '2.3.6'
Copyright (c) 2006-2009 Red Hat, Inc.

msnode1:~ # crm config show
node 1: msnode1
node 2: msnode2
primitive mspersonal systemd:mspersonal \
op monitor interval=30s
primitive virtip IPaddr \
params ip=10.243.109.103 cidr_netmask=21 \
op monitor interval=30s
location cli-prefer-virtip virtip role=Started inf: msnode1
colocation msconstraint inf: virtip mspersonal
order msorder Mandatory: virtip mspersonal
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.16-4.8-77ea74d \
cluster-infrastructure=corosync \
cluster-name=mscluster \
stonith-enabled=false \
placement-strategy=balanced \
help \
list \
last-lrm-refresh=1561341732
rsc_defaults rsc-options: \
resource-stickiness=100 \
migration-threshold=2
op_defaults op-options: \
timeout=600 \
record-pending=true

msnode1:~ # crm status
Stack: corosync
Current DC: msnode1 (version 1.1.16-4.8-77ea74d) - partition with quorum
Last updated: Tue Jun 25 17:43:44 2019
Last change: Tue Jun 25 17:38:21 2019 by hacluster via cibadmin on msnode1

2 nodes configured
2 resources configured

Online: [ msnode1 msnode2 ]

Full list of resources:

virtip (ocf::heartbeat:IPaddr): Started msnode1
mspersonal (systemd:mspersonal): Started msnode1

When I shut off the cluster on node1 (or reboot node1), the resources started on msnode2, but then they immediately turned off and situation changes as -

msnode1:~ # systemctl stop pacemaker
msnode2:~ # crm status
Stack: corosync
Current DC: msnode2 (version 1.1.16-4.8-77ea74d) - partition WITHOUT quorum
Last updated: Tue Jun 25 17:44:26 2019
Last change: Tue Jun 25 17:38:20 2019 by hacluster via cibadmin on msnode1

2 nodes configured
2 resources configured

Online: [ msnode2 ]
OFFLINE: [ msnode1 ]

Full list of resources:

virtip (ocf::heartbeat:IPaddr): Stopped
mspersonal (systemd:mspersonal): Stopped

When I restart the pacemaker service on msnode1, the resources starts back on msnode1 –

msnode1:~ # systemctl start pacemaker
msnode1:~ # crm status
Stack: corosync
Current DC: msnode2 (version 1.1.16-4.8-77ea74d) - partition with quorum
Last updated: Tue Jun 25 17:46:09 2019
Last change: Tue Jun 25 17:38:20 2019 by hacluster via cibadmin on msnode1

2 nodes configured
2 resources configured

Online: [ msnode1 msnode2 ]

Full list of resources:

virtip (ocf::heartbeat:IPaddr): Started msnode1
mspersonal (systemd:mspersonal): Started msnode1

But when I redo the same exercise, the resources actually start on msnode2 -

msnode1:~ # systemctl stop pacemaker
msnode2:~ # crm status
Stack: corosync
Current DC: msnode2 (version 1.1.16-4.8-77ea74d) - partition WITHOUT quorum
Last updated: Tue Jun 25 17:47:00 2019
Last change: Tue Jun 25 17:38:20 2019 by hacluster via cibadmin on msnode1

2 nodes configured
2 resources configured

Online: [ msnode2 ]
OFFLINE: [ msnode1 ]

Full list of resources:

virtip (ocf::heartbeat:IPaddr): Started msnode2
mspersonal (systemd:mspersonal): Started msnode2

But when I start pacemaker again on msnode1, the resources move back to msnode1 which I didnt expect because of stickiness set to 100 which works fine when any one of the resource fails.

I am not able to catch what I am missing in this cluster configuration, which can set everything right for me.

this means that you have run the “crm resource migrate virtip msnode1”. When node1 dies - it will migrate to node2 and then when node1 is back - it will recover on node1.

Remove it by using "crm resource unmigrate virtip " and it will remove the cli-prefer.

Also, consider adding fencing mechanism - either SBD or any other type in order to guarantee that resources will be failed over.

Another issue:

you need the "two_node:1 " set on both nodes (use csync2 -m /etc/corosync/corosync.conf ; csync2 -x to sync it between nodes)

Note , as the two_node is enabled it will automatically enable “wait_for_all” which means that if both nodes die (power outage , double fencing anything outstanding), then both nodes should be started before any resource is taken.
I would recommend it to leave it by default ,as you don’t have a fencing mechanism and you could have the IP and the service started on both nodes - which is bad.

[QUOTE=strahil-nikolov-dxc;58016]this means that you have run the “crm resource migrate virtip msnode1”. When node1 dies - it will migrate to node2 and then when node1 is back - it will recover on node1.

I am actually interested in adding a fencing mechanism, and would like to find out what is the other type of fencing that can be added and how ? And also, what is the best fencing mechanism. From my requirement perspective, I dont want a node to be rebooted as part of the fencing, is that possible ?

Another issue:

you need the "two_node:1 " set on both nodes (use csync2 -m /etc/corosync/corosync.conf ; csync2 -x to sync it between nodes)

Note , as the two_node is enabled it will automatically enable “wait_for_all” which means that if both nodes die (power outage , double fencing anything outstanding), then both nodes should be started before any resource is taken.
I would recommend it to leave it by default ,as you don’t have a fencing mechanism and you could have the IP and the service started on both nodes - which is bad.[/QUOTE]

I probably didnt understand this feature. You mean if I have two_node=0, then there are greater chances of split brain ? and if two_node=1, then both nodes should be up before resources are taken ? Well, I dont want both. I actually want the node which comes up first with the cluster should take the resource, and other remain as standby.

However, if I setup fencing, will this attribute required to be set ? Moreover, my cluster can grow and shrink dynamically (increase/decrease nodes), so dont want to rely on any such attribute.

Thanks a lot!

By default quorum is 50% + 1Vote and with 2 nodes to have quorum you need 2 votes.
Yet in a two node cluster - you want to have quorum even with 1 node, so there is a special option “two_node:1” which tells that (we need only 1 vote to survive) to pacemaker.

So ,in your case you need the “two_node:1” and set a proper stonith device (no matter sbd or other). Otherwise - this will never be a cluster,as it will be susceptible to split-brains.
If you add a 3rd node via “ha-cluster-join” that flag (two_node) is automatically removed and should be added when you use “ha-cluster-remove” to reduce from 3 nodes to 2.

[QUOTE=strahil-nikolov-dxc;58019]By default quorum is 50% + 1Vote and with 2 nodes to have quorum you need 2 votes.
Yet in a two node cluster - you want to have quorum even with 1 node, so there is a special option “two_node:1” which tells that (we need only 1 vote to survive) to pacemaker.[/QUOTE]
Understood. Thanks. However, I noticed that when I enabled this option, and when the node where the services are running reboots, the services and virtual IP dont migrate to other node which is running. Is that how it is supposed to work ?

Can you please suggest a fencing mechanism where there is no compulsion to bring a node down but only bring down required services.

[QUOTE]By default quorum is 50% + 1Vote and with 2 nodes to have quorum you need 2 votes.
Yet in a two node cluster - you want to have quorum even with 1 node, so there is a special option “two_node:1” which tells that (we need only 1 vote to survive) to pacemaker.[/QUOTE]
Thanks for the explanation. Understood the point.

So ,in your case you need the “two_node:1” and [QUOTE]set a proper stonith device (no matter sbd or other)[/QUOTE].

Can you please help me implement the fencing mechanism which doesnt shoot down the node but just stops the services that I want.

[QUOTE=singhm16;58031]Understood. Thanks. However, I noticed that when I enabled this option, and when the node where the services are running reboots, the services and virtual IP dont migrate to other node which is running. Is that how it is supposed to work ?

Can you please suggest a fencing mechanism where there is no compulsion to bring a node down but only bring down required services.[/QUOTE]

Yes, when you reboot a node (or the service is being restarted) the cluster will try to minimize downtime and bring it up on the other node.
Of course that can be controlled via a location constraint order (for example if you have a 3 node cluster and you don’t want the cluster to start the resource/group on a specific node).

About the fencing - the idea behind the STONITH (Shoot the other node in the head) is to guarantee that the resource will not be used by the problematic node (for example the node might freeze, but after a minute or two /in recent case 5 min/ it get’s recovered just to write something to the Filesystem) , so the cluster can safely start it up on the working node.

If your service is using a shared storage - you can use fence_scsi (pcmk_reboot_action=“off”) ,which uses persistent reservations (requires the storage to support it - most does) , but there is no guarantee that the frozen node will release the IP if they are not in a single resource group.
So you should have :

  1. Filesystem
  2. IP
  3. APP

Once the node is fenced - it will fail to write to filesystem and the resource will be marked as failed. Everything behind it will be stopped.

Edit:
Of course, setting the cluster in maintenance is mandatory when making system update (with sbd I would recommend you to stop the cluster software locally after setting the maintenance) or maintenance on the application itself.

[QUOTE=mnshsnghl;58032]
Can you please help me implement the fencing mechanism which doesnt shoot down the node but just stops the services that I want.[/QUOTE]

That’s tricky. If you use shared storage , you can use fence_scsi with pcmk_reboot_action=“off” which will deregister the system from the shared storage.

In order to work properly, you need to put the IP after the file system resource in a single group. Once, the node is fenced - the filesystem will not be available and the node will stop everything after the filesystem resource (for example Shared IP + App).

Edit: Check my previous comment for an example.

For Filesystem monitoring use “OCF_CHECK_LEVEL=20” in the ‘op’ section .

[QUOTE=strahil-nikolov-dxc;58040]That’s tricky. If you use shared storage , you can use fence_scsi with pcmk_reboot_action=“off” which will deregister the system from the shared storage.

In order to work properly, you need to put the IP after the file system resource in a single group. Once, the node is fenced - the filesystem will not be available and the node will stop everything after the filesystem resource (for example Shared IP + App).

Edit: Check my previous comment for an example.[/QUOTE]

Hi,
Was trying to implement your suggestion but found that my SLES system (SLES15) doesnt have fence_scsi resource agent at all -

$ crm configre ra list stonith
apcmaster apcmastersnmp apcsmart
baytech cyclades drac3
external/drac5 external/dracmc-telnet external/ec2
external/hetzner external/hmchttp external/ibmrsa
external/ibmrsa-telnet external/ipmi external/ippower9258
external/kdumpcheck external/libvirt external/nut
external/rackpdu external/riloe external/vcenter
external/vmware external/xen0 external/xen0-ha
fence_legacy ibmhmc meatware
nw_rpc100s rcd_serial rps10
suicide wti_mpc wti_nps

Is there a way I can get this resource agent, or has it been replaced with something else ?

I have it on my test SLES 15.1 cluster

[CODE]# crm ra list stonith | grep scsi
fence_pve fence_raritan fence_rcd_serial fence_rhevm fence_rsa fence_rsb fence_sanbox2 fence_sbd fence_scsi

rpm -qf /usr/sbin/fence_scsi

fence-agents-4.2.1+git.1537269352.7b1fd536-5.38.x86_64[/CODE]

My patterns that I have selected during install

S | Name | Summary | Type ---+---------------+----------------------------+-------- i | apparmor | AppArmor | pattern i | base | Minimal Base System | pattern i+ | enhanced_base | Enhanced Base System | pattern i+ | ha_sles | High Availability | pattern i+ | minimal_base | Minimal Appliance Base | pattern i | yast2_basis | YaST System Administration | pattern

[QUOTE=strahil-nikolov-dxc;58432]I have it on my test SLES 15.1 cluster

[CODE]# crm ra list stonith | grep scsi
fence_pve fence_raritan fence_rcd_serial fence_rhevm fence_rsa fence_rsb fence_sanbox2 fence_sbd fence_scsi

rpm -qf /usr/sbin/fence_scsi

fence-agents-4.2.1+git.1537269352.7b1fd536-5.38.x86_64[/CODE]

My patterns that I have selected during install

S | Name | Summary | Type ---+---------------+----------------------------+-------- i | apparmor | AppArmor | pattern i | base | Minimal Base System | pattern i+ | enhanced_base | Enhanced Base System | pattern i+ | ha_sles | High Availability | pattern i+ | minimal_base | Minimal Appliance Base | pattern i | yast2_basis | YaST System Administration | pattern [/QUOTE]

Thanks. I think that will work. Let me try that.

Hi,

Finally, I could test with fence_scsi and it worked to the extent it is supposed to. But in order to unfence the node it requires the node to be rebooted, which requires a manual attention (or can be automated in some ways), but thats not probably I am looking at.

My requirement is -

  1. node should not go down with fencing
  2. unfencing operation should not require a reboot of the node
    1 could be done by fence_scsi, but 2) I couldnt find a way.

So, if 1) and 2) both are not possible, then I am fine going for sbd based fencing, but in that case I am looking for fence operation to actually shutdown the node instead of rebooting, but rebooting of the node is actually orchestrated from some other flow.

Is there a way I can achieve this ?

Can you show the fencing device configurration?

crm configure  show <device_name>

Most probably you don’t have the ‘meta provides=“unfencing”’ to your fencing device.

[QUOTE=strahil-nikolov-dxc;58693]Can you show the fencing device configurration?

crm configure  show <device_name>

Most probably you don’t have the ‘meta provides=“unfencing”’ to your fencing device.[/QUOTE]

I dont have the configuration wrt fence_scsi right now, coz I started experimenting with something else. I will try this option, but I still dont know even if I configure this option what will trigger unfencing operation to this node ? I read at multiple places – “The failed node will no longer be able to write to the device(s). A manual reboot is required.”
Since reboot was not an option for me, I didnt configure that. But doesnt this sentence mean that a manual reboot needs to be triggered to the node, and during the course of reboot (maybe be before rebooting), the node shall be unfenced.

What I would want is to be able to “unfence” the node without rebooting. Is there a way to do it ? Can a node unfence itself once the issue with the node is resolved ?

Regards
Maneesh

[QUOTE=mnshsnghl;58698]I dont have the configuration wrt fence_scsi right now, coz I started experimenting with something else. I will try this option, but I still dont know even if I configure this option what will trigger unfencing operation to this node ? I read at multiple places – “The failed node will no longer be able to write to the device(s). A manual reboot is required.”
Since reboot was not an option for me, I didnt configure that. But doesnt this sentence mean that a manual reboot needs to be triggered to the node, and during the course of reboot (maybe be before rebooting), the node shall be unfenced.

What I would want is to be able to “unfence” the node without rebooting. Is there a way to do it ? Can a node unfence itself once the issue with the node is resolved ?

Regards
Maneesh[/QUOTE]

Unfencing should be done by pacemaker itself with the meta attribute from the previous comments. Sadly my test cluster is ontop of VmWare and that doesn’t support persistent SCSI-3 reservations.
If I have time, I will deploy an iSCSI and try it myself.
Keep in mind that the fence_scsi will automatically detect which LUNs need to be fenced/unfenced if they are part of a volume group with “c” flag (clustered) , but will require the “devices=” if you use HA-LVM (which I prefer as it supports dual corosync rings).