SAP HANA HA SBD STONITH device questions.

We are planning multiple 2 node SLES for SAP 15 SP1 cluster for SAP HANA and SAP NetWeaver ASCS. Below are some details.

  1. 2 data centers about 16 miles apart
  2. SLES for SAP 15 SP1
  3. VMware ESXi 6.7
  4. SAP HANA 2.0
  5. SAP ABAP ASCS
  6. PURE Storage SAN in each datacenter

Since the SBD device needs to be shared, how do I protect this if I have 2 datacenters. The shared disk would be in one or the other datacenter, but not both, or can it?
If we lose the datacenter that is hosting the SBD then we lose it. We are thinking about using Azure to host an iSCSI target device for this, but this means that our on-prem clusters would rely on a SBD STONITH device in the cloud.
Is this a viable solution? I have read the clustering guide from SUSE but I have not seen anywhere this is addressed. Any recommendations, suggestions and documentation would be appreciated.

Bryan

[QUOTE=snopudAdmin;59376]We are planning multiple 2 node SLES for SAP 15 SP1 cluster for SAP HANA and SAP NetWeaver ASCS. Below are some details.

  1. 2 data centers about 16 miles apart
  2. SLES for SAP 15 SP1
  3. VMware ESXi 6.7
  4. SAP HANA 2.0
  5. SAP ABAP ASCS
  6. PURE Storage SAN in each datacenter

Since the SBD device needs to be shared, how do I protect this if I have 2 datacenters. The shared disk would be in one or the other datacenter, but not both, or can it?
If we lose the datacenter that is hosting the SBD then we lose it. We are thinking about using Azure to host an iSCSI target device for this, but this means that our on-prem clusters would rely on a SBD STONITH device in the cloud.
Is this a viable solution? I have read the clustering guide from SUSE but I have not seen anywhere this is addressed. Any recommendations, suggestions and documentation would be appreciated.

Bryan[/QUOTE]

Hi Bryan,
You are not limited to sbd only. The HA stack also supports Vmware-based fencing, so that is also an option.
You also can have the following setup:

  1. SBD from SAN in DC1 - shared to both nodes
  2. SBD from SAN in DC2 - shared to both nodes
  3. SBD from iSCSI server in nearest cloud provider - shared to both nodes

Keep in mind that 2 node clusters are quite vulnerable to split-brain situation. If you don’t have a 3rd location where to bring a VM , SLES15 now supports quorum device (simple daemon that runs on linux and can be shared between clusters) which you can install on a cloud instance. You can use it to establish a quorum of 3 and prevent splitbrain (both DC should have separate connectivity -VPN- to that device).

[QUOTE=strahil-nikolov-dxc;59377]Hi Bryan,
You are not limited to sbd only. The HA stack also supports Vmware-based fencing, so that is also an option.
You also can have the following setup:

  1. SBD from SAN in DC1 - shared to both nodes
  2. SBD from SAN in DC2 - shared to both nodes
  3. SBD from iSCSI server in nearest cloud provider - shared to both nodes

Keep in mind that 2 node clusters are quite vulnerable to split-brain situation. If you don’t have a 3rd location where to bring a VM , SLES15 now supports quorum device (simple daemon that runs on linux and can be shared between clusters) which you can install on a cloud instance. You can use it to establish a quorum of 3 and prevent splitbrain (both DC should have separate connectivity -VPN- to that device).[/QUOTE]

Thank you for the information. We currently have Vcenter STONITH, for our SAP HANA DB. would It require Vcenter in both datacenters to eliminate a single point of failure?

Cheers

Bryan

Also can you point me to some documentation that talks about the quorum device?

Sadly I couldn’t find Suse Specific documentation.
Actually the process would involve:

  1. Install corosync-qdevice on both nodes + the third location
  2. Check the man:
man corosync-qdevice
  1. Open the firewall port. In RedHat documentation it’s 5403/TCP
  2. Configure and start the qnetd on the third (shared for many clusters) location
  3. Configure corosync.conf on the cluster nodes and sync with csync2
  4. Reload corosync and verify with ‘corosync-quorumtool -s’ & ’ corosync-cfgtool -s’

Of course reloading corosync in production is more risky-> set the cluster in maintenance and stop the cluster - one node at a time ‘crm cluster stop’.
Then start in the opposite way (last stopped node is started first). Then verify ‘crm_mon -r1’ .
Before removing the maintenance - use ‘crm_simulate’ to verify what will happen when the maintenance is off.