SLES 11 SP1 CLUSTER - NODE OFFLINE

Hello,

After I have configured the cluster with 2 nodes, both shows in their
status as DC’s and the other node as offline (dirty). Please help me to
troubleshoot this.

Node 1

Code:

============
Last updated: Fri Nov 11 05:39:25 2011
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.

Online: [ cluster1 ]
OFFLINE: [ cluster2 ]


Node2

Code:

============
Last updated: Fri Nov 11 05:07:45 2011
Stack: openais
Current DC: cluster2 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.

Node cluster1: UNCLEAN (offline)
Online: [ cluster2 ]


How I start troubleshooting this?

regards,

ccilleperuma.


ccilleperuma

ccilleperuma’s Profile: http://forums.novell.com/member.php?userid=97239
View this thread: http://forums.novell.com/showthread.php?t=448042

Hi ccilleperuma,

looks like some sort of communication problem to me. I recommend to
have a closer look at the log output (probably syslog) - it can be
rather verbose and tells you step by step what both nodes are trying to
attempt.

Common suggestions are to check for IP connectivity, firewalling issues
and alike. And double-check your config, ie concerning the cluster IP
ports defined on both nodes…

Regards
Jens


from the times when today’s “old school” was “new school” :eek:

jmozdzen’s Profile: http://forums.novell.com/member.php?userid=32246
View this thread: http://forums.novell.com/showthread.php?t=448042

I will post a config, which I posted in a nother forum to get a clear
idea of the situation.

Hi,

Sorry for the replying delay, I had another issue to solve.

In my cluster servers there is no ha.cf probably because of following

Code:

Ultimately it will change in SLES11, HA will be replaced with OpenAIS and follow the same packaging and naming convention according to the recent changes in the project.

and the openais.conf file says

Code:

This configuration file is not used any more

Please refer to /etc/corosync/corosync.conf


So I list the corosync file here of cluster1

Code:

aisexec {
#Group to run aisexec as. Needs to be root for Pacemaker

group: root

#User to run aisexec as. Needs to be root for Pacemaker

user: root

}
service {
#Default to start mgmtd with pacemaker

use_mgmtd: yes

ver: 0

name: pacemaker

}
totem {
#The mode for redundant ring. None is used when only 1 interface specified, otherwise, only active or passive may be choosen

rrp_mode: none

#How long to wait for join messages in membership protocol. in ms

join: 60

#The maximum number of messages that may be sent by one processor on receipt of the token.

max_messages: 20

#The virtual synchrony filter type used to indentify a primary component. Change with care.

vsftype: none

#The fixed 32 bit value to indentify node to cluster membership. Optional for IPv4, and required for IPv6. 0 is reserved for other usage

nodeid: 1

#How long to wait for consensus to be achieved before starting a new round of membership configuration.

consensus: 4000

#HMAC/SHA1 should be used to authenticate all message

secauth: on

#How many token retransmits should be attempted before forming a new configuration.

token_retransmits_before_loss_const: 10

#How many threads should be used to encypt and sending message. Only have meanings when secauth is turned on

threads: 1

#Timeout for a token lost. in ms

token: 3000

#The only valid version is 2

version: 2

interface {
#Network Address to be bind for this interface setting

bindnetaddr: 192.168.30.0

#The multicast address to be used

mcastaddr: 226.0.1.5

#The multicast port to be used

mcastport: 5454

#The ringnumber assigned to this interface setting

ringnumber: 0

}
#To make sure the auto-generated nodeid is positive

clear_node_high_bit: no

}
logging {
#Log to a specified file

to_logfile: no

#Log to syslog

to_syslog: yes

#Whether or not turning on the debug information in the log

debug: off

#Log timestamp as well

timestamp: on

#Log to the standard error output

to_stderr: yes

#Logging file line in the source code as well

fileline: off

#Facility in syslog

syslog_facility: daemon

}
amf {
#Enable or disable AMF

mode: disable

}

hosts file of cluster1

Code:

hosts This file describes a number of hostname-to-address

mappings for the TCP/IP subsystem. It is mostly

used at boot time, when no name servers are running.

On small systems, this file can be used instead of a

“named” name server.

Syntax:

IP-Address Full-Qualified-Hostname Short-Hostname

127.0.0.1 localhost

special IPv6 addresses

::1 localhost ipv6-localhost ipv6-loopback

fe00::0 ipv6-localnet

ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
127.0.0.2 cluster1.cbl cluster1
192.168.30.71 cluster2.cbl cluster2
192.168.30.70 cluster1.cbl cluster1

hosts file of cluster2

Code:

hosts This file describes a number of hostname-to-address

mappings for the TCP/IP subsystem. It is mostly

used at boot time, when no name servers are running.

On small systems, this file can be used instead of a

“named” name server.

Syntax:

IP-Address Full-Qualified-Hostname Short-Hostname

127.0.0.1 localhost

special IPv6 addresses

::1 localhost ipv6-localhost ipv6-loopback

fe00::0 ipv6-localnet

ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
127.0.0.2 cluster2.cbl cluster2
192.168.30.70 cluster1.cbl cluster1
192.168.30.71 cluster2.cbl cluster2

This is the output of crm_mon -1
Cluster1

Code:

============
Last updated: Sat Nov 19 02:07:42 2011
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.

Node cluster2: UNCLEAN (offline)
Online: [ cluster1 ]


Cluster2

Code:

Last updated: Sat Nov 19 02:07:35 2011
Stack: openais
Current DC: cluster2 - partition WITHOUT quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
0 Resources configured.

Node cluster1: UNCLEAN (offline)
Online: [ cluster2 ]


I have e0 of both servers configured to 192.100.100.70 & 71 which
connects to the out side network
and e1 of both configured as 192.168.30.70 & 71 and connect through a
cross cable

Yes, I have installed heartbeat and peacemaker both but not yet
configured the peacemaker as i want to check the cluster connectivity
first
In heartbeat communication channels, I have given 192.168.30.0 as Bind
network address as it only shows subnets (192.100.100.0/192.168.30.0) to
select.
Also it asks for mulicats address and port which i assigned
226.0.1.5:5454 and 226.0.1.6:5454 respectively
Cluster1 node id is 1 and cluster2 is 2
rrp mode none for both as i dont have redundant channels

I hope this will help you to get an idea of my setup. Thanks very much
for the interest you shown in this issue.

Regards,

ccilleperuma.


ccilleperuma

ccilleperuma’s Profile: http://forums.novell.com/member.php?userid=97239
View this thread: http://forums.novell.com/showthread.php?t=448042

Hi ccilleperuma,
[color=blue]

Also it asks for mulicats address and port which i assigned[/color]
226.0.1.5:5454 and 226.0.1.6:5454 respectively

You mean you have told the nodes to communicate via different multicast
addresses? That already could be the cause of the split - all nodes of a
cluster use the same multicast channel to communiate.

With regards
Jens


from the times when today’s “old school” was “new school” :eek:

jmozdzen’s Profile: http://forums.novell.com/member.php?userid=32246
View this thread: http://forums.novell.com/showthread.php?t=448042

I have assigned the same multicast address and restarted the cluster
service. But the issue is still the same.
Time scync server is a must for the cluster servers?


ccilleperuma

ccilleperuma’s Profile: http://forums.novell.com/member.php?userid=97239
View this thread: http://forums.novell.com/showthread.php?t=448042

Hi ccilleperuma,
[color=blue]

Time scync server is a must for the cluster servers?[/color]

I’m not sure, but would recommend it anyhow - debugging distributed
problems is a mess when time stamps differ.

Have you found any indicators in the cluster logs?

With regards
Jens


from the times when today’s “old school” was “new school” :eek:

jmozdzen’s Profile: http://forums.novell.com/member.php?userid=32246
View this thread: http://forums.novell.com/showthread.php?t=448042

Hi,

You can sync the config from the active node using Csync2 -xv

Regards
Dev

[QUOTE=jmozdzen;1015]Hi ccilleperuma,
[color=blue]

Time scync server is a must for the cluster servers?[/color]

I’m not sure, but would recommend it anyhow - debugging distributed
problems is a mess when time stamps differ.

Have you found any indicators in the cluster logs?

With regards
Jens


from the times when today’s “old school” was “new school” :eek:

jmozdzen’s Profile: http://forums.novell.com/member.php?userid=32246
View this thread: http://forums.novell.com/showthread.php?t=448042[/QUOTE]

Hi Dev,

Time scync server is a must for the cluster servers?
You can sync the config from the active node using Csync2 -xv

this thread is rather old (2011) and while csync2 is fine for synchronizing files in a cluster, it will not help concerning drifting system time bases. Setting up a proper ntp configuration might be more helpful to solve this issue.

Regards,
Jens