OCFS2 Volume not mounted during boot

Hi,

i have two nodes which access concurrently a OCFS2 Volume on a SAN. One Node (let’s call him 20), mounts the volume automatically.
The other does not (lets call him 10).
This is the actual error message (in /var/log/messages from 10):

Oct 18 14:20:32 sunhb65277 kernel: [ 76.198486] (kworker/u:0,5,0):o2net_connect_expired:1724 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. Oct 18 14:20:32 sunhb65277 kernel: [ 76.198506] (mount.ocfs2,6559,1):dlm_request_join:1472 ERROR: Error -107 when sending message 510 (key 0x666c6172) to node 2 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198513] (mount.ocfs2,6559,1):dlm_try_to_join_domain:1648 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198518] (mount.ocfs2,6559,1):dlm_join_domain:1948 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198628] (mount.ocfs2,6559,1):dlm_register_domain:2214 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198649] (mount.ocfs2,6559,1):o2cb_cluster_connect:313 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198653] (mount.ocfs2,6559,1):ocfs2_dlm_init:2995 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198668] (mount.ocfs2,6559,1):ocfs2_mount_volume:1881 ERROR: status = -107 Oct 18 14:20:32 sunhb65277 kernel: [ 76.198713] ocfs2: Unmounting device (252,5) on (node 0) Oct 18 14:20:32 sunhb65277 kernel: [ 76.198720] (mount.ocfs2,6559,1):ocfs2_fill_super:1236 ERROR: status = -107 Oct 18 14:20:35 sunhb65277 sm-notify[6644]: Version 1.2.3 starting

/var/log/boot.msg:

[CODE]<notice – Oct 18 14:19:57.711549000> multipathd start
Starting multipathddone
<notice – Oct 18 14:19:58.775058000> ‘multipathd start’ exits with status 0
<notice – Oct 18 14:19:58.776666000> ocfs2 start
Starting Oracle Cluster File System (OCFS2) mount.ocfs2: Transport endpoint is not connected while mounting /dev/mapper/3600c0ff00012824b04af7a5201000000 on /images. Check ‘dmesg’ for more information on this error.

failed
<notice – Oct 18 14:20:34.965416000> ‘ocfs2 start’ exits with status 1
[/CODE]

I don’t know if it’s a connection problem with the two nodes, or if node 10 has problems.
Connection between the hosts seem ok. I have on each a bond with a private address just for the communication between the two.
They are attached directly, without a switch or s.th. like that. One can ping each other.

This is my/etc/ocfs2/cluster.conf:

cluster:
        node_count = 2
        name = idg

node:
        ip_port = 7777
        ip_address = 192.168.100.10
        number = 1
        name = sunhb65277
        cluster = idg

node:
        ip_port = 7777
        ip_address = 192.168.100.20
        number = 2
        name = sunhb58820
        cluster = idg

File is identical on both (checked with md5sum). Port is open on both.

I found stuff like that in the logs from 10:

Oct 18 14:19:58 sunhb65277 kernel: [ 41.212716] alua: release port group 0
Oct 18 14:19:58 sunhb65277 kernel: [ 41.212725] sd 8:0:0:0: alua: Detached
Oct 18 14:19:58 sunhb65277 kernel: [ 41.244688] alua: release port group 1
Oct 18 14:19:58 sunhb65277 kernel: [ 41.244696] sd 7:0:0:0: alua: Detached
Oct 18 14:19:58 sunhb65277 multipathd: 3600c0ff00012824b04af7a5201000000: load table [0 2111328000 multipath 1 queue_if_no_path 0 2 1 service-time 0 1 1 8:32 1 service-time 0 1 1 8:16 1]
Oct 18 14:19:58 sunhb65277 multipathd: 3600508b1001cbde9bbc03f7d66f637a7: ignoring map
Oct 18 14:19:58 sunhb65277 multipathd: 3600c0ff00012824b04af7a5201000000: event checker started
Oct 18 14:19:58 sunhb65277 kernel: [ 41.687108] device-mapper: table: 252:6: multipath: error getting device
Oct 18 14:19:58 sunhb65277 kernel: [ 41.687114] device-mapper: ioctl: error adding target to table
Oct 18 14:19:58 sunhb65277 kernel: [ 41.687756] device-mapper: table: 252:6: multipath: error getting device
Oct 18 14:19:58 sunhb65277 kernel: [ 41.687761] device-mapper: ioctl: error adding target to table
Oct 18 14:19:58 sunhb65277 multipathd: path checkers start up

Is the SAN not available ?

DLM seems to start fine on 10:

Oct 18 14:19:56 sunhb65277 kernel: [ 40.102779] OCFS2 Node Manager 1.5.0 Oct 18 14:19:56 sunhb65277 kernel: [ 40.106011] OCFS2 DLM 1.5.0 Oct 18 14:19:56 sunhb65277 kernel: [ 40.106872] ocfs2: Registered cluster interface o2cb Oct 18 14:19:56 sunhb65277 kernel: [ 40.153028] OCFS2 DLMFS 1.5.0 Oct 18 14:19:56 sunhb65277 kernel: [ 40.153244] OCFS2 User DLM kernel interface loaded Oct 18 14:19:57 sunhb65277 kernel: [ 40.243627] o2hb: Heartbeat mode set to local Oct 18 14:19:57 sunhb65277 o2hbmonitor: Starting

/dlm is available.

After boot, without any intervention, everything seems fine:

[CODE]sunhb65277:~ # multipathd -k
multipathd> list paths
hcil dev dev_t pri dm_st chk_st dev_st next_check
7:0:0:0 sdb 8:16 10 active ready running XXXXXXXX… 17/20
8:0:0:0 sdc 8:32 50 active ready running XXXXXXXX… 17/20
0:0:0:0 sda 8:0 1 undef ready running orphan

multipathd> list maps
name sysfs uuid
3600c0ff00012824b04af7a5201000000 dm-5 3600c0ff00012824b04af7a5201000000
multipathd> list topology
create: 3600c0ff00012824b04af7a5201000000 dm-5 HP,P2000 G3 FC
size=1007G features=‘1 queue_if_no_path’ hwhandler=‘0’ wp=rw
|-± policy=‘service-time 0’ prio=50 status=active
| - 8:0:0:0 sdc 8:32 active ready running -± policy=‘service-time 0’ prio=10 status=enabled
`- 7:0:0:0 sdb 8:16 active ready running
multipathd>

sunhb65277:~ # dmsetup ls --tree
vg1-lv_srv (252:4)
└─ (8:1)
3600c0ff00012824b04af7a5201000000 (252:5)
├─ (8:16)
└─ (8:32)
vg1-lv_root_snapshot (252:3)
├─vg1-lv_root_snapshot-cow (252:2)
│ └─ (8:1)
└─vg1-lv_root-real (252:0)
└─ (8:1)
vg1-lv_root (252:1)
└─vg1-lv_root-real (252:0)
└─ (8:1)

[/CODE]

Mounting manually after boot works fine:

sunhb65277:~ # mount -a sunhb65277:~ # mount /dev/mapper/vg1-lv_root on / type ext3 (rw,strictatime,acl,user_xattr) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) debugfs on /sys/kernel/debug type debugfs (rw) udev on /dev type tmpfs (rw,mode=0755) tmpfs on /dev/shm type tmpfs (rw,mode=1777) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) /dev/sda3 on /boot type ext3 (rw,strictatime,acl,user_xattr) fusectl on /sys/fs/fuse/connections type fusectl (rw) securityfs on /sys/kernel/security type securityfs (rw) configfs on /sys/kernel/config type configfs (rw) ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw) none on /var/lib/ntp/proc type proc (ro,nosuid,nodev) /dev/dm-5 on /images type ocfs2 (rw,_netdev,heartbeat=local)

Any ideas ?

Bernd

There is s.th. i completely don’t understand.
I would expect that the booting host tries to connect the other node(s) mentioned in /etc/ocfs2/cluster.conf.
But when i issue a tcpdump on host 20 while host 10 is booting, i got the following:

tcpdump net 192.168.100.0/24 or port 7777 or 49152 -i any -vvvn & sunhb58820:~ # tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes 15:56:12.789724 arp who-has 192.168.100.10 tell 192.168.100.20 15:56:12.789733 arp who-has 192.168.100.10 tell 192.168.100.20 15:56:12.789912 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 15:56:12.789912 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 15:56:12.789926 IP (tos 0x0, ttl 64, id 8421, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x018c), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20485380 0,nop,wscale 7> 15:56:12.789930 IP (tos 0x0, ttl 64, id 8421, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x018c), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20485380 0,nop,wscale 7> 15:56:15.793702 IP (tos 0x0, ttl 64, id 8422, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xfe9b), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20486132 0,nop,wscale 7> 15:56:15.793712 IP (tos 0x0, ttl 64, id 8422, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xfe9b), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20486132 0,nop,wscale 7> 15:56:21.809740 IP (tos 0x0, ttl 64, id 8423, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xf8bb), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20487636 0,nop,wscale 7> 15:56:21.809750 IP (tos 0x0, ttl 64, id 8423, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xf8bb), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20487636 0,nop,wscale 7> 15:56:33.825724 IP (tos 0x0, ttl 64, id 8424, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xecff), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20490640 0,nop,wscale 7> 15:56:33.825736 IP (tos 0x0, ttl 64, id 8424, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.57678 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0xecff), 2016327137:2016327137(0) win 14600 <mss 1460,sackOK,timestamp 20490640 0,nop,wscale 7>
Why does the already running host contact the booting one ?
I get the packets right in the moment in which host 10 tries to start OCFS2. Not any packet before !
From where does host 20 knows that OCFS2 is just in that moment starting on 10 ?

Bernd

I tested it the other way round:
Host 10 is running with manually mounted OCFS2 volume, host 20 is booting.
Host 20 mounts automatically.
And here the flow of connection is the way i expect:

17:12:16.978152 arp who-has 192.168.100.10 tell 192.168.100.20 17:12:16.978152 arp who-has 192.168.100.10 tell 192.168.100.20 17:12:16.978177 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 17:12:16.978180 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 17:12:16.978301 IP (tos 0x0, ttl 64, id 36158, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.53309 > 192.168.100.10.7777: S, cksum 0x5085 (correct), 4171464791:4171464791(0) win 14600 <mss 1460,sackOK,timestamp 4294905193 0,nop,wscale 7> 17:12:16.978301 IP (tos 0x0, ttl 64, id 36158, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.53309 > 192.168.100.10.7777: S, cksum 0x5085 (correct), 4171464791:4171464791(0) win 14600 <mss 1460,sackOK,timestamp 4294905193 0,nop,wscale 7> 17:12:16.978370 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.10.7777 > 192.168.100.20.53309: S, cksum 0x499e (incorrect (-> 0xa716), 428745341:428745341(0) ack 4171464792 win 14480 <mss 1460,sackOK,timestamp 1076666 4294905193,nop,wscale 7> 17:12:16.978372 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.10.7777 > 192.168.100.20.53309: S, cksum 0x499e (incorrect (-> 0xa716), 428745341:428745341(0) ack 4171464792 win 14480 <mss 1460,sackOK,timestamp 1076666 4294905193,nop,wscale 7> 17:12:16.978554 IP (tos 0x0, ttl 64, id 36159, offset 0, flags [DF], proto TCP (6), length 52) 192.168.100.20.53309 > 192.168.100.10.7777: ., cksum 0x0dff (correct), 1:1(0) ack 1 win 115 <nop,nop,timestamp 4294905194 1076666> 17:12:16.978554 IP (tos 0x0, ttl 64, id 36159, offset 0, flags [DF], proto TCP (6), length 52) 192.168.100.20.53309 > 192.168.100.10.7777: ., cksum 0x0dff (correct), 1:1(0) ack 1 win 115 <nop,nop,timestamp 4294905194 1076666> 17:12:16.978578 IP (tos 0x0, ttl 64, id 36160, offset 0, flags [DF], proto TCP (6), length 84) 192.168.100.20.53309 > 192.168.100.10.7777: P, cksum 0x9e99 (correct), 1:33(32) ack 1 win 115 <nop,nop,timestamp 4294905194 1076666> 17:12:16.978578 IP (tos 0x0, ttl 64, id 36160, offset 0, flags [DF], proto TCP (6), length 84) 192.168.100.20.53309 > 192.168.100.10.7777: P, cksum 0x9e99 (correct), 1:33(32) ack 1 win 115 <nop,nop,timestamp 4294905194 1076666> 17:12:16.978597 IP (tos 0x0, ttl 64, id 20420, offset 0, flags [DF], proto TCP (6), length 52) 192.168.100.10.7777 > 192.168.100.20.53309: ., cksum 0x4996 (incorrect (-> 0x0de0), 1:1(0) ack 33 win 114 <nop,nop,timestamp 1076666 4294905194> 17:12:16.978600 IP (tos 0x0, ttl 64, id 20420, offset 0, flags [DF], proto TCP (6), length 52) 192.168.100.10.7777 > 192.168.100.20.53309: ., cksum 0x4996 (incorrect (-> 0x0de0), 1:1(0) ack 33 win 114 <nop,nop,timestamp 1076666 429490519

Host 20 (the booting one) contacts host 10.

I’m confused.

Bernd

Bernd

Hi,
i tested it again as i started. Host 20 running (mounted OCFS2 volume), host 10 booting.
As expected the volume is not mounted via fstab.
But what i see is:

tcpdump -i any -vvvn net 192.168.100.0/24 or port 7777 or port 49152 or ip broadcast and not \\( port 67 or port 68 or port 161 or port 162 or port 1947 or port 5246 or port 12223 or port 17500\\) 17:46:17.062526 arp who-has 192.168.100.10 tell 192.168.100.20 17:46:17.062536 arp who-has 192.168.100.10 tell 192.168.100.20 17:46:17.062720 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 17:46:17.062720 arp reply 192.168.100.10 is-at 70:10:6f:47:0c:46 17:46:17.062731 IP (tos 0x0, ttl 64, id 42021, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x95f9), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 447937 0,nop,wscale 7> 17:46:17.062734 IP (tos 0x0, ttl 64, id 42021, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x95f9), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 447937 0,nop,wscale 7> 17:46:20.058592 IP (tos 0x0, ttl 64, id 42022, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x930a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 448688 0,nop,wscale 7> 17:46:20.058603 IP (tos 0x0, ttl 64, id 42022, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x930a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 448688 0,nop,wscale 7> 17:46:26.074594 IP (tos 0x0, ttl 64, id 42023, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x8d2a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 450192 0,nop,wscale 7> 17:46:26.074603 IP (tos 0x0, ttl 64, id 42023, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x8d2a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 450192 0,nop,wscale 7> 17:46:38.106550 IP (tos 0x0, ttl 64, id 42024, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x816a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 453200 0,nop,wscale 7> 17:46:38.106560 IP (tos 0x0, ttl 64, id 42024, offset 0, flags [DF], proto TCP (6), length 60) 192.168.100.20.41140 > 192.168.100.10.7777: S, cksum 0x499e (incorrect (-> 0x816a), 684552676:684552676(0) win 14600 <mss 1460,sackOK,timestamp 453200 0,nop,wscale 7>
From what i know from TCP these are all syn packets, and there is no ack response.
This is a very basic problem. That means that maybe nothing is listening on port 7777 ?
In my third post (where everything went well) you see a fine tcp handshake.

Host 20 says in /var/log/messages:
Oct 18 17:46:47 sunhb58820 kernel: [ 2121.319644] (kworker/u:0,9051,7):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.

Bernd

I found it out.
Because the init-script SuSEfirewall2_setup is executed as the last one, and the ocfs2 script is executed before, and after 30s waiting the port 7777 is still not open, ocfs2 terminates the connection tries.
Is there a good reason why the SuSEfirewall2_setup script is executed as the last one ?

I tried already changing the boot order by manipulating the LSB headers in the init scripts, but insserv always complained:
“insserv: Starting ocfs2 depends on SuSEfirewall2_setup and therefore on system facility $all' which can not be true! insserv: Starting o2cb depends on SuSEfirewall2_setup and therefore on system facility $all’ which can not be true!”

Bernd

On 20/10/16 16:14, berndgsflinux wrote:
[color=blue]

I found it out.
Because the init-script SuSEfirewall2_setup is executed as the last
one, and the ocfs2 script is executed before, and after 30s waiting the
port 7777 is still not open, ocfs2 terminates the connection tries.
Is there a good reason why the SuSEfirewall2_setup script is executed
as the last one ?

I tried already changing the boot order by manipulating the LSB headers
in the init scripts, but insserv always complained:
“insserv: Starting ocfs2 depends on SuSEfirewall2_setup and therefore on
system facility $all' which can not be true! insserv: Starting o2cb depends on SuSEfirewall2_setup and therefore on system facility $all’ which can not be true!”[/color]

Have you tried disabling the firewall? Perhaps that is how you found this?

Are you allowing port 7777 through the firewall?

HTH.

Simon
SUSE Knowledge Partner


If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below. Thanks.

The port is opened by the SuSE-Firewall. But too late.
The ocfs2-init-script stopped trying to get a connection before the port is opened by the fw.
The init-script for the firewall is the last one executed by init. Is there a good reason for that ?
What is strange: On the host 10 the ocfs2-init-script runs completely until “O2CB_IDLE_TIMEOUT_MS” in /etc/sysconfig/o2cb
has run out. Then init continues to start scripts.
On host 20, which mounts the volume fine and automatically, the ocfs2 script starts, but just some seconds later init continues to start
other scripts. So the time does not run out before the port is opened.

I don’t understand from where this difference comes.
I compared /etc/init.d/ocfs2, /etc/init.d/o2cb, /etc/fstab, /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb with diff.
There is no difference.

Why does the init in host 20 continue to start init-scripts, and on host 10 it waits until the ocfs2 script is finished ?

Bernd

I finally found a workaround i’m not really happy with, but i didn’t find out why the hosts are behaving differently.
In /etc/sysconfig/SuSEfirewall2 i changed FW_BOOT_FULL_INIT to “yes”.
Now my rules are already applied when the SuSEfirewall_init script starts. So the port is opened when the host tries to mount the OCFS2 volume.

Bernd