libnetsnmp.so.30.0.3 segfault and crm_mon coredump

Hi,

i tried to create a ClusterMon resource:
ha-idg-1:~ # crm configure show SNMP
primitive SNMP ocf:pacemaker:ClusterMon \
params user=root \
params update=5000 \
params extra_options="-S vm49093-4.scidom.de -C idg-ha" \
params htmlfile="/srv/www/hawk/public/crm_mon.html" \
op start timeout=20 interval=0 \
op stop timeout=20 interval=0 \
op monitor interval=30 timeout=20

ClusterMon uses crm_mon. But the resource always fail,
first /usr/lib64/libnetsnmp.so.30.0.3 creates a segfault und immediately afterwards crm_mon creates a coredump.

This is the typical procedure:

2019-01-16T14:12:35.921439+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running 2019-01-16T14:12:35.924387+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running 2019-01-16T14:12:35.925833+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 ) 2019-01-16T14:12:35.926191+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 191, saving inputs in /var/lib/pacemaker/pengine/pe-input-2406.bz2 2019-01-16T14:12:35.944837+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running 2019-01-16T14:12:35.945743+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running 2019-01-16T14:12:35.949082+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 ) 2019-01-16T14:12:35.949952+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 192, saving inputs in /var/lib/pacemaker/pengine/pe-input-2407.bz2 2019-01-16T14:12:35.950240+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 192 (ref=pe_calc-dc-1547644355-512) derived from /var/lib/pacemaker/pengine/pe-input-2407.bz2 2019-01-16T14:12:35.950463+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 locally on ha-idg-1 2019-01-16T14:12:35.951522+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:stop call_id:242 2019-01-16T14:12:35.967848+01:00 ha-idg-1 lrmd[5687]: notice: SNMP_stop_0:29153:stderr [ /usr/lib/ocf/resource.d/pacemaker/ClusterMon: line 147: kill: (28105) - No such process ] 2019-01-16T14:12:35.968248+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:stop call_id:242 pid:29153 exit-code:0 exec-time:17ms queue-time:0ms 2019-01-16T14:12:35.968657+01:00 ha-idg-1 crmd[5691]: notice: Result of stop operation for SNMP on ha-idg-1: 0 (ok) 2019-01-16T14:12:35.971903+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 locally on ha-idg-1 2019-01-16T14:12:35.972624+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:start call_id:243 2019-01-16T14:12:35.989012+01:00 ha-idg-1 su: pam_unix(su-l:session): session opened for user root by (uid=0) 2019-01-16T14:12:35.991876+01:00 ha-idg-1 systemd[1]: Started Session c4 of user root. 2019-01-16T14:12:36.046399+01:00 ha-idg-1 su: pam_unix(su-l:session): session closed for user root 2019-01-16T14:12:36.049003+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:start call_id:243 pid:29158 exit-code:0 exec-time:76ms queue-time:1ms 2019-01-16T14:12:36.049729+01:00 ha-idg-1 crmd[5691]: notice: Result of start operation for SNMP on ha-idg-1: 0 (ok) 2019-01-16T14:12:36.055968+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 locally on ha-idg-1 2019-01-16T14:12:36.062611+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 aborted by operation SNMP_monitor_30000 'modify' on ha-idg-2: Old event 2019-01-16T14:12:36.098341+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2407.bz2): Complete 2019-01-16T14:12:36.107834+01:00 ha-idg-1 kernel: [157578.314958] crm_mon[29187]: segfault at 6c ip 00007fc2ff4d928d sp 00007fff6e231800 error 4 in libnetsnmp.so.30.0.3[7fc2ff49e000+c8000] 2019-01-16T14:12:36.123148+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running 2019-01-16T14:12:36.124655+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running 2019-01-16T14:12:36.127945+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:1 ( ha-idg-2 ) 2019-01-16T14:12:36.129056+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 193, saving inputs in /var/lib/pacemaker/pengine/pe-input-2408.bz2 2019-01-16T14:12:36.129795+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 193 (ref=pe_calc-dc-1547644356-516) derived from /var/lib/pacemaker/pengine/pe-input-2408.bz2 2019-01-16T14:12:36.130047+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 on ha-idg-2 2019-01-16T14:12:36.153502+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 on ha-idg-2 2019-01-16T14:12:36.244619+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 on ha-idg-2 2019-01-16T14:12:36.288010+01:00 ha-idg-1 crmd[5691]: notice: Transition 193 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2408.bz2): Complete 2019-01-16T14:12:36.288350+01:00 ha-idg-1 crmd[5691]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE 2019-01-16T14:12:37.371629+01:00 ha-idg-1 systemd-coredump[29197]: Process 29187 (crm_mon) of user 0 dumped core.

It’s always the same:
the cluster recognizes that resource SNMP isn’t running, stops it and starts it again.
crm_mon creates a segfault while accessing the library, crm_mon terminates with a core dump. The next monitor operation for SNMP recognizes that it isn’t running and the procedure starts again. Every 30 seconds the same procedure.

Any ideas ?

Bernd

berndgsflinux Wrote in message:
[color=blue]

i tried to create a ClusterMon resource:
ha-idg-1:~ # crm configure show SNMP
primitive SNMP ocf:pacemaker:ClusterMon \
params user=root \
params update=5000 \
params extra_options="-S vm49093-4.scidom.de -C idg-ha" \
params htmlfile="/srv/www/hawk/public/crm_mon.html" \
op start timeout=20 interval=0 \
op stop timeout=20 interval=0 \
op monitor interval=30 timeout=20

ClusterMon uses crm_mon. But the resource always fail,
first /usr/lib64/libnetsnmp.so.30.0.3 creates a segfault und immediately
afterwards crm_mon creates a coredump.

This is the typical procedure:

Code:

2019-01-16T14:12:35.921439+01:00 ha-idg-1 pengine[5690]:  warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running

2019-01-16T14:12:35.924387+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:35.925833+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
2019-01-16T14:12:35.926191+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 191, saving inputs in /var/lib/pacemaker/pengine/pe-input-2406.bz2
2019-01-16T14:12:35.944837+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
2019-01-16T14:12:35.945743+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:35.949082+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:0 ( ha-idg-1 )
2019-01-16T14:12:35.949952+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 192, saving inputs in /var/lib/pacemaker/pengine/pe-input-2407.bz2
2019-01-16T14:12:35.950240+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 192 (ref=pe_calc-dc-1547644355-512) derived from /var/lib/pacemaker/pengine/pe-input-2407.bz2
2019-01-16T14:12:35.950463+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 locally on ha-idg-1
2019-01-16T14:12:35.951522+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:stop call_id:242
2019-01-16T14:12:35.967848+01:00 ha-idg-1 lrmd[5687]: notice: SNMP_stop_0:29153:stderr [ /usr/lib/ocf/resource.d/pacemaker/ClusterMon: line 147: kill: (28105) - No such process ]
2019-01-16T14:12:35.968248+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:stop call_id:242 pid:29153 exit-code:0 exec-time:17ms queue-time:0ms
2019-01-16T14:12:35.968657+01:00 ha-idg-1 crmd[5691]: notice: Result of stop operation for SNMP on ha-idg-1: 0 (ok)
2019-01-16T14:12:35.971903+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 locally on ha-idg-1
2019-01-16T14:12:35.972624+01:00 ha-idg-1 lrmd[5687]: notice: executing - rsc:SNMP action:start call_id:243
2019-01-16T14:12:35.989012+01:00 ha-idg-1 su: pam_unix(su-l:session): session opened for user root by (uid=0)
2019-01-16T14:12:35.991876+01:00 ha-idg-1 systemd[1]: Started Session c4 of user root.
2019-01-16T14:12:36.046399+01:00 ha-idg-1 su: pam_unix(su-l:session): session closed for user root
2019-01-16T14:12:36.049003+01:00 ha-idg-1 lrmd[5687]: notice: finished - rsc:SNMP action:start call_id:243 pid:29158 exit-code:0 exec-time:76ms queue-time:1ms
2019-01-16T14:12:36.049729+01:00 ha-idg-1 crmd[5691]: notice: Result of start operation for SNMP on ha-idg-1: 0 (ok)
2019-01-16T14:12:36.055968+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 locally on ha-idg-1
2019-01-16T14:12:36.062611+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 aborted by operation SNMP_monitor_30000 ‘modify’ on ha-idg-2: Old event
2019-01-16T14:12:36.098341+01:00 ha-idg-1 crmd[5691]: notice: Transition 192 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2407.bz2): Complete
2019-01-16T14:12:36.107834+01:00 ha-idg-1 kernel: [157578.314958] crm_mon[29187]: segfault at 6c ip 00007fc2ff4d928d sp 00007fff6e231800 error 4 in libnetsnmp.so.30.0.3[7fc2ff49e000+c8000]
2019-01-16T14:12:36.123148+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:0 on ha-idg-1: not running
2019-01-16T14:12:36.124655+01:00 ha-idg-1 pengine[5690]: warning: Processing failed monitor of SNMP:1 on ha-idg-2: not running
2019-01-16T14:12:36.127945+01:00 ha-idg-1 pengine[5690]: notice: * Recover SNMP:1 ( ha-idg-2 )
2019-01-16T14:12:36.129056+01:00 ha-idg-1 pengine[5690]: notice: Calculated transition 193, saving inputs in /var/lib/pacemaker/pengine/pe-input-2408.bz2
2019-01-16T14:12:36.129795+01:00 ha-idg-1 crmd[5691]: notice: Processing graph 193 (ref=pe_calc-dc-1547644356-516) derived from /var/lib/pacemaker/pengine/pe-input-2408.bz2
2019-01-16T14:12:36.130047+01:00 ha-idg-1 crmd[5691]: notice: Initiating stop operation SNMP_stop_0 on ha-idg-2
2019-01-16T14:12:36.153502+01:00 ha-idg-1 crmd[5691]: notice: Initiating start operation SNMP_start_0 on ha-idg-2
2019-01-16T14:12:36.244619+01:00 ha-idg-1 crmd[5691]: notice: Initiating monitor operation SNMP_monitor_30000 on ha-idg-2
2019-01-16T14:12:36.288010+01:00 ha-idg-1 crmd[5691]: notice: Transition 193 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2408.bz2): Complete
2019-01-16T14:12:36.288350+01:00 ha-idg-1 crmd[5691]: notice: State transition S_TRANSITION_ENGINE → S_IDLE
2019-01-16T14:12:37.371629+01:00 ha-idg-1 systemd-coredump[29197]: Process 29187 (crm_mon) of user 0 dumped core.


It’s always the same:
the cluster recognizes that resource SNMP isn’t running, stops it and
starts it again.
crm_mon creates a segfault while accessing the library, crm_mon
terminates with a core dump. The next monitor operation for SNMP
recognizes that it isn’t running and the procedure starts again. Every
30 seconds the same procedure.

Any ideas ?[/color]

Which version of SUSE Linux Enterprise Server (SLES) and High
Availability Extension are you using?

HTH.

Simon Flood
SUSE Knowledge Partner

----Android NewsGroup Reader----
http://usenet.sinaapp.com/

[quote]
Which version of SUSE Linux Enterprise Server (SLES) and High
Availability Extension are you using?

HTH.

Simon Flood
SUSE Knowledge Partner[/quote]

It’s SP4.

Bernd

Hi Bernd,

to make debugging easier, you should be able to reproduce the symptoms by calling

root@ha-idg-1# crm_mon -p /tmp/ClusterMon_testing.pid -i 5000 -S vm49093-4.scidom.de -C idg-ha -h /srv/www/hawk/public/crm_mon.html

This should fail with a segfault, like the invocation done by the cluster resource. The only difference to the resource script is that I omitted the “-d” option to keep the process in the foreground.

Do you see any additional output on stdout/stderr that might hint at the root cause?

Regards,
J

PS: I’m not at a matching system right now - what are the -S and -C options about?

But SP4 of which version? SLES9, SLES10, SLES11, or SLES12?

HTH.

From https://manpages.debian.org/testing/pacemaker-cli-utils/crm_mon.8.en.html :

Modes (mutually exclusive):

…snip…

-S, --snmp-traps=value
    Send SNMP traps to this station

-C, --snmp-community=value
    Specify community for SNMP traps(default is NULL) 

HTH.

[QUOTE=smflood;56343]But SP4 of which version? SLES9, SLES10, SLES11, or SLES12?

HTH.[/QUOTE]

It’s 12.

Bernd

Hi,

i found in the syslog from one off the nodes:

I think that’s a clear statement. We shouldn’t waste our time with something deprecated. I will switch to alerts.

Bernd

[QUOTE=berndgsflinux;56367]Hi,

I think that’s a clear statement. We shouldn’t waste our time with something deprecated. I will switch to alerts.

Bernd[/QUOTE]

For the sake of completeness: the hostname in my RA was wrong, fixing it solved the problem. Nevertheless i switched to alerts.
Bernd