ceph-mds crushed and restarted with systemd

symbat · October 4, 2019, 8:26am

Hello everybody. At last few days I got messages from my zabbix monitoring system that my ceph-mds is down and after 30minut it up again. When I look at journalctl -u ceph-mds@… I found lines with tha I can not undestand and line with mds started again. Anybody see that message? What does it mean?

“Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: In function ‘bool Server::_dir_is_nonempty(MDRequestRef&, CInode*)’ thread 7f7dfecee700 time 2019-10-04 02:59:38.636539
Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: 6728: FAILED assert(in->filelock.can_read(mdr->get_client()))
Oct 04 02:59:38 cph-host ceph-mds[1357515]: ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable)
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x557899f2a9ee]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 2: (Server::_dir_is_nonempty(boost::intrusive_ptr&, CInode*)+0xc4) [0x557899c396d4]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 3: (Server::handle_client_unlink(boost::intrusive_ptr&)+0xfee) [0x557899c630de]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 4: (Server::dispatch_client_request(boost::intrusive_ptr&)+0xc30) [0x557899c83a30]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 5: (MDSInternalContextBase::complete(int)+0x1db) [0x557899e773cb]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 6: (void finish_contexts(CephContext*, std::list<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >&, int)+0x94) [0x557899bf2514]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 7: (Locker::eval(CInode*, int, bool)+0x117) [0x557899d856c7]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 8: (Locker::handle_client_caps(MClientCaps*)+0x141f) [0x557899d9b17f]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 9: (Locker::dispatch(Message*)+0x85) [0x557899d9ca95]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 10: (MDSRank::handle_deferrable_message(Message*)+0xbac) [0x557899bf829c]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 11: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x557899c0f31b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 12: (MDSRankDispatcher::ms_dispatch(Message*)+0x90) [0x557899c10280]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 13: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557899bef463]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 14: (DispatchQueue::entry()+0x78b) [0x55789a21996b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x557899face4d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 16: (()+0x8724) [0x7f7e041c0724]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 17: (clone()+0x6d) [0x7f7e03230e8d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Main process exited, code=killed, status=6/ABRT
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Unit entered failed state.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Failed with result ‘signal’.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Service RestartSec=100ms expired, scheduling restart.
Oct 04 03:25:28 cph-host systemd[1]: Stopped Ceph metadata server daemon.
Oct 04 03:25:28 cph-host systemd[1]: Started Ceph metadata server daemon.
Oct 04 03:25:28 cph-host ceph-mds[1519847]: starting mds.cph-host at -”

symbat · October 4, 2019, 2:39pm

some logs

-2> 2019-10-04 09:33:46.631956 7fb6c1106700 5 – 10.17.169.27:6802/3837479083 >> 10.17.37.202:0/162945971 conn(0x557f9fbd4000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2559191 cs=1 l=0). rx client.4432501 seq 307479 0x557fa19d4a00 client_request(client.4432501:36437546 lookup #0x100044fec6c/9 2019-10-04 09:26:51.076636 caller_uid=0, caller_gid=0{}) v2
-1> 2019-10-04 09:33:46.634528 7fb6c1106700 5 – 10.17.169.27:6802/3837479083 >> 10.17.37.202:0/162945971 conn(0x557f9fbd4000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2559191 cs=1 l=0). rx client.4432501 seq 307480 0x557fda649500 client_request(client.4432501:36437547 lookup #0x100044fec70/16 2019-10-04 09:26:51.079636 caller_uid=0, caller_gid=0{}) v2
0> 2019-10-04 09:33:46.635853 7fb6be978700 -1 /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: In function ‘bool Server::_dir_is_nonempty(MDRequestRef&, CInode*)’ thread 7fb6be978700 time 2019-10-04 09:33:46.631850
/home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: 6728: FAILED assert(in->filelock.can_read(mdr->get_client()))

a_jaeger · October 7, 2019, 9:29am

Looking at the logs, there’s a crash - I suggest you open a support request so that we can investigate.

strahil-nikolov-dxc · October 8, 2019, 6:20am

[QUOTE=symbatkarzhassov;58591]Hello everybody. At last few days I got messages from my zabbix monitoring system that my ceph-mds is down and after 30minut it up again. When I look at journalctl -u ceph-mds@… I found lines with tha I can not undestand and line with mds started again. Anybody see that message? What does it mean?

“Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: In function ‘bool Server::_dir_is_nonempty(MDRequestRef&, CInode*)’ thread 7f7dfecee700 time 2019-10-04 02:59:38.636539
Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: 6728: FAILED assert(in->filelock.can_read(mdr->get_client()))
Oct 04 02:59:38 cph-host ceph-mds[1357515]: ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable)
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x557899f2a9ee]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 2: (Server::_dir_is_nonempty(boost::intrusive_ptr&, CInode*)+0xc4) [0x557899c396d4]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 3: (Server::handle_client_unlink(boost::intrusive_ptr&)+0xfee) [0x557899c630de]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 4: (Server::dispatch_client_request(boost::intrusive_ptr&)+0xc30) [0x557899c83a30]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 5: (MDSInternalContextBase::complete(int)+0x1db) [0x557899e773cb]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 6: (void finish_contexts(CephContext*, std::list<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >&, int)+0x94) [0x557899bf2514]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 7: (Locker::eval(CInode*, int, bool)+0x117) [0x557899d856c7]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 8: (Locker::handle_client_caps(MClientCaps*)+0x141f) [0x557899d9b17f]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 9: (Locker::dispatch(Message*)+0x85) [0x557899d9ca95]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 10: (MDSRank::handle_deferrable_message(Message*)+0xbac) [0x557899bf829c]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 11: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x557899c0f31b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 12: (MDSRankDispatcher::ms_dispatch(Message*)+0x90) [0x557899c10280]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 13: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557899bef463]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 14: (DispatchQueue::entry()+0x78b) [0x55789a21996b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x557899face4d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 16: (()+0x8724) [0x7f7e041c0724]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 17: (clone()+0x6d) [0x7f7e03230e8d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Main process exited, code=killed, status=6/ABRT
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Unit entered failed state.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Failed with result ‘signal’.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Service RestartSec=100ms expired, scheduling restart.
Oct 04 03:25:28 cph-host systemd[1]: Stopped Ceph metadata server daemon.
Oct 04 03:25:28 cph-host systemd[1]: Started Ceph metadata server daemon.
Oct 04 03:25:28 cph-host ceph-mds[1519847]: starting mds.cph-host at -”[/QUOTE]

It looks like a bug.
Update to latest and if it still exist setup systemd coredumps via https://www.suse.com/support/kb/doc/?id=7017137 .
The data from the coredumps will help to find the reason behind the crashes.

symbat · October 17, 2019, 8:34am

client remount cephfs and mds now do not crashes.

Topic		Replies	Views
OSD are down, restarted the process but it is down again later...	6	2429	August 3, 2020
SES 4 issues SUSE Enterprise Storage	9	527	February 24, 2017
Unsupported ceph-iscsi config version. expected 10 >= and <= but found 9 in Lab 3-5	2	277	September 11, 2020
ceph status HEALTH_ERR	6	352	August 13, 2020
CephFS & Oracle database, 8k blocks, slow write SUSE Enterprise Storage	4	804	April 11, 2017

ceph-mds crushed and restarted with systemd

Related topics