Hello everybody. At last few days I got messages from my zabbix monitoring system that my ceph-mds is down and after 30minut it up again. When I look at journalctl -u ceph-mds@… I found lines with tha I can not undestand and line with mds started again. Anybody see that message? What does it mean?
“Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: In function ‘bool Server::_dir_is_nonempty(MDRequestRef&, CInode*)’ thread 7f7dfecee700 time 2019-10-04 02:59:38.636539
Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: 6728: FAILED assert(in->filelock.can_read(mdr->get_client()))
Oct 04 02:59:38 cph-host ceph-mds[1357515]: ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable)
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x557899f2a9ee]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 2: (Server::_dir_is_nonempty(boost::intrusive_ptr&, CInode*)+0xc4) [0x557899c396d4]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 3: (Server::handle_client_unlink(boost::intrusive_ptr&)+0xfee) [0x557899c630de]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 4: (Server::dispatch_client_request(boost::intrusive_ptr&)+0xc30) [0x557899c83a30]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 5: (MDSInternalContextBase::complete(int)+0x1db) [0x557899e773cb]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 6: (void finish_contexts(CephContext*, std::list<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >&, int)+0x94) [0x557899bf2514]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 7: (Locker::eval(CInode*, int, bool)+0x117) [0x557899d856c7]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 8: (Locker::handle_client_caps(MClientCaps*)+0x141f) [0x557899d9b17f]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 9: (Locker::dispatch(Message*)+0x85) [0x557899d9ca95]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 10: (MDSRank::handle_deferrable_message(Message*)+0xbac) [0x557899bf829c]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 11: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x557899c0f31b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 12: (MDSRankDispatcher::ms_dispatch(Message*)+0x90) [0x557899c10280]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 13: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557899bef463]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 14: (DispatchQueue::entry()+0x78b) [0x55789a21996b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x557899face4d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 16: (()+0x8724) [0x7f7e041c0724]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 17: (clone()+0x6d) [0x7f7e03230e8d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Main process exited, code=killed, status=6/ABRT
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Unit entered failed state.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Failed with result ‘signal’.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Service RestartSec=100ms expired, scheduling restart.
Oct 04 03:25:28 cph-host systemd[1]: Stopped Ceph metadata server daemon.
Oct 04 03:25:28 cph-host systemd[1]: Started Ceph metadata server daemon.
Oct 04 03:25:28 cph-host ceph-mds[1519847]: starting mds.cph-host at -”
[QUOTE=symbatkarzhassov;58591]Hello everybody. At last few days I got messages from my zabbix monitoring system that my ceph-mds is down and after 30minut it up again. When I look at journalctl -u ceph-mds@… I found lines with tha I can not undestand and line with mds started again. Anybody see that message? What does it mean?
“Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: In function ‘bool Server::_dir_is_nonempty(MDRequestRef&, CInode*)’ thread 7f7dfecee700 time 2019-10-04 02:59:38.636539
Oct 04 02:59:38 cph-host ceph-mds[1357515]: /home/abuild/rpmbuild/BUILD/ceph-12.2.12-573-g67074fa839/src/mds/Server.cc: 6728: FAILED assert(in->filelock.can_read(mdr->get_client()))
Oct 04 02:59:38 cph-host ceph-mds[1357515]: ceph version 12.2.12-573-g67074fa839 (67074fa839ea075078c95b0f6d1679c37f1c61f5) luminous (stable)
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x557899f2a9ee]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 2: (Server::_dir_is_nonempty(boost::intrusive_ptr&, CInode*)+0xc4) [0x557899c396d4]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 3: (Server::handle_client_unlink(boost::intrusive_ptr&)+0xfee) [0x557899c630de]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 4: (Server::dispatch_client_request(boost::intrusive_ptr&)+0xc30) [0x557899c83a30]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 5: (MDSInternalContextBase::complete(int)+0x1db) [0x557899e773cb]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 6: (void finish_contexts(CephContext*, std::list<MDSInternalContextBase*, std::allocator<MDSInternalContextBase*> >&, int)+0x94) [0x557899bf2514]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 7: (Locker::eval(CInode*, int, bool)+0x117) [0x557899d856c7]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 8: (Locker::handle_client_caps(MClientCaps*)+0x141f) [0x557899d9b17f]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 9: (Locker::dispatch(Message*)+0x85) [0x557899d9ca95]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 10: (MDSRank::handle_deferrable_message(Message*)+0xbac) [0x557899bf829c]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 11: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x557899c0f31b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 12: (MDSRankDispatcher::ms_dispatch(Message*)+0x90) [0x557899c10280]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 13: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557899bef463]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 14: (DispatchQueue::entry()+0x78b) [0x55789a21996b]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x557899face4d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 16: (()+0x8724) [0x7f7e041c0724]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: 17: (clone()+0x6d) [0x7f7e03230e8d]
Oct 04 02:59:38 cph-host ceph-mds[1357515]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Main process exited, code=killed, status=6/ABRT
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Unit entered failed state.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Failed with result ‘signal’.
Oct 04 03:25:28 cph-host systemd[1]: ceph-mds@cph-host.service: Service RestartSec=100ms expired, scheduling restart.
Oct 04 03:25:28 cph-host systemd[1]: Stopped Ceph metadata server daemon.
Oct 04 03:25:28 cph-host systemd[1]: Started Ceph metadata server daemon.
Oct 04 03:25:28 cph-host ceph-mds[1519847]: starting mds.cph-host at -”[/QUOTE]
It looks like a bug.
Update to latest and if it still exist setup systemd coredumps via https://www.suse.com/support/kb/doc/?id=7017137 .
The data from the coredumps will help to find the reason behind the crashes.