Skip to content

Comments

mds: flush journal asok command should trim only up to the segment created for the flush#58936

Merged
batrick merged 15 commits intoceph:mainfrom
batrick:i67276
Sep 26, 2024
Merged

mds: flush journal asok command should trim only up to the segment created for the flush#58936
batrick merged 15 commits intoceph:mainfrom
batrick:i67276

Conversation

@batrick
Copy link
Member

@batrick batrick commented Jul 30, 2024

Overall goal here is to avoid situations where the MDS is trimming aggressively past the segment created for the flush. Some code has demonstratively been sensitive to that.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Jul 30, 2024
@batrick batrick requested a review from vshankar July 31, 2024 01:49
mdlog->wait_for_safe(new MDSInternalContextWrapper(mds, ctx));
}

void handle_flush_mdlog(int r) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice change 👍

* segment.
*/
int MDLog::trim_all()
int MDLog::trim_to(SegmentBoundary::seq_t seq)
Copy link
Contributor

@vshankar vshankar Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to worry about doing something similar in MDLog::trim()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. trim() would run till max_segments or the threshold.

@batrick
Copy link
Member Author

batrick commented Aug 9, 2024

@vshankar good to approve?

@vshankar
Copy link
Contributor

vshankar commented Aug 9, 2024

@vshankar good to approve?

I'll have another look soon.

@batrick
Copy link
Member Author

batrick commented Aug 13, 2024

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.


MDCache *mdcache;
MDLog *mdlog;
SegmentBoundary* sb = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this only needs the seq number to pass to trim_to() so just include that rather than using SegmentBoundary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed

trim_segments();
}

void trim_segments() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this was done this way to invoke trim_expired_segments() via finisher thread. You probably have already checked the commit history for an explanation and didn't find one? (that should be my fault) :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps to avoid a call to mdlog->trim_expired_segments() from in deep context callback paths? I suppose that could create trouble somehow. In any case, I don't understand why the original code doesn't just push the FunctionContext onto the finisher queue instead of wrapping with C_OnFinisher which it immediately completes. Do you know why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Sometimes the context will complete with the MDLog::submit_mutex locked and cause deadlock. I'll add a note.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@batrick batrick force-pushed the i67276 branch 2 times, most recently from 52f2648 to 397f298 Compare August 15, 2024 00:50
@batrick
Copy link
Member Author

batrick commented Aug 15, 2024

@batrick
Copy link
Member Author

batrick commented Aug 27, 2024

flush command hangs:

2024-08-15T06:08:36.054+0000 7f2940205640  1 mds.b asok_command: flush journal {format=json,prefix=flush journal} (starting...)
2024-08-15T06:08:36.054+0000 7f2940205640 10 mds.0.265 handle_asok_command: flush journal
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.265 send
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.265 flush_mdlog
2024-08-15T06:08:36.054+0000 7f2940205640 10 mds.0.cache create_subtree_map 2 subtrees, 2 fullauth
2024-08-15T06:08:36.054+0000 7f2940205640 15 mds.0.cache show_subtrees
2024-08-15T06:08:36.054+0000 7f2940205640 10 mds.0.cache |__ 0    auth [dir 0x1 / [2,head] auth v=1 cv=1/1 REP dir_auth=0 state=1073741825|complete f() n() hs=0+1,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x560e1ebef200]
2024-08-15T06:08:36.054+0000 7f2940205640 10 mds.0.cache |__ 0    auth [dir 0x100 ~mds0/ [2,head] auth v=10 cv=10/10 dir_auth=0 state=1073741825|complete f(v0 10=0+10) n(v1 rc2024-08-15T06:07:22.699723+0000 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x560e1ebef680]
2024-08-15T06:08:36.054+0000 7f2940205640 15 mds.0.cache  auth subtree [dir 0x1 / [2,head] auth v=1 cv=1/1 REP dir_auth=0 state=1073741825|complete f() n() hs=0+1,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x560e1ebef200]
2024-08-15T06:08:36.054+0000 7f2940205640 15 mds.0.cache  auth subtree [dir 0x100 ~mds0/ [2,head] auth v=10 cv=10/10 dir_auth=0 state=1073741825|complete f(v0 10=0+10) n(v1 rc2024-08-15T06:07:22.699723+0000 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x560e1ebef680]
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.journal EMetaBlob::add_dir_context final:
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.journal EMetaBlob::add_dir_context final:
2024-08-15T06:08:36.054+0000 7f2940205640 15 mds.0.cache  subtrees {0x1=[],0x100=[]}
2024-08-15T06:08:36.054+0000 7f2940205640 15 mds.0.cache  ambiguous_subtrees
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.log _submit_entry ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2 dirs]
2024-08-15T06:08:36.054+0000 7f2940205640 20 mds.0.log _start_new_segment: seq = 17 starting new segment LogSegment(17/0xffffffffffffffff events=0)
2024-08-15T06:08:36.054+0000 7f2940205640 10 mds.0.cache advance_stray to index 4 fragmenting index -1
2024-08-15T06:08:36.054+0000 7f29371f3640  5 mds.0.log _submit_thread 4208266~872 : ESubtreeMap 2 subtrees , 0 ambiguous [metablob 0x1, 2 dirs]
2024-08-15T06:08:36.054+0000 7f29371f3640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] --> [v2:172.21.15.98:6808/1635857926,v1:172.21.15.98:6809/1635857926] -- osd_op(unknown.0.265:73 33.14 33:2e2fa760:::200.00000001:head [write 13962~892 [fadvise_dontneed] in=892b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216) -- 0x560e1ebc3400 con 0x560e1edbe900
2024-08-15T06:08:36.054+0000 7f2940a06640  2 --2- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] >> [v2:172.21.15.98:6808/1635857926,v1:172.21.15.98:6809/1635857926] conn(0x560e1edbe900 0x560e1eb63600 crc :-1 s=READY pgs=68 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x560e1ebc3400 seq=41 osd_op(mds.0.265:73 33.14 33:2e2fa760:::200.00000001:head [write 13962~892 [fadvise_dontneed] in=892b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216)
2024-08-15T06:08:36.056+0000 7f2940a06640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] <== osd.5 v2:172.21.15.98:6808/1635857926 41 ==== osd_op_reply(73 200.00000001 [write 13962~892 [fadvise_dontneed]] v216'23 uv23 ondisk = 0) ==== 156+0+0 (crc 0 0 0) 0x560e1edc6280 con 0x560e1edbe900
2024-08-15T06:08:36.056+0000 7f29379f4640 10 MDSIOContextBase::complete: 13C_MDL_Flushed
2024-08-15T06:08:36.056+0000 7f29379f4640 10 MDSContext::complete: 13C_MDL_Flushed
2024-08-15T06:08:36.056+0000 7f29379f4640 10 MDSIOContextBase::complete: 13C_MDL_Flushed
2024-08-15T06:08:36.056+0000 7f29379f4640 10 MDSContext::complete: 13C_MDL_Flushed
2024-08-15T06:08:36.056+0000 7f29379f4640 10 MDSContext::complete: 25MDSInternalContextWrapper
2024-08-15T06:08:36.056+0000 7f29379f4640 20 mds.0.265 handle_clear_mdlog: r=0
2024-08-15T06:08:36.056+0000 7f29379f4640  5 mds.0.265 trim_mdlog: beginning segment expiry
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.log trim_to: 17 2/0/0
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.openfiles commit log_seq 17 committing_log_seq:16
2024-08-15T06:08:36.056+0000 7f29379f4640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] --> [v2:172.21.15.98:6826/1301235023,v1:172.21.15.98:6827/1301235023] -- osd_op(unknown.0.265:74 33.6 33:654134d2:::mds0_openfiles.0:head [omap-set-header in=42b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216) -- 0x560e1ef11000 con 0x560e1edbfb00
2024-08-15T06:08:36.056+0000 7f29379f4640  6 mds.0.journal LogSegment(16/4207374).try_to_expire
2024-08-15T06:08:36.056+0000 7f29379f4640  4 mds.0.sessionmap save_if_dirty: writing 0
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.journal try_to_expire waiting for purge of []
2024-08-15T06:08:36.056+0000 7f29379f4640  6 mds.0.journal LogSegment(16/4207374).try_to_expire success
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.log try_expire expired LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.056+0000 7f29379f4640  5 mds.0.log _expired LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.056+0000 7f29379f4640 20 mds.0.log _trim_expired_segments: examining LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.log _trim_expired_segments defer expire for open file table committedseq 16 <= 16/4207374
2024-08-15T06:08:36.056+0000 7f29379f4640 20 mds.0.265 expire_segments
2024-08-15T06:08:36.056+0000 7f29379f4640  5 mds.0.265 expire_segments: waiting for 0 segments to expire
2024-08-15T06:08:36.056+0000 7f29379f4640  5 mds.0.265 trim_expired_segments: expiry complete, expire_pos/trim_pos is now 40330e/400000
2024-08-15T06:08:36.056+0000 7f29379f4640 20 mds.0.log _trim_expired_segments: examining LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.log _trim_expired_segments defer expire for open file table committedseq 16 <= 16/4207374
2024-08-15T06:08:36.056+0000 7f29379f4640  5 mds.0.265 trim_expired_segments: trimming is complete; wait for journal head write. Journal expire_pos/trim_pos is now 40330e/400000
2024-08-15T06:08:36.056+0000 7f2941207640  2 --2- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] >> [v2:172.21.15.98:6826/1301235023,v1:172.21.15.98:6827/1301235023] conn(0x560e1edbfb00 0x560e1ec01700 crc :-1 s=READY pgs=69 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x560e1ef11000 seq=8 osd_op(mds.0.265:74 33.6 33:654134d2:::mds0_openfiles.0:head [omap-set-header in=42b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216)
2024-08-15T06:08:36.057+0000 7f2941207640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] <== osd.7 v2:172.21.15.98:6826/1301235023 8 ==== osd_op_reply(74 mds0_openfiles.0 [omap-set-header] v216'3 uv3 ondisk = 0) ==== 160+0+0 (crc 0 0 0) 0x560e1ecdc000 con 0x560e1edbfb00
2024-08-15T06:08:36.057+0000 7f29379f4640 10 MDSIOContextBase::complete: 13C_IO_OFT_Save
2024-08-15T06:08:36.057+0000 7f29379f4640 10 MDSContext::complete: 13C_IO_OFT_Save
2024-08-15T06:08:36.057+0000 7f29379f4640 10 mds.0.openfiles _commit_finish log_seq 17 committed_log_seq 16 committing_log_seq 17
2024-08-15T06:08:36.057+0000 7f29379f4640 10 MDSContext::complete: 15C_OFT_Committed
2024-08-15T06:08:36.057+0000 7f29379f4640 20 mds.0.log _trim_expired_segments: examining LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.057+0000 7f29379f4640 10 mds.0.log _trim_expired_segments: maybe expiring LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.057+0000 7f29379f4640 20 mds.0.log _trim_expired_segments: examining LogSegment(17/0x40368a events=1)
2024-08-15T06:08:36.057+0000 7f29379f4640 10 mds.0.log _trim_expired_segments: expiring up to this major segment seq=17
2024-08-15T06:08:36.057+0000 7f29379f4640 20 mds.0.log _trim_expired_segments: expiring LogSegment(16/0x40330e events=1)
2024-08-15T06:08:36.057+0000 7f29379f4640 10 mds.0.log _trim_expired_segments waiting for expiry LogSegment(17/0x40368a events=1)
2024-08-15T06:08:36.057+0000 7f29379f4640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] --> [v2:172.21.15.98:6808/1635857926,v1:172.21.15.98:6809/1635857926] -- osd_op(unknown.0.265:75 33.14 33:292cf221:::200.00000000:head [writefull 0~90 [fadvise_dontneed] in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216) -- 0x560e1ef11400 con 0x560e1edbe900
2024-08-15T06:08:36.057+0000 7f2940a06640  2 --2- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] >> [v2:172.21.15.98:6808/1635857926,v1:172.21.15.98:6809/1635857926] conn(0x560e1edbe900 0x560e1eb63600 crc :-1 s=READY pgs=68 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x560e1ef11400 seq=42 osd_op(mds.0.265:75 33.14 33:292cf221:::200.00000000:head [writefull 0~90 [fadvise_dontneed] in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e216)
2024-08-15T06:08:36.058+0000 7f2940a06640  1 -- [v2:172.21.15.98:6834/2000876941,v1:172.21.15.98:6836/2000876941] <== osd.5 v2:172.21.15.98:6808/1635857926 42 ==== osd_op_reply(75 200.00000000 [writefull 0~90 [fadvise_dontneed]] v216'24 uv24 ondisk = 0) ==== 156+0+0 (crc 0 0 0) 0x560e1edc6280 con 0x560e1edbe900

From: /teuthology/pdonnell-2024-08-15_05:17:58-fs-wip-pdonnell-testing-20240815.005222-debug-distro-default-smithi/7856292/remote/smithi098/log/ceph-mds.b.log.gz

on openfiletable flush. Need to add a context to wait for this type of thing during trim.

@batrick
Copy link
Member Author

batrick commented Aug 27, 2024

@batrick
Copy link
Member Author

batrick commented Aug 29, 2024

@batrick
Copy link
Member Author

batrick commented Aug 29, 2024

@vshankar this needs a second look. These are the new commits

            mds: do not duplicate journaler write heads
            mds: use Journaler getters
            osdc: properly acquire locks for getters
            osdc: add print method for Journaler::Header
            mds: do not trim segments after open file table commit
            mds: delay expiry if LogSegment is ahead of committed oft seq

@batrick batrick requested a review from vshankar August 29, 2024 18:23
@batrick
Copy link
Member Author

batrick commented Aug 29, 2024

This passes QA, just needs reviews.

@vshankar
Copy link
Contributor

@vshankar this needs a second look. These are the new commits

            mds: do not duplicate journaler write heads
            mds: use Journaler getters
            osdc: properly acquire locks for getters
            osdc: add print method for Journaler::Header
            mds: do not trim segments after open file table commit
            mds: delay expiry if LogSegment is ahead of committed oft seq

Noted.

@batrick
Copy link
Member Author

batrick commented Sep 5, 2024

(trivial rebase to hopefully fix the broken windows test)

@batrick
Copy link
Member Author

batrick commented Sep 5, 2024

jenkins test make check arm64

looks like

    2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.journal try_to_expire waiting for purge of []

Nothing to purge!

Signed-off-by: Patrick Donnelly <[email protected]>
This neatly avoids a situation where the MDS trims past the segment created for
the `flush journal` command.

Here's what it looks like:

    2024-07-30T20:15:51.679+0000 7f70a809a640  1 mds.d asok_command: flush journal {prefix=flush journal} (starting...)
    2024-07-30T20:15:51.679+0000 7f70a809a640 10 mds.1.26 handle_asok_command: flush journal
    2024-07-30T20:15:51.679+0000 7f70a809a640 20 mds.1.26 send
    2024-07-30T20:15:51.679+0000 7f70a809a640 20 mds.1.26 flush_mdlog
    2024-07-30T20:15:51.679+0000 7f70a809a640 10 mds.1.cache create_subtree_map 106 subtrees, 105 fullauth
    ...
    2024-07-30T20:15:51.680+0000 7f70a809a640 20 mds.1.log _submit_entry ESubtreeMap 105 subtrees , 0 ambiguous [metablob 0x101, 107 dirs]
    2024-07-30T20:15:51.680+0000 7f70a809a640 20 mds.1.log _start_new_segment: seq = 522843 starting new segment LogSegment(522843/0xffffffffffffffff events=0)
    ...
    2024-07-30T20:15:51.686+0000 7f709e887640 20 mds.1.26 handle_clear_mdlog: r=0
    2024-07-30T20:15:51.686+0000 7f709e887640  5 mds.1.26 trim_mdlog: beginning segment expiry
    2024-07-30T20:15:51.686+0000 7f709e887640 10 mds.1.log trim_to: 522843 23/0/0
    ...
    2024-07-30T20:15:51.858+0000 7f709e887640 20 mds.1.26 expire_segments
    2024-07-30T20:15:51.858+0000 7f709e887640  5 mds.1.26 expire_segments: waiting for 22 segments to expire
    ...
    2024-07-30T20:15:52.386+0000 7f70a008a640 20 mds.1.26 handle_expire_segments: r=0
    2024-07-30T20:15:52.386+0000 7f70a008a640 20 mds.1.26 trim_segments
    2024-07-30T20:15:52.386+0000 7f709e887640  5 mds.1.26 trim_expired_segments: expiry complete, expire_pos/trim_pos is now 22f4c774/22c00000
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(503314/0x22f4c774 events=649)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(503314/0x22f4c774 events=649)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(503963/0x230019e6 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(503963/0x230019e6 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(504987/0x23119d93 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(504987/0x23119d93 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(506011/0x23234d24 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(506011/0x23234d24 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(507035/0x23392678 events=397)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(507035/0x23392678 events=397)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(507432/0x23400db4 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(507432/0x23400db4 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(508456/0x2351c0af events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(508456/0x2351c0af events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(509480/0x23639df0 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(509480/0x23639df0 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(510504/0x2375912d events=607)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(510504/0x2375912d events=607)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(511111/0x2380290f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(511111/0x2380290f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(512135/0x2391b7f3 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(512135/0x2391b7f3 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(513159/0x23a6a15f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(513159/0x23a6a15f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(514183/0x23b855e7 events=442)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(514183/0x23b855e7 events=442)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(514625/0x23c01325 events=979)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(514625/0x23c01325 events=979)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(515604/0x23d0e086 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: expiring up to this major segment seq=515604
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(503314/0x22f4c774 events=649)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(503963/0x230019e6 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(504987/0x23119d93 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(506011/0x23234d24 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(507035/0x23392678 events=397)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(507432/0x23400db4 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(508456/0x2351c0af events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(509480/0x23639df0 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(510504/0x2375912d events=607)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(511111/0x2380290f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(512135/0x2391b7f3 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(513159/0x23a6a15f events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(514183/0x23b855e7 events=442)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(514625/0x23c01325 events=979)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(515604/0x23d0e086 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(516628/0x23e2a879 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(516628/0x23e2a879 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(517652/0x23f47dbf events=665)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(517652/0x23f47dbf events=665)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(518317/0x240007be events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(518317/0x240007be events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(519341/0x2411af83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(519341/0x2411af83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(520365/0x24234a83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(520365/0x24234a83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(521389/0x24344292 events=691)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(521389/0x24344292 events=691)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(522080/0x24400e24 events=763)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(522080/0x24400e24 events=763)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(522843/0x244c4fda events=1)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: expiring up to this major segment seq=522843
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(515604/0x23d0e086 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(516628/0x23e2a879 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(517652/0x23f47dbf events=665)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(518317/0x240007be events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(519341/0x2411af83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(520365/0x24234a83 events=1024)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(521389/0x24344292 events=691)
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(522080/0x24400e24 events=763)
    2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments waiting for expiry LogSegment(522843/0x244c4fda events=1)
    2024-07-30T20:15:52.386+0000 7f709e887640  1 -- [v2:127.0.0.1:6872/1197350146,v1:127.0.0.1:6873/1197350146] --> [v2:127.0.0.1:6850/2027314847,v1:127.0.0.1:6851/2027314847] -- osd_op(unknown.0.26:131573 2.7 2:ea85266a:::201.00000000:head [writefull 0~90 [fadvise_dontneed] in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e76) -- 0x555daee1ac00 con 0x555da2bc7180
    2024-07-30T20:15:52.386+0000 7f709e887640  5 mds.1.26 trim_expired_segments: trim complete, expire_pos/trim_pos is now 244c4fda/22c00000
    2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.26 write_journal_head

Related-to: https://tracker.ceph.com/issues/59119
Signed-off-by: Patrick Donnelly <[email protected]>
Now that we only trim up to the segment created for the flush, this second wait
should no longer be necessary.

Signed-off-by: Patrick Donnelly <[email protected]>
Add context to wait for MDLog::trim_expired_segments to write the journal head.

Signed-off-by: Patrick Donnelly <[email protected]>
And remove the misplaced conditional in ::trim_expiring_segments.

This is necessary as the `flush journal` command gets confused by missing a
wait_for_expiry on a LogSegment that is not actually expired.

Signed-off-by: Patrick Donnelly <[email protected]>
Previously, the trimming of expired segments would only occur if the open file
table's committed sequence number is past the segment to be trimmed. This is
now part of the expiry checks so it's no longer necessary to restart trimming
when the open file table commit completes.

Furthermore, this was confusing to `flush journal` as it was waiting for the
journal head write but the open file table commit already triggered the trim.

Signed-off-by: Patrick Donnelly <[email protected]>
For dout prints.

Signed-off-by: Patrick Donnelly <[email protected]>
This was left as a TODO. : /

Signed-off-by: Patrick Donnelly <[email protected]>
To access the Journaler::Header safely.

Signed-off-by: Patrick Donnelly <[email protected]>
The MDS can unnecessarily write out the journaler head multiple times. Check
the last_written Header to see if it's necessary to write and gather waiters.

Signed-off-by: Patrick Donnelly <[email protected]>
@batrick
Copy link
Member Author

batrick commented Sep 25, 2024

@batrick
Copy link
Member Author

batrick commented Sep 26, 2024

[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs\ceph_test_libcephfs_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs\ceph_test_libcephfs_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_access failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_access.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_access\ceph_test_libcephfs_access_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_access\ceph_test_libcephfs_access_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_lazyio failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_lazyio.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_lazyio\ceph_test_libcephfs_lazyio_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_lazyio\ceph_test_libcephfs_lazyio_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_newops failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_newops.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_newops\ceph_test_libcephfs_newops_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_newops\ceph_test_libcephfs_newops_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_snapdiff failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_snapdiff.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_snapdiff\ceph_test_libcephfs_snapdiff_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_snapdiff\ceph_test_libcephfs_snapdiff_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_suidsgid failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_suidsgid.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_suidsgid\ceph_test_libcephfs_suidsgid_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_suidsgid\ceph_test_libcephfs_suidsgid_results.log 2>&1'".
[2024-09-25T23:44:20.000Z] [isolated][googletest] ceph_test_libcephfs_vxattr failed. Error: Command returned non-zero code(1): "cmd /c 'C:\ceph\ceph_test_libcephfs_vxattr.exe --gtest_output=xml:C:\workspace\test_results\out\ceph_test_libcephfs_vxattr\ceph_test_libcephfs_vxattr_results.xml --gtest_filter="-LibCephFS.Deleg*" >> C:\workspace\test_results\out\ceph_test_libcephfs_vxattr\ceph_test_libcephfs_vxattr_results.log 2>&1'".

hmm, will look at this tomorrow.

@batrick
Copy link
Member Author

batrick commented Sep 26, 2024

Windows test failure ticket: https://tracker.ceph.com/issues/68282

Also appears in #59936. Doesn't appear to be caused by this PR.

@batrick
Copy link
Member Author

batrick commented Sep 26, 2024

Will force merging.

@batrick batrick merged commit 4f09b38 into ceph:main Sep 26, 2024
@batrick batrick deleted the i67276 branch September 26, 2024 15:25
@petrutlucian94
Copy link
Contributor

petrutlucian94 commented Sep 27, 2024

@batrick I see that this PR was forcefully merged, however it broke MDS. The Windows CI failure was legitimate, MDS now crashes: https://jenkins.ceph.com/job/ceph-windows-pull-requests/47415/artifact/artifacts/cluster/ceph_logs/mds.a.log

2024-09-25T22:52:15.995+0000 7f5443bc7640 -1 /home/ubuntu/ceph/src/osdc/Journaler.h: In function 'bool Journaler::is_readonly() const' thread 7f5443bc7640 time 2024-09-25T22:52:15.993313+0000
/home/ubuntu/ceph/src/osdc/Journaler.h: 568: FAILED ceph_assert(!true)

 ceph version 351d92 (c351d92b0d9db66780dbf0781c81428652c3eec7) squid (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x15d) [0x7f5449e62b72]
 2: /home/ubuntu/ceph/build/lib/libceph-common.so.2(+0x2bada1) [0x7f5449e62da1]
 3: (MDLog::create(MDSContext*)+0x266) [0x55d6fdcf0346]
 4: (MDSRank::boot_create()+0x1bb) [0x55d6fd8ff1bb]
 5: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x280b) [0x55d6fd905bfb]
 6: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xe8b) [0x55d6fd8cf25b]
 7: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x371) [0x55d6fd8d2fb1]
 8: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xfb) [0x55d6fd8d373b]
 9: (DispatchQueue::entry()+0x629) [0x7f544a1844f9]
 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f544a278e61]
 11: (Thread::entry_wrapper()+0x54) [0x7f5449f88b94]
 12: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f54496a1ac3]
 13: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f5449733850]

2024-09-25T22:52:15.999+0000 7f5443bc7640 -1 *** Caught signal (Aborted) **
 in thread 7f5443bc7640 thread_name:ms_dispatch

I tried a clean vstart cluster (Linux only) and got the same behavior, MDS crashes immediately after I attempt to do a cephfs mount.

petrutlucian94 added a commit to petrutlucian94/ceph that referenced this pull request Sep 27, 2024
This reverts commit 4f09b38, reversing
changes made to 308ec88.

Signed-off-by: Lucian Petrut <[email protected]>
@batrick
Copy link
Member Author

batrick commented Sep 27, 2024

@petrutlucian94 well that's surprising. I will try to reproduce.

Do you mind giving a CephFS watercooler talk or CephFS walkthrough (see Ceph calendar) on debugging these?

@batrick
Copy link
Member Author

batrick commented Sep 27, 2024

I see the issue. This only happens when it's not a debug build of Ceph as that disables mutex debugging.

@petrutlucian94
Copy link
Contributor

@batrick thanks for sending a fix. I've prepared a doc that describes the Windows CI job and the test result structure: #60030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants