mds: flush journal asok command should trim only up to the segment created for the flush#58936
mds: flush journal asok command should trim only up to the segment created for the flush#58936
flush journal asok command should trim only up to the segment created for the flush#58936Conversation
| mdlog->wait_for_safe(new MDSInternalContextWrapper(mds, ctx)); | ||
| } | ||
|
|
||
| void handle_flush_mdlog(int r) { |
| * segment. | ||
| */ | ||
| int MDLog::trim_all() | ||
| int MDLog::trim_to(SegmentBoundary::seq_t seq) |
There was a problem hiding this comment.
Do we have to worry about doing something similar in MDLog::trim()?
There was a problem hiding this comment.
Never mind. trim() would run till max_segments or the threshold.
|
@vshankar good to approve? |
I'll have another look soon. |
|
This PR is under test in https://tracker.ceph.com/issues/67516. |
src/mds/MDSRank.cc
Outdated
|
|
||
| MDCache *mdcache; | ||
| MDLog *mdlog; | ||
| SegmentBoundary* sb = nullptr; |
There was a problem hiding this comment.
minor nit: this only needs the seq number to pass to trim_to() so just include that rather than using SegmentBoundary?
| trim_segments(); | ||
| } | ||
|
|
||
| void trim_segments() { |
There was a problem hiding this comment.
IIRC, this was done this way to invoke trim_expired_segments() via finisher thread. You probably have already checked the commit history for an explanation and didn't find one? (that should be my fault) :)
There was a problem hiding this comment.
Hmm, perhaps to avoid a call to mdlog->trim_expired_segments() from in deep context callback paths? I suppose that could create trouble somehow. In any case, I don't understand why the original code doesn't just push the FunctionContext onto the finisher queue instead of wrapping with C_OnFinisher which it immediately completes. Do you know why?
There was a problem hiding this comment.
Ah, I see. Sometimes the context will complete with the MDLog::submit_mutex locked and cause deadlock. I'll add a note.
52f2648 to
397f298
Compare
|
This PR is under test in https://tracker.ceph.com/issues/67563. |
|
flush command hangs: From: /teuthology/pdonnell-2024-08-15_05:17:58-fs-wip-pdonnell-testing-20240815.005222-debug-distro-default-smithi/7856292/remote/smithi098/log/ceph-mds.b.log.gz on openfiletable flush. Need to add a context to wait for this type of thing during trim. |
|
This PR is under test in https://tracker.ceph.com/issues/67757. |
|
@vshankar this needs a second look. These are the new commits |
|
This passes QA, just needs reviews. |
Noted. |
|
(trivial rebase to hopefully fix the broken windows test) |
|
jenkins test make check arm64 |
looks like
2024-08-15T06:08:36.056+0000 7f29379f4640 10 mds.0.journal try_to_expire waiting for purge of []
Nothing to purge!
Signed-off-by: Patrick Donnelly <[email protected]>
Signed-off-by: Patrick Donnelly <[email protected]>
This neatly avoids a situation where the MDS trims past the segment created for
the `flush journal` command.
Here's what it looks like:
2024-07-30T20:15:51.679+0000 7f70a809a640 1 mds.d asok_command: flush journal {prefix=flush journal} (starting...)
2024-07-30T20:15:51.679+0000 7f70a809a640 10 mds.1.26 handle_asok_command: flush journal
2024-07-30T20:15:51.679+0000 7f70a809a640 20 mds.1.26 send
2024-07-30T20:15:51.679+0000 7f70a809a640 20 mds.1.26 flush_mdlog
2024-07-30T20:15:51.679+0000 7f70a809a640 10 mds.1.cache create_subtree_map 106 subtrees, 105 fullauth
...
2024-07-30T20:15:51.680+0000 7f70a809a640 20 mds.1.log _submit_entry ESubtreeMap 105 subtrees , 0 ambiguous [metablob 0x101, 107 dirs]
2024-07-30T20:15:51.680+0000 7f70a809a640 20 mds.1.log _start_new_segment: seq = 522843 starting new segment LogSegment(522843/0xffffffffffffffff events=0)
...
2024-07-30T20:15:51.686+0000 7f709e887640 20 mds.1.26 handle_clear_mdlog: r=0
2024-07-30T20:15:51.686+0000 7f709e887640 5 mds.1.26 trim_mdlog: beginning segment expiry
2024-07-30T20:15:51.686+0000 7f709e887640 10 mds.1.log trim_to: 522843 23/0/0
...
2024-07-30T20:15:51.858+0000 7f709e887640 20 mds.1.26 expire_segments
2024-07-30T20:15:51.858+0000 7f709e887640 5 mds.1.26 expire_segments: waiting for 22 segments to expire
...
2024-07-30T20:15:52.386+0000 7f70a008a640 20 mds.1.26 handle_expire_segments: r=0
2024-07-30T20:15:52.386+0000 7f70a008a640 20 mds.1.26 trim_segments
2024-07-30T20:15:52.386+0000 7f709e887640 5 mds.1.26 trim_expired_segments: expiry complete, expire_pos/trim_pos is now 22f4c774/22c00000
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(503314/0x22f4c774 events=649)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(503314/0x22f4c774 events=649)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(503963/0x230019e6 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(503963/0x230019e6 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(504987/0x23119d93 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(504987/0x23119d93 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(506011/0x23234d24 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(506011/0x23234d24 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(507035/0x23392678 events=397)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(507035/0x23392678 events=397)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(507432/0x23400db4 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(507432/0x23400db4 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(508456/0x2351c0af events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(508456/0x2351c0af events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(509480/0x23639df0 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(509480/0x23639df0 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(510504/0x2375912d events=607)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(510504/0x2375912d events=607)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(511111/0x2380290f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(511111/0x2380290f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(512135/0x2391b7f3 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(512135/0x2391b7f3 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(513159/0x23a6a15f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(513159/0x23a6a15f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(514183/0x23b855e7 events=442)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(514183/0x23b855e7 events=442)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(514625/0x23c01325 events=979)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(514625/0x23c01325 events=979)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(515604/0x23d0e086 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: expiring up to this major segment seq=515604
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(503314/0x22f4c774 events=649)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(503963/0x230019e6 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(504987/0x23119d93 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(506011/0x23234d24 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(507035/0x23392678 events=397)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(507432/0x23400db4 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(508456/0x2351c0af events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(509480/0x23639df0 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(510504/0x2375912d events=607)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(511111/0x2380290f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(512135/0x2391b7f3 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(513159/0x23a6a15f events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(514183/0x23b855e7 events=442)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(514625/0x23c01325 events=979)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(515604/0x23d0e086 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(516628/0x23e2a879 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(516628/0x23e2a879 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(517652/0x23f47dbf events=665)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(517652/0x23f47dbf events=665)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(518317/0x240007be events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(518317/0x240007be events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(519341/0x2411af83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(519341/0x2411af83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(520365/0x24234a83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(520365/0x24234a83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(521389/0x24344292 events=691)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(521389/0x24344292 events=691)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(522080/0x24400e24 events=763)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: maybe expiring LogSegment(522080/0x24400e24 events=763)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: examining LogSegment(522843/0x244c4fda events=1)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments: expiring up to this major segment seq=522843
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(515604/0x23d0e086 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(516628/0x23e2a879 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(517652/0x23f47dbf events=665)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(518317/0x240007be events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(519341/0x2411af83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(520365/0x24234a83 events=1024)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(521389/0x24344292 events=691)
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.log _trim_expired_segments: expiring LogSegment(522080/0x24400e24 events=763)
2024-07-30T20:15:52.386+0000 7f709e887640 10 mds.1.log _trim_expired_segments waiting for expiry LogSegment(522843/0x244c4fda events=1)
2024-07-30T20:15:52.386+0000 7f709e887640 1 -- [v2:127.0.0.1:6872/1197350146,v1:127.0.0.1:6873/1197350146] --> [v2:127.0.0.1:6850/2027314847,v1:127.0.0.1:6851/2027314847] -- osd_op(unknown.0.26:131573 2.7 2:ea85266a:::201.00000000:head [writefull 0~90 [fadvise_dontneed] in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e76) -- 0x555daee1ac00 con 0x555da2bc7180
2024-07-30T20:15:52.386+0000 7f709e887640 5 mds.1.26 trim_expired_segments: trim complete, expire_pos/trim_pos is now 244c4fda/22c00000
2024-07-30T20:15:52.386+0000 7f709e887640 20 mds.1.26 write_journal_head
Related-to: https://tracker.ceph.com/issues/59119
Signed-off-by: Patrick Donnelly <[email protected]>
Now that we only trim up to the segment created for the flush, this second wait should no longer be necessary. Signed-off-by: Patrick Donnelly <[email protected]>
Signed-off-by: Patrick Donnelly <[email protected]>
Signed-off-by: Patrick Donnelly <[email protected]>
Add context to wait for MDLog::trim_expired_segments to write the journal head. Signed-off-by: Patrick Donnelly <[email protected]>
And remove the misplaced conditional in ::trim_expiring_segments. This is necessary as the `flush journal` command gets confused by missing a wait_for_expiry on a LogSegment that is not actually expired. Signed-off-by: Patrick Donnelly <[email protected]>
Previously, the trimming of expired segments would only occur if the open file table's committed sequence number is past the segment to be trimmed. This is now part of the expiry checks so it's no longer necessary to restart trimming when the open file table commit completes. Furthermore, this was confusing to `flush journal` as it was waiting for the journal head write but the open file table commit already triggered the trim. Signed-off-by: Patrick Donnelly <[email protected]>
For dout prints. Signed-off-by: Patrick Donnelly <[email protected]>
This was left as a TODO. : / Signed-off-by: Patrick Donnelly <[email protected]>
To access the Journaler::Header safely. Signed-off-by: Patrick Donnelly <[email protected]>
The MDS can unnecessarily write out the journaler head multiple times. Check the last_written Header to see if it's necessary to write and gather waiters. Signed-off-by: Patrick Donnelly <[email protected]>
hmm, will look at this tomorrow. |
|
Windows test failure ticket: https://tracker.ceph.com/issues/68282 Also appears in #59936. Doesn't appear to be caused by this PR. |
|
Will force merging. |
|
@batrick I see that this PR was forcefully merged, however it broke MDS. The Windows CI failure was legitimate, MDS now crashes: https://jenkins.ceph.com/job/ceph-windows-pull-requests/47415/artifact/artifacts/cluster/ceph_logs/mds.a.log I tried a clean vstart cluster (Linux only) and got the same behavior, MDS crashes immediately after I attempt to do a cephfs mount. |
This reverts commit 4f09b38, reversing changes made to 308ec88. Signed-off-by: Lucian Petrut <[email protected]>
|
@petrutlucian94 well that's surprising. I will try to reproduce. Do you mind giving a CephFS watercooler talk or CephFS walkthrough (see Ceph calendar) on debugging these? |
|
I see the issue. This only happens when it's not a debug build of Ceph as that disables mutex debugging. |
Overall goal here is to avoid situations where the MDS is trimming aggressively past the segment created for the flush. Some code has demonstratively been sensitive to that.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e