Avoid cloning source code multiple times#6629
Conversation
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinuxWindows |
MichaReiser
left a comment
There was a problem hiding this comment.
Nice, this should reduce memory usage quiet a bit and may even improve the cpython benchmark.
The new solution still feels a bit awkward to me. Maybe it is because I don't understand the separation between Sources, SourceType, and SourceKind well enough. Maybe we don't have these abstractions right yet which makes this more complicated than it needs to be? Could it also help to avoid storing the Notebook contents twice?
crates/ruff_cli/src/diagnostics.rs
Outdated
| } | ||
|
|
||
| #[derive(Debug)] | ||
| struct Sources<'a> { |
There was a problem hiding this comment.
I find Sources a confusing name, because it only contains a single source file.
Could we maybe marry this with our SourceFile implementation that also gives us cheap cloning?
crates/ruff_cli/src/diagnostics.rs
Outdated
| } else { | ||
| Ok(Sources { | ||
| source_type, | ||
| source_kind: SourceKind::Python, |
There was a problem hiding this comment.
Could we instead change SourceKind to:
enum SourceKind<'a> {
Python(&'a str),
Notebook(Notebook)
}
impl SourceKind<'_> {
fn source_code(&self) -> &str {
match self {
Self::Python(source) => source,
Self::Notebook(notebook) => notebook.contents()
}
}
There was a problem hiding this comment.
I'm using source_codehere (or source) because I find contents a too generic term (I can't even tell from its name if it is a string)
There was a problem hiding this comment.
This was my instinct too. I tried it but it didn’t work, because the Diagnostics struct includes the SourceKind so that it can reconstruct the cell ranges when it reports diagnostics at the end. So we’d have to add lifetimes to Diagnostics and then keep all the source code in memory for the life of the program.
There was a problem hiding this comment.
Hmm I see. I have to take a closer look at this tomorrow. I wonder if SourceKind (or something similar) should be a trait and / or if we can structure these types differently to also avoid the overlap between SourceKind and SourceType
There was a problem hiding this comment.
Diagnostics only needs access to the index though, not the file contents, which may help.
There was a problem hiding this comment.
Yeah, it's another source mapping (the reverse after applying fixes). It would be nice to have a more formal concept for that rather than special casing jupyter notebooks in the Emitter. Because we'll have the same problem when linting markdown files: We need to map back the relative code block indices to the absolute file indices, potentially customizing the message.
|
It feels awkward to me too, but I didn’t want to do anything more invasive. I want to see if we can remove the contents from Notebook, perhaps, which would make the responsibilities clearer. |
ee92fff to
7f35448
Compare
11dfb35 to
1554c36
Compare
1554c36 to
ccef889
Compare
|
@MichaReiser - I made some improvements based on your suggestions but LMK how you want to proceed (e.g., whether you want to spend time on this, or want me to do so, or want to merge and revisit later). |
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
MichaReiser
left a comment
There was a problem hiding this comment.
Thanks. I like the improvements and the improved naming.
We should follow up on notebooks. I think it would be nice if SourceKind could store the source code as well, to avoid passing two arguments everywhere. However, I didn't manage to do that because of some lifetime issues when fixing notebooks (and updating notebook indices)
dhruvmanila
left a comment
There was a problem hiding this comment.
Looks good! Thanks for doing this :)

Summary
In working on #6628, I noticed that we clone the source code contents, potentially multiple times, prior to linting. The issue is that
SourceKind::Pythontakes aString, so we first have to provide it with aString. In the stdin case, that means cloning. However, on top of this, we then have to clonesource_kind.contents()becauseSourceKindgets mutated. So for stdin, we end up cloning twice. For non-stdin, we end up cloning once, but unnecessarily (since the contents don't get mutated, only the kind).This PR removes the
Stringfromsource_kind, instead requiring that we parse it out elsewhere. It reduces the number of clones down to 1 for Jupyter Notebooks, and zero otherwise.