Refactor SourceKind to store file content#6640
Conversation
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinuxWindows |
| #[derive(Clone, Debug, PartialEq, is_macro::Is)] | ||
| pub enum SourceKind { | ||
| Python, | ||
| Python(String), |
There was a problem hiding this comment.
It would be nice to store the PySourceType on the SourceKind as well (making it a Program or SourceFile entity). However, it is a bit awkward because PySourceType has a Jupyter variant, but that variant should only be used to Jupyter...
| pub result: LinterResult<(Vec<Message>, Option<ImportMap>)>, | ||
| /// The resulting source code, after applying any fixes. | ||
| pub transformed: Cow<'a, str>, | ||
| pub transformed: Cow<'a, SourceKind>, |
There was a problem hiding this comment.
The change here is that the linter now returns the transformed SourceKind which aligns the handling of Notebooks with regular python files in the sense that it doesn't mutate the source file in place.
| settings, | ||
| noqa, | ||
| source_kind, | ||
| Some(source_kind), |
There was a problem hiding this comment.
Not sure why this is optional, but wanted to avoid increasing the scope of this even more.
There was a problem hiding this comment.
The reason this is optional is because there's no --add-noqa support for Jupyter Notebooks yet. The logic for add_noqa_to_path basically calls into check_path and as far as I recall there were some circular dependency problems. Maybe it could be solved now with the recent changes.
There was a problem hiding this comment.
Could add_noqa_to_path detect the PySourceType and exit early (with a warning) if it is a Jupyter notebook to avoid that all downstream code has to handle optional source kinds?
| pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self { | ||
| match self { | ||
| SourceKind::Jupyter(notebook) => { | ||
| let mut cloned = notebook.clone(); |
There was a problem hiding this comment.
This for sure is more expensive than the old "update in place".
There was a problem hiding this comment.
Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.
Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.
I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.
This is just a rough idea based on my morning walk :)
There was a problem hiding this comment.
I don't think I'm able to follow. Do you want to take a stab at this together with performing the renaming that you suggested? Or is this something that needs changing before landing this PR?
I tried to change raw to an Rc to get cheap cloning. However, it turns out that update_cell_content mutates the raw cells. The only larger structure that doesn't seem to get updated is valid_code_cells.
I think we could also move out index from the Notebook because it is only used in Diagnostics and instead have a JupyterIndex that is either Lazy(Notebook) or Built(Index).
There was a problem hiding this comment.
Do you want to take a stab at this together with performing the renaming that you suggested?
Yeah, I can give it a go next week.
Or is this something that needs changing before landing this PR?
No
The only larger structure that doesn't seem to get updated is
valid_code_cells.
What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.
However, it turns out that
update_cell_contentmutates the raw cells.
Yeah, this update isn't really required for every linter loop. We can avoid doing that and only update it at the end. This basically uses the concatenated source code and cell markers to update the cell content.
I think we could also move out
indexfrom theNotebookbecause it is only used inDiagnosticsand instead have aJupyterIndexthat is eitherLazy(Notebook)orBuilt(Index).
Yeah, this is a good idea.
There was a problem hiding this comment.
What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.
With large, I mean any non-fixed size heap allocated data structure that requires a deep clone.
| pub(crate) fixed: FxHashMap<String, FixTable>, | ||
| pub(crate) imports: ImportMap, | ||
| pub(crate) source_kind: FxHashMap<String, SourceKind>, | ||
| pub(crate) notebooks: FxHashMap<String, Notebook>, |
There was a problem hiding this comment.
This looks reasonable. So this allows us to keep the string content on SourceKind::Python, since we no longer store the Python source kind on here anyway.
charliermarsh
left a comment
There was a problem hiding this comment.
I think I'm supportive of this. It's really nice to avoid the &mut on the source kind.
|
@dhruvmanila any opinion on this? Any recommendations on how to test the changes? |
dhruvmanila
left a comment
There was a problem hiding this comment.
Thanks for making these changes :)
In the near future we should move towards finalizing an abstraction layer between different sources and the engines (linter, formatter, etc.).
| pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self { | ||
| match self { | ||
| SourceKind::Jupyter(notebook) => { | ||
| let mut cloned = notebook.clone(); |
There was a problem hiding this comment.
Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.
Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.
I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.
This is just a rough idea based on my morning walk :)
| Ok(test_contents(&SourceKind::Python(contents), &path, settings).0) | ||
| } | ||
|
|
||
| pub(crate) struct TestedNotebook { |
There was a problem hiding this comment.
This is good. Thanks for doing this :)
My main concern would be to test the linter loop (> 2 iterations) with Cellsimport random
import mathtry:
print()
except ValueError:
passimport random
import pprint
random.randint(10, 20)foo = 1
if foo is 2:
raise ValueError(f"Invalid foo: {foo is 1}")Command: |
|
Thanks @dhruvmanila for the extensive test plan. I played through the commands (showing diagnostics, diff, and fixing) and it works as expected |
77ca065 to
83eab8e
Compare
SourceKind to store file content
fbdec00 to
a928045
Compare
a928045 to
49904dc
Compare

Summary
This PR changes two things. You may want to take a look at the individual commits. I'm not entirely convinced whether we should land this change and it certainly needs more testing:
Diagnosticsto store the notebooks rather than theSourceKind. The sole reason is that we don't need the kind. We only need the notebook index to resolve cells. Ideally, this would be handled by a custom diagnostic kind, but we aren't there yetSourceKindto store the content for bothNotebooksand regular Python files. This allows us to alignlint_fixto never return the input in place, and instead return the updated result as aCow<'a, SourceKind>. I find this easier to reason about than having to remember that notebooks are updated in place, but regular source code is not. However, this has the downside that we now need to clone the Notebooks, which may be rather expensive (multiple large vecs).I'm looking for general feedback. Also happy if someone with more understanding on jupyter notebooks wants to take this over.
Test Plan