Fix ColumnTransformer in parallel with joblib's auto memmapping#28822
Merged
ogrisel merged 5 commits intoscikit-learn:mainfrom Apr 22, 2024
Merged
Fix ColumnTransformer in parallel with joblib's auto memmapping#28822ogrisel merged 5 commits intoscikit-learn:mainfrom
ogrisel merged 5 commits intoscikit-learn:mainfrom
Conversation
Member
Author
|
I opened #28824 to discuss the read-only situation more globally. |
jeremiedbb
commented
Apr 12, 2024
Comment on lines
2465
to
2467
| with parallel_backend("loky", max_nbytes=1): | ||
| Xt = transformer.fit_transform(X) | ||
|
|
Member
Author
There was a problem hiding this comment.
CI is failing because this is only doable in joblib>=1.13 and our min is 1.12.
I can use a bigger array for now
Member
There was a problem hiding this comment.
we could skip the test on that joblib though.
adrinjalali
approved these changes
Apr 15, 2024
Member
adrinjalali
left a comment
There was a problem hiding this comment.
I agree this might be a bigger issue, but this is a minimal change that fixes a few cases. So LGTM.
Comment on lines
2465
to
2467
| with parallel_backend("loky", max_nbytes=1): | ||
| Xt = transformer.fit_transform(X) | ||
|
|
Member
There was a problem hiding this comment.
we could skip the test on that joblib though.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #28781
When running in parallel, ColumnTransformer will crash if joblib's auto memmap triggers and copies are not made in time.
Currently we index X when declaring the jobs. It means we have copy then read-only memmap. Then if the transformer fails to do inplace transfo, or fails earlier in case of dataframe (see #28781 (comment)).
The fix here proposes to index X within each job instead. This way we have read-only memmap then copy, and the transformer can do inplace transfo.
Disclaimer: it doesn't solve the underlying problem completely. If you select columns by slice it still fails because it creates a view and not a copy. I'm starting to think that the issue is more profound, and lies between the
copyparameter andcheck_array, for all estimators. I thinkcheck_arrayshould always make a copy if the array is read-only, even ifcopy=Falsebecause when an estimator has acopyparameter, it's because it wants to do inplace modifications.