[WIP] Make random_state accept np.random.Generator#23962
[WIP] Make random_state accept np.random.Generator#23962BenjaminBossan wants to merge 11 commits intoscikit-learn:mainfrom
Conversation
Done: - Added tests for estimators - Made tests for estimators pass Missing - splitters - other components, e.g. for creating random datasets - documentation - docstrings Caveats It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support random_state, we don't know if the code path that actually uses random_state is being taken or not, since it might depend on hyper-parameters.
sklearn/utils/fixes.py
Outdated
|
|
||
| # below copied verbatim from scipy._lib._util.py to be used in rng_integers | ||
| try: | ||
| from numpy.random import Generator as Generator |
There was a problem hiding this comment.
Our minimum supported NumPy version is 1.17.3, so we can assume that Generators can be imported.
There was a problem hiding this comment.
I missed that 👍
sklearn/tree/tests/test_tree.py
Outdated
| X_sparse_pos = random_state.uniform(size=(20, 5)) | ||
| X_sparse_pos[X_sparse_pos <= 0.8] = 0.0 | ||
| y_random = random_state.randint(0, 4, size=(20,)) | ||
| y_random = rng_integers(random_state, 0, 4, size=(20,)) |
There was a problem hiding this comment.
To make this PR smaller, I prefer to leave the test unchanged. Currently, the tests are always using a RandomState object.
| ( | ||
| rng.randint(n_samples, size=n_nonzeros), | ||
| rng.randint(n_features, size=n_nonzeros), | ||
| rng_integers(rng, n_samples, size=n_nonzeros), |
There was a problem hiding this comment.
Same here regarding not needing to change files in the main sklearn files.
- Import can assume that Generator exists - Revert rng_integers use where not necessary
thomasjpfan
left a comment
There was a problem hiding this comment.
The linting error and CI failure looks related to this PR.
If it's okay, I would address the linting problems later, before creating the non-draft PR. |
The CI does not fully run unless linting passes. This makes it harder to evaluate the PR even as a draft. |
|
@thomasjpfan Regarding the failing tests, I think we have an interesting problem here. |
For me, it is about the scope. This PR turns on Generators on everywhere, which touches a lot of estimators all at once. I prefer to incrementally turn on |
- Remove rng_integers call where not necessary - Isinstance check for random ints also accepts np.int_ - Black formatting
I changed the tests to also accept
This is certainly feasible. A disadvantage would be that |
|
There are still problems stemming from the integer dtype, e.g. here: I believe those go back to the problem I mentioned earlier:
There are some possible solutions to that problem but I'm not sure which one to take. |
I am okay with that as long as we document which estimator supports generators in their docstrings. We can incrementally update the docstrings of
I'm thinking more about estimators opting into to Generator support and not about user opt-in. Let's say I want "random_state": ["random_state", np.random.Generator],and include it in the common test. During review, we can look at MDS's code to make sure estimator is configured in a way that actually uses the random state. With this PR turning Generator support everywhere, it is confirm that all estimators is configured to actually use the generator. For me, this makes it harder to review.
Pass |
We decided to opt in estimators (and all the rest) step by step into using Generators. Therefore, I reverted all the changes in the actual estimators that were necessary to accomodate Generators, which comes down to the use of the rng_integers for now. The common test has been adjusted to have a long list of excluded estimators -- currently containing all estimators -- that are skipped for testing. The idea is that if a new PR comes along that opts an estimator in, it should be as easy as crossing that estimator off the list to be able to check if it still works. Note for future developers. The "random_state" variable is sometimes also referred to "rng" or "rnd" (and maybe others that I missed), so a simple grep for "random_state" is not enough.
Updated statusAfter discussion with maintainers, we decided that estimators, splitters, and other functions should be opted in step by step into allowing Currently, this PR thus contains tests for estimators that check if they can be fitted with Guide how to opt an estimator into allowing
|
No actual changes to how the code works, since KBinsDiscretizer only uses the 'choice' method, which is backwards compatible.
TODOsHere is a list of classes and functions I could find that use a Estimators
Splitters
Rest
DocumentationBesides the individual docstrings of the classes/functions mentioned above, the documentation should be adjusted here:
|
|
@thomasjpfan as discussed, I changed the PR to only test a single estimator to decrease the review burden. That estimator is As for the updated docstring, for now I went with this very simple change: - random_state : int, RandomState instance or None, default=None
+ random_state : int, RandomState/Generator instance or None, default=NoneThe reason is that this line is already quite long and from what I can tell, it is not desired to have very long lines for parameter types (or line breaks for that matter). The body itself has not been altered. LMK if we want to do that. |
Reference Issues/PRs
Fixes #16988
What does this implement/fix? Explain your changes.
The
random_stateargument accepts numpy.random.Generator.Any other comments?
Context
Update: Please see this comment.
This is WIP and I discussed with @thomasjpfan that it would make sense to share the current progress to evaluate if the scope is sufficiently small for a single PR or if we need to split it.
Done
Made tests for estimators pass(reverted)Missing
n_jobs>1is probably out of scopeImplementation
One difficulty is that
Generatorhas a slightly different API than the existingRandomStateclass, namely that creating integers now happens through theintegersmethod, notrandint. We (Thomas and I) discussed 3 different approaches to supportGenerators:RandomStateIf
check_random_statesees aGenerator, it returns an adapter that supports therandintmethod with the old signature. This would be backwards compatible with all existing code but locks sklearn into the "old way". Also, the appearance of this new class could be surprising to users.GeneratorIf
check_random_statesees aRandomState, it returns an adapter that supports theintegersmethod with the old signature. This would be forwards compatible with the "new way". However, it requires changing all existing calls torandintand the appearance of this new class could be surprising to users.This is the way that scipy approached the problem. It also requires to change all the calls to
randintbut it's more transparent than solution 2. One disadvantage is that all other sampling functions are method calls on the object, only integers require this function, which can be surprising.In the end, we decided to go with option 3. because we assume that it worked well for scipy and should thus also serve sklearn well.
Another decision that I made while working on the feature is not to change
randintmethod calls where the object is known to be aRandomState. E.g. there are many tests that go like:or
Therefore, grepping through the repo for
randintstill reveals many direct calls, but unless I overlooked something, they should all be safe.Caveats
It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support
random_state, we don't know if the code path that actually usesrandom_stateis being taken or not, since it might depend on hyper-parameters. A similar argument applies to splitters and other functions.