[WIP] Make random_state accept np.random.Generator by BenjaminBossan · Pull Request #23962 · scikit-learn/scikit-learn

BenjaminBossan · 2022-07-20T15:23:10Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The random_state argument accepts numpy.random.Generator.

Any other comments?

Context

Update: Please see this comment.

This is WIP and I discussed with @thomasjpfan that it would make sense to share the current progress to evaluate if the scope is sufficiently small for a single PR or if we need to split it.

Done

Added tests for estimators
~~Made tests for estimators pass~~ (reverted)

Missing

tests for splitters
tests for other components, e.g. for creating random datasets (this will be difficult because those components need to be called, which is not possible to do in a generic way, unlike for estimators)
documentation
docstrings
SeedSequence use for n_jobs>1 is probably out of scope

Implementation

One difficulty is that Generator has a slightly different API than the existing RandomState class, namely that creating integers now happens through the integers method, not randint. We (Thomas and I) discussed 3 different approaches to support Generators:

Use an adapter with the API of RandomState

If check_random_state sees a Generator, it returns an adapter that supports the randint method with the old signature. This would be backwards compatible with all existing code but locks sklearn into the "old way". Also, the appearance of this new class could be surprising to users.

Use an adapter with the API of Generator

If check_random_state sees a RandomState, it returns an adapter that supports the integers method with the old signature. This would be forwards compatible with the "new way". However, it requires changing all existing calls to randint and the appearance of this new class could be surprising to users.

Using a utility function that knows how to deal with both objects

This is the way that scipy approached the problem. It also requires to change all the calls to randint but it's more transparent than solution 2. One disadvantage is that all other sampling functions are method calls on the object, only integers require this function, which can be surprising.

In the end, we decided to go with option 3. because we assume that it worked well for scipy and should thus also serve sklearn well.

Another decision that I made while working on the feature is not to change randint method calls where the object is known to be a RandomState. E.g. there are many tests that go like:

random_state = RandomState(...)
i = random_state.randint(...)

or

random_state = check_random_state(0)
i = random_state.randint(...)

Therefore, grepping through the repo for randint still reveals many direct calls, but unless I overlooked something, they should all be safe.

Caveats

It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support random_state, we don't know if the code path that actually uses random_state is being taken or not, since it might depend on hyper-parameters. A similar argument applies to splitters and other functions.

Done: - Added tests for estimators - Made tests for estimators pass Missing - splitters - other components, e.g. for creating random datasets - documentation - docstrings Caveats It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support random_state, we don't know if the code path that actually uses random_state is being taken or not, since it might depend on hyper-parameters.

thomasjpfan · 2022-07-20T15:29:45Z

sklearn/utils/fixes.py


+# below copied verbatim from scipy._lib._util.py to be used in rng_integers
+try:
+    from numpy.random import Generator as Generator


Our minimum supported NumPy version is 1.17.3, so we can assume that Generators can be imported.

I missed that 👍

thomasjpfan · 2022-07-20T15:32:32Z

sklearn/tree/tests/test_tree.py

 X_sparse_pos = random_state.uniform(size=(20, 5))
 X_sparse_pos[X_sparse_pos <= 0.8] = 0.0
-y_random = random_state.randint(0, 4, size=(20,))
+y_random = rng_integers(random_state, 0, 4, size=(20,))


To make this PR smaller, I prefer to leave the test unchanged. Currently, the tests are always using a RandomState object.

thomasjpfan · 2022-07-20T15:38:06Z

benchmarks/bench_random_projections.py

            (
-                rng.randint(n_samples, size=n_nonzeros),
-                rng.randint(n_features, size=n_nonzeros),
+                rng_integers(rng, n_samples, size=n_nonzeros),


Same here regarding not needing to change files in the main sklearn files.

- Import can assume that Generator exists - Revert rng_integers use where not necessary

thomasjpfan

The linting error and CI failure looks related to this PR.

BenjaminBossan · 2022-07-20T17:45:23Z

The linting error and CI failure looks related to this PR.

If it's okay, I would address the linting problems later, before creating the non-draft PR.

thomasjpfan · 2022-07-20T17:47:49Z

If it's okay, I would address the linting problems later, before creating the non-draft PR.

The CI does not fully run unless linting passes. This makes it harder to evaluate the PR even as a draft.

BenjaminBossan · 2022-07-20T17:54:04Z

@thomasjpfan Regarding the failing tests, I think we have an interesting problem here. RandomState.randint returns int but Generator.integers returns np.int*. These tests directly check if the instance is int and thus fail. We could allow np.int*. But is it not also problematic that these estimators change the random_state attribute set by the user?

thomasjpfan · 2022-07-20T20:03:54Z

This is WIP and I discussed with @thomasjpfan that it would make sense to share the current progress to evaluate if the scope is sufficiently small for a single PR or if we need to split it.

For me, it is about the scope. This PR turns on Generators on everywhere, which touches a lot of estimators all at once.

I prefer to incrementally turn on Generator support for each estimator and have a common test skip estimators that is not yet supported. This way we can be deliberate about using Generator specific features such as SeedSequences or dtype support in Generator.standard_normal (RandomState.standard_normal does not have dtype support).

- Remove rng_integers call where not necessary - Isinstance check for random ints also accepts np.int_ - Black formatting

BenjaminBossan · 2022-07-21T08:58:46Z

These tests directly check if the instance is int and thus fail.

I changed the tests to also accept np.int_, LMK if that's not the preferred solution.

I prefer to incrementally turn on Generator support for each estimator and have a common test skip estimators that is not yet supported.

This is certainly feasible. A disadvantage would be that random_state, which so far has been a very standardized argument, suddenly works differently for different classes and functions, which would be very surprising for the user. In a sense, even if we allow Generators everywhere, it is already opt-in since the user has to explicitly pass a Generator for it to be used. IMHO activating it on in steps would make sense if we're afraid that something breaks or if we want to later change the behavior when using Generators (as discussed could be possible for SeedSequence, which we considered adding in an accompanying PR).

BenjaminBossan · 2022-07-21T10:21:51Z

There are still problems stemming from the integer dtype, e.g. here:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=44874&view=logs&j=aabdcdc3-bb64-5414-b357-ed024fe8659e&t=b7b3ba55-d585-563b-a032-f235636c22b0&l=1574

I believe those go back to the problem I mentioned earlier:

RandomState.randint returns int but Generator.integers returns np.int*

There are some possible solutions to that problem but I'm not sure which one to take.

thomasjpfan · 2022-07-21T12:37:25Z

A disadvantage would be that random_state, which so far has been a very standardized argument, suddenly works differently for different classes and functions, which would be very surprising for the user.

I am okay with that as long as we document which estimator supports generators in their docstrings. We can incrementally update the docstrings of random_state docs as we add support for Generators.

In a sense, even if we allow Generators everywhere, it is already opt-in since the user has to explicitly pass a Generator for it to be used.

I'm thinking more about estimators opting into to Generator support and not about user opt-in. Let's say I want MDS to opt into generators then we change the parameter constraints to:

    "random_state": ["random_state", np.random.Generator],

and include it in the common test. During review, we can look at MDS's code to make sure estimator is configured in a way that actually uses the random state.

With this PR turning Generator support everywhere, it is confirm that all estimators is configured to actually use the generator. For me, this makes it harder to review.

believe those go back to the problem I mentioned earlier:

Pass dtype=int to rng_integers to match the default dtype for RandomState.randint?

We decided to opt in estimators (and all the rest) step by step into using Generators. Therefore, I reverted all the changes in the actual estimators that were necessary to accomodate Generators, which comes down to the use of the rng_integers for now. The common test has been adjusted to have a long list of excluded estimators -- currently containing all estimators -- that are skipped for testing. The idea is that if a new PR comes along that opts an estimator in, it should be as easy as crossing that estimator off the list to be able to check if it still works. Note for future developers. The "random_state" variable is sometimes also referred to "rng" or "rnd" (and maybe others that I missed), so a simple grep for "random_state" is not enough.

BenjaminBossan · 2022-07-26T10:33:25Z

Updated status

After discussion with maintainers, we decided that estimators, splitters, and other functions should be opted in step by step into allowing Generators as random_state. Therefore, I reverted all changes required to make estimators work with Generators, which came down to removing the usage of rng_integers and some resulting dtype checks.

Currently, this PR thus contains tests for estimators that check if they can be fitted with Generators, as well as if they can call predict, predict_proba, decision_function, or transform (if they have those methods). There is also a big list of estimators excluded from these tests, which, at the moment, contains all estimators.

Guide how to opt an estimator into allowing `Generator`s

[WIP]

I think we should provide a guide for others to opt an estimator in, additionally to what the standard steps for an sklearn PR. OTOH:

Remove the estimator from the _estimators_excluded_from_check_random_state list in sklearn/tests/test_common.py
Update the estimator's docstring to include Generators as a possible type for random_state. TODO we should provide a standard text here.
Check the actual code being run to see if, and under what circumstances, the random state is being used. This is important to actually cover those code paths in the tests, otherwise they could be missed. If the standard test_check_random_state_type does not cover that code path, add a specific test for that estimator in that estimator's test module.
Allow Generators in the parameter constraints, e.g. by setting _parameter_constraints = {..., "random_state": ["random_state", np.random.Generator]}.
If a method on random_state is used whose API has changed, e.g. RandomState.randint, use a compatibility function that supports both new and old methods. For randint, it is already provided as rng_integers, thus the change would be:

from sklearn.utils.fixes import rng_integers
...
- i = random_state.randint(*args **kwargs)
+ i = rng_integers(random_state, *args, **kwargs)

(with *args, **kwargs replaced by the actual arguments being used)

If parallelism is being used, make use of SeedSequence (docs); how exactly is yet to be determined.

Tip: When grepping for the use of random_state, note that the variable is also sometimes referred to rng, rs, rnd and maybe other names.

Please let me know if the steps should be updated and how to proceed for splitters and other functions using random state.

No actual changes to how the code works, since KBinsDiscretizer only uses the 'choice' method, which is backwards compatible.

BenjaminBossan · 2022-07-29T10:38:34Z

TODOs

Here is a list of classes and functions I could find that use a random_state argument. This includes estimators that currently don't have a random_state argument. LMK if I missed something and if this list should be put somewhere else:

Estimators

Splitters

sklearn.model_selection.GroupShuffleSplit
sklearn.model_selection.KFold
sklearn.model_selection.RepeatedKFold
sklearn.model_selection.RepeatedStratifiedKFold
sklearn.model_selection.ShuffleSplit
sklearn.model_selection.StratifiedGroupKFold
sklearn.model_selection.StratifiedKFold
sklearn.model_selection.StratifiedShuffleSplit

Rest

Documentation

Besides the individual docstrings of the classes/functions mentioned above, the documentation should be adjusted here:

BenjaminBossan · 2022-07-29T10:45:01Z

@thomasjpfan as discussed, I changed the PR to only test a single estimator to decrease the review burden. That estimator is KBinsDiscretizer, which was very easy to test.

As for the updated docstring, for now I went with this very simple change:

- random_state : int, RandomState instance or None, default=None
+ random_state : int, RandomState/Generator instance or None, default=None

The reason is that this line is already quite long and from what I can tell, it is not desired to have very long lines for parameter types (or line breaks for that matter). The body itself has not been altered. LMK if we want to do that.

github-actions bot added the cython label Jul 20, 2022

thomasjpfan reviewed Jul 20, 2022

View reviewed changes

Address reviewer comments

d51184f

- Import can assume that Generator exists - Revert rng_integers use where not necessary

thomasjpfan reviewed Jul 20, 2022

View reviewed changes

Fixes for CI

01576f2

- Remove rng_integers call where not necessary - Isinstance check for random ints also accepts np.int_ - Black formatting

BenjaminBossan added 4 commits July 21, 2022 11:01

Fix more Black issues

6e4a41b

Yet another Black formatting fix

3e7bf57

Remove unused exports

e32b40c

Make scipy docstring conform to sklearn reqs

36b00aa

BenjaminBossan added 2 commits July 26, 2022 12:10

Some more steps to revert

fc19c1a

BenjaminBossan added 2 commits July 26, 2022 12:36

Fix linting error

0b53e17

Enable random Generators for KBinsDiscretizer

088c4ae

No actual changes to how the code works, since KBinsDiscretizer only uses the 'choice' method, which is backwards compatible.

oyamad mentioned this pull request Nov 23, 2022

ENH: check_random_state: Accept np.random.Generator QuantEcon/QuantEcon.py#654

Merged

7 tasks

Uh oh!

Conversation

BenjaminBossan commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Context

Done

Missing

Implementation

Caveats

Uh oh!

thomasjpfan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Jul 20, 2022

Uh oh!

thomasjpfan commented Jul 20, 2022

Uh oh!

BenjaminBossan commented Jul 20, 2022

Uh oh!

thomasjpfan commented Jul 20, 2022

Uh oh!

BenjaminBossan commented Jul 21, 2022

Uh oh!

BenjaminBossan commented Jul 21, 2022

Uh oh!

thomasjpfan commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated status

Guide how to opt an estimator into allowing Generators

Uh oh!

BenjaminBossan commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODOs

Estimators

Splitters

Rest

Documentation

Uh oh!

BenjaminBossan commented Jul 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BenjaminBossan commented Jul 20, 2022 •

edited

Loading

thomasjpfan commented Jul 21, 2022 •

edited

Loading

BenjaminBossan commented Jul 26, 2022 •

edited

Loading

Guide how to opt an estimator into allowing `Generator`s

BenjaminBossan commented Jul 29, 2022 •

edited

Loading