FEA Add writeable parameter to check_array by jeremiedbb · Pull Request #29018 · scikit-learn/scikit-learn

jeremiedbb · 2024-05-14T13:44:52Z

closes #28824
closes #14481
closes #29103
Fixes #28899
Fixes #29182

This PR proposes to add a writeable parameter to check_array. It acts as a toggle: it can be True meaning it's set or None (unset). If True, check_array will make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.

It has the same status as the dtype or order parameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.

Writeable arrays are required in estimators that can perform inplace operations. These estimators have a copy or copy_X parameter and currently they raise an error if copy=False and X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.

This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.

I added a common test, which for now fails as expected for most estimators because I haven't applied the writeable param to all inplace estimators, so it shows the current behavior. A few of them still pass:

AffinityPropagation: already applied writeable in this PR
FactorAnalysis: already applied writeable in this PR
SimpleImputer: already applied writeable in this PR
Birch: doesn't perform inplace operations so the param copy should be deprecated (Deprecate copy in Birch #29092)
TheilSenRegressor: doesn't perform inplace operations so the copy param should be deprecated (Deprecate copy_X in TheilSenRegressor #29098)
KernelPCA: seems to always make a copy even if copy=False so needs investigation (Fix Make centering inplace in KernelPCA #29100)
OrthogonalMatchingPursuitCV: seems to always make a copy even if copy=False so needs investigation
(I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)

cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)

github-actions · 2024-05-14T13:46:07Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 19dfb21. Link to the linter CI: here}

thomasjpfan

Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.

sklearn/tests/test_common.py

thomasjpfan · 2024-05-18T16:43:34Z

sklearn/tests/test_common.py

+    _set_checking_parameters(estimator)
+
+    # The following estimators can work inplace only with certain settings
+    if estimator.__class__.__name__ == "HDBSCAN":


Given that we are in our testing code sklearn/tests/test_common.py can we use isinstance here?

I used the class name because it's what we use in _set_checking_parameters(estimator). Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.

jeremiedbb · 2024-05-23T13:43:28Z

I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls StandardScaler(copy=False).fit_transform(X) where X is read-only, it's an ambiguous situation and it should not be scikit-learn's role to decide which path to chose. So now we always make a copy in that situation. There's one exception that comes from #28348, where it's a intermediate array that is created in read-only mode.

Arguably, we could raise an error in the ambiguous copy=False + read-only situation, but I think it's better to not raise and make estimators work in that case because intermediate arrays in complex pipelines can be read-only due to joblib's auto-memmapping (last time it happened was #28781). There are always workarounds but with this PR, we should not have to worry about this issue in the future.

jeremiedbb · 2024-05-23T13:50:25Z

I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False)  # dataframe backed by a read-only array

LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation

jeremiedbb · 2024-05-23T14:13:20Z

When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations.

thomasjpfan

I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

jeremiedbb · 2024-05-24T12:34:50Z

Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.

This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6.

Edited: quick fix for #28899 in #29103.

jeremiedbb · 2024-06-05T13:55:20Z

@thomasjpfan, given that #29018 (comment) is a regression that impacts all estimators, has been reported by other users (e.g. #29182), and is not fixed by #29103, I now think that we should release it in 1.5.1 despite adding a new public param to check_array.

thomasjpfan · 2024-06-06T17:51:20Z

sklearn/utils/validation.py

        Whether an array will be forced to be fortran or c-style. If
        `None`, then the input data's order is preserved when possible.

+    writeable : True or None, default=None


I think accepting True but not False is a bit weird. Can we use more descriptive names? For example:

Suggested change

writeable : True or None, default=None

writeable : "force" or "preserve", default="preserve"

I agree that it's a bit weird. This is because I wanted to give it a similar status as order and dtype. I think a better option would have been ensure_writeable = True/False but the existing parameters with this naming pattern (ensure_xxx) are just to enable or disable a check, not to act on the output array.

Maybe we could go with force_writeable = True/False or make_writeable=True/False. There's already force_all_finite that is just to enable a check but arguably it should be ensure_all_finite. What do you think ? I think my preference is force_writeable and rename force_all_finite into ensure_all_finite later.

For this PR, I like force_writeable.

Can you open an issue on ensure_all_finite vs force_all_finite?

Alright, I made the change and opened #29262

seems to workaround of: scikit-learn/scikit-learn#29018 failing otherwise with ~: python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array array.flags.writeable = True

…k-array-writeable

thomasjpfan

LGTM

ogrisel · 2024-06-18T15:59:36Z

I triggered the array API tests for CUDA GPU on this branch (after an update with main) as I suspect that it might not be 100% neutral for non-numpy inputs:

https://github.com/scikit-learn/scikit-learn/actions/runs/9568320710

EDIT: there are failing tests with CuPy:

>           if hasattr(array_data, "flags") and not array_data.flags.writeable:
E           AttributeError: 'Flags' object has no attribute 'writeable'

ogrisel

A pass of feedback.

doc/whats_new/v1.5.rst

ogrisel · 2024-06-18T15:49:12Z

sklearn/preprocessing/_data.py

            accept_sparse="csc",
            copy=copy,
            dtype=FLOAT_DTYPES,
+            force_writeable=True if not in_fit else None,


I think this would deserve an inline comment to explain why force_writeable is needed only in transform.

sklearn/utils/validation.py

ogrisel · 2024-06-18T16:10:09Z

sklearn/utils/tests/test_validation.py

+    out = check_array(df, copy=False, force_writeable=True)
+    # df is backed by a read-only array, a copy is made
+    assert not np.may_share_memory(out, df)
+    assert out.flags.writeable


I think we need a similar test for array API inputs using the generic yield_namespace_device_dtype_combinations helper.

The array API states the following: https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html

So maybe we could just raise an exception when copy is False and force_writeable and not _is_numpy_namespace(xp).

If we decide that we should not raise an exception in that case for some reason (e.g. by always triggering a copy for safety?), then we should have a dedicated test such as:

xp = pytest.importorskip("array_api_strict") X_np = np.random.uniform(size=(10, 10)) X_np.flags.writeable = False X_np_copy = X_np.copy() X_xp = xp.asarray(X_np) with sklearn.config_context(array_api_dispatch=True): X_xp_checked = check_array(X_xp, copy=False, force_writeable=True) out_ns, is_array_api = get_namespace(X_xp_checked) assert is_array_api assert out_ns == xp assert_allclose(_convert_to_numpy(X_xp_checked), X_np) X_xp_checked[:] = 0 assert_allclose(X_np, X_np_copy)

And maybe something similar for PyTorch. I am not sure if it's possible to create readonly PyTorch tensors. On CPU it might be possible with memmaping? EDIT: I experimented with joblib.load("serialized_pytorch_cpu_tensor.pkl", mmap_mode="r") and I don't think it's possible: the result is a writeable PyTorch tensor.

However I am not sure this is what we want...

The array API draft spec for the 2024 version was updated to mention readonly flags exposed by the DLPack interchange protocol:

https://github.com/data-apis/array-api/pull/749/files

However numpy 1.16.6 does not support this (yet) and raises instead...

In [1]: import numpy as np In [2]: a = np.random.randn(10) In [3]: a.flags.writeable = False In [4]: a.__dlpack__() --------------------------------------------------------------------------- BufferError Traceback (most recent call last) Cell In[4], line 1 ----> 1 a.__dlpack__() BufferError: Cannot export readonly array since signalling readonly is unsupported by DLPack.

maybe we can reconsider inspecting __dlpack__ attributes later, once it's more widely adopted by libraries.

In retrospect, I think I would be in favor of raising an exception as first suggested in https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1644733447.

However to keep that PR minimally focused on the changes actually needed to fix the blocking bugs for 1.5.1, we can defer the new array API specific tests and the exception to a dedicated follow-up PR for 1.6 and only fix:

https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645720337

as part of the current PR.

Alright, let's do that in a follow-up PR, the current fix should be enough for 1.5.1

ogrisel · 2024-06-19T08:50:29Z

sklearn/utils/validation.py

+    if force_writeable:
+        array_data = array.data if sp.issparse(array) else array
+        copy_params = {"order": "K"} if not sp.issparse(array) else {}
+        if hasattr(array_data, "flags") and not array_data.flags.writeable:


Suggested change

if hasattr(array_data, "flags") and not array_data.flags.writeable:

flags = getattr(array_data, "flags", None)

if not getattr(flags, "writeable", True):

Do you think we need to trigger the array API tests again after this fix to be safe ?

Let me do that here:

https://github.com/scikit-learn/scikit-learn/actions/runs/9601131915

The failures seem unrelated to this PR. We need to check if they also occur on main:

https://github.com/scikit-learn/scikit-learn/actions/runs/9604238613

Confirmed, those 8 failures are not related to this specific PR.

ogrisel

Other than fixing the existing array API tests (https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645849964) and other small details in the previous review, LGTM.

…k-array-writeable

ogrisel · 2024-06-21T07:55:11Z

I merged too quickly, we now get:

FAILED tests/test_common.py::test_check_inplace_ensure_writeable[KernelPCA()] - ValueError: output array is read-only

on main.

Co-authored-by: Olivier Grisel <[email protected]>

common test + first applications

8dc719b

jeremiedbb mentioned this pull request May 14, 2024

RFC Trigger a copy when copy=False and X is read-only #28824

Closed

jeremiedbb added 2 commits May 14, 2024 16:34

include sparse

5bda2c3

simpler

98010b3

glemaitre self-assigned this May 16, 2024

thomasjpfan reviewed May 18, 2024

View reviewed changes

glemaitre removed their assignment May 20, 2024

glemaitre self-requested a review May 20, 2024 13:35

jeremiedbb added 3 commits May 23, 2024 10:50

Merge remote-tracking branch 'upstream/main' into check-array-writeable

64966ed

always copy when writeable + read-only but for 1 pandas exception

50d7cd0

nit

a8b56a7

thomasjpfan reviewed May 23, 2024

View reviewed changes

jeremiedbb mentioned this pull request May 24, 2024

FIX change writeability only if not already writeable #29103

Closed

jeremiedbb added 6 commits May 24, 2024 16:03

Merge remote-tracking branch 'upstream/main' into check-array-writeable

1c42d5a

wip

23cda9c

add writeable to estimators with inplace operations

08cbaee

Merge remote-tracking branch 'upstream/main' into check-array-writeable

dfe8483

fix sparse and select arrays with flags

230d19c

rework mmap test using existing testing tool

0fe8eaf

jeremiedbb mentioned this pull request Jun 5, 2024

Validation step fails when trying to set WRITEABLE flag to True #29182

Closed

add change log entry

2f796e2

thomasjpfan reviewed Jun 6, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into check-array-writeable

e1e7aa8

Merge remote-tracking branch 'origin/check-array-writeable' into chec…

096936e

…k-array-writeable

jeremiedbb mentioned this pull request Jun 13, 2024

Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1 #29229

Closed

rename force_writeable and make it a bool

9ed87a2

jeremiedbb mentioned this pull request Jun 14, 2024

API rename force_all_finite into ensure_all_finite in check_array ? #29262

Closed

thomasjpfan approved these changes Jun 14, 2024

View reviewed changes

astrojuanlu mentioned this pull request Jun 14, 2024

Investigate why Spaceflights project failing with ParallelRunner kedro-org/kedro#3674

Closed

Merge branch 'main' into check-array-writeable

301c5c3

ogrisel reviewed Jun 18, 2024

View reviewed changes

ogrisel reviewed Jun 19, 2024

View reviewed changes

ogrisel approved these changes Jun 19, 2024

View reviewed changes

jeremiedbb added 5 commits June 20, 2024 11:35

Merge remote-tracking branch 'upstream/main' into check-array-writeable

ccfcd05

fix what's new + add comments

5344d7b

Merge remote-tracking branch 'origin/check-array-writeable' into chec…

3e47271

…k-array-writeable

cln merge

145e36d

Merge remote-tracking branch 'upstream/main' into check-array-writeable

19dfb21

ogrisel merged commit ef6efef into scikit-learn:main Jun 20, 2024

lesteve mentioned this pull request Jun 21, 2024

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev (last failure: Jun 21, 2024) ⚠️ #29325

Closed

ogrisel mentioned this pull request Jun 21, 2024

FIX missing force_writeable in KernelCenterer.transform #29328

Merged

jeremiedbb mentioned this pull request Jun 21, 2024

Fix performance regression in ColumnTransformer #29330

Merged

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024

FEA Add writeable parameter to check_array (scikit-learn#29018)

7d80f6b

Co-authored-by: Olivier Grisel <[email protected]>

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024

FEA Add writeable parameter to check_array (scikit-learn#29018)

a9823c7

Co-authored-by: Olivier Grisel <[email protected]>

jeremiedbb added a commit that referenced this pull request Jul 2, 2024

FEA Add writeable parameter to check_array (#29018)

9c9f106

Co-authored-by: Olivier Grisel <[email protected]>

This was referenced Jul 5, 2024

[Blocking issue] ValueError: need at least one array to concatenate py-why/EconML#854

Open

Update responsibleai to scikit-learn 1.5.1 microsoft/responsible-ai-toolbox#2570

Merged

prateekdesai04 mentioned this pull request Aug 23, 2024

Upgrade scikit-learn to 1.5.1 autogluon/autogluon#4420

Merged

	writeable : True or None, default=None
	writeable : "force" or "preserve", default="preserve"

	if hasattr(array_data, "flags") and not array_data.flags.writeable:
	flags = getattr(array_data, "flags", None)
	if not getattr(flags, "writeable", True):

Uh oh!

Conversation

jeremiedbb commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented May 23, 2024

Uh oh!

jeremiedbb commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented May 23, 2024

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jun 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

jeremiedbb commented May 14, 2024 •

edited

Loading

github-actions bot commented May 14, 2024 •

edited

Loading

jeremiedbb commented May 23, 2024 •

edited

Loading

jeremiedbb commented May 24, 2024 •

edited

Loading

ogrisel commented Jun 18, 2024 •

edited

Loading

ogrisel Jun 18, 2024 •

edited

Loading

ogrisel Jun 19, 2024 •

edited

Loading

ogrisel Jun 20, 2024 •

edited

Loading