FEA Add writeable parameter to check_array#29018
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.
sklearn/tests/test_common.py
Outdated
| _set_checking_parameters(estimator) | ||
|
|
||
| # The following estimators can work inplace only with certain settings | ||
| if estimator.__class__.__name__ == "HDBSCAN": |
There was a problem hiding this comment.
Given that we are in our testing code sklearn/tests/test_common.py can we use isinstance here?
There was a problem hiding this comment.
I used the class name because it's what we use in _set_checking_parameters(estimator). Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.
|
I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls Arguably, we could raise an error in the ambiguous |
|
I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR: import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False) # dataframe backed by a read-only array
LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation |
|
When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations. |
thomasjpfan
left a comment
There was a problem hiding this comment.
I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.
|
Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.
I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6. |
|
@thomasjpfan, given that #29018 (comment) is a regression that impacts all estimators, has been reported by other users (e.g. #29182), and is not fixed by #29103, I now think that we should release it in 1.5.1 despite adding a new public param to check_array. |
sklearn/utils/validation.py
Outdated
| Whether an array will be forced to be fortran or c-style. If | ||
| `None`, then the input data's order is preserved when possible. | ||
|
|
||
| writeable : True or None, default=None |
There was a problem hiding this comment.
I think accepting True but not False is a bit weird. Can we use more descriptive names? For example:
| writeable : True or None, default=None | |
| writeable : "force" or "preserve", default="preserve" |
There was a problem hiding this comment.
I agree that it's a bit weird. This is because I wanted to give it a similar status as order and dtype. I think a better option would have been ensure_writeable = True/False but the existing parameters with this naming pattern (ensure_xxx) are just to enable or disable a check, not to act on the output array.
Maybe we could go with force_writeable = True/False or make_writeable=True/False. There's already force_all_finite that is just to enable a check but arguably it should be ensure_all_finite. What do you think ? I think my preference is force_writeable and rename force_all_finite into ensure_all_finite later.
There was a problem hiding this comment.
For this PR, I like force_writeable.
Can you open an issue on ensure_all_finite vs force_all_finite?
seems to workaround of: scikit-learn/scikit-learn#29018 failing otherwise with ~: python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array array.flags.writeable = True
…k-array-writeable
|
I triggered the array API tests for CUDA GPU on this branch (after an update with EDIT: there are failing tests with CuPy: |
| accept_sparse="csc", | ||
| copy=copy, | ||
| dtype=FLOAT_DTYPES, | ||
| force_writeable=True if not in_fit else None, |
There was a problem hiding this comment.
I think this would deserve an inline comment to explain why force_writeable is needed only in transform.
| out = check_array(df, copy=False, force_writeable=True) | ||
| # df is backed by a read-only array, a copy is made | ||
| assert not np.may_share_memory(out, df) | ||
| assert out.flags.writeable |
There was a problem hiding this comment.
I think we need a similar test for array API inputs using the generic yield_namespace_device_dtype_combinations helper.
The array API states the following: https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html
So maybe we could just raise an exception when copy is False and force_writeable and not _is_numpy_namespace(xp).
There was a problem hiding this comment.
If we decide that we should not raise an exception in that case for some reason (e.g. by always triggering a copy for safety?), then we should have a dedicated test such as:
xp = pytest.importorskip("array_api_strict")
X_np = np.random.uniform(size=(10, 10))
X_np.flags.writeable = False
X_np_copy = X_np.copy()
X_xp = xp.asarray(X_np)
with sklearn.config_context(array_api_dispatch=True):
X_xp_checked = check_array(X_xp, copy=False, force_writeable=True)
out_ns, is_array_api = get_namespace(X_xp_checked)
assert is_array_api
assert out_ns == xp
assert_allclose(_convert_to_numpy(X_xp_checked), X_np)
X_xp_checked[:] = 0
assert_allclose(X_np, X_np_copy)And maybe something similar for PyTorch. I am not sure if it's possible to create readonly PyTorch tensors. On CPU it might be possible with memmaping? EDIT: I experimented with joblib.load("serialized_pytorch_cpu_tensor.pkl", mmap_mode="r") and I don't think it's possible: the result is a writeable PyTorch tensor.
However I am not sure this is what we want...
There was a problem hiding this comment.
The array API draft spec for the 2024 version was updated to mention readonly flags exposed by the DLPack interchange protocol:
However numpy 1.16.6 does not support this (yet) and raises instead...
In [1]: import numpy as np
In [2]: a = np.random.randn(10)
In [3]: a.flags.writeable = False
In [4]: a.__dlpack__()
---------------------------------------------------------------------------
BufferError Traceback (most recent call last)
Cell In[4], line 1
----> 1 a.__dlpack__()
BufferError: Cannot export readonly array since signalling readonly is unsupported by DLPack.maybe we can reconsider inspecting __dlpack__ attributes later, once it's more widely adopted by libraries.
There was a problem hiding this comment.
In retrospect, I think I would be in favor of raising an exception as first suggested in https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1644733447.
However to keep that PR minimally focused on the changes actually needed to fix the blocking bugs for 1.5.1, we can defer the new array API specific tests and the exception to a dedicated follow-up PR for 1.6 and only fix:
as part of the current PR.
There was a problem hiding this comment.
Alright, let's do that in a follow-up PR, the current fix should be enough for 1.5.1
sklearn/utils/validation.py
Outdated
| if force_writeable: | ||
| array_data = array.data if sp.issparse(array) else array | ||
| copy_params = {"order": "K"} if not sp.issparse(array) else {} | ||
| if hasattr(array_data, "flags") and not array_data.flags.writeable: |
There was a problem hiding this comment.
| if hasattr(array_data, "flags") and not array_data.flags.writeable: | |
| flags = getattr(array_data, "flags", None) | |
| if not getattr(flags, "writeable", True): |
There was a problem hiding this comment.
Do you think we need to trigger the array API tests again after this fix to be safe ?
There was a problem hiding this comment.
There was a problem hiding this comment.
The failures seem unrelated to this PR. We need to check if they also occur on main:
There was a problem hiding this comment.
Confirmed, those 8 failures are not related to this specific PR.
ogrisel
left a comment
There was a problem hiding this comment.
Other than fixing the existing array API tests (https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645849964) and other small details in the previous review, LGTM.
…k-array-writeable
|
I merged too quickly, we now get: on |
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
closes #28824
closes #14481
closes #29103
Fixes #28899
Fixes #29182
This PR proposes to add a
writeableparameter tocheck_array. It acts as a toggle: it can be True meaning it's set or None (unset). If True,check_arraywill make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.It has the same status as the
dtypeororderparameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.Writeable arrays are required in estimators that can perform inplace operations. These estimators have a
copyorcopy_Xparameter and currently they raise an error ifcopy=Falseand X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.
I added a common test, which for now fails as expected for most estimators because I haven't applied the
writeableparam to all inplace estimators, so it shows the current behavior. A few of them still pass:writeablein this PRwriteablein this PRwriteablein this PRcopy=Falseso needs investigation (Fix Make centering inplace in KernelPCA #29100)copy=Falseso needs investigation(I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)
cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)