FIX Remove warnings when fitting a dataframe #21578

thomasjpfan · 2021-11-06T20:17:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR removes input validation for methods that already validate the input. The common test was updated to catch warnings during fit.

glemaitre

LGTM. If @jeremiedbb calls look at it since we work on this before the release but did not think about these methods.

I assume that this is relatively difficult to have a general test since it would depend on some parameters for some estimators. We might still have such of corner case for some of the estimators.

sklearn/cluster/_birch.py

…om_forest_classifier

adrinjalali · 2021-11-09T10:38:24Z

sklearn/ensemble/_forest.py

            The OOB predictions.
      """
-        X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
+        X = check_array(X, dtype=DTYPE, accept_sparse="csr", copy=False)


I feel like we either want to validate the data or not. If this is a private method and we know validation is done somewhere else, why do we need to call check_array?

In this case, check_array is required to convert the CSC matrix (during fit) into a CSR matrix for prediction. I'm undecided on who would be responsibility for this. It's either the caller of _compute_oob_predictions or _compute_oob_predictions itself.

In any case, I updated the PR with a comment regarding this behavior.

adrinjalali · 2021-11-09T10:39:56Z

sklearn/cluster/_birch.py


        if compute_labels:
-            self.labels_ = self.predict(X)
+            self.labels_ = self._predict(X)


wouldn't a cleaner API be like self.predict(X, validate_input=False), or self.validate_input(predict=False).predict(X)?

self.predict(X, validate_input=False)

I don't think introducing a public arg for that is cleaner. I find it clean that we use a private function internally and expose a public function that does extra validation.

self.validate_input(predict=False).predict(X)

I don't get the predict=False arg, could explain more what you have in mind ?

There are other reasons why a public, or a developer API, would be nice to have when it comes to [skipping] validation: #16653 (comment)

The predict=False would kinda set a flag in the estimator to skip the validation in a certain method.

The predict=False would kinda set a flag in the estimator to skip the validation in a certain method.

I think adding more state to the estimator after __init__ is outside the scope of this PR, but we can use this PR as a motivation to do it. It would kind of be like "inference mode".

self.predict(X, validate_input=False)

I think it would be very nice to have this type of kwarg everywhere. It would be similar to the check_finite flag in SciPy. (Every year I see the "Scikit-learn is slow during prediction" and it comes down to the validation we do.)

In both cases, I do not think we should change public API with a bug fix PR.

sklearn/utils/estimator_checks.py

…om_forest_classifier

… fit

adrinjalali · 2021-11-12T15:16:46Z

I find the introduction of a private method like _transform in this PR just to handle cases where we want validation and cases where we don't, quite a hacky solution.

I think this would be a nice trigger to introduce the developer or public API we've talked about, to disable input validation in these estimators. Not sure what others think @scikit-learn/core-devs

ogrisel

LGTM. I agree that having a generic public API to disable validation checks at inference is beyond the scope of this bugfix PR and just calling ad-hoc private functions when necessary is good enough for now.

I just have a question:

ogrisel · 2021-11-15T14:13:04Z

sklearn/ensemble/_forest.py

+        """
+        # Prediction requires X to be in CSR format
+        if issparse(X):
+            X = check_array(X, accept_sparse="csr", force_all_finite=True)


Why force_all_finite=True here if input validation has already been performed in the caller?

Wouldn't the following be enough?

Suggested change

X = check_array(X, accept_sparse="csr", force_all_finite=True)

X = X.tocsr()

ogrisel · 2021-11-15T14:16:10Z

Also, this comment about adding a comment to explain the test has not been addressed: https://github.com/scikit-learn/scikit-learn/pull/21578/files#r745491797

…om_forest_classifier

thomasjpfan · 2021-11-18T16:18:51Z

I updated PR with comment and suggestion.

ogrisel · 2021-11-19T08:39:33Z

Some test froze in the macos CI and triggered the 60 minutes timeout. I pushed another commit to check whether this is deterministically happening in this PR or a rare random event.

ogrisel · 2021-11-19T08:40:49Z

Here is where it happened in the run for the 5c22dec commit:

[...]
......................................s.s..s........s.s.s.......s.s.s... [ 47%]
....s.s.s.......s.s.s.......s.s.s....................................... [ 47%]
............
##[error]The operation was canceled.

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=35125&view=logs&j=97641769-79fb-5590-9088-a30ce9b850b9&t=4745baa1-36b5-56c8-9a8e-6480742db1a6

lorentzenchr · 2021-11-19T08:59:45Z

I find the introduction of a private method like _transform in this PR just to handle cases where we want validation and cases where we don't, quite a hacky solution.

I think this would be a nice trigger to introduce the developer or public API we've talked about, to disable input validation in these estimators. Not sure what others think @scikit-learn/core-devs

Should we open an issue for this to be discussed? Or is there already one?

adrinjalali · 2021-11-19T09:21:37Z

@lorentzenchr we could open a new issue, some background is present here: #16653

jjerphan

LGTM. Thank you, @thomasjpfan.

I agree with you: this PR is fine as is and a general API is to be introduced in another.
And as always, Julien is nitpicking with some tiny suggestions when nothing has to fundamentally be changed.

sklearn/ensemble/_gb.py

doc/whats_new/v1.0.rst

sklearn/utils/estimator_checks.py

…om_forest_classifier

lorentzenchr · 2021-11-27T11:48:45Z

@lorentzenchr we could open a new issue, some background is present here: #16653

Done in #21804.

sklearn/tests/test_common.py

jjerphan

OK with the follow-up PR for MLP and to merge it after resolving problem on the CI.

doc/whats_new/v1.0.rst

…om_forest_classifier

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan added 2 commits November 6, 2021 16:08

FIX Removes double validation in fit

3ed49d5

DOC Adds whats new

7021d0c

thomasjpfan added the Blocker label Nov 6, 2021

thomasjpfan added this to the 1.0.2 milestone Nov 6, 2021

DOC Adds pr number

4b3d2f5

thomasjpfan mentioned this pull request Nov 6, 2021

new bug in V1.0 new added attribute 'feature_names_in' #21577

Closed

glemaitre approved these changes Nov 7, 2021

View reviewed changes

jeremiedbb reviewed Nov 8, 2021

View reviewed changes

sklearn/cluster/_birch.py Outdated Show resolved Hide resolved

thomasjpfan added 2 commits November 8, 2021 11:08

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

7fd4f65

…om_forest_classifier

CLN Address comments

d9ca88a

adrinjalali reviewed Nov 9, 2021

View reviewed changes

thomasjpfan added 4 commits November 10, 2021 09:44

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

a449a64

…om_forest_classifier

DOC Adds comment about CSR format for prediction

45ed3ad

TST Updates common test to check every estimator that predicts during…

27027a0

… fit

DOC Update whats new

a13f7ee

thomasjpfan mentioned this pull request Nov 10, 2021

_check_feature_names raising a false positive when fitting a GBDT and n_iter_no_change is not None #21618

Closed

ArturoAmorQ mentioned this pull request Nov 12, 2021

MNT Fix or explain remaining warnings INRIA/scikit-learn-mooc#486

Merged

ogrisel approved these changes Nov 15, 2021

View reviewed changes

thomasjpfan added 2 commits November 18, 2021 08:15

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

cde9373

…om_forest_classifier

ENH Use tocsr

29aa492

thomasjpfan and others added 2 commits November 18, 2021 08:59

BUG Fixes bug

5c22dec

Trigger CI

fbf7006

jjerphan approved these changes Nov 22, 2021

View reviewed changes

sklearn/ensemble/_gb.py Show resolved Hide resolved

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

thomasjpfan added 3 commits November 24, 2021 11:45

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

c74fec0

…om_forest_classifier

CLN Address comments

d0a5c4c

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

4e3e8e8

…om_forest_classifier

lorentzenchr mentioned this pull request Nov 27, 2021

RFC / API add option to fit/predict without input validation #21804

Open

thomasjpfan added 4 commits November 27, 2021 09:49

TST Fixes test (Will fail now on CI)

3c1d57f

FIX Fixes issue same issue with MLP

5a22b70

XFAIL the MLP case

8cea8d8

REV Enable the other tests

5d323ac

thomasjpfan commented Nov 27, 2021

View reviewed changes

sklearn/tests/test_common.py Show resolved Hide resolved

jjerphan approved these changes Nov 29, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

thomasjpfan added 2 commits November 29, 2021 11:24

Merge remote-tracking branch 'upstream/main' into remove_warning_rand…

0849124

…om_forest_classifier

DOC Fix syntax error in whats_new

fab1268

jjerphan merged commit 071f98f into scikit-learn:main Nov 29, 2021

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX Remove warnings when fitting a dataframe (scikit-learn#21578)

f426c30

Co-authored-by: Olivier Grisel <[email protected]>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021

FIX Remove warnings when fitting a dataframe (scikit-learn#21578)

d309ec4

Co-authored-by: Olivier Grisel <[email protected]>

glemaitre pushed a commit that referenced this pull request Dec 25, 2021

FIX Remove warnings when fitting a dataframe (#21578)

eee1851

Co-authored-by: Olivier Grisel <[email protected]>

	X = check_array(X, accept_sparse="csr", force_all_finite=True)
	X = X.tocsr()

Uh oh!

FIX Remove warnings when fitting a dataframe #21578

FIX Remove warnings when fitting a dataframe #21578

Uh oh!

Conversation

thomasjpfan commented Nov 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

adrinjalali Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

adrinjalali Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali commented Nov 12, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 15, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 15, 2021

Uh oh!

thomasjpfan commented Nov 18, 2021

Uh oh!

ogrisel commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 19, 2021

Uh oh!

lorentzenchr commented Nov 19, 2021

Uh oh!

adrinjalali commented Nov 19, 2021

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Nov 27, 2021

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

thomasjpfan commented Nov 6, 2021 •

edited

Loading

ogrisel commented Nov 19, 2021 •

edited

Loading