[MRG+2] warn_on_dtype for DataFrames#10949
Conversation
rth
left a comment
There was a problem hiding this comment.
Thanks for your PR! Overall this feature makes sense, I think. I haven't reviewed the code yet.
Please change the title to something meaningful (i.e. describe what this PR does). Also please add some unit tests that are skipped if pandas is not found (cf this example)
|
Will do, thanks! Should I also make some tests with |
|
you might be better off doing it with a real DataFrame, using importorskip
to avoid the test where pandas is unavailable
|
|
Done, I also changed the title of the PR to MRG, since all tests pass |
sklearn/utils/validation.py
Outdated
| # check if the object contains several dtypes (typically a pandas | ||
| # DataFrame), and store them. If not, store None. | ||
| dtypes_orig = None | ||
| if hasattr(array, "dtypes"): |
There was a problem hiding this comment.
Maybe we should be a bit more precise on the duck typing, to avoid to capture other objects that happen by chance to have a dtype attribute. For instance, we could test for "array" which, I believe is the method used by numpy to convert to a numpy array.
There was a problem hiding this comment.
Note that the attribute here is dtypes, not dtype.
There was a problem hiding this comment.
Thanks, will do
sklearn/utils/validation.py
Outdated
| # some data must have been converted | ||
| msg = ("Data with input dtype %s were all converted to %s%s." | ||
| % (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
| warnings.warn(msg, DataConversionWarning) |
There was a problem hiding this comment.
Maybe add "stacklevel=2" here, or should we have 3?
There was a problem hiding this comment.
Thanks, will do
|
+1 on merge once the small comments have been addressed. |
|
Note that the attribute here is dtypes, not dtype.
Yes indeed. I was still thinking that we could do a stronger duck-typing.
|
|
Thanks for the review @GaelVaroquaux , I ll update the code according to your comments and resolve conflicts |
sklearn/utils/validation.py
Outdated
| # (for instance in a DataFrame that can contain several dtypes) then | ||
| # some data must have been converted | ||
| msg = ("Data with input dtype %s were all converted to %s%s." | ||
| % (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) |
There was a problem hiding this comment.
this does not have deterministic order.... which is probably fine. I would have preferred '%s' % sorted(dtypes_orig) but this will be alright.
sklearn/utils/validation.py
Outdated
| # some data must have been converted | ||
| msg = ("Data with input dtype %s were all converted to %s%s." | ||
| % (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
| warnings.warn(msg, DataConversionWarning, stacklevel=2) |
There was a problem hiding this comment.
The other DataConversionWarning does not use stacklevel. Generally, I find stacklevel unhelpful. (The best way to get the stack info correct is to run again with an 'error' filter.)
There was a problem hiding this comment.
I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)
|
Do we think this needs a what's new entry? |
sklearn/utils/validation.py
Outdated
| array = np.array(array, dtype=dtype, order=order) | ||
|
|
||
| if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \ | ||
| set(dtypes_orig): |
There was a problem hiding this comment.
@jnothman does sklearn have some style rules on avoiding \ in new code for line continuation?
There was a problem hiding this comment.
We usually avoid it, and here it can be easily avoided by bracketing ...
sklearn/utils/validation.py
Outdated
| # some data must have been converted | ||
| msg = ("Data with input dtype %s were all converted to %s%s." | ||
| % (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
| warnings.warn(msg, DataConversionWarning, stacklevel=2) |
There was a problem hiding this comment.
I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)
- sort dtypes to make output deterministic - put parenthesis instead of backslash for continuation line - put stacklevel=3 for more targeted output
|
Thanks for the review @jnothman and @jorisvandenbossche |
sklearn/utils/validation.py
Outdated
|
|
||
| if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \ | ||
| set(dtypes_orig): | ||
| if (warn_on_dtype and dtypes_orig is not None and {array.dtype} != |
There was a problem hiding this comment.
I would much prefer the line break next to a Boolean operators than a comparison, given order of precedence!
There was a problem hiding this comment.
You're right
Done
|
Thanks!!
|
|
@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data. In your original issue report, you used an object array as example: did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object? |
|
Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array. But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case? |
I think this was just a small example to reproduce the issue, I could probably have put something else instead of object
In fact if I remember correctly, at that time I was trying to understand when |
|
FYI seems that this introduces a bug #12622, suggestions are welcomed since we're going to include the fix in 0.20.1. |
|
fix in #12625 |
Reference Issues/PRs
Fixes #10948
What does this implement/fix? Explain your changes.
Pandas DataFrames can have different dtypes for each Series they contain. This PR fixes #10948 while displaying a message specific to this case of
DataFrameswith several dtypes. However this code just adds some "ifs" and the behaviour is maybe not what we would want. But if it is OK I will also add some tests.The following examples shows the new behaviour of this PR:
Example 1 (the one in the issue):
Returns:
Example 2: some dtypes are different than the one asked
Returns:
Example 3: all dtypes are different than the one asked
Returns: