Skip to content

Conversation

@glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 1, 2022

This is a fix such that we downcast to int or float values of nominal features instead of storing them always as strings.
It would be consistent with the behaviour of the future pandas parser implemented in #21938

Indeed it is related to the following quirk: https://github.com/scikit-learn/scikit-learn/pull/21938/files#diff-c14ad6f3f0f87029a67f4cca75114754191e3d2db225bb895b30e4d48b662cf0R825-R838

@glemaitre glemaitre changed the title FIX downcast nominal features whenver possible when reading ARFF FIX downcast nominal features whenever possible in LIAC-ARFF Feb 1, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether or not this is a good idea:

  • a) I suspect this can cause a large performance regression datasets with categorical columns with string values
  • b) the result would be very weird if a column has some values that can be converted to int and other to float and other that would stay as str: in this case I believe pandas would use a single type for all values of the same column.

To address a) and b) we could first parse the metadata to know which categorical column can be consistently converted to int values and only attempt to cover those.

But that would not solve the issue of type inference for numerical columns with only int values. I don't think it worse investing time in trying to fix LIAC-ARFF.

I think we can move forward with #21938 without changing the behavior of liac-arff as long as we properly document that the 2 parsers do not necessarily yield the same types for int/float values and categories.

try:
return float(value)
except ValueError:
return value
Copy link
Member

@ogrisel ogrisel Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to pursue this PR anyway, please write a dedicate unittest for this helper function in isolation.

# values because of the empty string case :(.)
return [None if s in ('?', '') else s
return [None if s in ('?', '') else _downcast(s)
for s in next(csv.reader([s]))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so confusing to reuse s as the comprehension variable name as it's already a local variable name of the function!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and do not look around too much :P

Co-authored-by: Olivier Grisel <[email protected]>
@ogrisel
Copy link
Member

ogrisel commented Feb 11, 2022

@glemaitre agreed we can close IRL.

@ogrisel ogrisel closed this Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants