Skip to content

Conversation

@Neeratyoy
Copy link
Contributor

Reference Issue

Addresses #639.

What does this PR implement/fix? Explain your changes.

Clarifies the type conversions involved with the choice of array_format parameter.

Any other comments?

The docstrings are as per my understanding of the discussion on #548.
Kindly advice on further corrections.

@Neeratyoy Neeratyoy changed the title Adding documentation for array_format Adding documentation for _convert_array_format output parameter Jun 6, 2019
@Neeratyoy Neeratyoy requested a review from PGijsbers June 6, 2019 18:22
Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an explicit list of possible conversions? If I would call the function now trying to convert a sparse dataframe to a dense numpy array, it would return the sparse dataframe without any warning/error.
This might be okay for an internal function (I personally would rather see an error), but it definitely should be documented.

Parameters
----------
array_format : str
Tag to attach to the dataset to get a pandas SparseDataFrame or a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean with "tag to attach to the dataset". From my understanding it would rather read something like "Datatype to convert the data to".
Please also add the list of legal values ('array' and 'dataframe').

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was the existing docstring and I was sceptical in removing it altogether so kind of cut-pasted what was there.
I'll make the changes.

- If array_format='array'
Converts non-sparse numeric data to numpy-array
Enforces numeric encoding of categorical columns
Missing values are represented as NaN in the dataframe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean ndarrray instead of dataframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, my bad.

@Neeratyoy
Copy link
Contributor Author

Please add an explicit list of possible conversions?

By explicit list I believe you want to highlight the two if-conditions in the function, i.e.,

  • Converts a non-sparse matrix to numpy array.
  • Converts a sparse matrix to a sparse dataframe.

Am I understanding this right?

If I would call the function now trying to convert a sparse dataframe to a dense numpy array, it would return the sparse dataframe without any warning/error.

You seem to be describing the case when both the if-conditions are evaluated as False.

This might be okay for an internal function (I personally would rather see an error), but it definitely should be documented.

Assuming the above case is correct, I believe converting the second if to an elif and raising a warning under a subsequent else would handle the case you have mentioned.
Should I proceed in this manner?

@PGijsbers
Copy link
Collaborator

Am I understanding this right?

Yes.

Assuming the above case is correct, I believe converting the second if to an elif and raising a warning under a subsequent else would handle the case you have mentioned.
Should I proceed in this manner?

I think so. I'll leave it up to you if calling it with data being a sparse matrix with array_format='array' should also raise an error or just return the data.

@Neeratyoy
Copy link
Contributor Author

Have made the changes as I mentioned. Tried to be explicit by mentioning the same in the documentation. Hope the structure is correct.

elif array_format == "dataframe" and scipy.sparse.issparse(data):
return pd.SparseDataFrame(data, columns=attribute_names)
else:
warn("Conversion criteria not satisfied. Returning input data.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add something that mentions its arguments? Makes it easier to track this down when a user runs into this. E.g.: "Cannot convert <type(data)> to <array_format>. Returning input data."

@PGijsbers
Copy link
Collaborator

Looks good :) Thanks! Just the one (new) minor comment and then it should be good. I do want to figure out why all the tests fail on this PR and the next one before merging though. But I can also look into that.

@Neeratyoy
Copy link
Contributor Author

Neeratyoy commented Jun 11, 2019

I do want to figure out why all the tests fail on this PR and the next one before merging though.

For the openml.config.server as https://www.openml.org/api/v1/xml, len(openml.tasks.list_tasks(tag='OpenML100') is 91.

For the openml.config.server as https://test.openml.org/api/v1/xml, len(openml.tasks.list_tasks(tag='OpenML100') is 99.

Whereas the unit test has a variable num_basic_tasks set to 100 and the test is to see if more than 100 such tasks are returned. Hence the failure.

@PGijsbers, how do you recommend I fix this? Should I just update the variable value from 100 to 90? Or something else needs to be fixed?

After certain fixes, #701 also fails the test cases with the same error.

@PGijsbers
Copy link
Collaborator

Ah. I did not look into the actual error logs yet. Those are known issues that are being worked on. No need to do anything about that.

@PGijsbers
Copy link
Collaborator

Everything looks good. As these changes are not invasive I would like to hold off a merge until we fix the dev branch. I'll look at that sooner rather than later.

@PGijsbers PGijsbers merged commit 3905544 into develop Jun 13, 2019
@PGijsbers PGijsbers deleted the fix_639 branch June 13, 2019 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants