-
-
Notifications
You must be signed in to change notification settings - Fork 212
[MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mfeurer For the unit test, would it be fine to actually used the following dataset: https://www.openml.org/d/40945 It contains all type of possible and this would be handy. But I don't know if this is robust enough (if there anybody going to remove it anytime?) |
|
@mfeurer @joaquinvanschoren @janvanrijn |
Codecov Report
@@ Coverage Diff @@
## develop #548 +/- ##
=========================================
+ Coverage 89.79% 90.1% +0.3%
=========================================
Files 32 32
Lines 3313 3366 +53
=========================================
+ Hits 2975 3033 +58
+ Misses 338 333 -5
Continue to review full report at Codecov.
|
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I briefly only skimmed this one only. It looks good, and having pandas support will definitively advance the package. There are two things we should consider, though:
- What are the consequences of always returning a dataframe instead of a numpy array? Can all scikit-learn models deal with this?
- Is there a reason why categories aren't added to the dataframe? Right now they're categorical with integer values, but it would actually be great if they values were of the right categories.
And yes, I'd be happy with using the titanic dataset as an example.
|
@mfeurer @janvanrijn I would need an hand for the testing here. I uploaded a specific dataset on the test server. However, this will go wrong when you are clearing the server. I was wondering if we could have somewhere couple of specific dataset to be able to test all scenario. |
|
ping @mfeurer @janvanrijn @joaquinvanschoren We will need to read those data since we can upload them :) |
|
My apologies, I thought this was merged already, and missed your previous comment.
Actually, yes I have a collection of datasets that are always available on the test server (symlinking to datasets that are actually on the main server). Dataset ids 1 through 130 will always be available on the server, even after every reset. If you need any more / other datasets, please let me know. |
|
In addition to Jan's suggestions, as we can merge sparse dataset upload support in a bit, you could upload the dataset first and then check if loading it again works. Otherwise you could check them in into the directory |
examples/datasets_tutorial.py
Outdated
| # Get the actual data. | ||
| # | ||
| # Returned as numpy array, with meta-info (e.g. target feature, feature names,...) | ||
| # The dataset can be returned in 2 possible formats: as a NumPy array or as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be a scipy.sparse.? matrix in case the dataset is sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this, but could you please also update number in the text?
examples/datasets_tutorial.py
Outdated
| print(eeg[:10]) | ||
|
|
||
| ############################################################################ | ||
| # Instead of creating manually the dataframe, you can already request a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you 'creating' and 'manually' need to be switched.
|
I think I'm fine and will approve once you briefly explain what happens/happened to the open TODO and after you had a look at the unit tests. I will afterwards whether and how this collides with existing pickle files on the hard drives. |
|
@mfeurer You are probably right regarding the timeout. On the built that does not fail the test which is commonly failing is close the timeout. |
|
Do you have any idea what makes that test take so long? I guess it takes only a few seconds locally? |
|
Thinking about this more generally, it seems that the time required to execute the unit tests roughly doubles with this PR. Do you have any idea why this is the case? |
It is very slow locally as well on my laptop which is kinda powerful. I can expect Travis to be slower. |
I can think about 2 things:
Since that we are using the dataframe by default and then convert is to numpy, all the current code will be affected by this slow down. However, I don't see any straightforward solution. |
openml/datasets/dataset.py
Outdated
| column = column.apply(lambda x: np.nan if x == -1 else x) | ||
| return column | ||
| if data.ndim == 2: | ||
| for column_name in data.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just briefly profiled the code and it looks like this for loop is the culprit of the slowdown. Replacing it by:
columns = {
column_name: _encode_if_category(data.loc[:, column_name])
for column_name in data.columns
}
data = pd.DataFrame(columns)
increases the test speed again (while the unit tests are still passing). Could you please check (and replace if this is the case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upsy. I did not notice this one. Thanks for the profiling
|
Tests look good (except for a minor merge issue). Would you mind fixing the two remaining issues (import error and pep8) and then merge? |
|
@janvanrijn @joaquinvanschoren Do you want to have another look at the PR? |
|
Awesome, it's green 🎉 🎈 Let's give them a few hours, also @amueller (who opened the original PR for Pandas support) and @PGijsbers |
| return decode_arff(fh) | ||
|
|
||
| @staticmethod | ||
| def _convert_array_format(data, array_format, attribute_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document supported conversions and how they ought to be specified. From reading code, it looks like there are only two specific conversions allowed:
- non-sparse numeric data to numpy-array (
array_format="array") - sparse data to sparse dataframe (
array_format="dataframe")
no cross combinations are possible (sparse to non-sparse and vice versa).
Additionally it seems the method converts categorical columns by using their numeric representation? It might be good to document that as well.
|
Looks good to me! I left a remark with one particular internal function that I felt was underdocumented. Though if it is merged in its current state that would be fine with me too. |
|
Thanks for the feedback @PGijsbers, I just merged (to make sure that this is no longer delayed by merge issues) and created an issue (#639) to improve the docs. |
supersede #363
closes #257
closes #364
closes #408.
Add support to load ARFF into pandas DataFrame and pandas SparseDataFrame.
TODO:
Check the dtype consistence: