Skip to content

Conversation

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 21, 2018

supersede #363
closes #257
closes #364
closes #408.

Add support to load ARFF into pandas DataFrame and pandas SparseDataFrame.

TODO:

  • Add unit tests
  • Add documentation
  • Add example
  • Add unit tests for exceptions
  • reuse the `row_id_attribute`` to rename the dataframe index name if possible.

Check the dtype consistence:

  • boolean will be encoded as category 'True', and 'False' and need to be converted
  • check that integer dtype is preserved
  • check the behavior with missing values with the different dtype

@glemaitre
Copy link
Contributor Author

@mfeurer For the unit test, would it be fine to actually used the following dataset:

https://www.openml.org/d/40945

It contains all type of possible and this would be handy. But I don't know if this is robust enough (if there anybody going to remove it anytime?)

@glemaitre
Copy link
Contributor Author

@mfeurer @joaquinvanschoren @janvanrijn
If you already have any comments even if this is still wip.

@glemaitre glemaitre changed the title [WIP] EHN: Add support for pandas DataFrame and SparseDataFrame when loading [MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading Sep 21, 2018
@codecov-io
Copy link

codecov-io commented Sep 22, 2018

Codecov Report

Merging #548 into develop will increase coverage by 0.3%.
The diff coverage is 93.81%.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop    #548     +/-   ##
=========================================
+ Coverage    89.79%   90.1%   +0.3%     
=========================================
  Files           32      32             
  Lines         3313    3366     +53     
=========================================
+ Hits          2975    3033     +58     
+ Misses         338     333      -5
Impacted Files Coverage Δ
openml/tasks/task.py 95.87% <100%> (ø) ⬆️
openml/datasets/dataset.py 82.4% <93.75%> (+4.7%) ⬆️
openml/flows/sklearn_converter.py 90.65% <0%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0235c51...5971762. Read the comment docs.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I briefly only skimmed this one only. It looks good, and having pandas support will definitively advance the package. There are two things we should consider, though:

  1. What are the consequences of always returning a dataframe instead of a numpy array? Can all scikit-learn models deal with this?
  2. Is there a reason why categories aren't added to the dataframe? Right now they're categorical with integer values, but it would actually be great if they values were of the right categories.

And yes, I'd be happy with using the titanic dataset as an example.

@glemaitre
Copy link
Contributor Author

@mfeurer @janvanrijn I would need an hand for the testing here.
I want to be able to test the following the dtype inference in presence of missing values

I uploaded a specific dataset on the test server. However, this will go wrong when you are clearing the server. I was wondering if we could have somewhere couple of specific dataset to be able to test all scenario.

@glemaitre
Copy link
Contributor Author

ping @mfeurer @janvanrijn @joaquinvanschoren We will need to read those data since we can upload them :)

@janvanrijn
Copy link
Member

My apologies, I thought this was merged already, and missed your previous comment.

I uploaded a specific dataset on the test server. However, this will go wrong when you are clearing the server. I was wondering if we could have somewhere couple of specific dataset to be able to test all scenario.

Actually, yes I have a collection of datasets that are always available on the test server (symlinking to datasets that are actually on the main server). Dataset ids 1 through 130 will always be available on the server, even after every reset. If you need any more / other datasets, please let me know.

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 13, 2018

In addition to Jan's suggestions, as we can merge sparse dataset upload support in a bit, you could upload the dataset first and then check if loading it again works. Otherwise you could check them in into the directory test/files and test the load function without invoking server calls or by mocking them out.

# Get the actual data.
#
# Returned as numpy array, with meta-info (e.g. target feature, feature names,...)
# The dataset can be returned in 2 possible formats: as a NumPy array or as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be a scipy.sparse.? matrix in case the dataset is sparse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this, but could you please also update number in the text?

print(eeg[:10])

############################################################################
# Instead of creating manually the dataframe, you can already request a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you 'creating' and 'manually' need to be switched.

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 16, 2018

This actually implements #257 #364 and #408.

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 5, 2019

I think I'm fine and will approve once you briefly explain what happens/happened to the open TODO and after you had a look at the unit tests. I will afterwards whether and how this collides with existing pickle files on the hard drives.

@glemaitre
Copy link
Contributor Author

glemaitre commented Mar 5, 2019

@mfeurer You are probably right regarding the timeout. On the built that does not fail the test which is commonly failing is close the timeout.

519.58s call     tests/test_runs/test_run_functions.py::TestRun::test_predict_proba_hardclassifier

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 6, 2019

Do you have any idea what makes that test take so long? I guess it takes only a few seconds locally?

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 6, 2019

Thinking about this more generally, it seems that the time required to execute the unit tests roughly doubles with this PR. Do you have any idea why this is the case?

@glemaitre
Copy link
Contributor Author

Do you have any idea what makes that test take so long? I guess it takes only a few seconds locally?

It is very slow locally as well on my laptop which is kinda powerful. I can expect Travis to be slower.

321.45s call     tests/test_runs/test_run_functions.py::TestRun::test_predict_proba_hardclassifier

@glemaitre
Copy link
Contributor Author

Thinking about this more generally, it seems that the time required to execute the unit tests roughly doubles with this PR. Do you have any idea why this is the case?

I can think about 2 things:

  • The pickling could be slower with dataframe. We keep the dtypes and we have to pickle several numpy array then.
  • When creating the dataframe, we have some kind of check/replace that go through the categorical data. So it would slower the data.

Since that we are using the dataframe by default and then convert is to numpy, all the current code will be affected by this slow down. However, I don't see any straightforward solution.

column = column.apply(lambda x: np.nan if x == -1 else x)
return column
if data.ndim == 2:
for column_name in data.columns:
Copy link
Collaborator

@mfeurer mfeurer Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just briefly profiled the code and it looks like this for loop is the culprit of the slowdown. Replacing it by:

columns = {
    column_name: _encode_if_category(data.loc[:, column_name]) 
    for column_name in data.columns
}
data = pd.DataFrame(columns)

increases the test speed again (while the unit tests are still passing). Could you please check (and replace if this is the case)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upsy. I did not notice this one. Thanks for the profiling

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 6, 2019

Tests look good (except for a minor merge issue). Would you mind fixing the two remaining issues (import error and pep8) and then merge?

@glemaitre
Copy link
Contributor Author

@janvanrijn @joaquinvanschoren Do you want to have another look at the PR?

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 7, 2019

Awesome, it's green 🎉 🎈

Let's give them a few hours, also @amueller (who opened the original PR for Pandas support) and @PGijsbers

return decode_arff(fh)

@staticmethod
def _convert_array_format(data, array_format, attribute_names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document supported conversions and how they ought to be specified. From reading code, it looks like there are only two specific conversions allowed:

  • non-sparse numeric data to numpy-array (array_format="array")
  • sparse data to sparse dataframe (array_format="dataframe")

no cross combinations are possible (sparse to non-sparse and vice versa).
Additionally it seems the method converts categorical columns by using their numeric representation? It might be good to document that as well.

@PGijsbers
Copy link
Collaborator

Looks good to me! I left a remark with one particular internal function that I felt was underdocumented. Though if it is merged in its current state that would be fine with me too.

@mfeurer mfeurer merged commit 94102f3 into openml:develop Mar 7, 2019
@mfeurer
Copy link
Collaborator

mfeurer commented Mar 7, 2019

Thanks for the feedback @PGijsbers, I just merged (to make sure that this is no longer delayed by merge issues) and created an issue (#639) to improve the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Getting non-encoded dataset Return tables as dataframes String feature support

7 participants