[MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading #548

glemaitre · 2018-09-21T15:33:33Z

supersede #363
closes #257
closes #364
closes #408.

Add support to load ARFF into pandas DataFrame and pandas SparseDataFrame.

TODO:

Add unit tests
Add documentation
Add example
Add unit tests for exceptions
reuse the `row_id_attribute`` to rename the dataframe index name if possible.

Check the dtype consistence:

boolean will be encoded as category 'True', and 'False' and need to be converted
check that integer dtype is preserved
check the behavior with missing values with the different dtype

glemaitre · 2018-09-21T15:37:47Z

@mfeurer For the unit test, would it be fine to actually used the following dataset:

https://www.openml.org/d/40945

It contains all type of possible and this would be handy. But I don't know if this is robust enough (if there anybody going to remove it anytime?)

glemaitre · 2018-09-21T17:41:00Z

@mfeurer @joaquinvanschoren @janvanrijn
If you already have any comments even if this is still wip.

codecov-io · 2018-09-22T20:29:21Z

Codecov Report

Merging #548 into develop will increase coverage by 0.3%.
The diff coverage is 93.81%.

@@            Coverage Diff            @@
##           develop    #548     +/-   ##
=========================================
+ Coverage    89.79%   90.1%   +0.3%     
=========================================
  Files           32      32             
  Lines         3313    3366     +53     
=========================================
+ Hits          2975    3033     +58     
+ Misses         338     333      -5

Impacted Files	Coverage Δ
openml/tasks/task.py	`95.87% <100%> (ø)`	⬆️
openml/datasets/dataset.py	`82.4% <93.75%> (+4.7%)`	⬆️
openml/flows/sklearn_converter.py	`90.65% <0%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0235c51...5971762. Read the comment docs.

mfeurer

Hey, I briefly only skimmed this one only. It looks good, and having pandas support will definitively advance the package. There are two things we should consider, though:

What are the consequences of always returning a dataframe instead of a numpy array? Can all scikit-learn models deal with this?
Is there a reason why categories aren't added to the dataframe? Right now they're categorical with integer values, but it would actually be great if they values were of the right categories.

And yes, I'd be happy with using the titanic dataset as an example.

ci_scripts/flake8_diff.sh

glemaitre · 2018-10-23T22:32:44Z

@mfeurer @janvanrijn I would need an hand for the testing here.
I want to be able to test the following the dtype inference in presence of missing values

I uploaded a specific dataset on the test server. However, this will go wrong when you are clearing the server. I was wondering if we could have somewhere couple of specific dataset to be able to test all scenario.

glemaitre · 2018-11-07T17:51:13Z

ping @mfeurer @janvanrijn @joaquinvanschoren We will need to read those data since we can upload them :)

janvanrijn · 2018-11-07T18:01:18Z

My apologies, I thought this was merged already, and missed your previous comment.

I uploaded a specific dataset on the test server. However, this will go wrong when you are clearing the server. I was wondering if we could have somewhere couple of specific dataset to be able to test all scenario.

Actually, yes I have a collection of datasets that are always available on the test server (symlinking to datasets that are actually on the main server). Dataset ids 1 through 130 will always be available on the server, even after every reset. If you need any more / other datasets, please let me know.

mfeurer · 2018-11-13T12:43:58Z

In addition to Jan's suggestions, as we can merge sparse dataset upload support in a bit, you could upload the dataset first and then check if loading it again works. Otherwise you could check them in into the directory test/files and test the load function without invoking server calls or by mocking them out.

mfeurer · 2018-11-16T13:55:19Z

examples/datasets_tutorial.py

 # Get the actual data.
 #
-# Returned as numpy array, with meta-info (e.g. target feature, feature names,...)
+# The dataset can be returned in 2 possible formats: as a NumPy array or as


It could also be a scipy.sparse.? matrix in case the dataset is sparse.

Thanks for updating this, but could you please also update number in the text?

mfeurer · 2018-11-16T13:56:09Z

examples/datasets_tutorial.py

 print(eeg[:10])

+############################################################################
+# Instead of creating manually the dataframe, you can already request a


I think you 'creating' and 'manually' need to be switched.

openml/datasets/dataset.py

tests/test_datasets/test_dataset.py

mfeurer · 2018-11-16T15:47:53Z

This actually implements #257 #364 and #408.

mfeurer · 2019-03-05T17:50:10Z

I think I'm fine and will approve once you briefly explain what happens/happened to the open TODO and after you had a look at the unit tests. I will afterwards whether and how this collides with existing pickle files on the hard drives.

glemaitre · 2019-03-05T21:46:39Z

@mfeurer You are probably right regarding the timeout. On the built that does not fail the test which is commonly failing is close the timeout.

519.58s call     tests/test_runs/test_run_functions.py::TestRun::test_predict_proba_hardclassifier

mfeurer · 2019-03-06T07:35:18Z

Do you have any idea what makes that test take so long? I guess it takes only a few seconds locally?

mfeurer · 2019-03-06T10:36:44Z

Thinking about this more generally, it seems that the time required to execute the unit tests roughly doubles with this PR. Do you have any idea why this is the case?

glemaitre · 2019-03-06T10:36:52Z

Do you have any idea what makes that test take so long? I guess it takes only a few seconds locally?

It is very slow locally as well on my laptop which is kinda powerful. I can expect Travis to be slower.

321.45s call     tests/test_runs/test_run_functions.py::TestRun::test_predict_proba_hardclassifier

glemaitre · 2019-03-06T10:44:01Z

Thinking about this more generally, it seems that the time required to execute the unit tests roughly doubles with this PR. Do you have any idea why this is the case?

I can think about 2 things:

The pickling could be slower with dataframe. We keep the dtypes and we have to pickle several numpy array then.
When creating the dataframe, we have some kind of check/replace that go through the categorical data. So it would slower the data.

Since that we are using the dataframe by default and then convert is to numpy, all the current code will be affected by this slow down. However, I don't see any straightforward solution.

mfeurer · 2019-03-06T12:58:28Z

openml/datasets/dataset.py

+                    column = column.apply(lambda x: np.nan if x == -1 else x)
+                return column
+            if data.ndim == 2:
+                for column_name in data.columns:


I just briefly profiled the code and it looks like this for loop is the culprit of the slowdown. Replacing it by:

columns = { column_name: _encode_if_category(data.loc[:, column_name]) for column_name in data.columns } data = pd.DataFrame(columns)

increases the test speed again (while the unit tests are still passing). Could you please check (and replace if this is the case)?

Upsy. I did not notice this one. Thanks for the profiling

mfeurer · 2019-03-06T14:58:51Z

Tests look good (except for a minor merge issue). Would you mind fixing the two remaining issues (import error and pep8) and then merge?

glemaitre · 2019-03-07T07:50:58Z

@janvanrijn @joaquinvanschoren Do you want to have another look at the PR?

mfeurer · 2019-03-07T07:58:10Z

Awesome, it's green 🎉 🎈

Let's give them a few hours, also @amueller (who opened the original PR for Pandas support) and @PGijsbers

PGijsbers · 2019-03-07T09:36:11Z

openml/datasets/dataset.py

                return decode_arff(fh)

+    @staticmethod
+    def _convert_array_format(data, array_format, attribute_names):


Maybe document supported conversions and how they ought to be specified. From reading code, it looks like there are only two specific conversions allowed:

non-sparse numeric data to numpy-array (array_format="array")

sparse data to sparse dataframe (array_format="dataframe")

no cross combinations are possible (sparse to non-sparse and vice versa).
Additionally it seems the method converts categorical columns by using their numeric representation? It might be good to document that as well.

PGijsbers · 2019-03-07T09:37:20Z

Looks good to me! I left a remark with one particular internal function that I felt was underdocumented. Though if it is merged in its current state that would be fine with me too.

mfeurer · 2019-03-07T15:15:33Z

Thanks for the feedback @PGijsbers, I just merged (to make sure that this is no longer delayed by merge issues) and created an issue (#639) to improve the docs.

glemaitre added 2 commits September 21, 2018 17:29

EHN: add support for DataFrame when loading dataset

7c428f8

MAINT: add pandas as dependency

6e485ec

glemaitre mentioned this pull request Sep 21, 2018

Improve dataset uploading #540

Closed

glemaitre added 2 commits September 21, 2018 17:40

FIX: typo in setup

b7e4dbc

TST: add unit test for checking pandas and numpy

5edf8d7

glemaitre added 5 commits September 21, 2018 22:31

FIX: back-compatibility defaulting on float 32

ac536ca

PEP8

1416330

FIX: transform y to integer if a category for back-compat

71d5ac6

PEP8

ab94796

DOC: add example

87d82d7

glemaitre changed the title ~~[WIP] EHN: Add support for pandas DataFrame and SparseDataFrame when loading~~ [MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading Sep 21, 2018

TST: remove useless tests

49ea0d7

mfeurer reviewed Sep 24, 2018

View reviewed changes

janvanrijn reviewed Sep 28, 2018

View reviewed changes

ci_scripts/flake8_diff.sh Outdated Show resolved Hide resolved

glemaitre added 4 commits October 23, 2018 15:48

Merge remote-tracking branch 'origin/develop' into is/363

1d37d91

iter

dcb3b75

iter

3278e10

iter

1d130c8

mfeurer reviewed Nov 16, 2018

View reviewed changes

glemaitre mentioned this pull request Dec 4, 2018

MAINT prepare new release #600

Merged

glemaitre added 2 commits December 4, 2018 16:19

Merge remote-tracking branch 'origin/develop' into is/363

5af65fd

EHN: partially address mfeurer comments

aaae124

MAINT: show slowest tests

bd04db0

mfeurer reviewed Mar 6, 2019

View reviewed changes

glemaitre added 3 commits March 6, 2019 14:33

Merge branch 'develop' into is/363

daba0fb

FIX: avoid reallocation in a loop with pandas

041e23e

fix typo

8bc7280

mfeurer approved these changes Mar 6, 2019

View reviewed changes

fixes

5971762

PGijsbers reviewed Mar 7, 2019

View reviewed changes

mfeurer mentioned this pull request Mar 7, 2019

Improve dataset documentation #639

Closed

mfeurer merged commit 94102f3 into openml:develop Mar 7, 2019

This was referenced Mar 7, 2019

Getting non-encoded dataset #408

Closed

String feature support #257

Closed

This was referenced Mar 18, 2019

Return tables as dataframes #364

Closed

Datasets cached pre-#548 can no longer be used for run_model_on_task #646

Closed

mfeurer mentioned this pull request Mar 19, 2019

How to know and skip if dataset having string features? "PyOpenMLError: Dataset 376 not compatible, PyOpenML cannot handle string features" #497

Closed

mfeurer mentioned this pull request Apr 16, 2019

Fix dataset parsing for categories #676

Merged

Neeratyoy mentioned this pull request Jun 6, 2019

Adding documentation for _convert_array_format output parameter #708

Merged

amueller mentioned this pull request Dec 3, 2019

pickle for SVHN bigger than arff file #891

Closed

mfeurer mentioned this pull request Oct 19, 2020

make tests run quicker #330

Open

Uh oh!

[MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading #548

[MRG] EHN: Add support for pandas DataFrame and SparseDataFrame when loading #548

Uh oh!

Conversation

glemaitre commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Sep 21, 2018

Uh oh!

glemaitre commented Sep 21, 2018

Uh oh!

codecov-io commented Sep 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Oct 23, 2018

Uh oh!

glemaitre commented Nov 7, 2018

Uh oh!

janvanrijn commented Nov 7, 2018

Uh oh!

mfeurer commented Nov 13, 2018

Uh oh!

mfeurer Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

mfeurer Dec 13, 2018

Choose a reason for hiding this comment

Uh oh!

mfeurer Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer commented Nov 16, 2018

Uh oh!

mfeurer commented Mar 5, 2019

Uh oh!

glemaitre commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer commented Mar 6, 2019

Uh oh!

mfeurer commented Mar 6, 2019

Uh oh!

glemaitre commented Mar 6, 2019

Uh oh!

glemaitre commented Mar 6, 2019

Uh oh!

mfeurer Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

mfeurer commented Mar 6, 2019

Uh oh!

glemaitre commented Mar 7, 2019

Uh oh!

mfeurer commented Mar 7, 2019

Uh oh!

PGijsbers Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

PGijsbers commented Mar 7, 2019

Uh oh!

mfeurer commented Mar 7, 2019

Uh oh!

Reviewers

glemaitre commented Sep 21, 2018 •

edited

Loading

codecov-io commented Sep 22, 2018 •

edited

Loading

glemaitre commented Mar 5, 2019 •

edited

Loading

mfeurer Mar 6, 2019 •

edited

Loading