[MRG] EHN: allow to upload DataFrame and infer dtype and column name #545

glemaitre · 2018-09-21T10:19:41Z

Partially adressed #540

Allow to use a pandas DataFrame when creating and publishing a dataset on OpenML

Important point to be considerd:

categorical columns are not automatically converted to string and will instead raise an error
boolean columns are converted to categorical column encoding True and False as string
integer and real are used depending of the dtype to allow preserving the dtype when loading with liac-arff.
a dict can be passed to override the dtype inference for some specific columns.

TODO:

Add unit tests for the exceptions

glemaitre · 2018-09-21T10:23:56Z

@ArlindKadra @mfeurer @janvanrijn Any suggestion is welcomed.

glemaitre · 2018-09-21T11:41:20Z

The PEP8 error is kinda unrelated sine that we can't avoid the issue of the import on the top of the file with mock.

codecov-io · 2018-09-21T11:59:17Z

Codecov Report

Merging #545 into develop will increase coverage by 0.08%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #545      +/-   ##
===========================================
+ Coverage    89.82%   89.91%   +0.08%     
===========================================
  Files           32       32              
  Lines         2959     2985      +26     
===========================================
+ Hits          2658     2684      +26     
  Misses         301      301

Impacted Files	Coverage Δ
openml/datasets/functions.py	`91.79% <100%> (+0.99%)`	⬆️
openml/utils.py	`92.63% <0%> (-0.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95d12e5...ccf7b82. Read the comment docs.

mfeurer

Thanks a lot! I only have minor comments and would really like to have the error tested, but I'm optimistic that we can merge this soon.

mfeurer · 2018-09-24T12:48:46Z

examples/create_upload_tutorial.py

 openml.config.server = 'https://test.openml.org/api/v1/xml'

+############################################################################
+# Uploading a data set store in a NumPy array


Should be stored

mfeurer · 2018-09-24T12:50:29Z

examples/create_upload_tutorial.py

+)
+
+############################################################################
+try:


I actually think these try/excepts are no longer necessary and only add little extra information for the user. Could you please remove them?

mfeurer · 2018-09-24T12:51:57Z

examples/create_upload_tutorial.py

+# call :func:`create_dataset` by passing the dataframe and fixing the parameter
+# ``attributes`` to ``'auto'``.
+
+# force OpenML to infer the attributes from the dataframe


Is there a reason to define these arguments outside of the function you'll be calling?

I found it more didactic to have a comment and an assignment instead of just assigning inside the function. WDYT?

The current example has comments inside the function so everything is in one place. Not sure which one is better though as the current version has a pretty long function call.

I discussed this with @ArlindKadra regarding #547 to have this based on the length. But that would distribute the creation of the arguments. I start to see why creating them beforehand is a good idea.

mfeurer · 2018-09-24T12:56:21Z

tests/test_datasets/test_dataset_functions.py

+
+    def test_create_dataset_pandas(self):
+        # pandas is only a optional dependency and we need to skip the test if
+        # it is not installed.


In the long run we will probably make pandas required (it is already required for the examples and documentation), so I'd totally okay if you add it as a dependency.

This could be deleted now, right?

mfeurer · 2018-09-24T12:57:02Z

tests/test_datasets/test_dataset_functions.py

+            original_data_url=original_data_url,
+            paper_url=paper_url
+        )
+        dataset.publish()


Could you please download and check that everything was converted and uploaded properly?

I assumed that I can do that with something like:

upload_id = dataset.publish() dataset_uploaded = openml.datasets.get_dataset(upload_id)

However, I receive the following exception:

openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/features/5915 returned code 273: Dataset not processed yet

What is the usual way that you are doing it.

One way could be something like that:

upload_id = dataset.publish() while True: try: dataset_uploaded = openml.datasets.get_dataset(upload_id) break except OpenMLServerException as msg_exception: if "code 273" in str(msg_exception): pass else: raise msg_exception

but I am not sure how long it can take for the dataset to be processed and we could get in some timeout nightmare.

@glemaitre We already had this problem in #547. We decided to only compare based on the arff and we also implemented 2 functions which get the arff and format based on the id (those are available when it is uploaded). I have to refactor my pr a bit more, but it should be available tomorrow.

So do you mean to use this function:

https://github.com/openml/openml-python/pull/547/files#diff-2ffad3a1ae43bbcc0e841e2f1df8967bR692

So I would probably wait that your PR get first in and I will manage the conflict then.

yes, @mfeurer are you okay with this?

This looks good at first glance.

glemaitre · 2018-09-24T13:23:27Z

Yes I have to have full coverage. I have to check that. Sent from my phone - sorry to be brief and potential misspell.

amueller · 2018-10-04T21:18:38Z

openml/datasets/functions.py

+                             "entries which are not string."
+                             .format(column_name))
+    elif column_dtype.name == 'object':
+        arff_dtype = 'STRING'


Usually we want this to be category as well, right? kinda tricky ;)
And what about integers that encode categories?

And what about integers that encode categories?

If they encode categories, then one should use the category dtype for those columns.

However, this is part needs to be modified. We should manage INTEGER and REAL/NUMERIC differently. We also have to think more in depth about the missing values. I'll use the expertise of @jorisvandenbossche

Basically, we are working on uploading the datasets from dirty_cat and we are confronted to those issues.

Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.

Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.

And what about integers that encode categories?

I think arff does not support integer categories?

Conceptually, arff should allow this (as long as the categories are integers, too). However, liac-arff does not support this. Can be reproduced with:

import arff arff.dumps({ 'relation': 'test', 'description': 'abc', 'attributes': [['test', [1, 2, 3]]], 'data': [[1], [2]], })

I'm wondering if it's worth changing this. Formally, there's also no difference between REAL, INTEGER and NUMERIC, but I don't see a reason to not be more specific about the data type as the arff specification allows this.

I think that there is a difference between REAL and INTEGER since the dtype will be kept, isn't it.

In [5]: X = arff.dumps({ ...: 'relation': 'test', ...: 'description': 'abc', ...: 'attributes': [['test', 'REAL']], ...: 'data': [[1], [2]] ...: }) In [6]: arff.loads(X) Out[6]: {'description': 'abc', 'relation': 'test', 'attributes': [('test', 'REAL')], 'data': [[1.0], [2.0]]} In [7]: X = arff.dumps({ ...: 'relation': 'test', ...: 'description': 'abc', ...: 'attributes': [['test', 'INTEGER']], ...: 'data': [[1], [2]] ...: }) In [8]: arff.loads(X) Out[8]: {'description': 'abc', 'relation': 'test', 'attributes': [('test', 'INTEGER')], 'data': [[1], [2]]}

Going by the official arff specification, all numeric attributes should go by NUMERIC
https://www.cs.waikato.ac.nz/ml/weka/arff.html

In order to make OpenML datasets comply with as much as possible arff parsers, maybe it's best to enforce this too server side? In that case the Python uploader should of course also upload NUMERIC key. @mfeurer @joaquinvanschoren @berndbischl WDYT?

Have to add that currently this is not enforced, server side

@janvanrijn you pointed to an outdated version of the arff standard (see the first sentence on the page you linked to), the newer one (https://waikato.github.io/weka-wiki/arff_stable/) actually mentions 'integer' and 'real' (although it says they are treated as numeric)

I think they're more a guideline than a necessity. But it woudl be great if we can support them as good as possible in pandas. I don't think this would be easy in numpy, though.

amueller · 2018-10-04T21:25:46Z

I'm so excited!

jorisvandenbossche · 2018-10-05T15:04:21Z

openml/datasets/functions.py

+        # infer the type of data for each column of the DataFrame
+        attributes_ = [(col_name,
+                        _pandas_dtype_to_arff_dtype(data, col_name, col_dtype))
+                       for col_name, col_dtype in data.dtypes.iteritems()]


you don't necessarily need to pass the dtype here, you could also iterate only through the column names

jorisvandenbossche · 2018-10-05T15:06:42Z

openml/datasets/functions.py

+                             "entries which are not string."
+                             .format(column_name))
+    elif column_dtype.name == 'object':
+        arff_dtype = 'STRING'


Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.

Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.

And what about integers that encode categories?

I think arff does not support integer categories?

glemaitre · 2018-10-06T22:28:02Z

I have to add much more tests but basically I change the way the dtype inference was done using pandas (only available for pandas >= 0.19, I think):
https://github.com/openml/openml-python/pull/545/files#diff-2ffad3a1ae43bbcc0e841e2f1df8967bR357

It should be robust to find the right dtype even in presence of NA values. It should also be extensible to DATE (datetime) dtype since pandas is also inferring those. I see that there is a PR in liac-arff which could be worth to push forward ;)

glemaitre · 2018-10-07T12:35:15Z

The error on the CI is a bit cryptic to me. Anyone has an idea why is it failing only for some of the build and not others. I can use this function locally as well.

glemaitre · 2018-10-07T16:19:46Z

I think that it could be goof to actually redownload the dataset once uploaded because I don't see any other way to check that everything went fine by converting the dataframe into arff.

glemaitre · 2018-10-07T16:20:01Z

@jorisvandenbossche If you want to have a look at it

janvanrijn · 2018-10-07T17:50:32Z

The error on the CI is a bit cryptic to me. Anyone has an idea why is it failing only for some of the build and not others. I can use this function locally as well.

I also encountered this error. Opened issue #565 and PR #566. If you merge #566 into this branch it should be fixed.

To add a little bit more context, due to the structure of the specific test it only manifests itself in certain cases, for example when the test server is wiped empty again (which I did on Friday). This is why it was not picked up on earlier and merged in develop.

mfeurer · 2018-10-12T09:29:06Z

openml/datasets/functions.py

+        # infer the type of data for each column of the DataFrame
+        attributes_ = attributes_arff_from_df(data)
+        if isinstance(attributes, dict):
+            # override the attributes which was specified by the user


Doesn't this do the exact opposite? It overrides the attributes from the dataframe with the arguments passed by the user, right?

Yes. It is the mechanism that allow the user to overwrite some specific attribute. So it could be useful to force a specific data type or specify the categories (e.g. if there is some missing categories in the data column).

mfeurer · 2018-10-12T09:29:51Z

setup.py

                     'nbformat',
                     'python-dateutil',
                     'oslo.concurrency',
+                     'pandas',


In one of your comments you mentioned that this needs to be at least 0.19. Would it make sense to add this piece of information here?

mfeurer · 2018-10-12T11:14:31Z

tests/test_datasets/test_dataset_functions.py

+
+    def test_create_dataset_pandas(self):
+        # pandas is only a optional dependency and we need to skip the test if
+        # it is not installed.


This could be deleted now, right?

jorisvandenbossche · 2018-10-15T09:02:38Z

openml/datasets/functions.py

+            # raise an error asking to convert all entries to string.
+            categories = df[column_name].cat.categories
+            categories_dtype = pd.api.types.infer_dtype(categories)
+            if categories_dtype != 'string':


You might need to allow 'unicode' here as well for python 2 compat ?

jorisvandenbossche · 2018-10-15T09:04:37Z

openml/datasets/functions.py

+        else:
+            raise ValueError("The dtype '{}' of the column '{}' is not "
+                             "currently supported by liac-arff. Supported "
+                             "dtypes are categorical, string, interger, "


interger -> integer

…pandas

glemaitre · 2018-10-21T21:53:32Z

@mfeurer @janvanrijn @ArlindKadra @amueller

The PR is ready for a full round of review. With the previous work it should be almost ready to be merged (the CI should be green).

janvanrijn

Thanks for making this PR. I find it really awesome. I have been not been too involved in this one, so please don't mind my comments too much, but I left some (open) questions. Apart from this, LGTM

janvanrijn · 2018-10-21T21:58:55Z

openml/datasets/functions.py

+def attributes_arff_from_df(df):
+    """Create the attributes as specified by the ARFF format using a dataframe.
+
+    Arguments:


The doc string that you applied seems different from the other docstrings that we use, i.e.,

Parameters ----------

and

Returns -------

janvanrijn · 2018-10-21T22:00:40Z

openml/datasets/functions.py

+        # dropped before the inference instead.
+        column_dtype = pd.api.types.infer_dtype(df[column_name].dropna())
+
+        if column_dtype == 'categorical':


I missed the discussion in the thread. is this a regular pandas data type? Is there any reason to not use the dtype str?

(the comments below do not seem to help me too much, sorry)

This is an available dtype in pandas to mention that a column is a categorical column.

Is there any reason to not use the dtype str?

Actually I don't see why categories should be string :). Right now, the liac-arff does not allow otherwise, we could actually think about converting silently the categories to string if a column is of category dtype.

Cool, thanks for the clarification :)

janvanrijn · 2018-10-21T22:01:23Z

openml/datasets/functions.py

+                                 "Got {} dtype in this column."
+                                 .format(column_name, categories_dtype))
+            attributes_arff.append((column_name, categories.tolist()))
+        elif column_dtype == 'boolean':


why not use dtype=bool?

because we are using the pd.api.types.infer_dtype function which return string.

janvanrijn · 2018-10-21T22:08:47Z

openml/datasets/functions.py

+    else:
+        attributes_ = attributes
+
+    data = data.values if hasattr(data, "columns") else data


This seems to imply that internally we store everything as arff. Wouldn't it be much cooler to store everything internally as pandas, that will make it much easier to do operations on the datasets. (And of course convert before uploading dataset / performing a run). However maybe that should be a next PR.

I would say that this is another PR.

I would imagine that it would require quite some changes but if the consensus is to use pandas then why not. Then, the data should be only tabular data (which is probably the main use case in openml)

Sparse data might be an issue, right?

Sparse arrays do not have values attribute so we should be fine.

If you think about SparseDataFrame, I would say that we should not offer support. They are going to be deprecated in favor of sparse dtype.

Is the liac-arff actually usable with sparse data?

Yes, it can handle scipy's COO data structure.

So this will not be an issue. @amueller could you elaborate on your thoughts.

mfeurer

This looks really great now, I think there is only one assert statement missing.

mfeurer · 2018-10-22T11:48:39Z

tests/test_datasets/test_dataset_functions.py

+            "Uploaded ARFF does not match original one"
+        )
+
+        # Check that we can overwrite the attributes


There's no assert for this, right?

line 797, we checked that the attributes contain 'g' which is not present in the original data. I would consider that as an assert. Do you want to test something else?

Do you mean this?

self.assertEqual( _get_online_dataset_arff(upload_did), dataset._dataset, "Uploaded ARFF does not match original one" )

Either I'm missing something, or there is no strict check that 'g' is in the attributes of the downloaded file? Line 797 only checks that the download is equal to the upload, right?

OK I see we can add a strict check then.

Actually we could support SparseDataFrame for now. Calling to_coo instead of values should be enough then.

glemaitre · 2018-10-22T12:27:57Z

Do you add entry to document changes in the library like what's new or changelog (I see a progress.rst maybe)?

glemaitre · 2018-10-22T16:03:35Z

Anything else?

janvanrijn · 2018-10-22T16:21:57Z

LGTM. Also, I promised Lisheng Sun (PhD student in Paris) to update her once this was merged. I don't have her github handle, does someone have?

mfeurer · 2018-10-23T07:47:13Z

@janvanrijn @LishengSun

glemaitre · 2018-10-23T07:55:06Z

I can do the support for SparseDataFrame in an upcoming PR.

EHN: allow to upload DataFrame and infer dtype and column name

01e1bd2

glemaitre changed the title ~~[MRG] EHN: allow to upload DataFrame and infer dtype and column name~~ [WIP] EHN: allow to upload DataFrame and infer dtype and column name Sep 21, 2018

FIX: check that we raised an error when nominal has mixed type

19fe4cc

glemaitre changed the title ~~[WIP] EHN: allow to upload DataFrame and infer dtype and column name~~ [MRG] EHN: allow to upload DataFrame and infer dtype and column name Sep 21, 2018

DOC: add documentation for the dataframe in the docstring

2ed1928

glemaitre force-pushed the is/create_dataset_pandas branch from 5135b81 to 2ed1928 Compare September 21, 2018 19:41

glemaitre added 3 commits September 21, 2018 22:04

FIX: make flake8 stop complaining for top import

8901ee7

PEP8

efeed09

PEP8

ceaf990

mfeurer reviewed Sep 24, 2018

View reviewed changes

amueller reviewed Oct 4, 2018

View reviewed changes

jorisvandenbossche reviewed Oct 5, 2018

View reviewed changes

EHN: using pandas inference

10a1562

TST: check inference for dataframe

e870733

TST: check bool case and override attributes with dict

eb6415b

mfeurer reviewed Oct 12, 2018

View reviewed changes

jorisvandenbossche reviewed Oct 15, 2018

View reviewed changes

mfeurer mentioned this pull request Oct 15, 2018

Rename X,y to features, labels #570

Closed

glemaitre added 2 commits October 21, 2018 23:31

Merge remote-tracking branch 'origin/develop' into is/create_dataset_…

6132296

…pandas

iter

08c1496

glemaitre added 2 commits October 21, 2018 23:48

PEP8

ee872c5

remove dataset publishing

3aaef38

janvanrijn reviewed Oct 21, 2018

View reviewed changes

DOC: fix docstring numpydoc format

32fe04e

mfeurer approved these changes Oct 22, 2018

View reviewed changes

TST: check that the new attributes is in the uploaded dataset

ccf7b82

glemaitre force-pushed the is/create_dataset_pandas branch from 20dbc23 to ccf7b82 Compare October 22, 2018 16:03

mfeurer merged commit f22c393 into openml:develop Oct 23, 2018

Uh oh!

[MRG] EHN: allow to upload DataFrame and infer dtype and column name #545

[MRG] EHN: allow to upload DataFrame and infer dtype and column name #545

Uh oh!

Conversation

glemaitre commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Sep 21, 2018

Uh oh!

glemaitre commented Sep 21, 2018

Uh oh!

codecov-io commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArlindKadra Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Sep 24, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 6, 2018

glemaitre commented Sep 21, 2018 •

edited

Loading

codecov-io commented Sep 21, 2018 •

edited

Loading

glemaitre Oct 2, 2018 •

edited

Loading

ArlindKadra Oct 2, 2018 •

edited

Loading

janvanrijn commented Oct 7, 2018 •

edited

Loading