[MRG] EHN: inferred row_id_attribute from dataframe to create a dataset #586

glemaitre · 2018-10-31T12:50:29Z

This allows to automatically inferred the name of the index column.

However, I break the backward compatibility since that the parameter is now a keyword argument. It was already done like that for format.

glemaitre · 2018-11-07T17:52:28Z

ping @mfeurer @janvanrijn @joaquinvanschoren Any feedback?

janvanrijn · 2018-11-07T18:04:21Z

tests/test_datasets/test_dataset_functions.py

+            original_data_url=original_data_url,
+            paper_url=paper_url
+        )
+        self.assertEqual(dataset.row_id_attribute, 'index_column')


should we also test that the "publish" function correctly uploads the index column?

That would indeed be great.

janvanrijn · 2018-11-07T18:05:33Z

openml/datasets/functions.py

    else:
        attributes_ = attributes

+    if row_id_attribute is None and hasattr(data, "index"):


what happens with the dataframe index if row_id_attribute is set to a value other than the index? Will it be ignored at upload time? Can this be added to comments and somehow tested?

Could you quickly show a code snippet (even pseudo-code) to illustrate what you mean? I am not sure to see the use-case?

I think @janvanrijn wanted to only point out a potential issue:

In [17]: dataset = openml.datasets.functions.create_dataset( ...: name=name, ...: description=description, ...: creator=creator, ...: contributor=None, ...: collection_date=collection_date, ...: language=language, ...: licence=licence, ...: default_target_attribute=default_target_attribute, ...: ignore_attribute=None, ...: citation=citation, ...: attributes='auto', ...: data=df, ...: row_id_attribute='rnd_str', ...: format=None, ...: version_label='test', ...: original_data_url=original_data_url, ...: paper_url=paper_url, ...: ) In [18]: dataset.row_id_attribute Out[18]: 'rnd_str' In [19]: df.index.name Out[19]: 'index'

OK so the PR is wrong. The behaviour needs to be better defined:

If data is an array and row_id_attribute is None then we let row_id_attribute at None.

If data is an array and row_id_attribute is not None, we need to have a check that this is actually an attribute.

If data is a dataframe and row_id_attribute is None, we need to infer it. If df.index.name is None then we are back to the first bullet. Otherwise, we take the name as row_id_attribute and we need to df.reset_index() to include the data into the array when converting to numpy df.values.

we need to infer it

I'd rather say "we can infer it". There is no necessity to have an ID column. But I think these three bullet points describe the behavior I would expect very well.

janvanrijn · 2018-11-07T18:08:16Z

openml/datasets/functions.py

+        specified, it will be inferred.
+        .. versionadded: 0.8
+           Inference of ``row_id_attribute`` from a dataframe.
+    format : str, optional


Personally, I wouldn't mind just removing format rather than deprecating it. It seems that the function signature is changed anyway, making the function not backwards compatible (bad, but acceptable and nothing really that can be done about this)

OK, but I would do that in a separate PR.

janvanrijn · 2018-11-07T18:09:05Z

Thanks for this PR! I left some questions. @mfeurer is currently out of office, hopefully he can also have a look at it, but I think this can be merged quite easily :)

mfeurer · 2018-11-09T15:00:29Z

openml/datasets/functions.py

+    row_id_attribute : str, optional
+        The attribute that represents the row-id column, if present in the
+        dataset. If ``data`` is a dataframe and ``row_id_attribute`` is not
+        specified, it will be inferred.


It would be great if you could be a bit more specific on how it will be inferred. Honestly I'm a bit confused right now. Doesn't a dataframe always have an index?

In [1]: import pandas as pd In [2]: data = [['a'], ['b'], ['c'], ['d'], ['e']] In [3]: column_names = ['rnd_str'] In [4]: df = pd.DataFrame(data, columns=column_names) In [5]: hasattr(data, "index") Out[5]: True

Unless the user specifies an index column it should not be uploaded. And I assume that the pandas index column is not uploaded right now.

yes, it always has an index and the index.name is either a string or None. By default, it will be None.

Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it?

Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it

Yes, but only if there is one. For most datasets there is none.

yes, it always has an index and the index.name is either a string or None. By default, it will be None.

Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?

Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?

No, index as always a name attribute. It is None by default which is the also the value required when row_id_attribute should not be specified.

To illustrate what I mean.

In [2]: df = pd.DataFrame([1, 2]) In [3]: df Out[3]: 0 0 1 1 2 In [4]: df.index.name In [5]: print(df.index.name) None

Okay, I got it. Do you think it would make sense to add a brief description about this behavior to the docstring?

Python lists also have an index. So it would break for this case.

mfeurer · 2018-11-09T15:01:18Z

tests/test_datasets/test_dataset_functions.py

+            original_data_url=original_data_url,
+            paper_url=paper_url
+        )
+        self.assertEqual(dataset.row_id_attribute, 'index_column')


That would indeed be great.

mfeurer · 2018-11-12T10:09:42Z

Looks good to me, but it would be great to have the additional unit test and maybe a warning if the row ID is given twice (in the form of an index and as an attribute to the create dataset function).

glemaitre · 2018-11-12T15:49:43Z

Looks good to me, but it would be great to have the additional unit test and maybe a warning if the row ID is given twice (in the form of an index and as an attribute to the create dataset function).

I am a bit lost then. How do you actually store the index column in ARFF if this is not listed in the attribute?

glemaitre · 2018-11-12T16:16:12Z

@janvanrijn

I am getting locally and on travis the following:

>           parser.Parse(xml_input, True)
E           xml.parsers.expat.ExpatError: junk after document element: line 71, column 6

Is it linked with the test server? This is a bit blocking to write test locally actually.

mfeurer · 2018-11-13T08:25:13Z

I am a bit lost then. How do you actually store the index column in ARFF if this is not listed in the attribute?

There is not necessarily an ID column. Most datasets, for example Iris, do not have an ID column and therefore, no ID column is uploaded to OpenML. However, for example the MUSK dataset has an index column in original data, called configuration_name, which describes is marked as an index.

…own attribute

glemaitre · 2018-11-14T15:05:45Z

The errors are unrelated. @mfeurer I modified the documentation to reflect the behaviour of the attributes.

…et (openml#586) * EHN: inferred row_id_attribute from dataframe to create a dataset * reset the index of dataframe after inference * TST: check the size of the dataset * PEP8 * TST: check that an error is raised when row_id_attributes is not a known attribute * DOC: Update the docstring * PEP8

* EHN: support SparseDataFrame when creating a dataset * TST: check attributes inference dtype * PEP8 * EXA: add sparse dataframe in the example * Fix typos. * Fix typo. * Refactoring task.py (#588) * [MRG] EHN: inferred row_id_attribute from dataframe to create a dataset (#586) * EHN: inferred row_id_attribute from dataframe to create a dataset * reset the index of dataframe after inference * TST: check the size of the dataset * PEP8 * TST: check that an error is raised when row_id_attributes is not a known attribute * DOC: Update the docstring * PEP8 * add examples to the menu, remove double progress (#554) * PEP8 * PEP8

EHN: inferred row_id_attribute from dataframe to create a dataset

bd413cd

janvanrijn reviewed Nov 7, 2018

View reviewed changes

mfeurer reviewed Nov 9, 2018

View reviewed changes

reset the index of dataframe after inference

d4d4bd9

glemaitre added 6 commits November 14, 2018 11:11

Merge remote-tracking branch 'origin/develop' into is/row_is_attribute

011c094

TST: check the size of the dataset

84137d4

PEP8

4a7e43b

TST: check that an error is raised when row_id_attributes is not a kn…

3b0f4db

…own attribute

DOC: Update the docstring

6a82f1c

PEP8

e6fd25b

glemaitre mentioned this pull request Nov 14, 2018

[MRG] DEPR: remove the format parameter from create_dataset #592

Merged

mfeurer approved these changes Nov 16, 2018

View reviewed changes

mfeurer merged commit 696db49 into openml:develop Nov 16, 2018

Uh oh!

[MRG] EHN: inferred row_id_attribute from dataframe to create a dataset #586

[MRG] EHN: inferred row_id_attribute from dataframe to create a dataset #586

Uh oh!

Conversation

glemaitre commented Oct 31, 2018

Uh oh!

glemaitre commented Nov 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janvanrijn commented Nov 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfeurer commented Nov 12, 2018

Uh oh!

glemaitre commented Nov 12, 2018

Uh oh!

glemaitre commented Nov 12, 2018

Uh oh!

mfeurer commented Nov 13, 2018

Uh oh!

glemaitre commented Nov 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants