Skip to content

Conversation

@glemaitre
Copy link
Contributor

This allows to automatically inferred the name of the index column.

However, I break the backward compatibility since that the parameter is now a keyword argument. It was already done like that for format.

@glemaitre
Copy link
Contributor Author

ping @mfeurer @janvanrijn @joaquinvanschoren Any feedback?

original_data_url=original_data_url,
paper_url=paper_url
)
self.assertEqual(dataset.row_id_attribute, 'index_column')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also test that the "publish" function correctly uploads the index column?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would indeed be great.

else:
attributes_ = attributes

if row_id_attribute is None and hasattr(data, "index"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with the dataframe index if row_id_attribute is set to a value other than the index? Will it be ignored at upload time? Can this be added to comments and somehow tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you quickly show a code snippet (even pseudo-code) to illustrate what you mean? I am not sure to see the use-case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @janvanrijn wanted to only point out a potential issue:

In [17]: dataset = openml.datasets.functions.create_dataset(
    ...:             name=name,
    ...:             description=description,
    ...:             creator=creator,
    ...:             contributor=None,
    ...:             collection_date=collection_date,
    ...:             language=language,
    ...:             licence=licence,
    ...:             default_target_attribute=default_target_attribute,
    ...:             ignore_attribute=None,
    ...:             citation=citation,
    ...:             attributes='auto',
    ...:             data=df,
    ...:             row_id_attribute='rnd_str',
    ...:             format=None,
    ...:             version_label='test',
    ...:             original_data_url=original_data_url,
    ...:             paper_url=paper_url,
    ...: )

In [18]: dataset.row_id_attribute
Out[18]: 'rnd_str'

In [19]: df.index.name
Out[19]: 'index'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the PR is wrong. The behaviour needs to be better defined:

  • If data is an array and row_id_attribute is None then we let row_id_attribute at None.
  • If data is an array and row_id_attribute is not None, we need to have a check that this is actually an attribute.
  • If data is a dataframe and row_id_attribute is None, we need to infer it. If df.index.name is None then we are back to the first bullet. Otherwise, we take the name as row_id_attribute and we need to df.reset_index() to include the data into the array when converting to numpy df.values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to infer it

I'd rather say "we can infer it". There is no necessity to have an ID column. But I think these three bullet points describe the behavior I would expect very well.

specified, it will be inferred.
.. versionadded: 0.8
Inference of ``row_id_attribute`` from a dataframe.
format : str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I wouldn't mind just removing format rather than deprecating it. It seems that the function signature is changed anyway, making the function not backwards compatible (bad, but acceptable and nothing really that can be done about this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but I would do that in a separate PR.

@janvanrijn
Copy link
Member

Thanks for this PR! I left some questions. @mfeurer is currently out of office, hopefully he can also have a look at it, but I think this can be merged quite easily :)

row_id_attribute : str, optional
The attribute that represents the row-id column, if present in the
dataset. If ``data`` is a dataframe and ``row_id_attribute`` is not
specified, it will be inferred.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could be a bit more specific on how it will be inferred. Honestly I'm a bit confused right now. Doesn't a dataframe always have an index?

In [1]: import pandas as pd                                                                                                                                                                                  

In [2]: data = [['a'], ['b'], ['c'], ['d'], ['e']]                                                                                                                                                           

In [3]: column_names = ['rnd_str']                                                                                                                                                                           

In [4]: df = pd.DataFrame(data, columns=column_names)                                                                                                                                                        

In [5]: hasattr(data, "index")                                                                                                                                                                               
Out[5]: True

Unless the user specifies an index column it should not be uploaded. And I assume that the pandas index column is not uploaded right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it always has an index and the index.name is either a string or None. By default, it will be None.

Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it

Yes, but only if there is one. For most datasets there is none.

yes, it always has an index and the index.name is either a string or None. By default, it will be None.

Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?

No, index as always a name attribute. It is None by default which is the also the value required when row_id_attribute should not be specified.

To illustrate what I mean.

In [2]: df = pd.DataFrame([1, 2])

In [3]: df
Out[3]: 
   0
0  1
1  2

In [4]: df.index.name

In [5]: print(df.index.name)
None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I got it. Do you think it would make sense to add a brief description about this behavior to the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python lists also have an index. So it would break for this case.

original_data_url=original_data_url,
paper_url=paper_url
)
self.assertEqual(dataset.row_id_attribute, 'index_column')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would indeed be great.

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 12, 2018

Looks good to me, but it would be great to have the additional unit test and maybe a warning if the row ID is given twice (in the form of an index and as an attribute to the create dataset function).

@glemaitre
Copy link
Contributor Author

Looks good to me, but it would be great to have the additional unit test and maybe a warning if the row ID is given twice (in the form of an index and as an attribute to the create dataset function).

I am a bit lost then. How do you actually store the index column in ARFF if this is not listed in the attribute?

@glemaitre
Copy link
Contributor Author

@janvanrijn

I am getting locally and on travis the following:

>           parser.Parse(xml_input, True)
E           xml.parsers.expat.ExpatError: junk after document element: line 71, column 6

Is it linked with the test server? This is a bit blocking to write test locally actually.

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 13, 2018

I am a bit lost then. How do you actually store the index column in ARFF if this is not listed in the attribute?

There is not necessarily an ID column. Most datasets, for example Iris, do not have an ID column and therefore, no ID column is uploaded to OpenML. However, for example the MUSK dataset has an index column in original data, called configuration_name, which describes is marked as an index.

@glemaitre
Copy link
Contributor Author

The errors are unrelated. @mfeurer I modified the documentation to reflect the behaviour of the attributes.

@mfeurer mfeurer merged commit 696db49 into openml:develop Nov 16, 2018
glemaitre added a commit to glemaitre/openml-python that referenced this pull request Nov 16, 2018
…et (openml#586)

* EHN: inferred row_id_attribute from dataframe to create a dataset

* reset the index of dataframe after inference

* TST: check the size of the dataset

* PEP8

* TST: check that an error is raised when row_id_attributes is not a known attribute

* DOC: Update the docstring

* PEP8
mfeurer pushed a commit that referenced this pull request Nov 16, 2018
* EHN: support SparseDataFrame when creating a dataset

* TST: check attributes inference dtype

* PEP8

* EXA: add sparse dataframe in the example

* Fix typos.

* Fix typo.

* Refactoring task.py (#588)

* [MRG] EHN: inferred row_id_attribute from dataframe to create a dataset (#586)

* EHN: inferred row_id_attribute from dataframe to create a dataset

* reset the index of dataframe after inference

* TST: check the size of the dataset

* PEP8

* TST: check that an error is raised when row_id_attributes is not a known attribute

* DOC: Update the docstring

* PEP8

* add examples to the menu, remove double progress (#554)

* PEP8

* PEP8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants