-
-
Notifications
You must be signed in to change notification settings - Fork 211
[MRG] EHN: inferred row_id_attribute from dataframe to create a dataset #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @mfeurer @janvanrijn @joaquinvanschoren Any feedback? |
| original_data_url=original_data_url, | ||
| paper_url=paper_url | ||
| ) | ||
| self.assertEqual(dataset.row_id_attribute, 'index_column') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also test that the "publish" function correctly uploads the index column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would indeed be great.
openml/datasets/functions.py
Outdated
| else: | ||
| attributes_ = attributes | ||
|
|
||
| if row_id_attribute is None and hasattr(data, "index"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens with the dataframe index if row_id_attribute is set to a value other than the index? Will it be ignored at upload time? Can this be added to comments and somehow tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you quickly show a code snippet (even pseudo-code) to illustrate what you mean? I am not sure to see the use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @janvanrijn wanted to only point out a potential issue:
In [17]: dataset = openml.datasets.functions.create_dataset(
...: name=name,
...: description=description,
...: creator=creator,
...: contributor=None,
...: collection_date=collection_date,
...: language=language,
...: licence=licence,
...: default_target_attribute=default_target_attribute,
...: ignore_attribute=None,
...: citation=citation,
...: attributes='auto',
...: data=df,
...: row_id_attribute='rnd_str',
...: format=None,
...: version_label='test',
...: original_data_url=original_data_url,
...: paper_url=paper_url,
...: )
In [18]: dataset.row_id_attribute
Out[18]: 'rnd_str'
In [19]: df.index.name
Out[19]: 'index'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the PR is wrong. The behaviour needs to be better defined:
- If data is an array and
row_id_attributeisNonethen we letrow_id_attributeatNone. - If data is an array and
row_id_attributeis notNone, we need to have a check that this is actually an attribute. - If data is a dataframe and
row_id_attributeisNone, we need to infer it. Ifdf.index.nameisNonethen we are back to the first bullet. Otherwise, we take the name asrow_id_attributeand we need todf.reset_index()to include the data into the array when converting to numpydf.values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to infer it
I'd rather say "we can infer it". There is no necessity to have an ID column. But I think these three bullet points describe the behavior I would expect very well.
| specified, it will be inferred. | ||
| .. versionadded: 0.8 | ||
| Inference of ``row_id_attribute`` from a dataframe. | ||
| format : str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I wouldn't mind just removing format rather than deprecating it. It seems that the function signature is changed anyway, making the function not backwards compatible (bad, but acceptable and nothing really that can be done about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but I would do that in a separate PR.
|
Thanks for this PR! I left some questions. @mfeurer is currently out of office, hopefully he can also have a look at it, but I think this can be merged quite easily :) |
openml/datasets/functions.py
Outdated
| row_id_attribute : str, optional | ||
| The attribute that represents the row-id column, if present in the | ||
| dataset. If ``data`` is a dataframe and ``row_id_attribute`` is not | ||
| specified, it will be inferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could be a bit more specific on how it will be inferred. Honestly I'm a bit confused right now. Doesn't a dataframe always have an index?
In [1]: import pandas as pd
In [2]: data = [['a'], ['b'], ['c'], ['d'], ['e']]
In [3]: column_names = ['rnd_str']
In [4]: df = pd.DataFrame(data, columns=column_names)
In [5]: hasattr(data, "index")
Out[5]: True
Unless the user specifies an index column it should not be uploaded. And I assume that the pandas index column is not uploaded right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it always has an index and the index.name is either a string or None. By default, it will be None.
Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, row_id_attribute corresponds to the name of the index column, isn't it
Yes, but only if there is one. For most datasets there is none.
yes, it always has an index and the index.name is either a string or None. By default, it will be None.
Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the code then be hasattr(data, "index") and hasattr(data.index, "name")?
No, index as always a name attribute. It is None by default which is the also the value required when row_id_attribute should not be specified.
To illustrate what I mean.
In [2]: df = pd.DataFrame([1, 2])
In [3]: df
Out[3]:
0
0 1
1 2
In [4]: df.index.name
In [5]: print(df.index.name)
NoneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I got it. Do you think it would make sense to add a brief description about this behavior to the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python lists also have an index. So it would break for this case.
| original_data_url=original_data_url, | ||
| paper_url=paper_url | ||
| ) | ||
| self.assertEqual(dataset.row_id_attribute, 'index_column') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would indeed be great.
|
Looks good to me, but it would be great to have the additional unit test and maybe a warning if the row ID is given twice (in the form of an index and as an attribute to the create dataset function). |
I am a bit lost then. How do you actually store the index column in ARFF if this is not listed in the attribute? |
|
I am getting locally and on travis the following: Is it linked with the test server? This is a bit blocking to write test locally actually. |
There is not necessarily an ID column. Most datasets, for example Iris, do not have an ID column and therefore, no ID column is uploaded to OpenML. However, for example the MUSK dataset has an index column in original data, called |
|
The errors are unrelated. @mfeurer I modified the documentation to reflect the behaviour of the attributes. |
…et (openml#586) * EHN: inferred row_id_attribute from dataframe to create a dataset * reset the index of dataframe after inference * TST: check the size of the dataset * PEP8 * TST: check that an error is raised when row_id_attributes is not a known attribute * DOC: Update the docstring * PEP8
* EHN: support SparseDataFrame when creating a dataset * TST: check attributes inference dtype * PEP8 * EXA: add sparse dataframe in the example * Fix typos. * Fix typo. * Refactoring task.py (#588) * [MRG] EHN: inferred row_id_attribute from dataframe to create a dataset (#586) * EHN: inferred row_id_attribute from dataframe to create a dataset * reset the index of dataframe after inference * TST: check the size of the dataset * PEP8 * TST: check that an error is raised when row_id_attributes is not a known attribute * DOC: Update the docstring * PEP8 * add examples to the menu, remove double progress (#554) * PEP8 * PEP8
This allows to automatically inferred the name of the index column.
However, I break the backward compatibility since that the parameter is now a keyword argument. It was already done like that for
format.