-
-
Notifications
You must be signed in to change notification settings - Fork 211
[MRG] EHN: allow to upload DataFrame and infer dtype and column name #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ArlindKadra @mfeurer @janvanrijn Any suggestion is welcomed. |
|
The PEP8 error is kinda unrelated sine that we can't avoid the issue of the import on the top of the file with mock. |
Codecov Report
@@ Coverage Diff @@
## develop #545 +/- ##
===========================================
+ Coverage 89.82% 89.91% +0.08%
===========================================
Files 32 32
Lines 2959 2985 +26
===========================================
+ Hits 2658 2684 +26
Misses 301 301
Continue to review full report at Codecov.
|
5135b81 to
2ed1928
Compare
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I only have minor comments and would really like to have the error tested, but I'm optimistic that we can merge this soon.
examples/create_upload_tutorial.py
Outdated
| openml.config.server = 'https://test.openml.org/api/v1/xml' | ||
|
|
||
| ############################################################################ | ||
| # Uploading a data set store in a NumPy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be stored
examples/create_upload_tutorial.py
Outdated
| ) | ||
|
|
||
| ############################################################################ | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think these try/excepts are no longer necessary and only add little extra information for the user. Could you please remove them?
examples/create_upload_tutorial.py
Outdated
| # call :func:`create_dataset` by passing the dataframe and fixing the parameter | ||
| # ``attributes`` to ``'auto'``. | ||
|
|
||
| # force OpenML to infer the attributes from the dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to define these arguments outside of the function you'll be calling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it more didactic to have a comment and an assignment instead of just assigning inside the function. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current example has comments inside the function so everything is in one place. Not sure which one is better though as the current version has a pretty long function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @ArlindKadra regarding #547 to have this based on the length. But that would distribute the creation of the arguments. I start to see why creating them beforehand is a good idea.
|
|
||
| def test_create_dataset_pandas(self): | ||
| # pandas is only a optional dependency and we need to skip the test if | ||
| # it is not installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run we will probably make pandas required (it is already required for the examples and documentation), so I'd totally okay if you add it as a dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be deleted now, right?
| original_data_url=original_data_url, | ||
| paper_url=paper_url | ||
| ) | ||
| dataset.publish() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please download and check that everything was converted and uploaded properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that I can do that with something like:
upload_id = dataset.publish()
dataset_uploaded = openml.datasets.get_dataset(upload_id)However, I receive the following exception:
openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/features/5915 returned code 273: Dataset not processed yetWhat is the usual way that you are doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way could be something like that:
upload_id = dataset.publish()
while True:
try:
dataset_uploaded = openml.datasets.get_dataset(upload_id)
break
except OpenMLServerException as msg_exception:
if "code 273" in str(msg_exception):
pass
else:
raise msg_exceptionbut I am not sure how long it can take for the dataset to be processed and we could get in some timeout nightmare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre We already had this problem in #547. We decided to only compare based on the arff and we also implemented 2 functions which get the arff and format based on the id (those are available when it is uploaded). I have to refactor my pr a bit more, but it should be available tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you mean to use this function:
So I would probably wait that your PR get first in and I will manage the conflict then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, @mfeurer are you okay with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good at first glance.
|
Yes I have to have full coverage. I have to check that.
Sent from my phone - sorry to be brief and potential misspell.
|
openml/datasets/functions.py
Outdated
| "entries which are not string." | ||
| .format(column_name)) | ||
| elif column_dtype.name == 'object': | ||
| arff_dtype = 'STRING' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we want this to be category as well, right? kinda tricky ;)
And what about integers that encode categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what about integers that encode categories?
If they encode categories, then one should use the category dtype for those columns.
However, this is part needs to be modified. We should manage INTEGER and REAL/NUMERIC differently. We also have to think more in depth about the missing values. I'll use the expertise of @jorisvandenbossche
Basically, we are working on uploading the datasets from dirty_cat and we are confronted to those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.
Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.
And what about integers that encode categories?
I think arff does not support integer categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, arff should allow this (as long as the categories are integers, too). However, liac-arff does not support this. Can be reproduced with:
import arff
arff.dumps({
'relation': 'test',
'description': 'abc',
'attributes': [['test', [1, 2, 3]]],
'data': [[1], [2]],
})
I'm wondering if it's worth changing this. Formally, there's also no difference between REAL, INTEGER and NUMERIC, but I don't see a reason to not be more specific about the data type as the arff specification allows this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there is a difference between REAL and INTEGER since the dtype will be kept, isn't it.
In [5]: X = arff.dumps({
...: 'relation': 'test',
...: 'description': 'abc',
...: 'attributes': [['test', 'REAL']],
...: 'data': [[1], [2]]
...: })
In [6]: arff.loads(X)
Out[6]:
{'description': 'abc',
'relation': 'test',
'attributes': [('test', 'REAL')],
'data': [[1.0], [2.0]]}
In [7]: X = arff.dumps({
...: 'relation': 'test',
...: 'description': 'abc',
...: 'attributes': [['test', 'INTEGER']],
...: 'data': [[1], [2]]
...: })
In [8]: arff.loads(X)
Out[8]:
{'description': 'abc',
'relation': 'test',
'attributes': [('test', 'INTEGER')],
'data': [[1], [2]]}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going by the official arff specification, all numeric attributes should go by NUMERIC
https://www.cs.waikato.ac.nz/ml/weka/arff.html
In order to make OpenML datasets comply with as much as possible arff parsers, maybe it's best to enforce this too server side? In that case the Python uploader should of course also upload NUMERIC key. @mfeurer @joaquinvanschoren @berndbischl WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to add that currently this is not enforced, server side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janvanrijn you pointed to an outdated version of the arff standard (see the first sentence on the page you linked to), the newer one (https://waikato.github.io/weka-wiki/arff_stable/) actually mentions 'integer' and 'real' (although it says they are treated as numeric)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're more a guideline than a necessity. But it woudl be great if we can support them as good as possible in pandas. I don't think this would be easy in numpy, though.
|
I'm so excited! |
openml/datasets/functions.py
Outdated
| # infer the type of data for each column of the DataFrame | ||
| attributes_ = [(col_name, | ||
| _pandas_dtype_to_arff_dtype(data, col_name, col_dtype)) | ||
| for col_name, col_dtype in data.dtypes.iteritems()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't necessarily need to pass the dtype here, you could also iterate only through the column names
openml/datasets/functions.py
Outdated
| "entries which are not string." | ||
| .format(column_name)) | ||
| elif column_dtype.name == 'object': | ||
| arff_dtype = 'STRING' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.
Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.
And what about integers that encode categories?
I think arff does not support integer categories?
|
I have to add much more tests but basically I change the way the dtype inference was done using pandas (only available for pandas >= 0.19, I think): It should be robust to find the right dtype even in presence of NA values. It should also be extensible to DATE (datetime) dtype since pandas is also inferring those. I see that there is a PR in liac-arff which could be worth to push forward ;) |
|
The error on the CI is a bit cryptic to me. Anyone has an idea why is it failing only for some of the build and not others. I can use this function locally as well. |
|
I think that it could be goof to actually redownload the dataset once uploaded because I don't see any other way to check that everything went fine by converting the dataframe into arff. |
|
@jorisvandenbossche If you want to have a look at it |
I also encountered this error. Opened issue #565 and PR #566. If you merge #566 into this branch it should be fixed. To add a little bit more context, due to the structure of the specific test it only manifests itself in certain cases, for example when the test server is wiped empty again (which I did on Friday). This is why it was not picked up on earlier and merged in develop. |
| # infer the type of data for each column of the DataFrame | ||
| attributes_ = attributes_arff_from_df(data) | ||
| if isinstance(attributes, dict): | ||
| # override the attributes which was specified by the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this do the exact opposite? It overrides the attributes from the dataframe with the arguments passed by the user, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is the mechanism that allow the user to overwrite some specific attribute. So it could be useful to force a specific data type or specify the categories (e.g. if there is some missing categories in the data column).
setup.py
Outdated
| 'nbformat', | ||
| 'python-dateutil', | ||
| 'oslo.concurrency', | ||
| 'pandas', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of your comments you mentioned that this needs to be at least 0.19. Would it make sense to add this piece of information here?
|
|
||
| def test_create_dataset_pandas(self): | ||
| # pandas is only a optional dependency and we need to skip the test if | ||
| # it is not installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be deleted now, right?
openml/datasets/functions.py
Outdated
| # raise an error asking to convert all entries to string. | ||
| categories = df[column_name].cat.categories | ||
| categories_dtype = pd.api.types.infer_dtype(categories) | ||
| if categories_dtype != 'string': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to allow 'unicode' here as well for python 2 compat ?
openml/datasets/functions.py
Outdated
| else: | ||
| raise ValueError("The dtype '{}' of the column '{}' is not " | ||
| "currently supported by liac-arff. Supported " | ||
| "dtypes are categorical, string, interger, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interger -> integer
|
@mfeurer @janvanrijn @ArlindKadra @amueller The PR is ready for a full round of review. With the previous work it should be almost ready to be merged (the CI should be green). |
janvanrijn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR. I find it really awesome. I have been not been too involved in this one, so please don't mind my comments too much, but I left some (open) questions. Apart from this, LGTM
openml/datasets/functions.py
Outdated
| def attributes_arff_from_df(df): | ||
| """Create the attributes as specified by the ARFF format using a dataframe. | ||
| Arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string that you applied seems different from the other docstrings that we use, i.e.,
Parameters
----------
and
Returns
-------
| # dropped before the inference instead. | ||
| column_dtype = pd.api.types.infer_dtype(df[column_name].dropna()) | ||
|
|
||
| if column_dtype == 'categorical': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the discussion in the thread. is this a regular pandas data type? Is there any reason to not use the dtype str?
(the comments below do not seem to help me too much, sorry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an available dtype in pandas to mention that a column is a categorical column.
Is there any reason to not use the dtype str?
Actually I don't see why categories should be string :). Right now, the liac-arff does not allow otherwise, we could actually think about converting silently the categories to string if a column is of category dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the clarification :)
| "Got {} dtype in this column." | ||
| .format(column_name, categories_dtype)) | ||
| attributes_arff.append((column_name, categories.tolist())) | ||
| elif column_dtype == 'boolean': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use dtype=bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we are using the pd.api.types.infer_dtype function which return string.
| else: | ||
| attributes_ = attributes | ||
|
|
||
| data = data.values if hasattr(data, "columns") else data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply that internally we store everything as arff. Wouldn't it be much cooler to store everything internally as pandas, that will make it much easier to do operations on the datasets. (And of course convert before uploading dataset / performing a run). However maybe that should be a next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is another PR.
I would imagine that it would require quite some changes but if the consensus is to use pandas then why not. Then, the data should be only tabular data (which is probably the main use case in openml)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparse data might be an issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparse arrays do not have values attribute so we should be fine.
If you think about SparseDataFrame, I would say that we should not offer support. They are going to be deprecated in favor of sparse dtype.
Is the liac-arff actually usable with sparse data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can handle scipy's COO data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will not be an issue. @amueller could you elaborate on your thoughts.
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great now, I think there is only one assert statement missing.
| "Uploaded ARFF does not match original one" | ||
| ) | ||
|
|
||
| # Check that we can overwrite the attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no assert for this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 797, we checked that the attributes contain 'g' which is not present in the original data. I would consider that as an assert. Do you want to test something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this?
self.assertEqual(
_get_online_dataset_arff(upload_did),
dataset._dataset,
"Uploaded ARFF does not match original one"
)
Either I'm missing something, or there is no strict check that 'g' is in the attributes of the downloaded file? Line 797 only checks that the download is equal to the upload, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see we can add a strict check then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could support SparseDataFrame for now. Calling to_coo instead of values should be enough then.
|
Do you add entry to document changes in the library like what's new or changelog (I see a |
20dbc23 to
ccf7b82
Compare
|
Anything else? |
|
LGTM. Also, I promised Lisheng Sun (PhD student in Paris) to update her once this was merged. I don't have her github handle, does someone have? |
|
I can do the support for SparseDataFrame in an upcoming PR. |
Partially adressed #540
Allow to use a pandas DataFrame when creating and publishing a dataset on OpenML
Important point to be considerd:
TODO: