Skip to content

Conversation

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 21, 2018

Partially adressed #540

Allow to use a pandas DataFrame when creating and publishing a dataset on OpenML

Important point to be considerd:

  • categorical columns are not automatically converted to string and will instead raise an error
  • boolean columns are converted to categorical column encoding True and False as string
  • integer and real are used depending of the dtype to allow preserving the dtype when loading with liac-arff.
  • a dict can be passed to override the dtype inference for some specific columns.

TODO:

  • Add unit tests for the exceptions

@glemaitre
Copy link
Contributor Author

@ArlindKadra @mfeurer @janvanrijn Any suggestion is welcomed.

@glemaitre glemaitre changed the title [MRG] EHN: allow to upload DataFrame and infer dtype and column name [WIP] EHN: allow to upload DataFrame and infer dtype and column name Sep 21, 2018
@glemaitre
Copy link
Contributor Author

The PEP8 error is kinda unrelated sine that we can't avoid the issue of the import on the top of the file with mock.

@codecov-io
Copy link

codecov-io commented Sep 21, 2018

Codecov Report

Merging #545 into develop will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #545      +/-   ##
===========================================
+ Coverage    89.82%   89.91%   +0.08%     
===========================================
  Files           32       32              
  Lines         2959     2985      +26     
===========================================
+ Hits          2658     2684      +26     
  Misses         301      301
Impacted Files Coverage Δ
openml/datasets/functions.py 91.79% <100%> (+0.99%) ⬆️
openml/utils.py 92.63% <0%> (-0.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95d12e5...ccf7b82. Read the comment docs.

@glemaitre glemaitre changed the title [WIP] EHN: allow to upload DataFrame and infer dtype and column name [MRG] EHN: allow to upload DataFrame and infer dtype and column name Sep 21, 2018
@glemaitre glemaitre force-pushed the is/create_dataset_pandas branch from 5135b81 to 2ed1928 Compare September 21, 2018 19:41
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I only have minor comments and would really like to have the error tested, but I'm optimistic that we can merge this soon.

openml.config.server = 'https://test.openml.org/api/v1/xml'

############################################################################
# Uploading a data set store in a NumPy array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be stored

)

############################################################################
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think these try/excepts are no longer necessary and only add little extra information for the user. Could you please remove them?

# call :func:`create_dataset` by passing the dataframe and fixing the parameter
# ``attributes`` to ``'auto'``.

# force OpenML to infer the attributes from the dataframe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to define these arguments outside of the function you'll be calling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it more didactic to have a comment and an assignment instead of just assigning inside the function. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current example has comments inside the function so everything is in one place. Not sure which one is better though as the current version has a pretty long function call.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @ArlindKadra regarding #547 to have this based on the length. But that would distribute the creation of the arguments. I start to see why creating them beforehand is a good idea.


def test_create_dataset_pandas(self):
# pandas is only a optional dependency and we need to skip the test if
# it is not installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run we will probably make pandas required (it is already required for the examples and documentation), so I'd totally okay if you add it as a dependency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be deleted now, right?

original_data_url=original_data_url,
paper_url=paper_url
)
dataset.publish()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please download and check that everything was converted and uploaded properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that I can do that with something like:

upload_id = dataset.publish()
dataset_uploaded = openml.datasets.get_dataset(upload_id)

However, I receive the following exception:

openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/features/5915 returned code 273: Dataset not processed yet

What is the usual way that you are doing it.

Copy link
Contributor Author

@glemaitre glemaitre Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way could be something like that:

        upload_id = dataset.publish()
        while True:
            try:
                dataset_uploaded = openml.datasets.get_dataset(upload_id)
                break
            except OpenMLServerException as msg_exception:
                if "code 273" in str(msg_exception):
                    pass
                else:
                    raise msg_exception

but I am not sure how long it can take for the dataset to be processed and we could get in some timeout nightmare.

Copy link
Member

@ArlindKadra ArlindKadra Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre We already had this problem in #547. We decided to only compare based on the arff and we also implemented 2 functions which get the arff and format based on the id (those are available when it is uploaded). I have to refactor my pr a bit more, but it should be available tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean to use this function:

https://github.com/openml/openml-python/pull/547/files#diff-2ffad3a1ae43bbcc0e841e2f1df8967bR692

So I would probably wait that your PR get first in and I will manage the conflict then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, @mfeurer are you okay with this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good at first glance.

@glemaitre
Copy link
Contributor Author

glemaitre commented Sep 24, 2018 via email

"entries which are not string."
.format(column_name))
elif column_dtype.name == 'object':
arff_dtype = 'STRING'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we want this to be category as well, right? kinda tricky ;)
And what about integers that encode categories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what about integers that encode categories?

If they encode categories, then one should use the category dtype for those columns.

However, this is part needs to be modified. We should manage INTEGER and REAL/NUMERIC differently. We also have to think more in depth about the missing values. I'll use the expertise of @jorisvandenbossche

Basically, we are working on uploading the datasets from dirty_cat and we are confronted to those issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.

Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.

And what about integers that encode categories?

I think arff does not support integer categories?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, arff should allow this (as long as the categories are integers, too). However, liac-arff does not support this. Can be reproduced with:

import arff

arff.dumps({
    'relation': 'test',
    'description': 'abc',
    'attributes': [['test', [1, 2, 3]]],
    'data': [[1], [2]],
})

I'm wondering if it's worth changing this. Formally, there's also no difference between REAL, INTEGER and NUMERIC, but I don't see a reason to not be more specific about the data type as the arff specification allows this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is a difference between REAL and INTEGER since the dtype will be kept, isn't it.

In [5]: X = arff.dumps({ 
   ...:   'relation': 'test', 
   ...:   'description': 'abc', 
   ...:   'attributes': [['test', 'REAL']], 
   ...:   'data': [[1], [2]] 
   ...: })                                                                                              

In [6]: arff.loads(X)                                                                                   
Out[6]: 
{'description': 'abc',
 'relation': 'test',
 'attributes': [('test', 'REAL')],
 'data': [[1.0], [2.0]]}

In [7]: X = arff.dumps({ 
   ...:   'relation': 'test', 
   ...:   'description': 'abc', 
   ...:   'attributes': [['test', 'INTEGER']], 
   ...:   'data': [[1], [2]] 
   ...: })                                                                                              

In [8]: arff.loads(X)                                                                                   
Out[8]: 
{'description': 'abc',
 'relation': 'test',
 'attributes': [('test', 'INTEGER')],
 'data': [[1], [2]]}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going by the official arff specification, all numeric attributes should go by NUMERIC
https://www.cs.waikato.ac.nz/ml/weka/arff.html

In order to make OpenML datasets comply with as much as possible arff parsers, maybe it's best to enforce this too server side? In that case the Python uploader should of course also upload NUMERIC key. @mfeurer @joaquinvanschoren @berndbischl WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to add that currently this is not enforced, server side

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janvanrijn you pointed to an outdated version of the arff standard (see the first sentence on the page you linked to), the newer one (https://waikato.github.io/weka-wiki/arff_stable/) actually mentions 'integer' and 'real' (although it says they are treated as numeric)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're more a guideline than a necessity. But it woudl be great if we can support them as good as possible in pandas. I don't think this would be easy in numpy, though.

@amueller
Copy link
Contributor

amueller commented Oct 4, 2018

I'm so excited!

# infer the type of data for each column of the DataFrame
attributes_ = [(col_name,
_pandas_dtype_to_arff_dtype(data, col_name, col_dtype))
for col_name, col_dtype in data.dtypes.iteritems()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't necessarily need to pass the dtype here, you could also iterate only through the column names

"entries which are not string."
.format(column_name))
elif column_dtype.name == 'object':
arff_dtype = 'STRING'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would also say that it is up to the user to use a categorical dtype. Actual string columns are not that rare (eg titanic dataset), and it is quite impossible to guess from a string column if it should be categorical or not.

Of course, it could also be a keyword that the user can pass to indicate which columns to see as categorical without the need that the (s)he does the actual conversion themselves.

And what about integers that encode categories?

I think arff does not support integer categories?

@glemaitre
Copy link
Contributor Author

I have to add much more tests but basically I change the way the dtype inference was done using pandas (only available for pandas >= 0.19, I think):
https://github.com/openml/openml-python/pull/545/files#diff-2ffad3a1ae43bbcc0e841e2f1df8967bR357

It should be robust to find the right dtype even in presence of NA values. It should also be extensible to DATE (datetime) dtype since pandas is also inferring those. I see that there is a PR in liac-arff which could be worth to push forward ;)

@glemaitre
Copy link
Contributor Author

The error on the CI is a bit cryptic to me. Anyone has an idea why is it failing only for some of the build and not others. I can use this function locally as well.

@glemaitre
Copy link
Contributor Author

I think that it could be goof to actually redownload the dataset once uploaded because I don't see any other way to check that everything went fine by converting the dataframe into arff.

@glemaitre
Copy link
Contributor Author

@jorisvandenbossche If you want to have a look at it

@janvanrijn
Copy link
Member

janvanrijn commented Oct 7, 2018

The error on the CI is a bit cryptic to me. Anyone has an idea why is it failing only for some of the build and not others. I can use this function locally as well.

I also encountered this error. Opened issue #565 and PR #566. If you merge #566 into this branch it should be fixed.

To add a little bit more context, due to the structure of the specific test it only manifests itself in certain cases, for example when the test server is wiped empty again (which I did on Friday). This is why it was not picked up on earlier and merged in develop.

# infer the type of data for each column of the DataFrame
attributes_ = attributes_arff_from_df(data)
if isinstance(attributes, dict):
# override the attributes which was specified by the user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this do the exact opposite? It overrides the attributes from the dataframe with the arguments passed by the user, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is the mechanism that allow the user to overwrite some specific attribute. So it could be useful to force a specific data type or specify the categories (e.g. if there is some missing categories in the data column).

setup.py Outdated
'nbformat',
'python-dateutil',
'oslo.concurrency',
'pandas',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of your comments you mentioned that this needs to be at least 0.19. Would it make sense to add this piece of information here?


def test_create_dataset_pandas(self):
# pandas is only a optional dependency and we need to skip the test if
# it is not installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be deleted now, right?

# raise an error asking to convert all entries to string.
categories = df[column_name].cat.categories
categories_dtype = pd.api.types.infer_dtype(categories)
if categories_dtype != 'string':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to allow 'unicode' here as well for python 2 compat ?

else:
raise ValueError("The dtype '{}' of the column '{}' is not "
"currently supported by liac-arff. Supported "
"dtypes are categorical, string, interger, "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interger -> integer

@glemaitre
Copy link
Contributor Author

@mfeurer @janvanrijn @ArlindKadra @amueller

The PR is ready for a full round of review. With the previous work it should be almost ready to be merged (the CI should be green).

Copy link
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR. I find it really awesome. I have been not been too involved in this one, so please don't mind my comments too much, but I left some (open) questions. Apart from this, LGTM

def attributes_arff_from_df(df):
"""Create the attributes as specified by the ARFF format using a dataframe.
Arguments:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string that you applied seems different from the other docstrings that we use, i.e.,

Parameters
----------

and

Returns
-------

# dropped before the inference instead.
column_dtype = pd.api.types.infer_dtype(df[column_name].dropna())

if column_dtype == 'categorical':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the discussion in the thread. is this a regular pandas data type? Is there any reason to not use the dtype str?

(the comments below do not seem to help me too much, sorry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an available dtype in pandas to mention that a column is a categorical column.

Is there any reason to not use the dtype str?

Actually I don't see why categories should be string :). Right now, the liac-arff does not allow otherwise, we could actually think about converting silently the categories to string if a column is of category dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the clarification :)

"Got {} dtype in this column."
.format(column_name, categories_dtype))
attributes_arff.append((column_name, categories.tolist()))
elif column_dtype == 'boolean':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use dtype=bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we are using the pd.api.types.infer_dtype function which return string.

else:
attributes_ = attributes

data = data.values if hasattr(data, "columns") else data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that internally we store everything as arff. Wouldn't it be much cooler to store everything internally as pandas, that will make it much easier to do operations on the datasets. (And of course convert before uploading dataset / performing a run). However maybe that should be a next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is another PR.

I would imagine that it would require quite some changes but if the consensus is to use pandas then why not. Then, the data should be only tabular data (which is probably the main use case in openml)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse data might be an issue, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse arrays do not have values attribute so we should be fine.

If you think about SparseDataFrame, I would say that we should not offer support. They are going to be deprecated in favor of sparse dtype.

Is the liac-arff actually usable with sparse data?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can handle scipy's COO data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will not be an issue. @amueller could you elaborate on your thoughts.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great now, I think there is only one assert statement missing.

"Uploaded ARFF does not match original one"
)

# Check that we can overwrite the attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no assert for this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 797, we checked that the attributes contain 'g' which is not present in the original data. I would consider that as an assert. Do you want to test something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this?

        self.assertEqual(
            _get_online_dataset_arff(upload_did),
            dataset._dataset,
            "Uploaded ARFF does not match original one"
        )

Either I'm missing something, or there is no strict check that 'g' is in the attributes of the downloaded file? Line 797 only checks that the download is equal to the upload, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see we can add a strict check then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could support SparseDataFrame for now. Calling to_coo instead of values should be enough then.

@glemaitre
Copy link
Contributor Author

Do you add entry to document changes in the library like what's new or changelog (I see a progress.rst maybe)?

@glemaitre glemaitre force-pushed the is/create_dataset_pandas branch from 20dbc23 to ccf7b82 Compare October 22, 2018 16:03
@glemaitre
Copy link
Contributor Author

Anything else?

@janvanrijn
Copy link
Member

LGTM. Also, I promised Lisheng Sun (PhD student in Paris) to update her once this was merged. I don't have her github handle, does someone have?

@mfeurer mfeurer merged commit f22c393 into openml:develop Oct 23, 2018
@mfeurer
Copy link
Collaborator

mfeurer commented Oct 23, 2018

@janvanrijn @LishengSun

@glemaitre
Copy link
Contributor Author

I can do the support for SparseDataFrame in an upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants