Skip to content

Conversation

@ArlindKadra
Copy link
Member

Reference Issue

#540

What does this PR implement/fix? Explain your changes.

Makes the arff attribute optional for the dataset, fixes the target type in the dataset publish doc to Nominal from Real. create_dataset is imported into the openml.datasets namespace and a unit test to upload a dataset as a list of lists is added.

How should this PR be tested?

Added unit test, other changes by the default ones.

@ArlindKadra ArlindKadra requested a review from mfeurer September 21, 2018 12:23
@codecov-io
Copy link

codecov-io commented Sep 21, 2018

Codecov Report

Merging #547 into develop will decrease coverage by 0.07%.
The diff coverage is 85.71%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #547      +/-   ##
===========================================
- Coverage    89.89%   89.81%   -0.08%     
===========================================
  Files           32       32              
  Lines         2928     2956      +28     
===========================================
+ Hits          2632     2655      +23     
- Misses         296      301       +5
Impacted Files Coverage Δ
openml/datasets/__init__.py 100% <100%> (ø) ⬆️
openml/datasets/dataset.py 77.41% <70%> (-0.81%) ⬇️
openml/datasets/functions.py 90.79% <96.29%> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ef4694...0c66cfc. Read the comment docs.

@mfeurer
Copy link
Collaborator

mfeurer commented Sep 24, 2018

Thanks a lot! Could you please add the following:

  • download the uploaded dataset after the publish and check whether the upload worked properly
  • the class attribute should be a categorical, not an integer
  • uploading of the weather dataset as an example, not only as a unit test
  • add support, tests and an example for sparse data

@openml openml deleted a comment from ArlindKadra Sep 24, 2018
@mfeurer mfeurer mentioned this pull request Sep 24, 2018
"""
dataset_xml = openml._api_calls._perform_api_call("data/%d" % did)
# build a dict from the xml and get the format from the dataset description
return xmltodict\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of \. I would instead break the line in the paranthesis or reaffect the name of the keys into 2 variables. But this is more personnal thing I assume

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither am I. Just trying to keep it as readable as possible.

_get_dataset_qualities,
DATASETS_CACHE_DIR_NAME)
DATASETS_CACHE_DIR_NAME,
_get_online_dataset_arff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put them in alphabetic order

version=1, licence="public", default_target_attribute="class", data_file=file_path)
"anneal",
"test",
data_format="ARFF",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, it should not be "arff" in lower case even this is converted after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean it should be "arff".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

"UploadTestWithURL", "test", "ARFF",
"UploadTestWithURL",
"test",
data_format="ARFF",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@glemaitre
Copy link
Contributor

glemaitre commented Oct 7, 2018

In addition, check:

https://codecov.io/gh/openml/openml-python/pull/547/diff

Basically a test checking the equality between OpenMLDataset(s) is missing

@ArlindKadra
Copy link
Member Author

In addition, check:

https://codecov.io/gh/openml/openml-python/pull/547/diff

Basically a test checking the equality between OpenMLDataset(s) is missing

@glemaitre I agree, that would be a good issue. Not sure if we should handle it in this PR. The reason that I changed the eq method in this PR is that originally I planned to use it and the way it was did not seem functional.

description : str
Description of the dataset.
format : str, optional
Format of the dataset. Only 'arff' for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only 'arff' for now" is not correct anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add
Assuming a deprecation cycle of 2 releases:

format : str, optional
    Format of the dataset which can be either 'arff' or 'sparse-arff'.
    By default, the format is automatically inferred.

    .. deprecated: 0.8
       ``format`` is deprecated in 0.8 and will be removed in 0.10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, quite true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition you need to add a test to check if the deprecation warning is raised.
Ideally, we should remove the format parameter from all the other tests to avoid deprecation warning in the test suit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be mistaken, but I think I did that. I have to double check though.

}

# Determine arff format from the dataset
if format is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, you should use the parameters if given otherwise you make the inference

if format is not None:
    warn(...)
    d_format = format
else:
    if isinstance ...

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I was able to do another pass on this one. I think the comments are mostly of stylistic nature and hopefully are rather easy to address.

- DISTRIB="conda" PYTHON_VERSION="3.6" SKLEARN_VERSION="0.18.2" RUN_FLAKE8="true" SKIP_TESTS="true"
- DISTRIB="conda" PYTHON_VERSION="3.7" SKLEARN_VERSION="0.19.2"

before_install:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment here on why this is necessary as in #560?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed doing that.

breast_cancer = sklearn.datasets.load_breast_cancer()
name = 'BreastCancer(scikit-learn)'
X = breast_cancer.data
x = breast_cancer.data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave this upper case (although flake will fail on it). @glemaitre do you have a solution in scikit-learn to avoid countless complaints about X being a bad variable name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why do we want to use an uppercase variable name ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention. All ML books (I'm aware of) use upper case bold face variables for matrices. We can't use boldface, but we can at least write it upper case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is more a scikit-learn convention (it is used in example and in all methods of estimators).
The other solution is to call X -> data and y -> label and you remove this issue and it is explicit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, but we should do this in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfeurer then we should probably make a note also on the issue that the PR will solve so that it removes the ignore to capital variable names ?

# The target feature is indicated as meta-data of the
# dataset (and tasks on that data).

data = np.concatenate((x, y.reshape((-1, 1))), axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see this I find it kind of weird to use a classification dataset as an example for the upload of a numpy array (as the array will be of mixed dtype which makes the usage unintuitive). @ArlindKadra what do you think of changing this to the scikit-learn diabetes dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfeurer I get your point. The array is actually of one type, after the concat the np.ndarray.dtype is '<U32', the numbers are converted to strings.
With the diabetes I guess after the concatenate all the elements would be floats.
I agree on using the later.

print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did))

############################################################################
# Dataset is a list of lists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a not that this can be generalized and that it basically is only necessary for the object to be a generator returning generators?

)

wind_dataset = create_dataset(
name="Wind",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be called weather?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should

description : str
Description of the dataset.
format : str, optional
Format of the dataset which can be either 'arff' or 'sparse-arff'.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be sparse_arff

# Determine ARFF format from the dataset
else:
if isinstance(data, list):
if isinstance(data[0], list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a strict requirement. It could also be a list of numpy arrays. Maybe check whether it is a dict, and otherwise assume that it is dense and then drop the else statement?

# build a dict from the xml and get the format from the dataset description
return xmltodict\
.parse(dataset_xml)['oml:data_set_description']['oml:format']\
.lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for this function and the function above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertEqual(
_get_online_dataset_arff(upload_did),
dataset._dataset,
"Uploaded ARFF does not match original one"
)
self.assertEqual(
_get_online_dataset_format(upload_did),
'arff',
"Wrong format for dataset"
)

self.assertEqual(
_get_online_dataset_arff(upload_did),
xor_dataset._dataset,
"Uploaded ARFF does not match original one"
)
self.assertEqual(
_get_online_dataset_format(upload_did),
'sparse_arff',
"Wrong format for dataset"
)

@mfeurer Now that I am thinking about it, the above checks are happening in test_create_dataset_sparse and test_create_dataset_list. Although they check that the datasets are uploaded correctly, they also check that the functions work correctly. In my opinion this already covers the functions, any other test where we do not upload, in the future might fail if information is removed from the test server.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this also suboptimal because it is not immediately clear where this is tested. How about a simple test like for all the other functions which retrieve dataset related stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is not clear where it is tested.. Are you thinking of checking an existing dataset on the server ?

import numpy as np
from scipy import sparse
import six
import openml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rearrange the imports to be PEP8 compliant.

if sys.version_info[0] >= 3:
from unittest import mock
else:
import mock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this reordering.

x = breast_cancer.data
y = breast_cancer.target
target_names = breast_cancer.target_names
y = np.array([target_names[i] for i in y])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would force the array to be object dtype since this is containing strings.

Copy link
Member Author

@ArlindKadra ArlindKadra Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed a few hours ago, I would agree with this, however instead we are going to go for another dataset.

@mfeurer mfeurer merged commit 8ed133e into develop Oct 17, 2018
@mfeurer mfeurer deleted the fix540 branch October 17, 2018 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants