-
-
Notifications
You must be signed in to change notification settings - Fork 211
Issue 540 #547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #547 +/- ##
===========================================
- Coverage 89.89% 89.81% -0.08%
===========================================
Files 32 32
Lines 2928 2956 +28
===========================================
+ Hits 2632 2655 +23
- Misses 296 301 +5
Continue to review full report at Codecov.
|
…taset with format attribute
|
Thanks a lot! Could you please add the following:
|
| """ | ||
| dataset_xml = openml._api_calls._perform_api_call("data/%d" % did) | ||
| # build a dict from the xml and get the format from the dataset description | ||
| return xmltodict\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of \. I would instead break the line in the paranthesis or reaffect the name of the keys into 2 variables. But this is more personnal thing I assume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither am I. Just trying to keep it as readable as possible.
| _get_dataset_qualities, | ||
| DATASETS_CACHE_DIR_NAME) | ||
| DATASETS_CACHE_DIR_NAME, | ||
| _get_online_dataset_arff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put them in alphabetic order
| version=1, licence="public", default_target_attribute="class", data_file=file_path) | ||
| "anneal", | ||
| "test", | ||
| data_format="ARFF", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm, it should not be "arff" in lower case even this is converted after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean it should be "arff".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
| "UploadTestWithURL", "test", "ARFF", | ||
| "UploadTestWithURL", | ||
| "test", | ||
| data_format="ARFF", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
|
In addition, check: Basically a test checking the equality between OpenMLDataset(s) is missing |
@glemaitre I agree, that would be a good issue. Not sure if we should handle it in this PR. The reason that I changed the eq method in this PR is that originally I planned to use it and the way it was did not seem functional. |
openml/datasets/functions.py
Outdated
| description : str | ||
| Description of the dataset. | ||
| format : str, optional | ||
| Format of the dataset. Only 'arff' for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Only 'arff' for now" is not correct anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add
Assuming a deprecation cycle of 2 releases:
format : str, optional
Format of the dataset which can be either 'arff' or 'sparse-arff'.
By default, the format is automatically inferred.
.. deprecated: 0.8
``format`` is deprecated in 0.8 and will be removed in 0.10.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops, quite true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition you need to add a test to check if the deprecation warning is raised.
Ideally, we should remove the format parameter from all the other tests to avoid deprecation warning in the test suit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be mistaken, but I think I did that. I have to double check though.
| } | ||
|
|
||
| # Determine arff format from the dataset | ||
| if format is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, you should use the parameters if given otherwise you make the inference
if format is not None:
warn(...)
d_format = format
else:
if isinstance ...
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally I was able to do another pass on this one. I think the comments are mostly of stylistic nature and hopefully are rather easy to address.
| - DISTRIB="conda" PYTHON_VERSION="3.6" SKLEARN_VERSION="0.18.2" RUN_FLAKE8="true" SKIP_TESTS="true" | ||
| - DISTRIB="conda" PYTHON_VERSION="3.7" SKLEARN_VERSION="0.19.2" | ||
|
|
||
| before_install: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment here on why this is necessary as in #560?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I missed doing that.
examples/create_upload_tutorial.py
Outdated
| breast_cancer = sklearn.datasets.load_breast_cancer() | ||
| name = 'BreastCancer(scikit-learn)' | ||
| X = breast_cancer.data | ||
| x = breast_cancer.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave this upper case (although flake will fail on it). @glemaitre do you have a solution in scikit-learn to avoid countless complaints about X being a bad variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that we should not use X but I think it went to the convention now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why do we want to use an uppercase variable name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convention. All ML books (I'm aware of) use upper case bold face variables for matrices. We can't use boldface, but we can at least write it upper case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is more a scikit-learn convention (it is used in example and in all methods of estimators).
The other solution is to call X -> data and y -> label and you remove this issue and it is explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, but we should do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfeurer then we should probably make a note also on the issue that the PR will solve so that it removes the ignore to capital variable names ?
examples/create_upload_tutorial.py
Outdated
| # The target feature is indicated as meta-data of the | ||
| # dataset (and tasks on that data). | ||
|
|
||
| data = np.concatenate((x, y.reshape((-1, 1))), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this I find it kind of weird to use a classification dataset as an example for the upload of a numpy array (as the array will be of mixed dtype which makes the usage unintuitive). @ArlindKadra what do you think of changing this to the scikit-learn diabetes dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfeurer I get your point. The array is actually of one type, after the concat the np.ndarray.dtype is '<U32', the numbers are converted to strings.
With the diabetes I guess after the concatenate all the elements would be floats.
I agree on using the later.
examples/create_upload_tutorial.py
Outdated
| print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did)) | ||
|
|
||
| ############################################################################ | ||
| # Dataset is a list of lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a not that this can be generalized and that it basically is only necessary for the object to be a generator returning generators?
examples/create_upload_tutorial.py
Outdated
| ) | ||
|
|
||
| wind_dataset = create_dataset( | ||
| name="Wind", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be called weather?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should
openml/datasets/functions.py
Outdated
| description : str | ||
| Description of the dataset. | ||
| format : str, optional | ||
| Format of the dataset which can be either 'arff' or 'sparse-arff'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be sparse_arff
openml/datasets/functions.py
Outdated
| # Determine ARFF format from the dataset | ||
| else: | ||
| if isinstance(data, list): | ||
| if isinstance(data[0], list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a strict requirement. It could also be a list of numpy arrays. Maybe check whether it is a dict, and otherwise assume that it is dense and then drop the else statement?
| # build a dict from the xml and get the format from the dataset description | ||
| return xmltodict\ | ||
| .parse(dataset_xml)['oml:data_set_description']['oml:format']\ | ||
| .lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for this function and the function above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openml-python/tests/test_datasets/test_dataset_functions.py
Lines 468 to 477 in e711267
| self.assertEqual( | |
| _get_online_dataset_arff(upload_did), | |
| dataset._dataset, | |
| "Uploaded ARFF does not match original one" | |
| ) | |
| self.assertEqual( | |
| _get_online_dataset_format(upload_did), | |
| 'arff', | |
| "Wrong format for dataset" | |
| ) |
openml-python/tests/test_datasets/test_dataset_functions.py
Lines 511 to 520 in e711267
| self.assertEqual( | |
| _get_online_dataset_arff(upload_did), | |
| xor_dataset._dataset, | |
| "Uploaded ARFF does not match original one" | |
| ) | |
| self.assertEqual( | |
| _get_online_dataset_format(upload_did), | |
| 'sparse_arff', | |
| "Wrong format for dataset" | |
| ) |
@mfeurer Now that I am thinking about it, the above checks are happening in test_create_dataset_sparse and test_create_dataset_list. Although they check that the datasets are uploaded correctly, they also check that the functions work correctly. In my opinion this already covers the functions, any other test where we do not upload, in the future might fail if information is removed from the test server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but this also suboptimal because it is not immediately clear where this is tested. How about a simple test like for all the other functions which retrieve dataset related stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is not clear where it is tested.. Are you thinking of checking an existing dataset on the server ?
tests/test_datasets/test_dataset.py
Outdated
| import numpy as np | ||
| from scipy import sparse | ||
| import six | ||
| import openml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rearrange the imports to be PEP8 compliant.
| if sys.version_info[0] >= 3: | ||
| from unittest import mock | ||
| else: | ||
| import mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo this reordering.
examples/create_upload_tutorial.py
Outdated
| x = breast_cancer.data | ||
| y = breast_cancer.target | ||
| target_names = breast_cancer.target_names | ||
| y = np.array([target_names[i] for i in y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would force the array to be object dtype since this is containing strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed a few hours ago, I would agree with this, however instead we are going to go for another dataset.
…her dataset, adding creator of weather dataset
Reference Issue
#540
What does this PR implement/fix? Explain your changes.
Makes the arff attribute optional for the dataset, fixes the target type in the dataset publish doc to Nominal from Real. create_dataset is imported into the openml.datasets namespace and a unit test to upload a dataset as a list of lists is added.
How should this PR be tested?
Added unit test, other changes by the default ones.