Issue 540 #547

ArlindKadra · 2018-09-21T12:23:49Z

Reference Issue

What does this PR implement/fix? Explain your changes.

Makes the arff attribute optional for the dataset, fixes the target type in the dataset publish doc to Nominal from Real. create_dataset is imported into the openml.datasets namespace and a unit test to upload a dataset as a list of lists is added.

How should this PR be tested?

Added unit test, other changes by the default ones.

codecov-io · 2018-09-21T14:05:45Z

Codecov Report

Merging #547 into develop will decrease coverage by 0.07%.
The diff coverage is 85.71%.

@@             Coverage Diff             @@
##           develop     #547      +/-   ##
===========================================
- Coverage    89.89%   89.81%   -0.08%     
===========================================
  Files           32       32              
  Lines         2928     2956      +28     
===========================================
+ Hits          2632     2655      +23     
- Misses         296      301       +5

Impacted Files	Coverage Δ
openml/datasets/__init__.py	`100% <100%> (ø)`	⬆️
openml/datasets/dataset.py	`77.41% <70%> (-0.81%)`	⬇️
openml/datasets/functions.py	`90.79% <96.29%> (+0.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ef4694...0c66cfc. Read the comment docs.

…t call

…taset with format attribute

mfeurer · 2018-09-24T12:45:40Z

Thanks a lot! Could you please add the following:

download the uploaded dataset after the publish and check whether the upload worked properly
the class attribute should be a categorical, not an integer
uploading of the weather dataset as an example, not only as a unit test
add support, tests and an example for sparse data

…ests

… is a built-in

openml/datasets/functions.py

glemaitre · 2018-10-07T16:10:13Z

openml/datasets/functions.py

+    """
+    dataset_xml = openml._api_calls._perform_api_call("data/%d" % did)
+    # build a dict from the xml and get the format from the dataset description
+    return xmltodict\


I am not a big fan of \. I would instead break the line in the paranthesis or reaffect the name of the keys into 2 variables. But this is more personnal thing I assume

Neither am I. Just trying to keep it as readable as possible.

glemaitre · 2018-10-07T16:10:28Z

tests/test_datasets/test_dataset_functions.py

                                       _get_dataset_qualities,
-                                       DATASETS_CACHE_DIR_NAME)
+                                       DATASETS_CACHE_DIR_NAME,
+                                       _get_online_dataset_arff,


put them in alphabetic order

glemaitre · 2018-10-07T16:11:17Z

tests/test_datasets/test_dataset_functions.py

-            version=1, licence="public", default_target_attribute="class", data_file=file_path)
+            "anneal",
+            "test",
+            data_format="ARFF",


uhm, it should not be "arff" in lower case even this is converted after.

you mean it should be "arff".

glemaitre · 2018-10-07T16:11:30Z

tests/test_datasets/test_dataset_functions.py

-            "UploadTestWithURL", "test", "ARFF",
+            "UploadTestWithURL",
+            "test",
+            data_format="ARFF",


lower case?

glemaitre · 2018-10-07T16:14:45Z

In addition, check:

https://codecov.io/gh/openml/openml-python/pull/547/diff

Basically a test checking the equality between OpenMLDataset(s) is missing

ArlindKadra · 2018-10-07T23:44:51Z

In addition, check:

https://codecov.io/gh/openml/openml-python/pull/547/diff

Basically a test checking the equality between OpenMLDataset(s) is missing

@glemaitre I agree, that would be a good issue. Not sure if we should handle it in this PR. The reason that I changed the eq method in this PR is that originally I planned to use it and the way it was did not seem functional.

glemaitre · 2018-10-08T15:16:46Z

openml/datasets/functions.py

    description : str
        Description of the dataset.
+    format : str, optional
+        Format of the dataset. Only 'arff' for now.


"Only 'arff' for now" is not correct anymore.

You can add
Assuming a deprecation cycle of 2 releases:

format : str, optional Format of the dataset which can be either 'arff' or 'sparse-arff'. By default, the format is automatically inferred. .. deprecated: 0.8 ``format`` is deprecated in 0.8 and will be removed in 0.10.

ops, quite true.

In addition you need to add a test to check if the deprecation warning is raised.
Ideally, we should remove the format parameter from all the other tests to avoid deprecation warning in the test suit.

I might be mistaken, but I think I did that. I have to double check though.

glemaitre · 2018-10-08T15:23:18Z

openml/datasets/functions.py

-    }

-    # Determine arff format from the dataset
+    if format is not None:


Ideally, you should use the parameters if given otherwise you make the inference

if format is not None: warn(...) d_format = format else: if isinstance ...

mfeurer

Finally I was able to do another pass on this one. I think the comments are mostly of stylistic nature and hopefully are rather easy to address.

mfeurer · 2018-10-06T07:42:27Z

.travis.yml

  - DISTRIB="conda" PYTHON_VERSION="3.6" SKLEARN_VERSION="0.18.2" RUN_FLAKE8="true" SKIP_TESTS="true"
  - DISTRIB="conda" PYTHON_VERSION="3.7" SKLEARN_VERSION="0.19.2"

+before_install:


Could you please add a comment here on why this is necessary as in #560?

yes, I missed doing that.

mfeurer · 2018-10-06T07:44:48Z

examples/create_upload_tutorial.py

 breast_cancer = sklearn.datasets.load_breast_cancer()
 name = 'BreastCancer(scikit-learn)'
-X = breast_cancer.data
+x = breast_cancer.data


Please leave this upper case (although flake will fail on it). @glemaitre do you have a solution in scikit-learn to avoid countless complaints about X being a bad variable name?

Ignore it as in:
https://github.com/openml/openml-python/pull/545/files#diff-892329e02eaf285e36ef57f91ac8a843R128

True that we should not use X but I think it went to the convention now :)

but why do we want to use an uppercase variable name ?

Convention. All ML books (I'm aware of) use upper case bold face variables for matrices. We can't use boldface, but we can at least write it upper case.

I would say that this is more a scikit-learn convention (it is used in example and in all methods of estimators).
The other solution is to call X -> data and y -> label and you remove this issue and it is explicit.

That's a good point, but we should do this in a separate PR.

@mfeurer then we should probably make a note also on the issue that the PR will solve so that it removes the ignore to capital variable names ?

mfeurer · 2018-10-06T08:07:39Z

examples/create_upload_tutorial.py

+# The target feature is indicated as meta-data of the
+# dataset (and tasks on that data).
+
+data = np.concatenate((x, y.reshape((-1, 1))), axis=1)


Now that I see this I find it kind of weird to use a classification dataset as an example for the upload of a numpy array (as the array will be of mixed dtype which makes the usage unintuitive). @ArlindKadra what do you think of changing this to the scikit-learn diabetes dataset?

@mfeurer I get your point. The array is actually of one type, after the concat the np.ndarray.dtype is '<U32', the numbers are converted to strings.
With the diabetes I guess after the concatenate all the elements would be floats.
I agree on using the later.

mfeurer · 2018-10-06T11:27:15Z

examples/create_upload_tutorial.py

+print('URL for dataset: %s/data/%d' % (openml.config.server, upload_did))
+
+############################################################################
+# Dataset is a list of lists


Could you please add a not that this can be generalized and that it basically is only necessary for the object to be a generator returning generators?

mfeurer · 2018-10-06T11:27:56Z

examples/create_upload_tutorial.py

+)
+
+wind_dataset = create_dataset(
+    name="Wind",


Shouldn't it be called weather?

yes it should

mfeurer · 2018-10-12T08:52:46Z

openml/datasets/functions.py

    description : str
        Description of the dataset.
+    format : str, optional
+        Format of the dataset which can be either 'arff' or 'sparse-arff'.


Shouldn't it be sparse_arff

mfeurer · 2018-10-12T08:56:58Z

openml/datasets/functions.py

+    # Determine ARFF format from the dataset
+    else:
+        if isinstance(data, list):
+            if isinstance(data[0], list):


I don't think this is a strict requirement. It could also be a list of numpy arrays. Maybe check whether it is a dict, and otherwise assume that it is dense and then drop the else statement?

mfeurer · 2018-10-12T08:59:47Z

openml/datasets/functions.py

+    # build a dict from the xml and get the format from the dataset description
+    return xmltodict\
+        .parse(dataset_xml)['oml:data_set_description']['oml:format']\
+        .lower()


Please add tests for this function and the function above.

openml-python/tests/test_datasets/test_dataset_functions.py

Lines 468 to 477 in e711267

self.assertEqual(

_get_online_dataset_arff(upload_did),

dataset._dataset,

"Uploaded ARFF does not match original one"

)

self.assertEqual(

_get_online_dataset_format(upload_did),

'arff',

"Wrong format for dataset"

)

openml-python/tests/test_datasets/test_dataset_functions.py

Lines 511 to 520 in e711267

self.assertEqual(

_get_online_dataset_arff(upload_did),

xor_dataset._dataset,

"Uploaded ARFF does not match original one"

)

self.assertEqual(

_get_online_dataset_format(upload_did),

'sparse_arff',

"Wrong format for dataset"

)

@mfeurer Now that I am thinking about it, the above checks are happening in test_create_dataset_sparse and test_create_dataset_list. Although they check that the datasets are uploaded correctly, they also check that the functions work correctly. In my opinion this already covers the functions, any other test where we do not upload, in the future might fail if information is removed from the test server.

True, but this also suboptimal because it is not immediately clear where this is tested. How about a simple test like for all the other functions which retrieve dataset related stuff?

I agree that it is not clear where it is tested.. Are you thinking of checking an existing dataset on the server ?

mfeurer · 2018-10-12T09:00:04Z

tests/test_datasets/test_dataset.py

 import numpy as np
-from scipy import sparse
 import six
+import openml


Please rearrange the imports to be PEP8 compliant.

mfeurer · 2018-10-12T09:00:33Z

tests/test_datasets/test_dataset_functions.py

+if sys.version_info[0] >= 3:
+    from unittest import mock
+else:
+    import mock


Please undo this reordering.

glemaitre · 2018-10-15T07:51:51Z

examples/create_upload_tutorial.py

+x = breast_cancer.data
 y = breast_cancer.target
+target_names = breast_cancer.target_names
+y = np.array([target_names[i] for i in y])


I would force the array to be object dtype since this is containing strings.

just noticed a few hours ago, I would agree with this, however instead we are going to go for another dataset.

…her dataset, adding creator of weather dataset

… fix540

Add unit test for list of lists dataset upload

fda4b90

ArlindKadra requested a review from mfeurer September 21, 2018 12:23

ArlindKadra added 2 commits September 21, 2018 14:04

Fixing xml pattern typo

2c7fd30

Fix pep8 no newline at the end of file

ed33768

ArlindKadra added 12 commits September 21, 2018 15:19

Remove format from definitions

a4ebfa7

Restoring format in dataset

ebd7113

Fixing a couple of unused imports and fixings bugs with create_datase…

5d6053e

…t call

Adapting unit tests to changes

fbc1f6b

Fixing failing unit tests

6a3ffb8

fixing typo

7dc9355

Enforce pep8 style guide, fix doc tutorial trying to invoke create_da…

7b0fdde

…taset with format attribute

Workaround for pep8 style guide

2d7b75c

fix long time typo

2919dd6

update pep8 failing statement and bug fix for dataset upload tutorial

1c4faff

fixed problem with arff file

3602739

Fix pep8 line too long

46cf1fa

openml deleted a comment from ArlindKadra Sep 24, 2018

mfeurer mentioned this pull request Sep 24, 2018

Typo in OpenMLDataset #550

Closed

ArlindKadra added 10 commits September 25, 2018 13:09

Extending the unit test for dataset upload, changing upload tutorial

693c368

Workaround for the dataset upload unit test

e29cf4d

Adding example with weather dataset into the dataset upload tutorial

f0d8200

Fixing builds failure

be7791f

Adding support for sparse datasets, implementing corresponding unit t…

5011216

…ests

fix bug

005649a

More unit tests and bug fix

b4103df

Fixing bugs

2e898ee

Fix bug and pep8 errors

43c6530

Enforcing pep8 and fixing changing the name of attribute format as it…

f45adbf

… is a built-in

glemaitre reviewed Oct 7, 2018

View reviewed changes

openml/datasets/functions.py Show resolved Hide resolved

glemaitre reviewed Oct 7, 2018

View reviewed changes

glemaitre reviewed Oct 8, 2018

View reviewed changes

ArlindKadra added 3 commits October 8, 2018 17:10

making changes in accordance with Guillaume's suggestions

714619f

Adding unit tests, small refactoring

e858689

Enforcing pep8 style

e711267

mfeurer reviewed Oct 12, 2018

View reviewed changes

glemaitre reviewed Oct 15, 2018

View reviewed changes

mfeurer mentioned this pull request Oct 15, 2018

Rename X,y to features, labels #570

Closed

ArlindKadra added 11 commits October 15, 2018 13:43

Following Matthias's suggestions

a3dbb9a

Fixing bug introduced by variable name change

4ae71be

Changing the breast_cancer dataset to diabetes, fixing typo with weat…

f922654

…her dataset, adding creator of weather dataset

Further changes

e84c42d

Merge branch 'develop' into fix540

0f653a3

Adding more changes

1d7f8eb

Fixing bug

82b4758

Merge branch 'fix540' of https://github.com/openml/openml-python into…

fc53ef6

… fix540

Pep8 enforce

0edea31

few changes

751f8c9

Fixing typo in dataset name attributes

0c66cfc

mfeurer approved these changes Oct 17, 2018

View reviewed changes

mfeurer merged commit 8ed133e into develop Oct 17, 2018

mfeurer deleted the fix540 branch October 17, 2018 12:24

mfeurer mentioned this pull request Nov 16, 2018

Improve dataset uploading #540

Closed

	self.assertEqual(
	_get_online_dataset_arff(upload_did),
	dataset._dataset,
	"Uploaded ARFF does not match original one"
	)
	self.assertEqual(
	_get_online_dataset_format(upload_did),
	'arff',
	"Wrong format for dataset"
	)

	self.assertEqual(
	_get_online_dataset_arff(upload_did),
	xor_dataset._dataset,
	"Uploaded ARFF does not match original one"
	)
	self.assertEqual(
	_get_online_dataset_format(upload_did),
	'sparse_arff',
	"Wrong format for dataset"
	)

Uh oh!

Issue 540 #547

Issue 540 #547

Uh oh!

Conversation

ArlindKadra commented Sep 21, 2018

Reference Issue

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Uh oh!

codecov-io commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mfeurer commented Sep 24, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArlindKadra commented Oct 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov-io commented Sep 21, 2018 •

edited

Loading

glemaitre commented Oct 7, 2018 •

edited

Loading

ArlindKadra Oct 15, 2018 •

edited

Loading