-
-
Notifications
You must be signed in to change notification settings - Fork 211
[WIP] Fix and improve dataset upload #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…he dataset functions module
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Could you please remove the function the publish function, too? I think that this function currently does not provide enough information to the user on how to use it and is therefore not really helpful. Also, besides for re-uploading of a fixed dataset, I don't see a reason to keep this. Maybe we should add a helper function to re-upload a fixed dataset?
openml/datasets/functions.py
Outdated
|
|
||
|
|
||
| def upload_dataset(dataset_description, file): | ||
| """Upload a dataset to OpenMl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use an uppercase L in OpenML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type, fixed on the next commit.
| dataset = OpenMLDataset( | ||
| name="anneal", version=1, description="test", | ||
| format="ARFF", licence="public", default_target_attribute="class", data_file=file_path) | ||
| "anneal", "test", "ARFF", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I changed name, description and format to positional arguments as they were mandatory when uploading a dataset to the server. To reflect the changes on the dataset class, I made the changes above to the unit tests.
| dataset = OpenMLDataset( | ||
| name="UploadTestWithURL", version=1, description="test", | ||
| format="ARFF", | ||
| "UploadTestWithURL", "test", "ARFF", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do these changes?
openml/datasets/dataset.py
Outdated
| xml_dataset += "<oml:{0}>{1}</oml:{0}>\n".format(prop, content) | ||
| xml_dataset += "</oml:data_set_description>" | ||
| return xml_dataset | ||
| #if isinstance(content, (list,set)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove those comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
openml/datasets/functions.py
Outdated
| ---------- | ||
| dataset_description : OpenMLDataset | ||
| OpenMLDataset which contains the description of the dataset. | ||
| file : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this was an arff dictionary as expected by liac-arff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the function as we discussed regarding the arff file.
openml/datasets/functions.py
Outdated
| Parameters | ||
| ---------- | ||
| dataset_description : OpenMLDataset | ||
| OpenMLDataset which contains the description of the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a great idea, and I think besides for re-uploading a dataset this will be handy very often. Could you please change it so that the user needs to pass all relevant information separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not agree on this one, I think we should pass an OpenMLDataset as it encapsulates all the necessary information and the user can decide what he wants to fill, aside from the mandatory fields. I also did not increase the number of function arguments, as some can be retrieved from the OpenMLDataset object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Probably we should then move the logic of this function to publish and actually add a function called create_dataset. The user would then pass all arguments necessary to create the dataset object locally and could then upload it.
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Could you please make the requested changes? As I cannot approve this as I created the PR in the first place, please also approve and squash and merge it.
openml/datasets/functions.py
Outdated
| License of the data. | ||
| attributes: list | ||
| A list of tuples. Each tuple consists of the attribute name and type. | ||
| data : numpy.matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, it should be array-like or sparse matrix, shape=(n_samples, n_features)
openml/datasets/functions.py
Outdated
| attributes: list | ||
| A list of tuples. Each tuple consists of the attribute name and type. | ||
| data : numpy.matrix | ||
| A matrix that contains both the attributes and targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the term array instead of matrix because arrays are more common in python.
openml/datasets/functions.py
Outdated
| Returns | ||
| ------- | ||
| class:`openml.OpenMLDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a formatting error.
examples/Dataset_import.ipynb
Outdated
| ], | ||
| "source": [ | ||
| "%load_ext autoreload\n", | ||
| "%autoreload 2\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove the autoreload extension and run the notebook again to remove the text?
Fixes/Implements #337, #331, #292, #117 and #113.