-
-
Notifications
You must be signed in to change notification settings - Fork 211
Adding object summary #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, I added quite a few comments. Could you please apply the comments about string formatting etc to all applicable places? Could you also please use the format function instead of string interpolation as discussed in the roadmap?
openml/datasets/dataset.py
Outdated
| date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' ')) | ||
| licence = '%-14s: %s\n' % ("Licence", object_dict['licence']) | ||
| d_url = '%-14s: %s\n' % ("Download URL", object_dict['url']) | ||
| base_url = 'https://www.openml.org/d/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the server URL given by openml.config.server? In case the user works against the test server, this URL would be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure of what to do here.
Shall I change the base_url = 'https://www.openml.org/d/' to base_url = 'https://test.openml.org/d/' ?
However, https://www.openml.org/d/31 and https://test.openml.org/d/31 don't point to the same resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:
import openml
openml.config.start_using_configuration_for_example()
print(openml.datasets.get_dataset(128))prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.
I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Made the changes.
openml/datasets/dataset.py
Outdated
| def __str__(self): | ||
| object_dict = self.__dict__ | ||
| output_str = '' | ||
| name = '\n%-14s: %s\n' % ("Name", object_dict['name']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary to add the identifier via string interpolation, you can simply have it in the main string.
openml/datasets/dataset.py
Outdated
| pickle_file = '%-14s: %s\n' % ("Pickle file", object_dict['data_pickle_file']) | ||
| num_instances = '' | ||
| if object_dict['qualities']['NumberOfInstances'] is not None: | ||
| num_instances = '%-14s: %d\n' % ("# of instances", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move the "# of instances" to the next line?
openml/datasets/dataset.py
Outdated
| num_instances = '' | ||
| if object_dict['qualities']['NumberOfInstances'] is not None: | ||
| num_instances = '%-14s: %d\n' % ("# of instances", | ||
| object_dict['qualities']['NumberOfInstances']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the number of features, too? And also the number of classes if that's available?
openml/tasks/task.py
Outdated
| class_labels = '%-20s: %s\n' % ("# of Classes", len(object_dict['class_labels'])) | ||
| if 'cost_matrix' in object_dict: | ||
| cost_matrix = '%-20s: %s\n' % ("Cost Matrix", "Available") | ||
| output_str = task_type + task_id + task_url + evaluation_measure + estimation_procedure + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please switch the estimation procedure and the evaluation measure?
openml/flows/flow.py
Outdated
| language = '%-16s: %s\n' % ('Language', object_dict['language']) | ||
| dependencies = '%-16s: %s\n' % ('Dependencies', object_dict['dependencies']) | ||
| # 3740 for example | ||
| output_str = id + version + url + name + description + binary + upload + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should probably be above the URL? Also, what do you think of Flow ID: %d (version %d) to make the display of the version more compact?
openml/flows/flow.py
Outdated
| binary = '%-16s: %s\n\n' % ('Binary URL', object_dict['binary_url']) | ||
|
|
||
| upload = '%-16s: %s\n' % ('Upload Date', object_dict['upload_date'].replace('T', ' ')) | ||
| language = '%-16s: %s\n' % ('Language', object_dict['language']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the language, it is kind of unimportant (it is mostly English anyway).
openml/study/study.py
Outdated
| object_dict['creation_date'].replace('T', ' ')) | ||
| output_str = id + name + status + main_entity_type + study_url + data + \ | ||
| tasks + flows + runs + creator + upload_time | ||
| return(output_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no brackets around the return values.
openml/study/study.py
Outdated
| creator = '\n%-16s: %s\n' % ("Creator", url) | ||
| upload_time = '%-16s: %s\n' % ("Upload Time", | ||
| object_dict['creation_date'].replace('T', ' ')) | ||
| output_str = id + name + status + main_entity_type + study_url + data + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a header to the output string, for example
OpenML Dataset
==============
ID: 5 (version 1)
...
obviously, this needs to be adapted for the different entities. Also, you'd need to create separate functions for the study and the suite class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added __str__ functions for the other two classes which inherit this class, OpenMLStudy and OpenMLBenchmarkSuite. Their __str__ appends the relevant header to the base class' __str__ string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the string representations should have leading and trailing newlines. Was this ever discussed? Other large pretty prints like pandas.DataFrame or numpy.ndarray don't use them either.
What is the reason for referencing self through object_dict instead?
There are some magic numbers going on here, such as the length of the left column (i.e. number of trailing periods). I suspect it has to do with however long the longest line is. There is also a lot of repetition. Let me suggest some refactoring, for e.g. dataset:
def __str__(self):
header = "OpenML Dataset"
header = '{}\n{}\n'.format(header, '=' * len(header))
openml_url = "{}d/{}".format(openml.config.server[:-len('api/v1/xml')], self.dataset_id)
fields = [("Name", self.name),
("Version", self.version),
("Format", self.format),
("Upload Date", self.upload_date.replace('T', ' ')),
("Licence", self.licence),
("Download URL", self.url),
("OpenML URL", openml_url),
("Data file", self.data_file),
("Pickle file", self.data_pickle_file),
("# of features", len(self.features))]
if self.qualities['NumberOfInstances'] is not None:
fields.append(("# of instances", self.qualities['NumberOfInstances']))
longest_field_name_length = max(len(name) for name, value in fields)
field_line_format = "{{:.<{}}}: {{}}".format(longest_field_name_length)
body = '\n'.join(field_line_format.format(name, value) for name, value in fields)
return header + bodyNote that this refactor incorporates the advice to look at openml.config.server to determine the base_url , discards leading/trailing newlines and directly uses self.
Use list.insert instead of list.append for optional fields that might be inserted in the middle of a body.
Other than that, it gives identical output (as long as the magic number was indeed determined by the longest field name).
openml/datasets/dataset.py
Outdated
| date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' ')) | ||
| licence = '%-14s: %s\n' % ("Licence", object_dict['licence']) | ||
| d_url = '%-14s: %s\n' % ("Download URL", object_dict['url']) | ||
| base_url = 'https://www.openml.org/d/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:
import openml
openml.config.start_using_configuration_for_example()
print(openml.datasets.get_dataset(128))prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.
I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.
No, it wasn't discussed. These implementations were my independent design choices. Have corrected them as per your suggestions.
Oh, this was a silly mistake wherein I was using the
This looks really good. I wanted to do something about the character padding. Wasn't aware of the escape formatting with
I understood the reasoning behind |
|
I understand where you're coming from. I perhaps should have been more explicit. I would have something like this for flow, where possibly a binary url would come before upload date: if self.binary_url is not None:
before_idx = [name for name, value in fields].index("Upload Date")
fields.insert(before_idx , ("Binary URL": self.binary_url))that said, it does introduce some work for the developer if they want to find out the full order. Additionally they also have to be careful about ordering if multiple such optional elements should be printed one after another. I think that's where your solution is better. But can I ask why use if you don't want to print any None values, or if you want to print I haven't tested it myself, but I think this small change should keep equivalent behavior while using standard python over |
|
Well, this is completely down to my history with R :/ Ordering by 'rownames' is quite common there and I just rabbit-holed into this particular solution. You're right, I can do the same with a dict representation and use a list comprehension to create the equivalent |
|
No worries :) I'm still learning to write cleaner Python myself. But whenever I spot easy improvements I really would like them to be done, it just helps so much down the line (easier for maintenance, extending, newcomers, etc.). Sorry for being a hard-ass 😅 |
|
Haha! Absolutely understandable and much needed, to be honest. I am learning well ;) |
|
Everything looks good. As these changes are not invasive I would like to hold off a merge until we fix the dev branch. I'll look at that sooner rather than later. |
Reference Issue
Addresses #653.
What does this PR implement/fix? Explain your changes.
Adds
__str__for top-level classes that returns a custom string for displaying brief, necessary object information.How should this PR be tested?
By a
print()of the object variable.Examples:
d = openml.datasets.get_dataset(523); print(d)f = openml.flows.get_flow(3740); print(f)r = openml.runs.get_run(10000); print(r)t = openml.tasks.get_task(31); print(t)s = openml.study.get_study(16); print(s)b = openml.study.get_suite(218); print(b)s = openml.setups.get_setup(10); print(s)print(s.parameters[190])e = openml.evaluations.list_evaluations(function='predictive_accuracy', task=[31], size=10)print(e[51])Any other comments?