Skip to content

Conversation

@Neeratyoy
Copy link
Contributor

Reference Issue

Addresses #653.

What does this PR implement/fix? Explain your changes.

Adds __str__ for top-level classes that returns a custom string for displaying brief, necessary object information.

How should this PR be tested?

By a print() of the object variable.

Examples:

  • d = openml.datasets.get_dataset(523); print(d)
  • f = openml.flows.get_flow(3740); print(f)
  • r = openml.runs.get_run(10000); print(r)
  • t = openml.tasks.get_task(31); print(t)
  • s = openml.study.get_study(16); print(s)
  • b = openml.study.get_suite(218); print(b)
  • s = openml.setups.get_setup(10); print(s)
    • print(s.parameters[190])
  • e = openml.evaluations.list_evaluations(function='predictive_accuracy', task=[31], size=10)
    • print(e[51])

Any other comments?

  • The reason for including URLs (which are not object attributes) is to nudge the user to visit the website for additional information since certain descriptions can get arbitrarily long.
  • I have tested with only a handful of IDs. I would be keen to test these on datasets, tasks, flows, etc. that are anomalous and can break this.
  • Please provide suggestions, corrections on how the above can be done better.
  • Also, if the examples should be edited to show that object summaries can be pretty printed.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I added quite a few comments. Could you please apply the comments about string formatting etc to all applicable places? Could you also please use the format function instead of string interpolation as discussed in the roadmap?

date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' '))
licence = '%-14s: %s\n' % ("Licence", object_dict['licence'])
d_url = '%-14s: %s\n' % ("Download URL", object_dict['url'])
base_url = 'https://www.openml.org/d/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the server URL given by openml.config.server? In case the user works against the test server, this URL would be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure of what to do here.
Shall I change the base_url = 'https://www.openml.org/d/' to base_url = 'https://test.openml.org/d/' ?
However, https://www.openml.org/d/31 and https://test.openml.org/d/31 don't point to the same resource.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:

import openml
openml.config.start_using_configuration_for_example()
print(openml.datasets.get_dataset(128))

prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.

I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Made the changes.

def __str__(self):
object_dict = self.__dict__
output_str = ''
name = '\n%-14s: %s\n' % ("Name", object_dict['name'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary to add the identifier via string interpolation, you can simply have it in the main string.

pickle_file = '%-14s: %s\n' % ("Pickle file", object_dict['data_pickle_file'])
num_instances = ''
if object_dict['qualities']['NumberOfInstances'] is not None:
num_instances = '%-14s: %d\n' % ("# of instances",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move the "# of instances" to the next line?

num_instances = ''
if object_dict['qualities']['NumberOfInstances'] is not None:
num_instances = '%-14s: %d\n' % ("# of instances",
object_dict['qualities']['NumberOfInstances'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the number of features, too? And also the number of classes if that's available?

class_labels = '%-20s: %s\n' % ("# of Classes", len(object_dict['class_labels']))
if 'cost_matrix' in object_dict:
cost_matrix = '%-20s: %s\n' % ("Cost Matrix", "Available")
output_str = task_type + task_id + task_url + evaluation_measure + estimation_procedure + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please switch the estimation procedure and the evaluation measure?

language = '%-16s: %s\n' % ('Language', object_dict['language'])
dependencies = '%-16s: %s\n' % ('Dependencies', object_dict['dependencies'])
# 3740 for example
output_str = id + version + url + name + description + binary + upload + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should probably be above the URL? Also, what do you think of Flow ID: %d (version %d) to make the display of the version more compact?

binary = '%-16s: %s\n\n' % ('Binary URL', object_dict['binary_url'])

upload = '%-16s: %s\n' % ('Upload Date', object_dict['upload_date'].replace('T', ' '))
language = '%-16s: %s\n' % ('Language', object_dict['language'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the language, it is kind of unimportant (it is mostly English anyway).

object_dict['creation_date'].replace('T', ' '))
output_str = id + name + status + main_entity_type + study_url + data + \
tasks + flows + runs + creator + upload_time
return(output_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no brackets around the return values.

creator = '\n%-16s: %s\n' % ("Creator", url)
upload_time = '%-16s: %s\n' % ("Upload Time",
object_dict['creation_date'].replace('T', ' '))
output_str = id + name + status + main_entity_type + study_url + data + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a header to the output string, for example

OpenML Dataset
==============
ID: 5 (version 1)
...

obviously, this needs to be adapted for the different entities. Also, you'd need to create separate functions for the study and the suite class.

Copy link
Contributor Author

@Neeratyoy Neeratyoy Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added __str__ functions for the other two classes which inherit this class, OpenMLStudy and OpenMLBenchmarkSuite. Their __str__ appends the relevant header to the base class' __str__ string.

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the string representations should have leading and trailing newlines. Was this ever discussed? Other large pretty prints like pandas.DataFrame or numpy.ndarray don't use them either.

What is the reason for referencing self through object_dict instead?

There are some magic numbers going on here, such as the length of the left column (i.e. number of trailing periods). I suspect it has to do with however long the longest line is. There is also a lot of repetition. Let me suggest some refactoring, for e.g. dataset:

    def __str__(self):
        header = "OpenML Dataset"
        header = '{}\n{}\n'.format(header, '=' * len(header))

        openml_url = "{}d/{}".format(openml.config.server[:-len('api/v1/xml')], self.dataset_id)
        fields = [("Name", self.name),
                  ("Version", self.version),
                  ("Format", self.format),
                  ("Upload Date", self.upload_date.replace('T', ' ')),
                  ("Licence", self.licence),
                  ("Download URL", self.url),
                  ("OpenML URL", openml_url),
                  ("Data file", self.data_file),
                  ("Pickle file", self.data_pickle_file),
                  ("# of features", len(self.features))]

        if self.qualities['NumberOfInstances'] is not None:
            fields.append(("# of instances", self.qualities['NumberOfInstances']))

        longest_field_name_length = max(len(name) for name, value in fields)
        field_line_format = "{{:.<{}}}: {{}}".format(longest_field_name_length)
        body = '\n'.join(field_line_format.format(name, value) for name, value in fields)
        return header + body

Note that this refactor incorporates the advice to look at openml.config.server to determine the base_url , discards leading/trailing newlines and directly uses self.
Use list.insert instead of list.append for optional fields that might be inserted in the middle of a body.
Other than that, it gives identical output (as long as the magic number was indeed determined by the longest field name).

date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' '))
licence = '%-14s: %s\n' % ("Licence", object_dict['licence'])
d_url = '%-14s: %s\n' % ("Download URL", object_dict['url'])
base_url = 'https://www.openml.org/d/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:

import openml
openml.config.start_using_configuration_for_example()
print(openml.datasets.get_dataset(128))

prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.

I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.

@Neeratyoy
Copy link
Contributor Author

I don't think the string representations should have leading and trailing newlines. Was this ever discussed? Other large pretty prints like pandas.DataFrame or numpy.ndarray don't use them either.

No, it wasn't discussed. These implementations were my independent design choices. Have corrected them as per your suggestions.

What is the reason for referencing self through object_dict instead?

Oh, this was a silly mistake wherein I was using the object_dict to study which attributes can be printed for the objects. Continued using that for the rest of the function instead of referencing through self. Sorry about that.

There are some magic numbers going on here, such as the length of the left column (i.e. number of trailing periods). I suspect it has to do with however long the longest line is. There is also a lot of repetition. Let me suggest some refactoring, for e.g. dataset:

    def __str__(self):
        header = "OpenML Dataset"
        header = '{}\n{}\n'.format(header, '=' * len(header))

        openml_url = "{}d/{}".format(openml.config.server[:-len('api/v1/xml')], self.dataset_id)
        fields = [("Name", self.name),
                  ("Version", self.version),
                  ("Format", self.format),
                  ("Upload Date", self.upload_date.replace('T', ' ')),
                  ("Licence", self.licence),
                  ("Download URL", self.url),
                  ("OpenML URL", openml_url),
                  ("Data file", self.data_file),
                  ("Pickle file", self.data_pickle_file),
                  ("# of features", len(self.features))]

        if self.qualities['NumberOfInstances'] is not None:
            fields.append(("# of instances", self.qualities['NumberOfInstances']))

        longest_field_name_length = max(len(name) for name, value in fields)
        field_line_format = "{{:.<{}}}: {{}}".format(longest_field_name_length)
        body = '\n'.join(field_line_format.format(name, value) for name, value in fields)
        return header + body

Note that this refactor incorporates the advice to look at openml.config.server to determine the base_url , discards leading/trailing newlines and directly uses self.

This looks really good. I wanted to do something about the character padding. Wasn't aware of the escape formatting with {{. This is what I was looking for. Thanks a lot!

Use list.insert instead of list.append for optional fields that might be inserted in the middle of a body.
Other than that, it gives identical output (as long as the magic number was indeed determined by the longest field name).

I understood the reasoning behind insert and not append. However, with insert the index number needed to be specified again, which would again be ghost numbers that might become incomprehensible later.
I am therefore now using Pandas Series to create the fields to be printed. Along with a list of strings which independently determines the order in which the field appears and is able to handle a missing value for a (for optional fields).
Just felt that the slightly redundant list of strings is a worthy trade-off for explicit insert indexes. Also, this seems more customizable, in terms of addition/deletion of fields, or reordering. Kindly advise.

@PGijsbers
Copy link
Collaborator

I understand where you're coming from. I perhaps should have been more explicit. I would have something like this for flow, where possibly a binary url would come before upload date:

if self.binary_url is not None:
    before_idx = [name for name, value in fields].index("Upload Date")
    fields.insert(before_idx , ("Binary URL": self.binary_url))

that said, it does introduce some work for the developer if they want to find out the full order. Additionally they also have to be careful about ordering if multiple such optional elements should be printed one after another. I think that's where your solution is better.

But can I ask why use pd.Series and not a regular dict?
It adds a lot of additional verbosity (pd.Series({...}) everywhere), and the only use for the series object itself seems to be (unless I missed it) in
fields = list(fields.reindex(order).dropna().iteritems())
Which, if fields was a dict instead, would be replaced by:

fields = [(name, fields[name]) for name in order if fields.get(name) is not None]

if you don't want to print any None values, or

fields = [(name, fields[name]) for name in order if name in fields]

if you want to print None values if they have been explicitly added to the fields dict.

I haven't tested it myself, but I think this small change should keep equivalent behavior while using standard python over pd.Series.

@Neeratyoy
Copy link
Contributor Author

Well, this is completely down to my history with R :/ Ordering by 'rownames' is quite common there and I just rabbit-holed into this particular solution. You're right, I can do the same with a dict representation and use a list comprehension to create the equivalent field variable. Rest remaining the same.
And I think with the current design I don't need to check for a None since I am adding the optional attributes if available and just checking if the relevant key is present or not should suffice.
I'll make this change.

@PGijsbers
Copy link
Collaborator

No worries :) I'm still learning to write cleaner Python myself. But whenever I spot easy improvements I really would like them to be done, it just helps so much down the line (easier for maintenance, extending, newcomers, etc.). Sorry for being a hard-ass 😅

@Neeratyoy
Copy link
Contributor Author

Haha! Absolutely understandable and much needed, to be honest. I am learning well ;)

@PGijsbers
Copy link
Collaborator

Everything looks good. As these changes are not invasive I would like to hold off a merge until we fix the dev branch. I'll look at that sooner rather than later.

@PGijsbers PGijsbers merged commit 6ec4ad1 into develop Jun 12, 2019
@PGijsbers PGijsbers deleted the fix_653 branch June 12, 2019 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants