Adding object summary #701

Neeratyoy · 2019-05-30T14:21:50Z

Reference Issue

Addresses #653.

What does this PR implement/fix? Explain your changes.

Adds __str__ for top-level classes that returns a custom string for displaying brief, necessary object information.

How should this PR be tested?

By a print() of the object variable.

Examples:

d = openml.datasets.get_dataset(523); print(d)
f = openml.flows.get_flow(3740); print(f)
r = openml.runs.get_run(10000); print(r)
t = openml.tasks.get_task(31); print(t)
s = openml.study.get_study(16); print(s)
b = openml.study.get_suite(218); print(b)
s = openml.setups.get_setup(10); print(s)
- print(s.parameters[190])
e = openml.evaluations.list_evaluations(function='predictive_accuracy', task=[31], size=10)
- print(e[51])

Any other comments?

The reason for including URLs (which are not object attributes) is to nudge the user to visit the website for additional information since certain descriptions can get arbitrarily long.
I have tested with only a handful of IDs. I would be keen to test these on datasets, tasks, flows, etc. that are anomalous and can break this.
Please provide suggestions, corrections on how the above can be done better.
Also, if the examples should be edited to show that object summaries can be pretty printed.

mfeurer

Thanks a lot, I added quite a few comments. Could you please apply the comments about string formatting etc to all applicable places? Could you also please use the format function instead of string interpolation as discussed in the roadmap?

mfeurer · 2019-05-31T13:14:34Z

openml/datasets/dataset.py

+        date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' '))
+        licence = '%-14s: %s\n' % ("Licence", object_dict['licence'])
+        d_url = '%-14s: %s\n' % ("Download URL", object_dict['url'])
+        base_url = 'https://www.openml.org/d/'


Could you please use the server URL given by openml.config.server? In case the user works against the test server, this URL would be wrong.

I am unsure of what to do here.
Shall I change the base_url = 'https://www.openml.org/d/' to base_url = 'https://test.openml.org/d/' ?
However, https://www.openml.org/d/31 and https://test.openml.org/d/31 don't point to the same resource.

No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:

import openml openml.config.start_using_configuration_for_example() print(openml.datasets.get_dataset(128))

prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.

I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.

Understood. Made the changes.

mfeurer · 2019-05-31T13:15:58Z

openml/datasets/dataset.py

+    def __str__(self):
+        object_dict = self.__dict__
+        output_str = ''
+        name = '\n%-14s: %s\n' % ("Name", object_dict['name'])


I don't think it is necessary to add the identifier via string interpolation, you can simply have it in the main string.

mfeurer · 2019-05-31T13:16:41Z

openml/datasets/dataset.py

+        pickle_file = '%-14s: %s\n' % ("Pickle file", object_dict['data_pickle_file'])
+        num_instances = ''
+        if object_dict['qualities']['NumberOfInstances'] is not None:
+            num_instances = '%-14s: %d\n' % ("# of instances",


Could you please move the "# of instances" to the next line?

mfeurer · 2019-05-31T13:17:05Z

openml/datasets/dataset.py

+        num_instances = ''
+        if object_dict['qualities']['NumberOfInstances'] is not None:
+            num_instances = '%-14s: %d\n' % ("# of instances",
+                                             object_dict['qualities']['NumberOfInstances'])


Could you please add the number of features, too? And also the number of classes if that's available?

mfeurer · 2019-05-31T13:18:44Z

openml/tasks/task.py

+                class_labels = '%-20s: %s\n' % ("# of Classes", len(object_dict['class_labels']))
+            if 'cost_matrix' in object_dict:
+                cost_matrix = '%-20s: %s\n' % ("Cost Matrix", "Available")
+        output_str = task_type + task_id + task_url + evaluation_measure + estimation_procedure + \


Could you please switch the estimation procedure and the evaluation measure?

mfeurer · 2019-05-31T13:22:41Z

openml/flows/flow.py

+        language = '%-16s: %s\n' % ('Language', object_dict['language'])
+        dependencies = '%-16s: %s\n' % ('Dependencies', object_dict['dependencies'])
+        # 3740 for example
+        output_str = id + version + url + name + description + binary + upload + \


The name should probably be above the URL? Also, what do you think of Flow ID: %d (version %d) to make the display of the version more compact?

mfeurer · 2019-05-31T13:23:09Z

openml/flows/flow.py

+            binary = '%-16s: %s\n\n' % ('Binary URL', object_dict['binary_url'])
+
+        upload = '%-16s: %s\n' % ('Upload Date', object_dict['upload_date'].replace('T', ' '))
+        language = '%-16s: %s\n' % ('Language', object_dict['language'])


Please remove the language, it is kind of unimportant (it is mostly English anyway).

mfeurer · 2019-05-31T13:28:15Z

openml/study/study.py

+                                       object_dict['creation_date'].replace('T', ' '))
+        output_str = id + name + status + main_entity_type + study_url + data + \
+            tasks + flows + runs + creator + upload_time
+        return(output_str)


Please no brackets around the return values.

mfeurer · 2019-05-31T13:29:31Z

openml/study/study.py

+        creator = '\n%-16s: %s\n' % ("Creator", url)
+        upload_time = '%-16s: %s\n' % ("Upload Time",
+                                       object_dict['creation_date'].replace('T', ' '))
+        output_str = id + name + status + main_entity_type + study_url + data + \


What do you think about adding a header to the output string, for example

OpenML Dataset ============== ID: 5 (version 1) ...

obviously, this needs to be adapted for the different entities. Also, you'd need to create separate functions for the study and the suite class.

I have added __str__ functions for the other two classes which inherit this class, OpenMLStudy and OpenMLBenchmarkSuite. Their __str__ appends the relevant header to the base class' __str__ string.

PGijsbers

I don't think the string representations should have leading and trailing newlines. Was this ever discussed? Other large pretty prints like pandas.DataFrame or numpy.ndarray don't use them either.

What is the reason for referencing self through object_dict instead?

There are some magic numbers going on here, such as the length of the left column (i.e. number of trailing periods). I suspect it has to do with however long the longest line is. There is also a lot of repetition. Let me suggest some refactoring, for e.g. dataset:

    def __str__(self):
        header = "OpenML Dataset"
        header = '{}\n{}\n'.format(header, '=' * len(header))

        openml_url = "{}d/{}".format(openml.config.server[:-len('api/v1/xml')], self.dataset_id)
        fields = [("Name", self.name),
                  ("Version", self.version),
                  ("Format", self.format),
                  ("Upload Date", self.upload_date.replace('T', ' ')),
                  ("Licence", self.licence),
                  ("Download URL", self.url),
                  ("OpenML URL", openml_url),
                  ("Data file", self.data_file),
                  ("Pickle file", self.data_pickle_file),
                  ("# of features", len(self.features))]

        if self.qualities['NumberOfInstances'] is not None:
            fields.append(("# of instances", self.qualities['NumberOfInstances']))

        longest_field_name_length = max(len(name) for name, value in fields)
        field_line_format = "{{:.<{}}}: {{}}".format(longest_field_name_length)
        body = '\n'.join(field_line_format.format(name, value) for name, value in fields)
        return header + body

Note that this refactor incorporates the advice to look at openml.config.server to determine the base_url , discards leading/trailing newlines and directly uses self.
Use list.insert instead of list.append for optional fields that might be inserted in the middle of a body.
Other than that, it gives identical output (as long as the magic number was indeed determined by the longest field name).

PGijsbers · 2019-06-07T08:39:46Z

openml/datasets/dataset.py

+        date = '%-14s: %s\n' % ("Upload Date", object_dict['upload_date'].replace('T', ' '))
+        licence = '%-14s: %s\n' % ("Licence", object_dict['licence'])
+        d_url = '%-14s: %s\n' % ("Download URL", object_dict['url'])
+        base_url = 'https://www.openml.org/d/'


No, he suggests to look at openml.config.server to see to which server the client is connected.
Right now, when connected to the test server, a user will be shown false information.
Try for example:

import openml openml.config.start_using_configuration_for_example() print(openml.datasets.get_dataset(128))

prints OpenML URL....: https://www.openml.org/d/128 where it should be OpenML URL....: https://test.openml.org/d/128.

I think changing the base_url to base_url="{}d/".format(openml.config.server[:-len('api/v1/xml')]) should fix the issue.

Neeratyoy · 2019-06-07T17:32:53Z

I don't think the string representations should have leading and trailing newlines. Was this ever discussed? Other large pretty prints like pandas.DataFrame or numpy.ndarray don't use them either.

No, it wasn't discussed. These implementations were my independent design choices. Have corrected them as per your suggestions.

What is the reason for referencing self through object_dict instead?

Oh, this was a silly mistake wherein I was using the object_dict to study which attributes can be printed for the objects. Continued using that for the rest of the function instead of referencing through self. Sorry about that.

There are some magic numbers going on here, such as the length of the left column (i.e. number of trailing periods). I suspect it has to do with however long the longest line is. There is also a lot of repetition. Let me suggest some refactoring, for e.g. dataset:

    def __str__(self):
        header = "OpenML Dataset"
        header = '{}\n{}\n'.format(header, '=' * len(header))

        openml_url = "{}d/{}".format(openml.config.server[:-len('api/v1/xml')], self.dataset_id)
        fields = [("Name", self.name),
                  ("Version", self.version),
                  ("Format", self.format),
                  ("Upload Date", self.upload_date.replace('T', ' ')),
                  ("Licence", self.licence),
                  ("Download URL", self.url),
                  ("OpenML URL", openml_url),
                  ("Data file", self.data_file),
                  ("Pickle file", self.data_pickle_file),
                  ("# of features", len(self.features))]

        if self.qualities['NumberOfInstances'] is not None:
            fields.append(("# of instances", self.qualities['NumberOfInstances']))

        longest_field_name_length = max(len(name) for name, value in fields)
        field_line_format = "{{:.<{}}}: {{}}".format(longest_field_name_length)
        body = '\n'.join(field_line_format.format(name, value) for name, value in fields)
        return header + body

Note that this refactor incorporates the advice to look at openml.config.server to determine the base_url , discards leading/trailing newlines and directly uses self.

This looks really good. I wanted to do something about the character padding. Wasn't aware of the escape formatting with {{. This is what I was looking for. Thanks a lot!

Use list.insert instead of list.append for optional fields that might be inserted in the middle of a body.
Other than that, it gives identical output (as long as the magic number was indeed determined by the longest field name).

I understood the reasoning behind insert and not append. However, with insert the index number needed to be specified again, which would again be ghost numbers that might become incomprehensible later.
I am therefore now using Pandas Series to create the fields to be printed. Along with a list of strings which independently determines the order in which the field appears and is able to handle a missing value for a (for optional fields).
Just felt that the slightly redundant list of strings is a worthy trade-off for explicit insert indexes. Also, this seems more customizable, in terms of addition/deletion of fields, or reordering. Kindly advise.

PGijsbers · 2019-06-07T19:18:18Z

I understand where you're coming from. I perhaps should have been more explicit. I would have something like this for flow, where possibly a binary url would come before upload date:

if self.binary_url is not None:
    before_idx = [name for name, value in fields].index("Upload Date")
    fields.insert(before_idx , ("Binary URL": self.binary_url))

that said, it does introduce some work for the developer if they want to find out the full order. Additionally they also have to be careful about ordering if multiple such optional elements should be printed one after another. I think that's where your solution is better.

But can I ask why use pd.Series and not a regular dict?
It adds a lot of additional verbosity (pd.Series({...}) everywhere), and the only use for the series object itself seems to be (unless I missed it) in
fields = list(fields.reindex(order).dropna().iteritems())
Which, if fields was a dict instead, would be replaced by:

fields = [(name, fields[name]) for name in order if fields.get(name) is not None]

if you don't want to print any None values, or

fields = [(name, fields[name]) for name in order if name in fields]

if you want to print None values if they have been explicitly added to the fields dict.

I haven't tested it myself, but I think this small change should keep equivalent behavior while using standard python over pd.Series.

Neeratyoy · 2019-06-08T13:11:16Z

Well, this is completely down to my history with R :/ Ordering by 'rownames' is quite common there and I just rabbit-holed into this particular solution. You're right, I can do the same with a dict representation and use a list comprehension to create the equivalent field variable. Rest remaining the same.
And I think with the current design I don't need to check for a None since I am adding the optional attributes if available and just checking if the relevant key is present or not should suffice.
I'll make this change.

PGijsbers · 2019-06-09T16:23:17Z

No worries :) I'm still learning to write cleaner Python myself. But whenever I spot easy improvements I really would like them to be done, it just helps so much down the line (easier for maintenance, extending, newcomers, etc.). Sorry for being a hard-ass 😅

Neeratyoy · 2019-06-11T09:37:23Z

Haha! Absolutely understandable and much needed, to be honest. I am learning well ;)

PGijsbers · 2019-06-11T21:21:28Z

Everything looks good. As these changes are not invasive I would like to hold off a merge until we fix the dev branch. I'll look at that sooner rather than later.

Neeratyoy added 8 commits May 27, 2019 13:51

Adding __str__ for OpenMLDataset

7692337

Adding __str__ for OpenMLEvaluation

1d4e851

Adding __str__ for OpenMLFlow

893295c

Adding __str__ for OpenMLRun

b3bdb42

Fixing flake issues

8673200

Adding __str__ for OpenMLSetup and OpenMLParameter

837cb9b

Adding __str__ for OpenMLStudy + Fixing flake issues

ca6e852

Adding __str__ to OpenMLTask

e57a21b

Neeratyoy requested review from PGijsbers, janvanrijn, joaquinvanschoren and mfeurer May 30, 2019 14:22

Cleaning code

8eb49a0

mfeurer reviewed May 31, 2019

View reviewed changes

Changing sting interpolation to format + minor edits

90f425f

PGijsbers requested changes Jun 7, 2019

View reviewed changes

Refactoring __str__ to remove redundancies

948aebe

Cleaning stray code

8eae8b7

Neeratyoy added 2 commits June 11, 2019 12:20

Replacing pd.Series with dict for simplicity

c4920ea

Adding checks for printing optional attributes in __str__

0ab7cd3

Neeratyoy mentioned this pull request Jun 11, 2019

Adding documentation for _convert_array_format output parameter #708

Merged

PGijsbers approved these changes Jun 12, 2019

View reviewed changes

PGijsbers merged commit 6ec4ad1 into develop Jun 12, 2019

PGijsbers deleted the fix_653 branch June 12, 2019 21:43

PGijsbers mentioned this pull request Jun 12, 2019

Quick summaries of openml objects #653

Closed

mfeurer mentioned this pull request Jun 12, 2019

nice repr for all classes #75

Closed

Uh oh!

Adding object summary #701

Adding object summary #701

Uh oh!

Conversation

Neeratyoy commented May 30, 2019

Reference Issue

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Any other comments?

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neeratyoy Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PGijsbers left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neeratyoy commented Jun 7, 2019

Uh oh!

PGijsbers commented Jun 7, 2019

Uh oh!

Neeratyoy commented Jun 8, 2019

Uh oh!

PGijsbers commented Jun 9, 2019

Uh oh!

Neeratyoy commented Jun 11, 2019

Uh oh!

PGijsbers commented Jun 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Neeratyoy Jun 6, 2019 •

edited

Loading

PGijsbers left a comment •

edited

Loading