Skip to content

Conversation

@janvanrijn
Copy link
Member

Maybe, let's solve #564 in two separate PR's, too keep the PR small and manageable? For #564 to happen, we need to have some knowledge about the structure of the scikit-learn flow. This is all in the OpenML flow object, but in a hard to extract format. I added a function flow_format, that extracts all needed information from the scikit-learn flow. Nice byproduct is that the result of this function integrates nicely with unit test functions. I integrated several, and can add more.

@janvanrijn janvanrijn requested a review from mfeurer October 6, 2018 22:53
@janvanrijn janvanrijn changed the base branch from master to develop October 6, 2018 22:53
@janvanrijn
Copy link
Member Author

Btw, the following testcases fail because they are ran against the live server. These will be fine once this PR is merged.

[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_cached_setup 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_setup 
[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_setup 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_uncached_setup 
[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_setup_list_filter_flow 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_setuplist_offset 

@janvanrijn
Copy link
Member Author

Nevermind, I added the functionality that I needed (as this function is only 2 lines)

I am not happy about the tests, this probably hinges a bit upon #568

Would be great to hear your opinion @mfeurer

@codecov-io
Copy link

codecov-io commented Oct 10, 2018

Codecov Report

Merging #567 into develop will decrease coverage by <.01%.
The diff coverage is 85.5%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #567      +/-   ##
===========================================
- Coverage    90.09%   90.08%   -0.01%     
===========================================
  Files           32       32              
  Lines         2999     3118     +119     
===========================================
+ Hits          2702     2809     +107     
- Misses         297      309      +12
Impacted Files Coverage Δ
openml/setups/__init__.py 100% <100%> (ø) ⬆️
openml/runs/functions.py 87.69% <100%> (+0.05%) ⬆️
openml/setups/setup.py 86.36% <100%> (+0.64%) ⬆️
openml/flows/__init__.py 100% <100%> (ø) ⬆️
openml/setups/functions.py 94.49% <100%> (+0.09%) ⬆️
openml/tasks/functions.py 88.07% <66.66%> (-0.52%) ⬇️
openml/flows/sklearn_converter.py 91.05% <70%> (-0.6%) ⬇️
openml/flows/flow.py 93.83% <85.36%> (-1.4%) ⬇️
openml/flows/functions.py 94.4% <0%> (+0.21%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 876be65...786cfcb. Read the comment docs.

@janvanrijn janvanrijn changed the title Adds flow function flow_structure Adds flow.get_structure and flow.get_subflow (which are complements of each other). Also fixes #564 Oct 10, 2018
@janvanrijn
Copy link
Member Author

This PR now adds the functionality that is required for #564.

In order to do this off we need more information about the flow structure. For this, I added the functions flow.get_structure (to return the flow structure in a usable way) and flow.get_subflow, which traverses the flow structural tree by means of the result of flow.get_structure. These two functions can be seen as each others complement.

With these functionalities, solving #564 is rather easy. For this I added a function to the sklearn converter (in setups, but maybe we want to pull these in the flow sklearn converter later). This function now provides a convenience function that maps the name of the OpenMLParameter to the name of the sklearn parameter (e.g., if the openml name is sklearn.pipeline.Pipeline(classifier=sklearn.trees.DecisionTreeClassifier))(1)_min_samples_leaf, the sklearn name is classifier__min_samples_leaf

Ready for review @mfeurer @ArlindKadra

@mfeurer
Copy link
Collaborator

mfeurer commented Oct 23, 2018

Could you please add an example on how to use this to the docs so it is more clear to potential users?

@janvanrijn
Copy link
Member Author

I know that you asked for an example, but I found it hard to construct a general usable example. In the process, I did come up with a general (run setup) tutorial, and I used it to simplify and remove a complex part of the setup reinstantiation code. Hope this clarifies this PR a bit. Let me know if you have further questions

@janvanrijn
Copy link
Member Author

To add to this, here is the function where I used this:
https://github.com/openml/openml-python-contrib/blob/master/openmlcontrib/setups/functions.py#L114

It is to transform an OpenML setup object into a sklearn parameter dict, something that is quite hard without these convenience functions. This is what I wrote in the tutorial, before I realized that the tutorial is obsolute due to easier functions:

# Interestingly, OpenML has a different way of storing setup names than sklearn
# has. Let's investigate
for hyperparameter in setup.parameters.values():
    logging.info('Hyperparameter id=%d, parameter name=%s, full name=%s'
                 % (hyperparameter.id, hyperparameter.parameter_name,
                    hyperparameter.full_name))
# note that each hyperparameter has an id, a parameter name (e.g.,
# 'max_features', corresponding to the name of the actual parameter in the
# sklearn component) and a full name, i.e., a unique name within OpenML. Note
# that this full name does not correspond to the scikit-learn name. The main
# question is: How do we map the OpenML name, e.g.,
# sklearn.ensemble.forest.RandomForestClassifier(1)_max_depth to the
# corresponding sklearn name randomforestclassifier__max_features?
# The answer to that is a set of convenience features build in the setup object

hyperparameters_duplicate = {
    openml.setups.openml_param_name_to_sklearn(hp, flow):
        openml.flows.flow_to_sklearn(hp.value) for hp in setup.parameters.values()
}
model_duplicate.set_params(**hyperparameters_duplicate)

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I think I got the functions and they make sense. I only have a few comments.

use them)
A key requirement for reinstantiating a flow is to have the same scikit-learn
version as the flow that was uploaded. This tutorial will upload the flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these two sentences contradict each other?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I also don't get your updated version. You're saying it is important to have the same scikit-learn version, but then say this doesn't matter because the tutorial uploads the flow. I simply don't get how this makes the version not matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reformulate. Please let me know if this is any better.

and solve the same task again.
3) We will verify that the obtained results are exactly the same.
Readers interested in reinstantiating a setup can skip part 1 and 2 and start
with part 3 immediately.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't part 3 only compare the two arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but is is quite an important point and it combines pt 1 and pt 2. I would prefer to keep it in a separate part, albeit only one line of code (it could be extended with more checks later, if someone is motivated)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that. What I wanted to point out is that the sentence says:

Readers interested in reinstantiating a setup can skip part 1 and 2 and start with part 3 immediately.

which to me reads like "if you're interested in 1 and 2 you can skip them and commence to 3" because 2 is about reinstantiating the flow, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I removed the sentence.

@janvanrijn
Copy link
Member Author

ping @mfeurer

@mfeurer mfeurer merged commit 7c0a77d into develop Dec 11, 2018
@mfeurer mfeurer deleted the add_#564 branch December 11, 2018 10:13
@janvanrijn
Copy link
Member Author

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants