Adds flow.get_structure and flow.get_subflow (which are complements of each other). Also fixes #564 #567

janvanrijn · 2018-10-06T22:53:17Z

Maybe, let's solve #564 in two separate PR's, too keep the PR small and manageable? For #564 to happen, we need to have some knowledge about the structure of the scikit-learn flow. This is all in the OpenML flow object, but in a hard to extract format. I added a function flow_format, that extracts all needed information from the scikit-learn flow. Nice byproduct is that the result of this function integrates nicely with unit test functions. I integrated several, and can add more.

janvanrijn · 2018-10-06T23:59:33Z

Btw, the following testcases fail because they are ran against the live server. These will be fine once this PR is merged.

[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_cached_setup 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_setup 
[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_setup 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_get_uncached_setup 
[gw2] FAILED ../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_setup_list_filter_flow 
../../home/travis/build/openml/openml-python/tests/test_setups/test_setup_functions.py::TestSetupFunctions::test_setuplist_offset

janvanrijn · 2018-10-07T02:12:33Z

Nevermind, I added the functionality that I needed (as this function is only 2 lines)

I am not happy about the tests, this probably hinges a bit upon #568

Would be great to hear your opinion @mfeurer

…name

codecov-io · 2018-10-10T00:56:19Z

Codecov Report

Merging #567 into develop will decrease coverage by <.01%.
The diff coverage is 85.5%.

@@             Coverage Diff             @@
##           develop     #567      +/-   ##
===========================================
- Coverage    90.09%   90.08%   -0.01%     
===========================================
  Files           32       32              
  Lines         2999     3118     +119     
===========================================
+ Hits          2702     2809     +107     
- Misses         297      309      +12

Impacted Files	Coverage Δ
openml/setups/__init__.py	`100% <100%> (ø)`	⬆️
openml/runs/functions.py	`87.69% <100%> (+0.05%)`	⬆️
openml/setups/setup.py	`86.36% <100%> (+0.64%)`	⬆️
openml/flows/__init__.py	`100% <100%> (ø)`	⬆️
openml/setups/functions.py	`94.49% <100%> (+0.09%)`	⬆️
openml/tasks/functions.py	`88.07% <66.66%> (-0.52%)`	⬇️
openml/flows/sklearn_converter.py	`91.05% <70%> (-0.6%)`	⬇️
openml/flows/flow.py	`93.83% <85.36%> (-1.4%)`	⬇️
openml/flows/functions.py	`94.4% <0%> (+0.21%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 876be65...786cfcb. Read the comment docs.

…class. updated unit tests accordingly

janvanrijn · 2018-10-10T02:01:02Z

This PR now adds the functionality that is required for #564.

In order to do this off we need more information about the flow structure. For this, I added the functions flow.get_structure (to return the flow structure in a usable way) and flow.get_subflow, which traverses the flow structural tree by means of the result of flow.get_structure. These two functions can be seen as each others complement.

With these functionalities, solving #564 is rather easy. For this I added a function to the sklearn converter (in setups, but maybe we want to pull these in the flow sklearn converter later). This function now provides a convenience function that maps the name of the OpenMLParameter to the name of the sklearn parameter (e.g., if the openml name is sklearn.pipeline.Pipeline(classifier=sklearn.trees.DecisionTreeClassifier))(1)_min_samples_leaf, the sklearn name is classifier__min_samples_leaf

Ready for review @mfeurer @ArlindKadra

Develop

mfeurer · 2018-10-23T07:29:26Z

Could you please add an example on how to use this to the docs so it is more clear to potential users?

janvanrijn · 2018-12-05T17:18:13Z

I know that you asked for an example, but I found it hard to construct a general usable example. In the process, I did come up with a general (run setup) tutorial, and I used it to simplify and remove a complex part of the setup reinstantiation code. Hope this clarifies this PR a bit. Let me know if you have further questions

janvanrijn · 2018-12-05T17:22:30Z

To add to this, here is the function where I used this:
https://github.com/openml/openml-python-contrib/blob/master/openmlcontrib/setups/functions.py#L114

It is to transform an OpenML setup object into a sklearn parameter dict, something that is quite hard without these convenience functions. This is what I wrote in the tutorial, before I realized that the tutorial is obsolute due to easier functions:

# Interestingly, OpenML has a different way of storing setup names than sklearn
# has. Let's investigate
for hyperparameter in setup.parameters.values():
    logging.info('Hyperparameter id=%d, parameter name=%s, full name=%s'
                 % (hyperparameter.id, hyperparameter.parameter_name,
                    hyperparameter.full_name))
# note that each hyperparameter has an id, a parameter name (e.g.,
# 'max_features', corresponding to the name of the actual parameter in the
# sklearn component) and a full name, i.e., a unique name within OpenML. Note
# that this full name does not correspond to the scikit-learn name. The main
# question is: How do we map the OpenML name, e.g.,
# sklearn.ensemble.forest.RandomForestClassifier(1)_max_depth to the
# corresponding sklearn name randomforestclassifier__max_features?
# The answer to that is a set of convenience features build in the setup object

hyperparameters_duplicate = {
    openml.setups.openml_param_name_to_sklearn(hp, flow):
        openml.flows.flow_to_sklearn(hp.value) for hp in setup.parameters.values()
}
model_duplicate.set_params(**hyperparameters_duplicate)

mfeurer

Hey, I think I got the functions and they make sense. I only have a few comments.