Skip to content

Conversation

@amueller
Copy link
Contributor

Closes #718.

alternatively we could be more conservative and add a strict=True keyword somewhere that the user needs to set to False to allow circumventing the version check.

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 23, 2019

alternatively we could be more conservative and add a strict=True keyword somewhere that the user needs to set to False to allow circumventing the version check.

I'd be in favor of that solution as it makes the user think about dependency versions.

@hp2500
Copy link

hp2500 commented Jul 23, 2019

I have played around with this pull request a little bit and I ran into the next issue. Some of the runs don't seem to be recognized as sklearn runs at all.

Here is an example:

import openml
model = openml.runs.initialize_model_from_run(2041946)

Which throws the following error:
image

I see the warning message, which would have thrown an error before this pull request, but there seems to be some other problem, that I don't understand.

Of course I have checked manually whether the runs in question are sklearn runs and they are. e.g.
'flow_name': 'sklearn.pipeline.Pipeline(dualimputer=extra.dual_imputer.DualImputer,standardscaler=sklearn.preprocessing.data.StandardScaler,svc=sklearn.svm.classes.SVC)(1)'

This only happens with a subset of the runs.

@amueller

@amueller
Copy link
Contributor Author

it looks like extra.dual_imputer.DualImputer is not from sklearn.

@hp2500
Copy link

hp2500 commented Jul 23, 2019

Oops, yes, I should have realized that. However, there is more:

model = openml.runs.initialize_model_from_run(2081320)

for example, gives me:
image

'flow_name': 'sklearn.pipeline.Pipeline(standardscaler=sklearn.preprocessing.data.StandardScaler,svc=sklearn.svm.classes.SVC)(1)',

I am not sure know where the words 'data' and 'classes' are coming from. As far as I know they are not part of the class hierarchy. Was this the case in older sklearn versions?

@amueller

@amueller
Copy link
Contributor Author

@mfeurer where would you want to have that flag set? It seems like it would be a property of the extension but the extension is created on-demand whenever needed.

I could add it to get_flow as an additional parameter and then propagate it to flow_to_model. But then I'd also need to add it to run_flow_on_task, I think.
Also to propagate it downwards I'd have to add it to _deserialize_sklearn, _deserialize_model and _check_dependencies.

Passing it to where the extension is created in get_flow would also be possible but that seems to involve more chained function calls that it would need to be added to.

The other option would be to add a config option i.e. a global variable that stores this. Which is also not pretty.

@amueller
Copy link
Contributor Author

@hp2500 the data and classes are part of the module hierarchy and should be there.

It's an issue with the deserialization of the pipeline, possibly because the format for storing flows has changed in openml? I can try to look into it but maybe @janvanrijn or @mfeurer know.

@PGijsbers
Copy link
Collaborator

But then I'd also need to add it to run_flow_on_task, I think

I'm not sure about that. If you have a mismatching local libraries, and want to perform the flow on a task, does it not make sense to create a new flow for your version-loose replication and run that on the task?

@amueller
Copy link
Contributor Author

@PGijsbers how would you do that? Or do you mean just doing get_flow would be fine? I was just a bit concerned about having the two have different interfaces.
Either way, it needs to be added to at least 5 functions or so.

@PGijsbers
Copy link
Collaborator

Actually, I'd like to take it back and agree with you instead, if we allow it in straight forward fashion then a parameter should be added there.

Basically, the possible behavior I would expect is this (in run_flow_on_task);

  • run the flow_to_model strictly
  • if an error happens, you instead do it loosely, and construct a new flow off the newly constructed model.

but this really shouldn't be done without explicit conset of the user, I don't think.

However, we should also think about how it would interact with avoid_duplicate_runs. What would happen if I want to run_model_on_task with avoid_duplicate runs=True but you allow loose instantiation? Then do you check for duplicate results against the flow that would be generated from the model with your specific versions, or the one you initially specified as argument, or both?

@amueller
Copy link
Contributor Author

I really don't have a strong opinion on this. I don't even care that much about running the flow again, just instantiating a model at all is what we need, and right now that's quite tricky :-/

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 24, 2019

Oops, yes, I should have realized that. However, there is more:

Could you please create a new issue for this? It seems to be a different problem.

I could add it to get_flow as an additional parameter and then propagate it to flow_to_model. But then I'd also need to add it to run_flow_on_task, I think.

This makes sense to me and I think I would prefer this solution to the other ones you mentioned.

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 24, 2019

What would happen if I want to run_model_on_task with avoid_duplicate runs=True but you allow loose instantiation

We could just forbid enabling both to avoid this issue.

@amueller
Copy link
Contributor Author

It looks like CI is not running with the lastest commit in the branch? I'm a bit confused...

@amueller
Copy link
Contributor Author

hm so this is a bit fragile. I'm not passing it around everywhere and it's hard to ensure that I'd be passing it around everywhere. The tests will pass now, but I haven't added it to deserialize_cross_validation so this will actually not work on GridSearchCV I think.

@amueller
Copy link
Contributor Author

I feel like this should be an attribute of the parser but the parser is wrapped up in the extension and we don't have control over where it's instantiated.

I think the clean version would be to have the extension be refactored into several classes with more clear responsibilities, like a deseralizer that could have a flag, instead of passing options around everywhere... but that seems outside of the scope of this PR.

@amueller
Copy link
Contributor Author

should work now ... maybe?

@PGijsbers
Copy link
Collaborator

Loose instantiation fails if the flow records hyperparameters present in newer scikit-learn versions, but not older ones. E.g. min_impurity_decrease for RandomForestEnsemble is added in 0.19, so a reinstantiation using 0.18 fails (see below). I think a ignoring the hyperparameter and printing a warning is the best solution here?

=================================== FAILURES ===================================
______ TestFlowFunctions.test_get_flow_reinstantiate_model_wrong_version _______
[gw0] linux -- Python 3.6.8 /home/travis/miniconda/envs/testenv/bin/python
self = <tests.test_flows.test_flow_functions.TestFlowFunctions testMethod=test_get_flow_reinstantiate_model_wrong_version>
    @unittest.skipIf(LooseVersion(sklearn.__version__) == "0.19.1",
                     reason="Target flow is from sklearn 0.19.1")
    def test_get_flow_reinstantiate_model_wrong_version(self):
        # Note that CI does not test against 0.19.1.
        openml.config.server = self.production_server
        _, sklearn_major, _ = LooseVersion(sklearn.__version__).version[:3]
        flow = 8175
        expected = 'Trying to deserialize a model with dependency sklearn==0.19.1 not satisfied.'
        self.assertRaisesRegex(ValueError,
                               expected,
                               openml.flows.get_flow,
                               flow_id=flow,
                               reinstantiate=True)
        openml.flows.get_flow(flow_id=flow, reinstantiate=True,
>                             strict_version=False)
/home/travis/build/openml/openml-python/tests/test_flows/test_flow_functions.py:300: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/build/openml/openml-python/openml/utils.py:328: in safe_func
    return func(*args, **kwargs)
/home/travis/build/openml/openml-python/openml/flows/functions.py:99: in get_flow
    flow, strict_version=strict_version)
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:233: in flow_to_model
    strict_version=strict_version)
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:374: in _deserialize_sklearn
    strict_version=strict_version
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:823: in _deserialize_model
    strict_version=strict_version,
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:361: in _deserialize_sklearn
    for element in o
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:361: in <listcomp>
    for element in o
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:312: in _deserialize_sklearn
    strict_version=strict_version
/home/travis/build/openml/openml-python/openml/extensions/sklearn/extension.py:374: in _deserialize_sklearn
    strict_version=strict_version
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <openml.extensions.sklearn.extension.SklearnExtension object at 0x7fe7be78c710>
flow = <openml.flows.flow.OpenMLFlow object at 0x7fe7be786d68>
keep_defaults = False, recursion_depth = 3, strict_version = False
    def _deserialize_model(
        self,
        flow: OpenMLFlow,
        keep_defaults: bool,
        recursion_depth: int,
        strict_version: Optional[bool] = True
    ) -> Any:
        logging.info('-%s deserialize %s' % ('-' * recursion_depth, flow.name))
        model_name = flow.class_name
        self._check_dependencies(flow.dependencies,
                                 strict_version=strict_version)
    
        parameters = flow.parameters
        components = flow.components
        parameter_dict = OrderedDict()  # type: Dict[str, Any]
    
        # Do a shallow copy of the components dictionary so we can remove the
        # components from this copy once we added them into the pipeline. This
        # allows us to not consider them any more when looping over the
        # components, but keeping the dictionary of components untouched in the
        # original components dictionary.
        components_ = copy.copy(components)
    
        for name in parameters:
            value = parameters.get(name)
            logging.info('--%s flow_parameter=%s, value=%s' %
                         ('-' * recursion_depth, name, value))
            rval = self._deserialize_sklearn(
                value,
                components=components_,
                initialize_with_defaults=keep_defaults,
                recursion_depth=recursion_depth + 1,
                strict_version=strict_version,
            )
            parameter_dict[name] = rval
    
        for name in components:
            if name in parameter_dict:
                continue
            if name not in components_:
                continue
            value = components[name]
            logging.info('--%s flow_component=%s, value=%s'
                         % ('-' * recursion_depth, name, value))
            rval = self._deserialize_sklearn(
                value,
                recursion_depth=recursion_depth + 1,
                strict_version=strict_version
            )
            parameter_dict[name] = rval
    
        module_name = model_name.rsplit('.', 1)
        model_class = getattr(importlib.import_module(module_name[0]),
                              module_name[1])
    
        if keep_defaults:
            # obtain all params with a default
            param_defaults, _ = \
                self._get_fn_arguments_with_defaults(model_class.__init__)
    
            # delete the params that have a default from the dict,
            # so they get initialized with their default value
            # except [...]
            for param in param_defaults:
                # [...] the ones that also have a key in the components dict.
                # As OpenML stores different flows for ensembles with different
                # (base-)components, in OpenML terms, these are not considered
                # hyperparameters but rather constants (i.e., changing them would
                # result in a different flow)
                if param not in components.keys():
                    del parameter_dict[param]
>       return model_class(**parameter_dict)
E       TypeError: __init__() got an unexpected keyword argument 'min_impurity_decrease'

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 25, 2019

I think a ignoring the hyperparameter and printing a warning is the best solution here?

This just makes me think of a potential issue I didn't see yet:

  1. You create a flow with a scikit-learn version 0.X
  2. You create a run on task Y and upload it to OpenML
  3. scikit-learn version 0.X+1 gets released
  4. You download the flow and reinstantiate the model
  5. You create a run on task Y and upload it to OpenML

The flow downloaded in step 4 now has an external version attribute scikit-learn 0.X, but we used scikit-learn 0.X+1 in step 5. When uploading, we will reference the flow with the wrong scikit-learn version and therefore create faulty entries on OpenML.

I'm wondering if such re-instantation is a) a good idea and b) if it is, how we can get around this issue? Maybe we need to remove the flow's own ID if we deserialize it to a different version?

@PGijsbers
Copy link
Collaborator

Maybe we need to remove the flow's own ID if we deserialize it to a different version?

From our design perspective, I think that is the easiest solution. For a user it might be a little confusing if it's running flow X on task Y and no new results for flow X are added (through run_model_on_task). Do we just want to provide better warning messages?

@amueller
Copy link
Contributor Author

I think if there's a real incompatibility, it's fine to just error (at least for now).
And I think discarding the ID and treating it as a new flow is the only consistent choice. Maybe with more error messages.
So not all of the requirements are == versions, some are >=, right? If the requirement is fulfilled the deserialization works, but it should probably still be a different flow, right?

For us, instantiation was mostly a way to get to the parameters, because that is so hard otherwise.

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 26, 2019

I think if there's a real incompatibility, it's fine to just error (at least for now). And I think discarding the ID and treating it as a new flow is the only consistent choice. Maybe with more error messages.
I agree 100%
So not all of the requirements are == versions, some are >=, right? If the requirement is fulfilled the deserialization works, but it should probably still be a different flow, right?
I'm not sure. This is mostly about libraries like numpy and scipy which are dependencies of scikit-learn I guess?. They can introduce changes in the performance, on the other hand, one wants to avoid the server being swamped by flows with all kinds of combinations of sklearn, numpy and scipy versions. I'd suggest leaving it like it is at the moment?

@amueller
Copy link
Contributor Author

How does the interface decide which libraries get a == and which get at >=?

@codecov-io
Copy link

codecov-io commented Jul 26, 2019

Codecov Report

Merging #744 into develop will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #744      +/-   ##
===========================================
- Coverage    88.06%   88.06%   -0.01%     
===========================================
  Files           36       36              
  Lines         4090     4121      +31     
===========================================
+ Hits          3602     3629      +27     
- Misses         488      492       +4
Impacted Files Coverage Δ
openml/extensions/sklearn/extension.py 91.38% <100%> (+0.03%) ⬆️
openml/flows/functions.py 88.32% <100%> (+0.35%) ⬆️
openml/_api_calls.py 84.41% <0%> (-3.9%) ⬇️
openml/datasets/functions.py 95.16% <0%> (-0.33%) ⬇️
openml/evaluations/functions.py 92.3% <0%> (+2.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6ee09d...2852df9. Read the comment docs.

@amueller
Copy link
Contributor Author

So do I need to do something here? Do I need to handle resetting the flow id? Or change something so that the flow realized it's actually a different flow?
Or can we merge it like this and care about creating new flows later?

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 26, 2019

How does the interface decide which libraries get a == and which get at >=?

Like this: https://github.com/openml/openml-python/blob/develop/openml/extensions/sklearn/extension.py#L526

So do I need to do something here? Do I need to handle resetting the flow id?

That should be sufficient. Plus a check which actually makes sure that when downloading a flow, and running it again with a different scikit-learn version, a new flow is generated.

Or can we merge it like this and care about creating new flows later?

I would prefer to avoid this having unfinished/suboptimal code in the dev branch.

@amueller
Copy link
Contributor Author

Maybe a naive question, but why do flows have versions? And is there something that makes sure the versions of a run and a flow are consistent?

@amueller
Copy link
Contributor Author

anyway, I proposed a fix.

@amueller
Copy link
Contributor Author

@amueller
Copy link
Contributor Author

I think I addressed your comments?

@mfeurer mfeurer merged commit fe218bc into openml:develop Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issues instantiating/running models from flows

5 participants