Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

There is a desire for shorter flow names for scikit-learn flows.
I added a function which uses the flow name string, and trims it down to its major components.
See #412 for more.

@PGijsbers
Copy link
Collaborator Author

@mfeurer what's the reason for the dummy learners and/or why they are linked to the scikit-learn extension?

Currently the generated flow names are not scikit-learn standard names (i.e. they don't start with sklearn...). Consequently these long names are not valid input to the trim_pipeline_name method. I raise an error for invalid strings currently, but I could also opt to have an optional parameter that specifies to return the full name instead.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the class DummyModel @PGijsbers ?

A general answer regarding the 2nd part of your comment is that the scikit-learn extension so far does not require the ML algorithms actually being implemented in scikit-learn. The only requirement is that the classes follow the scikit-learn conventions (fit, predict, transform etc.). It would be great if we can keep this property. One possibility would be to add the package name of non-scikit-learn packages to the short name, too.

Also, I guess feature unions should also be treated as a top-level possibility next to pipelines and model selection classes.

@PGijsbers PGijsbers marked this pull request as ready for review July 20, 2019 21:50
@PGijsbers
Copy link
Collaborator Author

PGijsbers commented Jul 20, 2019

This breaks OpenMLFlow.publish on a flow which already exists on the server due to an introduced inconsistency where to local flow will have a custom_name and the remote flow does not. However, they are enforced to have the same custom_name. See below for a failing unit test. What do we do here?

  • Update server flows
  • Don't check for custom_name
  • Allow custom_name to be unequal only if the server field is None

I think the last option is probably desirable? Or perhaps the second one, as it allows us to change the way we shorten names in the future more easily.


The unit test test_to_from_filesystem_search fails here:

    def publish(self, raise_error_if_exists: bool = False) -> 'OpenMLFlow':
        """ Publish this flow to OpenML server.
    
        Raises a PyOpenMLError if the flow exists on the server, but
        `self.flow_id` does not match the server known flow id.
    
        Parameters
        ----------
        raise_error_if_exists : bool, optional (default=False)
            If True, raise PyOpenMLError if the flow exists on the server.
            If False, update the local flow to match the server flow.
    
        Returns
        -------
        self : OpenMLFlow
    
        """
        ... <removed code> ...
        try:
            openml.flows.functions.assert_flows_equal(
                self, flow, flow.upload_date, ignore_parameter_values=True
            )
        except ValueError as e:
            message = e.args[0]
            raise ValueError("Flow was not stored correctly on the server. "
                             "New flow ID is %d. Please check manually and "
                             "remove the flow if necessary! Error is:\n'%s'" %
>                            (flow_id, message))
E           ValueError: Flow was not stored correctly on the server. New flow ID is 22101. Please check manually and remove the flow if necessary! Error is:
E           'Flow sklearn.model_selection._search.GridSearchCV(estimator=sklearn.pipeline.Pipeline(imputer=sklearn.preprocessing.imputation.Imputer,classifier=sklearn.tree.tree.DecisionTreeClassifier)): values for attribute 'custom_name' differ: 'sklearn.GridSearchCV(Pipeline(Imputer,DecisionTreeClassifier))'
E           vs
E           'None'.'

This is expected as the existing flow does not have the custom_name. Come to think of it, I also think perhaps we should have a different error message there. I propose to change it to:

"The flow on the server is inconsistent with the local flow.  The server flow ID is %d. Please check manually and remove the flow if necessary! Error is:\n'%s'" % (flow_id, message))

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 22, 2019

I think the 3rd option, together with an argument to the assert function to enable this behavior would be best.

And the different error message looks fine.

@PGijsbers
Copy link
Collaborator Author

Getting

_________ OpenMLTaskTest.test_list_datasets_with_high_size_parameter __________
[gw3] win32 -- Python 3.5.6 c:\miniconda35-x64\python.exe
self = <tests.test_utils.test_utils.OpenMLTaskTest testMethod=test_list_datasets_with_high_size_parameter>
    def test_list_datasets_with_high_size_parameter(self):
        datasets_a = openml.datasets.list_datasets()
        datasets_b = openml.datasets.list_datasets(size=np.inf)
        # note that in the meantime the number of datasets could have increased
        # due to tests that run in parallel.
        # instead of equality of size of list, checking if a valid subset
        a = set(datasets_a.keys())
        b = set(datasets_b.keys())
>       self.assertTrue(b.issubset(a))
E       AssertionError: False is not true
tests\test_utils\test_utils.py:54: AssertionError

Maybe restart tests tomorrow

@codecov-io
Copy link

codecov-io commented Jul 24, 2019

Codecov Report

Merging #723 into develop will increase coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #723      +/-   ##
===========================================
+ Coverage    87.89%   88.02%   +0.12%     
===========================================
  Files           36       36              
  Lines         4033     4076      +43     
===========================================
+ Hits          3545     3588      +43     
  Misses         488      488
Impacted Files Coverage Δ
openml/flows/flow.py 93.8% <100%> (ø) ⬆️
openml/extensions/sklearn/extension.py 91.16% <100%> (+0.54%) ⬆️
openml/flows/functions.py 87.96% <100%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56fcc00...f7343ec. Read the comment docs.

@PGijsbers
Copy link
Collaborator Author

PGijsbers commented Jul 24, 2019

Flow 20 seems to have been removed from the test server. @janvanrijn do you know what's up? (https://test.openml.org/api/v1/xml/flow/20)

short_name = 'sklearn.{}'
# Generally, we want to trim all hyperparameters, the exception to that is for model
# selection, as the `estimator` hyperparameter is very indicative of what is in the flow.
# So we first trim pipeline names of the `estimator` parameter. For reference:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble to match

So we first trim pipeline names of the estimator parameter.

with the code. The code does not really check for pipelines below, but for model selection. I think the issue is that you do not mean a scikit-learn pipeline object by pipeline, is that correct? Could you update the comment to reflect this more clearly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I meant any number of components specified under estimators= of the model selection. I understand it's confusing and I'll change it.

run = openml.runs.run_flow_on_task(flow, task)
run = run.publish()
TestBase._mark_entity_for_removal('run', run.run_id)
TestBase.logger.info("collected from {}: {}".format(__file__.split('/')[-1], run.run_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add unit tests for feature union and simple non-pipeline, non-feature selection models here, too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. FeatureUnion did not work as expected. Added the tests and fixed the issue.

super(TestFlowFunctions, self).setUp()

def tearDown(self):
super(TestFlowFunctions, self).tearDown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it wasn't there, but could you please change this to an assertRaisesRegex?

Copy link
Collaborator Author

@PGijsbers PGijsbers Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also found that flow 8175 is 0.19.1 and not 0.19.2 so in terms of CI we can actually use it for all scikit-learn versions we test against.

@PGijsbers
Copy link
Collaborator Author

PGijsbers commented Jul 24, 2019

appveyor/branch, travis-ci/push, travis-ci/pr: [gw1] FAILED tests/test_utils/test_utils.py::OpenMLTaskTest::test_list_datasets_with_high_size_parameter

@mfeurer mfeurer merged commit 4715796 into develop Jul 24, 2019
@mfeurer mfeurer deleted the Add_412 branch July 24, 2019 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants