Function to trim flownames for scikit-learn flows. #723

PGijsbers · 2019-07-02T13:41:53Z

There is a desire for shorter flow names for scikit-learn flows.
I added a function which uses the flow name string, and trims it down to its major components.
See #412 for more.

PGijsbers · 2019-07-03T11:53:52Z

@mfeurer what's the reason for the dummy learners and/or why they are linked to the scikit-learn extension?

Currently the generated flow names are not scikit-learn standard names (i.e. they don't start with sklearn...). Consequently these long names are not valid input to the trim_pipeline_name method. I raise an error for invalid strings currently, but I could also opt to have an optional parameter that specifies to return the full name instead.

mfeurer

Do you mean the class DummyModel @PGijsbers ?

A general answer regarding the 2nd part of your comment is that the scikit-learn extension so far does not require the ML algorithms actually being implemented in scikit-learn. The only requirement is that the classes follow the scikit-learn conventions (fit, predict, transform etc.). It would be great if we can keep this property. One possibility would be to add the package name of non-scikit-learn packages to the short name, too.

Also, I guess feature unions should also be treated as a top-level possibility next to pipelines and model selection classes.

openml/extensions/sklearn/extension.py

…o Add_412

PGijsbers · 2019-07-20T22:25:37Z

This breaks OpenMLFlow.publish on a flow which already exists on the server due to an introduced inconsistency where to local flow will have a custom_name and the remote flow does not. However, they are enforced to have the same custom_name. See below for a failing unit test. What do we do here?

Update server flows
Don't check for custom_name
Allow custom_name to be unequal only if the server field is None

I think the last option is probably desirable? Or perhaps the second one, as it allows us to change the way we shorten names in the future more easily.

The unit test test_to_from_filesystem_search fails here:

    def publish(self, raise_error_if_exists: bool = False) -> 'OpenMLFlow':
        """ Publish this flow to OpenML server.
    
        Raises a PyOpenMLError if the flow exists on the server, but
        `self.flow_id` does not match the server known flow id.
    
        Parameters
        ----------
        raise_error_if_exists : bool, optional (default=False)
            If True, raise PyOpenMLError if the flow exists on the server.
            If False, update the local flow to match the server flow.
    
        Returns
        -------
        self : OpenMLFlow
    
        """
        ... <removed code> ...
        try:
            openml.flows.functions.assert_flows_equal(
                self, flow, flow.upload_date, ignore_parameter_values=True
            )
        except ValueError as e:
            message = e.args[0]
            raise ValueError("Flow was not stored correctly on the server. "
                             "New flow ID is %d. Please check manually and "
                             "remove the flow if necessary! Error is:\n'%s'" %
>                            (flow_id, message))
E           ValueError: Flow was not stored correctly on the server. New flow ID is 22101. Please check manually and remove the flow if necessary! Error is:
E           'Flow sklearn.model_selection._search.GridSearchCV(estimator=sklearn.pipeline.Pipeline(imputer=sklearn.preprocessing.imputation.Imputer,classifier=sklearn.tree.tree.DecisionTreeClassifier)): values for attribute 'custom_name' differ: 'sklearn.GridSearchCV(Pipeline(Imputer,DecisionTreeClassifier))'
E           vs
E           'None'.'

This is expected as the existing flow does not have the custom_name. Come to think of it, I also think perhaps we should have a different error message there. I propose to change it to:

"The flow on the server is inconsistent with the local flow.  The server flow ID is %d. Please check manually and remove the flow if necessary! Error is:\n'%s'" % (flow_id, message))

mfeurer · 2019-07-22T07:46:37Z

I think the 3rd option, together with an argument to the assert function to enable this behavior would be best.

And the different error message looks fine.

…ifference on upload.

PGijsbers · 2019-07-23T22:35:59Z

Getting

_________ OpenMLTaskTest.test_list_datasets_with_high_size_parameter __________
[gw3] win32 -- Python 3.5.6 c:\miniconda35-x64\python.exe
self = <tests.test_utils.test_utils.OpenMLTaskTest testMethod=test_list_datasets_with_high_size_parameter>
    def test_list_datasets_with_high_size_parameter(self):
        datasets_a = openml.datasets.list_datasets()
        datasets_b = openml.datasets.list_datasets(size=np.inf)
        # note that in the meantime the number of datasets could have increased
        # due to tests that run in parallel.
        # instead of equality of size of list, checking if a valid subset
        a = set(datasets_a.keys())
        b = set(datasets_b.keys())
>       self.assertTrue(b.issubset(a))
E       AssertionError: False is not true
tests\test_utils\test_utils.py:54: AssertionError

Maybe restart tests tomorrow

codecov-io · 2019-07-24T08:10:38Z

Codecov Report

Merging #723 into develop will increase coverage by 0.12%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #723      +/-   ##
===========================================
+ Coverage    87.89%   88.02%   +0.12%     
===========================================
  Files           36       36              
  Lines         4033     4076      +43     
===========================================
+ Hits          3545     3588      +43     
  Misses         488      488

Impacted Files	Coverage Δ
openml/flows/flow.py	`93.8% <100%> (ø)`	⬆️
openml/extensions/sklearn/extension.py	`91.16% <100%> (+0.54%)`	⬆️
openml/flows/functions.py	`87.96% <100%> (+0.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56fcc00...f7343ec. Read the comment docs.

PGijsbers · 2019-07-24T09:44:52Z

Flow 20 seems to have been removed from the test server. @janvanrijn do you know what's up? (https://test.openml.org/api/v1/xml/flow/20)

mfeurer · 2019-07-24T13:52:15Z

openml/extensions/sklearn/extension.py

-        short_name = 'sklearn.{}'
+        # Generally, we want to trim all hyperparameters, the exception to that is for model
+        # selection, as the `estimator` hyperparameter is very indicative of what is in the flow.
+        # So we first trim pipeline names of the `estimator` parameter. For reference:


I'm having trouble to match

So we first trim pipeline names of the estimator parameter.

with the code. The code does not really check for pipelines below, but for model selection. I think the issue is that you do not mean a scikit-learn pipeline object by pipeline, is that correct? Could you update the comment to reflect this more clearly?

That's right. I meant any number of components specified under estimators= of the model selection. I understand it's confusing and I'll change it.

mfeurer · 2019-07-24T13:52:58Z

tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py

        run = openml.runs.run_flow_on_task(flow, task)
        run = run.publish()
        TestBase._mark_entity_for_removal('run', run.run_id)
        TestBase.logger.info("collected from {}: {}".format(__file__.split('/')[-1], run.run_id))


Could you please add unit tests for feature union and simple non-pipeline, non-feature selection models here, too?

Good catch. FeatureUnion did not work as expected. Added the tests and fixed the issue.

mfeurer · 2019-07-24T13:53:55Z

tests/test_flows/test_flow_functions.py

        super(TestFlowFunctions, self).setUp()

    def tearDown(self):
        super(TestFlowFunctions, self).tearDown()


I know it wasn't there, but could you please change this to an assertRaisesRegex?

Done, also found that flow 8175 is 0.19.1 and not 0.19.2 so in terms of CI we can actually use it for all scikit-learn versions we test against.

PGijsbers · 2019-07-24T14:39:22Z

appveyor/branch, travis-ci/push, travis-ci/pr: [gw1] FAILED tests/test_utils/test_utils.py::OpenMLTaskTest::test_list_datasets_with_high_size_parameter

PGijsbers added 2 commits July 2, 2019 15:28

Function to trim flownames for scikit-learn flows.

dc71ad9

max_length -> extra trim length rename

f3722b1

PGijsbers requested review from joaquinvanschoren and mfeurer July 2, 2019 13:41

PGijsbers added 2 commits July 2, 2019 16:09

Flake.

69f7972

Fix typo in test which is no longer allowed with Pytest 5.0.0

f34d1c4

mfeurer reviewed Jul 8, 2019

View reviewed changes

openml/extensions/sklearn/extension.py Show resolved Hide resolved

openml/extensions/sklearn/extension.py Outdated Show resolved Hide resolved

Allow long names from other modules.

b33d2db

PGijsbers marked this pull request as ready for review July 20, 2019 21:50

PGijsbers added 3 commits July 20, 2019 14:51

Merge branch 'develop' into Add_412

03a79c9

Update test to reflect we allow non-sklearn pipelines now.

2c28d0d

Merge branch 'Add_412' of https://github.com/openml/openml-python int…

46e304f

…o Add_412

[skip-CI] Flake8.

0aaef28

PGijsbers added 3 commits July 23, 2019 22:02

Allow to ignore custom name when checking if flows are equal. Allow d…

74dd6bb

…ifference on upload.

Propegate ignore_custom_name_if_none in assert_flows_equal

0393f46

Allow model_selection in pipeline or pipeline in model_selection

268f795

Flake8

7518536

PGijsbers added 3 commits July 24, 2019 13:59

reinstantiate wrong version tests against live and has 0.20 support

750065f

[skip-ci] Remove commented out code.

feef9b5

Disable test_get_flow_reinstantiate_model_wrong_version for sklearn 0.19

127ac30

mfeurer reviewed Jul 24, 2019

View reviewed changes

Process feedback.

f7343ec

mfeurer approved these changes Jul 24, 2019

View reviewed changes

mfeurer merged commit 4715796 into develop Jul 24, 2019

mfeurer deleted the Add_412 branch July 24, 2019 15:29

mfeurer mentioned this pull request Jul 24, 2019

Short name for flows (pipelines) #412

Closed

Uh oh!

Function to trim flownames for scikit-learn flows. #723

Function to trim flownames for scikit-learn flows. #723

Uh oh!

Conversation

PGijsbers commented Jul 2, 2019

Uh oh!

PGijsbers commented Jul 3, 2019

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PGijsbers commented Jul 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer commented Jul 22, 2019

Uh oh!

PGijsbers commented Jul 23, 2019

Uh oh!

codecov-io commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PGijsbers commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

PGijsbers Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

mfeurer Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

PGijsbers Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

mfeurer Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

PGijsbers Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PGijsbers commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PGijsbers commented Jul 20, 2019 •

edited

Loading

codecov-io commented Jul 24, 2019 •

edited

Loading

PGijsbers commented Jul 24, 2019 •

edited

Loading

PGijsbers Jul 24, 2019 •

edited

Loading

PGijsbers commented Jul 24, 2019 •

edited

Loading