-
-
Notifications
You must be signed in to change notification settings - Fork 211
Function to trim flownames for scikit-learn flows. #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mfeurer what's the reason for the dummy learners and/or why they are linked to the scikit-learn extension? Currently the generated flow names are not scikit-learn standard names (i.e. they don't start with |
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the class DummyModel @PGijsbers ?
A general answer regarding the 2nd part of your comment is that the scikit-learn extension so far does not require the ML algorithms actually being implemented in scikit-learn. The only requirement is that the classes follow the scikit-learn conventions (fit, predict, transform etc.). It would be great if we can keep this property. One possibility would be to add the package name of non-scikit-learn packages to the short name, too.
Also, I guess feature unions should also be treated as a top-level possibility next to pipelines and model selection classes.
|
This breaks
I think the last option is probably desirable? Or perhaps the second one, as it allows us to change the way we shorten names in the future more easily. The unit test def publish(self, raise_error_if_exists: bool = False) -> 'OpenMLFlow':
""" Publish this flow to OpenML server.
Raises a PyOpenMLError if the flow exists on the server, but
`self.flow_id` does not match the server known flow id.
Parameters
----------
raise_error_if_exists : bool, optional (default=False)
If True, raise PyOpenMLError if the flow exists on the server.
If False, update the local flow to match the server flow.
Returns
-------
self : OpenMLFlow
"""
... <removed code> ...
try:
openml.flows.functions.assert_flows_equal(
self, flow, flow.upload_date, ignore_parameter_values=True
)
except ValueError as e:
message = e.args[0]
raise ValueError("Flow was not stored correctly on the server. "
"New flow ID is %d. Please check manually and "
"remove the flow if necessary! Error is:\n'%s'" %
> (flow_id, message))
E ValueError: Flow was not stored correctly on the server. New flow ID is 22101. Please check manually and remove the flow if necessary! Error is:
E 'Flow sklearn.model_selection._search.GridSearchCV(estimator=sklearn.pipeline.Pipeline(imputer=sklearn.preprocessing.imputation.Imputer,classifier=sklearn.tree.tree.DecisionTreeClassifier)): values for attribute 'custom_name' differ: 'sklearn.GridSearchCV(Pipeline(Imputer,DecisionTreeClassifier))'
E vs
E 'None'.'This is expected as the existing flow does not have the |
|
I think the 3rd option, together with an argument to the assert function to enable this behavior would be best. And the different error message looks fine. |
|
Getting Maybe restart tests tomorrow |
Codecov Report
@@ Coverage Diff @@
## develop #723 +/- ##
===========================================
+ Coverage 87.89% 88.02% +0.12%
===========================================
Files 36 36
Lines 4033 4076 +43
===========================================
+ Hits 3545 3588 +43
Misses 488 488
Continue to review full report at Codecov.
|
|
Flow 20 seems to have been removed from the test server. @janvanrijn do you know what's up? (https://test.openml.org/api/v1/xml/flow/20) |
| short_name = 'sklearn.{}' | ||
| # Generally, we want to trim all hyperparameters, the exception to that is for model | ||
| # selection, as the `estimator` hyperparameter is very indicative of what is in the flow. | ||
| # So we first trim pipeline names of the `estimator` parameter. For reference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble to match
So we first trim pipeline names of the
estimatorparameter.
with the code. The code does not really check for pipelines below, but for model selection. I think the issue is that you do not mean a scikit-learn pipeline object by pipeline, is that correct? Could you update the comment to reflect this more clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I meant any number of components specified under estimators= of the model selection. I understand it's confusing and I'll change it.
| run = openml.runs.run_flow_on_task(flow, task) | ||
| run = run.publish() | ||
| TestBase._mark_entity_for_removal('run', run.run_id) | ||
| TestBase.logger.info("collected from {}: {}".format(__file__.split('/')[-1], run.run_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add unit tests for feature union and simple non-pipeline, non-feature selection models here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. FeatureUnion did not work as expected. Added the tests and fixed the issue.
| super(TestFlowFunctions, self).setUp() | ||
|
|
||
| def tearDown(self): | ||
| super(TestFlowFunctions, self).tearDown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it wasn't there, but could you please change this to an assertRaisesRegex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also found that flow 8175 is 0.19.1 and not 0.19.2 so in terms of CI we can actually use it for all scikit-learn versions we test against.
|
appveyor/branch, travis-ci/push, travis-ci/pr: |
There is a desire for shorter flow names for scikit-learn flows.
I added a function which uses the flow name string, and trims it down to its major components.
See #412 for more.