Skip to content

Conversation

@ArlindKadra
Copy link
Member

@ArlindKadra ArlindKadra commented Sep 11, 2018

Reference Issue

#515, #498, #457

What does this PR implement/fix? Explain your changes.

The PR fixes the documentation for run_flow_on_task, it also tackles run_flow_or_task always uploading the flow.

How should this PR be tested?

Calling the function, getting a run object in the end without uploading the flow to OpenML.

@ArlindKadra ArlindKadra changed the title [WIP] Refactoring run_flow_on_task [WIP] Refactoring run_flow_on_task and doc add for run_model Sep 11, 2018
@PGijsbers
Copy link
Collaborator

Hi, I am interested in having this functionality and if wonder it's ok working on this. I am a bit of a loss of the changes made so far, so I have some questions to better understand your work:

  • What is the _to_dict function for (in tasks/tasks.py). It seems to return some flow_container which is not referenced elsewhere, it seems to serialize flows but is located in tasks.py. Was it meant for flow serialization (for saving to disk)?
  • Work on actually preventing flow checks/uploads is not handled yet, is that correct?

In general, I thought to update the process to the following:

  • check if flow exists online, if so download. If not or no internet connection, use the local flow (possibly publish it if a flag is set and internet is available).
  • do the run logic
  • assuming still using a local flow, reference that when creating an OpenMLRun.

then upload the flow when:

  • publishing the run, if a local flow was used, upload it first, update the OpenMLRun with updated flow information before uploading the run.
  • or when flow.publish is called.

@ArlindKadra
Copy link
Member Author

ArlindKadra commented Feb 21, 2019

Hey @PGijsbers

What is the _to_dict function for (in tasks/tasks.py). It seems to return some flow_container which is not referenced elsewhere, it seems to serialize flows but is located in tasks.py. Was it meant for flow serialization (for saving to disk)?

It has been quite a bit of time. Considering the commit name, I guess I was implementing the generation of a task dictionary and I just copy pasted the code from the _to_dict function in the flow class. I started tweaking it and I never got back to the pr, that is why there are still flow parts there, so basically it is not useful. You should remove that function completely.

This is addressed in #607 .

Work on actually preventing flow checks/uploads is not handled yet, is that correct?

Yes, it is not handled yet. Preventing flow uploads is not easily done, since a flow id is needed to create the run, with the idea of storing the run locally, if I remember correctly.

As to how you want to proceed, it seems really good to me. We had a discussion on how to handle this in the last OpenML workshop and this was what we agreed upon. @mfeurer @janvanrijn can you confirm this was and still is the case ?

The functions to get the cached flows should be useful and the documentation change for the run_model_on_task is nice too.

@ArlindKadra
Copy link
Member Author

@mfeurer should the flow cache functions be moved in another PR ?

@PGijsbers
Copy link
Collaborator

So basically undo the last commit 22b1e62 before proceeding to work?

@ArlindKadra
Copy link
Member Author

Yes, the other commits should be useful.

@codecov-io
Copy link

codecov-io commented Feb 21, 2019

Codecov Report

Merging #516 into develop will increase coverage by 0.43%.
The diff coverage is 90.21%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #516      +/-   ##
===========================================
+ Coverage    90.12%   90.56%   +0.43%     
===========================================
  Files           32       32              
  Lines         3232     3520     +288     
===========================================
+ Hits          2913     3188     +275     
- Misses         319      332      +13
Impacted Files Coverage Δ
openml/flows/sklearn_converter.py 94.39% <100%> (+3.99%) ⬆️
openml/datasets/functions.py 92.73% <100%> (ø) ⬆️
openml/flows/functions.py 87.4% <79.16%> (-5.86%) ⬇️
openml/runs/functions.py 86.43% <92.45%> (-0.63%) ⬇️
openml/runs/run.py 90.56% <93.33%> (+1.46%) ⬆️
openml/exceptions.py 96.87% <94.11%> (-3.13%) ⬇️
openml/flows/flow.py 94.08% <96.77%> (+0.22%) ⬆️
openml/_api_calls.py 83.11% <0%> (-5.2%) ⬇️
openml/tasks/functions.py 86.45% <0%> (-1.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98a73b3...e63e495. Read the comment docs.

@mfeurer
Copy link
Collaborator

mfeurer commented Feb 22, 2019

I think this is what we discussed in Paris. And flow caching should be fine within this PR as it would already test the new flow caching functionality.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments regarding the unit tests.

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 5, 2019

I played a bit with the code and it appears that it cannot handle a simple random forest at the moment:

import sklearn.ensemble

import openml

openml.config.server = "https://test.openml.org/api/v1/xml"
openml.config.apikey = "610344db6388d9ba34f6db45a3cf71de"

model = sklearn.ensemble.RandomForestClassifier(n_estimators=33)

task = openml.tasks.get_task(12)

run = openml.runs.run_model_on_task(
    model=model,
    task=task,
    avoid_duplicate_runs=False,
    upload_flow=False,
)

print(run)
print(run.flow)
print(run.flow.flow_id)
print(run.flow_id)

run.publish()

as it results in:

[run id: None, task id: 12, flow id: None, flow name: sklearn.ensemble.forest.Ra...]
<openml.flows.flow.OpenMLFlow object at 0x7fe912411438>
None
None
Traceback (most recent call last):
  File "/home/feurerm/sync_dir/projects/openml/python/testing3.py", line 24, in <module>
    run.publish()
  File "/home/feurerm/sync_dir/projects/openml/python/openml/runs/run.py", line 402, in publish
    self.parameter_settings = openml.flows.obtain_parameter_values(self.flow, self.model)
  File "/home/feurerm/sync_dir/projects/openml/python/openml/flows/sklearn_converter.py", line 373, in obtain_parameter_values
    model = model if model else flow.model
  File "/home/feurerm/miniconda/3-4.5.4/envs/openml/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 140, in __len__
    return len(self.estimators_)
AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'

@PGijsbers
Copy link
Collaborator

Weird. I can recreate the problem locally and am looking into this.

@PGijsbers
Copy link
Collaborator

It seems that with absence of an explicit __bool__ operator Python objects fall back on the __len__ operator to handle 'truthiness', since BaseEnsemble had a __len__ method but not a __bool__ method, __len__ was invoked. I now refactored the problematic line
model = model if model else flow.model
to include explicit None checking:
model = model if model is not None else flow.model

…flow to ensure it does not exist on the server.
…f associated flow did not exist but was also not uploaded. This gave errors at publish-time.
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side we're good to merge. Only nitpick would be that some fields are not filled prior to uploading and the user can obtain a None when accessing them. However, I think this is more related to #634.

@ArlindKadra do you have any comments as you started this PR?

@PGijsbers
Copy link
Collaborator

But that is essentially the same as a user just creating the flow through sklearn_to_flow and then accessing the None fields, right? If so, I would agree that it's not really related to this PR so much as #634.

@ArlindKadra
Copy link
Member Author

@mfeurer @PGijsbers except for a small suggestion, everything looks really good to me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants