-
-
Notifications
You must be signed in to change notification settings - Fork 211
Adding Perrone example for building surrogate #832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mfeurer might need some more textual descriptions, but I presume that the person going through this example already is aware of the paper. |
| a tabular format that can be used to build models. | ||
| """ | ||
|
|
||
| def fetch_evaluations(run_full=False, flow_type='svm', metric = 'area_under_roc_curve'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove the whitespace around the equals for metric=...?
| return eval_df, task_ids, flow_id | ||
|
|
||
|
|
||
| def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please put each argument into its own line?
|
|
||
|
|
||
| def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max, | ||
| metric = 'area_under_roc_curve', task_ids=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
| values : list | ||
| ''' | ||
| if task_ids is not None: | ||
| eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about eval_df.query()? I usually find this easier to read where possible.
| ''' | ||
| if task_ids is not None: | ||
| eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)] | ||
| ncols = 4 if flow_type == 'svm' else 10 # ncols determine the number of hyperparameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line is duplicated in the if/else statement below.
| # Replacing NaNs with fixed values outside the range of the parameters | ||
| # given in the supplement material of the paper | ||
| if flow_type == 'svm': | ||
| eval_table.kernel.fillna("None", inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please not use the syntax of indexing the columns of the array like they were attributes? This is likely to be removed from pandas in the future (and kind of confusing attributes of the dataframe object with the content of the dataframe).
| eval_table = impute_missing_values(eval_table, flow_type) | ||
| # Encode categorical variables as one-hot vectors | ||
| enc = OneHotEncoder(handle_unknown='ignore') | ||
| enc.fit(eval_table.kernel.to_numpy().reshape(-1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
|
|
||
|
|
||
| def preprocess(eval_table, flow_type='svm'): | ||
| eval_table = impute_missing_values(eval_table, flow_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please construct a scikit-learn pipeline? That could then be easily used for also predicting for new hyperparameter settings with calling a single function.
|
|
||
| eval_df, task_ids, flow_id = fetch_evaluations(run_full=False) | ||
| X, y = create_table_from_evaluations(eval_df, run_count=1000) | ||
| X = preprocess(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a print statement here so that the user sees the data format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you please relate more to the paper? Is this the metadata for the meta-tasks in the paper?
Codecov Report
@@ Coverage Diff @@
## develop #832 +/- ##
==========================================
Coverage ? 89.78%
==========================================
Files ? 36
Lines ? 5118
Branches ? 0
==========================================
Hits ? 4595
Misses ? 523
Partials ? 0Continue to review full report at Codecov.
|
|
@Neeratyoy could you please have a look at my changes? Apparently, the flake8 checks only fail on the push, not the pr check, so we should be fine. Please feel free to merge if you agree with my changes. |
What does this PR implement/fix? Explain your changes.
Shows an example of Perrone et al's use of previous OpenML runs to build a surrogate model.