Skip to content

Conversation

@Neeratyoy
Copy link
Contributor

What does this PR implement/fix? Explain your changes.

Shows an example of Perrone et al's use of previous OpenML runs to build a surrogate model.

@Neeratyoy
Copy link
Contributor Author

@mfeurer might need some more textual descriptions, but I presume that the person going through this example already is aware of the paper.

@Neeratyoy Neeratyoy requested a review from mfeurer October 15, 2019 18:18
a tabular format that can be used to build models.
"""

def fetch_evaluations(run_full=False, flow_type='svm', metric = 'area_under_roc_curve'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove the whitespace around the equals for metric=...?

return eval_df, task_ids, flow_id


def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please put each argument into its own line?



def create_table_from_evaluations(eval_df, flow_type='svm', run_count=np.iinfo(np.int64).max,
metric = 'area_under_roc_curve', task_ids=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

values : list
'''
if task_ids is not None:
eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about eval_df.query()? I usually find this easier to read where possible.

'''
if task_ids is not None:
eval_df = eval_df.loc[eval_df.task_id.isin(task_ids)]
ncols = 4 if flow_type == 'svm' else 10 # ncols determine the number of hyperparameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line is duplicated in the if/else statement below.

# Replacing NaNs with fixed values outside the range of the parameters
# given in the supplement material of the paper
if flow_type == 'svm':
eval_table.kernel.fillna("None", inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please not use the syntax of indexing the columns of the array like they were attributes? This is likely to be removed from pandas in the future (and kind of confusing attributes of the dataframe object with the content of the dataframe).

eval_table = impute_missing_values(eval_table, flow_type)
# Encode categorical variables as one-hot vectors
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(eval_table.kernel.to_numpy().reshape(-1, 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.



def preprocess(eval_table, flow_type='svm'):
eval_table = impute_missing_values(eval_table, flow_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please construct a scikit-learn pipeline? That could then be easily used for also predicting for new hyperparameter settings with calling a single function.


eval_df, task_ids, flow_id = fetch_evaluations(run_full=False)
X, y = create_table_from_evaluations(eval_df, run_count=1000)
X = preprocess(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a print statement here so that the user sees the data format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you please relate more to the paper? Is this the metadata for the meta-tasks in the paper?

@codecov-io
Copy link

codecov-io commented Oct 16, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@29a023c). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop     #832   +/-   ##
==========================================
  Coverage           ?   89.78%           
==========================================
  Files              ?       36           
  Lines              ?     5118           
  Branches           ?        0           
==========================================
  Hits               ?     4595           
  Misses             ?      523           
  Partials           ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29a023c...9ca9d87. Read the comment docs.

@Neeratyoy Neeratyoy requested a review from mfeurer October 17, 2019 14:45
@mfeurer
Copy link
Collaborator

mfeurer commented Oct 17, 2019

@Neeratyoy could you please have a look at my changes? Apparently, the flake8 checks only fail on the push, not the pr check, so we should be fine. Please feel free to merge if you agree with my changes.

@Neeratyoy Neeratyoy merged commit 56fa7f9 into develop Oct 18, 2019
@Neeratyoy Neeratyoy deleted the transfer_learning_example branch October 18, 2019 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants