-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+1] Make ParameterSampler sample without replacement #3850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Make ParameterSampler sample without replacement #3850
Conversation
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be worth mentioning the concrete values for the size of the space of the parameters and the value of n_iter in the error message.
|
Done. Do you think we should introduce an option? |
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think that this way of doing sampling without replacement might be too costly for values of n_iter close to the size of the parameter space? The last few samples will have have a very high likelihood of being rejected. Maybe we should check the case self.n_iter > 0.1 * grid_size and special purpose that by returning the n_iter first element a permutation of the materialized list of parameter combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in the worst case a single sample could have time complexity n_iter. I thought that probably is not an issue but I can just special case the whole finite grid-size path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up implementing your heuristic ;)
I have no strong opinion. Maybe a warning is enough. |
|
How would you warn so that it doesn't pop up all the time? |
5f22d5e to
133f3fd
Compare
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80 columns :)
|
Apart from my cosmetics comments, +1 on my side. |
|
any other reviews? |
|
I think this probably should be an option. Not altogether certain, though. |
|
There is not really a benefit in doing sampling with replacement, but maybe it would be less magic. |
|
Okay. I recall the issue of ParameterSampler sampling with replacement On 6 January 2015 at 04:01, Andreas Mueller [email protected]
|
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use sample_without_replacement at this point? It uses a similar heuristic to decide algorithm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh never mind. I didn't read properly that you still do sampling without replacement when RVs are involved, which may be necessary for discrete RVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, could just use that function for the all_lists case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c9b4094 to
302feed
Compare
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block can be collapsed with the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Should the new behaviour be documented in |
…e given as lists.
297a9d0 to
2e2eebf
Compare
|
+1 for merge (again). |
|
@jnothman Documented it in both places. |
|
Do we think sampling without replacement is more appropriate for e.g. poisson sampling than for lists? |
|
Sorry I don't understand your question. For distributions we do sampling with replacement as we can't know if the space was exhausted yet. |
|
Of course, wasn't thinking (although the alternative is to promise the user On 18 January 2015 at 07:00, Andreas Mueller [email protected]
|
|
Merging. |
[MRG] Make ParameterSampler sample without replacement
|
Thanks @amueller. |
|
Thanks for the reviews :) |
|
Thanks for this @amueller ! |
|
np |
|
Hi @amueller @adinh and I ran into an issue as a result of this change. It makes the ParameterSampler super memory hungry even if you have a small number of categorical parameters to search on, because it now tries to generate all possible combination in memory. This code below uses 600MB just for setting up the parameter search, as a couple more parameters and you quickly hit many-of-gigabytes. from sklearn.grid_search import ParameterSampler
from memory_profiler import profile
@profile
def sample_params(params):
my_sampler = ParameterSampler(params, 10)
for sample in my_sampler:
print sample
if __name__ == '__main__':
parameter_distribution = {
'param_a': range(10),
'param_b': range(10),
'param_c': range(10),
'param_d': range(10),
'param_e': range(10),
'param_f': range(10),
}
sample_params(parameter_distribution)Output You mentioned somewhere above about adding the ability to choose whether or not to sample with replacement, I think that would be a good option. Or at least we should update the doc string to point out whats going on here. I'm not sure whether to file a bug report for this or is it expected behaviour? Thoughts? |
|
I think this is a "bug" or at least an issue. A parameter to fix this might help, but really the problem is in realising the parameter grid in def __getitem__(self, ind):
"""
>>> grid = ParameterGrid([{'kernel': ['linear']},
... {'kernel': ['rbf'], 'gamma': [1, 10]}])
>>> [grid[i] for i in range(len(grid))] == list(grid)
True
"""
for sub_grid in self.param_grid:
# XXX: could memoize information used here
keys, values = zip(*sorted(sub_grid.items()))
sizes = [len(v) for v in values]
modulo = np.cumprod(np.hstack([1, sizes[::-1]]))[::-1]
if ind >= modulo[0]:
ind -= modulo[0]
else:
offsets = ind // modulo[1:] % sizes
return {k: v[o] for k, v, o in zip(keys, values, offsets)}
raise ValueErrorand below change param_grid = list(ParameterGrid(self.param_distributions))to param_grid = ParameterGrid(self.param_distributions) |
Fixes #3792.
I think this issue came up before so maybe it is worth addressing.
This makes the ParametersSampler sample without replacement if all parameters are lists.
This has a couple of gotchas:
n_iter.'bla' : bernoulli(0.5)is not equivalent any more to'bla': [0, 1](the first does sampling with replacement, the second without)Possibilities to resolve these would be:
n_iter>grid_sizeand produce a smaller grid (or sample with replacement in this case?). Might invalidate the__len__implementation.with_replacement, set it toNone, and go to a deprecation cycle to set it toFalse. The parameter would only do something in the case all parameters are lists, though.