Add classes for calibrating average ensemble and ensemble of tragedies#1048
Add classes for calibrating average ensemble and ensemble of tragedies#1048emanuel-schmid merged 126 commits intodevelopfrom
Conversation
# Conflicts: # script/jenkins/branches/Jenkinsfile # tests_runner.py
Use negative cost function as target function in BayesianOptimizer Co-authored-by: Thomas Vogt <[email protected]>
…pact-functions # Conflicts: # doc/user-guide/climada_util_calibrate.ipynb
emanuel-schmid
left a comment
There was a problem hiding this comment.
Awesome amendment! Very well written module I'd say.
I just wish the pydoc strings were more elaborate and I wonder about the use of dataclasses.
climada/util/calibrate/ensemble.py
Outdated
| from ...engine.unsequa.input_var import InputVar | ||
| from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet | ||
| from ..coordinates import country_to_iso | ||
| from .base import Input, Optimizer, Output |
There was a problem hiding this comment.
PEP 8: "Absolute imports are recommended, as they are usually more readable and tend to be better behaved (or at least give better error messages)"
Imo, this is probably true for ..-modules. For siblings relative imports are OK.
| from ...engine.unsequa.input_var import InputVar | |
| from ...entity.impact_funcs import ImpactFunc, ImpactFuncSet | |
| from ..coordinates import country_to_iso | |
| from .base import Input, Optimizer, Output | |
| from climada.engine.unsequa.input_var import InputVar | |
| from climada.entity.impact_funcs import ImpactFunc, ImpactFuncSet | |
| from climada.util.coordinates import country_to_iso | |
| from .base import Input, Optimizer, Output |
| @dataclass | ||
| class EnsembleOptimizerOutput: | ||
| """The collective output of an ensemble optimization""" | ||
|
|
||
| data: pd.DataFrame |
There was a problem hiding this comment.
This is stretching the concept of a dataclass as I see it a lot. For instance, you can't even do this:
EnsembleOptimizerOutput(df1) == EnsembleOptimizerOutput(df2)
Is there a reason why it has to be a dataclass? Woldn't it be more natural to have an ordinary class that uses from_outputs as __init__ ?
There was a problem hiding this comment.
I am using dataclass to avoid boilerplate code. In my view, the __init__ of the class should be as permissive as possible. Therefore, a usual init will look like this:
class MyClass:
def __init__(attr1, attr2):
self.attr1 = attr1
self.attr2 = attr2I think the code is trivial and using dataclass to construct is seems reasonable to me. It improves readability, maintainability, and extensibility.
Woldn't it be more natural to have an ordinary class that uses
from_outputsas__init__?
I think from_outputs is too restrictive for making it the default initialization (although I use it exclusively throughout my code). See my comment above.
You can't even do this:
A good point! I will implement a custom __eq__. But I don't think that's a reason to abandon dataclass in this case.
There was a problem hiding this comment.
I implemented a custom __eq__ for EnsembleOptimizerOutput
| return ax | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Input has no __eq__, hence this class has neither. Here too, I'm not sure why it needs to be a dataclass.
Is it about the implicit immutability of a dataclass object? If so, we maybe even should make it explicit by
| @dataclass | |
| @dataclass(frozen=True) |
Although this would require to drop the __post_init__ methods and InitVars and replace them by something like
@classmethod
def from_initvars(cls, x, y):
return cls(something(x,y))
Even so, it would be nice to allow comparing two such objects meaningfully - and if it means that we have to implement __eq__ for a dataclass. Otherwise I honestly don't see a point of calling it dataclass.
There was a problem hiding this comment.
Again, I use dataclass mostly to avoid a trivial init. This is especially useful for derived classes, where you might easily miss that you also need to initialize the base class.
I did not check it, but a default __eq__ should be created by dataclass. I did not really consider freezing the instances, as I think it's not necessary. Input probably is the only sensible candidate for freezing, as results are distorted if one changes it.
There was a problem hiding this comment.
not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class TragedyEnsembleOptimizer(EnsembleOptimizer): | ||
| """An optimizer for the ensemble of tragedies |
There was a problem hiding this comment.
A very short summary that doesn't give much of a clue what this implementation of the EnsembleOpitmizer does.
| self.samples = rng.choice(self.samples, ensemble_size, replace=False) | ||
|
|
||
| def input_from_sample(self, sample: list[tuple[int, int]]): | ||
| """Subselect all input""" |
There was a problem hiding this comment.
I agree that this does not become clear in the docs. The idea is the following: Each EnsembleOptimizer only stores a list of samples, which contains the indices of the data points that are to be used in each optimization. The concrete optimizer then defines in input_from_sample, how a new Input object is generated for each sample and thus each optimization. In TragedyEnsembleOptimizer, this is used to drastically reduce the amount of hazard data that needs to be processed by each optimization. I'll try to clarify
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class AverageEnsembleOptimizer(EnsembleOptimizer): | ||
| """An optimizer for the average ensemble |
There was a problem hiding this comment.
🤔 Hard to understand. I wish the description was more elaborate. s.b.
climada/util/calibrate/ensemble.py
Outdated
|
|
||
| @dataclass | ||
| class EnsembleOptimizer(ABC): | ||
| """Abstract base class for defining an ensemble optimizer |
There was a problem hiding this comment.
What is an ensemble optimizer?
There was a problem hiding this comment.
Will clarify in the docs. An optimizer yields a single set of optimal parameters (or impact functions), whereas an ensemble optimizer yields an ensemble of them.
* Use weights for sampling with replacement in AverageEnsembleOptimizer. * Update tests
|
@emanuel-schmid I hope the docs are clearer now! I also found a bug in the implementation where the |
…-project/climada_python into cross-calibrate-impact-functions
emanuel-schmid
left a comment
There was a problem hiding this comment.
Thanks for improving the docs! Sorry for taking so long.
It all looks good to me, apart from a few minor, mainly cosmetical, concerns.
Happy to merge.
| return -self.input.cost_func(data, predicted, weights) | ||
|
|
||
| def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutput: | ||
| def run(self, **opt_kwargs) -> BayesianOptimizerOutput: |
There was a problem hiding this comment.
Why is the signature not just like the pydoc describes it?
There was a problem hiding this comment.
Same argument as we had in the discussion about the unsequa abstract base class:
The Optimizer abstract base class defines an interface that should be kept for all derived classes. However, the derived classes can, or in this case must, take additional, class specific arguments. The only elegant (?) way I see for doing this is simply going with a variadic kwargs argument and then documenting the method like it had the actual arguments that are needed here.
tldr: The linter would complain: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/arguments-differ.html
There was a problem hiding this comment.
Note that we do the same thing in the ScipyOptimizer:
| # Take the controller | ||
| try: | ||
| controller = opt_kwargs.pop("controller") | ||
| except KeyError as err: | ||
| raise RuntimeError( | ||
| "BayesianOptimizer.run requires 'controller' as keyword argument" | ||
| ) from err | ||
|
|
There was a problem hiding this comment.
all of this wouldn't be necessary with a signature as described in the pydoc
| return ax | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
not really convinced given the need to implement __eq__ and __post__
- but I guess it doesn't matter much. 😎
| LOGGER = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def sample_data(data: pd.DataFrame, sample: list[tuple[int, int]]): |
There was a problem hiding this comment.
function could be private if we wanted (with regard to possible Extensions better not though)
| return data_sampled | ||
|
|
||
|
|
||
| def sample_weights(weights: pd.DataFrame, sample: list[tuple[int, int]]): |
| return weights_sampled | ||
|
|
||
|
|
||
| def event_info_from_input(inp: Input) -> dict[str, Any]: |
Co-authored-by: Emanuel Schmid <[email protected]>
|
@peanutfun this is ready for mergin - isn't it? |
Changes proposed in this PR:
PR Author Checklist
develop)PR Reviewer Checklist