-
Notifications
You must be signed in to change notification settings - Fork 13
Refactor Task class to include multiple rolling window evaluations
#33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
src/fev/__about__.py
Outdated
| @@ -1 +1 @@ | |||
| __version__ = "0.6.0rc2" | |||
| __version__ = "1.0.0" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to bump the version to v1.0.0 given how breaking the changes are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a new branch pre-v1.0.0. We can merge all PRs before the next release into this branch to avoid making the documentation on main out-of-sync with the latest stable release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to bump it? Are we sure fev is now stable? If not, I'd stick to 0.7 as per semver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is __about__.py standard / meaningfully consumed by setuptools or similar? why not just define in init.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I moved this to __init__.py. Using __about__.py has the advantage of being able to check the version without loading the full package, but it's probably not worth it.
Task class to include multiple rolling window evaluationsTask class to include multiple rolling window evaluations
| if as_univariate: | ||
| if univariate_target_column in past.column_names and univariate_target_column != task.target_column: | ||
| # Raise error if column called `univariate_target_column` already exists and it's not the *only* target column | ||
| if univariate_target_column in past.column_names and window.target_columns_list != [univariate_target_column]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we hit this often? If yes, maybe it makes sense to change the default to something which would lead to fewer collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only hit this if there is a columns called target in the dataset that is not actually the only target column. This looks like a really rare case to me, so I wouldn't worry about too much.
canerturkmen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Dropping some comments.
| data loading. | ||
| window_step_size : int | str | None | ||
| Step size between consecutive evaluation windows. Must be an integer if `initial_cutoff` is an integer. | ||
| Can be an integer or pandas offset string (e.g., 'D', '15min') if `initial_cutoff` is a timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. if this pandas candy is there, as a user I expect it to be there for horizon as well for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making horizon: int | str can be really annoying for the users since then they won't be able to set MyModel(prediction_length=task.horizon) anymore, and will require some complex pandas offset manipulations.
In contrast, I don't think that user will ever want to access task.window_step_size since this parameter just configures which windows will be produced by iter_windows().
src/fev/__about__.py
Outdated
| @@ -1 +1 @@ | |||
| __version__ = "0.6.0rc2" | |||
| __version__ = "1.0.0" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to bump it? Are we sure fev is now stable? If not, I'd stick to 0.7 as per semver.
| future: datasets.Dataset, | ||
| *, | ||
| target_column: str | list[str], | ||
| target_columns_list: list[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if at all possible I would avoid type encodings in variables (i.e., ..._list) unless in local variables where they clearly serve to disambiguate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name here was chosen to mimic the task property Task.target_columns_list
Lines 727 to 733 in 9a835d0
| @property | |
| def target_columns_list(self) -> list[str]: | |
| """A list including names of all target columns for this task.""" | |
| if isinstance(self.target_column, list): | |
| return self.target_column | |
| else: | |
| return [self.target_column] |
I think the current naming where target_column can be of type str | list[str] is a bit counterintuitive. How about we rename as follows in a follow-up PR?
Task.target_column->target: str | list[str]Task.target_columns_list->target_columns: list[str]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you. I understand the problem, but the solution doesn't help me.
| raise ValueError("`window_step_size` must be an int if `initial_cutoff` is an int") | ||
| assert self.window_step_size >= 1 | ||
| max_allowed_cutoff = -self.horizon - (self.num_windows - 1) * self.window_step_size | ||
| if self.initial_cutoff < 0 and self.initial_cutoff > max_allowed_cutoff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow this check. So if I understand correctly I can only specify initial cutoff to allow for at least the required number of steps (roughly horizon * nr_windows) in the future. if so, why is it not checked for timestamp as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of timestamp-based cutoffs we cannot perform this check during Task.__post_init__ since we don't yet know which timestamps are available in the dataset.
If there is not enough data to generate the splits with a timestamp-based initial_cutoff, an error will be raised once the user accesses EvaluationWindow.get_input_data() for the window where not enough data is available.
For an integer-based negative initial_cutoff, we can immediately tell if the task configuration is invalid during __post_init__, so this check here is just to save user's time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
canerturkmen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Target branch should likely be master. Otherwise, LGTM!
@canerturkmen I will merge |
Issue #, if available:
Description of changes:
TaskGeneratorclassTaskEvaluationWindowclass that corresponds to a single cutoff in the original datasetTo do:
Task/EvaluationWindowmethodsBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.