-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Hypothesis strategy for chunking arrays #9374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Can one of the admins verify this patch? Admins can comment |
|
ok to test |
|
Q: Should this utility expose |
|
For some reason the new strategy is not showing up on the API docs page |
|
which I would think is hypothesis' fault? |
|
Thanks for opening this @TomNicholas! I have a few thoughts:
|
|
Thanks for your comments @jsignell !
I noticed there are recent changes to the code for that decorator in hypothesis so it might be that requiring a more recent version would fix this typing error? Looks like this PR was intended to make typing work with
This depends on whether you think the
This is really an intuition question for dask devs about which they think is more important for finding bugs.
I kind of don't really want to do that because they imply different strategies implementations under the hood, so allowing both is twice as much work 😅
I'm putting these here so that I can import them in xarray (/xGCM/xhistogram...)! So I am happy to maintain them, but I also want the |
Ah ok. Thanks for explaining. I took a look and realized that
|
|
Coming back to this after a year 😅 I've become interested in this problem again, and would like to merge this, but I'm just fighting with the CI to get everything to pass. |
|
I've been adding hypothesis to the test envs but I guess I could just add an |
Zac-HD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 exciting to see this active again!
My best guess is that the mypy error is because you're pinned to mypy == 1.1 - the current version is 1.7, and has a lot of bugfixes.
| chunks = data.draw(strategies.chunks(shape, axes=axes)) | ||
|
|
||
| # assert that chunks add up to array lengths along all axes | ||
| lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks) | |
| lengths = tuple(sum(chunks_along_axis) for chunks_along_axis in chunks) |
| if not isinstance(shape, tuple): | ||
| raise ValueError("shape argument must be a tuple of ints") | ||
|
|
||
| if min_chunk_length < 1 or not isinstance(min_chunk_length, int): | ||
| raise ValueError("min_chunk_length must be an integer >= 1") | ||
|
|
||
| if max_chunk_length is not None: | ||
| if max_chunk_length < 1 or not isinstance(min_chunk_length, int): | ||
| raise ValueError("max_chunk_length must be an integer >= 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use hypothesis.errors.InvalidArgument, as the convention for "you passed an invalid argument to a strategy function" (it inherits from TypeError).
| if min_chunk_length > max_chunk_length_remaining: | ||
| # if we are at the end of the array we have no choice but to use a smaller chunk | ||
| chunk = remaining_length | ||
| else: | ||
| chunk = draw( | ||
| st.integers( | ||
| min_value=min_chunk_length, max_value=max_chunk_length_remaining | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shrinking will tend to work better if the underlying elements are drawn from the same domain. In this case I'd implement the strategy by:
- calculating the range of valid chunk sizes and range of number-of-chunks (excluding the final chunk)
- drawing from
st.lists(st.integers(...), ...) - discard trailing elements which would take you over
ax_lengthand manually insert a final chunk of sizeax_length - sum(chunk_sizes)
You could even implement this as a non-@st.composite function which did some validation and precomputation, and then return st.lists(...).map(_turn_into_valid_chunks) to better amortize the setup cost.

Example of using this PR to create examples of arrays with a specific shape but arbitrary chunk structure:
pre-commit run --all-files