Skip to content

Conversation

@TomNicholas
Copy link

@TomNicholas TomNicholas commented Aug 11, 2022

Example of using this PR to create examples of arrays with a specific shape but arbitrary chunk structure:

In [9]: arr
Out[9]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [10]: from dask.array.strategies import chunks

In [11]: dask.array.from_array(arr, chunks=chunks(arr.shape).example())
Out[11]: dask.array<array, shape=(2, 3), dtype=int64, chunksize=(1, 3), chunktype=numpy.ndarray>

In [12]: dask.array.from_array(arr, chunks=chunks(arr.shape).example())
Out[12]: dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 2), chunktype=numpy.ndarray>

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@github-actions github-actions bot added the array label Aug 11, 2022
@github-actions github-actions bot added the documentation Improve or add to documentation label Aug 11, 2022
@pavithraes
Copy link
Member

ok to test

@TomNicholas
Copy link
Author

Q: Should this utility expose max_chunk_lengths or be changed to expose max_num_chunks instead?

@TomNicholas
Copy link
Author

For some reason the new strategy is not showing up on the API docs page

Screenshot from 2022-08-11 15-58-13

@TomNicholas
Copy link
Author

mypy is failing with

dask/array/strategies.py:11: error: Untyped decorator makes function "block_lengths" untyped  [misc]
dask/array/strategies.py:45: error: Untyped decorator makes function "chunks" untyped  [misc]
Found 2 errors in 1 file (checked 244 source files)

which I would think is hypothesis' fault?

@jsignell
Copy link
Member

Thanks for opening this @TomNicholas! I have a few thoughts:

  1. The mypy failure definitely feels like a hypothesis issue. It seems like it would not be too hard to pass the type though the decorator though (I'm looking at https://stackoverflow.com/questions/65621789/mypy-untyped-decorator-makes-function-my-method-untyped) so they would likely welcome that change upstream.
  2. I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py
  3. I don't have an opinion on whether to expose max_num_chunks or max_chunk_length. You could allow them both and error if they are both specified. Regardless of that choice these methods should be private at least to start.
  4. Is this something that you are interested in maintaining? If so then great - I feel fine merging it into Dask if we can just ping you if something breaks :)

@TomNicholas
Copy link
Author

Thanks for your comments @jsignell !

The mypy failure definitely feels like a hypothesis issue.

I noticed there are recent changes to the code for that decorator in hypothesis so it might be that requiring a more recent version would fix this typing error? Looks like this PR was intended to make typing work with @st.composite.

I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py

This depends on whether you think the tests namespace should be entirely a private library internal or not (I do). In xarray at least (and in numpy) we use tests to refer to internal checks checks that are run by pytest, and testing to refer to any test-related public API that someone else might want to import. So for example xarray.tests imports assert_equal from xarray.testing. (And yes we have an xarray/tests/test_testing.py file 😁) However if dask has a different policy then I'm happy to defer, I just think this separation is neat.

I don't have an opinion on whether to expose max_num_chunks or max_chunk_length

This is really an intuition question for dask devs about which they think is more important for finding bugs. max_num_chunks would create uniform-length chunks by default (which we might want the strategy to arbitrarily deviate a bit from). max_chunk_length would create completely non-uniform chunks, but that might be overkill for finding bugs and/or mean an inefficient strategy.

You could allow them both and error if they are both specified.

I kind of don't really want to do that because they imply different strategies implementations under the hood, so allowing both is twice as much work 😅

Regardless of that choice these methods should be private at least to start.

Is this something that you are interested in maintaining? If so then great - I feel fine merging it into Dask if we can just ping you if something breaks :)

I'm putting these here so that I can import them in xarray (/xGCM/xhistogram...)! So I am happy to maintain them, but I also want the chunks strategy to be public so that I'm not relying on private API.

@jsignell
Copy link
Member

I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py

This depends on whether you think the tests namespace should be entirely a private library internal or not (I do). In xarray at least (and in numpy) we use tests to refer to internal checks checks that are run by pytest, and testing to refer to any test-related public API that someone else might want to import. So for example xarray.tests imports assert_equal from xarray.testing. (And yes we have an xarray/tests/test_testing.py file grin) However if dask has a different policy then I'm happy to defer, I just think this separation is neat.

Ah ok. Thanks for explaining. I took a look and realized that assert_eq is defined in dask/array/utils.py so I guess there is precedent. Given that, I am ok with the proposed placement and with it being public.

This is really an intuition question for dask devs about which they think is more important for finding bugs. max_num_chunks would create uniform-length chunks by default (which we might want the strategy to arbitrarily deviate a bit from). max_chunk_length would create completely non-uniform chunks, but that might be overkill for finding bugs and/or mean an inefficient strategy.

max_chunk_length feels slightly more dasky to me.

@TomNicholas
Copy link
Author

Coming back to this after a year 😅

I've become interested in this problem again, and would like to merge this, but I'm just fighting with the CI to get everything to pass.

@TomNicholas
Copy link
Author

I've been adding hypothesis to the test envs but I guess I could just add an importorskip instead - that might be preferred rather than adding to the GPU test env?

Copy link

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 exciting to see this active again!

My best guess is that the mypy error is because you're pinned to mypy == 1.1 - the current version is 1.7, and has a lot of bugfixes.

chunks = data.draw(strategies.chunks(shape, axes=axes))

# assert that chunks add up to array lengths along all axes
lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks)
lengths = tuple(sum(chunks_along_axis) for chunks_along_axis in chunks)

Comment on lines +106 to +114
if not isinstance(shape, tuple):
raise ValueError("shape argument must be a tuple of ints")

if min_chunk_length < 1 or not isinstance(min_chunk_length, int):
raise ValueError("min_chunk_length must be an integer >= 1")

if max_chunk_length is not None:
if max_chunk_length < 1 or not isinstance(min_chunk_length, int):
raise ValueError("max_chunk_length must be an integer >= 1")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use hypothesis.errors.InvalidArgument, as the convention for "you passed an invalid argument to a strategy function" (it inherits from TypeError).

Comment on lines +29 to +37
if min_chunk_length > max_chunk_length_remaining:
# if we are at the end of the array we have no choice but to use a smaller chunk
chunk = remaining_length
else:
chunk = draw(
st.integers(
min_value=min_chunk_length, max_value=max_chunk_length_remaining
)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shrinking will tend to work better if the underlying elements are drawn from the same domain. In this case I'd implement the strategy by:

  1. calculating the range of valid chunk sizes and range of number-of-chunks (excluding the final chunk)
  2. drawing from st.lists(st.integers(...), ...)
  3. discard trailing elements which would take you over ax_length and manually insert a final chunk of size ax_length - sum(chunk_sizes)

You could even implement this as a non-@st.composite function which did some validation and precomputation, and then return st.lists(...).map(_turn_into_valid_chunks) to better amortize the setup cost.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array documentation Improve or add to documentation needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hypothesis strategy for chunking arrays in different patterns

6 participants