Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Jun 7, 2018

Fixes #1926
Fixes #2415

This provides a Dask Array implementation of NumPy's pad. It dispatches through 1 of 4 different functions depending on the type of padding (i.e. mode) used. If padding only involves edge chunks, then NumPy's pad is applied to the edge chunks and internal chunks are left untouched. If padding involves some sort of tiling, then the array is sliced up into pieces that orientated and organized as need, which are then combined with the original array using block. If the padding involves the computation of some statistics of the array, the statistics are computed and broadcast to match padding, which are then combined with the original array using block. If a user defined function is provided, then the array is padded with 0s and map_blocks is used (with some rechunking) to apply the user function and get the resulting array.

  • Tests added / passed
  • Passes flake8 dask

@jakirkham
Copy link
Member Author

Should add this is a rough implementation currently. There are no tests yet (obviously those will be needed). Nor are there API docs. Many padding options are not supported currently. Also there are kwargs that need to be passed on, but wasn't sure how to accomplish that. In any event, put this up here to get some rough/big picture feedback and suggestions before spending too long on it. Can iterate on details once we are happy with the big picture.

@jakirkham jakirkham force-pushed the add_pad branch 3 times, most recently from 08a8203 to 760e9a4 Compare June 7, 2018 23:57
array = asarray(array)

if mode not in ["constant", "edge", "linear_ramp"]:
raise NotImplementError("`pad` does not support the given `mode`.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are various statistical paddings like maximum, mean, and minimum. These could be handled by computing on the global array (possibly sliced if stat_length is provided) and then treating them like the constant padding case. The median case gets a little tricky. Though should be doable as this only gets computed along 1-D pieces of the array.

Things like reflect, symmetric, and wrap can be constructed by slicing the Dask Array into pieces, moving them around, and running block to stitch them all together into one Dask Array. Given how this behavior differs from the pad behavior included, we may opt to have this be a separate internal function that pad calls.

Finally there is the possibility of a user supplying an arbitrary function. We probably could implement this. Have some thoughts on it, but it is enough of a special case that we might want to wait and see if people ask for this feature and see what they expect from it.

@jakirkham jakirkham force-pushed the add_pad branch 10 times, most recently from ac70f12 to b6e7e79 Compare June 8, 2018 08:58
@jakirkham jakirkham changed the title WIP: Add Dask Array implementation of pad Add Dask Array implementation of pad Jun 8, 2018
@jakirkham
Copy link
Member Author

CI failure appears to be caused by issue ( numpy/numpy#7353 ). Bumping to NumPy 1.11.1 to see if that fixes the issue.


if any(map(any, pad_chunk_width)):
dsk[result_chunk_key] = (
np_pad, array_chunk_key, pad_chunk_width, mode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we reuse NumPy's pad here. This increases the chunks on the edge of the array and simplifies the logic we need to handle.

An alternative to this would be to do the padding ourselves manually in Dask. The benefit being the input array would be unchanged and the padded data would appear as new chunks on the exterior of the array.

All of the other padding cases already add the padding using new, distinct chunks while not changing the array itself. Could be useful if we wanted to provide a user option to override the chunking of the padded array.

jakirkham added 5 commits June 8, 2018 13:16
This provides a Dask Array implementation of NumPy's `pad`. It
dispatches through 1 of 4 different functions depending on the type of
padding (i.e. `mode`) used. If padding only involves edge chunks, then
NumPy's `pad` is applied to the edge chunks and internal chunks are left
untouched. If padding involves some sort of tiling, then the array is
sliced up into pieces that orientated and organized as need, which are
then combined with the original array using `block`. If the padding
involves the computation of some statistics of the array, the statistics
are computed and broadcast to match padding, which are then combined
with the original array using `block`. If a user defined function is
provided, then the array is padded with 0s and `map_blocks` is used
(with some `rechunk`ing) to apply the user function and get the
resulting array.

Some special cases like computation of the `median` and usage of
`reflect_type="odd"` are currently not supported. The former could be
supported within some reasonable constraints. The latter has somewhat
mystifying behavior, which would need to be understood to be
implemented.
Appears a bug fix we need for NumPy's `pad` function is added in NumPy
1.11.1. So try bumping to NumPy 1.11.1 to see if that resolves the
issue.
@jakirkham
Copy link
Member Author

jakirkham commented Jun 8, 2018

This is now a pretty complete implementation of pad for Dask Arrays. So this would be a good time to get reviews.

Currently we handle all cases except the following.

  • mode="median"
  • reflect_type="odd" (used by reflect and symmetric)

As median is a weird thing for Dask Arrays, this probably can't be added generally. It might be possible to add it with some constraints (e.g. running as a UDF). Though user feedback would be helpful in determining whether this is useful and what sort of constraints would be tolerable.

With reflect_type="odd", this should be solvable. Have some ideas on how to solve it. Just handling corners appears to be a bit subtle. ( numpy/numpy#11283 ) Given as it is unclear to me how useful this would be and it appears a bit tricky, have skipped it for now. Should be a reasonable problem for a user interested in this feature to add.

@jakirkham
Copy link
Member Author

Will plan on merging this end of day Friday if there are no comments.

@jakirkham jakirkham merged commit cfd0f7c into dask:master Jun 16, 2018
@jakirkham jakirkham deleted the add_pad branch June 16, 2018 17:42
@jakirkham
Copy link
Member Author

Went ahead and merged. That said, if there are any issues, please let me know. We can fix them in a follow-up.

@crusaderky
Copy link
Collaborator

@jakirkham all travis tests are failing. Please revert until fixed.

@jakirkham
Copy link
Member Author

Saw some failures related to toolz not being importable in dask.dataframe, which is totally unrelated to this change. Also not all builds on master had this issue (for example this one already passed). Instead this is likely a transient installation error on CIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants