-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
The current unique implementation, while lazy, does not currently leverage tree reductions akin to those used for things like sum. These tree reductions allow operations to be computed over each chunk and then reduce small groups of chunks into a smaller group of new chunks that can be further reduced until only one small chunk remains. Through this process it is possible to work with much larger pieces of data. To allow unique to work on larger pieces of data, it would be nice to leverage this tree reduction strategy in its implementation.
Currently unique makes use of a wrapper function _unique_internal, which makes it pretty easy to support unique with various optional extra results. This function is well designed for performing more complex reductions by supporting reductions of the other arguments as well. So there shouldn't really be any retuning of unique needed for this change.
It is almost possible to implement unique using the existing reduction utilities in dask.array.reductions except that these functions seem to assume that a reduction results in a chunk with a single element (or one element for a dimension). However unique may return an unknown number of elements that could range from one to the length of the data itself. Hence we would need a way to explain that the chunks from each reduction may have a different or indeterminate size. Perhaps another argument to reduction and _tree_reduce is all that is needed.