Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Nov 2, 2018

If a size 0 array shows up in concatenate, it doesn't contribute any data. So it would be good to drop these terms from the sequence to avoid unnecessary constraints on computations.

To-do: Handle case with all size 0 arrays.

  • Tests added / passed
  • Passes flake8 dask

Make sure that `da.concatenate` does not include empty arrays in the
result as they don't contribute any data.
If any of the arrays passed to `da.concatenate` has a size of 0, then it
won't contribute anything to the array created by concatenation. As such
make sure to drop any size 0 arrays from the sequence of arrays to
concatenate before proceeding.
@shoyer
Copy link
Member

shoyer commented Nov 2, 2018

Size 0 arrays don't affect the values, but they could change the dtype, e.g., note

>>> np.concatenate([np.zeros((0,), dtype=float), np.zeros((1,), dtype=int)]).dtype
dtype('float64')

@jakirkham
Copy link
Member Author

Thanks for that. Didn't catch the type point when playing earlier.

Also they can affect final shape in some cases.

In [1]: import numpy as np                                                                                                                                   

In [2]: np.concatenate([np.empty((0, 2, 4)), np.empty((0, 2, 3))], axis=2)                                                                                   
Out[2]: array([], shape=(0, 2, 7), dtype=float64)

@jakirkham
Copy link
Member Author

Yeah, there is a da.result_type that we could use to accomplish that.

@jakirkham
Copy link
Member Author

Am wondering if we should solve this problem at an even higher level. Namely optimizing out empty chunks from the graph entirely. ( #4173 )

@jrbourbeau
Copy link
Member

I recently had someone bring the issue of size 0 arrays affecting the output of da.concatenate to my attention. For example,

import numpy as np
import dask.array as da

x_np = np.concatenate([np.zeros(0), np.zeros(1)])
x_da = da.concatenate([da.zeros(0), da.zeros(1)])

result_np = np.concatenate([x_np, x_np])
result_da = da.concatenate([x_da, x_da])

print(f'result_np = {result_np}')
print(f'result_da = {result_da.compute()}')

outputs

result_np = [0. 0.]
result_da = []

The proposed solution of dropping size 0 arrays seems like a reasonable fix to me. I'm happy to help move this along by pushing commits to this PR, if that's okay with you @jakirkham

@jakirkham
Copy link
Member Author

Maybe we should bring this up as core work? Though I'm not sure entirely how to propose it.

@mrocklin
Copy link
Member

Screen Shot 2019-05-20 at 12 27 06 PM

Click "Projects" above

@jakirkham
Copy link
Member Author

Ah ok. Was trying to do it from the project board. Thanks for the pointer.

@jrbourbeau
Copy link
Member

jrbourbeau commented May 22, 2019

@jakirkham I pushed some commits to ensure that the output dtype and case when all inputs to da.concatenate have 0 size are handled properly, so I think this should be ready for another review

@jrbourbeau jrbourbeau changed the title WIP: Drop size 0 arrays in concatenate Drop size 0 arrays in concatenate May 23, 2019
Copy link
Member Author

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrbourbeau. Pushed a few commits to tidy some things as this has simplified the code a bit. Hope that is ok. Think there is one outstanding issue to address (noted below).

@jakirkham jakirkham self-assigned this Jun 4, 2019
There was some left over type promotion code for the arrays to
concatenate using their `dtype`s. However this should now use the
`_meta` information instead since that is available.
@jakirkham jakirkham force-pushed the concat_drop_0_size_arrs branch from 476f203 to 79690e2 Compare June 12, 2019 15:50
NumPy will raise if no arrays are provided to concatenate as it is
unclear what to do. This adds a similar exception for Dask Arrays. Also
this short circuits handling unusual cases later. Plus raises a clearer
exception than one might see if this weren't raised.
@jakirkham
Copy link
Member Author

Broke out some smaller pieces from this into PR ( #4925 ) and PR ( #4927 ), which have been integrated.

Have updated to include the latests changes from master (including _meta).

Finally pushed some fixes to address some of the remaining issues. This is passing for me locally.

Would be great to get another review if someone has a chance.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @jakirkham! This generally looks good to me. I've attached a couple of questions and a few nitpicky comments

@jakirkham
Copy link
Member Author

Thanks @jrbourbeau. Think I've addressed your concerns. So please give it another look.

@jrbourbeau
Copy link
Member

The Travis build failure here is due to dropping Python 2.7 in #4919 and can be ignored

@jakirkham
Copy link
Member Author

The Travis build failure here is due to dropping Python 2.7 in #4919 and can be ignored

Have merged with master. So this should be fixed.

Simplifies the `_meta` handling logic in `concatenate` to assume that
`_meta` is valid. As all arguments have been coerced to Dask Arrays,
this is a reasonable assumption to make.
@jakirkham
Copy link
Member Author

@jrbourbeau, could you please take another look?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for all your work on this @jakirkham! Merging on green

@jrbourbeau jrbourbeau merged commit 66531ba into dask:master Jun 13, 2019
@jakirkham jakirkham deleted the concat_drop_0_size_arrs branch June 13, 2019 16:22
@jakirkham
Copy link
Member Author

Thanks for the reviews.

Have raised an issue about doing this for da.stack as well. ( #4928 )

TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Jun 13, 2019
commit 66531ba
Author: jakirkham <[email protected]>
Date:   Thu Jun 13 12:13:55 2019 -0400

    Drop size 0 arrays in concatenate (dask#4167)

    * Test `da.concatenate` with size 0 array

    Make sure that `da.concatenate` does not include empty arrays in the
    result as they don't contribute any data.

    * Drop size 0 arrays from `da.concatenate`

    If any of the arrays passed to `da.concatenate` has a size of 0, then it
    won't contribute anything to the array created by concatenation. As such
    make sure to drop any size 0 arrays from the sequence of arrays to
    concatenate before proceeding.

    * Handle dtype and all 0 size case

    * Cast inputs with asarray

    * Coerce all arrays to concatenate to the same type

    * Drop obsoleted type handling code

    * Comment on why arrays are being dropped

    * Use `np.promote_types` for parity w/old behavior

    * Handle endianness during type promotion

    * Construct empty array of right type

    Avoids the need to cast later and the addition of another node to the
    graph.

    * Promote types in `concatenate` using `_meta`

    There was some left over type promotion code for the arrays to
    concatenate using their `dtype`s. However this should now use the
    `_meta` information instead since that is available.

    * Ensure `concatenate` is working on Dask Arrays

    * Raise `ValueError` if `concatenate` gets no arrays

    NumPy will raise if no arrays are provided to concatenate as it is
    unclear what to do. This adds a similar exception for Dask Arrays. Also
    this short circuits handling unusual cases later. Plus raises a clearer
    exception than one might see if this weren't raised.

    * Test `concatenate` raises when no arrays are given

    * Determine the concatenated array's shape

    Needed to handle the case where all arrays have trivial shapes.

    * Handle special sequence cases together

    * Update dask/array/core.py

    Co-Authored-By: James Bourbeau <[email protected]>

    * Drop outdated comment

    * Assume valid `_meta` in `concatenate`

    Simplifies the `_meta` handling logic in `concatenate` to assume that
    `_meta` is valid. As all arguments have been coerced to Dask Arrays,
    this is a reasonable assumption to make.

commit 46aef58
Author: James Bourbeau <[email protected]>
Date:   Thu Jun 13 11:04:47 2019 -0500

    Overload HLG values method (dask#4918)

    * Overload HLG values method

    * Return lists for keys, values, and items

    * Add tests for keys and items

commit f9cd802
Author: mcsoini <[email protected]>
Date:   Thu Jun 13 18:03:55 2019 +0200

    Merge dtype warning (dask#4917)

    * add test covering the merge column dtype mismatch warning

    * for various merge types: checks that the resulting dataframe
      has either no nans or that a UserWarning has been thrown

    * Add warning for mismatches between column data types

    * fixes issue dask#4574
    * Warning is thrown if the on-columns of left and right have
      different dtypes

    * flake8 fixes

    * fixes

    * use asciitable for warning string

commit c400691
Author: Hugo <[email protected]>
Date:   Thu Jun 13 17:38:37 2019 +0300

    Docs: Drop support for Python 2.7 (dask#4932)

commit 985cdf2
Author: Benjamin Zaitlen <[email protected]>
Date:   Thu Jun 13 10:38:15 2019 -0400

    Groupby Covariance/Correlation (dask#4889)

commit 6e8c1b7
Author: Jim Crist <[email protected]>
Date:   Wed Jun 12 15:55:11 2019 -0500

    Drop Python 2.7 (dask#4919)

    * Drop Python 2.7

    Drops Python 2.7 from our `setup.py`, and from our test matrix. We don't
    drop any of the compatability fixes (yet), but won't be adding new ones.

    * fixup

commit 7a9cfaf
Author: Ian Bolliger <[email protected]>
Date:   Wed Jun 12 11:44:26 2019 -0700

    keep index name with to_datetime (dask#4905)

    * keep index name with to_datetime

    * allow users to pass meta

    * Update dask/dataframe/core.py

    put meta as explicit kwarg

    Co-Authored-By: Matthew Rocklin <[email protected]>

    * Update dask/dataframe/core.py

    remove meta kwargs.pop

    Co-Authored-By: Matthew Rocklin <[email protected]>

    * remove test for index

    * allow index

commit abc86d3
Author: jakirkham <[email protected]>
Date:   Wed Jun 12 14:20:59 2019 -0400

    Raise ValueError if concatenate is given no arrays (dask#4927)

    * Raise `ValueError` if `concatenate` gets no arrays

    NumPy will raise if no arrays are provided to concatenate as it is
    unclear what to do. This adds a similar exception for Dask Arrays. Also
    this short circuits handling unusual cases later. Plus raises a clearer
    exception than one might see if this weren't raised.

    * Test `concatenate` raises when no arrays are given

commit ce2f866
Author: jakirkham <[email protected]>
Date:   Wed Jun 12 14:09:35 2019 -0400

    Promote types in `concatenate` using `_meta` (dask#4925)

    * Promote types in `concatenate` using `_meta`

    There was some left over type promotion code for the arrays to
    concatenate using their `dtype`s. However this should now use the
    `_meta` information instead since that is available.

    * Ensure `concatenate` is working on Dask Arrays
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Jun 17, 2019
commit 255cc5b
Author: Justin Waugh <[email protected]>
Date:   Mon Jun 17 08:18:26 2019 -0600

    Map Dask Series to Dask Series (dask#4872)

    * index-test needed fix

    * single-parititon-error

    * added code to make it work

    * add tests

    * delete some comments

    * remove seed set

    * updated tests

    * remove sort_index and add tests

commit f7d73f8
Author: Matthew Rocklin <[email protected]>
Date:   Mon Jun 17 15:22:35 2019 +0200

    Further relax Array meta checks for Xarray (dask#4944)

    Our checks in slicing were causing issues for Xarray, which has some
    unslicable array types.  Additionally, this centralizes a bit of logic
    from blockwise into meta_from_array

    * simplify slicing meta code with meta_from_array

commit 4f97be6
Author: Peter Andreas Entschev <[email protected]>
Date:   Mon Jun 17 15:21:15 2019 +0200

    Expand *_like_safe usage (dask#4946)

commit abe9e28
Author: Peter Andreas Entschev <[email protected]>
Date:   Mon Jun 17 15:19:24 2019 +0200

    Defer order/casting einsum parameters to NumPy implementation (dask#4914)

commit 76f55fd
Author: Matthew Rocklin <[email protected]>
Date:   Mon Jun 17 09:28:07 2019 +0200

    Remove numpy warning in moment calculation (dask#4921)

    Previously we would divide by 0 in meta calculations for dask array
    moments, which would raise a Numpy RuntimeWarning to users.

    Now we avoid that situation, though we may also want to investigate a
    more thorough solution.

commit c437e63
Author: Matthew Rocklin <[email protected]>
Date:   Sun Jun 16 10:42:16 2019 +0200

    Fix meta_from_array to support Xarray test suite (dask#4938)

    Fixes pydata/xarray#3009

commit d8ff4c4
Author: jakirkham <[email protected]>
Date:   Fri Jun 14 10:35:00 2019 -0400

    Add a diagnostics extra (includes bokeh) (dask#4924)

    * Add a diagnostics extra (includes bokeh)

    * Bump bokeh minimum to 0.13.0

    * Add to `test_imports`

commit 773f775
Author: btw08 <[email protected]>
Date:   Fri Jun 14 14:34:34 2019 +0000

    4809 fix extra cr (dask#4935)

    * added test that fails to demonstrate the issue in 4809

    * modfied open_files/OpenFile to accept a newline parameter, similar to io.TextIOWrapper or the builtin open on py3. Pass newline='' to open_files when preparing to write csv files.

    Fixed dask#4809

    * modified newline documentation to follow convention

    * added blank line to make test_csv.py flake8-compliant

commit 419d27e
Author: Peter Andreas Entschev <[email protected]>
Date:   Fri Jun 14 15:18:42 2019 +0200

    Minor meta construction cleanup in concatenate (dask#4937)

commit 1f821f4
Author: Bruce Merry <[email protected]>
Date:   Fri Jun 14 12:49:59 2019 +0200

    Cache chunk boundaries for integer slicing (dask#4923)

    This is an alternative to dask#4909, to implement dask#4867.

    Instead of caching in the class as in dask#4909, use functools.lru_cache.
    This unfortunately has a fixed cache size rather than a cache entry
    stored with each array, but simplifies the code as it is not necessary
    to pass the cached value from the Array class down through the call tree
    to the point of use.

    A quick benchmark shows that the result for indexing a single value from
    a large array is similar to that from dask#4909, i.e., around 10x faster for
    constructing the graph.

    This only applies the cache in `_slice_1d`, so should be considered a
    proof-of-concept.

    * Move cached_cumsum to dask/array/slicing.py

    It can't go in dask/utils.py because the top level is not supposed to
    depend on numpy.

    * cached_cumsum: index cache by both id and hash

    The underlying _cumsum is first called with _HashIdWrapper, which will
    hit (very cheaply) if we've seen this tuple object before. If not, it
    will call itself again without the wrapper, which will hit (but at a
    higher cost for tuple.__hash__) if we've seen the same value before but
    in a different tuple object.

    * Apply cached_cumsum in more places

commit 66531ba
Author: jakirkham <[email protected]>
Date:   Thu Jun 13 12:13:55 2019 -0400

    Drop size 0 arrays in concatenate (dask#4167)

    * Test `da.concatenate` with size 0 array

    Make sure that `da.concatenate` does not include empty arrays in the
    result as they don't contribute any data.

    * Drop size 0 arrays from `da.concatenate`

    If any of the arrays passed to `da.concatenate` has a size of 0, then it
    won't contribute anything to the array created by concatenation. As such
    make sure to drop any size 0 arrays from the sequence of arrays to
    concatenate before proceeding.

    * Handle dtype and all 0 size case

    * Cast inputs with asarray

    * Coerce all arrays to concatenate to the same type

    * Drop obsoleted type handling code

    * Comment on why arrays are being dropped

    * Use `np.promote_types` for parity w/old behavior

    * Handle endianness during type promotion

    * Construct empty array of right type

    Avoids the need to cast later and the addition of another node to the
    graph.

    * Promote types in `concatenate` using `_meta`

    There was some left over type promotion code for the arrays to
    concatenate using their `dtype`s. However this should now use the
    `_meta` information instead since that is available.

    * Ensure `concatenate` is working on Dask Arrays

    * Raise `ValueError` if `concatenate` gets no arrays

    NumPy will raise if no arrays are provided to concatenate as it is
    unclear what to do. This adds a similar exception for Dask Arrays. Also
    this short circuits handling unusual cases later. Plus raises a clearer
    exception than one might see if this weren't raised.

    * Test `concatenate` raises when no arrays are given

    * Determine the concatenated array's shape

    Needed to handle the case where all arrays have trivial shapes.

    * Handle special sequence cases together

    * Update dask/array/core.py

    Co-Authored-By: James Bourbeau <[email protected]>

    * Drop outdated comment

    * Assume valid `_meta` in `concatenate`

    Simplifies the `_meta` handling logic in `concatenate` to assume that
    `_meta` is valid. As all arguments have been coerced to Dask Arrays,
    this is a reasonable assumption to make.

commit 46aef58
Author: James Bourbeau <[email protected]>
Date:   Thu Jun 13 11:04:47 2019 -0500

    Overload HLG values method (dask#4918)

    * Overload HLG values method

    * Return lists for keys, values, and items

    * Add tests for keys and items

commit f9cd802
Author: mcsoini <[email protected]>
Date:   Thu Jun 13 18:03:55 2019 +0200

    Merge dtype warning (dask#4917)

    * add test covering the merge column dtype mismatch warning

    * for various merge types: checks that the resulting dataframe
      has either no nans or that a UserWarning has been thrown

    * Add warning for mismatches between column data types

    * fixes issue dask#4574
    * Warning is thrown if the on-columns of left and right have
      different dtypes

    * flake8 fixes

    * fixes

    * use asciitable for warning string

commit c400691
Author: Hugo <[email protected]>
Date:   Thu Jun 13 17:38:37 2019 +0300

    Docs: Drop support for Python 2.7 (dask#4932)

commit 985cdf2
Author: Benjamin Zaitlen <[email protected]>
Date:   Thu Jun 13 10:38:15 2019 -0400

    Groupby Covariance/Correlation (dask#4889)

commit 6e8c1b7
Author: Jim Crist <[email protected]>
Date:   Wed Jun 12 15:55:11 2019 -0500

    Drop Python 2.7 (dask#4919)

    * Drop Python 2.7

    Drops Python 2.7 from our `setup.py`, and from our test matrix. We don't
    drop any of the compatability fixes (yet), but won't be adding new ones.

    * fixup

commit 7a9cfaf
Author: Ian Bolliger <[email protected]>
Date:   Wed Jun 12 11:44:26 2019 -0700

    keep index name with to_datetime (dask#4905)

    * keep index name with to_datetime

    * allow users to pass meta

    * Update dask/dataframe/core.py

    put meta as explicit kwarg

    Co-Authored-By: Matthew Rocklin <[email protected]>

    * Update dask/dataframe/core.py

    remove meta kwargs.pop

    Co-Authored-By: Matthew Rocklin <[email protected]>

    * remove test for index

    * allow index

commit abc86d3
Author: jakirkham <[email protected]>
Date:   Wed Jun 12 14:20:59 2019 -0400

    Raise ValueError if concatenate is given no arrays (dask#4927)

    * Raise `ValueError` if `concatenate` gets no arrays

    NumPy will raise if no arrays are provided to concatenate as it is
    unclear what to do. This adds a similar exception for Dask Arrays. Also
    this short circuits handling unusual cases later. Plus raises a clearer
    exception than one might see if this weren't raised.

    * Test `concatenate` raises when no arrays are given

commit ce2f866
Author: jakirkham <[email protected]>
Date:   Wed Jun 12 14:09:35 2019 -0400

    Promote types in `concatenate` using `_meta` (dask#4925)

    * Promote types in `concatenate` using `_meta`

    There was some left over type promotion code for the arrays to
    concatenate using their `dtype`s. However this should now use the
    `_meta` information instead since that is available.

    * Ensure `concatenate` is working on Dask Arrays
Merge remote-tracking branch 'upstream/master' into dataframe-warnings
if n == 1:
n = len(seq)
if n == 0:
return from_array(np.empty(shape=shape, dtype=meta.dtype))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't very meta-friendly. Fixing in PR ( #4979 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants