Skip to content

concat() with compat='no_conflicts' on dask arrays has accidentally quadratic runtime #5381

@shoyer

Description

@shoyer

This ends up calling fillna() in a loop inside xarray.core.merge.unique_variable(), something like:

  out = variables[0]
  for var in variables[1:]:
    out = out.fillna(var)

xarray/xarray/core/merge.py

Lines 147 to 149 in 55e5b5a

if combine_method:
for var in variables[1:]:
out = getattr(out, combine_method)(var)

This has quadratic behavior if the variables are stored in dask arrays (the dask graph gets one element larger after each loop iteration). This is OK for merge() (which typically only has two arguments) but is problematic for dealing with variables that shouldn't be concatenated inside concat(), which should be able to handle very long lists of arguments.

I encountered this because compat='no_conflicts' is the default for xarray.combine_nested().

I guess there's also the related issue which is that even if we produced the output dask graph by hand without a loop, it still wouldn't be easy to evaluate for a large number of elements. Ideally we would use some sort of tree-reduction to ensure the operation can be parallelized.

xref google/xarray-beam#13

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions