Skip to content

Loading metadata slow for zarr archives with many objects stored in GCS #6363

@rafa-guedes

Description

@rafa-guedes

I'm not sure if this topic would be more related to dask, zarr or xarray, apologies if this is not the correct place to post it.

Loading metadata / coordinates from a zarr archive (stored on GCS) with xarray to build a dask array could take relatively long for archives with large number of objects. As an example, below is part of the profiling output from loading a dask array (from the same GCS region) from an archive with around 5e6 objects. The most costly calls are those 40 calls (1 per data variable in the dataset) to dask.array.core.slices_from_chunks and dask.array.core.getem functions.

The total time in these calls usually vary between 5 and 20 sec for this dataset. These times can be greatly reduced (to be sub-second) by decreasing the overall number of objects in the archive (by increasing the chunk sizes). My question is whether these loading times need to scale with the number of objects in the zarr store, considering the metadata are consolidated and the actual coordinate variables are stored in one single chunk.

Thanks

         545613 function calls (538362 primitive calls) in 10.786 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       40    3.694    0.092    3.715    0.093 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:168(slices_from_chunks)
       40    3.163    0.079    3.163    0.079 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:224(<listcomp>)
       40    1.928    0.048    8.827    0.221 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:187(getem)
       93    1.408    0.015    1.408    0.015 {built-in method _openssl.SSL_read}
       40    0.103    0.003    8.986    0.225 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:2670(from_array)
        3    0.064    0.021    0.065    0.022 /usr/local/lib/python3.8/dist-packages/zarr/core.py:1734(_decode_chunk)
      275    0.025    0.000    0.025    0.000 {built-in method marshal.loads}
       14    0.017    0.001    0.017    0.001 {method 'reduce' of 'numpy.ufunc' objects}
    86463    0.015    0.000    0.021    0.000 /usr/local/lib/python3.8/dist-packages/toolz/itertoolz.py:31(accumulate)
      669    0.010    0.000    0.023    0.000 {built-in method builtins.__build_class__}

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrayioneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions