-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
I'm not sure if this topic would be more related to dask, zarr or xarray, apologies if this is not the correct place to post it.
Loading metadata / coordinates from a zarr archive (stored on GCS) with xarray to build a dask array could take relatively long for archives with large number of objects. As an example, below is part of the profiling output from loading a dask array (from the same GCS region) from an archive with around 5e6 objects. The most costly calls are those 40 calls (1 per data variable in the dataset) to dask.array.core.slices_from_chunks and dask.array.core.getem functions.
The total time in these calls usually vary between 5 and 20 sec for this dataset. These times can be greatly reduced (to be sub-second) by decreasing the overall number of objects in the archive (by increasing the chunk sizes). My question is whether these loading times need to scale with the number of objects in the zarr store, considering the metadata are consolidated and the actual coordinate variables are stored in one single chunk.
Thanks
545613 function calls (538362 primitive calls) in 10.786 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
40 3.694 0.092 3.715 0.093 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:168(slices_from_chunks)
40 3.163 0.079 3.163 0.079 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:224(<listcomp>)
40 1.928 0.048 8.827 0.221 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:187(getem)
93 1.408 0.015 1.408 0.015 {built-in method _openssl.SSL_read}
40 0.103 0.003 8.986 0.225 /usr/local/lib/python3.8/dist-packages/dask/array/core.py:2670(from_array)
3 0.064 0.021 0.065 0.022 /usr/local/lib/python3.8/dist-packages/zarr/core.py:1734(_decode_chunk)
275 0.025 0.000 0.025 0.000 {built-in method marshal.loads}
14 0.017 0.001 0.017 0.001 {method 'reduce' of 'numpy.ufunc' objects}
86463 0.015 0.000 0.021 0.000 /usr/local/lib/python3.8/dist-packages/toolz/itertoolz.py:31(accumulate)
669 0.010 0.000 0.023 0.000 {built-in method builtins.__build_class__}