Skip to content

Conversation

@phofl
Copy link
Collaborator

@phofl phofl commented Jan 8, 2025

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

cc @dcherian is this suitable for xarray?

@phofl phofl changed the title Add cached version for normalize_chunks Add cached version for normalize_chunks Jan 8, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   4h 29m 11s ⏱️ -2s
 17 163 tests + 1   15 968 ✅ + 1   1 195 💤 ±0  0 ❌ ±0 
211 476 runs  +13  194 296 ✅ +10  17 180 💤 +3  0 ❌ ±0 

Results for commit f4c0011. ± Comparison against base commit 7393a77.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
dask.array.tests.test_array_core ‑ test_normalize_chunks
dask.array.tests.test_array_core ‑ test_normalize_chunks[normalize_chunks]
dask.array.tests.test_array_core ‑ test_normalize_chunks[normalize_chunks_cached]

@dcherian
Copy link
Collaborator

dcherian commented Jan 9, 2025

Yes decent impact.

# BEFORE: dask  : 2024.12.1, xarray: 2025.1.0
# Wall time: 1min 12s
#
# AFTER: xarray: 2024.11.1.dev50+gd9365109.d20250108, dask  : 2024.12.1+966.gf4c001150
# Wall time: 49.1 s
%time xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3")
from dask.array.core import normalize_chunks_cached

normalize_chunks_cached.cache_info()
# CacheInfo(hits=271, misses=2, maxsize=128, currsize=2)

Now the big issue is tokenize.

return i


@functools.lru_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you could use this function internally if you wrote a caching decorator that cached both by id and hash. That way you check id first, and then with the hash if that fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is interesting, I'll take a look at this but will merge that one for now

@dcherian
Copy link
Collaborator

dcherian commented Jan 9, 2025

This is the issue:

from dask.layers import ArraySliceDep

chunks = ((1,)*1_000_000, (721,), (1441,))
%timeit dask.base.tokenize(ArraySliceDep(chunks)) # 55ms

for 300 variables, that alone takes 16s. somehow using a cache with id(chunks) in that tokenization would fix it.

EDIT: The tuples within chunks are the cached values, not chunks itself

@phofl phofl merged commit ca15d49 into dask:main Jan 9, 2025
27 checks passed
@phofl
Copy link
Collaborator Author

phofl commented Jan 9, 2025

for 300 variables, that alone takes 16s. somehow using a cache with id(chunks) in that tokenization would fix it.

EDIT: The tuples within chunks are the cached values, not chunks itself

Yeah, this is tricky, I'll see if we can do something here

@phofl phofl deleted the normalize-chunks-cached branch January 9, 2025 09:49
@phofl
Copy link
Collaborator Author

phofl commented Jan 9, 2025

@dcherian

can you import from

dask.array.api when you add this to xarray?

dcherian added a commit to dcherian/xarray that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants