-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi @pp-mo another one from me, with apologies for machine gunning you with issues - but I hope these help:
Issue Summary
Basic use of ncdata when converting an Xarray Dataset from a Zarr file to an Iris cube doesn't look very efficient; please help us improve efficiency 🍻
MRE
This basic test measures time and max RES memory for the two API blocks: the Xarry data loading, and the ncdata conversion to a cube (that, indeed, has lazy data); to note that this is not some cray HEALPIX file like the one from my other issue (and, indeed, no issues with no CI client etc, all works well), but this is a bogstandard CMIP6 file.
import cf_units
import iris
import ncdata
import xarray as xr
def test_load_zarr3_cmip6_via_ncdata():
"""
Test loading a Zarr3 store from a https Object Store.
This test is meant to determine how much memory we need via the
two main API routes we need to go to load an Iris cube from a Zarr
store, using an object storage unit:
- API1: Xarray.open_dataset
- API2: ncdata.iris_xarray.cubes_from_xarray
We have a permanent bucket: esmvaltool-zarr at CEDA's object store
"url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
where will host a number of test files like this one.
This is an actual CMIP6 dataset (Zarr built from netCDF4 via Xarray)
- Zarr store on disk: 243 MiB
- compression: Blosc
- Dimensions: (lat: 128, lon: 256, time: 2352, axis_nbounds: 2)
- chunking: time-slices; netCDF4.Dataset.chunking() = [1, 128, 256]
Test takes 8-9s (median: 8.5s) and needs max Res mem: 1GB
"""
zarr_path = (
"https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
"esmvaltool-zarr/pr_Amon_CNRM-ESM2-1_02Kpd-11_r1i1p2f2_gr_200601-220112.zarr3"
)
time_coder = xr.coders.CFDatetimeCoder(use_cftime=True)
zarr_xr = xr.open_dataset(
zarr_path,
consolidated=True,
decode_times=time_coder,
engine="zarr",
backend_kwargs={},
)
# API1: 420MB memory; 1.5s
conversion_func = ncdata.iris_xarray.cubes_from_xarray
cubes = conversion_func(zarr_xr)
# API2: 1GB memory; 8.5s
assert isinstance(cubes, iris.cube.CubeList)
assert len(cubes) == 1
assert cubes[0].has_lazy_data()As you can see, the ncdata converter needs about 4x more memory than the compressed size of the file on disk, and it spends a wee bit of time converting it too (not particularly concerned about runtimes at this point, just the memory needed). Is there a trick we are missing to make it more efficient?
Many thanks as ever 🍺