How to improve performance: `ncdata` conversion from Xarray(zarr) to Cube is not ideal

Hi @pp-mo another one from me, with apologies for machine gunning you with issues - but I hope these help:

**Issue Summary**

Basic use of `ncdata` when converting an Xarray Dataset from a Zarr file to an Iris cube doesn't look very efficient; please help us improve efficiency 🍻 

**MRE**

This basic test measures time and max RES memory for the two API blocks: the Xarry data loading, and the `ncdata` conversion to a cube (that, indeed, has lazy data); to note that this is not some cray HEALPIX file like the one from my other issue (and, indeed, no issues with no CI client etc, all works well), but this is a bogstandard CMIP6 file.

```python
import cf_units
import iris
import ncdata
import xarray as xr


def test_load_zarr3_cmip6_via_ncdata():
    """
    Test loading a Zarr3 store from a https Object Store.

    This test is meant to determine how much memory we need via the
    two main API routes we need to go to load an Iris cube from a Zarr
    store, using an object storage unit:

    - API1: Xarray.open_dataset
    - API2: ncdata.iris_xarray.cubes_from_xarray

    We have a permanent bucket: esmvaltool-zarr at CEDA's object store
    "url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
    where will host a number of test files like this one.

    This is an actual CMIP6 dataset (Zarr built from netCDF4 via Xarray)
    - Zarr store on disk: 243 MiB
    - compression: Blosc
    - Dimensions: (lat: 128, lon: 256, time: 2352, axis_nbounds: 2)
    - chunking: time-slices; netCDF4.Dataset.chunking() = [1, 128, 256]

    Test takes 8-9s (median: 8.5s) and needs max Res mem: 1GB
    """
    zarr_path = (
        "https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
        "esmvaltool-zarr/pr_Amon_CNRM-ESM2-1_02Kpd-11_r1i1p2f2_gr_200601-220112.zarr3"
    )

    time_coder = xr.coders.CFDatetimeCoder(use_cftime=True)
    zarr_xr = xr.open_dataset(
        zarr_path,
        consolidated=True,
        decode_times=time_coder,
        engine="zarr",
        backend_kwargs={},
    )
    # API1: 420MB memory; 1.5s

    conversion_func = ncdata.iris_xarray.cubes_from_xarray
    cubes = conversion_func(zarr_xr)
    # API2: 1GB memory; 8.5s

    assert isinstance(cubes, iris.cube.CubeList)
    assert len(cubes) == 1
    assert cubes[0].has_lazy_data()
```

As you can see, the `ncdata` converter needs about 4x more memory than the compressed size of the file on disk, and it spends a wee bit of time converting it too (not particularly concerned about runtimes at this point, just the memory needed). Is there a trick we are missing to make it more efficient?

Many thanks as ever 🍺 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to improve performance: `ncdata` conversion from Xarray(zarr) to Cube is not ideal #139

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to improve performance: ncdata conversion from Xarray(zarr) to Cube is not ideal #139

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

How to improve performance: `ncdata` conversion from Xarray(zarr) to Cube is not ideal #139