Skip to content

How to improve performance: ncdata conversion from Xarray(zarr) to Cube is not ideal #139

@valeriupredoi

Description

@valeriupredoi

Hi @pp-mo another one from me, with apologies for machine gunning you with issues - but I hope these help:

Issue Summary

Basic use of ncdata when converting an Xarray Dataset from a Zarr file to an Iris cube doesn't look very efficient; please help us improve efficiency 🍻

MRE

This basic test measures time and max RES memory for the two API blocks: the Xarry data loading, and the ncdata conversion to a cube (that, indeed, has lazy data); to note that this is not some cray HEALPIX file like the one from my other issue (and, indeed, no issues with no CI client etc, all works well), but this is a bogstandard CMIP6 file.

import cf_units
import iris
import ncdata
import xarray as xr


def test_load_zarr3_cmip6_via_ncdata():
    """
    Test loading a Zarr3 store from a https Object Store.

    This test is meant to determine how much memory we need via the
    two main API routes we need to go to load an Iris cube from a Zarr
    store, using an object storage unit:

    - API1: Xarray.open_dataset
    - API2: ncdata.iris_xarray.cubes_from_xarray

    We have a permanent bucket: esmvaltool-zarr at CEDA's object store
    "url": "https://uor-aces-o.s3-ext.jc.rl.ac.uk",
    where will host a number of test files like this one.

    This is an actual CMIP6 dataset (Zarr built from netCDF4 via Xarray)
    - Zarr store on disk: 243 MiB
    - compression: Blosc
    - Dimensions: (lat: 128, lon: 256, time: 2352, axis_nbounds: 2)
    - chunking: time-slices; netCDF4.Dataset.chunking() = [1, 128, 256]

    Test takes 8-9s (median: 8.5s) and needs max Res mem: 1GB
    """
    zarr_path = (
        "https://uor-aces-o.s3-ext.jc.rl.ac.uk/"
        "esmvaltool-zarr/pr_Amon_CNRM-ESM2-1_02Kpd-11_r1i1p2f2_gr_200601-220112.zarr3"
    )

    time_coder = xr.coders.CFDatetimeCoder(use_cftime=True)
    zarr_xr = xr.open_dataset(
        zarr_path,
        consolidated=True,
        decode_times=time_coder,
        engine="zarr",
        backend_kwargs={},
    )
    # API1: 420MB memory; 1.5s

    conversion_func = ncdata.iris_xarray.cubes_from_xarray
    cubes = conversion_func(zarr_xr)
    # API2: 1GB memory; 8.5s

    assert isinstance(cubes, iris.cube.CubeList)
    assert len(cubes) == 1
    assert cubes[0].has_lazy_data()

As you can see, the ncdata converter needs about 4x more memory than the compressed size of the file on disk, and it spends a wee bit of time converting it too (not particularly concerned about runtimes at this point, just the memory needed). Is there a trick we are missing to make it more efficient?

Many thanks as ever 🍺

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions