-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Discussions with @maxrjones made me realise that we're thinking about testing wrong. In particular, with #124 in place we will have well-defined expected properties that we can and should test against. Let me explain.
@maxrjones asked whether implementing loading via a ManifestStore/FunctionalStore as described by @ayushnag in #124 would be a problematic mixing of concerns between creating virtual references and also loading data. But actually it's the opposite: this is exactly what we should be doing.
The reason is that if we cannot correctly load data via a ManifestStore, we cannot correctly load it from an IcechunkStore either. Loading from an in-memory ManifestArray via zarr-python reading a ManifestStore and loading from an in-memory/on-disk/S3 IcechunkStore via zarr-python performs exactly the same set of decoding steps. Contrast that to how we currently load variables via a separate xarray backend (e.g. h5netcdf) - that does it's own decoding logic that isn't encoded in the zarr model.
To check that we are able to correctly load variables, the property to test is
loaded_ds_from_manifests = open_virtual_dataset('file.nc', backend='netcdf', loadable_variables='all')
loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)i.e., we check our creation of manifests by loading from it and comparing against loading using a pre-existing xarray backend for that filetype. @dcherian told me once it's a good thing to have multiple different ways to load data from the same file format, as it allows cross-checking. This property just encodes that. We could write a similar property for any other virtualizable file format that has an xarray backend, and each test like this should live in virtualizarr/tests/test_readers/test_hdf.py etc.
Similarly we could also load from an in-memory icechunk store instead, which would make the property:
vds = open_virtual_dataset('file.nc', backend='netcdf')
vds.virtualize.to_icechunk(icechunkstore)
loaded_ds_from_icechunk = xr.open_zarr(icechunkstore).load()
loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)This now looks a lot like our existing round-trip tests, inclusive of #376 (FYI @jsignell).
But there is no fundamental difference between how the data is loaded in these two cases: it's still zarr-python getting and decoding the data according to the ManifestArrays our readers created. So I now think what we should do is use the ManifestStore idea (the first property) as much as possible internally, as it's the most lightweight and doesn't depend on the optional icechunk dependency. Then we also additionally test that the latter property holds for round-tripping via icechunk.
Metadata
Metadata
Assignees
Type
Projects
Status