Refactor tests around expected properties

Discussions with @maxrjones made me realise that we're thinking about testing wrong. In particular, with #124 in place we will have well-defined **expected properties** that we can and should test against. Let me explain.

@maxrjones asked whether implementing loading via a `ManifestStore`/`FunctionalStore` as described by @ayushnag in https://github.com/zarr-developers/VirtualiZarr/issues/124 would be a problematic mixing of concerns between creating virtual references and also loading data. But actually it's the opposite: this is exactly what we should be doing. 

The reason is that _if we cannot correctly load data via a `ManifestStore`, we cannot correctly load it from an `IcechunkStore` either_. Loading from an in-memory `ManifestArray` via zarr-python reading a `ManifestStore` and loading from an in-memory/on-disk/S3 `IcechunkStore` via zarr-python performs exactly the same set of decoding steps. Contrast that to how we currently load variables via a separate xarray backend (e.g. h5netcdf) - that does it's own decoding logic that isn't encoded in the zarr model.

To check that we are able to correctly load variables, the property to test is

```python
loaded_ds_from_manifests = open_virtual_dataset('file.nc', backend='netcdf', loadable_variables='all')

loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)
```

i.e., we check our creation of manifests by loading from it and comparing against loading using a pre-existing xarray backend for that filetype. @dcherian told me once it's a good thing to have multiple different ways to load data from the same file format, as it allows cross-checking. This property just encodes that. We could write a similar property for any other virtualizable file format that has an xarray backend, and each test like this should live in `virtualizarr/tests/test_readers/test_hdf.py` etc.

Similarly we could also load from an in-memory icechunk store instead, which would make the property:

```python
vds = open_virtual_dataset('file.nc', backend='netcdf')
vds.virtualize.to_icechunk(icechunkstore)
loaded_ds_from_icechunk = xr.open_zarr(icechunkstore).load()

loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)
```

This now looks a lot like our existing round-trip tests, inclusive of #376 (FYI @jsignell).

But there is no fundamental difference between how the data is loaded in these two cases: it's still zarr-python getting and decoding the data according to the `ManifestArrays` our readers created. So I now think what we should do is use the `ManifestStore` idea (the first property) as much as possible internally, as it's the most lightweight and doesn't depend on the optional icechunk dependency. Then we also additionally test that the latter property holds for round-tripping via icechunk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor tests around expected properties #394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor tests around expected properties #394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions