Skip to content

Extension to the HDF5 chunks API #309

@davidhassell

Description

@davidhassell

Currently (v1.11.1.0), the treatment of HDF5 chunking is a bit inadequate:

  • Chunking can only be set on a per-Data object basis
  • Chunking can only be defined by explicitly setting the chunks shape on each axis
  • Chunking is ignored in an output file unless native compression is on
  • Chunks from an input file are not stored

A more comprehensive and flexible API is needed:

  • cfdm.write should chunk by default, and have a keywork argument (hdf5_chunks) to configure the default chunking.
  • cfdm.read should, by default, store HDF5 chunking on the returned data, so that it will be used when when writing out to a new netCDF4 file.
  • Setting a HDF5 chunking strategy should be more intuitive. E.g. it should be easy to "chunk the time axis by 12 elements, leaving all other axes unchunked": f.nc_set_hdf_chunksizes({'T': 12})
  • Setting HDF5 chunksizes follows the Dask API for defining its computaitonal chunk sizes. E.g. f.nc_set_hdf_chunksizes("8 MiB")

PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataset readRelating to reading datasetsdataset writeRelating to writing datasetsenhancementNew feature or requestperformanceRelating to speed and memory performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions