Skip to content

netCDF4-python writes string (unicode) attributes as 1-d arrays, not scalars #448

@shoyer

Description

@shoyer

This code writes a single string attribute to an HDF5 file using netCDF4:

# Python 3.4.3
In [1]: import netCDF4

In [3]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr.nc', 'w')

In [4]: ds.units = 'days since 1900'

In [5]: ds.close()

In [7]: !h5dump /Users/shoyer/Downloads/global-attr.nc
HDF5 "/Users/shoyer/Downloads/global-attr.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Here's code do to the same thing with h5py:

In [8]: import h5py

In [9]: f = h5py.File('/Users/shoyer/Downloads/global-attr-h5py.nc')

In [10]: f.attrs['units'] = 'days since 1900'

In [11]: f.close()

In [12]: !h5dump /Users/shoyer/Downloads/global-attr-h5py.nc
HDF5 "/Users/shoyer/Downloads/global-attr-h5py.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
}
}

As you can see from the results of h5dump, netCDF4-python is writing the attribute as a "simple dataspace" which corresponds to a multi-dimensional array of 1-element:
https://www.hdfgroup.org/HDF5/doc/UG/UG_frame12Dataspaces.html

In fact, this is exactly what you get if you view the file created with netCDF4-python using h5py (to netCDF4-python and ncdump, they appear identical):

In [13]: f = h5py.File('/Users/shoyer/Downloads/global-attr.nc')

In [14]: f.attrs['units']
Out[14]: array([b'days since 1900'], dtype=object)

I believe netCDF4-python should be writing the attribute as a scalar, similarly to want it does if you write bytes (or a string on Python 2):

# python 2.7
In [11]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr-py27.nc', 'w')

In [12]: ds.bytes_str = 'days since 1900'

In [13]: ds.unicode_str = u'days since 1900'

In [14]: ds.close()

In [15]: !h5dump /Users/shoyer/Downloads/global-attr-py27.nc
HDF5 "/Users/shoyer/Downloads/global-attr-py27.nc" {
GROUP "/" {
   ATTRIBUTE "bytes_str" {
      DATATYPE  H5T_STRING {
         STRSIZE 15;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
   ATTRIBUTE "unicode_str" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Given that netCDF4-python is simply using the netCDF-C library's nc_put_att_string function, this may very well be a bug upstream in the netCDF-C library.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions