Skip to content

Commit 3433c29

Browse files
committed
Completion of original review comments (mostly, a few from new set).
1 parent a1fa515 commit 3433c29

File tree

11 files changed

+250
-195
lines changed

11 files changed

+250
-195
lines changed
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
.. _string-and-character-data:
2+
3+
Character and String Data Handling
4+
----------------------------------
5+
NetCDF can contain string and character data in at least 3 different contexts :
6+
7+
Characters in Data Component Names
8+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9+
That is, names of groups, variables, attributes or dimensions.
10+
Component names in the API are just native Python strings.
11+
12+
Since NetCDF version 4, the names of components within files are fully unicode
13+
compliant, using UTF-8.
14+
15+
These names can use virtually **any** characters, with the exception of the forward
16+
slash "/", since in some technical cases a component name needs to specified as a
17+
"path-like" compound.
18+
19+
20+
Characters in Attribute Values
21+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22+
Character data in string *attribute* values can likewise be read and written simply as
23+
Python strings.
24+
25+
However they are actually *stored* in an :class:`~ncdata.NcAttribute`'s
26+
``.value`` as a character array of dtype "<U??" (that is, the dtype does not really
27+
have a "??", but some definite length). These are returned by
28+
:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
29+
30+
A vector of strings is also a permitted attribute value, but bear in mind that
31+
**a vector of strings is not currently supported in netCDF4 implementations**.
32+
Thus, you cannot have an array or list of strings as an attribute value in an actual file,
33+
and if stored to a file such an attribute will be concatenated into a single string value.
34+
35+
In actual files, Unicode is again supported via UTF-8, and seamlessly encoded/decoded.
36+
37+
38+
Characters in Variable Data
39+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
40+
Character data in variable *data* arrays are generally stored as fixed-length arrays of
41+
characters (i.e. fixed-width strings), and no unicode interpretation is applied by the
42+
libraries (neither netCDF4 or ncdata). In this case, the strings appear in Python as
43+
numpy character arrays of dtype "<U1". All elements have the same fixed length, but
44+
may contain zero bytes so that they convert to variable-width (Python) strings up to a
45+
maximum width. Trailing characters are filled with "NUL", i.e. "\\0" character
46+
aka "zero byte". The (maximum) string length is a separate dimension, which is
47+
recorded as a normal netCDF file dimension like any other.
48+
49+
.. note::
50+
51+
Although it is not tested, it has proved possible (and useful) at present to load
52+
files with variables containing variable-length string data, but it is
53+
necessary to supply an explicit user chunking to workaround limitations in Dask.
54+
Please see the :ref:`howto example <howto_load_variablewidth_strings>`.
55+
56+
.. warning::
57+
58+
The netCDF4 package will perform automatic character encoding/decoding of a
59+
character variable if it has a special ``_Encoding`` attribute. Ncdata does not
60+
currently allow for this. See : :ref:`known-issues`
61+

docs/details/details_index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Detail reference topics
88
../change_log
99
./known_issues
1010
./interface_support
11+
./character_handling
1112
./threadlock_sharing
1213
./developer_notes
1314

docs/details/interface_support.rst

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,24 @@ variable-length and user-defined datatypes.
1717
Please see : :ref:`data-types`.
1818

1919

20-
Data Scaling, Masking and Compression
21-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22-
Ncdata does not implement scaling and offset within data arrays : The ".data"
20+
Data Scaling and Masking
21+
^^^^^^^^^^^^^^^^^^^^^^^^
22+
Ncdata does not implement scaling and offset within variable data arrays : The ".data"
2323
array has the actual variable dtype, and the "scale_factor" and
2424
"add_offset" attributes are treated like any other attribute.
2525

26-
The existence of a "_FillValue" attribute controls how.. TODO
26+
Likewise, Ncdata does not use masking within its variable data arrays, so that variable
27+
data arrays contain "raw" data, which include any "fill" values -- i.e. at any missing
28+
data points you will have a "fill" value rather than a masked point.
29+
30+
The use of "scale_factor", "add_offset" and "_FillValue" attributes are standard
31+
conventions described in the NetCDF documentation itself, and implemented by NetCDF
32+
library software including the Python netCDF4 library. To ignore these default
33+
interpretations, ncdata has to actually turn these features "off". The rationale for
34+
this, however, is that the low-level unprocessed data content, equivalent to actual
35+
file storage, may be more likely to form a stable common basis of equivalence, particularly
36+
between different system architectures.
37+
2738

2839
.. _file-storage:
2940

@@ -33,14 +44,20 @@ The :func:`ncdata.netcdf4.to_nc4` cannot control compression or storage options
3344
provided by :meth:`netCDF4.Dataset.createVariable`, which means you can't
3445
control the data compression and translation facilities of the NetCDF file
3546
library.
36-
If required, you should use :mod:`iris` or :mod:`xarray` for this.
47+
If required, you should use :mod:`iris` or :mod:`xarray` for this, i.e. use
48+
:meth:`xarray.Dataset.to_netcdf` or :func:`iris.save` instead of
49+
:func:`ncdata.netcdf4.to_nc4`, as these provide more special options for controlling
50+
netcdf file creation.
3751

3852
File-specific storage aspects, such as chunking, data-paths or compression
3953
strategies, are not recorded in the core objects. However, array representations in
4054
variable and attribute data (notably dask lazy arrays) may hold such information.
4155

42-
The concept of "unlimited" dimensions is also, arguably an exception. However, this is a
43-
core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).
56+
The concept of "unlimited" dimensions is also, you might think, outside the abstract
57+
model of NetCDF data and not of concern to Ncdata . However, in fact this concept is
58+
present as a core property of dimensions in the classic NetCDF data model (see
59+
"Dimension" in the `NetCDF Classic Data Model`_), so that is why it **is** an essential
60+
property of an NcDimension also.
4461

4562

4663
Dask chunking control

docs/details/known_issues.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _known-issues:
2+
13
Outstanding Issues
24
==================
35

@@ -21,6 +23,12 @@ To be fixed
2123

2224
* `issue#66 <https://github.com/pp-mo/ncdata/issues/66>`_
2325

26+
* in conversion to/from netCDF4 files
27+
28+
* netCDF4 performs automatic encoding/decoding of byte data to characters, triggered
29+
by the existence of an ``_Encoding`` attribute on a character type variable.
30+
Ncdata does not currently account for this, and may fail to read/write correctly.
31+
2432

2533
.. _todo:
2634

docs/details/threadlock_sharing.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ created from netcdf file data, which it is either computing or storing to an
1212
output netcdf file.
1313

1414
In short, this is not needed when all your data is loaded with only **one** of the data
15-
packages (Iris, Xarray or ncata). The problem only occurs when you try to
15+
packages (Iris, Xarray or ncdata). The problem only occurs when you try to
1616
realise/calculate/save results which combine data loaded from a mixture of sources.
1717

1818
sample code::

docs/userdocs/user_guide/_snippets.rst

Lines changed: 0 additions & 89 deletions
This file was deleted.

docs/userdocs/user_guide/common_operations.rst

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,42 @@ Example : ``dataset.variables.rename("x", "y")``
4545

4646
Copying
4747
-------
48-
All core objects support a ``.copy()`` method, which however does not copy array content
49-
(e.g. variable data or attribute arrays). See for instance :meth:`ncdata.NcData.copy`.
48+
All core objects support a ``.copy()`` method. See for instance
49+
:meth:`ncdata.NcData.copy`.
5050

51-
There is also a utility function :func:`ncdata.utils.ncdata_copy`, this is effectively
52-
the same as the NcData object copy.
51+
These however do *not* copy variable data arrays (either real or lazy), but produce new
52+
(copied) variables referencing the same arrays. So, for example:
53+
54+
.. code-block::
55+
56+
>>> Construct a simple test dataset
57+
>>> ds = NcData(
58+
... dimensions=[NcDimension('x', 12)],
59+
... variables=[NcVariable('vx', ['x'], np.ones(12))]
60+
... )
61+
62+
>>> # Make a copy
63+
>>> ds_copy = ds.copy()
64+
65+
>>> # The new dataset has a new matching variable with a matching data array
66+
>>> # The variables are different ..
67+
>>> ds_copy.variables['vx'] is ds.variables['vx']
68+
False
69+
>>> # ... but the arrays are THE SAME ARRAY
70+
>>> ds_copy.variables['vx'].data is ds.variables['vx'].data
71+
True
72+
73+
>>> # So changing one actually CHANGES THE OTHER ...
74+
>>> ds.variables['vx'].data[6:] = 777
75+
>>> ds_copy.variables['vx'].data
76+
array([1., 1., 1., 1., 1., 1., 777., 777., 777., 777., 777., 777.])
77+
78+
If needed you can of course replace variable data with copies yourself, since you can
79+
freely assign to ``.data``.
80+
For real data, this is just ``var.data = var.data.copy()``.
81+
82+
There is also a utility function :func:`ncdata.utils.ncdata_copy` : This is
83+
effectively the same thing as the NcData object :meth:`~ncdata.NcData.copy` method.
5384

5485

5586
Equality Checking

docs/userdocs/user_guide/data_objects.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Notes :
3434

3535
:class:`~ncdata.NcData`
3636
^^^^^^^^^^^^^^^^^^^^^^^
37-
This represents a dataset containing variables, attributes and groups.
37+
This represents a dataset containing variables, dimensions, attributes and groups.
3838
It is also used to represent groups.
3939

4040
:class:`~ncdata.NcDimension`
@@ -168,9 +168,7 @@ Thus to fetch an attribute you might write, for example one of these :
168168
169169
but **not** ``unit = dataset.variables['x'].attributes['attr1']``
170170

171-
And not ``unit = dataset.variables['x'].attributes['attr1']``
172-
173-
Or, likewise, to ***set*** values, one of
171+
Or, likewise, to **set** values, one of
174172

175173
.. code-block::
176174

0 commit comments

Comments
 (0)