pp-mo
diff --git a/‎docs/details/character_handling.rst‎
Lines changed: 61 additions & 0 deletions b/‎docs/details/character_handling.rst‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎docs/details/details_index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/details/details_index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/details/interface_support.rst‎
Lines changed: 24 additions & 7 deletions b/‎docs/details/interface_support.rst‎
Lines changed: 24 additions & 7 deletions
diff --git a/‎docs/details/known_issues.rst‎
Lines changed: 8 additions & 0 deletions b/‎docs/details/known_issues.rst‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/details/threadlock_sharing.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/details/threadlock_sharing.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/userdocs/user_guide/_snippets.rst‎
Lines changed: 0 additions & 89 deletions b/‎docs/userdocs/user_guide/_snippets.rst‎
Lines changed: 0 additions & 89 deletions
diff --git a/‎docs/userdocs/user_guide/common_operations.rst‎
Lines changed: 35 additions & 4 deletions b/‎docs/userdocs/user_guide/common_operations.rst‎
Lines changed: 35 additions & 4 deletions
diff --git a/‎docs/userdocs/user_guide/data_objects.rst‎
Lines changed: 2 additions & 4 deletions b/‎docs/userdocs/user_guide/data_objects.rst‎
Lines changed: 2 additions & 4 deletions
@@ -0,0 +1,61 @@
+.. _string-and-character-data:
+
+Character and String Data Handling
+----------------------------------
+NetCDF can contain string and character data in at least 3 different contexts :
+
+Characters in Data Component Names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+That is, names of groups, variables, attributes or dimensions.
+Component names in the API are just native Python strings.
+
+Since NetCDF version 4, the names of components within files are fully unicode
+compliant, using UTF-8.
+
+These names can use virtually **any** characters, with the exception of the forward
+slash "/", since in some technical cases a component name needs to specified as a
+"path-like" compound.
+
+
+Characters in Attribute Values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Character data in string *attribute* values can likewise be read and written simply as
+Python strings.
+
+However they are actually *stored* in an :class:`~ncdata.NcAttribute`'s
+``.value`` as a character array of dtype "<U??"  (that is, the dtype does not really
+have a "??", but some definite length).  These are returned by
+:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
+
+A vector of strings is also a permitted attribute value, but bear in mind that
+**a vector of strings is not currently supported in netCDF4 implementations**.
+Thus, you cannot have an array or list of strings as an attribute value in an actual file,
+and if stored to a file such an attribute will be concatenated into a single string value.
+
+In actual files, Unicode is again supported via UTF-8, and seamlessly encoded/decoded.
+
+
+Characters in Variable Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Character data in variable *data* arrays are generally stored as fixed-length arrays of
+characters (i.e. fixed-width strings), and no unicode interpretation is applied by the
+libraries (neither netCDF4 or ncdata).  In this case, the strings appear in Python as
+numpy character arrays of dtype "<U1".  All elements have the same fixed length, but
+may contain zero bytes so that they convert to variable-width (Python) strings up to a
+maximum width.  Trailing characters are filled with "NUL", i.e. "\\0" character
+aka "zero byte".  The (maximum) string length is a separate dimension, which is
+recorded as a normal netCDF file dimension like any other.
+
+.. note::
+
+    Although it is not tested, it has proved possible (and useful) at present to load
+    files with variables containing variable-length string data, but it is
+    necessary to supply an explicit user chunking to workaround limitations in Dask.
+    Please see the :ref:`howto example <howto_load_variablewidth_strings>`.
+
+.. warning::
+
+    The netCDF4 package will perform automatic character encoding/decoding of a
+    character variable if it has a special ``_Encoding`` attribute.  Ncdata does not
+    currently allow for this.  See : :ref:`known-issues`
+
@@ -8,6 +8,7 @@ Detail reference topics
     ../change_log
     ./known_issues
     ./interface_support
+    ./character_handling
     ./threadlock_sharing
     ./developer_notes
 
@@ -17,13 +17,24 @@ variable-length and user-defined datatypes.
 Please see : :ref:`data-types`.
 
 
-Data Scaling, Masking and Compression
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Ncdata does not implement scaling and offset within data arrays :  The ".data"
+Data Scaling and Masking
+^^^^^^^^^^^^^^^^^^^^^^^^
+Ncdata does not implement scaling and offset within variable data arrays :  The ".data"
 array has the actual variable dtype, and the "scale_factor" and
 "add_offset" attributes are treated like any other attribute.
 
-The existence of a "_FillValue" attribute controls how.. TODO
+Likewise, Ncdata does not use masking within its variable data arrays, so that variable
+data arrays contain "raw" data, which include any "fill" values -- i.e. at any missing
+data points you will have a "fill" value rather than a masked point.
+
+The use of "scale_factor", "add_offset" and "_FillValue" attributes are standard
+conventions described in the NetCDF documentation itself, and implemented by NetCDF
+library software including the Python netCDF4 library.  To ignore these default
+interpretations, ncdata has to actually turn these features "off".  The rationale for
+this, however, is that the low-level unprocessed data content, equivalent to actual
+file storage, may be more likely to form a stable common basis of equivalence, particularly
+between different system architectures.
+
 
 .. _file-storage:
 
@@ -33,14 +44,20 @@ The :func:`ncdata.netcdf4.to_nc4` cannot control compression or storage options
 provided by :meth:`netCDF4.Dataset.createVariable`, which means you can't
 control the data compression and translation facilities of the NetCDF file
 library.
-If required, you should use :mod:`iris` or :mod:`xarray` for this.
+If required, you should use :mod:`iris` or :mod:`xarray` for this, i.e. use
+:meth:`xarray.Dataset.to_netcdf` or :func:`iris.save` instead of
+:func:`ncdata.netcdf4.to_nc4`, as these provide more special options for controlling
+netcdf file creation.
 
 File-specific storage aspects, such as chunking, data-paths or compression
 strategies, are not recorded in the core objects.  However, array representations in
 variable and attribute data (notably dask lazy arrays) may hold such information.
 
-The concept of "unlimited" dimensions is also, arguably an exception.  However, this is a
-core provision in the NetCDF data model itself (see "Dimension" in the `NetCDF Classic Data Model`_).
+The concept of "unlimited" dimensions is also, you might think, outside the abstract
+model of NetCDF data and not of concern to Ncdata .  However, in fact this concept is
+present as a core property of dimensions in the classic NetCDF data model (see
+"Dimension" in the `NetCDF Classic Data Model`_), so that is why it **is** an essential
+property of an NcDimension also.
 
 
 Dask chunking control
 
@@ -1,3 +1,5 @@
+.. _known-issues:
+
 Outstanding Issues
 ==================
 
@@ -21,6 +23,12 @@ To be fixed
 
    * `issue#66 <https://github.com/pp-mo/ncdata/issues/66>`_
 
+* in conversion to/from netCDF4 files
+
+   * netCDF4 performs automatic encoding/decoding of byte data to characters, triggered
+     by the existence of an ``_Encoding`` attribute on a character type variable.
+     Ncdata does not currently account for this, and may fail to read/write correctly.
+
 
 .. _todo:
 
 
@@ -12,7 +12,7 @@ created from netcdf file data, which it is either computing or storing to an
 output netcdf file.
 
 In short, this is not needed when all your data is loaded with only **one** of the data
-packages (Iris, Xarray or ncata).  The problem only occurs when you try to
+packages (Iris, Xarray or ncdata).  The problem only occurs when you try to
 realise/calculate/save results which combine data loaded from a mixture of sources.
 
 sample code::
 
@@ -45,11 +45,42 @@ Example : ``dataset.variables.rename("x", "y")``
 
 Copying
 -------
-All core objects support a ``.copy()`` method, which however does not copy array content
-(e.g. variable data or attribute arrays).  See for instance :meth:`ncdata.NcData.copy`.
+All core objects support a ``.copy()`` method.  See for instance
+:meth:`ncdata.NcData.copy`.
 
-There is also a utility function :func:`ncdata.utils.ncdata_copy`, this is effectively
-the same as the NcData object copy.
+These however do *not* copy variable data arrays (either real or lazy), but produce new
+(copied) variables referencing the same arrays.  So, for example:
+
+.. code-block::
+
+    >>> Construct a simple test dataset
+    >>> ds = NcData(
+    ...     dimensions=[NcDimension('x', 12)],
+    ...     variables=[NcVariable('vx', ['x'], np.ones(12))]
+    ... )
+
+    >>> # Make a copy
+    >>> ds_copy = ds.copy()
+
+    >>> # The new dataset has a new matching variable with a matching data array
+    >>> # The variables are different ..
+    >>> ds_copy.variables['vx'] is ds.variables['vx']
+    False
+    >>> # ... but the arrays are THE SAME ARRAY
+    >>> ds_copy.variables['vx'].data is ds.variables['vx'].data
+    True
+
+    >>> # So changing one actually CHANGES THE OTHER ...
+    >>> ds.variables['vx'].data[6:] = 777
+    >>> ds_copy.variables['vx'].data
+    array([1., 1., 1., 1., 1., 1., 777., 777., 777., 777., 777., 777.])
+
+If needed you can of course replace variable data with copies yourself, since you can
+freely assign to ``.data``.
+For real data, this is just ``var.data = var.data.copy()``.
+
+There is also a utility function :func:`ncdata.utils.ncdata_copy` :  This is
+effectively the same thing as the NcData object :meth:`~ncdata.NcData.copy` method.
 
 
 Equality Checking
 
@@ -34,7 +34,7 @@ Notes :
 
 :class:`~ncdata.NcData`
 ^^^^^^^^^^^^^^^^^^^^^^^
-This represents a dataset containing variables, attributes and groups.
+This represents a dataset containing variables, dimensions, attributes and groups.
 It is also used to represent groups.
 
 :class:`~ncdata.NcDimension`
@@ -168,9 +168,7 @@ Thus to fetch an attribute you might write, for example one of these :
 
 but **not** ``unit = dataset.variables['x'].attributes['attr1']``
 
-And not ``unit = dataset.variables['x'].attributes['attr1']``
-
-Or, likewise, to ***set*** values, one of
+Or, likewise, to **set** values, one of
 
 .. code-block::