Name collisions between Dataset variables and child tree nodes

I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means `__getitem__` becomes ambiguous.

---
 See this example:

```python
In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))

In [4]: child = DataNode('a', data=None, parent=dt)

In [5]: print(dt)
DataNode('root')
│   Dimensions:  (a: 1)
│   Coordinates:
│     * a        (a) int64 0
│   Data variables:
│       b        int64 1
└── DataNode('a')

In [6]: dt['a']
Out[6]: 
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
  * a        (a) int64 0

In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)
```
Here `print(dt)` shows that `dt` is in a form [forbidden by netCDF](https://github.com/pydata/xarray/issues/4118#issuecomment-901598698), because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).

Furthermore, when choosing an item via `DataTree.__getitem__` it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node `'a'` instead.

The node is still accessible via `.get_node`, but only because `.get_node` is inherited from `TreeNode`, which has no concept of data variables.


Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:
```python
In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)

~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
    162         # This approach was inspired by xarray.Dataset._construct_direct()
    163         obj = object.__new__(cls)
--> 164         obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
    165         obj.ds = data
    166         return obj

~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
     13     obj.name = name
     14 
---> 15     obj.parent = parent
     16     if children:
     17         obj.children = children

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
    133             self.__check_loop(value)
    134             self.__detach(parent)
--> 135             self.__attach(value)
    136 
    137     def __check_loop(self, node):

~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
    157     def __attach(self, parent):
    158         if parent is not None:
--> 159             self._pre_attach(parent)
    160             parentchildren = parent.__children_or_empty
    161             assert not any(child is self for child in parentchildren), "Tree is corrupt."  # pragma: no cover

~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
     84         """
     85         if self.name in list(c.name for c in parent.children):
---> 86             raise KeyError(
     87                 f"parent {parent.name} already has a child named {self.name}"
     88             )

KeyError: 'parent root already has a child named a'
```


---

To prevent this we need better checks on assignment between variables and children. For example `TreeNode.set_node(key, new_child)` currently checks for any existing children with name `key`, but it also needs to check for any variables in the dataset  with name `key`. (That's not too hard to implement, it could be done by overloading `set_node` on `DataTree` to check against variables as well as children, for example.)

What is more difficult is if a child with name `key` exists, but the user tries to assign a variable with name `key` to the wrapped dataset. If the user does this via `node.ds.assign(key=new_da)` then that's manageable - in that case `assign()` has a return value, which they need to assign to the node via `node.ds = node.ds.assign(key=new_da)`. We could check for name conflicts with children in the `.ds` property setter method.

However if the user adds a variable via `node.ds[key] = new_da` then I *think* `node.ds` will be updated in-place without it's wrapping `DataTree` class ever having a chance to intervene. A similar issue with `node[key] = new_da` is preventable by improving checking in `DataTree.__setitem__`, but I don't know how we can prevent this happening when all that is being called is `Dataset.__setitem__`.

I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition :confused: Any ideas @dcherian maybe?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Name collisions between Dataset variables and child tree nodes #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Name collisions between Dataset variables and child tree nodes #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions