-
Notifications
You must be signed in to change notification settings - Fork 41
Name collisions between Dataset variables and child tree nodes #38
Description
I realised that it is currently possible to get a tree into a state which (a) cannot be represented as a netCDF file, and (b) means __getitem__ becomes ambiguous.
See this example:
In [3]: dt = DataNode('root', data=xr.Dataset({'a': [0], 'b': 1}))
In [4]: child = DataNode('a', data=None, parent=dt)
In [5]: print(dt)
DataNode('root')
│ Dimensions: (a: 1)
│ Coordinates:
│ * a (a) int64 0
│ Data variables:
│ b int64 1
└── DataNode('a')
In [6]: dt['a']
Out[6]:
<xarray.DataArray 'a' (a: 1)>
array([0])
Coordinates:
* a (a) int64 0
In [7]: dt.get_node('a')
Out[7]: DataNode(name='a', parent='root', children=[],data=None)Here print(dt) shows that dt is in a form forbidden by netCDF, because we have a child node and a variable with the same name (equivalent to having a group and a variable with the same name at the same level in netcdf).
Furthermore, when choosing an item via DataTree.__getitem__ it merrily picks out the DataArray even though this is an ambiguous situation and I might have intended to pick out the child node 'a' instead.
The node is still accessible via .get_node, but only because .get_node is inherited from TreeNode, which has no concept of data variables.
Contrast this silent collision of variable and child names with what happens if you try to assign two children with the same name:
In [8]: child = DataNode('a', data=None, parent=dt)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-8-8c4a956bcdd5> in <module>
----> 1 child = DataNode('a', data=None, parent=dt)
~/Documents/Work/Code/datatree/datatree/datatree.py in _init_single_datatree_node(cls, name, data, parent, children)
162 # This approach was inspired by xarray.Dataset._construct_direct()
163 obj = object.__new__(cls)
--> 164 obj = _init_single_treenode(obj, name=name, parent=parent, children=children)
165 obj.ds = data
166 return obj
~/Documents/Work/Code/datatree/datatree/treenode.py in _init_single_treenode(obj, name, parent, children)
13 obj.name = name
14
---> 15 obj.parent = parent
16 if children:
17 obj.children = children
~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in parent(self, value)
133 self.__check_loop(value)
134 self.__detach(parent)
--> 135 self.__attach(value)
136
137 def __check_loop(self, node):
~/Documents/Work/Code/anytree/anytree/node/nodemixin.py in __attach(self, parent)
157 def __attach(self, parent):
158 if parent is not None:
--> 159 self._pre_attach(parent)
160 parentchildren = parent.__children_or_empty
161 assert not any(child is self for child in parentchildren), "Tree is corrupt." # pragma: no cover
~/Documents/Work/Code/datatree/datatree/treenode.py in _pre_attach(self, parent)
84 """
85 if self.name in list(c.name for c in parent.children):
---> 86 raise KeyError(
87 f"parent {parent.name} already has a child named {self.name}"
88 )
KeyError: 'parent root already has a child named a'To prevent this we need better checks on assignment between variables and children. For example TreeNode.set_node(key, new_child) currently checks for any existing children with name key, but it also needs to check for any variables in the dataset with name key. (That's not too hard to implement, it could be done by overloading set_node on DataTree to check against variables as well as children, for example.)
What is more difficult is if a child with name key exists, but the user tries to assign a variable with name key to the wrapped dataset. If the user does this via node.ds.assign(key=new_da) then that's manageable - in that case assign() has a return value, which they need to assign to the node via node.ds = node.ds.assign(key=new_da). We could check for name conflicts with children in the .ds property setter method.
However if the user adds a variable via node.ds[key] = new_da then I think node.ds will be updated in-place without it's wrapping DataTree class ever having a chance to intervene. A similar issue with node[key] = new_da is preventable by improving checking in DataTree.__setitem__, but I don't know how we can prevent this happening when all that is being called is Dataset.__setitem__.
I don't really know what to do about this, other than have a much more complicated class design which is no longer simple composition 😕 Any ideas @dcherian maybe?