Hi!
I just found that the diag operator does not support N-d arrays where N > 2. According to my own experience, it could be made more useful if the N > 2 cases are properly designed. For example, I find it troublesome to take the diagonals of several matrices of the same shape at the same time. I know this task could be accomplished with a combination of arange, tile and pick, but it would be very complicated, confusing and error-prone. To support this, the behaviour when N > 2 could be designed as taking the diagonal of the last two axes, i.e., when fed with an array of shape [d1, d2, d3, ..., dn-2, dn-1, dn], where the diagonal of [dn-1, dn] is of length k, diag would return an array of shape [d1, d2, d3, ..., dn-2, k]. Of course, this could be designed to be more flexible (allowing specifying the axes to reduce, for example).
PyTorch provides a diag operator that behaves in the same way. Tensorflow actually splits it into two operators, diag and diag_part, the former of which constructs diagonal matrices and the latter takes diagonals from matrices. They are designed to support N > 2 but not in a way I find useful or flexible.
On the MXNet forum: https://discuss.mxnet.io/t/diag-for-n-d-arrays/1707