In RNNBase, self.all_weights contains references to all parameters. In the case of using DataParallel, those parameters are replaced by replicate, but the stale references in self.all_weights are sent to cuDNN.
In thnn/sparse.py, Embedding's backward should create grad_weight, _indices, _counts, and _sorted on the same device as grad_output.