Always deserializing onto device 0 is dangerous, because often device 0 doesn't have enough memory.
This requires some heuristics to get right (e.g. what if you deserialize onto a machine with different # of GPUS?) You can look at what we did in cunn. I think this is low-pri.
In [127]: with open('checkpoint4.pt', 'wb') as f:
...: pickle.dump(torch.FloatTensor(10).cuda(3), f)
...:
In [128]: with open('checkpoint4.pt', 'rb') as f:
...: obj = pickle.load(f)
...:
In [130]: obj.getDevice()
Out[130]: 0