Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

enable moving traced model between devices#203

Closed
wat3rBro wants to merge 1 commit intofacebookresearch:mainfrom
wat3rBro:export-D35367772
Closed

enable moving traced model between devices#203
wat3rBro wants to merge 1 commit intofacebookresearch:mainfrom
wat3rBro:export-D35367772

Conversation

@wat3rBro
Copy link
Copy Markdown
Contributor

@wat3rBro wat3rBro commented Apr 5, 2022

Summary:
For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the .to(device) will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from cuda:0 to cuda:1). The reason is that device is not a torch.Tensor, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:

# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================

For D2 (87374ef), this diff creates a move_tensor_device_same_as_another(A, B) function to replace A.to(B.device). This diff updates the rcnn.py and all its utils.

For D2 (87374efb134e539090e0b5c476809dc35bf6aedb)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Differential Revision: D35367772

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Apr 5, 2022
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D35367772

wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 6, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Differential Revision: D35367772

fbshipit-source-id: 78358affec19b9a68e2e25a5dff2124d86143189
wat3rBro added a commit to wat3rBro/d2go that referenced this pull request Apr 6, 2022
Summary:
X-link: facebookresearch/detectron2#4132

X-link: fairinternal/detectron2#568

Pull Request resolved: facebookresearch#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@87374ef), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@87374efb134e539090e0b5c476809dc35bf6aedb)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Differential Revision: D35367772

fbshipit-source-id: 02b93d2e06fb3a81d5fa09805b3f85e48dc550e6
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D35367772

wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 7, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Differential Revision: D35367772

fbshipit-source-id: b3db1dcee554b1075f6db50094b4f5ed11496efc
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 12, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 15fca8173561b33e6366e74c683a521580325f22
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 12, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 79b49201dd33a628f304cd1de3db3eb93c5bfc34
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 12, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 65e1c4903323c3fd02caccedaf97ac9ac052531f
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 12, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 4acaacaa9e6e848db0f14bc357520b2e00d75291
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 12, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 64914731f68d93c76a2dcbc0f43bf5343b435da8
wat3rBro added a commit to wat3rBro/d2go that referenced this pull request Apr 12, 2022
Summary:
X-link: facebookresearch/detectron2#4132

X-link: fairinternal/detectron2#568

Pull Request resolved: facebookresearch#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@87374ef), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@87374efb134e539090e0b5c476809dc35bf6aedb)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 404965f5908e17e6d9aeb68cfcb7c6dc5be42c02
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D35367772

wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 15, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 20d037156026d5bd3b3394858752b4c4f1d24ae4
wat3rBro added a commit to wat3rBro/d2go that referenced this pull request Apr 15, 2022
Summary:
X-link: facebookresearch/detectron2#4132

X-link: fairinternal/detectron2#568

Pull Request resolved: facebookresearch#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@87374ef), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@87374efb134e539090e0b5c476809dc35bf6aedb)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: c38d0ecb69225f5f5268fd8df14448416d12b446
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D35367772

wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 15, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: bc357b4bd65bc11ebf6c44e201c9bca29586fa84
Summary:
X-link: facebookresearch/detectron2#4132

X-link: fairinternal/detectron2#568

Pull Request resolved: facebookresearch#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@87374ef), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@87374efb134e539090e0b5c476809dc35bf6aedb)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 190c97eff77acec465f49f4337764060ec75d5e4
wat3rBro added a commit to wat3rBro/detectron2-1 that referenced this pull request Apr 15, 2022
Summary:
Pull Request resolved: facebookresearch#4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (facebookresearch@11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (facebookresearch@11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: c4493daa381e0c22bfbfc82915f4034c9b55fbd0
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D35367772

facebook-github-bot pushed a commit to facebookresearch/detectron2 that referenced this pull request Apr 15, 2022
Summary:
Pull Request resolved: #4132

X-link: fairinternal/detectron2#568

X-link: facebookresearch/d2go#203

For full discussion: https://fb.workplace.com/groups/1405155842844877/posts/5744470455580039

Tracing the `.to(device)` will cause problem when moving the traced torchscript to another device (eg. from cpu to gpu, or even, from `cuda:0` to `cuda:1`). The reason is that `device` is not a `torch.Tensor`, so the tracer just hardcode the value during tracing. The solution is scripting the casting operation.

Here's the code snippet illustrating this:
```
# define the MyModel similar to GeneralizedRCNN, which casts the input to the model's device
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = x.to(self.conv1.weight.device)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_0.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %14 : int = prim::Constant[value=6]() # <ipython-input-2-5abde0efc36f>:11:0
  %15 : int = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %16 : Device = prim::Constant[value="cpu"]() # <ipython-input-2-5abde0efc36f>:11:0
  %17 : NoneType = prim::Constant()
  %18 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %19 : bool = prim::Constant[value=0]() # <ipython-input-2-5abde0efc36f>:11:0
  %20 : NoneType = prim::Constant()
  %input.1 : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu) = aten::to(%x, %14, %15, %16, %17, %18, %19, %20) # <ipython-input-2-5abde0efc36f>:11:0
  %72 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%72) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %73 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %61 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%73) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%61)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# torchscript cuda doesn't work
ts = ts.to("cuda")
y = ts(x)

# =====================================================
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-2aece3ad6c9a> in <module>
      7 # torchscript cuda doesn't work
      8 ts = ts.to("cuda")
----> 9 y = ts(x)
/mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []
RuntimeError: The following operation failed in the TorchScript interpreter.
# =====================================================

# One solution is scripting the casting instead of tracing it, the folloing code demonstrate how to do it. We need to use mixed scripting/tracing
torch.jit.script_if_tracing
def cast_device_like(src: torch.Tensor, dst: torch.Tensor) -> torch.Tensor:
    return src.to(dst.device)

class MyModel2(nn.Module):
    def __init__(self):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        # cast the input to the same device as this model, this makes it possible to
        # take a cpu tensor as input when the model is on GPU.
        x = cast_device_like(x, self.conv1.weight)

        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

# export the model by tracing
model = MyModel2()
x = torch.zeros([1, 3, 32, 32])
ts = torch.jit.trace(model, x)
print(ts.graph)

# =====================================================
graph(%self.1 : __torch__.MyModel2,
      %x : Float(1, 3, 32, 32, strides=[3072, 1024, 32, 1], requires_grad=0, device=cpu)):
  %conv2 : __torch__.torch.nn.modules.conv.___torch_mangle_5.Conv2d = prim::GetAttr[name="conv2"](%self.1)
  %conv1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.___torch_mangle_4.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %weight.5 : Tensor = prim::GetAttr[name="weight"](%conv1.1)
  %14 : Function = prim::Constant[name="cast_device_like"]()
  %input.1 : Tensor = prim::CallFunction(%14, %x, %weight.5)
  %68 : Tensor = prim::CallMethod[name="forward"](%conv1, %input.1)
  %input.5 : Float(1, 20, 28, 28, strides=[15680, 784, 28, 1], requires_grad=1, device=cpu) = aten::relu(%68) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  %69 : Tensor = prim::CallMethod[name="forward"](%conv2, %input.5)
  %55 : Float(1, 20, 24, 24, strides=[11520, 576, 24, 1], requires_grad=1, device=cpu) = aten::relu(%69) # /mnt/xarfuse/uid-20293/a90d1698-seed-nspid4026533681_cgpid21128615-ns-4026533618/torch/nn/functional.py:1406:0
  return (%55)
# =====================================================

# PyTorch cuda works
model = copy.deepcopy(model)
model.to("cuda")
y = model(x)
# torchscript cpu works
y = ts(x)
# Note that now torchscript cuda works
ts = ts.to("cuda")
y = ts(x)
print(y.device)

# =====================================================
cuda:0
# =====================================================
```

For D2 (11528ce), this diff creates a `move_tensor_device_same_as_another(A, B)` function to replace `A.to(B.device)`. This diff updates the `rcnn.py` and all its utils.

For D2 (11528ce083dc9ff83ee3a8f9086a1ef54d2a402f)Go, since the exported model will become device-agnostic, we can remove the "_gpu" from predictor-type.

Update (April 11):
Add test to cover tracing on one device and move traced model to another device for inference. When GPU is available, it'll trace on `cuda:0` and run inference on `cpu`, `cuda:0` (and `cuda:N-1` if available).

Summary of the device related patterns
- The usage of `.to(dtype=another_dype)` won't affect device.
- Explicit device casting like `.to(device)` can be generally replaced by `move_device_like`.
- For creating variable directly on device (eg. `torch.zeros`, `torch.arange`), we can replace then with ScriptModule to avoid first create on CPU and then move to new device.
    - Creating things on tracing device and then moving to new device is dangerous, because tracing device (eg. `cuda:0`) might not be available (eg. running on CPU-only machine).
    - It's hard to write `image_list.py` in this pattern because the size behaves differently during tracing (int vs. scalar tensor), in this diff, still create on CPU first and then move to target device.

Reviewed By: tglik

Differential Revision: D35367772

fbshipit-source-id: 02d07e3d96da85f4cfbeb996e3c14c2a6f619beb
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants