inceptionv3 with jit: first ~20 `forward`-s are very slow

## 🐛 Bug

First 10-20 calls to `forward` are very slow for a jit version of a pre-packaged inception3 model.

## To Reproduce

Code:
```
import sys
import time
import psutil

import torch
from torch import nn
from torchvision import models

def get_vms():
    return psutil.Process().memory_info().vms

def sync_cuda():
    t0 = time.time()
    torch.cuda.synchronize('cuda')
    print(f"cuda sync took {time.time() - t0:.3f} second(s)")

if __name__ == '__main__':
    device = sys.argv[1]

    net = models.inception_v3(pretrained=True, aux_logits=True)
    net = torch.jit.script(net)
    if device == 'cuda':
        sync_cuda()

    net = net.to(device).eval()
    input = torch.randn(1,3,299,299,requires_grad=False).to(device)

    if device == 'cuda':
        sync_cuda()

    out0 = None
    with torch.no_grad():
        for i in range(30):
            t0 = time.time()
            pred = net(input)
            if out0 is not None and not torch.allclose(out0.logits, pred.logits):
                print(f"{i}: logits aren't allclose; abs-sum={(out0.logits - pred.logits).abs().sum()}")
            print(f"{i} vms:{get_vms() / 1024/1024/1024:.3f}Gb seconds/iter={time.time() - t0:.3f}")
            out0 = pred
```

CPU-based run:
```
(pt-nightly) igor@w ~/playground$ python memory_leakage_test.py cpu
0 vms:4.635Gb seconds/iter=2.548
1 vms:4.910Gb seconds/iter=10.017
2 vms:5.153Gb seconds/iter=10.036
3 vms:5.386Gb seconds/iter=9.321
4 vms:5.620Gb seconds/iter=8.975
5 vms:5.844Gb seconds/iter=7.905
6 vms:6.058Gb seconds/iter=7.819
7 vms:6.262Gb seconds/iter=6.870
8 vms:6.457Gb seconds/iter=6.727
9 vms:6.642Gb seconds/iter=5.774
10 vms:6.824Gb seconds/iter=5.617
11 vms:7.002Gb seconds/iter=5.067
12 vms:7.162Gb seconds/iter=5.008
13 vms:7.310Gb seconds/iter=3.942
14 vms:7.443Gb seconds/iter=3.695
15 vms:7.565Gb seconds/iter=2.950
16 vms:7.672Gb seconds/iter=2.746
17 vms:7.767Gb seconds/iter=2.042
18 vms:7.846Gb seconds/iter=1.871
19 vms:7.911Gb seconds/iter=1.318
20 vms:7.966Gb seconds/iter=1.048
21 vms:8.014Gb seconds/iter=0.841
22 vms:8.048Gb seconds/iter=0.813
23 vms:8.070Gb seconds/iter=0.561
24 vms:8.082Gb seconds/iter=0.468
25 vms:8.095Gb seconds/iter=0.370
26 vms:8.099Gb seconds/iter=0.327
```

CUDA:
```
(pt-nightly) igor@w ~/playground$ python memory_leakage_test.py cuda
cuda sync took 2.181 second(s)
cuda sync took 0.000 second(s)
0 vms:10.510Gb seconds/iter=2.357
1: logits aren't allclose; abs-sum=0.0007971674203872681
1 vms:10.769Gb seconds/iter=9.826
2 vms:11.018Gb seconds/iter=9.541
3 vms:11.261Gb seconds/iter=8.995
4 vms:11.493Gb seconds/iter=8.586
5 vms:11.717Gb seconds/iter=7.597
6 vms:11.929Gb seconds/iter=7.358
7 vms:12.135Gb seconds/iter=6.468
8 vms:12.329Gb seconds/iter=6.244
9 vms:12.516Gb seconds/iter=5.422
10 vms:12.697Gb seconds/iter=5.233
11 vms:12.874Gb seconds/iter=4.771
12 vms:13.034Gb seconds/iter=4.647
13 vms:13.185Gb seconds/iter=3.569
14 vms:13.316Gb seconds/iter=3.374
15 vms:13.439Gb seconds/iter=2.519
16 vms:13.545Gb seconds/iter=2.359
17 vms:13.642Gb seconds/iter=1.663
18 vms:13.721Gb seconds/iter=1.508
19 vms:13.787Gb seconds/iter=0.950
20 vms:13.841Gb seconds/iter=0.739
21 vms:13.889Gb seconds/iter=0.504
22 vms:13.922Gb seconds/iter=0.440
23 vms:13.946Gb seconds/iter=0.201
24 vms:13.954Gb seconds/iter=0.141
25 vms:13.955Gb seconds/iter=0.017
26 vms:13.956Gb seconds/iter=0.010
27 vms:13.956Gb seconds/iter=0.008
28 vms:13.956Gb seconds/iter=0.007
29 vms:13.956Gb seconds/iter=0.007
```

## Expected behavior

Expecting all `forward`-s to take the same amount of time.

## Environment

```
PyTorch version: 1.4.0.dev20191205
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 19.10
GCC version: (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
CMake version: version 3.13.4

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 SUPER
Nvidia driver version: 440.36
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.16.2
[conda] mkl                       2019.4                      243  
[conda] pytorch                   1.4.0.dev20191205 py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch-nightly
[conda] torchvision               0.5.0.dev20191205      py37_cu101    pytorch-nightly
```

cc @suo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inceptionv3 with jit: first ~20 `forward`-s are very slow #30902

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inceptionv3 with jit: first ~20 forward-s are very slow #30902

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

inceptionv3 with jit: first ~20 `forward`-s are very slow #30902