Skip to content

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Jul 11, 2019

Stack from ghstack:

Differential Revision: D16257781

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jul 11, 2019
…ernel.cu"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
@pytorchbot pytorchbot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 11, 2019
@xuhdev xuhdev changed the title Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.cu Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} Jul 11, 2019
…ernel.{cpp,cu}"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
@zou3519
Copy link
Contributor

zou3519 commented Jul 11, 2019

Can we get some before/after numbers?

xuhdev added 2 commits July 11, 2019 12:45
…ernel.{cpp,cu}"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
…ernel.{cpp,cu}"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 11, 2019

@zou3519 Can we trigger a benchmark run, since a lot of functions are using fill? (like this one)

…ernel.{cpp,cu}"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
@zou3519
Copy link
Contributor

zou3519 commented Jul 11, 2019

I don't know how to trigger a benchmark run. @bddppq do you know?

Although a benchmark run would show us how much impact this has a macro scale, it's nice (and faster) to just look at numbers on just the op to make sure we haven't changed the behavior too much.

@bddppq
Copy link
Contributor

bddppq commented Jul 11, 2019

@xuhdev @zou3519 lol that benchmark was just for ROCm Caffe2 training. I think for this change, better to run benchmarks here cc @mingzhe09088

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 11, 2019

OK. Hopefully I didn't change anything significant to performance. I turned of CPU turbo and warmed up my GPU. Here's the benchmark:

Before:

a.fill_(10) (a.numel() == 10) for 100000 times
device: cpu
  dtype:torch.int8		1.4398678180004936
  dtype:torch.uint8		1.4053343380001024
  dtype:torch.int16		1.4766908799992962
  dtype:torch.int32		1.451924565999434
  dtype:torch.int64		1.4752186900004745
  dtype:torch.half		1.4896835120016476
  dtype:torch.float		1.5013499300002877
  dtype:torch.double		1.5264382410005055
device: cuda
  dtype:torch.int8		2.681039557999611
  dtype:torch.uint8		2.680637668001509
  dtype:torch.int16		2.7244960089992674
  dtype:torch.int32		2.748717996000778
  dtype:torch.int64		2.718285664999712
  dtype:torch.half		2.770288576000894
  dtype:torch.float		2.759359248000692
  dtype:torch.double		2.768819724000423
a.fill_(10) (a.numel() == 1000) for 10000 times
device: cpu
  dtype:torch.int8		0.19708076200004143
  dtype:torch.uint8		0.19885199899908912
  dtype:torch.int16		0.1959762740007136
  dtype:torch.int32		0.21790278300068167
  dtype:torch.int64		0.2854765039992344
  dtype:torch.half		0.21166744800029846
  dtype:torch.float		0.21991784899910272
  dtype:torch.double		0.29566019300000335
device: cuda
  dtype:torch.int8		0.27165307299947017
  dtype:torch.uint8		0.27046840399998473
  dtype:torch.int16		0.2719461199994839
  dtype:torch.int32		0.27715157000056934
  dtype:torch.int64		0.27338939899891557
  dtype:torch.half		0.2763788480006042
  dtype:torch.float		0.2772146719999
  dtype:torch.double		0.2752408190008282

After:

a.fill_(10) (a.numel() == 10) for 100000 times
device: cpu
  dtype:torch.int8		1.2886789590011176
  dtype:torch.uint8		1.315619697999864
  dtype:torch.int16		1.3257245380009408
  dtype:torch.int32		1.3391901470004086
  dtype:torch.int64		1.3278755680003087
  dtype:torch.half		1.377823962999173
  dtype:torch.float		1.3750181650011655
  dtype:torch.double		1.4091837569994823
device: cuda
  dtype:torch.int8		2.5793657489994075
  dtype:torch.uint8		2.5789307300001383
  dtype:torch.int16		2.6026060299991514
  dtype:torch.int32		2.652911500999835
  dtype:torch.int64		2.63532043899977
  dtype:torch.half		2.6785809670000162
  dtype:torch.float		2.6598699250007485
  dtype:torch.double		2.6494504060010513
a.fill_(10) (a.numel() == 1000) for 10000 times
device: cpu
  dtype:torch.int8		0.18751540899938846
  dtype:torch.uint8		0.1910664649985847
  dtype:torch.int16		0.18229868299931695
  dtype:torch.int32		0.20953493699926184
  dtype:torch.int64		0.27226362799956405
  dtype:torch.half		0.2080454450006073
  dtype:torch.float		0.2122907649991248
  dtype:torch.double		0.26846734500031744
device: cuda
  dtype:torch.int8		0.26064572900031635
  dtype:torch.uint8		0.2591479809998418
  dtype:torch.int16		0.25905369599968253
  dtype:torch.int32		0.2640146449994063
  dtype:torch.int64		0.26635702299972763
  dtype:torch.half		0.27022074299929955
  dtype:torch.float		0.26834364000023925
  dtype:torch.double		0.266556746999413

…ernel.{cpp,cu}"

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head
@xuhdev xuhdev requested a review from zou3519 July 11, 2019 21:32
@zou3519
Copy link
Contributor

zou3519 commented Jul 11, 2019

OK. Hopefully I didn't change anything significant to performance. I turned of CPU turbo and warmed up my GPU. Here's the benchmark:

Before:

a.fill_(10) for 100000 times
device: cpu
  dtype:torch.int8		1.4398678180004936
  dtype:torch.uint8		1.4053343380001024
  dtype:torch.int16		1.4766908799992962
  dtype:torch.int32		1.451924565999434
  dtype:torch.int64		1.4752186900004745
  dtype:torch.half		1.4896835120016476
  dtype:torch.float		1.5013499300002877
  dtype:torch.double		1.5264382410005055
device: cuda
  dtype:torch.int8		2.681039557999611
  dtype:torch.uint8		2.680637668001509
  dtype:torch.int16		2.7244960089992674
  dtype:torch.int32		2.748717996000778
  dtype:torch.int64		2.718285664999712
  dtype:torch.half		2.770288576000894
  dtype:torch.float		2.759359248000692
  dtype:torch.double		2.768819724000423
a.fill_(1000) for 10000 times
device: cpu
  dtype:torch.int8		0.19708076200004143
  dtype:torch.uint8		0.19885199899908912
  dtype:torch.int16		0.1959762740007136
  dtype:torch.int32		0.21790278300068167
  dtype:torch.int64		0.2854765039992344
  dtype:torch.half		0.21166744800029846
  dtype:torch.float		0.21991784899910272
  dtype:torch.double		0.29566019300000335
device: cuda
  dtype:torch.int8		0.27165307299947017
  dtype:torch.uint8		0.27046840399998473
  dtype:torch.int16		0.2719461199994839
  dtype:torch.int32		0.27715157000056934
  dtype:torch.int64		0.27338939899891557
  dtype:torch.half		0.2763788480006042
  dtype:torch.float		0.2772146719999
  dtype:torch.double		0.2752408190008282

After:

a.fill_(10) for 100000 times
device: cpu
  dtype:torch.int8		1.2886789590011176
  dtype:torch.uint8		1.315619697999864
  dtype:torch.int16		1.3257245380009408
  dtype:torch.int32		1.3391901470004086
  dtype:torch.int64		1.3278755680003087
  dtype:torch.half		1.377823962999173
  dtype:torch.float		1.3750181650011655
  dtype:torch.double		1.4091837569994823
device: cuda
  dtype:torch.int8		2.5793657489994075
  dtype:torch.uint8		2.5789307300001383
  dtype:torch.int16		2.6026060299991514
  dtype:torch.int32		2.652911500999835
  dtype:torch.int64		2.63532043899977
  dtype:torch.half		2.6785809670000162
  dtype:torch.float		2.6598699250007485
  dtype:torch.double		2.6494504060010513
a.fill_(1000) for 10000 times
device: cpu
  dtype:torch.int8		0.18751540899938846
  dtype:torch.uint8		0.1910664649985847
  dtype:torch.int16		0.18229868299931695
  dtype:torch.int32		0.20953493699926184
  dtype:torch.int64		0.27226362799956405
  dtype:torch.half		0.2080454450006073
  dtype:torch.float		0.2122907649991248
  dtype:torch.double		0.26846734500031744
device: cuda
  dtype:torch.int8		0.26064572900031635
  dtype:torch.uint8		0.2591479809998418
  dtype:torch.int16		0.25905369599968253
  dtype:torch.int32		0.2640146449994063
  dtype:torch.int64		0.26635702299972763
  dtype:torch.half		0.27022074299929955
  dtype:torch.float		0.26834364000023925
  dtype:torch.double		0.266556746999413

How large is a? How can we reproduce this benchmark?

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 11, 2019

Oops, I meant a.numel() == 10 and a.numel() == 1000, not a.fill_(10) or a.fill_(1000).

Here's the script:

import timeit

for n, t in [(10, 100000),
             (1000, 10000)]:
    print('a.fill_(10) (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        print('device: ' + device)
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64', 'torch.half', 'torch.float', 'torch.double'):
            print('  dtype:' + dtype, end='\t\t')
            print(timeit.timeit('a.fill_(10)', setup='import torch; a = torch.zeros({}, device="{}", dtype={})'.format(n, device, dtype), number=t))

@mingzhe09088
Copy link
Contributor

There is an operator benchmark suite which reports the execution time of operators. Many operators have been supported and the code is in benchmarks/operator_benchmark/pt directory. Let me know if you guys are interested in using it for this case here. I can add a test to that directory.

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 12, 2019

@mingzhe09088 That would be helpful. Thanks!

@colesbury
Copy link
Member

Does this PR change anything other than the location of the functions?

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 12, 2019

Does this PR change anything other than the location of the functions?

No, but I think the benchmark run was a safeguard measure to ensure nothing was changed accidentally.

@zou3519
Copy link
Contributor

zou3519 commented Jul 12, 2019

Does this PR change anything other than the location of the functions?

No, but I think the benchmark run was a safeguard measure to ensure nothing was changed accidentally.

Sorry, I didn't realize that nothing else was changed. Thanks for catching that @colesbury.

In general a dedicated benchmark for fill_ would be good but we don't have to block this PR on that.


using unary_fn = void(*)(TensorIterator&);

DECLARE_DISPATCH(void(*)(TensorIterator&, Scalar), fill_stub);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just unused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the definition got moved to Fill.h

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in fc297b8.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 15, 2019
Summary: Pull Request resolved: pytorch/pytorch#22758

Test Plan: Imported from OSS

Differential Revision: D16257781

Pulled By: ezyang

fbshipit-source-id: 9e5ed06e95ef65b036eb388488faad981f1e8012
@xuhdev xuhdev deleted the gh/xuhdev/10/head branch July 15, 2019 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants