Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} #22758

xuhdev · 2019-07-11T18:57:41Z

Stack from ghstack:

Make the signature of fill_out consistent with fill_. #22761 Make the signature of fill_out consistent with fill.
Move the body of fill_kernel_impl into fill_kernel_cuda #22760 Move the body of fill_kernel_impl into fill_kernel_cuda
Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} #22758 Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu}

Differential Revision: D16257781

…ernel.cu" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

aten/src/ATen/native/cuda/FillKernel.cu

zou3519 · 2019-07-11T19:42:31Z

Can we get some before/after numbers?

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

xuhdev · 2019-07-11T20:09:03Z

@zou3519 Can we trigger a benchmark run, since a lot of functions are using fill? (like this one)

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

zou3519 · 2019-07-11T20:23:06Z

I don't know how to trigger a benchmark run. @bddppq do you know?

Although a benchmark run would show us how much impact this has a macro scale, it's nice (and faster) to just look at numbers on just the op to make sure we haven't changed the behavior too much.

bddppq · 2019-07-11T20:32:14Z

@xuhdev @zou3519 lol that benchmark was just for ROCm Caffe2 training. I think for this change, better to run benchmarks here cc @mingzhe09088

xuhdev · 2019-07-11T21:28:07Z

OK. Hopefully I didn't change anything significant to performance. I turned of CPU turbo and warmed up my GPU. Here's the benchmark:

Before:

a.fill_(10) (a.numel() == 10) for 100000 times
device: cpu
  dtype:torch.int8		1.4398678180004936
  dtype:torch.uint8		1.4053343380001024
  dtype:torch.int16		1.4766908799992962
  dtype:torch.int32		1.451924565999434
  dtype:torch.int64		1.4752186900004745
  dtype:torch.half		1.4896835120016476
  dtype:torch.float		1.5013499300002877
  dtype:torch.double		1.5264382410005055
device: cuda
  dtype:torch.int8		2.681039557999611
  dtype:torch.uint8		2.680637668001509
  dtype:torch.int16		2.7244960089992674
  dtype:torch.int32		2.748717996000778
  dtype:torch.int64		2.718285664999712
  dtype:torch.half		2.770288576000894
  dtype:torch.float		2.759359248000692
  dtype:torch.double		2.768819724000423
a.fill_(10) (a.numel() == 1000) for 10000 times
device: cpu
  dtype:torch.int8		0.19708076200004143
  dtype:torch.uint8		0.19885199899908912
  dtype:torch.int16		0.1959762740007136
  dtype:torch.int32		0.21790278300068167
  dtype:torch.int64		0.2854765039992344
  dtype:torch.half		0.21166744800029846
  dtype:torch.float		0.21991784899910272
  dtype:torch.double		0.29566019300000335
device: cuda
  dtype:torch.int8		0.27165307299947017
  dtype:torch.uint8		0.27046840399998473
  dtype:torch.int16		0.2719461199994839
  dtype:torch.int32		0.27715157000056934
  dtype:torch.int64		0.27338939899891557
  dtype:torch.half		0.2763788480006042
  dtype:torch.float		0.2772146719999
  dtype:torch.double		0.2752408190008282

After:

a.fill_(10) (a.numel() == 10) for 100000 times
device: cpu
  dtype:torch.int8		1.2886789590011176
  dtype:torch.uint8		1.315619697999864
  dtype:torch.int16		1.3257245380009408
  dtype:torch.int32		1.3391901470004086
  dtype:torch.int64		1.3278755680003087
  dtype:torch.half		1.377823962999173
  dtype:torch.float		1.3750181650011655
  dtype:torch.double		1.4091837569994823
device: cuda
  dtype:torch.int8		2.5793657489994075
  dtype:torch.uint8		2.5789307300001383
  dtype:torch.int16		2.6026060299991514
  dtype:torch.int32		2.652911500999835
  dtype:torch.int64		2.63532043899977
  dtype:torch.half		2.6785809670000162
  dtype:torch.float		2.6598699250007485
  dtype:torch.double		2.6494504060010513
a.fill_(10) (a.numel() == 1000) for 10000 times
device: cpu
  dtype:torch.int8		0.18751540899938846
  dtype:torch.uint8		0.1910664649985847
  dtype:torch.int16		0.18229868299931695
  dtype:torch.int32		0.20953493699926184
  dtype:torch.int64		0.27226362799956405
  dtype:torch.half		0.2080454450006073
  dtype:torch.float		0.2122907649991248
  dtype:torch.double		0.26846734500031744
device: cuda
  dtype:torch.int8		0.26064572900031635
  dtype:torch.uint8		0.2591479809998418
  dtype:torch.int16		0.25905369599968253
  dtype:torch.int32		0.2640146449994063
  dtype:torch.int64		0.26635702299972763
  dtype:torch.half		0.27022074299929955
  dtype:torch.float		0.26834364000023925
  dtype:torch.double		0.266556746999413

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

zou3519 · 2019-07-11T21:58:07Z

OK. Hopefully I didn't change anything significant to performance. I turned of CPU turbo and warmed up my GPU. Here's the benchmark:

Before:

a.fill_(10) for 100000 times
device: cpu
  dtype:torch.int8		1.4398678180004936
  dtype:torch.uint8		1.4053343380001024
  dtype:torch.int16		1.4766908799992962
  dtype:torch.int32		1.451924565999434
  dtype:torch.int64		1.4752186900004745
  dtype:torch.half		1.4896835120016476
  dtype:torch.float		1.5013499300002877
  dtype:torch.double		1.5264382410005055
device: cuda
  dtype:torch.int8		2.681039557999611
  dtype:torch.uint8		2.680637668001509
  dtype:torch.int16		2.7244960089992674
  dtype:torch.int32		2.748717996000778
  dtype:torch.int64		2.718285664999712
  dtype:torch.half		2.770288576000894
  dtype:torch.float		2.759359248000692
  dtype:torch.double		2.768819724000423
a.fill_(1000) for 10000 times
device: cpu
  dtype:torch.int8		0.19708076200004143
  dtype:torch.uint8		0.19885199899908912
  dtype:torch.int16		0.1959762740007136
  dtype:torch.int32		0.21790278300068167
  dtype:torch.int64		0.2854765039992344
  dtype:torch.half		0.21166744800029846
  dtype:torch.float		0.21991784899910272
  dtype:torch.double		0.29566019300000335
device: cuda
  dtype:torch.int8		0.27165307299947017
  dtype:torch.uint8		0.27046840399998473
  dtype:torch.int16		0.2719461199994839
  dtype:torch.int32		0.27715157000056934
  dtype:torch.int64		0.27338939899891557
  dtype:torch.half		0.2763788480006042
  dtype:torch.float		0.2772146719999
  dtype:torch.double		0.2752408190008282

After:

a.fill_(10) for 100000 times
device: cpu
  dtype:torch.int8		1.2886789590011176
  dtype:torch.uint8		1.315619697999864
  dtype:torch.int16		1.3257245380009408
  dtype:torch.int32		1.3391901470004086
  dtype:torch.int64		1.3278755680003087
  dtype:torch.half		1.377823962999173
  dtype:torch.float		1.3750181650011655
  dtype:torch.double		1.4091837569994823
device: cuda
  dtype:torch.int8		2.5793657489994075
  dtype:torch.uint8		2.5789307300001383
  dtype:torch.int16		2.6026060299991514
  dtype:torch.int32		2.652911500999835
  dtype:torch.int64		2.63532043899977
  dtype:torch.half		2.6785809670000162
  dtype:torch.float		2.6598699250007485
  dtype:torch.double		2.6494504060010513
a.fill_(1000) for 10000 times
device: cpu
  dtype:torch.int8		0.18751540899938846
  dtype:torch.uint8		0.1910664649985847
  dtype:torch.int16		0.18229868299931695
  dtype:torch.int32		0.20953493699926184
  dtype:torch.int64		0.27226362799956405
  dtype:torch.half		0.2080454450006073
  dtype:torch.float		0.2122907649991248
  dtype:torch.double		0.26846734500031744
device: cuda
  dtype:torch.int8		0.26064572900031635
  dtype:torch.uint8		0.2591479809998418
  dtype:torch.int16		0.25905369599968253
  dtype:torch.int32		0.2640146449994063
  dtype:torch.int64		0.26635702299972763
  dtype:torch.half		0.27022074299929955
  dtype:torch.float		0.26834364000023925
  dtype:torch.double		0.266556746999413

How large is a? How can we reproduce this benchmark?

xuhdev · 2019-07-11T22:04:22Z

Oops, I meant a.numel() == 10 and a.numel() == 1000, not a.fill_(10) or a.fill_(1000).

Here's the script:

import timeit

for n, t in [(10, 100000),
             (1000, 10000)]:
    print('a.fill_(10) (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        print('device: ' + device)
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64', 'torch.half', 'torch.float', 'torch.double'):
            print('  dtype:' + dtype, end='\t\t')
            print(timeit.timeit('a.fill_(10)', setup='import torch; a = torch.zeros({}, device="{}", dtype={})'.format(n, device, dtype), number=t))

mingzhe09088 · 2019-07-11T23:50:10Z

There is an operator benchmark suite which reports the execution time of operators. Many operators have been supported and the code is in benchmarks/operator_benchmark/pt directory. Let me know if you guys are interested in using it for this case here. I can add a test to that directory.

xuhdev · 2019-07-12T15:23:42Z

@mingzhe09088 That would be helpful. Thanks!

colesbury · 2019-07-12T20:30:19Z

Does this PR change anything other than the location of the functions?

xuhdev · 2019-07-12T20:45:52Z

Does this PR change anything other than the location of the functions?

No, but I think the benchmark run was a safeguard measure to ensure nothing was changed accidentally.

zou3519 · 2019-07-12T21:10:30Z

Does this PR change anything other than the location of the functions?

No, but I think the benchmark run was a safeguard measure to ensure nothing was changed accidentally.

Sorry, I didn't realize that nothing else was changed. Thanks for catching that @colesbury.

In general a dedicated benchmark for fill_ would be good but we don't have to block this PR on that.

zou3519 · 2019-07-12T21:11:33Z

aten/src/ATen/native/UnaryOps.h


 using unary_fn = void(*)(TensorIterator&);

-DECLARE_DISPATCH(void(*)(TensorIterator&, Scalar), fill_stub);


Was this just unused?

No, the definition got moved to Fill.h

facebook-github-bot · 2019-07-15T02:32:03Z

@ezyang merged this pull request in fc297b8.

Summary: Pull Request resolved: pytorch/pytorch#22758 Test Plan: Imported from OSS Differential Revision: D16257781 Pulled By: ezyang fbshipit-source-id: 9e5ed06e95ef65b036eb388488faad981f1e8012

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.cu

6cab3c6

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jul 11, 2019

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

30a5693

…ernel.cu" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

pytorchbot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 11, 2019

xuhdev changed the title ~~Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.cu~~ Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} Jul 11, 2019

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

6f4c185

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

ezyang added the open source label Jul 11, 2019

zou3519 reviewed Jul 11, 2019

View reviewed changes

aten/src/ATen/native/cuda/FillKernel.cu Outdated Show resolved Hide resolved

xuhdev added 2 commits July 11, 2019 12:45

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

66d742d

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

e18d5a7

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

xuhdev mentioned this pull request Jul 11, 2019

add fill_diagonal function #21892

Closed

xuhdev mentioned this pull request Jul 11, 2019

Move the body of fill_kernel_impl into fill_kernel_cuda #22760

Closed

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

cdb3df2

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

xuhdev mentioned this pull request Jul 11, 2019

Make the signature of fill_out consistent with fill_. #22761

Closed

Update on "Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillK…

0007741

…ernel.{cpp,cu}" Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} gh-metadata: pytorch pytorch 22758 gh/xuhdev/10/head

xuhdev requested a review from zou3519 July 11, 2019 21:32

zou3519 reviewed Jul 12, 2019

View reviewed changes

zou3519 approved these changes Jul 12, 2019

View reviewed changes

facebook-github-bot closed this in fc297b8 Jul 15, 2019

facebook-github-bot added the merged label Jul 15, 2019

xuhdev deleted the gh/xuhdev/10/head branch July 15, 2019 05:53

mingzhe09088 mentioned this pull request Jul 15, 2019

add benchmark for PT fill_ op #22867

Closed

mruberry added the Merged label Oct 28, 2020


		using unary_fn = void(*)(TensorIterator&);

		DECLARE_DISPATCH(void(*)(TensorIterator&, Scalar), fill_stub);

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} #22758

Move fill and fill_diagonal to Fill.cpp, Fill.h, and FillKernel.{cpp,cu} #22758

Uh oh!

Conversation

xuhdev commented Jul 11, 2019 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zou3519 commented Jul 11, 2019

Uh oh!

xuhdev commented Jul 11, 2019

Uh oh!

zou3519 commented Jul 11, 2019

Uh oh!

bddppq commented Jul 11, 2019

Uh oh!

xuhdev commented Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Jul 11, 2019

Uh oh!

xuhdev commented Jul 11, 2019

Uh oh!

mingzhe09088 commented Jul 11, 2019

Uh oh!

xuhdev commented Jul 12, 2019

Uh oh!

colesbury commented Jul 12, 2019

Uh oh!

xuhdev commented Jul 12, 2019

Uh oh!

zou3519 commented Jul 12, 2019

Uh oh!

zou3519 Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

colesbury Jul 12, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

xuhdev commented Jul 11, 2019 •

edited by ezyang

Loading

xuhdev commented Jul 11, 2019 •

edited

Loading