Skip to content

Conversation

@navahgar
Copy link
Contributor

Summary:
Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:

--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615

With NNC fusion:

--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
without fusion: 0.9% (computed by adding the % spent on all the 4 ops involved)
with NNC fusion: 0.55%

Test Plan:
buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0

Reviewed By: hlu1

Differential Revision: D30609492

@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Sep 21, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 21, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit b137a03 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30609492

@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #65387 (b137a03) into master (64d3c73) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #65387      +/-   ##
==========================================
- Coverage   66.38%   66.37%   -0.01%     
==========================================
  Files         739      739              
  Lines       94295    94295              
==========================================
- Hits        62594    62593       -1     
- Misses      31701    31702       +1     

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30609492

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30609492

…ytorch#65387)

Summary:
Pull Request resolved: pytorch#65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: 61e09889390e5d7fd8c1b7c615ea0e09640549c1
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30609492

@navahgar navahgar force-pushed the export-D30609492 branch 2 times, most recently from c180852 to b137a03 Compare September 22, 2021 06:15
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30609492

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 31584d0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants