Skip to content

add BFloat16 support for LayerNorm CPU#55210

Closed
mingfeima wants to merge 15 commits intogh/mingfeima/17/basefrom
gh/mingfeima/17/head
Closed

add BFloat16 support for LayerNorm CPU#55210
mingfeima wants to merge 15 commits intogh/mingfeima/17/basefrom
gh/mingfeima/17/head

Conversation

@mingfeima
Copy link
Copy Markdown
Collaborator

@mingfeima mingfeima commented Apr 2, 2021

Stack from ghstack:

Differential Revision: D28836793

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Apr 2, 2021

💊 CI failures summary and remediations

As of commit 5ef2baf (more details on the Dr. CI page and at hud.pytorch.org/pr/55210):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Windows CI (pytorch-win-vs2019-cuda10-cudnn7-py3) / test (2) Install Visual Studio 2019 toolchain 🔁 rerun

ci.pytorch.org: 1 failed


Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

mingfeima added a commit that referenced this pull request Apr 2, 2021
ghstack-source-id: f26d56d
Pull Request resolved: #55210
@mingfeima
Copy link
Copy Markdown
Collaborator Author

Since this PR is not related to parallelization feature, only single core perf is tested:
NB: LayerNorm CPU doesn't have BFloat16 support previously, the before perf refers to simple impl as following:

-  AT_DISPATCH_FLOATING_TYPES(X.scalar_type(), "LayerNormKernelImpl", [&]() {
+  AT_DISPATCH_FLOATING_TYPES_AND(kBFloat16, X.scalar_type(), "LayerNormKernelImpl", [&]() {
  • performance update on avx512 machine: Xeon(R) Gold 6248 CPU @ 2.50GHz
before: LayerNorm: 32x128x1024: fp32: 2.806 ms; bf16: 9.901 ms
after:  LayerNorm: 32x128x1024: fp32: 2.813 ms; bf16: 2.306 ms
  • performance update on avx2 machine: Xeon(R) CPU E5-2680 v3 @ 2.50GHz
before: LayerNorm: 32x128x1024: fp32: 5.286 ms; bf16: 15.186 ms
after:  LayerNorm: 32x128x1024: fp32: 5.258 ms; bf16: 3.469 ms

mingfeima added a commit to mingfeima/pytorch that referenced this pull request Apr 28, 2021
ghstack-source-id: 88be540
Pull Request resolved: pytorch#55210
dgl-intel pushed a commit to dgl-intel/pytorch that referenced this pull request May 14, 2021
ghstack-source-id: 18d4100
Pull Request resolved: pytorch#55210
@VitalyFedyunin
Copy link
Copy Markdown
Contributor

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mingfeima
Copy link
Copy Markdown
Collaborator Author

rebased and clear test cases failures from test_ops.py

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

Please rebase

@mingfeima
Copy link
Copy Markdown
Collaborator Author

@VitalyFedyunin rebased, please check!

@VitalyFedyunin
Copy link
Copy Markdown
Contributor

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@VitalyFedyunin merged this pull request in 652d911.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants