Make sure the number of MKL and OpenMP threads match #1740

colesbury · 2017-06-06T21:43:56Z

Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4

Here's an example program which triggers this behavior:
https://gist.github.com/colesbury/9ac92dfe5346bb71dee885af9c8cdd5d

torch.add/THTensor_(cadd) with a destination tensor is an OpenMP-enabled TH function. torch.mm/THTensor_(addmm) uses MKL. On my machine, MKL defaults to 24 threads while OpenMP defaults to 48 threads. Without this fix, the example gets about 200 itrs/sec. With the fix, the example gets about 3000 itrs/sec.

Otherwise, on many machines, the size of the OpenMP thread pool will change between MKL and our OpenMP enabled functions. The constant thread creation and destruction results in worse performance and leaks memory on GCC 5.4

apaszke · 2017-06-06T22:02:42Z

Awesome! But we're still linking both libiomp.so and libgomp.so, right? We should finally clean it up, because it can possible cause some super weird errors

apaszke · 2017-06-06T22:03:33Z

Also, shouldn't we limit the number of threads to be lower than the number of cores? I've seen performance improve when I lowered OMP_NUM_THREADS to around 10 on a 40-core machine

soumith · 2017-06-06T22:05:28Z

we are and should only be linking to libgomp

apaszke · 2017-06-06T22:06:24Z

We are and should, but as far as I remember gcc is linking gomp too. Can you check the output of ldd on _C.so?

soumith · 2017-06-07T18:55:23Z

this is now merged into master

Summary: This is needed for pytorch#1740. Verified that `./build.sh py2-android-ubuntu16.04` builds an Android base image with CMake 3.6.3. Closes facebookarchive/caffe2#1747 Differential Revision: D6729823 Pulled By: pietern fbshipit-source-id: f7c888b4fba14ff6ea703cc269175b327b49f6b8

* Extend the grouped grid reduction kernel The kernel itself should work with an arbitrary number of inputs, but the underlying data structure, Tuple, still explicitly needs to be specialized for the number of values, which is currently limited to 8.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation; - Indexing refactor: remove reference tensor creation in all tensor indexing logic (#1690) - (more) generic grouped grid reduction kernel; - Minor parser/fuser patches: 1. zero-dim tensor reduction support 3. no-op binary removal within fused graph 4. expand supported in fusion Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` a054b3e Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (#1775) d67e1cd Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (#1690) 1b65299 Issue 1770 (#1774) 35b0427 Avoid compilation errors like below: (#1773) 452c773 Ignore reductions of zero-dim tensors per PyTorch conventions (#1771) 31d6c56 TransformPropagator refactor (#1769) 570c5a8 Merge pull request #1767 from csarofeen/upstream_merge_0621 9d6c3d8 merging upstream 61305cd 0ed815f New TransformPropagator algorithm (#1763) 6c19520 no-op binary removal (#1764) ec7fa41 Proper propagation of IterType (#1762) b263562 Fix dimensionality check (#1759) 2d6343f More generic grouped grid reduction kernel (#1740) 64e2b56 [nvfuser] prevent spamming warning message (#77777) (#1758) 0c43162 [nvFuser] Improving bitwise ops support (#77158) (#1757) b93a147 Parser expand (#1754) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: #80355 Approved by: https://github.com/davidberard98

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation; - Indexing refactor: remove reference tensor creation in all tensor indexing logic (#1690) - (more) generic grouped grid reduction kernel; - Minor parser/fuser patches: 1. zero-dim tensor reduction support 3. no-op binary removal within fused graph 4. expand supported in fusion Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` a054b3e Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (#1775) d67e1cd Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (#1690) 1b65299 Issue 1770 (#1774) 35b0427 Avoid compilation errors like below: (#1773) 452c773 Ignore reductions of zero-dim tensors per PyTorch conventions (#1771) 31d6c56 TransformPropagator refactor (#1769) 570c5a8 Merge pull request #1767 from csarofeen/upstream_merge_0621 9d6c3d8 merging upstream 61305cd 0ed815f New TransformPropagator algorithm (#1763) 6c19520 no-op binary removal (#1764) ec7fa41 Proper propagation of IterType (#1762) b263562 Fix dimensionality check (#1759) 2d6343f More generic grouped grid reduction kernel (#1740) 64e2b56 [nvfuser] prevent spamming warning message (#77777) (#1758) 0c43162 [nvFuser] Improving bitwise ops support (#77158) (#1757) b93a147 Parser expand (#1754) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: #80355 Reviewed By: qihqi Differential Revision: D37573400 Pulled By: davidberard98 fbshipit-source-id: 52ab68d89ec01ef61f69f5abeb18c9d3a312aa64

Make sure the number of MKL and OpenMP threads match

afa148f

Otherwise, on many machines, the size of the OpenMP thread pool will change between MKL and our OpenMP enabled functions. The constant thread creation and destruction results in worse performance and leaks memory on GCC 5.4

soumith approved these changes Jun 6, 2017

View reviewed changes

apaszke approved these changes Jun 6, 2017

View reviewed changes

Set the number of OpenMP threads in autograd engine

c84161c

soumith closed this Jun 7, 2017

soumith mentioned this pull request Jun 8, 2017

pytorch not respecting torch.set_num_threads #975

Closed

VitaMusic mentioned this pull request Nov 25, 2019

PyTorch C++ API : VS2017 and Intel v19 and Windows 7 x64 compilation problem #25698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make sure the number of MKL and OpenMP threads match #1740

Make sure the number of MKL and OpenMP threads match #1740

Uh oh!

colesbury commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

soumith commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

soumith commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make sure the number of MKL and OpenMP threads match #1740

Make sure the number of MKL and OpenMP threads match #1740

Uh oh!

Conversation

colesbury commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

soumith commented Jun 6, 2017

Uh oh!

apaszke commented Jun 6, 2017

Uh oh!

soumith commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants