-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Efficient usage of parallelism is important for achieving high performance in CPU tasks. Currently there're three main use cases/sources of parallelism in PT:
- Operator implementations (intra-op parallelism) - using multiple parallel tasks to execute an operator/function;
- TorchScript JIT interpreter - explicit fork/wait calls used in JIT programs;
- Autograd engine - uses multiple CPU threads pinned to devices (one for CPU and one per each GPU device) to run backward pass.
That's in addition to other mechanisms, such as multiprocessing used by eager mode training and low-level mechanisms (e.g. vectorization, not discussed here). This issue focuses mainly on server CPU inference and to a lesser extent on autograd engine parallelism.
As of now, the implementations of intra- and inter-op parallelism are based on:
- For intra-op (within an op) we typically use OpenMP - either explicitly (through pragmas) or by using MKL/MKL-DNN that use OpenMP
- For inter-op parallelism we either explicitly use a global (per process) thread pool (in JIT interpreter) or a set of threads pinned to devices (in autograd engine).
The main goal of the proposed parallelism work is to unify the usage of parallelism in PT behind a simple interface and abstract away from specific parallelization libraries. This will allow us to switch between and experiment with different parallel implementations, including:
- OpenMP-based (for operator implementations only);
- TBB-based;
- Native thread pool based.
More specifically, the proposed work has the following steps:
Intra-op parallelism:
- Update torch.get_num_threads/set_num_threads to use Parallel interface
- Redirect management of number of threads (intra-op) to Parallel.h impl
- On the details of handling of OMP/MKL threads see Intra-operator parallelism settings in PyTorch #19001
- Port exisiting intra-op use cases to Parallel.h
- TH
- THNN
- ATen/native
- Split Parallel.h into interface and impl parts (_openmp.cpp)
- Cmake scaffolding to select Parallel.h backend
- Add native backend for Parallel.h
- Using a separate native thread pool
- Add TBB submodule into PT
- Add Cmake TBB option that enables TBB, disables OpenMP, links with MKL-TBB and MKLDNN-TBB
- Users reported perf regressions when linking with TBB (Serious perf drop on CPU #7903). In case of TBB we might need to turn off OpenMP usage and link with TBB versions of MKL, MKLDNN
- Add TBB backend for Parallel.h
Inter-op parallelism:
- Move Future and task launching interface into Parallel.h
- Scaffolding to set number of inter-op threads, switch between single vs separate inter-/intra-op thread pools usage
- Use Parallel.h from JIT interpreter to launch tasks
- Update Autograd Engine to use Parallel.h
- Needs some discussion about pinning of threads to specific GPUs