Skip to content

Intra- and inter-operator parallelism in PyTorch (work plan) #19002

@ilia-cher

Description

@ilia-cher

Efficient usage of parallelism is important for achieving high performance in CPU tasks. Currently there're three main use cases/sources of parallelism in PT:

  1. Operator implementations (intra-op parallelism) - using multiple parallel tasks to execute an operator/function;
  2. TorchScript JIT interpreter - explicit fork/wait calls used in JIT programs;
  3. Autograd engine - uses multiple CPU threads pinned to devices (one for CPU and one per each GPU device) to run backward pass.

That's in addition to other mechanisms, such as multiprocessing used by eager mode training and low-level mechanisms (e.g. vectorization, not discussed here). This issue focuses mainly on server CPU inference and to a lesser extent on autograd engine parallelism.

As of now, the implementations of intra- and inter-op parallelism are based on:

  • For intra-op (within an op) we typically use OpenMP - either explicitly (through pragmas) or by using MKL/MKL-DNN that use OpenMP
  • For inter-op parallelism we either explicitly use a global (per process) thread pool (in JIT interpreter) or a set of threads pinned to devices (in autograd engine).

The main goal of the proposed parallelism work is to unify the usage of parallelism in PT behind a simple interface and abstract away from specific parallelization libraries. This will allow us to switch between and experiment with different parallel implementations, including:

  1. OpenMP-based (for operator implementations only);
  2. TBB-based;
  3. Native thread pool based.

More specifically, the proposed work has the following steps:
Intra-op parallelism:

  • Update torch.get_num_threads/set_num_threads to use Parallel interface
  • Port exisiting intra-op use cases to Parallel.h
    • TH
    • THNN
    • ATen/native
  • Split Parallel.h into interface and impl parts (_openmp.cpp)
  • Cmake scaffolding to select Parallel.h backend
  • Add native backend for Parallel.h
    • Using a separate native thread pool
  • Add TBB submodule into PT
  • Add Cmake TBB option that enables TBB, disables OpenMP, links with MKL-TBB and MKLDNN-TBB
    • Users reported perf regressions when linking with TBB (Serious perf drop on CPU  #7903). In case of TBB we might need to turn off OpenMP usage and link with TBB versions of MKL, MKLDNN
  • Add TBB backend for Parallel.h

Inter-op parallelism:

  • Move Future and task launching interface into Parallel.h
  • Scaffolding to set number of inter-op threads, switch between single vs separate inter-/intra-op thread pools usage
  • Use Parallel.h from JIT interpreter to launch tasks
  • Update Autograd Engine to use Parallel.h
    • Needs some discussion about pinning of threads to specific GPUs

Metadata

Metadata

Labels

featureA request for a proper, new feature.high prioritymodule: cpuCPU specific problem (e.g., perf, algorithm)module: internalsRelated to internal abstractions in c10 and ATenmodule: multithreadingRelated to issues that occur when running on multiple CPU threadstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions