[RFC] A device-agnostic Python runtime API design for stream-based accelerators

# Motivation
This RFC aims to propose a design for a series of generic device runtime APIs tailored for stream-based accelerators to help users simplify the runtime code written for different devices (including out-of-tree devices, a registration mechanism is provided for them).

# Background
Currently, we can separate stream-based accelerators runtime into 6 components:
- `Device`
- `Stream`
- `Event`
- `Generator`
- `Guard`
- `Allocator`

PyTorch already provided some generic APIs for `Stream`, `Event`, `Guard`, and `Allocator`. For `Stream` and `Event`, following [the design of generic Stream and Event](https://github.com/pytorch/pytorch/pull/123611), we can write a device-agnostic code for Python:
```Python
event = torch.Event(device_type)
stream = torch.Stream(device_type)
event.record(stream)
stream.synchronize()
```
A simple summary related to device-agnostic code is listed below. 
- Stream (provided in C++ and Python)
- Event (provided in C++ and Python)
- Device (provided in C++, but lacks of some functionality)
- Guard (provided in C++)

However, PyTorch lacks some generic device-agnostic APIs mainly including two scenarios:
1. Get/Set the status of `Device` and `Stream` on the Python side, the functionality similar to `torch.xpu.set_device(1)` and `torch.xpu.current_stream()` but accepts device type as a parameter rather than XPU-specific;
2. Device/Stream Guard on the Python side which is workable for each type of device.

# Usage
Since there are no generic device-agnostic APIs, how can a device-agnostic code be written in other PyTorch components? We found two designs in PyTorch that can cover this usage.
1. for [FSDP](https://github.com/pytorch/pytorch/pull/99024), it uses [`getattr`](https://github.com/pytorch/pytorch/blob/f325b393038c55c30e5b5ce59709b6da158f03c8/torch/distributed/fsdp/_common_utils.py#L75) and  [`_register_device_module`](https://github.com/pytorch/pytorch/blob/f343f98710dfa7305a873f558086c595a3c3d3d4/torch/__init__.py#L1963) registration mechanism to handle different devices, like
```python
backend = getattr(torch, device_type)
if backend.is_available():
    backend.set_device(1)
    backend.synchronize()
```
2. for [Inductor](https://github.com/pytorch/pytorch/pull/109486), we propose a device interface registration mechanism, like
```python
device_interface = get_interface_for_device(device_type)
if device_interface.is_available():
    device_interface.set_device(1)
    device_interface.synchronize()
```
These two methods are not unified in PyTorch yet.

# Design
To simplify and unify the code, this RFC aims to propose a design for a series of generic device runtime APIs tailored for stream-based accelerators for different devices (including out-of-tree devices).
As described above, PyTorch already provides the generic code,  `torch.Stream` and `torch.Event`, for `Stream` and `Event` respectively in Python. Furthermore, no Python code is provided for `Guard` and lacks some APIs to cover the `Device` and `Stream` status.
So, we propose a design to cover these missing device-agnostic runtime APIs, like the codes below.
```python
import torch

device_type = tensor.device.type  # maybe cuda, xpu, mps, mtia, and privateuser1...
assert(torch.has_accelerator(device_type ), "No available accelerator detected!")
stream = torch.current_stream(device_type )
torch.set_device(0, device_type )
d1 = torch.maybe_exchange_device(1, device_type )
s1 = torch.Stream(device_type )
with torch.DeviceGuard(2, device_type ):
    d1 = torch.current_device(device_type )

with torch.StreamGuard(s1):
    s2 = torch.current_stream(device_type )
...
```
The inspiration for this design comes from [the design of generic Stream and Event](https://github.com/pytorch/pytorch/pull/123611). PyTorch promotes `torch.xxx.Stream` & `torch.xxx.Event` to `torch.Stream` & `torch.Event` and make the later device-agnostic. According to this design, we list the device-agnostic Python runtime APIs below which are the most used filtered from some popular repos:

<div align="center">
<table>
<tr>
<td> Device-specific runtime APIs torch.xxx.foo</td> <td> Device-agnostic runtime APIs torch.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.set_device
```

</td>
<td>

```python
torch.set_device
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.current_device
```

</td>
<td>

```python
torch.current_device
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.device_count
```

</td>
<td>

```python
torch.device_count
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.is_available
```

</td>
<td>

```python
torch.has_accelerator
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.exchange_device
```

</td>
<td>

```python
torch.exchange_device
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.maybe_exchange_device
```

</td>
<td>

```python
torch.maybe_exchange_device
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.set_stream
```

</td>
<td>

```python
torch.set_stream
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.current_stream
```

</td>
<td>

```python
torch.current_stream
```

</td>
</tr>

</table>
</div>

**Our goal** is `torch.foo` can cover the most common runtime scenarios and usages. And using `if/else` statement and `torch.xxx.foo` as an additional supplement in other situations.

**NB**: We will not unify a device-agnostic API for some backend-specific APIs, like `torch.cuda.default_stream`, as other backends have no `default stream` concept. Due to the significant differences in device properties of each backend, `get_device_properties` will also not be involved at this stage.

**Simple Version**: for more convenience, device-agnostic API can no longer accept the device type and parse the device type based on [getAccelerator.](https://github.com/pytorch/pytorch/blob/9a8ab778d34bd24c5caceb340837483decc4c311/aten/src/ATen/DeviceAccelerator.cpp#L6) So the above common code can be simplified as this: 
```python
import torch

assert(torch.has_accelerator(), "No available accelerator detected!")
stream = torch.current_stream()
torch.set_device(0)
d1 = torch.maybe_exchange_device(1)
s1 = torch.Stream()
with torch.DeviceGuard(2):
    d1 = torch.current_device()

with torch.StreamGuard(s1):
    s2 = torch.current_stream()
...
```
Obviously, this can greatly simplify the code and save efforts for the users to migrate their code to follow this design. But the drawback is that it relies on an assumption **there is only one type of accelerator on the machine**. This is an open issue on demands of the feedback of PyTorch community.

I personally prefer these device-agnostic APIs no longer need to accept the device type as input. The reasons are,
1. Currently, it is enough for the PyTorch binary build to only support one accelerator type;
2. It is easy to expand these device-agnostic APIs to handle multiple types of accelerator scenarios;
3. The user also can use device-specific APIs `torch.xxx.foo` to handle multi types of accelerator scenarios instead of `torch.foo`;
4. The design does **NOT** break the previous design philosophies. Since `torch.foo` are only used for the accelerator excluding the CPU unlike `torch.empty` needs to specific `device` parameter to inform the user where the empty tensor would be created.

Also, I list the pros and cons of simple version here to help us to make a decision:
Pros:
- `torch.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code. 
Cons:
- no obvious drawbacks.

Also, in some situations, the users would like to check or know what type of accelerator they are using. To handle this scenario, we provide an extra API `torch.current_accelerator` to return the type of accelerator as a string according to [current Accelerator.](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.rst#accelerators) It can help users to handle the specific situations, like default stream.
```python
if torch.has_accelerator():
    if torch.current_accelerator() == "cuda":
        stream = torch.cuda.default_stream()
    else:
        stream = torch.Stream()
...
```

# Additional context
We will implement this design on top of [DeviceGuardImplInterface](https://github.com/pytorch/pytorch/blob/e7cb43a2d2bb1740fd6f4bc1a440004664007a3f/c10/core/impl/DeviceGuardImplInterface.h#L57). It also provides a registration mechanism for out-of-tree devices. 

These device-agnostic runtime APIs should accept the same input type (maybe `torch.device`, `str`, `int`, or `None`) as `torch.xxx.foo`. These two APIs below should be equivalent when XPU is available.
- torch.xpu.set_device(1)
- torch.set_device(1)  # based on [getAccelerator.](https://github.com/pytorch/pytorch/blob/9a8ab778d34bd24c5caceb340837483decc4c311/aten/src/ATen/DeviceAccelerator.cpp#L6)

**Open:** Is it necessary to pass a device type to these designed APIs as an input argument? Only one accelerator sounds enough for most people and most situations.

Besides, we will help
- unify FSDP and Inductor code using these new APIs, and
- investigate how to unify the device-agnostic API related to `Generator` and `Allocator`.


cc @albanD @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @gujinghui 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] A device-agnostic Python runtime API design for stream-based accelerators #128403

Motivation

Background

Usage

Design

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Device-specific runtime APIs torch.xxx.foo	Device-agnostic runtime APIs torch.foo
torch.xxx.set_device	torch.set_device
torch.xxx.current_device	torch.current_device
torch.xxx.device_count	torch.device_count
torch.xxx.is_available	torch.has_accelerator
torch.xxx.exchange_device	torch.exchange_device
torch.xxx.maybe_exchange_device	torch.maybe_exchange_device
torch.xxx.set_stream	torch.set_stream
torch.xxx.current_stream	torch.current_stream

[RFC] A device-agnostic Python runtime API design for stream-based accelerators #128403

Description

Motivation

Background

Usage

Design

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions