Python + torch + cuda in multiprocess application freeze

So… jetpack 6.1.2, python 3.10, python torch package from nvidia, pycuda

I have a python application that spawns multiple processes (using multiprocessing.Process)

Subprocess A uses torch to do some matrix computations to process an input image

Subprocess B uses pycuda to run inference on image (using a tensorrt compiled model)

B is freezing periodically in what appears to be something related to gpu and kernel mutexes

Right now the initialization of torch and cuda is as follows:

A initializes torch using _ = torch.randn(1, device=‘cuda’)
B initializez torch also using _ = torch.randn(1, device=‘cuda’) then attaches to cuda context with cuda.Context.attach()

Which is the correct way to handle this scenario, I’ve seen way too many different examples but I have yet to find the correct one that avoids any application freeze

Hi,

Could you check if MPS can meet your requirements?

Thanks.

At this point I am looking for a confirmation that I use the correct initialization sequence so that torch and cuda coexist peacefully in python

This is the latest code I use (as suggested by AI after several not-so-successful iterations), not sure if correct:

#!/usr/bin/env python3

import torch
import pycuda.driver as cuda

# 1. Initialize PyTorch's CUDA state (you already did this)
print(f"Initializing torch")
_ = torch.randn(1, device="cuda")

# 2. Initialize the PyCUDA driver
print(f"Initializing cuda")
cuda.init()

# 3. Retrieve the primary context created by PyTorch
# Use the device ID matching your torch tensor (usually 0)
print(f"Get pytorch context")
device = cuda.Device(torch.cuda.current_device())
ctx = device.retain_primary_context()

# 4. Push the context to make it active for PyCUDA
print("Push context")
ctx.push()

try:
    # Your PyCUDA kernel operations go here
    print("Run sample torch operation")
    x = torch.ones(10, device="cuda")
finally:
    # 5. Pop the context when done to avoid cleanup hangs
    print(f"Pop context")
    ctx.pop()

Hi,

Do you want a sample with PyTorch?
If yes, could you check if the following example helps?

Thanks.

I am looking for the proper way to initialize torch and cuda in the same script so that I do not risk a deadlock when using both, especially that torch is also used in another process

What I have is a multiprocess python app, in which process A uses torch, and process B uses both torch and cuda, and they run at the same time, which has lead to some gpu-related futex deadlocks

For example I’ve read that cuda autoinit is a no-no and I should attach cuda to torch context (there are several ways to do this, not sure which one is correct)

Hi,

Is the process A and B share the same data?
Or you just want to share GPU resources among different tasks?

Thanks

The data is not shared, each process does it’s own processing on different data

Depending on how we initialize python torch and cuda, in some scenarios we end in a futex deadlock which is gpu related, and various information on the net suggests that yes this is known to happen

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~0422

Hi,

Could you try to run each process with a different CUDA stream?
Thanks.