Python + torch + cuda in multiprocess application freeze

dan_borlovan · March 18, 2026, 9:37am

So… jetpack 6.1.2, python 3.10, python torch package from nvidia, pycuda

I have a python application that spawns multiple processes (using multiprocessing.Process)

Subprocess A uses torch to do some matrix computations to process an input image

Subprocess B uses pycuda to run inference on image (using a tensorrt compiled model)

B is freezing periodically in what appears to be something related to gpu and kernel mutexes

Right now the initialization of torch and cuda is as follows:

A initializes torch using _ = torch.randn(1, device=‘cuda’)
B initializez torch also using _ = torch.randn(1, device=‘cuda’) then attaches to cuda context with cuda.Context.attach()

Which is the correct way to handle this scenario, I’ve seen way too many different examples but I have yet to find the correct one that avoids any application freeze

AastaLLL · March 19, 2026, 5:43am

Hi,

Could you check if MPS can meet your requirements?

Thanks.

dan_borlovan · March 19, 2026, 1:56pm

At this point I am looking for a confirmation that I use the correct initialization sequence so that torch and cuda coexist peacefully in python

This is the latest code I use (as suggested by AI after several not-so-successful iterations), not sure if correct:

#!/usr/bin/env python3

import torch
import pycuda.driver as cuda

# 1. Initialize PyTorch's CUDA state (you already did this)
print(f"Initializing torch")
_ = torch.randn(1, device="cuda")

# 2. Initialize the PyCUDA driver
print(f"Initializing cuda")
cuda.init()

# 3. Retrieve the primary context created by PyTorch
# Use the device ID matching your torch tensor (usually 0)
print(f"Get pytorch context")
device = cuda.Device(torch.cuda.current_device())
ctx = device.retain_primary_context()

# 4. Push the context to make it active for PyCUDA
print("Push context")
ctx.push()

try:
    # Your PyCUDA kernel operations go here
    print("Run sample torch operation")
    x = torch.ones(10, device="cuda")
finally:
    # 5. Pop the context when done to avoid cleanup hangs
    print(f"Pop context")
    ctx.pop()

AastaLLL · March 23, 2026, 8:11am

Hi,

Do you want a sample with PyTorch?
If yes, could you check if the following example helps?

github.com/dusty-nv/jetson-containers

packages/ml/pytorch/distributed_test.py

master

#!/usr/bin/env python3

# From https://docs.vllm.ai/en/stable/usage/troubleshooting.html#incorrect-hardwaredriver
# Code to confirm whether the GPU/CPU communication is working correctly.
#----------------------------------------------------------------

# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

This file has been truncated. show original

Thanks.

dan_borlovan · March 23, 2026, 9:08am

I am looking for the proper way to initialize torch and cuda in the same script so that I do not risk a deadlock when using both, especially that torch is also used in another process

What I have is a multiprocess python app, in which process A uses torch, and process B uses both torch and cuda, and they run at the same time, which has lead to some gpu-related futex deadlocks

For example I’ve read that cuda autoinit is a no-no and I should attach cuda to torch context (there are several ways to do this, not sure which one is correct)

AastaLLL · March 25, 2026, 8:46am

Hi,

Is the process A and B share the same data?
Or you just want to share GPU resources among different tasks?

Thanks

dan_borlovan · March 25, 2026, 9:55am

The data is not shared, each process does it’s own processing on different data

Depending on how we initialize python torch and cuda, in some scenarios we end in a futex deadlock which is gpu related, and various information on the net suggests that yes this is known to happen

AastaLLL · March 30, 2026, 9:05am

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~0422

Hi,

Could you try to run each process with a different CUDA stream?
Thanks.

Topic		Replies	Views
Tensorrt inference with pytorch tensor(data_ptr) TensorRT tensorrt , cuda , pytorch	2	2003	June 11, 2021
Multiprocessing on Jetson Jetson TX2 jetson-inference	12	4432	October 18, 2021
Using Multiprocessing in Pycuda CUDA Programming and Performance	5	4028	September 22, 2022
Multiprocessing PyTorch inference with TensorRT on Jetson Orin NX devices Jetson Orin NX tensorrt , cuda , pytorch , cudnn	2	748	May 7, 2024
Python OpenCV - multiprocessing doesn't work with CUDA CUDA Programming and Performance opencv , python	3	6097	October 12, 2021
Error in cuda when trying to inference via multiprocessing TensorRT	2	1781	November 14, 2021
TensorRT Python Client Runtime Error TensorRT	6	1568	September 19, 2019
CUDA IPC replacement for Jetson Jetson AGX Orin cuda , jetson	2	170	August 4, 2025
'operation not supported' of spawn method in pytorch multiprocessing on Jetson Xavier NX Jetson Xavier NX pytorch	7	1063	November 23, 2022
Slow video streaming while using pytorch with cuda Jetson AGX Xavier	4	929	October 18, 2021

Python + torch + cuda in multiprocess application freeze

Related topics