[AOTI] AOTIModelPackageLoader::run dispatch to specific device containner runner not really work. 

### 🐛 Describe the bug

The [`AOTIModelPackageLoader::run`](https://github.com/pytorch/pytorch/blob/b4cc5d38b416c8e74a6ba8f537a75571a3cdd563/torch/csrc/inductor/aoti_package/model_package_loader.cpp#L399C1-L403C1) is designed to dispatch to run method of corresponding device contain runner: `AOTIModelContainerRunnerCuda`, `AOTIModelContainerRunnerCpu`, and the `AOTIModelContainerRunnerXpu` that I'm implementing. The `runner_` is get by device from the above device container runners. And it's declared as 
`std::unique_ptr<AOTIModelContainerRunner> runner_;`.

`AOTIModelContainerRunner` is the base class of `AOTIModelContainerRunnerCuda`, `AOTIModelContainerRunnerCpu`, `AOTIModelContainerRunnerXpu`.
We expected that when device is `cuda`, the call `runner_->run()`, it will call the  `AOTIModelContainerRunnerCuda::run`, but it actually call  `AOTIModelContainerRunner::run`.

This happens for two reasons:

- The run method in the base class is not declared as virtual.
- The function signatures differ between the base class and the derived class: 
  - base: `std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs, AOTInductorStreamHandle cuda_stream_handle = nullptr);` 
  - derived `std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs);`

Because of these differences, the two methods are not polymorphic, and runner_->run() always executes the base class implementation.

This leads to issues, especially for GPU that need stream, because the stream parameter will be `nullptr` in the `AOTIModelContainerRunner::run` . For CUDA, when the stream is nullptr, the CUDA API automatically uses the current stream instead, so it happens works fine. However, for XPU, its API crashes when the stream is `nullptr`, resulting in a null pointer dereference.

@desertfire Sorry, I'm not sure who to assign this issue and just assigned to you, please feel free to re-assign this issue to the corresponding developer. Thanks.  




### Versions

Collecting environment information...
PyTorch version: 2.6.0a0+git8a80cee
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.0
Libc version: glibc-2.35

Python version: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:49:10) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 550.120
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

cc @ezyang @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @aakhundov @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AOTI] AOTIModelPackageLoader::run dispatch to specific device containner runner not really work. #140546

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AOTI] AOTIModelPackageLoader::run dispatch to specific device containner runner not really work. #140546

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions