[webgpu] Enable graph capture by qjia7 · Pull Request #24900 · microsoft/onnxruntime

qjia7 · 2025-05-29T10:01:24Z

This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989.

All limitations are similar with JS/CUDA EP:

Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported.
Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them.
Shapes of inputs/outputs cannot change across inference calls.
IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers.

When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below:

// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...

onnxruntime/core/providers/webgpu/buffer_manager.cc

guschmue · 2025-06-02T16:24:52Z

wow, that was fast :)

onnxruntime/core/providers/webgpu/buffer_manager.h

onnxruntime/core/session/inference_session.cc

guschmue · 2025-06-17T00:07:26Z

tried it out, seems to work for models like mobilenet-v4. I see ~20% gain in my setup.
Kind of unfortunate in ort that the capture hangs on the session and not a partition.
Looking if we can get a llm running with capture.

guschmue · 2025-06-18T02:15:02Z

there are a few ops that are partition to cpu. All because of int64 - can try to change the model to use int32.

qjia7 · 2025-06-21T00:57:39Z

@guschmue @fs-eire Rebased the code to latest. Please take another look, thanks.

with ep

onnxruntime/core/providers/webgpu/buffer_manager.cc

This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```

This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```

qjia7 added 2 commits May 29, 2025 17:10

[webgpu] Add graph capture support

3458625

clean code

08bfa2b

qjia7 commented May 29, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/buffer_manager.cc Outdated Show resolved Hide resolved

Merge branch 'main' into graph_capture

ac47a39

guschmue added the ep:WebGPU ort-web webgpu provider label Jun 2, 2025

qjia7 added 3 commits June 6, 2025 07:55

Merge branch 'main' into graph_capture

b2db0b8

don't need to execute commands in capture mode

b6cf35d

fix CI errors

d2acf41

qjia7 marked this pull request as draft June 7, 2025 13:25

ishwar-raut1 added a commit to ishwar-raut1/onnxruntime that referenced this pull request Jun 10, 2025

Merge PR microsoft#24900 from microsoft/onnxruntime

0bd9181

qjia7 added 4 commits June 11, 2025 12:52

Add a Graph BufferCacheMode

0756d5b

add GraphSimple mode for uniform

21d0243

nits

13995ad

Merge branch 'main' into graph_capture

c17dce6

qjia7 requested review from fs-eire and guschmue June 11, 2025 06:36

qjia7 marked this pull request as ready for review June 11, 2025 06:37

fs-eire reviewed Jun 11, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/buffer_manager.h Outdated Show resolved Hide resolved

fs-eire reviewed Jun 11, 2025

View reviewed changes

onnxruntime/core/session/inference_session.cc Outdated Show resolved Hide resolved

qjia7 added 7 commits June 11, 2025 17:31

address comments

2bf3e6a

fix unmap errors due to CreateUMA path

004c9d0

fix errors when no gpu

66d558a

Merge branch 'main' into graph_capture

9ffc44e

fix CI errors

6b5b9b2

fix CI errors

6cdf804

Merge branch 'main' into graph_capture

4903c27

qjia7 requested a review from fs-eire June 16, 2025 05:26

guschmue previously approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into graph_capture

5d1ff55

qjia7 added 2 commits June 20, 2025 19:32

Merge branch 'main' into graph_capture

e8d19ac

force use graph mode when enable_graph_capture is true

6b88726

qjia7 dismissed guschmue’s stale review via 6b88726 June 20, 2025 14:21

qjia7 requested a review from guschmue June 21, 2025 00:56

qjia7 added 7 commits June 26, 2025 10:14

Merge branch 'main' into graph_capture

55f595a

decouple WebGpuContext and session_id

87abbe6

Make GpuBufferAllocator and DataTransfer use the same buffer manager

f1efee2

with ep

remove legacy session id

c9a2f0c

remove useless comments

7cf055e

fix CI errors

df96164

fix using incorrect buffer manager

0a29521

fs-eire reviewed Jul 1, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/buffer_manager.cc Outdated Show resolved Hide resolved

address comments

9ed4ade

qjia7 requested a review from fs-eire July 1, 2025 06:28

qjia7 added 5 commits July 2, 2025 13:28

restore to use raw WebGPU handles

456c1ad

Merge branch 'main' into graph_capture

11656cd

remove external buffer manager in webgpucontext

151529f

fix CI errors

da83468

remove duplicated comments

591a999

fs-eire approved these changes Jul 7, 2025

View reviewed changes

qjia7 merged commit e63e053 into main Jul 7, 2025
90 of 93 checks passed

qjia7 deleted the graph_capture branch July 7, 2025 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[webgpu] Enable graph capture#24900

[webgpu] Enable graph capture#24900
qjia7 merged 33 commits intomainfrom
graph_capture

qjia7 commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

guschmue commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

guschmue commented Jun 17, 2025

Uh oh!

guschmue commented Jun 18, 2025

Uh oh!

qjia7 commented Jun 21, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

qjia7 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

guschmue commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

guschmue commented Jun 17, 2025

Uh oh!

guschmue commented Jun 18, 2025

Uh oh!

qjia7 commented Jun 21, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented May 29, 2025 •

edited

Loading