Skip to content

Comments

[webgpu] Enable graph capture#24900

Merged
qjia7 merged 33 commits intomainfrom
graph_capture
Jul 7, 2025
Merged

[webgpu] Enable graph capture#24900
qjia7 merged 33 commits intomainfrom
graph_capture

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented May 29, 2025

This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989.

All limitations are similar with JS/CUDA EP:

  1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported.
  2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them.
  3. Shapes of inputs/outputs cannot change across inference calls.
  4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers.

When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below:

// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...

@guschmue
Copy link
Contributor

guschmue commented Jun 2, 2025

wow, that was fast :)

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jun 2, 2025
@qjia7 qjia7 marked this pull request as draft June 7, 2025 13:25
ishwar-raut1 added a commit to ishwar-raut1/onnxruntime that referenced this pull request Jun 10, 2025
@qjia7 qjia7 requested review from fs-eire and guschmue June 11, 2025 06:36
@qjia7 qjia7 marked this pull request as ready for review June 11, 2025 06:37
@qjia7 qjia7 requested a review from fs-eire June 16, 2025 05:26
@guschmue
Copy link
Contributor

tried it out, seems to work for models like mobilenet-v4. I see ~20% gain in my setup.
Kind of unfortunate in ort that the capture hangs on the session and not a partition.
Looking if we can get a llm running with capture.

guschmue
guschmue previously approved these changes Jun 17, 2025
@guschmue
Copy link
Contributor

there are a few ops that are partition to cpu. All because of int64 - can try to change the model to use int32.

@qjia7
Copy link
Contributor Author

qjia7 commented Jun 21, 2025

@guschmue @fs-eire Rebased the code to latest. Please take another look, thanks.

@qjia7 qjia7 requested a review from fs-eire July 1, 2025 06:28
@qjia7 qjia7 merged commit e63e053 into main Jul 7, 2025
90 of 93 checks passed
@qjia7 qjia7 deleted the graph_capture branch July 7, 2025 02:31
fs-eire pushed a commit that referenced this pull request Jul 7, 2025
This PR enables graph capture capabilities in the WebGPU provider, which
is similar with jsep one #18989.

All limitations are similar with JS/CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the WebGPU EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IOBinding is required. And all inputs/outputs are pre-allocated gpu
buffers.

When users use graph capture feature, we suppose they will do some
pre-process and post-process for the inference's inputs and outputs in
order to keep the whole pipeline on GPU to avoid some unnecessary cpu to
gpu or gpu to cpu copying. The usage will be like below:
```
// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...
```
daijh pushed a commit to daijh/onnxruntime that referenced this pull request Jul 10, 2025
This PR enables graph capture capabilities in the WebGPU provider, which
is similar with jsep one microsoft#18989.

All limitations are similar with JS/CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the WebGPU EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IOBinding is required. And all inputs/outputs are pre-allocated gpu
buffers.

When users use graph capture feature, we suppose they will do some
pre-process and post-process for the inference's inputs and outputs in
order to keep the whole pipeline on GPU to avoid some unnecessary cpu to
gpu or gpu to cpu copying. The usage will be like below:
```
// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...
```
qti-yuduo pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Aug 8, 2025
This PR enables graph capture capabilities in the WebGPU provider, which
is similar with jsep one microsoft#18989.

All limitations are similar with JS/CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the WebGPU EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IOBinding is required. And all inputs/outputs are pre-allocated gpu
buffers.

When users use graph capture feature, we suppose they will do some
pre-process and post-process for the inference's inputs and outputs in
order to keep the whole pipeline on GPU to avoid some unnecessary cpu to
gpu or gpu to cpu copying. The usage will be like below:
```
// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...
```
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
This PR enables graph capture capabilities in the WebGPU provider, which
is similar with jsep one microsoft#18989.

All limitations are similar with JS/CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the WebGPU EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IOBinding is required. And all inputs/outputs are pre-allocated gpu
buffers.

When users use graph capture feature, we suppose they will do some
pre-process and post-process for the inference's inputs and outputs in
order to keep the whole pipeline on GPU to avoid some unnecessary cpu to
gpu or gpu to cpu copying. The usage will be like below:
```
// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants