Conversation
qjia7
commented
May 29, 2025
Contributor
|
wow, that was fast :) |
ishwar-raut1
added a commit
to ishwar-raut1/onnxruntime
that referenced
this pull request
Jun 10, 2025
fs-eire
reviewed
Jun 11, 2025
fs-eire
reviewed
Jun 11, 2025
Contributor
|
tried it out, seems to work for models like mobilenet-v4. I see ~20% gain in my setup. |
guschmue
previously approved these changes
Jun 17, 2025
Contributor
|
there are a few ops that are partition to cpu. All because of int64 - can try to change the model to use int32. |
Contributor
Author
fs-eire
reviewed
Jul 1, 2025
fs-eire
approved these changes
Jul 7, 2025
fs-eire
pushed a commit
that referenced
this pull request
Jul 7, 2025
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```
daijh
pushed a commit
to daijh/onnxruntime
that referenced
this pull request
Jul 10, 2025
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```
qti-yuduo
pushed a commit
to CodeLinaro/onnxruntime
that referenced
this pull request
Aug 8, 2025
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```
sanketkaleoss
pushed a commit
to sanketkaleoss/onnxruntime
that referenced
this pull request
Aug 11, 2025
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989.
All limitations are similar with JS/CUDA EP:
When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: