QMoE kernel further optimizations #26048

apsonawane · 2025-09-15T22:16:58Z

This pull request focuses on improving the robustness and reliability of the QMoECPU quantization code in moe_quantization_cpu.cc. The main changes add extensive input validation and bounds checking throughout the dequantization and routing logic, helping to prevent out-of-bounds memory access and potential crashes. Additionally, buffer size calculations are simplified for clarity and consistency.

The most important changes are:

Input Validation and Bounds Checking:

Added checks to validate scale_idx bounds in all dequantization code paths (including 8-bit and 4-bit cases) to prevent out-of-bounds access to the scales array. [1] [2] [3]
Added validation of route indices and token indices in expert routing logic to skip invalid or out-of-bounds entries, preventing possible crashes or incorrect memory access. [1] [2] [3]
Added validation to ensure the number of tokens per expert does not exceed allocated workspace, logging an error and skipping the expert if exceeded.
Added additional buffer and thread index validations in the output accumulation step to prevent out-of-bounds writes and ensure thread safety.

Buffer Size Calculation Simplification:

Removed unnecessary alignment for float array buffer sizes, simplifying workspace allocation and making buffer size calculations more consistent.

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

tianleiwu · 2025-09-16T21:05:04Z

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

  const int64_t thread_divisor = std::max(1, max_threads * 4);
  const int64_t min_work_per_thread = std::max(int64_t{32}, static_cast<int64_t>(num_tokens / thread_divisor));
-  const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / std::max(int64_t{1}, min_work_per_thread)), max_threads);
+  const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / min_work_per_thread), max_threads);


num_tokens / min_work_per_thread could be zero here since min_work_per_thread >= 32 in previous line. If num_tokens < 32, then num_tokens / min_work_per_thread = 0.

In the end, you will use only one thread when num_tokens < 32. Is it expected if we want optimize performance for decoding?

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

tianleiwu · 2025-09-16T21:12:06Z

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

    fc1_gemm_done:

-      const int64_t activation_threshold = std::max(int64_t{4}, 256 / std::max(int64_t{1}, inter_size));
+      const int64_t activation_threshold = std::max(int64_t{4}, 256 / inter_size);


How do you choose the magic numbers (4, 256) here?

256 is chosen because it fits well in L1 cache and is better for CPU Cache efficiency. We get the number 4 based on the inter_size it is the minimum number of token required before considering parallel processing.
inter_size = 64 --> 256/64 = 4.

apsonawane · 2025-09-19T01:26:03Z

Created a new PR: #26091

This PR is not required

QMoE kernel further optimizations

cb6939d

apsonawane mentioned this pull request Sep 15, 2025

[CPU] Block-wise QMoE kernel for CPU #26009

Merged

Address comments

4bfc508

apsonawane requested a review from tianleiwu September 16, 2025 00:39

tianleiwu reviewed Sep 16, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 16, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 16, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 16, 2025

View reviewed changes

address comments

5c04dc9

apsonawane closed this Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QMoE kernel further optimizations #26048

QMoE kernel further optimizations #26048

Uh oh!

apsonawane commented Sep 15, 2025

Uh oh!

Uh oh!

tianleiwu Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tianleiwu Sep 16, 2025

Uh oh!

apsonawane Sep 17, 2025

Uh oh!

apsonawane commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QMoE kernel further optimizations #26048

QMoE kernel further optimizations #26048

Uh oh!

Conversation

apsonawane commented Sep 15, 2025

Uh oh!

Uh oh!

tianleiwu Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

apsonawane Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

apsonawane commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu Sep 16, 2025 •

edited

Loading