Skip to content

Conversation

@apsonawane
Copy link
Contributor

This pull request focuses on improving the robustness and reliability of the QMoECPU quantization code in moe_quantization_cpu.cc. The main changes add extensive input validation and bounds checking throughout the dequantization and routing logic, helping to prevent out-of-bounds memory access and potential crashes. Additionally, buffer size calculations are simplified for clarity and consistency.

The most important changes are:

Input Validation and Bounds Checking:

  • Added checks to validate scale_idx bounds in all dequantization code paths (including 8-bit and 4-bit cases) to prevent out-of-bounds access to the scales array. [1] [2] [3]
  • Added validation of route indices and token indices in expert routing logic to skip invalid or out-of-bounds entries, preventing possible crashes or incorrect memory access. [1] [2] [3]
  • Added validation to ensure the number of tokens per expert does not exceed allocated workspace, logging an error and skipping the expert if exceeded.
  • Added additional buffer and thread index validations in the output accumulation step to prevent out-of-bounds writes and ensure thread safety.

Buffer Size Calculation Simplification:

  • Removed unnecessary alignment for float array buffer sizes, simplifying workspace allocation and making buffer size calculations more consistent.

const int64_t thread_divisor = std::max(1, max_threads * 4);
const int64_t min_work_per_thread = std::max(int64_t{32}, static_cast<int64_t>(num_tokens / thread_divisor));
const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / std::max(int64_t{1}, min_work_per_thread)), max_threads);
const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / min_work_per_thread), max_threads);
Copy link
Contributor

@tianleiwu tianleiwu Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_tokens / min_work_per_thread could be zero here since min_work_per_thread >= 32 in previous line. If num_tokens < 32, then num_tokens / min_work_per_thread = 0.

In the end, you will use only one thread when num_tokens < 32. Is it expected if we want optimize performance for decoding?

fc1_gemm_done:

const int64_t activation_threshold = std::max(int64_t{4}, 256 / std::max(int64_t{1}, inter_size));
const int64_t activation_threshold = std::max(int64_t{4}, 256 / inter_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you choose the magic numbers (4, 256) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 is chosen because it fits well in L1 cache and is better for CPU Cache efficiency. We get the number 4 based on the inter_size it is the minimum number of token required before considering parallel processing.
inter_size = 64 --> 256/64 = 4.

@apsonawane
Copy link
Contributor Author

Created a new PR: #26091

This PR is not required

@apsonawane apsonawane closed this Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants