-
Notifications
You must be signed in to change notification settings - Fork 3.7k
QMoE kernel further optimizations #26048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| const int64_t thread_divisor = std::max(1, max_threads * 4); | ||
| const int64_t min_work_per_thread = std::max(int64_t{32}, static_cast<int64_t>(num_tokens / thread_divisor)); | ||
| const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / std::max(int64_t{1}, min_work_per_thread)), max_threads); | ||
| const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / min_work_per_thread), max_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_tokens / min_work_per_thread could be zero here since min_work_per_thread >= 32 in previous line. If num_tokens < 32, then num_tokens / min_work_per_thread = 0.
In the end, you will use only one thread when num_tokens < 32. Is it expected if we want optimize performance for decoding?
| fc1_gemm_done: | ||
|
|
||
| const int64_t activation_threshold = std::max(int64_t{4}, 256 / std::max(int64_t{1}, inter_size)); | ||
| const int64_t activation_threshold = std::max(int64_t{4}, 256 / inter_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you choose the magic numbers (4, 256) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
256 is chosen because it fits well in L1 cache and is better for CPU Cache efficiency. We get the number 4 based on the inter_size it is the minimum number of token required before considering parallel processing.
inter_size = 64 --> 256/64 = 4.
|
Created a new PR: #26091 This PR is not required |
This pull request focuses on improving the robustness and reliability of the
QMoECPUquantization code inmoe_quantization_cpu.cc. The main changes add extensive input validation and bounds checking throughout the dequantization and routing logic, helping to prevent out-of-bounds memory access and potential crashes. Additionally, buffer size calculations are simplified for clarity and consistency.The most important changes are:
Input Validation and Bounds Checking:
scale_idxbounds in all dequantization code paths (including 8-bit and 4-bit cases) to prevent out-of-bounds access to thescalesarray. [1] [2] [3]Buffer Size Calculation Simplification: