-
Notifications
You must be signed in to change notification settings - Fork 682
Support GPT-OSS-BF16 #4240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPT-OSS-BF16 #4240
Conversation
|
Thanks for your contribution! |
ming1753
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for CI
| if ( | ||
| hasattr(self.fd_config.model_config, "layer_types") | ||
| and self.fd_config.model_config.layer_types[layer.layer_id] == "sliding_attention" | ||
| ): | ||
| sliding_window = self.fd_config.model_config.sliding_window | ||
| else: | ||
| sliding_window = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一块放到attention.py里吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前只有append_attention支持SWA,可以后面再提个PR放到attention.py里面,其他后端就要弹NotImplementedError了
ming1753
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ec49653
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
XiaoguangHu01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds support for the GPT OSS bf16 model. Compared to the vLLM, this PR implements Wint8 quantization and achieves a 15% leads in metrics such as QPS, TPS, and TTFT. This PR also introduces several new features to enhance model flexibility and performance, such as sinks in append attention, sliding window attention, bias support for MoE layers, and the swigluoai activation function.
New Features
Feature 1: Support Sinks in Append Attention
This feature introduces sinks in append attention, allowing certain tokens to remain visible across all decoding and enhances the control and stability of attention mechanisms, especially in long-context or multi-turn scenarios.
Feature 2: Support Sliding Window Attention (SWA)
This feature implements Sliding Window Attention, an efficient mechanism for handling long sequences by limiting the attention scope for each token.Sliding window constrains the visible key-value pairs during decoding, improving memory and efficiency in long-sequence inference.
Feature 3: Implement "swigluoai" activation function
This added support for SwigluOAI activation, a variant of SwiGLU with optimized scaling which provides configurable scaling factors (1.702, 7.0) and supports interleaved mode.
Feature 4: Add Bias support for MoE layers
This extended MoE feed-forward to correctly apply expert-specific bias during down projection. And ensures each token routes to the correct expert with its associated bias term.
Usage Example
Start online service
python -m fastdeploy.entrypoints.openai.api_server \ --model /path/to/gpt-oss-20b-bf16 \ --port 8188 \ --engine-worker-queue-port 51001\ --cache-queue-port 51002 \ --host 0.0.0.0 \ --max-model-len 32768 \ --max-num-seqs 256 \ --quantization wint8 \Send a request