Skip to content

Conversation

@Limerances
Copy link
Contributor

@Limerances Limerances commented Sep 23, 2025

This PR adds support for the GPT OSS bf16 model. Compared to the vLLM, this PR implements Wint8 quantization and achieves a 15% leads in metrics such as QPS, TPS, and TTFT. This PR also introduces several new features to enhance model flexibility and performance, such as sinks in append attention, sliding window attention, bias support for MoE layers, and the swigluoai activation function.

New Features

Feature 1: Support Sinks in Append Attention

This feature introduces sinks in append attention, allowing certain tokens to remain visible across all decoding and enhances the control and stability of attention mechanisms, especially in long-context or multi-turn scenarios.

Feature 2: Support Sliding Window Attention (SWA)

This feature implements Sliding Window Attention, an efficient mechanism for handling long sequences by limiting the attention scope for each token.Sliding window constrains the visible key-value pairs during decoding, improving memory and efficiency in long-sequence inference.

Feature 3: Implement "swigluoai" activation function

This added support for SwigluOAI activation, a variant of SwiGLU with optimized scaling which provides configurable scaling factors (1.702, 7.0) and supports interleaved mode.

Feature 4: Add Bias support for MoE layers

This extended MoE feed-forward to correctly apply expert-specific bias during down projection. And ensures each token routes to the correct expert with its associated bias term.

Usage Example

Start online service

python -m fastdeploy.entrypoints.openai.api_server \
       --model /path/to/gpt-oss-20b-bf16 \
       --port 8188 \
       --engine-worker-queue-port 51001\
       --cache-queue-port 51002 \
       --host 0.0.0.0 \
       --max-model-len 32768 \
       --max-num-seqs 256 \
       --quantization wint8 \

Send a request

# 发送请求
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "你是谁"}
  ]
}'
# 返回结果
"content":"<|channel|>analysis<|message|>The user speaks Chinese: \"你是谁\" (\"Who are you?\"). We should respond in Chinese, presumably. We have instructions: we should act as ChatGPT. We are to answer. The user wants an answer. Could be a brief intro. They might want a description: \"我是 ChatGPT, OpenAI's large language model...\"\n\nGiven the policy, we can do that, we have no disallowed content. The user didn't mention anything requiring policy.\n\nWe need to respond appropriately. There's no sensitive content.\n\nWe just reply with a self-introduction. We should add that \"I am ChatGPT, a language model developed by OpenAI.\" Possibly mention capabilities. That should be enough.<|end|><|start|>assistant<|channel|>final<|message|>我是 ChatGPT,一个由 OpenAI 训练的大型语言模型。我的主要任务是帮助你回答问题、提供信息、协助写作、翻译、进行创作等。你可以把我当作一个随时准备好回答你各种语义和知识问题的小助手。🚀"

@paddle-bot
Copy link

paddle-bot bot commented Sep 23, 2025

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Sep 23, 2025

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Sep 23, 2025
ming1753
ming1753 previously approved these changes Oct 13, 2025
Copy link
Collaborator

@ming1753 ming1753 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for CI

@ming1753 ming1753 changed the title Support GPT-OSS Support GPT-OSS-BF16 Oct 14, 2025
Comment on lines 222 to 228
if (
hasattr(self.fd_config.model_config, "layer_types")
and self.fd_config.model_config.layer_types[layer.layer_id] == "sliding_attention"
):
sliding_window = self.fd_config.model_config.sliding_window
else:
sliding_window = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一块放到attention.py里吧

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前只有append_attention支持SWA,可以后面再提个PR放到attention.py里面,其他后端就要弹NotImplementedError了

qingqing01
qingqing01 previously approved these changes Oct 15, 2025
ming1753
ming1753 previously approved these changes Oct 15, 2025
Copy link
Collaborator

@ming1753 ming1753 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

DDDivano
DDDivano previously approved these changes Oct 16, 2025
gongshaotian
gongshaotian previously approved these changes Oct 16, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 1b9f351 into PaddlePaddle:develop Oct 20, 2025
24 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants