-
Notifications
You must be signed in to change notification settings - Fork 682
[Feature] Support zai-org/GLM-4.5-Air BF16 model #3928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
YuanRisheng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要补充单测
| return rot_emb | ||
|
|
||
|
|
||
| class GlmRotaryEmbedding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能想办法把ErnieRotaryEmbedding改成一个基类,这里继承一下吗,重复代码太多了
* support glm45_air
| self.model.clear_grpah_opt_backend(fd_config=self.fd_config) | ||
|
|
||
|
|
||
| class Glm4MoePretrainedModel(PretrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个类可以删掉了
* support glm45_air
* [Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928) * support glm45_air * [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) (#4051) * check * fix v1 load for mix and wint8 * check --quantizations 'None' * check * support RL rollout * check v1 loader * check glm rollout_model, change wfp8afp8 per_token_cast_to_fp8 to native impl * check rollout moe gate begin layer_id * check rollout e_score_correction_bias * delete infer_to_train_mapping={} * code check
Support load pytorch model zai-org/GLM-4.5-Air model and inference.
model_path=./torch_models/GLM-4.5-Air python -m fastdeploy.entrypoints.openai.api_server \ --model ${model_path} \ --max-model-len 32768 \ --max-num-seqs 18 \ --tensor-parallel-size 4 \ --port 8112 \ --graph-optimization-config '{"use_cudagraph":true, "graph_opt_level":0}' \ --load_choices "default_v1" \