Skip to content

feat: add partial mode for assistant message prefill#306

Merged
jundot merged 1 commit intojundot:mainfrom
blightbow:feat/partial-mode
Mar 21, 2026
Merged

feat: add partial mode for assistant message prefill#306
jundot merged 1 commit intojundot:mainfrom
blightbow:feat/partial-mode

Conversation

@blightbow
Copy link
Copy Markdown
Contributor

@blightbow blightbow commented Mar 18, 2026

This PR implements support for Moonshot's "partial mode", an extension to the OpenAI chat completion API that enables message prefills and renaming of role anchors. This enables prefill patterns (JSON mode via "content": "{") and named-assistant persona consistency (Kimi K2/K2.5 name field rendering).

Full disclosure: Moonshot has not publicly documented how "partial mode" is implemented on their API backend, but when you compare their documentation to the model templates for K2 and K2.5, it becomes pretty obvious what they're doing here.

The underlying infrastructure is already present in mlx_lm and the chat templates of Kimi K2 and K2.5. This change allows the name field to be passed through to the backend (in all circumstances), and reserves the partial field as a boolean toggle for switching between continue_final_message=True and add_generation_prompt=True in apply_chat_template. partial is consumed by the API server and not passed to the model template, unlike name.

For more information on Partial Mode, refer to the following:


Schema changes:

  • Add name and partial fields to Message model
  • Add detect_and_strip_partial() helper in api/utils.py
  • Preserve name/partial through extract_text_content() and extract_multimodal_content() message extraction

Engine changes:

  • BatchedEngine._apply_chat_template(): conditional generation prompt based on partial mode detection
  • VLMBatchedEngine: strip partial field in both template methods but always use add_generation_prompt=True (vision models do not support continuation)
  • MLXLanguageModel.chat(): same partial mode logic as BatchedEngine

When the final assistant message has `partial: true`, use
`continue_final_message=True` instead of `add_generation_prompt=True`
in apply_chat_template. This enables prefill patterns (JSON mode
via `"content": "{"`) and named-assistant persona consistency
(Kimi K2/K2.5 `name` field rendering).

Schema changes:
- Add `name` and `partial` fields to Message model
- Add `detect_and_strip_partial()` helper in api/utils.py
- Preserve `name`/`partial` through extract_text_content() and
  extract_multimodal_content() message extraction

Engine changes:
- BatchedEngine._apply_chat_template(): conditional generation
  prompt based on partial mode detection
- VLMBatchedEngine: strip partial field in both template methods
  but always use add_generation_prompt=True (vision models do
  not support continuation)
- MLXLanguageModel.chat(): same partial mode logic as BatchedEngine

Ported from mlx-openai-server PR jundot#213.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@blightbow
Copy link
Copy Markdown
Contributor Author

Part of me doesn't like the isolated pop loop in detect_and_strip_partial(), but the only clean way to avoid that is an explicit kwarg for partial mode on the engine. Overriding generation mode via chat_template_kwargs would be clumsy because the mode should be deterministic.

I'm happy to rework that, just let me know what your preference is.

@jundot jundot force-pushed the main branch 7 times, most recently from f6faf2f to c2beead Compare March 21, 2026 05:58
Copy link
Copy Markdown
Owner

@jundot jundot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, clean implementation and solid test coverage. The partial/name flow through the extraction pipeline and engine layer looks well thought out, and the VLM handling (strip but always use add_generation_prompt=True) is the right call.

One minor thing: _merge_consecutive_roles merges assistant messages and only keeps the first message's dict, so a second consecutive assistant message's partial: true would get dropped before it reaches detect_and_strip_partial. Pretty unlikely to happen in practice though, so not a concern for this PR.

@jundot jundot merged commit e9530f5 into jundot:main Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants