feat: add partial mode for assistant message prefill#306
Conversation
When the final assistant message has `partial: true`, use
`continue_final_message=True` instead of `add_generation_prompt=True`
in apply_chat_template. This enables prefill patterns (JSON mode
via `"content": "{"`) and named-assistant persona consistency
(Kimi K2/K2.5 `name` field rendering).
Schema changes:
- Add `name` and `partial` fields to Message model
- Add `detect_and_strip_partial()` helper in api/utils.py
- Preserve `name`/`partial` through extract_text_content() and
extract_multimodal_content() message extraction
Engine changes:
- BatchedEngine._apply_chat_template(): conditional generation
prompt based on partial mode detection
- VLMBatchedEngine: strip partial field in both template methods
but always use add_generation_prompt=True (vision models do
not support continuation)
- MLXLanguageModel.chat(): same partial mode logic as BatchedEngine
Ported from mlx-openai-server PR jundot#213.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Part of me doesn't like the isolated pop loop in detect_and_strip_partial(), but the only clean way to avoid that is an explicit kwarg for partial mode on the engine. Overriding generation mode via I'm happy to rework that, just let me know what your preference is. |
f6faf2f to
c2beead
Compare
jundot
left a comment
There was a problem hiding this comment.
Thanks for this, clean implementation and solid test coverage. The partial/name flow through the extraction pipeline and engine layer looks well thought out, and the VLM handling (strip but always use add_generation_prompt=True) is the right call.
One minor thing: _merge_consecutive_roles merges assistant messages and only keeps the first message's dict, so a second consecutive assistant message's partial: true would get dropped before it reaches detect_and_strip_partial. Pretty unlikely to happen in practice though, so not a concern for this PR.
This PR implements support for Moonshot's "partial mode", an extension to the OpenAI chat completion API that enables message prefills and renaming of role anchors. This enables prefill patterns (JSON mode via
"content": "{") and named-assistant persona consistency (Kimi K2/K2.5namefield rendering).Full disclosure: Moonshot has not publicly documented how "partial mode" is implemented on their API backend, but when you compare their documentation to the model templates for K2 and K2.5, it becomes pretty obvious what they're doing here.
The underlying infrastructure is already present in mlx_lm and the chat templates of Kimi K2 and K2.5. This change allows the
namefield to be passed through to the backend (in all circumstances), and reserves thepartialfield as a boolean toggle for switching betweencontinue_final_message=Trueandadd_generation_prompt=Truein apply_chat_template.partialis consumed by the API server and not passed to the model template, unlikename.For more information on Partial Mode, refer to the following:
Schema changes:
nameandpartialfields to Message modeldetect_and_strip_partial()helper in api/utils.pyname/partialthrough extract_text_content() and extract_multimodal_content() message extractionEngine changes:
partialfield in both template methods but always useadd_generation_prompt=True(vision models do not support continuation)