xgrammar-based constrained decoding with per-model reasoning parser#335
Merged
jundot merged 1 commit intojundot:mainfrom Mar 29, 2026
Merged
xgrammar-based constrained decoding with per-model reasoning parser#335jundot merged 1 commit intojundot:mainfrom
jundot merged 1 commit intojundot:mainfrom
Conversation
9efa440 to
48c7449
Compare
…rser
Add grammar-constrained structured output generation using xgrammar,
with support for JSON schema, regex, EBNF grammar, and choice constraints
via the OpenAI-compatible `structured_outputs` API field.
A per-model `reasoning_parser` setting (configured via Admin UI) maps to
xgrammar's builtin structural tags, enabling grammar constraints to work
transparently with thinking/reasoning model protocols (Qwen's <think>,
Harmony's channel system, DeepSeek, Llama, etc.). When set, the user's
grammar is patched into the output slot of the structural tag so protocol
tokens flow freely while only the final output is constrained.
Key components:
- GrammarConstraintProcessor: logits processor that applies xgrammar
bitmasks to mask invalid tokens at sampling time
- Batched grammar support via xgrammar.BatchGrammarMatcher for efficient
parallel bitmask computation across concurrent requests
- mx.async_eval overlap: bitmask computation (CPU) runs in parallel with
the model forward pass (GPU), making grammar overhead near-zero
- Per-model reasoning_parser setting exposed in Admin UI as a dropdown
- Auto-budget: thinking_budget is auto-set only when reasoning_parser is
configured, ensuring the model exits reasoning before grammar activates
Performance (128 tokens, 60s time-boxed benchmark):
- Grammar overhead: 1.09x-1.24x across Qwen 4B, Gemma 4B, gpt-oss 20B
- TTFT unaffected by grammar constraints
- Overhead inversely correlates with model size (longer forward pass =
more overlap with bitmask computation)
Files changed:
- omlx/api/grammar.py (new): GrammarConstraintProcessor
- omlx/api/openai_models.py: StructuredOutputOptions, structured_outputs field
- omlx/engine/{base,batched,vlm}.py: grammar_compiler property
- omlx/model_settings.py: reasoning_parser field
- omlx/request.py: compiled_grammar in SamplingParams
- omlx/scheduler.py: batched grammar path with async overlap
- omlx/server.py: grammar compilation, structural tag patching, call sites
- omlx/admin/: reasoning_parser dropdown in UI, routes, dashboard.js
- pyproject.toml: xgrammar optional dependency
- tests/: 55 unit tests, live integration tests, performance benchmarks
Made-with: Cursor
48c7449 to
e74758d
Compare
65b4ef1 to
2e39d71
Compare
Owner
|
Thanks for this, merged and tested with a real model (Qwen3.5-35B-A3B-oQ4e). JSON schema, regex, and choice constraints all worked correctly. I made a follow-up change on top:
|
Owner
|
@leuski v0.3.0rc1 is out with your xgrammar integration included. https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks! |
Author
|
@jundot Looks good! Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
xgrammar-based constrained decoding with per-model reasoning parser
Summary
This PR adds grammar-constrained structured output generation to oMLX using xgrammar. Users can request structured outputs (JSON schema, regex, EBNF grammar, choice lists) via the OpenAI-compatible
structured_outputsfield, and oMLX enforces them at the logit level during generation.A key design decision is the per-model
reasoning_parsersetting, which maps to xgrammar's builtin structural tags. This lets grammar constraints work transparently with different model protocols — Qwen's<think>...</think>, Harmony's<|channel|>analysis/finalsystem, DeepSeek, Llama, etc. — without any client-side changes.How it works
structured_outputs: {"json": {...schema...}}reasoning_parsersetting (configured via Admin UI)xgrammar.get_builtin_structural_tag(parser_name, reasoning=...)to get the model's protocol structure, then patches the user's grammar into the output slotGrammarConstraintProcessorapplies token-level bitmasks to enforce the grammarmx.async_evalAPI
Uses the vLLM-compatible
structured_outputsfield inextra_body:Supported grammar types:
json— JSON schema (dict or string)regex— regular expression patterngrammar— EBNF/GBNF grammar stringchoice— list of allowed string valuesPer-model reasoning_parser
Configured via the Admin UI dropdown (or
PUT /admin/api/models/{id}/settings). Valid values:qwen,qwen_coder,harmony,llama,deepseek_r1,deepseek_v3_2,kimi,minimax, or empty (no structural tag).When set, the auto-budget logic also kicks in: if the user doesn't specify a
thinking_budget, one is auto-set so the model exits reasoning before the grammar activates.Performance
Benchmarked with 128 max_tokens, 60-second time-boxed runs per configuration:
tests/bench_results.csvFiles changed
omlx/api/grammar.py(new)GrammarConstraintProcessor— logits processor with bitmask maskingomlx/server.py_patch_output_format,_compile_with_structural_tag,_compile_bare_grammar), call sites, auto-budgetomlx/scheduler.pyBatchGrammarMatcher+mx.async_evaloverlapomlx/request.pycompiled_grammarfield inSamplingParamsomlx/api/openai_models.pyStructuredOutputOptions,structured_outputsfield onChatCompletionRequestomlx/engine/{base,batched,vlm}.pygrammar_compilerproperty (lazyxgrammar.GrammarCompilerinit)omlx/model_settings.pyreasoning_parserfieldomlx/admin/routes.pyreasoning_parserin API_modal_model_settings.html_settings.htmldashboard.jspyproject.tomlxgrammaroptional dependency (pip install omlx[grammar])tests/test_grammar.pytests/test_grammar_live.pytests/bench_grammar_bitmask.pyTesting
_build_format_element,_patch_output_format,_compile_with_structural_tag,_compile_bare_grammar,_compile_grammar_for_request,GrammarConstraintProcessor, scheduler grammar path, batched grammar, and_get_model_vocab_sizeDependencies
xgrammar— added as optional:pip install 'omlx[grammar]'. When not installed, structured output requests viaresponse_formatfall back to prompt injection (existing behavior); requests viastructured_outputsreturn a 400 error explaining how to install.