xgrammar-based constrained decoding with per-model reasoning parser by leuski · Pull Request #335 · jundot/omlx

leuski · 2026-03-21T18:09:17Z

xgrammar-based constrained decoding with per-model reasoning parser

Summary

This PR adds grammar-constrained structured output generation to oMLX using xgrammar. Users can request structured outputs (JSON schema, regex, EBNF grammar, choice lists) via the OpenAI-compatible structured_outputs field, and oMLX enforces them at the logit level during generation.

A key design decision is the per-model reasoning_parser setting, which maps to xgrammar's builtin structural tags. This lets grammar constraints work transparently with different model protocols — Qwen's <think>...</think>, Harmony's <|channel|>analysis/final system, DeepSeek, Llama, etc. — without any client-side changes.

How it works

Client sends a request with structured_outputs: {"json": {...schema...}}
Server reads the model's reasoning_parser setting (configured via Admin UI)
If set, calls xgrammar.get_builtin_structural_tag(parser_name, reasoning=...) to get the model's protocol structure, then patches the user's grammar into the output slot
If not set, compiles a bare grammar (no structural wrapping)
During generation, GrammarConstraintProcessor applies token-level bitmasks to enforce the grammar
Bitmask computation runs in parallel with the model forward pass via mx.async_eval

Client request ──► reasoning_parser setting ──► get_builtin_structural_tag()
                                                        │
                                               Patch user grammar into
                                               output slot of tag
                                                        │
                                               compile_structural_tag()
                                                        │
                                               GrammarConstraintProcessor
                                               (bitmask per token, batched)

API

Uses the vLLM-compatible structured_outputs field in extra_body:

response = client.chat.completions.create(
    model="Qwen3.5-4B-4bit",
    messages=[{"role": "user", "content": "Give me a person"}],
    extra_body={
        "structured_outputs": {
            "json": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Supported grammar types:

json — JSON schema (dict or string)
regex — regular expression pattern
grammar — EBNF/GBNF grammar string
choice — list of allowed string values

Per-model reasoning_parser

Configured via the Admin UI dropdown (or PUT /admin/api/models/{id}/settings). Valid values: qwen, qwen_coder, harmony, llama, deepseek_r1, deepseek_v3_2, kimi, minimax, or empty (no structural tag).

When set, the auto-budget logic also kicks in: if the user doesn't specify a thinking_budget, one is auto-set so the model exits reasoning before the grammar activates.

Performance

Benchmarked with 128 max_tokens, 60-second time-boxed runs per configuration:

Model	Thinking	Grammar	Conc	Reqs/60s	TPS (mean +/- std)	Overhead
Qwen3.5-4B	off	none	1	60	127.6 +/- 2.6	—
Qwen3.5-4B	off	json	1	50	106.3 +/- 2.0	1.20x
Qwen3.5-4B	on	none	1	65	137.3 +/- 3.2	—
Qwen3.5-4B	on	json	1	53	110.5 +/- 2.9	1.24x
Gemma-3-4B	off	none	1	64	135.5 +/- 2.6	—
Gemma-3-4B	off	json	1	54	114.1 +/- 1.7	1.19x
gpt-oss-20B	off	none	1	20	130.3 +/- 1.2	—
gpt-oss-20B	off	json	1	6	104.7 +/- 0.9	1.24x
gpt-oss-20B	on	none	1	19	121.7 +/- 2.7	—
gpt-oss-20B	on	json	1	17	111.7 +/- 0.6	1.09x

Grammar overhead: 1.09x–1.24x (9–24% slower per request)
TTFT unaffected by grammar constraints
Overhead inversely correlates with model size — larger models have longer forward passes that better overlap with bitmask computation
Full CSV with all concurrency levels in tests/bench_results.csv

Files changed

Area	Files	What
Core	`omlx/api/grammar.py` (new)	`GrammarConstraintProcessor` — logits processor with bitmask masking
Core	`omlx/server.py`	Grammar compilation, structural tag patching (`_patch_output_format`, `_compile_with_structural_tag`, `_compile_bare_grammar`), call sites, auto-budget
Core	`omlx/scheduler.py`	Batched grammar path with `BatchGrammarMatcher` + `mx.async_eval` overlap
Core	`omlx/request.py`	`compiled_grammar` field in `SamplingParams`
API	`omlx/api/openai_models.py`	`StructuredOutputOptions`, `structured_outputs` field on `ChatCompletionRequest`
Engine	`omlx/engine/{base,batched,vlm}.py`	`grammar_compiler` property (lazy `xgrammar.GrammarCompiler` init)
Settings	`omlx/model_settings.py`	`reasoning_parser` field
Admin	`omlx/admin/routes.py`	Wire `reasoning_parser` in API
Admin	`_modal_model_settings.html`	Dropdown with parser choices
Admin	`_settings.html`	Display badge
Admin	`dashboard.js`	Wire in open/save
Config	`pyproject.toml`	`xgrammar` optional dependency (`pip install omlx[grammar]`)
Tests	`tests/test_grammar.py`	55 unit tests
Tests	`tests/test_grammar_live.py`	Live integration + performance benchmarks
Tests	`tests/bench_grammar_bitmask.py`	Microbenchmark for bitmask strategies

Testing

55 unit tests covering _build_format_element, _patch_output_format, _compile_with_structural_tag, _compile_bare_grammar, _compile_grammar_for_request, GrammarConstraintProcessor, scheduler grammar path, batched grammar, and _get_model_vocab_size
Live integration tests verifying JSON schema, regex, and choice grammars produce correct output across Qwen, Gemma, and Harmony/OSS models
Time-boxed performance benchmarks at concurrency 1/2/4 with thinking on/off

Dependencies

xgrammar — added as optional: pip install 'omlx[grammar]'. When not installed, structured output requests via response_format fall back to prompt injection (existing behavior); requests via structured_outputs return a 400 error explaining how to install.

…rser Add grammar-constrained structured output generation using xgrammar, with support for JSON schema, regex, EBNF grammar, and choice constraints via the OpenAI-compatible `structured_outputs` API field. A per-model `reasoning_parser` setting (configured via Admin UI) maps to xgrammar's builtin structural tags, enabling grammar constraints to work transparently with thinking/reasoning model protocols (Qwen's <think>, Harmony's channel system, DeepSeek, Llama, etc.). When set, the user's grammar is patched into the output slot of the structural tag so protocol tokens flow freely while only the final output is constrained. Key components: - GrammarConstraintProcessor: logits processor that applies xgrammar bitmasks to mask invalid tokens at sampling time - Batched grammar support via xgrammar.BatchGrammarMatcher for efficient parallel bitmask computation across concurrent requests - mx.async_eval overlap: bitmask computation (CPU) runs in parallel with the model forward pass (GPU), making grammar overhead near-zero - Per-model reasoning_parser setting exposed in Admin UI as a dropdown - Auto-budget: thinking_budget is auto-set only when reasoning_parser is configured, ensuring the model exits reasoning before grammar activates Performance (128 tokens, 60s time-boxed benchmark): - Grammar overhead: 1.09x-1.24x across Qwen 4B, Gemma 4B, gpt-oss 20B - TTFT unaffected by grammar constraints - Overhead inversely correlates with model size (longer forward pass = more overlap with bitmask computation) Files changed: - omlx/api/grammar.py (new): GrammarConstraintProcessor - omlx/api/openai_models.py: StructuredOutputOptions, structured_outputs field - omlx/engine/{base,batched,vlm}.py: grammar_compiler property - omlx/model_settings.py: reasoning_parser field - omlx/request.py: compiled_grammar in SamplingParams - omlx/scheduler.py: batched grammar path with async overlap - omlx/server.py: grammar compilation, structural tag patching, call sites - omlx/admin/: reasoning_parser dropdown in UI, routes, dashboard.js - pyproject.toml: xgrammar optional dependency - tests/: 55 unit tests, live integration tests, performance benchmarks Made-with: Cursor

jundot · 2026-03-29T08:00:37Z

Thanks for this, merged and tested with a real model (Qwen3.5-35B-A3B-oQ4e). JSON schema, regex, and choice constraints all worked correctly.

I made a follow-up change on top:

fde9794 refactor: consolidate grammar utils, make xgrammar a core dependency

Moved xgrammar from optional [grammar] extra to core dependencies (also added to venvstacks.toml for the app bundle)
Extracted unwrap_tokenizer and resolve_vocab_size into utils/tokenizer.py to deduplicate the 3 copies across batched.py, vlm.py, and scheduler.py
Added create_grammar_compiler factory in api/grammar.py so both engines share the same init logic

jundot · 2026-03-29T11:12:15Z

@leuski v0.3.0rc1 is out with your xgrammar integration included. https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

leuski · 2026-03-30T00:36:01Z

@jundot Looks good! Thanks.

leuski force-pushed the xgrammar-integration branch from 9efa440 to 48c7449 Compare March 21, 2026 18:12

leuski force-pushed the xgrammar-integration branch from 48c7449 to e74758d Compare March 21, 2026 18:21

jundot force-pushed the main branch 10 times, most recently from 65b4ef1 to 2e39d71 Compare March 28, 2026 01:20

jundot merged commit 8939303 into jundot:main Mar 29, 2026

leuski deleted the xgrammar-integration branch March 30, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgrammar-based constrained decoding with per-model reasoning parser#335

xgrammar-based constrained decoding with per-model reasoning parser#335
jundot merged 1 commit intojundot:mainfrom
leuski:xgrammar-integration

leuski commented Mar 21, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

leuski commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leuski commented Mar 21, 2026

xgrammar-based constrained decoding with per-model reasoning parser

Summary

How it works

API

Per-model reasoning_parser

Performance

Files changed

Testing

Dependencies

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

leuski commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants