Skip to content

xgrammar-based constrained decoding with per-model reasoning parser#335

Merged
jundot merged 1 commit intojundot:mainfrom
leuski:xgrammar-integration
Mar 29, 2026
Merged

xgrammar-based constrained decoding with per-model reasoning parser#335
jundot merged 1 commit intojundot:mainfrom
leuski:xgrammar-integration

Conversation

@leuski
Copy link
Copy Markdown

@leuski leuski commented Mar 21, 2026

xgrammar-based constrained decoding with per-model reasoning parser

Summary

This PR adds grammar-constrained structured output generation to oMLX using xgrammar. Users can request structured outputs (JSON schema, regex, EBNF grammar, choice lists) via the OpenAI-compatible structured_outputs field, and oMLX enforces them at the logit level during generation.

A key design decision is the per-model reasoning_parser setting, which maps to xgrammar's builtin structural tags. This lets grammar constraints work transparently with different model protocols — Qwen's <think>...</think>, Harmony's <|channel|>analysis/final system, DeepSeek, Llama, etc. — without any client-side changes.

How it works

  1. Client sends a request with structured_outputs: {"json": {...schema...}}
  2. Server reads the model's reasoning_parser setting (configured via Admin UI)
  3. If set, calls xgrammar.get_builtin_structural_tag(parser_name, reasoning=...) to get the model's protocol structure, then patches the user's grammar into the output slot
  4. If not set, compiles a bare grammar (no structural wrapping)
  5. During generation, GrammarConstraintProcessor applies token-level bitmasks to enforce the grammar
  6. Bitmask computation runs in parallel with the model forward pass via mx.async_eval
Client request ──► reasoning_parser setting ──► get_builtin_structural_tag()
                                                        │
                                               Patch user grammar into
                                               output slot of tag
                                                        │
                                               compile_structural_tag()
                                                        │
                                               GrammarConstraintProcessor
                                               (bitmask per token, batched)

API

Uses the vLLM-compatible structured_outputs field in extra_body:

response = client.chat.completions.create(
    model="Qwen3.5-4B-4bit",
    messages=[{"role": "user", "content": "Give me a person"}],
    extra_body={
        "structured_outputs": {
            "json": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            }
        }
    }
)

Supported grammar types:

  • json — JSON schema (dict or string)
  • regex — regular expression pattern
  • grammar — EBNF/GBNF grammar string
  • choice — list of allowed string values

Per-model reasoning_parser

Configured via the Admin UI dropdown (or PUT /admin/api/models/{id}/settings). Valid values: qwen, qwen_coder, harmony, llama, deepseek_r1, deepseek_v3_2, kimi, minimax, or empty (no structural tag).

When set, the auto-budget logic also kicks in: if the user doesn't specify a thinking_budget, one is auto-set so the model exits reasoning before the grammar activates.

Performance

Benchmarked with 128 max_tokens, 60-second time-boxed runs per configuration:

Model Thinking Grammar Conc Reqs/60s TPS (mean +/- std) Overhead
Qwen3.5-4B off none 1 60 127.6 +/- 2.6
Qwen3.5-4B off json 1 50 106.3 +/- 2.0 1.20x
Qwen3.5-4B on none 1 65 137.3 +/- 3.2
Qwen3.5-4B on json 1 53 110.5 +/- 2.9 1.24x
Gemma-3-4B off none 1 64 135.5 +/- 2.6
Gemma-3-4B off json 1 54 114.1 +/- 1.7 1.19x
gpt-oss-20B off none 1 20 130.3 +/- 1.2
gpt-oss-20B off json 1 6 104.7 +/- 0.9 1.24x
gpt-oss-20B on none 1 19 121.7 +/- 2.7
gpt-oss-20B on json 1 17 111.7 +/- 0.6 1.09x
  • Grammar overhead: 1.09x–1.24x (9–24% slower per request)
  • TTFT unaffected by grammar constraints
  • Overhead inversely correlates with model size — larger models have longer forward passes that better overlap with bitmask computation
  • Full CSV with all concurrency levels in tests/bench_results.csv

Files changed

Area Files What
Core omlx/api/grammar.py (new) GrammarConstraintProcessor — logits processor with bitmask masking
Core omlx/server.py Grammar compilation, structural tag patching (_patch_output_format, _compile_with_structural_tag, _compile_bare_grammar), call sites, auto-budget
Core omlx/scheduler.py Batched grammar path with BatchGrammarMatcher + mx.async_eval overlap
Core omlx/request.py compiled_grammar field in SamplingParams
API omlx/api/openai_models.py StructuredOutputOptions, structured_outputs field on ChatCompletionRequest
Engine omlx/engine/{base,batched,vlm}.py grammar_compiler property (lazy xgrammar.GrammarCompiler init)
Settings omlx/model_settings.py reasoning_parser field
Admin omlx/admin/routes.py Wire reasoning_parser in API
Admin _modal_model_settings.html Dropdown with parser choices
Admin _settings.html Display badge
Admin dashboard.js Wire in open/save
Config pyproject.toml xgrammar optional dependency (pip install omlx[grammar])
Tests tests/test_grammar.py 55 unit tests
Tests tests/test_grammar_live.py Live integration + performance benchmarks
Tests tests/bench_grammar_bitmask.py Microbenchmark for bitmask strategies

Testing

  • 55 unit tests covering _build_format_element, _patch_output_format, _compile_with_structural_tag, _compile_bare_grammar, _compile_grammar_for_request, GrammarConstraintProcessor, scheduler grammar path, batched grammar, and _get_model_vocab_size
  • Live integration tests verifying JSON schema, regex, and choice grammars produce correct output across Qwen, Gemma, and Harmony/OSS models
  • Time-boxed performance benchmarks at concurrency 1/2/4 with thinking on/off

Dependencies

  • xgrammar — added as optional: pip install 'omlx[grammar]'. When not installed, structured output requests via response_format fall back to prompt injection (existing behavior); requests via structured_outputs return a 400 error explaining how to install.

@leuski leuski force-pushed the xgrammar-integration branch from 9efa440 to 48c7449 Compare March 21, 2026 18:12
…rser

Add grammar-constrained structured output generation using xgrammar,
with support for JSON schema, regex, EBNF grammar, and choice constraints
via the OpenAI-compatible `structured_outputs` API field.

A per-model `reasoning_parser` setting (configured via Admin UI) maps to
xgrammar's builtin structural tags, enabling grammar constraints to work
transparently with thinking/reasoning model protocols (Qwen's <think>,
Harmony's channel system, DeepSeek, Llama, etc.). When set, the user's
grammar is patched into the output slot of the structural tag so protocol
tokens flow freely while only the final output is constrained.

Key components:

- GrammarConstraintProcessor: logits processor that applies xgrammar
  bitmasks to mask invalid tokens at sampling time
- Batched grammar support via xgrammar.BatchGrammarMatcher for efficient
  parallel bitmask computation across concurrent requests
- mx.async_eval overlap: bitmask computation (CPU) runs in parallel with
  the model forward pass (GPU), making grammar overhead near-zero
- Per-model reasoning_parser setting exposed in Admin UI as a dropdown
- Auto-budget: thinking_budget is auto-set only when reasoning_parser is
  configured, ensuring the model exits reasoning before grammar activates

Performance (128 tokens, 60s time-boxed benchmark):
- Grammar overhead: 1.09x-1.24x across Qwen 4B, Gemma 4B, gpt-oss 20B
- TTFT unaffected by grammar constraints
- Overhead inversely correlates with model size (longer forward pass =
  more overlap with bitmask computation)

Files changed:
- omlx/api/grammar.py (new): GrammarConstraintProcessor
- omlx/api/openai_models.py: StructuredOutputOptions, structured_outputs field
- omlx/engine/{base,batched,vlm}.py: grammar_compiler property
- omlx/model_settings.py: reasoning_parser field
- omlx/request.py: compiled_grammar in SamplingParams
- omlx/scheduler.py: batched grammar path with async overlap
- omlx/server.py: grammar compilation, structural tag patching, call sites
- omlx/admin/: reasoning_parser dropdown in UI, routes, dashboard.js
- pyproject.toml: xgrammar optional dependency
- tests/: 55 unit tests, live integration tests, performance benchmarks

Made-with: Cursor
@leuski leuski force-pushed the xgrammar-integration branch from 48c7449 to e74758d Compare March 21, 2026 18:21
@jundot jundot force-pushed the main branch 10 times, most recently from 65b4ef1 to 2e39d71 Compare March 28, 2026 01:20
@jundot jundot merged commit 8939303 into jundot:main Mar 29, 2026
@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 29, 2026

Thanks for this, merged and tested with a real model (Qwen3.5-35B-A3B-oQ4e). JSON schema, regex, and choice constraints all worked correctly.

I made a follow-up change on top:

fde9794 refactor: consolidate grammar utils, make xgrammar a core dependency

  • Moved xgrammar from optional [grammar] extra to core dependencies (also added to venvstacks.toml for the app bundle)
  • Extracted unwrap_tokenizer and resolve_vocab_size into utils/tokenizer.py to deduplicate the 3 copies across batched.py, vlm.py, and scheduler.py
  • Added create_grammar_compiler factory in api/grammar.py so both engines share the same init logic

@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 29, 2026

@leuski v0.3.0rc1 is out with your xgrammar integration included. https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

@leuski
Copy link
Copy Markdown
Author

leuski commented Mar 30, 2026

@jundot Looks good! Thanks.

@leuski leuski deleted the xgrammar-integration branch March 30, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants