Skip to content

jinja : implement mixed type object keys#18955

Merged
CISC merged 17 commits intomasterfrom
cisc/jinja-mixed-type-object-keys
Jan 27, 2026
Merged

jinja : implement mixed type object keys#18955
CISC merged 17 commits intomasterfrom
cisc/jinja-mixed-type-object-keys

Conversation

@CISC
Copy link
Copy Markdown
Member

@CISC CISC commented Jan 20, 2026

Allow all hashable types as object keys, taking care to replicate special python/jinja behavior between int/float/bool.

Fixed array/object output with string filter.

Fixed object tojson output (did not properly escape key string).

Fixed object item order when replacing an item.

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rethinking about this approach a bit, probably we should drop the map::unordered, keep all the data inside std::vector<std::pair<value, value>> ordered, and always perform a linear search when we access an object.

Ofc, that will be slower, but realistically a template in the wild never has more than 50 or even 100 keys inside an object, so it's probably fine.

Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
if (std::all_of(vec.begin(), vec.end(), [&](auto ikv) -> bool {
return is_hashable(std::get<2>(ikv));
})) {
key_type = "NamedTuple";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kwarg is a good alternative to NamedTuple

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, possibly.

Comment thread tests/test-jinja.cpp Outdated
Comment thread common/jinja/caps.cpp Outdated
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 20, 2026

Btw, the value.h header is becoming larger. We should move long function definitions into value.cpp, while keeping one-liner definition inside the header (for more visibility)

@github-actions github-actions Bot added testing Everything test related jinja parser Issues related to the jinja parser labels Jan 20, 2026
Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 22, 2026

Phew, this led me down quite a few rabbit holes (how to use hash specialization in a class of the specialized type, c++ standard lacking hash_combine drama, virtual equality overload pitfalls, what have you), now it finally all makes sense and even compiles, but unfortunately does not work, more fun for tomorrow! :)

@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 25, 2026

Massive refactoring done, now has proper value hashing and equality operators (equivalence and strict (non-)equal). Added real value_tuple so that we can allow only immutable types as object keys.

Will be easy to add further operators for sorting in follow-up PR.

Lessons learned:

  • You can't override specialized hash of shared_ptr
  • You can't forward declare hashing functor of unordered_map
  • The rest has been repressed, never to be mentioned again

Hopefully not too many mistakes made, but for some reason test-chat-template fails, so obviously some mistakes present, will investigate further unless someone brighter spots my error first. :)

@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 25, 2026

for some reason test-chat-template fails, so obviously some mistakes present, will investigate further unless someone brighter spots my error first. :)

Actually, it's the template that's faulty (not sure why this worked before?), we set tools = [], and the Mistral-Large-Instruct-2407 template does this:

{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}
{%- if tools is not none and (message == user_messages[-1]) %}
    {{- "[AVAILABLE_TOOLS] [" }}
    {%- for tool in tools %}
        ...
    {%- endfor %}
    {{- "[/AVAILABLE_TOOLS]" }}
{%- endif %}

@CISC CISC requested a review from ngxson January 25, 2026 12:32
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty close! Just need to optimize the a little bit by avoiding relying too much on string.

Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:

size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
  cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;

Doing str() or as_repr() uses a lot more memory. An example can be:

my_tuple = ("some string")
my_outer_tuple = (my_arr, my_arr, my_arr, my_arr, my_arr)

Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory

Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
Comment thread common/jinja/value.h Outdated
@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 26, 2026

Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:

size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
  cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;

Yep, that was the point, but I see I made a footgun, never start with an initial seed (unless it is a previous hash).

Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory

Very good point, I'll look into your suggestions and see what can be done.

Comment thread common/jinja/value.h
Comment on lines +494 to +504
virtual string as_string() const override {
std::ostringstream ss;
ss << "{";
for (size_t i = 0; i < val_obj.size(); i++) {
if (i > 0) ss << ", ";
auto & [key, val] = val_obj.at(i);
ss << value_to_string_repr(key) << ": " << value_to_string_repr(val);
}
ss << "}";
return ss.str();
}
Copy link
Copy Markdown
Contributor

@ngxson ngxson Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note that I initially don't allow as_repr or to_string to be recursive, because it can go into an infinitive loop if the object/array entity is pointed back to itself:

obj = {}
obj["a"] = obj

# or even harder to detect, nested circular
obj["a"] = {"b": obj}

a bit ironically that this is not actually classified as a vulnerability. just sometimes program are actually coded this way, and it is indeed a very common practice in high-level languages like javascript or python:

var node = {"parent": null}
node["child"] = {
  "parent" : node,
  "child": null,
};
// now, JSON.stringify(node) will throw:
// TypeError: Converting circular structure to JSON

btw, may worth a fix for to_json as it can also stuck on this case. IIRC javascript simply throw an error if it detect circular in an object

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, very aware of this, decided to ignore it for now. :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested circulars are usually bypassed by keeping a reference of every encountered object that can nest when processing and simply skip/print ... when coming across one that has already been visited. Will follow up if no-one else does.

@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 26, 2026

@ngxson Ok, be80c62 should improve things, though I don't see how we can avoid composing strings as the parts may not be the same while the whole is.

Edit: I mean for equality, for hash it is fine processing parts due to the nature of FNV-1a.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 26, 2026

I'll work on this a bit, will push some optimizations directly here

Comment thread common/jinja/utils.h Outdated
Comment on lines +57 to +74
static size_t hash_bytes(size_t seed, void const * bytes, size_t len, Args&&... args) noexcept
{
static_assert(sizeof...(args) % 2 == 0);
static constexpr size_t prime = size_t_digits == 64 ? 0x100000001b3 : 0x01000193;

unsigned char const * c = static_cast<unsigned char const *>(bytes);
unsigned char const * const e = c + len;

for (; c < e; ++c) {
seed = (seed ^ *c) * prime;
}

if constexpr (sizeof...(args) > 0) {
seed = hash_bytes(seed, std::forward<Args>(args)...);
}

return seed;
}
Copy link
Copy Markdown
Contributor

@ngxson ngxson Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud in math terms: if we consider hash_bytes(seed, data, size) as a (surjective) function f(s, x) with s = seed and x = tuple(data, size)

given input data x0, x1, x2:

s = initial_seed
hash = f(f(f(s, x0), x1), x2)

and your "convenient" function hash_bytes(x0, x1, ...) can be defined as g:

g() = s
g(x0) = f(s, x0)
g(x0, x1) = f(f(s, x0), x1)
...

therefore, g is still surjective and g(x0, x1) != g(x1, x0) which is so far so good.

but now, consider: f(x0 ~ x1) == g(x0, x1) with ~ the "concatenation" operation: as long as x0 ~ x1 == x2 ~ x3, then g(x0, x1) == g(x2, x3), which is expected when calculating hash of string parts (where each part is different but concatenated version is the same). technically says, this make g no longer a good hash function, since we have a known set of collisions. but in our context this is acceptable. FNV-1a is not that good anyway.

(note: the property f(x0 ~ x1) == g(x0, x1) can be derived from the implementation and can be considered as a postulation in this context)

in practice, to prevent this, most hash functions use some kind of internal state such that: hash(s, x) = output(s, mix(x)), so hash(x0 ~ x1) != hash(x1 ~ x0). we can implement this easily but it's not really necessary. we are not doing cryptographic hash anyway.

end of the math part. now, in term of speed, what I think can be nice & fun to make hash_bytes to operate on blocks of data instead of one byte at a time. because one update depends on the prior state, CPU will have hard time with out-of-order execution. a simple vectorization should help a lot. just the case of x0 ~ x1 will be a bit complicated. I'll see how to do that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FNV-1a certainly has its flaws, it was mainly chosen for its simplicity and chainability, it performs well enough in terms of quality and speed here I think.

Feel free to improve upon it if you wish though. :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ended up refactor this into a new struct called hasher(), it is inspired by nodejs's crypto.createHash()

// Old notion:
size_t seed = hash_bytes(data0, len0);
       seed = hash_bytes(data1, len1);
       // ...
size_t output = hash_bytes(seed, dataN, lenN);

// New notion
hasher hash = hasher().update(data0, len0).update(data1, len1); // ...
size_t output = hash.digest();

With this type of notion, we can in theory implement the notion of "internal state", though I won't add it because it's unnecessary (just mentioning here for completeness)

The new notion should have the exact same mathematical properties as the old one (reflected by the new test cases in test-jinja)

The result will be different from the old one though, because we are processing block of N bytes at once (on 64-bit system, we process 64/8 = 8 bytes). I expect most compilers will know how to vectorize it. If data is not a multiple of the block size, they will be buffered.

I'll do another pass tomorrow to see if there are any other clean up needed. In the meantime, feel free to review my last commit and lmk if anything seems off to you.

Comment thread common/jinja/utils.h Outdated
Comment thread common/jinja/utils.h Outdated
Comment thread common/jinja/value.h
Comment thread common/jinja/value.h
Comment on lines +444 to +447
for (const auto & val : val_arr) {
const size_t val_hash = val->unique_hash().digest();
hash.update(&val_hash, sizeof(size_t));
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow unique_hash to be passed a hasher so that arrays and objects can just update it instead of hashing the hashes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot leave a comment here, but that won't work in this case: ["ab", "c"] vs ["a", "bc"]

the case of reusing state is mostly used by string parts, I don't think it's valid anywhere else

Copy link
Copy Markdown
Member Author

@CISC CISC Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's actually a problem as typeid is hashed inbetween those.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mathematically say, hashing the digest of elements vs letting each element to update are (quasi-)equivalent in our case, because the digest() only do one job: to add padding to the input data.

my current version is still mathematically equivalent to adding a typeid in-between, while being more simple than having to add a version of unique_hash that takes another hasher as input.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this code path is quite rarely used in practice (in case of using tuple as key), so I prefer to keep it simple for now. it should still be efficient enough doing this way.

Comment thread common/jinja/value.h
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge when you're ready

@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 27, 2026

Feel free to merge when you're ready

Seems to need cstring include, also still think #18955 (comment) is a good idea (see my last comment).

@CISC
Copy link
Copy Markdown
Member Author

CISC commented Jan 27, 2026

Ok, all good, will merge when CI is done.

@CISC CISC merged commit 2b4cbd2 into master Jan 27, 2026
75 of 78 checks passed
@CISC CISC deleted the cisc/jinja-mixed-type-object-keys branch January 27, 2026 18:50
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
* implement mixed type object keys

* add tests

* refactor

* minor fixes

* massive refactor

* add more tests

* forgotten tuples

* fix array/object is_hashable

* correct (albeit broken) jinja responses

verified with transformers

* improved hashing and equality

* refactor hash function

* more exhausive test case

* clean up

* cont

* cont (2)

* missing cstring

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* implement mixed type object keys

* add tests

* refactor

* minor fixes

* massive refactor

* add more tests

* forgotten tuples

* fix array/object is_hashable

* correct (albeit broken) jinja responses

verified with transformers

* improved hashing and equality

* refactor hash function

* more exhausive test case

* clean up

* cont

* cont (2)

* missing cstring

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
… and new jinja template engine (ggml-org#1369)

---------

Co-authored-by: Piotr Wilkin <[email protected]>

common : add nemotron 3 parsing (ggml-org#18077)

common : add parser for ministral/mistral large 3/devstral 2 (ggml-org#17713)

common : default content to an empty string (ggml-org#18485)

chat: make tool description and parameters optional per OpenAI spec (ggml-org#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix ggml-org#17667

common : implement new jinja template engine (ggml-org#18462)
---------

Co-authored-by: Alde Rojas <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

jinja: correct member access rule (ggml-org#18905)

jinja : fix lexing of float literals with sign (ggml-org#18901)

jinja : add missing tojson filter for bool (ggml-org#18900)

jinja : attribute support for join, map and sort (ggml-org#18883)

jinja : fix object item order (and properly implement dictsort) (ggml-org#18904)

tests : add test-jinja -py option for cross-checking (ggml-org#18906)

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

ci : run test-jinja -py on high perf [no ci] (ggml-org#18916)

jinja : fix undefined keys and attributes and int/float as bool (ggml-org#18924)

jinja: support none|string (ggml-org#18995)

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

jinja : implement mixed type object keys (ggml-org#18955)

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (ggml-org#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (ggml-org#19176)

jinja : add unordered_map include to value.h [no ci] (ggml-org#19205)

jinja : add missing 'in' test to template engine (ggml-org#19004) (ggml-org#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

---------

Co-authored-by: Sid Mohan <[email protected]>
Co-authored-by: Claude Opus 4.5 <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>

Add Jinja support for "indent" string filter (ggml-org#19529)

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (ggml-org#18994)

chat : fix translategemma crash on common_chat_format_example (ggml-org#19019)

chat: fix language input for translategemma (ggml-org#19052)

Co-authored-by: Aldehir Rojas <[email protected]>

---------

Co-authored-by: Aldehir Rojas <[email protected]>

chat: fix case where template accepts type content only (ggml-org#19419)

mtmd : chat : Fix extra \n between text and media marker (ggml-org#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <[email protected]>

---------

Co-authored-by: Piotr Wilkin (ilintar) <[email protected]>

common : merge qwen3-coder and nemotron nano 3 parsers (ggml-org#19765)

common : fix improper trimming in XML parser on complete message (ggml-org#19805)

Co-authored-by: Jules LEIDELINGER <[email protected]>

jinja: correct stats for tojson and string filters (ggml-org#19785)

jinja : correct default size for string slices (ggml-org#19913)

common : handle unicode during partial json parsing (ggml-org#16526)

common : fix json schema with '\' in literals (ggml-org#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <[email protected]>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* implement mixed type object keys

* add tests

* refactor

* minor fixes

* massive refactor

* add more tests

* forgotten tuples

* fix array/object is_hashable

* correct (albeit broken) jinja responses

verified with transformers

* improved hashing and equality

* refactor hash function

* more exhausive test case

* clean up

* cont

* cont (2)

* missing cstring

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jinja parser Issues related to the jinja parser testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants