Use the SipHash-1-3 function for randomized hash tables#9764
Use the SipHash-1-3 function for randomized hash tables#9764xavierleroy wants to merge 1 commit intoocaml:trunkfrom
Conversation
|
This PR is now ready for review. |
runtime/hash.c
Outdated
|
|
||
| /* Mix an OCaml string */ | ||
|
|
||
| static void sip_string(struct sip_state * st, value s) |
There was a problem hiding this comment.
It might be worth changing these functions to avoid accessing sip_state through a pointer. GCC introduces spills of the siphash state in the inner loop, because it can't see that the state and the string don't alias.
|
I'm afraid that there are still seed-independent collisions with this hash! Here's an example: let z = "\000\000\000\000\000\000\000\000"
let w = String.init (31 * 8) (fun _ -> '!')
let a = ("", w ^ z)
let b = (z ^ w, "")
let () =
let seed = Random.bits () in
Printf.printf "%08x %08x %b\n"
(Hashtbl.seeded_hash seed a)
(Hashtbl.seeded_hash seed b)
(a = b)produces: The issue is that the encoding of strings is ambiguous - there are distinct values that feed the same sequence of 64-bit words to SipHash. |
|
Interesting! The initial claims about seed-independent collisions were focused on strings as hash keys, not structured OCaml values. I agree the latter raise other issues. It is certainly possible to linearize a structure value into a sequence of 64-bit words in an injective manner -- that's what the marshaler does -- but I don't know how costly this is. I'll look into possible approaches. |
|
Note that there is also a problem with custom blocks, which are first hashed (without seeding) to a 32-bit integer, then this integer is mixed via SIPhash. This makes it easy to find seed-independent collisions. They are trivial for int64, for instance. Solving this issue would involve a different API for custom hash functions, taking a hash state as in-out parameter. This has been on my to do list for a while too. But I wonder whether we should do it now, or have an intermediate state where the SipHash-based seeded hash behaves well for strings but not that well for other data types. |
SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has no known seed-independent collisions. This PR updates the Hashtbl implementation to use SipHash-1-3 as the hash function for randomized hash tables, making them truly resistant to hash flooding attacks. Non-randomized hash tables still use MurmurHash 3, as it is faster than SipHash-1-3, especially on 32-bit platforms, yet good enough, statistically speaking, for a classic hash table. This PR also adds an API for randomized hashing of custom blocks. A "hash_ext" function is added to the operations that can be attached to custom blocks. Unlike the existing "hash" operation, which takes a 32-bit hash state as argument and returns an updated state as result, "hash_ext" takes a pointer to an opaque "caml_hash_state" struct, which is updated in-place. Closes: ocaml#24
|
I just pushed a new proposal that tries hard to avoid @stedolan's "confusing deputy" attack. Let's view seeded hashing of a structured value as 1- serializing the structured value to a sequence of 64-bit words, and 2- combining the 64-bit words in a single hash value using SipHash-1-3. (In reality the two steps occur in parallel.) We must make sure that the serialization schema is injective: structured values that compare unequal should produce different sequences of 64-bit words; otherwise, we have a seed-independent collision. Here is the serialization schema used in the new proposal. A tagged integer is serialized to one 64-bit word, with low bit 1. A heap block is serialized to one header word possibly followed by one or several data words. The header word looks a lot like a heap block header: A header cannot be confused with a tagged integer because the low bit is 0. The tag is that of the heap block (String_tag, Double_tag, etc). If tag >= No_scan_tag and tag <> Custom_tag, the size is the number of 64-bit words representing the block contents. (Even on 32-bit platforms.) The header word is followed by "size" 64-bit words representing these contents. For a Double_tag, size = 1 and the next word is the binary64 representation of the FP number, with -0.0 and NaNs normalized. For a String_tag, the size is the length divided by 8 and rounded up, and the string is encoded as in the original SipHash algorithm, with the length modulo 256 in the last byte. If tag < No_scan_tag and tag <> Closure_tag, the size is the number of words in the heap block. Each field of the block is pushed on a queue and will be serialized later, in breadth-first manner. If tag = Closure_tag, we have a mixed encoding: the first words of the closure block, up to the first environment field, are serialized just after the header, and the environment fields are pushed on the queue for later serialization. The second word determines how many "first words" there are, so the format is still non-ambiguous, although it's getting convoluted. If tag = Custom_tag, we call the method If no As can be seen above, the encoding of structured values into sequences of 64-bit words tries very hard to be injective. There is one known case where it is not: to prevent runaway, the breadth-first traversal stops after a fixed number of nodes were seen. For instance, only the first N elements of a list are serialized and contribute to the hash value. So, all lists that agree on the first N elements collide. If we really want a truly injective encoding, the only reasonable way I can think of is to use the marshaller ( |
|
Even if the encoding must be injective, we can still be faster than the marshaller, for several reasons:
|
| uint64_t w; | ||
|
|
||
| num_elts = 1; | ||
| for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n]; |
There was a problem hiding this comment.
This is the appveyor failure:
| for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n]; | |
| for (n = 0; n < b->num_dims; n++) num_elts = num_elts * b->dim[n]; |
|
As suggested by @omasanori, PolymurHash (https://github.com/orlp/polymur-hash) could be a good alternative to SipHash 1-3. The question of the injective encoding (of values to string of bytes) remains. |
SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has no known seed-independent collisions.
This PR updates the Hashtbl implementation to use SipHash-1-3 as the hash function for randomized hash tables, making them truly resistant to hash flooding attacks.
Non-randomized hash tables still use MurmurHash 3, as it is faster than SipHash-1-3, especially on 32-bit platforms, yet good enough, statistically speaking, for a classic hash table. Plus, this gives a modicum of backward compatibility with programs that marshal (non-randomized) hash tables to persistent storage.
If accepted, this will close PR #24, six years later. Why did it take so long? First, at the time, only SipHash-2-4 was considered, with more rounds and stronger cryptographic properties but slower running times. It took a while before the SipHash-1-3 reduced-round version was confirmed as good enough. Second, it took me all of 6 years to realize that we can keep using MurmurHash 3 for non-randomized hash tables, resulting in an excellent deal: more security for randomized hash tables, no slowdown for classic hash tables.