Use the SipHash-1-3 function for randomized hash tables by xavierleroy · Pull Request #9764 · ocaml/ocaml

xavierleroy · 2020-07-13T17:07:43Z

SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has no known seed-independent collisions.

This PR updates the Hashtbl implementation to use SipHash-1-3 as the hash function for randomized hash tables, making them truly resistant to hash flooding attacks.

Non-randomized hash tables still use MurmurHash 3, as it is faster than SipHash-1-3, especially on 32-bit platforms, yet good enough, statistically speaking, for a classic hash table. Plus, this gives a modicum of backward compatibility with programs that marshal (non-randomized) hash tables to persistent storage.

If accepted, this will close PR #24, six years later. Why did it take so long? First, at the time, only SipHash-2-4 was considered, with more rounds and stronger cryptographic properties but slower running times. It took a while before the SipHash-1-3 reduced-round version was confirmed as good enough. Second, it took me all of 6 years to realize that we can keep using MurmurHash 3 for non-randomized hash tables, resulting in an excellent deal: more security for randomized hash tables, no slowdown for classic hash tables.

xavierleroy · 2020-07-20T08:24:09Z

This PR is now ready for review.

stedolan · 2020-07-29T09:50:30Z

runtime/hash.c

+
+/* Mix an OCaml string */
+
+static void sip_string(struct sip_state * st, value s)


It might be worth changing these functions to avoid accessing sip_state through a pointer. GCC introduces spills of the siphash state in the inner loop, because it can't see that the state and the string don't alias.

stedolan · 2020-07-29T11:23:54Z

I'm afraid that there are still seed-independent collisions with this hash! Here's an example:

let z = "\000\000\000\000\000\000\000\000"
let w = String.init (31 * 8) (fun _ -> '!')

let a = ("", w ^ z)
let b = (z ^ w, "")

let () =
  let seed = Random.bits () in
  Printf.printf "%08x %08x %b\n"
    (Hashtbl.seeded_hash seed a)
    (Hashtbl.seeded_hash seed b)
    (a = b)

produces:

06d8d8d6 06d8d8d6 false

The issue is that the encoding of strings is ambiguous - there are distinct values that feed the same sequence of 64-bit words to SipHash.

xavierleroy · 2020-07-29T13:31:42Z

Interesting! The initial claims about seed-independent collisions were focused on strings as hash keys, not structured OCaml values. I agree the latter raise other issues. It is certainly possible to linearize a structure value into a sequence of 64-bit words in an injective manner -- that's what the marshaler does -- but I don't know how costly this is. I'll look into possible approaches.

xavierleroy · 2020-07-29T14:13:31Z

Note that there is also a problem with custom blocks, which are first hashed (without seeding) to a 32-bit integer, then this integer is mixed via SIPhash. This makes it easy to find seed-independent collisions. They are trivial for int64, for instance. Solving this issue would involve a different API for custom hash functions, taking a hash state as in-out parameter. This has been on my to do list for a while too. But I wonder whether we should do it now, or have an intermediate state where the SipHash-based seeded hash behaves well for strings but not that well for other data types.

SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has no known seed-independent collisions. This PR updates the Hashtbl implementation to use SipHash-1-3 as the hash function for randomized hash tables, making them truly resistant to hash flooding attacks. Non-randomized hash tables still use MurmurHash 3, as it is faster than SipHash-1-3, especially on 32-bit platforms, yet good enough, statistically speaking, for a classic hash table. This PR also adds an API for randomized hashing of custom blocks. A "hash_ext" function is added to the operations that can be attached to custom blocks. Unlike the existing "hash" operation, which takes a 32-bit hash state as argument and returns an updated state as result, "hash_ext" takes a pointer to an opaque "caml_hash_state" struct, which is updated in-place. Closes: ocaml#24

xavierleroy · 2020-08-19T08:28:14Z

I just pushed a new proposal that tries hard to avoid @stedolan's "confusing deputy" attack.

Let's view seeded hashing of a structured value as 1- serializing the structured value to a sequence of 64-bit words, and 2- combining the 64-bit words in a single hash value using SipHash-1-3. (In reality the two steps occur in parallel.) We must make sure that the serialization schema is injective: structured values that compare unequal should produce different sequences of 64-bit words; otherwise, we have a seed-independent collision.

Here is the serialization schema used in the new proposal.

A tagged integer is serialized to one 64-bit word, with low bit 1.

A heap block is serialized to one header word possibly followed by one or several data words. The header word looks a lot like a heap block header:

   size (54 bits) . tag (8 bits) . two zero bits (2 bits)

A header cannot be confused with a tagged integer because the low bit is 0.

The tag is that of the heap block (String_tag, Double_tag, etc).

If tag >= No_scan_tag and tag <> Custom_tag, the size is the number of 64-bit words representing the block contents. (Even on 32-bit platforms.) The header word is followed by "size" 64-bit words representing these contents. For a Double_tag, size = 1 and the next word is the binary64 representation of the FP number, with -0.0 and NaNs normalized. For a String_tag, the size is the length divided by 8 and rounded up, and the string is encoded as in the original SipHash algorithm, with the length modulo 256 in the last byte.

If tag < No_scan_tag and tag <> Closure_tag, the size is the number of words in the heap block. Each field of the block is pushed on a queue and will be serialized later, in breadth-first manner.

If tag = Closure_tag, we have a mixed encoding: the first words of the closure block, up to the first environment field, are serialized just after the header, and the environment fields are pushed on the queue for later serialization. The second word determines how many "first words" there are, so the format is still non-ambiguous, although it's getting convoluted.

If tag = Custom_tag, we call the method hash_ext associated with the custom block, if it exists. This is a new method, added to the custom_operations struct by this PR. It is supposed to serialize and hash the custom object following the same conventions as for strings or floats: one header word with tag = Custom_tag and size N, followed by N words describing the contents of the block. Appropriate hash_ext functions are implemented in this PR for boxed integers and for bigarrays. In particular, 1-dimension bigarrays of characters are hashed very much like strings.

If no hash_ext method is provided, the old hash method is called, and the resulting 32-bit hash value is used as one 64-bit word of content. Of course this makes it trivial to find collisions, i.e. two custom blocks that hash equal but compare different. That's why the hash_ext method had to be introduced and used for commonly-used custom blocks such as boxed integers.

As can be seen above, the encoding of structured values into sequences of 64-bit words tries very hard to be injective. There is one known case where it is not: to prevent runaway, the breadth-first traversal stops after a fixed number of nodes were seen. For instance, only the first N elements of a list are serialized and contribute to the hash value. So, all lists that agree on the first N elements collide.

If we really want a truly injective encoding, the only reasonable way I can think of is to use the marshaller (Marshal.to_string) to produce a string of bytes, which is then hashed using SipHash. On the positive side, it would save code in the runtime system and avoid introducing the hash_ext custom method. On the negative side, I'm afraid this is going to make seeded hashing much slower, in particular for simple data such as strings or integers.

DemiMarie · 2020-11-02T00:06:23Z

Even if the encoding must be injective, we can still be faster than the marshaller, for several reasons:

The encoding doesn’t need to be portable, so we can avoid unnecessary byte swapping.
We can stream the data into the hasher as it is generated, avoiding allocations and copies.
We can special-case integers and strings and use optimized fast paths for them.
The encoding doesn’t need to be easy to decode, or even possible.

damiendoligez · 2022-01-21T15:07:51Z

runtime/bigarray.c

+  uint64_t w;
+
+  num_elts = 1;
+  for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n];


This is the appveyor failure:

Suggested change

for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n];

for (n = 0; n < b->num_dims; n++) num_elts = num_elts * b->dim[n];

xavierleroy · 2023-07-17T08:09:52Z

As suggested by @omasanori, PolymurHash (https://github.com/orlp/polymur-hash) could be a good alternative to SipHash 1-3. The question of the injective encoding (of values to string of bytes) remains.

xavierleroy mentioned this pull request Jul 13, 2020

Switch to the SipHash hash function #24

Closed

xavierleroy marked this pull request as ready for review July 20, 2020 08:13

xavierleroy force-pushed the siphash branch from 506dc4a to 4d07cae Compare July 20, 2020 08:23

stedolan reviewed Jul 29, 2020

View reviewed changes

xavierleroy force-pushed the siphash branch from 4d07cae to feafcb4 Compare August 18, 2020 15:05

xavierleroy force-pushed the siphash branch from feafcb4 to 1b0af10 Compare August 19, 2020 08:20

damiendoligez reviewed Jan 21, 2022

View reviewed changes

omasanori mentioned this pull request Jul 17, 2023

Consider PolymurHash #12380

Open

hyphenrf mentioned this pull request Jun 17, 2024

Bring Uchar hashing on par with other base types like Int, Char, ... #13240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the SipHash-1-3 function for randomized hash tables#9764

Use the SipHash-1-3 function for randomized hash tables#9764
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:siphash

xavierleroy commented Jul 13, 2020 •

edited

Loading

Uh oh!

xavierleroy commented Jul 20, 2020

Uh oh!

stedolan Jul 29, 2020

Uh oh!

stedolan commented Jul 29, 2020

Uh oh!

xavierleroy commented Jul 29, 2020

Uh oh!

xavierleroy commented Jul 29, 2020

Uh oh!

xavierleroy commented Aug 19, 2020

Uh oh!

DemiMarie commented Nov 2, 2020

Uh oh!

damiendoligez Jan 21, 2022

Uh oh!

xavierleroy commented Jul 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		/* Mix an OCaml string */

		static void sip_string(struct sip_state * st, value s)

	for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n];
	for (n = 0; n < b->num_dims; n++) num_elts = num_elts * b->dim[n];

Conversation

xavierleroy commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Jul 20, 2020

Uh oh!

stedolan Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

stedolan commented Jul 29, 2020

Uh oh!

xavierleroy commented Jul 29, 2020

Uh oh!

xavierleroy commented Jul 29, 2020

Uh oh!

xavierleroy commented Aug 19, 2020

Uh oh!

DemiMarie commented Nov 2, 2020

Uh oh!

damiendoligez Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Jul 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xavierleroy commented Jul 13, 2020 •

edited

Loading