Skip to content

Use the SipHash-1-3 function for randomized hash tables#9764

Open
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:siphash
Open

Use the SipHash-1-3 function for randomized hash tables#9764
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:siphash

Conversation

@xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Jul 13, 2020

SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has no known seed-independent collisions.

This PR updates the Hashtbl implementation to use SipHash-1-3 as the hash function for randomized hash tables, making them truly resistant to hash flooding attacks.

Non-randomized hash tables still use MurmurHash 3, as it is faster than SipHash-1-3, especially on 32-bit platforms, yet good enough, statistically speaking, for a classic hash table. Plus, this gives a modicum of backward compatibility with programs that marshal (non-randomized) hash tables to persistent storage.

If accepted, this will close PR #24, six years later. Why did it take so long? First, at the time, only SipHash-2-4 was considered, with more rounds and stronger cryptographic properties but slower running times. It took a while before the SipHash-1-3 reduced-round version was confirmed as good enough. Second, it took me all of 6 years to realize that we can keep using MurmurHash 3 for non-randomized hash tables, resulting in an excellent deal: more security for randomized hash tables, no slowdown for classic hash tables.

@xavierleroy xavierleroy marked this pull request as ready for review July 20, 2020 08:13
@xavierleroy
Copy link
Contributor Author

This PR is now ready for review.

runtime/hash.c Outdated

/* Mix an OCaml string */

static void sip_string(struct sip_state * st, value s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth changing these functions to avoid accessing sip_state through a pointer. GCC introduces spills of the siphash state in the inner loop, because it can't see that the state and the string don't alias.

@stedolan
Copy link
Contributor

I'm afraid that there are still seed-independent collisions with this hash! Here's an example:

let z = "\000\000\000\000\000\000\000\000"
let w = String.init (31 * 8) (fun _ -> '!')

let a = ("", w ^ z)
let b = (z ^ w, "")

let () =
  let seed = Random.bits () in
  Printf.printf "%08x %08x %b\n"
    (Hashtbl.seeded_hash seed a)
    (Hashtbl.seeded_hash seed b)
    (a = b)

produces:

06d8d8d6 06d8d8d6 false

The issue is that the encoding of strings is ambiguous - there are distinct values that feed the same sequence of 64-bit words to SipHash.

@xavierleroy
Copy link
Contributor Author

Interesting! The initial claims about seed-independent collisions were focused on strings as hash keys, not structured OCaml values. I agree the latter raise other issues. It is certainly possible to linearize a structure value into a sequence of 64-bit words in an injective manner -- that's what the marshaler does -- but I don't know how costly this is. I'll look into possible approaches.

@xavierleroy
Copy link
Contributor Author

Note that there is also a problem with custom blocks, which are first hashed (without seeding) to a 32-bit integer, then this integer is mixed via SIPhash. This makes it easy to find seed-independent collisions. They are trivial for int64, for instance. Solving this issue would involve a different API for custom hash functions, taking a hash state as in-out parameter. This has been on my to do list for a while too. But I wonder whether we should do it now, or have an intermediate state where the SipHash-based seeded hash behaves well for strings but not that well for other data types.

SipHash-1-3 is a seeded hash function that, unlike MurmurHash 3, has
no known seed-independent collisions.

This PR updates the Hashtbl implementation to use SipHash-1-3 as
the hash function for randomized hash tables, making them truly
resistant to hash flooding attacks.

Non-randomized hash tables still use MurmurHash 3, as it is faster
than SipHash-1-3, especially on 32-bit platforms, yet good enough,
statistically speaking, for a classic hash table.

This PR also adds an API for randomized hashing of custom blocks.
A "hash_ext" function is added to the operations that can be attached
to custom blocks.  Unlike the existing "hash" operation, which takes
a 32-bit hash state as argument and returns an updated state as result,
"hash_ext" takes a pointer to an opaque "caml_hash_state" struct, which
is updated in-place.

Closes: ocaml#24
@xavierleroy
Copy link
Contributor Author

I just pushed a new proposal that tries hard to avoid @stedolan's "confusing deputy" attack.

Let's view seeded hashing of a structured value as 1- serializing the structured value to a sequence of 64-bit words, and 2- combining the 64-bit words in a single hash value using SipHash-1-3. (In reality the two steps occur in parallel.) We must make sure that the serialization schema is injective: structured values that compare unequal should produce different sequences of 64-bit words; otherwise, we have a seed-independent collision.

Here is the serialization schema used in the new proposal.

A tagged integer is serialized to one 64-bit word, with low bit 1.

A heap block is serialized to one header word possibly followed by one or several data words. The header word looks a lot like a heap block header:

   size (54 bits) . tag (8 bits) . two zero bits (2 bits)

A header cannot be confused with a tagged integer because the low bit is 0.

The tag is that of the heap block (String_tag, Double_tag, etc).

If tag >= No_scan_tag and tag <> Custom_tag, the size is the number of 64-bit words representing the block contents. (Even on 32-bit platforms.) The header word is followed by "size" 64-bit words representing these contents. For a Double_tag, size = 1 and the next word is the binary64 representation of the FP number, with -0.0 and NaNs normalized. For a String_tag, the size is the length divided by 8 and rounded up, and the string is encoded as in the original SipHash algorithm, with the length modulo 256 in the last byte.

If tag < No_scan_tag and tag <> Closure_tag, the size is the number of words in the heap block. Each field of the block is pushed on a queue and will be serialized later, in breadth-first manner.

If tag = Closure_tag, we have a mixed encoding: the first words of the closure block, up to the first environment field, are serialized just after the header, and the environment fields are pushed on the queue for later serialization. The second word determines how many "first words" there are, so the format is still non-ambiguous, although it's getting convoluted.

If tag = Custom_tag, we call the method hash_ext associated with the custom block, if it exists. This is a new method, added to the custom_operations struct by this PR. It is supposed to serialize and hash the custom object following the same conventions as for strings or floats: one header word with tag = Custom_tag and size N, followed by N words describing the contents of the block. Appropriate hash_ext functions are implemented in this PR for boxed integers and for bigarrays. In particular, 1-dimension bigarrays of characters are hashed very much like strings.

If no hash_ext method is provided, the old hash method is called, and the resulting 32-bit hash value is used as one 64-bit word of content. Of course this makes it trivial to find collisions, i.e. two custom blocks that hash equal but compare different. That's why the hash_ext method had to be introduced and used for commonly-used custom blocks such as boxed integers.

As can be seen above, the encoding of structured values into sequences of 64-bit words tries very hard to be injective. There is one known case where it is not: to prevent runaway, the breadth-first traversal stops after a fixed number of nodes were seen. For instance, only the first N elements of a list are serialized and contribute to the hash value. So, all lists that agree on the first N elements collide.

If we really want a truly injective encoding, the only reasonable way I can think of is to use the marshaller (Marshal.to_string) to produce a string of bytes, which is then hashed using SipHash. On the positive side, it would save code in the runtime system and avoid introducing the hash_ext custom method. On the negative side, I'm afraid this is going to make seeded hashing much slower, in particular for simple data such as strings or integers.

@DemiMarie
Copy link
Contributor

Even if the encoding must be injective, we can still be faster than the marshaller, for several reasons:

  • The encoding doesn’t need to be portable, so we can avoid unnecessary byte swapping.
  • We can stream the data into the hasher as it is generated, avoiding allocations and copies.
  • We can special-case integers and strings and use optimized fast paths for them.
  • The encoding doesn’t need to be easy to decode, or even possible.

uint64_t w;

num_elts = 1;
for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the appveyor failure:

Suggested change
for (n = 0; n < b->num_dims; i++) num_elts = num_elts * b->dim[n];
for (n = 0; n < b->num_dims; n++) num_elts = num_elts * b->dim[n];

@xavierleroy
Copy link
Contributor Author

As suggested by @omasanori, PolymurHash (https://github.com/orlp/polymur-hash) could be a good alternative to SipHash 1-3. The question of the injective encoding (of values to string of bytes) remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants