Skip to content

fishy strategy for order-independent hashing #15660

@oconnor663

Description

@oconnor663

Description of the bug:

With apologies, this isn't a bug that I can reproduce in practice. It's just a concern I have reading the code. The code in question is MetadataDigestUtils.java:

/**
 * @param env A collection of (String, String) pairs.
 * @return an order-independent digest of the given set of pairs.
 */
public static byte[] fromEnv(Map<String, String> env) {
  byte[] result = new byte[0];
  Fingerprint fp = new Fingerprint();
  for (Map.Entry<String, String> entry : env.entrySet()) {
    fp.addString(entry.getKey());
    fp.addString(entry.getValue());
    result = DigestUtils.xor(result, fp.digestAndReset());
  }
  return result;
}

(Similarly, the fromMetadata method immediately above that one.)

In short, this code XOR's a bunch of different hashes together, with the goal of getting a final result that's independent of the order of the inputs. My concern is that XOR'ing hashes is a fishy thing to do in general, because if two inputs are ever identical to each other, their hashes will also be identical, and XOR'ing them will cancel them both out. This makes it possible to change both inputs at the same time, as long as they remain the same as each other. Their hashes will always cancel out, and the final answer will be totally independent of the values of those inputs.

Now in this case, it looks like part of the input is a key from a hash map, so we might hope that those keys are guaranteed to be unique. If that was the whole story, I think it would still be worth a comment in the code. ("NB: The following is only valid because we're hashing the keys of a map. Do not copy this approach if you can't guarantee the inputs are all distinct.") But I'm not so sure it's the whole story. One detail that jumps out to me is that hashing appears to be done by encoding each string into Protobuf bytes. That entails a conversion from Java's UTF-16 strings to Protobuf's UTF-8 strings. Can two different Java strings ever wind up getting converted to the same Protobuf bytes? I don't know the answer to that, but the first thing I'd want to look at is invalid UTF-16 strings with "isolated surrogates". It's common for UTF-8 conversions of invalid UTF-16 strings to be lossy, which could trigger this bug. Even if this particular issue doesn't pan out, I think it's a good example of how this XOR strategy entails some pretty beefy assumptions about what a lot of underlying libraries are doing, which could change over time in unexpected ways.

For a more robust approach to order-independent hashing, I'd recommend taking all the hashes-of-inputs, sorting them, and then hashing the concatenation of all those sorted hashes. That requires a bit more work and linear memory, but it's simple to implement and robust to identical inputs. If constant memory was a hard requirement, you could think about getting fancy with e.g. scalar multiplication of elliptic curve points, but that's way more complicated and also much more expensive.

Anyway, fingers crossed that I haven't wasted your time by misreading the code :)

Metadata

Metadata

Assignees

Labels

P3We're not considering working on this, but happy to review a PR. (No assignee)team-CoreSkyframe, bazel query, BEP, options parsing, bazelrctype: bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions