fishy strategy for order-independent hashing

### Description of the bug:


With apologies, this isn't a bug that I can reproduce in practice. It's just a concern I have reading the code. The code in question is [MetadataDigestUtils.java](https://github.com/bazelbuild/bazel/blob/f0e63c4c934dfd9e85d2e55ede7c5bc0409946c7/src/main/java/com/google/devtools/build/lib/actions/cache/MetadataDigestUtils.java#L49-L60):

```java
/**
 * @param env A collection of (String, String) pairs.
 * @return an order-independent digest of the given set of pairs.
 */
public static byte[] fromEnv(Map<String, String> env) {
  byte[] result = new byte[0];
  Fingerprint fp = new Fingerprint();
  for (Map.Entry<String, String> entry : env.entrySet()) {
    fp.addString(entry.getKey());
    fp.addString(entry.getValue());
    result = DigestUtils.xor(result, fp.digestAndReset());
  }
  return result;
}
```

(Similarly, the `fromMetadata` method immediately above that one.)

In short, this code XOR's a bunch of different hashes together, with the goal of getting a final result that's independent of the order of the inputs. My concern is that XOR'ing hashes is a fishy thing to do in general, because if two inputs are ever identical to each other, their hashes will also be identical, and XOR'ing them will cancel them both out. This makes it possible to change both inputs at the same time, as long as they remain the same as each other. Their hashes will always cancel out, and the final answer will be totally independent of the values of those inputs.

Now in this case, it looks like part of the input is a key from a hash map, so we might hope that those keys are guaranteed to be unique. If that was the whole story, I think it would still be worth a comment in the code. ("NB: The following is only valid because we're hashing the keys of a map. Do not copy this approach if you can't guarantee the inputs are all distinct.") But I'm not so sure it's the whole story. One detail that jumps out to me is that hashing appears to be done by encoding each string into Protobuf bytes. That entails a conversion from Java's UTF-16 strings to Protobuf's UTF-8 strings. Can two different Java strings ever wind up getting converted to the same Protobuf bytes? I don't know the answer to that, but the first thing I'd want to look at is invalid UTF-16 strings with "isolated surrogates". It's common for UTF-8 conversions of invalid UTF-16 strings to be lossy, which could trigger this bug. Even if this particular issue doesn't pan out, I think it's a good example of how this XOR strategy entails some pretty beefy assumptions about what a lot of underlying libraries are doing, which could change over time in unexpected ways.

For a more robust approach to order-independent hashing, I'd recommend taking all the hashes-of-inputs, sorting them, and then hashing the concatenation of all those sorted hashes. That requires a bit more work and linear memory, but it's simple to implement and robust to identical inputs. If constant memory was a hard requirement, you could think about getting fancy with e.g. scalar multiplication of elliptic curve points, but that's way more complicated and also much more expensive.

Anyway, fingers crossed that I haven't wasted your time by misreading the code :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fishy strategy for order-independent hashing #15660

Description of the bug:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fishy strategy for order-independent hashing #15660

Description

Description of the bug:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions