-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Description
There seems to be a hidden assumption within HashSet<string> that, if broken, leads to a set containing strings that fail to be looked up. The assumption is that the instance returned by EqualityComparer<string>.Default is the only instance of its type.
This is a simple reproduction of the issue:
[TestMethod]
public void HashSet_MinimalIssueReproduction()
{
// Create another instance of GenericEqualityComparer<string> using reflection.
Type comparerType = EqualityComparer<string>.Default.GetType();
ConstructorInfo ctorInfo = comparerType.GetConstructor(Array.Empty<Type>());
EqualityComparer<string> comparer = (EqualityComparer<string>)ctorInfo.Invoke(Array.Empty<object>());
// Create a set using that comparer, and then run it through a copy constructor.
HashSet<string> set = new HashSet<string>(comparer) { "a", "b", "c", "d" };
HashSet<string> setClone = new HashSet<string>(set);
// The copied set is broken!
Assert.IsTrue(setClone.Contains("d"), "Fails!");
}Of course, the question would then be - why would there ever be a second instance of the type backed by EqualityComparer<string>.Default?
The NonRandomizedStringEqualityComparer in its ISerializable implementation indicates via GetObjectData that it should serialize as a GenericEqualityComparer<string>:
void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context)
{
// We are doing this to stay compatible with .NET Framework.
// Our own collection types will never call this (since this type is a wrapper),
// but perhaps third-party collection types could try serializing an instance
// of this.
info.SetType(typeof(GenericEqualityComparer<string>));
}So an implementation of a serializer that respects ISerializable would probably create a new instance of GenericEqualityComparer<string> when serializing NonRandomizedStringEqualityComparer (the default comparer that HashSet<string> seems to use, at least in my testing).
In other words, a realistic impact of this bug is: if you round trip a HashSet<string> through an ISerializable serializer, and then copy it via copy constructor, the copy will be broken. Here is a demonstration of this:
[TestMethod]
public void HashSet_SerializationIssueExample()
{
HashSet<string> set = new HashSet<string>() { "a", "b", "c", "d" };
// Mimic ISerializable serialization
SerializationInfo info = new SerializationInfo(typeof(HashSet<string>), new FormatterConverter());
StreamingContext context = new StreamingContext(StreamingContextStates.All);
set.GetObjectData(info, context);
ConstructorInfo constructor = typeof(HashSet<string>).GetConstructor(BindingFlags.NonPublic | BindingFlags.Instance, null, new[] { typeof(SerializationInfo), typeof(StreamingContext) }, null);
HashSet<string> setRoundTripped = (HashSet<string>)constructor.Invoke(new object[] { info, context });
setRoundTripped.OnDeserialization(null);
Assert.IsTrue(setRoundTripped.Contains("d"), "This works.");
HashSet<string> setClone = new HashSet<string>(setRoundTripped);
Assert.IsTrue(setClone.Contains("d"), "Fails!");
}Reproduction Steps
Run either of the above test cases in .NET 8.
Expected behavior
The HashSet<string> should always report that it contains strings that are within it. The tests should pass on .NET Framework and .NET 8.
Actual behavior
The tests pass on .NET Framework, and fail on .NET 8. (I did not try .NET 9, but I have been looking at source.dot.net when investigating this, and don't see any indication it would be fixed.)
Regression?
This was a regression noticed when migrating to .NET 8 from .NET Framework.
Known Workarounds
A serializer can special case these three singletons:
EqualityComparer<string>.DefaultStringComparer.OrdinalStringComparer.OrdinalIgnoreCase
If a serializer sees any of these, and upon deserializing ensures it returns the original singleton, assumptions can hold.
Configuration
.NET 8 / Windows 11, unlikely to be specific.
Other information
I'm not exactly sure how you would fix this, or if it would be realistic to do so at this point.
My understanding of the problem is that HashSet<string> ends up copying its backing bucket data incorrectly when the copy is made. It goes through this optimized code path when it shouldn't:
if (collection is HashSet<T> otherAsHashSet && EqualityComparersAreEqual(this, otherAsHashSet))
{
ConstructFrom(otherAsHashSet);
}In the problem circumstance, this is the new copy HashSet<string> that is being created, and it is using a NonRandomizedStringEqualityComparer instance that it gets from GetStringComparer. The NonRandomizedStringEqualityComparer basically "wraps" the EqualityComparer<string>.Default, which is a GenericEqualityComparer<string>.
The otherAsHashSet is in the unexpected state where its comparer is a non-singleton GenericEqualityComparer<string>, which isn't wrapped. The comparer isn't wrapped because GetStringComparer only wraps 3 specific singletons, when the given comparer is reference equal to one of them. (This is the assumption I refer to at the start of the issue.)
The problem arises because EqualityComparersAreEqual is implemented as set1.Comparer.Equals(set2.Comparer) and the Comparer property "unwraps" a wrapped comparer. this.Comparer will return an unwrapped GenericEqualityComparer<string>, and otherAsHashSet.Comparer returns a GenericEqualityComparer<string> that was never wrapped. I think these two non-reference equal instances of GenericEqualityComparer<string> are considered equal by Equals, leading to the bucket data copy in ConstructFrom. But it's not safe to copy bucket data, since the two sets are actually using comparers that use different hash codes.
For the bucket data copy scenario, I think it would be more correct to check set1._comparer.Equals(set2._comparer) so the unwrapping doesn't hide the fact that the comparers are actually different.