-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New mutable.HashSet and mutable.HashMap implementations #7348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I grepped the jenkins log for "contains object and package with same name", and this hasn't come up in a while at least. Perhaps has something to do with iteration-order-sensitivity in the compiler? |
|
Ah, right, that could be the issue. I assumed it's some change in my code that trips up the compiler but the failing tests are ones that use the new compiler and library and therefore the new HashSet. So it could be either a bug that the regular tests don't catch or the compiler relying on undocumented behavior of the old version. |
3e16f04 to
4e6d9df
Compare
|
Here's a benchmark for the current |
|
Not sure if this is helpful, but I know in java land you can transform any
I guess it could be worth a benchmark |
|
Sharing an implementation between Set and Map is the obvious thing to do. Java does it (HashSet simply wraps HashMap). Our old design was also prepared for this by separating the actual hash table implementation from HashMap and HashSet (but HashSet uses a different implementation anyway). You're wasting one pointer per element if you do it this way, which shouldn't be too bad. Another option would be to have different node implementations for maps and sets. This will need some benchmarking to make sure HotSpot can optimize it properly. |
|
I suppose what I’m wondering is if there should be something like |
|
@javax-swing: Are there other cases than |
|
BTW, the first performance results show that my new HashMap is much faster than the old one except in collision resistance. Maybe the old Scala hash mapping provides an advantage over Java's simpler one (that I'm using). Investigating further. |
|
@szeiger not really, just wondering whether the abstraction is useful. It could simplify the implementations for some of the other kinds of sets (for example |
4e6d9df to
8640fd8
Compare
|
|
8640fd8 to
a724245
Compare
|
Ah ok makes sense :) |
|
I did some benchmarks on an abstracted version that allows different Node implementations so that HashSet could have a Node without the value. Some variations are worse than others but in all versions that I tried there was a significant performance penalty. So if we want to it be fast, the choices are separate implementations or wrapping a Map in a Set (and thus wasting one pointer per element). |
|
Here's the new Notice the The performance data from Java is pretty much the same as it was for This is with the new |
8addfaa to
0c8e94a
Compare
|
It's probably best to start a new PR for the order-aware implementations. I consider this PR ready. |
| val e = table.findEntry(key) | ||
| if (e eq null) default else e.value | ||
| } | ||
| override def put(key: K, value: V): Option[V] = put0(key, value, true) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put0 could probably return None rather than null. This would avoid the extra pattern match here. Unless it’s faster to return null in put0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I ran a benchmark for this. I assumed that null would be faster in general and the extra null check when you do care about the result (which is only in put) is cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, did some benchmarks. They're not entirely conclusive but if anything, using null is a little bit faster.
|
@szeiger not sure if this would be interesting to you, but i had a crack at implementing the more efficient grow function that you hint at in the comments. |
|
Since OpenHashMap is deprecated, does it make sense to introduce OpenHashSet? |
|
(could also summarize why OpenHashMap was deprecated? scala/collection-strawman#388) |
You went for the separate implemenation, I think it's the right choice. À la limite we could have a template and some code gen, but it's probably not worth the effort. Having separate classes might allow the JIT to do a better job optimizing each one individually. |
lrytz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! A couple of questions in review comments.
| * @define willNotTerminateInf | ||
| */ | ||
| class HashMap[K, V] | ||
| class HashMap[K, V](initialCapacity: Int, loadFactor: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it final..? I guess the main use case for extending it is to override default. We can make that default function an assignable field (withDefaultValue?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 other subclasses in the compiler: QuantMap in scala.reflect.internal.util.Statistics, result in scala.tools.nsc.doc.model.IndexModelFactory and ReponseMap in scala.tools.nsc.interactive.Global. Plus some more in test cases. Making it final would probably break a lot of code if we already have that much in our own codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should make it final / deprecatedOverriding. The 3 examples don't do much more than adding defaults, I think people should use composition not inheritance here. Allowing subclasses of HashMap makes it hard for us to change things, to optimize the implementaiton, and even changes that look safe might break subclasses in unexpected ways - you cannot anticipate what implementation details subclasses rely on.
For convenience, we could provide a non-final DelegatingMap that people could extend to add methods to a map. But just doing it for mutable.Map seems unsystematic. We have many other final collection types in the library, I don't see why mutable.HashMap should be an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the default method is not specific to HashMap, it is defined in collection.Map and we already have withDefault to create a delegate.
|
"Template and code gen" - Something like C++ templates would be useful here. I suppose we'd have to settle for something more ad-hoc and more horrifying. I think the reason for deprecating |
|
@javax-swing Did you run any benchmarks on this algorithm? I assume it would be beneficial to keep the separate version for the common 2x case but maybe the generic one can be just as fast. |
|
@szeiger no I didn’t actually. I will have a go at that when I find some free time. I’ve never used jmh before so it might be fun! I initially hashed it out on my phone to combat boredom 😁 and the solution seems surprisingly small. The caveat being that the array of last values could be quite large and therefore take up a fair chunk of memory of the difference in the size of the tables is big enough |
|
@szeiger I’ve attempted some benchmarking https://github.com/javax-swing/hashtable-grow-brenchmarking From initial tests there is a performance degradation when the size doubles, I will post some results once I’m happy with it.i suspect that it’s a combination of the array access and the null checks, I made a second variant using I’m yet to test other grow factors but hopefully will find time this evening |
|
Is OpenHashSet/OpenHashMap a good name for the old implementations? Using "Open" in the name makes it hard to know whether it is referring to "Open Hashing" or "Open Addressing (aka closed hashing). |
|
@szeiger i managed to get some more time for running some benchmarks. The generic grow algorithm shows some improvements, but only when the number of elements is large or the grow factor is large. Below Below are the results from doubling in size From these numbers i'm not sure it's worth changing it, since 99% of the time i imagine that the hashtable will double in size. And people could avoid a costly call to grow the hashtable by supplying the size hint on construction. What do you think? |
0c8e94a to
7b40d5c
Compare
|
Updated to address review comments. Important changes:
@javax-swing While working on |
- The old HashSet was based on open addressing. Since the similar `OpenHashMap` is deprecated I removed it (along with the actual implementing class `FlatHashTable`) instead of adding yet another collection implementation (`OpenHashSet`). - The new HashSet implementation caches the hash codes and uses linked lists for chaining. This is consistent with mutable.HashMap, the immutable HashSet/HashMap, and Java’s mutable HashSet/HashMap. - The linked lists are kept ordered by hash code. This costs a few percent of performance for well-distributed hashes but is significantly faster in cases with lots of bucket collisions - Hash set computation is fixed to properly implement Scala’s equality semantics for numeric types. - The new HashMap is based on the new HashSet. Fixes scala/scala-dev#567
7b40d5c to
3a6f637
Compare
| override def addAll(xs: IterableOnce[A]): this.type = { | ||
| sizeHint(xs.knownSize) | ||
| super.addAll(xs) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take advantage of the cached hashcodes in both this implementation of mutable.HashSet as well as in immutable.HashSet.
Maybe an other optimization:
When case xs: mutable.HashMap[A] if xs.table.length >= this.table.length, then each of those buckets maps onto exactly one bucket in this.table. Maybe each bucket in that could be "merge joined" into the bucket in this, since they're both sorted linked lists? Maybe it would get complicated with the extra housekeeping of making sure to resize the table when this gets too big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could check for equal table sizes after growing this table. If you check before growing and delay growing until you've added the new data you could end up adding a lot of data in an inefficient way to linked lists even though you know you'll have to split them afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but whether or not this table should grow depends on the overlap of the two HashMaps. It's probably not worth doing I guess.
| def iterator: Iterator[(K, V)] = { | ||
| if (isEmpty) Iterator.empty | ||
| else table.entriesIterator.map(e => (e.key, e.value)) | ||
| override def addAll(xs: IterableOnce[(K, V)]): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this override to mutable.MapOps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeHint comes from Builder which extends Growable. It would make sense to override addAll (which is concrete in Growable) in Builder to set the size hint.
|
I forgot to mention: could you update the PR description? Say that OpenHashMap and OpenHashSet are removed. And it would be good to give an explanation why OpenHashMap was deprecated. |
|
@szeiger Should the equivalent changes to this also be done for LinkedHashMap and LinkedHashSet? |
OpenHashMapis deprecated to lower the maintenance burden (and nobody seems to be complaining about that deprecation) I removed it (along with the actual implementing classFlatHashTable) instead of adding yet another collection implementation (OpenHashSet).Fixes scala/scala-dev#567
Note: The following performance data is from the original version of this PR. See the other comments for newer data.
Performance is similar to Java and much better than the old implementation (
ohs=OpenHashSet, i.e. the old version plus the equality fix).Checking for elements not in the set:
Checking for elements in the set:
Iterating over all elements with an
Iterator:Building a pre-allocated set without the need to regrow the table: The old implementation doesn't support size hints and is therefore much slower (even more so than in the other benchmarks)
Building the same set without pre-allocating:
Building a set with objects whose hashes are reduced to only 4 significant bits to simulate a badly distributed hash function with many collisions: Keeping the linked lists sorted by hash code provides a significant advantage over the Java implementation:
The next step is comparing the new
mutable.HashSetwithmutable.HashMap, possibly replacing one with the other. After that I'll look into ordering-aware implementations to address scala/bug#11203. As suggested by @Ichoran I'll keep these separate so they can make use of anOrderingwhich is more idiomatic in Scala than having the objects themselves implementComparable. It should also make the implementations simpler and faster.