ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding#4846
ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding#4846tianchen92 wants to merge 6 commits intoapache:masterfrom
Conversation
|
@emkornfield Hi Micah, I made a simple implementation for dictionary hash table. |
java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
Outdated
Show resolved
Hide resolved
|
@liyafan82 Thanks for your comments, This PR is just a demo as I commented as WIP and this will work for VarCharVector type encoding in TestDictionaryVector to test correctness. default and memory copy already commented as TODO to make it perfect since we have to use memory level hashCode and implement equals in all type vectors. |
Codecov Report
@@ Coverage Diff @@
## master #4846 +/- ##
==========================================
+ Coverage 87.42% 89.58% +2.16%
==========================================
Files 994 661 -333
Lines 140102 96617 -43485
Branches 1418 0 -1418
==========================================
- Hits 122481 86555 -35926
+ Misses 17259 10062 -7197
+ Partials 362 0 -362Continue to review full report at Codecov.
|
There was a problem hiding this comment.
this is a no-op I believe.
There was a problem hiding this comment.
Thanks for comments, IMO, what dose this mean? :)
There was a problem hiding this comment.
@emkornfield Could you please explain this a little more? thanks
There was a problem hiding this comment.
I think Integer.hashCode(value) == value?
There was a problem hiding this comment.
I see, you are right, thanks, fixed.
emkornfield
left a comment
There was a problem hiding this comment.
This looks like a good start do you see performance improvements? If so I think the next step would be to add unit tests.
Also, CC @praveenbingo and @pravindra to make sure they don't have concerns with the new interfaces.
java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryHashTable.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
Outdated
Show resolved
Hide resolved
Sure, I have already added a DictionaryEncoderBenchmark and list its previous performance data in #4786. And add what kind of UT? now TestDictionaryVector already support many types like Struct, List, Binary, Varchar etc. |
|
Sorry forgot about the existing unit tests. LGTM. |
|
+1, than you |
… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <[email protected]> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
… dictionary encoding As discussed in apache#4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <[email protected]> Closes apache#4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
As discussed in #4792
Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.