-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Implement Native Text Operator #8384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level question: does FST + inverted index enough to solve the phrase query? Based on my understanding, we need to also keep the token positioning info in order to solve it?
We should take all the use cases we want to solve into consideration because it is hard to change the index once it is generated and pushed to the cluster.
...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/IndexCreationContext.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java
Outdated
Show resolved
Hide resolved
richardstartin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, just minor comments.
|
Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently. |
...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #8384 +/- ##
=============================================
- Coverage 70.33% 14.09% -56.25%
+ Complexity 4375 84 -4291
=============================================
Files 1705 1664 -41
Lines 89699 87989 -1710
Branches 13568 13387 -181
=============================================
- Hits 63093 12398 -50695
- Misses 22155 74660 +52505
+ Partials 4451 931 -3520
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
You are right if we try to change text match operator semantics.. I was thinking more along using like operator to use this index and leave the text match operator as it is. Phrase match is not that important and can be done later in another PR Wdyt? |
|
Is this exposed via TEXT_MATCH function and thus requires Lucene query syntax ? IIRC, in the initial design we wanted to expose this through a different function / udf (LIKE to get ANSI SQL) so that the query syntax doesn't necessarily have to be Lucene based. Referencing our earlier discussion - #7395 (comment) |
We don't need this index in order to support LIKE. LIKE and regexpLike is already supported with the FST (either lucene or native) and won't use this index. |
Yes, the plan is to support LIKE and a new method for PHRASE search. The native text index is exposed as an option within the text index, thus supports TEXT_MATCH by default. We will have to disable TEXT_MATCH on native indices explicitly. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to have some discussion on the syntax of the new text index. Currently we are using Java regexp format (with ^$.* instead of _% in sql LIKE) for the native FST, which is okay (not sure if all regexp are supported though, sql LIKE syntax is much simpler). Do we want to reuse the current TEXT_MATCH which currently takes a lucene search query, or come up with a new function that performs regexpLike/Like + logical operators?
cc @kishoreg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor, readability) Remove this empty line
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor should take a PinotDataBuffer instead of column and indexDir. The PinotDataBuffer can be part of the combined index file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface follows the same semantics as TextIndexReader interface, hence this signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should throw UnsupportedException here as we should never ask for dictionary ids from text index (the returned dictionary id is not the dictionary id for the column, but for the tokens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As one of the basic requirement, we should support logical operators in the search query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONTAINS supports boolean operators
...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java
Outdated
Show resolved
Hide resolved
My suggestion would be to use a new function that accepts potentially ANSI SQL LIKE style + delta needed to handle phrase, fuzzy and is not tied to lucene syntax currently used by TEXT_MATCH. On the other hand, exposing this new index via TEXT_MATCH and Lucene syntax along with index rebuilding in SegmentPreprocessor can help with easy migration of users currently using existing lucene text index |
|
Ping to see how do we want to move forward ? |
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
Here is the plan -- LIKE is going to start using text indices if available, thus providing support for LIKE on native indices + logical operators come free due to the ability to combine multiple LIKE predicates. In the current PR, I will allow TEXT_MATCH to work on top of native text indices, but any query passed in to the native text index through TEXT_MATCH will be treated as a regex query. Alternative is to disable TEXT_MATCH on native indices completely. Phrase and fuzzy will be supported by new functions. |
This might not align with the SQL
I'd suggest coming up with a new function to support logical operations on token regex matching. Something like Looking for inputs here @siddharthteotia @richardstartin @kishoreg |
I am hesitant to add a new function just for the regex matching. I like your suggestion around CONTAINS, will add that for the same, and a new function for phrase matching when we get there. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 major comments:
- Support boolean operator in search query
- Use magic header and version when reading the index file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can extend BasePredicate. See other predicate for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Add a new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Put CONTAINS after REGEXP_LIKE for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to implement record() (see TextMatchFilterOperator for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it ever be negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, FST will return an empty value if nothing found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(MAJOR) What I meant is to support search query with logical operators, e.g. www.domain1% OR %www.domain1. One way to achieve that is to use CalciteSqlParser.compileToExpression() and then treat identifier as literal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support boolean operators for CONTAINS using the standard syntax -- A CONTAINS "foo" AND A CONTAINS "bar". I will do a follow up PR for supporting the syntax mentioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some text (multiple terms) instead of simple domain name?
pinot-core/src/test/java/org/apache/pinot/queries/NativeAndLuceneComparisonTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to compare the result from lucene and native. Currently all the expectedResults are null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional) It will be faster to collect the bitmaps into an array and or them afterwards, because it avoids computing the cardinalities of each intermediate union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do that as a follow up
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java
Outdated
Show resolved
Hide resolved
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please base this benchmark on the structure of BenchmarkNativeAndLuceneBasedLike which ensures that the segments can't be mixed up. I don't have time to sanity check this benchmark right now, but BenchmarkNativeAndLuceneBasedLike is the right pattern to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is impossible to mix segments up since TEXT_MATCH works only on Lucene based segments and CONTAINS works only on native segments.
Nevertheless, I have taken your advise and refactored the benchmark to be in line with BenchmarkNativeAndLuceneBasedLike
|
@Jackie-Jiang As discussed:
All other comments are fixed. Please take a look. |
…al/segment/store/TextIndexUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
…al/segment/store/TextIndexUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
…al/indexsegment/mutable/MutableSegmentImpl.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>
89ab437 to
b57d0a2
Compare
b57d0a2 to
d287d60
Compare
This PR implements native text operator. It is based on the native FST implementation and is configured as an extra attribute to the existing text index.
This PR also implements the TEXT_CONTAINS operator. TEXT_CONTAINS operator allows you to search a text field using tokens. e.g.:
SELECT * FROM foo WHERE TEXT_CONTAINS(bar, '.*l') OR TEXT_CONTAINS(barbar, 'p')Please note that TEXT_CONTAINS works only on native text indices.
A new FieldConfig property "fstType" is defined to allow defining the index type. If none is specified, Lucene index type is used.
The current implementation supports regex queries through TEXT_CONTAINS. Phrase and wildcard queries will be supported soon, using new methods.