Implement Native Text Operator #8384

atris · 2022-03-22T18:52:08Z

This PR implements native text operator. It is based on the native FST implementation and is configured as an extra attribute to the existing text index.

This PR also implements the TEXT_CONTAINS operator. TEXT_CONTAINS operator allows you to search a text field using tokens. e.g.:

SELECT * FROM foo WHERE TEXT_CONTAINS(bar, '.*l') OR TEXT_CONTAINS(barbar, 'p')

Please note that TEXT_CONTAINS works only on native text indices.

A new FieldConfig property "fstType" is defined to allow defining the index type. If none is specified, Lucene index type is used.

The current implementation supports regex queries through TEXT_CONTAINS. Phrase and wildcard queries will be supported soon, using new methods.

Jackie-Jiang

High level question: does FST + inverted index enough to solve the phrase query? Based on my understanding, we need to also keep the token positioning info in order to solve it?
We should take all the use cases we want to solve into consideration because it is hard to change the index once it is generated and pushed to the cluster.

...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/IndexCreationContext.java

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

richardstartin

Looks really good, just minor comments.

Jackie-Jiang · 2022-03-22T20:39:10Z

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java

concerns about MV logic

codecov-commenter · 2022-03-23T06:15:57Z

Codecov Report

Merging #8384 (89ab437) into master (72e1844) will decrease coverage by 56.24%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #8384       +/-   ##
=============================================
- Coverage     70.33%   14.09%   -56.25%     
+ Complexity     4375       84     -4291     
=============================================
  Files          1705     1664       -41     
  Lines         89699    87989     -1710     
  Branches      13568    13387      -181     
=============================================
- Hits          63093    12398    -50695     
- Misses        22155    74660    +52505     
+ Partials       4451      931     -3520

Flag	Coverage Δ
integration1	`?`
integration2	`?`
unittests1	`?`
unittests2	`14.09% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ot/common/request/context/RequestContextUtils.java	`0.00% <0.00%> (-70.44%)`	⬇️
...n/request/context/predicate/ContainsPredicate.java	`0.00% <0.00%> (ø)`
...ot/common/request/context/predicate/Predicate.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../apache/pinot/pql/parsers/pql2/ast/FilterKind.java	`0.00% <0.00%> (-100.00%)`	⬇️
...manager/realtime/LLRealtimeSegmentDataManager.java	`0.00% <0.00%> (-70.88%)`	⬇️
...t/core/operator/filter/ContainsFilterOperator.java	`0.00% <0.00%> (ø)`
...inot/core/operator/filter/FilterOperatorUtils.java	`0.00% <0.00%> (-88.75%)`	⬇️
...ava/org/apache/pinot/core/plan/FilterPlanNode.java	`0.00% <0.00%> (-82.54%)`	⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java	`0.00% <0.00%> (-58.82%)`	⬇️
...ent/local/realtime/impl/RealtimeSegmentConfig.java	`0.00% <0.00%> (-91.87%)`	⬇️
... and 1374 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e1844...89ab437. Read the comment docs.

kishoreg · 2022-03-23T06:16:27Z

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

You are right if we try to change text match operator semantics.. I was thinking more along using like operator to use this index and leave the text match operator as it is. Phrase match is not that important and can be done later in another PR

Wdyt?

siddharthteotia · 2022-03-23T16:02:06Z

Is this exposed via TEXT_MATCH function and thus requires Lucene query syntax ? IIRC, in the initial design we wanted to expose this through a different function / udf (LIKE to get ANSI SQL) so that the query syntax doesn't necessarily have to be Lucene based.

Referencing our earlier discussion - #7395 (comment)

Jackie-Jiang · 2022-03-23T16:57:29Z

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

You are right if we try to change text match operator semantics.. I was thinking more along using like operator to use this index and leave the text match operator as it is. Phrase match is not that important and can be done later in another PR

Wdyt?

We don't need this index in order to support LIKE. LIKE and regexpLike is already supported with the FST (either lucene or native) and won't use this index.

atris · 2022-03-23T17:04:23Z

Is this exposed via TEXT_MATCH function and thus requires Lucene query syntax ? IIRC, in the initial design we wanted to expose this through a different function / udf (LIKE to get ANSI SQL) so that the query syntax doesn't necessarily have to be Lucene based.

Referencing our earlier discussion - #7395 (comment)

Yes, the plan is to support LIKE and a new method for PHRASE search. The native text index is exposed as an option within the text index, thus supports TEXT_MATCH by default. We will have to disable TEXT_MATCH on native indices explicitly.

Jackie-Jiang

Want to have some discussion on the syntax of the new text index. Currently we are using Java regexp format (with ^$.* instead of _% in sql LIKE) for the native FST, which is okay (not sure if all regexp are supported though, sql LIKE syntax is much simpler). Do we want to reuse the current TEXT_MATCH which currently takes a lucene search query, or come up with a new function that performs regexpLike/Like + logical operators?

cc @kishoreg

Jackie-Jiang · 2022-03-24T20:08:36Z

...al/src/main/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfig.java

(minor, readability) Remove this empty line

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

Jackie-Jiang · 2022-03-24T20:17:05Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

The constructor should take a PinotDataBuffer instead of column and indexDir. The PinotDataBuffer can be part of the combined index file

The interface follows the same semantics as TextIndexReader interface, hence this signature

Jackie-Jiang · 2022-03-24T20:21:50Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

Should throw UnsupportedException here as we should never ask for dictionary ids from text index (the returned dictionary id is not the dictionary id for the column, but for the tokens)

Jackie-Jiang · 2022-03-24T20:27:53Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

As one of the basic requirement, we should support logical operators in the search query

CONTAINS supports boolean operators

...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java

siddharthteotia · 2022-03-25T21:18:59Z

Want to have some discussion on the syntax of the new text index. Currently we are using Java regexp format (with ^$.* instead of _% in sql LIKE) for the native FST, which is okay (not sure if all regexp are supported though, sql LIKE syntax is much simpler). Do we want to reuse the current TEXT_MATCH which currently takes a lucene search query, or come up with a new function that performs regexpLike/Like + logical operators?

cc @kishoreg

My suggestion would be to use a new function that accepts potentially ANSI SQL LIKE style + delta needed to handle phrase, fuzzy and is not tied to lucene syntax currently used by TEXT_MATCH.

On the other hand, exposing this new index via TEXT_MATCH and Lucene syntax along with index rebuilding in SegmentPreprocessor can help with easy migration of users currently using existing lucene text index

siddharthteotia · 2022-03-29T04:24:27Z

Ping to see how do we want to move forward ?

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

atris · 2022-04-04T08:14:44Z

Ping to see how do we want to move forward ?

Here is the plan -- LIKE is going to start using text indices if available, thus providing support for LIKE on native indices + logical operators come free due to the ability to combine multiple LIKE predicates.

In the current PR, I will allow TEXT_MATCH to work on top of native text indices, but any query passed in to the native text index through TEXT_MATCH will be treated as a regex query. Alternative is to disable TEXT_MATCH on native indices completely.

Phrase and fuzzy will be supported by new functions.

Jackie-Jiang · 2022-04-04T21:02:13Z

Here is the plan -- LIKE is going to start using text indices if available, thus providing support for LIKE on native indices + logical operators come free due to the ability to combine multiple LIKE predicates.

This might not align with the SQL LIKE semantic which should match the whole document instead of the token matching. This is more like the CONTAINS described here. I didn't find a standard SQL function for the token matching, not sure if there is one.

In the current PR, I will allow TEXT_MATCH to work on top of native text indices, but any query passed in to the native text index through TEXT_MATCH will be treated as a regex query. Alternative is to disable TEXT_MATCH on native indices completely.

I'd suggest coming up with a new function to support logical operations on token regex matching. Something like TEXT_LIKE(textCol, 'ab.*' AND '.cd')

Looking for inputs here @siddharthteotia @richardstartin @kishoreg

atris · 2022-04-05T06:59:49Z

here

I am hesitant to add a new function just for the regex matching. I like your suggestion around CONTAINS, will add that for the same, and a new function for phrase matching when we get there.

Jackie-Jiang

2 major comments:

Support boolean operator in search query
Use magic header and version when reading the index file

Jackie-Jiang · 2022-04-27T17:33:33Z

...ommon/src/main/java/org/apache/pinot/common/request/context/predicate/ContainsPredicate.java

It can extend BasePredicate. See other predicate for example

Jackie-Jiang · 2022-04-27T17:33:46Z

...ommon/src/main/java/org/apache/pinot/common/request/context/predicate/ContainsPredicate.java

(minor) Add a new line

Jackie-Jiang · 2022-04-27T17:35:11Z

pinot-common/src/main/java/org/apache/pinot/common/request/context/predicate/Predicate.java

(minor) Put CONTAINS after REGEXP_LIKE for consistency

Jackie-Jiang · 2022-04-27T17:38:54Z

pinot-core/src/main/java/org/apache/pinot/core/operator/filter/ContainsFilterOperator.java

Remove this field

Jackie-Jiang · 2022-04-27T17:41:34Z

pinot-core/src/main/java/org/apache/pinot/core/operator/filter/ContainsFilterOperator.java

You might want to implement record() (see TextMatchFilterOperator for example)

Jackie-Jiang · 2022-04-27T19:00:48Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

Can it ever be negative?

No, FST will return an empty value if nothing found.

Jackie-Jiang · 2022-04-27T19:02:21Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

(MAJOR) What I meant is to support search query with logical operators, e.g. www.domain1% OR %www.domain1. One way to achieve that is to use CalciteSqlParser.compileToExpression() and then treat identifier as literal

We support boolean operators for CONTAINS using the standard syntax -- A CONTAINS "foo" AND A CONTAINS "bar". I will do a follow up PR for supporting the syntax mentioned

Jackie-Jiang · 2022-04-27T19:05:46Z

pinot-core/src/test/java/org/apache/pinot/queries/NativeAndLuceneComparisonTest.java

Can we add some text (multiple terms) instead of simple domain name?

pinot-core/src/test/java/org/apache/pinot/queries/NativeAndLuceneComparisonTest.java

Jackie-Jiang · 2022-04-27T19:07:42Z

pinot-core/src/test/java/org/apache/pinot/queries/NativeAndLuceneComparisonTest.java

We want to compare the result from lucene and native. Currently all the expectedResults are null

richardstartin · 2022-04-28T19:31:28Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java

(optional) It will be faster to collect the bitmaps into an array and or them afterwards, because it avoids computing the cardinalities of each intermediate union.

Will do that as a follow up

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java

richardstartin · 2022-04-28T19:35:39Z

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java

Please base this benchmark on the structure of BenchmarkNativeAndLuceneBasedLike which ensures that the segments can't be mixed up. I don't have time to sanity check this benchmark right now, but BenchmarkNativeAndLuceneBasedLike is the right pattern to follow.

It is impossible to mix segments up since TEXT_MATCH works only on Lucene based segments and CONTAINS works only on native segments.

Nevertheless, I have taken your advise and refactored the benchmark to be in line with BenchmarkNativeAndLuceneBasedLike

atris · 2022-04-29T13:22:11Z

@Jackie-Jiang As discussed:

We already support boolean queries on CONTAINS (field1 CONTAINS foo AND field1 CONTAINS bar. I will follow up with a PR to support the syntax you mentioned.

All other comments are fixed. Please take a look.
@richardstartin

…al/segment/store/TextIndexUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

…al/indexsegment/mutable/MutableSegmentImpl.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

atris requested a review from Jackie-Jiang March 22, 2022 19:08

Jackie-Jiang reviewed Mar 22, 2022

View reviewed changes

...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java Outdated Show resolved Hide resolved

richardstartin reviewed Mar 22, 2022

View reviewed changes

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/IndexCreationContext.java Outdated Show resolved Hide resolved

richardstartin reviewed Mar 22, 2022

View reviewed changes

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java Outdated Show resolved Hide resolved

richardstartin reviewed Mar 22, 2022

View reviewed changes

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/NativeTextIndexReader.java Outdated Show resolved Hide resolved

richardstartin previously approved these changes Mar 22, 2022

View reviewed changes

richardstartin reviewed Mar 22, 2022

View reviewed changes

...cal/src/main/java/org/apache/pinot/segment/local/utils/nativefst/NativeTextIndexCreator.java Outdated Show resolved Hide resolved

richardstartin self-requested a review March 22, 2022 21:05

Jackie-Jiang reviewed Mar 24, 2022

View reviewed changes

richardstartin reviewed Mar 31, 2022

View reviewed changes

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java Outdated Show resolved Hide resolved

atris force-pushed the text_operator branch from fd7f714 to dda717d Compare April 25, 2022 17:17

Jackie-Jiang reviewed Apr 27, 2022

View reviewed changes

richardstartin reviewed Apr 28, 2022

View reviewed changes

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java Outdated Show resolved Hide resolved

richardstartin reviewed Apr 28, 2022

View reviewed changes

pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkNativeVsLuceneTextIndex.java Outdated Show resolved Hide resolved

richardstartin reviewed Apr 28, 2022

View reviewed changes

atris force-pushed the text_operator branch from 6e2b63a to 3f5e47f Compare May 2, 2022 04:53

Atri Sharma and others added 22 commits May 3, 2022 14:00

Revert check for lucene text indices

3a4fa02

Use peekable iterator

444f0bc

Add option of returning list

03be8ce

V1 of comment fixing

a914ea2

More review comments

fe20e9a

More reviews

807b50d

More review comments

4a0cf88

Remove redundant call

18ea328

More refactoring

1ec1b52

Update per rebase

4ccfa14

Update pinot-segment-local/src/main/java/org/apache/pinot/segment/loc…

2e85118

…al/segment/store/TextIndexUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

Update pinot-segment-local/src/main/java/org/apache/pinot/segment/loc…

b041adb

…al/segment/store/TextIndexUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

Update pinot-segment-local/src/main/java/org/apache/pinot/segment/loc…

0c1a115

…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

Update pinot-segment-local/src/main/java/org/apache/pinot/segment/loc…

92cf013

…al/indexsegment/mutable/MutableSegmentImpl.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

Update pinot-segment-local/src/main/java/org/apache/pinot/segment/loc…

135f5e7

…al/utils/nativefst/NativeTextIndexCreator.java Co-authored-by: Xiaotian (Jackie) Jiang <[email protected]>

More review comments

9a9318f

More review fixes

e7bcdea

Fix import failure

3e86554

Remove new integration test

1449460

More fixes

ac07c63

Cleanup

15c3b48

Cleanup

b75a2a0

Jackie-Jiang force-pushed the text_operator branch from 89ab437 to b57d0a2 Compare May 3, 2022 21:01

Change CONTAINS to TEXT_CONTAINS to avoid conflict

d287d60

Jackie-Jiang force-pushed the text_operator branch from b57d0a2 to d287d60 Compare May 3, 2022 21:06

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels May 4, 2022

Jackie-Jiang approved these changes May 4, 2022

View reviewed changes

Jackie-Jiang merged commit 907b023 into apache:master May 4, 2022

Implement Native Text Operator #8384

Implement Native Text Operator #8384

Uh oh!

Conversation

atris commented Mar 22, 2022 • edited by Jackie-Jiang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardstartin left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented Mar 22, 2022

Uh oh!

Uh oh!

codecov-commenter commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kishoreg commented Mar 23, 2022

Uh oh!

siddharthteotia commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang commented Mar 23, 2022

Uh oh!

atris commented Mar 23, 2022

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

siddharthteotia commented Mar 25, 2022

Uh oh!

siddharthteotia commented Mar 29, 2022

Uh oh!

Uh oh!

atris commented Apr 4, 2022

Uh oh!

Jackie-Jiang commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atris commented Apr 5, 2022

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

atris commented Mar 22, 2022 •

edited by Jackie-Jiang

Loading

codecov-commenter commented Mar 23, 2022 •

edited

Loading

siddharthteotia commented Mar 23, 2022 •

edited

Loading

Jackie-Jiang commented Apr 4, 2022 •

edited

Loading