Skip to content

Conversation

@atris
Copy link
Contributor

@atris atris commented Mar 22, 2022

This PR implements native text operator. It is based on the native FST implementation and is configured as an extra attribute to the existing text index.

This PR also implements the TEXT_CONTAINS operator. TEXT_CONTAINS operator allows you to search a text field using tokens. e.g.:

SELECT * FROM foo WHERE TEXT_CONTAINS(bar, '.*l') OR TEXT_CONTAINS(barbar, 'p')

Please note that TEXT_CONTAINS works only on native text indices.

A new FieldConfig property "fstType" is defined to allow defining the index type. If none is specified, Lucene index type is used.

The current implementation supports regex queries through TEXT_CONTAINS. Phrase and wildcard queries will be supported soon, using new methods.

@atris atris requested a review from Jackie-Jiang March 22, 2022 19:08
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: does FST + inverted index enough to solve the phrase query? Based on my understanding, we need to also keep the token positioning info in order to solve it?
We should take all the use cases we want to solve into consideration because it is hard to change the index once it is generated and pushed to the cluster.

richardstartin
richardstartin previously approved these changes Mar 22, 2022
Copy link
Member

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, just minor comments.

@Jackie-Jiang
Copy link
Contributor

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

@richardstartin richardstartin self-requested a review March 22, 2022 21:05
@richardstartin richardstartin dismissed their stale review March 22, 2022 21:06

concerns about MV logic

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Merging #8384 (89ab437) into master (72e1844) will decrease coverage by 56.24%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #8384       +/-   ##
=============================================
- Coverage     70.33%   14.09%   -56.25%     
+ Complexity     4375       84     -4291     
=============================================
  Files          1705     1664       -41     
  Lines         89699    87989     -1710     
  Branches      13568    13387      -181     
=============================================
- Hits          63093    12398    -50695     
- Misses        22155    74660    +52505     
+ Partials       4451      931     -3520     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 ?
unittests2 14.09% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ot/common/request/context/RequestContextUtils.java 0.00% <0.00%> (-70.44%) ⬇️
...n/request/context/predicate/ContainsPredicate.java 0.00% <0.00%> (ø)
...ot/common/request/context/predicate/Predicate.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/pql/parsers/pql2/ast/FilterKind.java 0.00% <0.00%> (-100.00%) ⬇️
...manager/realtime/LLRealtimeSegmentDataManager.java 0.00% <0.00%> (-70.88%) ⬇️
...t/core/operator/filter/ContainsFilterOperator.java 0.00% <0.00%> (ø)
...inot/core/operator/filter/FilterOperatorUtils.java 0.00% <0.00%> (-88.75%) ⬇️
...ava/org/apache/pinot/core/plan/FilterPlanNode.java 0.00% <0.00%> (-82.54%) ⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java 0.00% <0.00%> (-58.82%) ⬇️
...ent/local/realtime/impl/RealtimeSegmentConfig.java 0.00% <0.00%> (-91.87%) ⬇️
... and 1374 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e1844...89ab437. Read the comment docs.

@kishoreg
Copy link
Member

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

You are right if we try to change text match operator semantics.. I was thinking more along using like operator to use this index and leave the text match operator as it is. Phrase match is not that important and can be done later in another PR

Wdyt?

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Mar 23, 2022

Is this exposed via TEXT_MATCH function and thus requires Lucene query syntax ? IIRC, in the initial design we wanted to expose this through a different function / udf (LIKE to get ANSI SQL) so that the query syntax doesn't necessarily have to be Lucene based.

Referencing our earlier discussion - #7395 (comment)

@Jackie-Jiang
Copy link
Contributor

Let's not merge this PR yet. Want to ensure all the functionalities we want to support for text index is covered by the index (positional info needs to be stored). The actual support can be added in multiple PRs, but we don't want to change index format frequently.

You are right if we try to change text match operator semantics.. I was thinking more along using like operator to use this index and leave the text match operator as it is. Phrase match is not that important and can be done later in another PR

Wdyt?

We don't need this index in order to support LIKE. LIKE and regexpLike is already supported with the FST (either lucene or native) and won't use this index.

@atris
Copy link
Contributor Author

atris commented Mar 23, 2022

Is this exposed via TEXT_MATCH function and thus requires Lucene query syntax ? IIRC, in the initial design we wanted to expose this through a different function / udf (LIKE to get ANSI SQL) so that the query syntax doesn't necessarily have to be Lucene based.

Referencing our earlier discussion - #7395 (comment)

Yes, the plan is to support LIKE and a new method for PHRASE search. The native text index is exposed as an option within the text index, thus supports TEXT_MATCH by default. We will have to disable TEXT_MATCH on native indices explicitly.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to have some discussion on the syntax of the new text index. Currently we are using Java regexp format (with ^$.* instead of _% in sql LIKE) for the native FST, which is okay (not sure if all regexp are supported though, sql LIKE syntax is much simpler). Do we want to reuse the current TEXT_MATCH which currently takes a lucene search query, or come up with a new function that performs regexpLike/Like + logical operators?

cc @kishoreg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor, readability) Remove this empty line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor should take a PinotDataBuffer instead of column and indexDir. The PinotDataBuffer can be part of the combined index file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface follows the same semantics as TextIndexReader interface, hence this signature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw UnsupportedException here as we should never ask for dictionary ids from text index (the returned dictionary id is not the dictionary id for the column, but for the tokens)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one of the basic requirement, we should support logical operators in the search query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTAINS supports boolean operators

@siddharthteotia
Copy link
Contributor

Want to have some discussion on the syntax of the new text index. Currently we are using Java regexp format (with ^$.* instead of _% in sql LIKE) for the native FST, which is okay (not sure if all regexp are supported though, sql LIKE syntax is much simpler). Do we want to reuse the current TEXT_MATCH which currently takes a lucene search query, or come up with a new function that performs regexpLike/Like + logical operators?

cc @kishoreg

My suggestion would be to use a new function that accepts potentially ANSI SQL LIKE style + delta needed to handle phrase, fuzzy and is not tied to lucene syntax currently used by TEXT_MATCH.

On the other hand, exposing this new index via TEXT_MATCH and Lucene syntax along with index rebuilding in SegmentPreprocessor can help with easy migration of users currently using existing lucene text index

@siddharthteotia
Copy link
Contributor

Ping to see how do we want to move forward ?

@atris
Copy link
Contributor Author

atris commented Apr 4, 2022

Ping to see how do we want to move forward ?

Here is the plan -- LIKE is going to start using text indices if available, thus providing support for LIKE on native indices + logical operators come free due to the ability to combine multiple LIKE predicates.

In the current PR, I will allow TEXT_MATCH to work on top of native text indices, but any query passed in to the native text index through TEXT_MATCH will be treated as a regex query. Alternative is to disable TEXT_MATCH on native indices completely.

Phrase and fuzzy will be supported by new functions.

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Apr 4, 2022

Here is the plan -- LIKE is going to start using text indices if available, thus providing support for LIKE on native indices + logical operators come free due to the ability to combine multiple LIKE predicates.

This might not align with the SQL LIKE semantic which should match the whole document instead of the token matching. This is more like the CONTAINS described here. I didn't find a standard SQL function for the token matching, not sure if there is one.

In the current PR, I will allow TEXT_MATCH to work on top of native text indices, but any query passed in to the native text index through TEXT_MATCH will be treated as a regex query. Alternative is to disable TEXT_MATCH on native indices completely.

I'd suggest coming up with a new function to support logical operations on token regex matching. Something like TEXT_LIKE(textCol, 'ab.*' AND '.cd')

Looking for inputs here @siddharthteotia @richardstartin @kishoreg

@atris
Copy link
Contributor Author

atris commented Apr 5, 2022

here

I am hesitant to add a new function just for the regex matching. I like your suggestion around CONTAINS, will add that for the same, and a new function for phrase matching when we get there.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 major comments:

  • Support boolean operator in search query
  • Use magic header and version when reading the index file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can extend BasePredicate. See other predicate for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Add a new line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Put CONTAINS after REGEXP_LIKE for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to implement record() (see TextMatchFilterOperator for example)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it ever be negative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, FST will return an empty value if nothing found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MAJOR) What I meant is to support search query with logical operators, e.g. www.domain1% OR %www.domain1. One way to achieve that is to use CalciteSqlParser.compileToExpression() and then treat identifier as literal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support boolean operators for CONTAINS using the standard syntax -- A CONTAINS "foo" AND A CONTAINS "bar". I will do a follow up PR for supporting the syntax mentioned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some text (multiple terms) instead of simple domain name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to compare the result from lucene and native. Currently all the expectedResults are null

Comment on lines 115 to 122
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) It will be faster to collect the bitmaps into an array and or them afterwards, because it avoids computing the cardinalities of each intermediate union.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that as a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please base this benchmark on the structure of BenchmarkNativeAndLuceneBasedLike which ensures that the segments can't be mixed up. I don't have time to sanity check this benchmark right now, but BenchmarkNativeAndLuceneBasedLike is the right pattern to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is impossible to mix segments up since TEXT_MATCH works only on Lucene based segments and CONTAINS works only on native segments.

Nevertheless, I have taken your advise and refactored the benchmark to be in line with BenchmarkNativeAndLuceneBasedLike

@atris
Copy link
Contributor Author

atris commented Apr 29, 2022

@Jackie-Jiang As discussed:

  1. We already support boolean queries on CONTAINS (field1 CONTAINS foo AND field1 CONTAINS bar. I will follow up with a PR to support the syntax you mentioned.

All other comments are fixed. Please take a look.
@richardstartin

@atris atris force-pushed the text_operator branch from 6e2b63a to 3f5e47f Compare May 2, 2022 04:53
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels May 4, 2022
@Jackie-Jiang Jackie-Jiang merged commit 907b023 into apache:master May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants