-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Customize stopword for Lucene Index #9708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b5e3536 to
0cb9609
Compare
Codecov Report
@@ Coverage Diff @@
## master #9708 +/- ##
============================================
- Coverage 69.91% 60.35% -9.56%
- Complexity 4855 5192 +337
============================================
Files 1950 1938 -12
Lines 104339 104020 -319
Branches 15804 15768 -36
============================================
- Hits 72945 62778 -10167
- Misses 26267 36550 +10283
+ Partials 5127 4692 -435
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java
Outdated
Show resolved
Hide resolved
4703975 to
49d0289
Compare
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java
Outdated
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Outdated
Show resolved
Hide resolved
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Outdated
Show resolved
Hide resolved
siddharthteotia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Mostly minor comments.
...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java
Outdated
Show resolved
Hide resolved
jackjlli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but LGTM.
...in/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if stopWordsInclude and stopWordsExclude have the same word, will the code throw any exception on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior is to prioritize exclude these duplicate words. I'm planning to put default ENGLISH_STOP_WORDS_SET, config keys, and this behavior in the user doc.
...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java
Outdated
Show resolved
Hide resolved
|
@jasperjiaguo can you also rebase once if not already. Just in case we see some spurious failures again. |
03572dd to
5c7c949
Compare
|
The tests pass locally and this failed even after retriggering couple of times. Merging this |
|
Thanks for adding the feature. Can you please update the PR description with an example on how to configure the stop words? |
This PR addresses a production scenario where we see
IT(Information Technology) in text data but it was pruned out from the text index asitis in theENGLISH_STOP_WORDS_SET. Therefore, we would like to make the stop word set configurable per column.The way to use this is to add
"stopWordInclude":"incl1, incl2, ..."and"stopWordExclude":"IT, excl2, excl3 ..."in thepropertiesof field config. As will include/exclude words from the default stop words list of"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this", "to", "was", "will", "with", "those"