Skip to content

Conversation

@jasperjiaguo
Copy link
Contributor

@jasperjiaguo jasperjiaguo commented Nov 2, 2022

This PR addresses a production scenario where we see IT (Information Technology) in text data but it was pruned out from the text index as it is in the ENGLISH_STOP_WORDS_SET. Therefore, we would like to make the stop word set configurable per column.

The way to use this is to add "stopWordInclude":"incl1, incl2, ..." and "stopWordExclude":"IT, excl2, excl3 ..." in the properties of field config. As will include/exclude words from the default stop words list of "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "than", "there", "these", "they", "this", "to", "was", "will", "with", "those"

@codecov-commenter
Copy link

codecov-commenter commented Nov 2, 2022

Codecov Report

Merging #9708 (5c7c949) into master (a541396) will decrease coverage by 9.55%.
The diff coverage is 93.33%.

@@             Coverage Diff              @@
##             master    #9708      +/-   ##
============================================
- Coverage     69.91%   60.35%   -9.56%     
- Complexity     4855     5192     +337     
============================================
  Files          1950     1938      -12     
  Lines        104339   104020     -319     
  Branches      15804    15768      -36     
============================================
- Hits          72945    62778   -10167     
- Misses        26267    36550   +10283     
+ Partials       5127     4692     -435     
Flag Coverage Δ
integration1 ?
integration2 24.59% <0.00%> (+<0.01%) ⬆️
unittests1 67.48% <93.33%> (+0.05%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...org/apache/pinot/spi/config/table/FieldConfig.java 96.42% <ø> (ø)
...local/indexsegment/mutable/MutableSegmentImpl.java 57.83% <40.00%> (-0.05%) ⬇️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java 66.10% <100.00%> (ø)
...ment/creator/impl/DefaultIndexCreatorProvider.java 67.44% <100.00%> (ø)
...ment/creator/impl/SegmentColumnarIndexCreator.java 79.88% <100.00%> (+0.27%) ⬆️
...ment/creator/impl/text/LuceneTextIndexCreator.java 79.59% <100.00%> (+0.86%) ⬆️
...t/index/loader/invertedindex/TextIndexHandler.java 86.51% <100.00%> (+0.47%) ⬆️
...ment/index/readers/text/LuceneTextIndexReader.java 79.68% <100.00%> (+0.65%) ⬆️
...ot/segment/local/segment/store/TextIndexUtils.java 87.50% <100.00%> (+11.02%) ⬆️
...inot/segment/spi/creator/IndexCreationContext.java 100.00% <100.00%> (ø)
... and 378 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Mostly minor comments.

@jasperjiaguo jasperjiaguo marked this pull request as ready for review November 2, 2022 07:10
Copy link
Member

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if stopWordsInclude and stopWordsExclude have the same word, will the code throw any exception on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is to prioritize exclude these duplicate words. I'm planning to put default ENGLISH_STOP_WORDS_SET, config keys, and this behavior in the user doc.

@siddharthteotia
Copy link
Contributor

@jasperjiaguo can you also rebase once if not already. Just in case we see some spurious failures again.

@siddharthteotia
Copy link
Contributor

The tests pass locally and this failed even after retriggering couple of times.

Merging this

@siddharthteotia siddharthteotia merged commit 229c55e into apache:master Nov 2, 2022
@Jackie-Jiang Jackie-Jiang added feature Configuration Config changes (addition/deletion/change in behavior) release-notes Referenced by PRs that need attention when compiling the next release notes labels Nov 7, 2022
@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Nov 7, 2022

Thanks for adding the feature. Can you please update the PR description with an example on how to configure the stop words?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Configuration Config changes (addition/deletion/change in behavior) feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants