[Index SPI] IndexType #10191

gortiz · 2023-01-27T09:37:12Z

This is the first draft PR related to #10183. The whole PR is in #10184. It isn't intended to be merged as it is and it doesn't require to compile or pass the checks. It has been created in order to make it easier to understand the PR with all the changes.

It shows the basics of the proposed IndexType without actually modifying the code.

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/FieldIndexConfigs.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexDeclaration.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexCreator.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexHandler.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexReaderFactory.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

…nto index-spi-index-type-only

codecov-commenter · 2023-02-01T10:47:31Z

Codecov Report

Merging #10191 (41b28b0) into master (7c3c8e8) will increase coverage by 1.99%.
The diff coverage is 38.88%.

@@             Coverage Diff              @@
##             master   #10191      +/-   ##
============================================
+ Coverage     68.27%   70.26%   +1.99%     
- Complexity     5262     6027     +765     
============================================
  Files          2049     2053       +4     
  Lines        111033   111122      +89     
  Branches      16891    16899       +8     
============================================
+ Hits          75810    78085    +2275     
+ Misses        29816    27558    -2258     
- Partials       5407     5479      +72

Flag	Coverage Δ
integration1	`24.45% <0.00%> (+0.10%)`	⬆️
integration2	`24.44% <0.00%> (?)`
unittests1	`67.87% <38.88%> (-0.01%)`	⬇️
unittests2	`13.90% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...t/local/segment/index/loader/BaseIndexHandler.java	`88.88% <ø> (ø)`
...ocal/segment/index/loader/IndexHandlerFactory.java	`100.00% <ø> (ø)`
...tedIndexAndDictionaryBasedForwardIndexCreator.java	`77.24% <ø> (ø)`
...ocal/segment/index/loader/SegmentPreProcessor.java	`84.76% <ø> (ø)`
...ment/spi/index/IndexReaderConstraintException.java	`0.00% <0.00%> (ø)`
.../org/apache/pinot/segment/spi/index/IndexType.java	`0.00% <0.00%> (ø)`
...che/pinot/segment/spi/index/FieldIndexConfigs.java	`45.45% <45.45%> (ø)`
...org/apache/pinot/spi/config/table/IndexConfig.java	`100.00% <100.00%> (ø)`

... and 194 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…e-only

…tead

gortiz · 2023-02-28T14:17:05Z

@Jackie-Jiang I've modified this PR in order to remove the IndexDeclaration, as we discussed online. I'll be checking the tests, but this should be ready to merge

Jackie-Jiang

It is much cleaner now

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/FieldIndexConfigs.java

Jackie-Jiang · 2023-02-28T22:33:13Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/FieldIndexConfigs.java

+      _configMap = new HashMap<>(other._configMap);
+    }
+
+    public <C extends IndexConfig, I extends IndexType<C, ?, ?>> Builder add(I indexType, @Nullable C config) {


Do we allow adding null config?

We can remove it and delegate on undeclare

Hmm, should we add the default config when it is not explicitly configured?

It isn't necessary and it would force us to have larger maps for no reason. It is not necessary because the getConfig method is implemented as:

public <C extends IndexConfig, I extends IndexType<C, ?, ?>> C getConfig(I indexType) { IndexConfig config = _configMap.get(indexType); if (config == null) { return indexType.getDefaultConfig(); } return (C) config; }

The size of the map wouldn't be an actual memory problem, but:

It would force us to iterate over all the index types registered whenever we create one of these objects

We would therefore need to add a dependency to StandardIndexes. This pattern were we have to (ab)use of static references is not something I really like, so I'm trying to limit it as much as possible.

This would make debugging (or logging) quite more difficult, as these objects will have one entry per index instead of one entry per explicitly define index.

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexCreator.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexHandler.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexReaderFactory.java

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexConfig.java

gortiz · 2023-03-01T10:08:56Z

Just realize this is copy-pasted from the class in segment local, but the original IndexHandler is modified recently..
Let's directly move the class to segment spi since there are very few usages of

I've just checked and there is no recent change in the interface. Both interfaces are not the same because I removed IndexCreatorProvider, as it is not needed in this version.

I'm going to move the class. It would increase the number of changes in this PR, but they should only be imports.

PS: I will move the current IndexHandler class, with IndexCreatorProvider. Therefore I will need to modify that class in following PRs

…he.pinot.segment.local.segment.index.loader

gortiz · 2023-03-01T10:21:56Z

But in ColumnIndexType, the _indexName is always the lower case of the enum value. Can we always use IndexName? That can avoid a lot of confusion. If you really think we need 2 different names, we can call the internal one getInternalName()

I don't like having the two concepts either, but I think we need them for compatibility and usability reasons.

Yes, the ids of the literals in ColumnIndexType are in upper case and the names are the same value in lower case. But the literals are the following:

DICTIONARY("dictionary"),
  FORWARD_INDEX("forward_index"),
  INVERTED_INDEX("inverted_index"),
  BLOOM_FILTER("bloom_filter"),
  NULLVALUE_VECTOR("nullvalue_vector"),
  TEXT_INDEX("text_index"),
  FST_INDEX("fst_index"),
  JSON_INDEX("json_index"),
  RANGE_INDEX("range_index"),
  H3_INDEX("h3_index");

As you can see, all but dictionary and null value have the _index suffix.

In index-spi we have a new concept of id which is the one used to describe the index to the user. We use this id to indicate that something is wrong with an index and the user uses that id to define the config for that specific index. This new concept of id is different than the older one. We don't want the user to write the _index suffix everywhere like the following:

{
  "fieldConfigList": [
    {
       "name": “col1”,
       "indexes": {
         "inverted_index": {
           "enabled": true
         },
         "range_index": {
           "enabled": true
         },
         "text_index": {
           "type": “lucene”,
           "enableQueryCacheForTextIndex": false,
           "stopWordsInclude": ["a", "b", "c"]
         },
         "bloom_index": {
           "fpp": 0.01,
           "maxSizeInBytes": 1000000,
           "loadOnHeap": true
          }
       }
    }
  ]
}

And I think printing errors like "there was a problem with index of type " + indexType would be quite repetitive if they print "there was a problem with index of type bloom_index".

At the same time we cannot decide to just remove the _index suffix because that literal is used in the segment metadata, so we need to read segments commited with _index suffix. And at the same time the suffix is not always there (dictionary and null vector do not add that suffix) so we cannot just create the version with suffix by appending _index to the new user-friendly concept of id.

Jackie-Jiang · 2023-03-01T19:51:40Z

IMO it is okay to have the _index suffix for inverted, forward etc. We don't have bloom_index (it is bloom_filter), dictionary_index which is controversial.
E.g. in IndexingConfig we have _jsonIndexColumns, _rangeIndexColumns, _bloomFilterColumns

gortiz · 2023-03-02T07:39:26Z

IMO it is okay to have the _index suffix for inverted, forward etc. We don't have bloom_index (it is bloom_filter), dictionary_index which is controversial.

IMO that is cheep, repetitive (we define these objects as keys in indexes) and in general a bad user experience that is very easy to solve (in fact, it is solved with this PR). We can discuss whether we want to use other names on methods (maybe not calling them getId and getName), but forcing users to use these not that friendly names when it is easy to let them use better names is not a good idea.

…e-only

Jackie-Jiang

LGTM otherwise

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/FieldIndexConfigs.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

…e-only

gortiz · 2023-03-09T09:49:02Z

Should we rename this method to getName()?

I think id is a better description for this concept. Unlike names, ids usually imply unicity and stability in time, while name is usually a term more focused on how to present the information to the user and they usually can change.

Also, all other branches are already using id so I would prefer to do not have to change them again unless you really think name is better.

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexCreator.java

klsince · 2023-03-09T21:16:01Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexCreator.java

+   * @param values The nonnull value of the cell. In case the cell was actually null, an empty array is received instead
+   * @param dictIds An optional array of dictionary values. If there is no dictionary, null is received.
+   */
+  void add(@Nonnull Object[] values, @Nullable int[] dictIds)


IIRC, the convention is to annotate @Nullable only and leave default as @Nonnull.

I know, but here I'm introducing both to be more explicit. Like adding bold in a text just to emphasize something. If you think that is problematic I can remove it.

I know this is not the place to discuss it, but I have to say that the convention is not correct. Or better said, it is partial in the sense that tools doesn't know that. We should add @ParametersAreNonnullByDefault to all packages in order to inform tools that this is our standard. For example with that spotless or spotbugs can detect NPE at compile time. Otherwise they require explicit nullability annotations anywhere.

klsince · 2023-03-09T21:16:49Z

...ent-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexReaderConstraintException.java

+
+  public IndexReaderConstraintException(String columnName, IndexType<?, ?, ?> type, String constraintDesc,
+      Throwable cause) {
+    this("Cannot read an index of type " + type + " on column " + columnName + ". Reason: " + constraintDesc,


nit: String.format(...) to read more easily

I think that is a personal taste. I find String.format quite more difficult to read given that I have to move my reading cursor to the end of the sentence in order to know what is going to feel the format variable.

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

klsince · 2023-03-09T21:23:22Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

+   * Sometimes it doesn't make sense to create an index, even when the user explicitly asked for it. For example, an
+   * inverted index shouldn't be created when the column is sorted.
+   *
+   * Apache Pinot will call this method once all index configurations have been parsed and it is included in the


Apache Pinot feels too generic to me here. Perhaps list some example callers of this method here, e.g. SegmentPreProcessor or SegmentCreator (iiuc) .

I prefer to be too generic than too specific. Using an IDE (and even github!) it is easy to know who calls a function, so an interested reader can calculate the complete list of callers in real time. But in the too specific case, it may happen that in a future the caller changes to another class and it would be very easy to forget to change this javadoc, which would make it incorrect.

klsince · 2023-03-09T21:29:19Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/IndexType.java

+   *
+   * Most indexes are stored as a buffer, but for example TextIndexType is stored in a separate lucene file.
+   */
+  default boolean storedAsBuffer() {


curious how this method is going to be used?

The StarTree index data is kept in a separate file (star_tree_index) as PinotDataBuffers. Would it break any assumption here?

This is going to be used in TextIndexType. StarTree index is not an index covered by index-spi.

You can see how it is being used here (in case the link doesn't work: file is pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/converter/SegmentV1V2ToV3FormatConverter.java in #10184).

TL;DR: if this method returns false, the buffer is not copied. I think this will only be false in TextIndexType, but we need to introduce the method here in order to do not introduce a special case in the generic code of SegmentV1V2ToV3FormatConverter

Draft commit with the IndexType draft

a002ce8

gortiz marked this pull request as draft January 27, 2023 09:37

gortiz mentioned this pull request Jan 27, 2023

Index Service Provider Interface (SPI) #10183

Closed

5 tasks

gortiz added 2 commits January 27, 2023 11:34

Simplify IndexType with tons of UnsupportedOperationException

3ed6dfe

Remove BloomIndexType to make this code compile

049bf33

gortiz mentioned this pull request Jan 27, 2023

Index spi: index service #10192

Merged

gortiz changed the title ~~Draft commit with the IndexType draft~~ [draft] index service: IndexType draft Jan 27, 2023

gortiz mentioned this pull request Jan 27, 2023

Draft commit with the IndexService draft gortiz/pinot#1

Draft

Fix two checkstyle issues

6e4c5d3

Jackie-Jiang added the index-spi label Jan 31, 2023

Jackie-Jiang reviewed Jan 31, 2023

View reviewed changes

gortiz added 3 commits February 1, 2023 11:00

Remove segment dir param from IndexReaderFactory.read

eac81b4

Merge branch 'index-spi-index-type-only' of github.com:gortiz/pinot i…

23c85b4

…nto index-spi-index-type-only

Improve javadoc on IndexType.deserialize

83450fd

gortiz added 5 commits February 1, 2023 12:08

Declare a constant before attributes

b476808

Remove unused @JsonIgnore annotation

68f0f78

Merge remote-tracking branch 'origin/master' into index-spi-index-typ…

6645098

…e-only

Merge remote-tracking branch 'origin/master' into index-spi-index-typ…

89ffd18

…e-only

Merge branch 'master' into index-spi-index-type-only

0288378

Jackie-Jiang changed the title ~~[draft] index service: IndexType draft~~ [Index SPI] IndexType Feb 21, 2023

Jackie-Jiang marked this pull request as ready for review February 21, 2023 21:24

gortiz added 2 commits February 28, 2023 13:13

Merge remote-tracking branch 'origin/master' into index-spi-index-typ…

fa7b848

…e-only

Remove IndexDeclaration. Use the single config deserialize method ins…

7b3c70e

…tead

remove alternativeSingleValue method from IndexCreator

2c2ae31

Jackie-Jiang reviewed Feb 28, 2023

View reviewed changes

gortiz added 3 commits March 1, 2023 10:46

Change FieldIndexConfigs.Builder to do not accept null configs

3b8ad74

Rename add methods in IndexCreator as add

4b71dec

Rename IndexType.deserialize as IndexType.getConfig

ac825a9

Rename IndexReaderFactory.read as IndexReaderFactory.createIndexReader

8c055e6

gortiz added 2 commits March 1, 2023 11:10

Remove copy of IndexHandler

4b0a56c

Move IndexHandler to org.apache.pinot.segment.spi.index from org.apac…

bcb99e1

…he.pinot.segment.local.segment.index.loader

gortiz added 4 commits March 2, 2023 15:37

Add FieldIndexConfigs.toString

08e4bf0

Add javadoc

346fca6

Remove indexName concept

8950402

Merge remote-tracking branch 'origin/master' into index-spi-index-typ…

182820e

…e-only

Jackie-Jiang approved these changes Mar 8, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into index-spi-index-typ…

a941f30

…e-only

Apply some stetical changes recommended in the PR

f910590

klsince reviewed Mar 9, 2023

View reviewed changes

gortiz added 2 commits March 10, 2023 11:33

Improve javadoc

8fb35dc

Fix typo in javadoc

41b28b0

Jackie-Jiang merged commit f8d4320 into apache:master Mar 13, 2023

gortiz mentioned this pull request Mar 21, 2023

Complete index spi #10184

Merged

saharahmed27 mentioned this pull request Jan 23, 2025

Index Service Provider Interface (SPI) #14899

Closed

[Index SPI] IndexType #10191

[Index SPI] IndexType #10191

Uh oh!

Conversation

gortiz commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gortiz commented Feb 28, 2023

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gortiz commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gortiz commented Mar 1, 2023

Uh oh!

Jackie-Jiang commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gortiz commented Mar 2, 2023

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gortiz commented Mar 9, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz commented Jan 27, 2023 •

edited

Loading

codecov-commenter commented Feb 1, 2023 •

edited

Loading

gortiz commented Mar 1, 2023 •

edited

Loading

Jackie-Jiang commented Mar 1, 2023 •

edited

Loading

gortiz Mar 10, 2023 •

edited

Loading