Skip to content

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Jan 27, 2023

This is the first draft PR related to #10183. The whole PR is in #10184. It isn't intended to be merged as it is and it doesn't require to compile or pass the checks. It has been created in order to make it easier to understand the PR with all the changes.

It shows the basics of the proposed IndexType without actually modifying the code.

@gortiz gortiz marked this pull request as draft January 27, 2023 09:37
@gortiz gortiz changed the title Draft commit with the IndexType draft [draft] index service: IndexType draft Jan 27, 2023
@codecov-commenter
Copy link

codecov-commenter commented Feb 1, 2023

Codecov Report

Merging #10191 (41b28b0) into master (7c3c8e8) will increase coverage by 1.99%.
The diff coverage is 38.88%.

@@             Coverage Diff              @@
##             master   #10191      +/-   ##
============================================
+ Coverage     68.27%   70.26%   +1.99%     
- Complexity     5262     6027     +765     
============================================
  Files          2049     2053       +4     
  Lines        111033   111122      +89     
  Branches      16891    16899       +8     
============================================
+ Hits          75810    78085    +2275     
+ Misses        29816    27558    -2258     
- Partials       5407     5479      +72     
Flag Coverage Δ
integration1 24.45% <0.00%> (+0.10%) ⬆️
integration2 24.44% <0.00%> (?)
unittests1 67.87% <38.88%> (-0.01%) ⬇️
unittests2 13.90% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...t/local/segment/index/loader/BaseIndexHandler.java 88.88% <ø> (ø)
...ocal/segment/index/loader/IndexHandlerFactory.java 100.00% <ø> (ø)
...tedIndexAndDictionaryBasedForwardIndexCreator.java 77.24% <ø> (ø)
...ocal/segment/index/loader/SegmentPreProcessor.java 84.76% <ø> (ø)
...ment/spi/index/IndexReaderConstraintException.java 0.00% <0.00%> (ø)
.../org/apache/pinot/segment/spi/index/IndexType.java 0.00% <0.00%> (ø)
...che/pinot/segment/spi/index/FieldIndexConfigs.java 45.45% <45.45%> (ø)
...org/apache/pinot/spi/config/table/IndexConfig.java 100.00% <100.00%> (ø)

... and 194 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang Jackie-Jiang changed the title [draft] index service: IndexType draft [Index SPI] IndexType Feb 21, 2023
@Jackie-Jiang Jackie-Jiang marked this pull request as ready for review February 21, 2023 21:24
@gortiz
Copy link
Contributor Author

gortiz commented Feb 28, 2023

@Jackie-Jiang I've modified this PR in order to remove the IndexDeclaration, as we discussed online. I'll be checking the tests, but this should be ready to merge

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is much cleaner now

_configMap = new HashMap<>(other._configMap);
}

public <C extends IndexConfig, I extends IndexType<C, ?, ?>> Builder add(I indexType, @Nullable C config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow adding null config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it and delegate on undeclare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should we add the default config when it is not explicitly configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't necessary and it would force us to have larger maps for no reason. It is not necessary because the getConfig method is implemented as:

  public <C extends IndexConfig, I extends IndexType<C, ?, ?>> C getConfig(I indexType) {
    IndexConfig config = _configMap.get(indexType);
    if (config == null) {
      return indexType.getDefaultConfig();
    }
    return (C) config;
  }

The size of the map wouldn't be an actual memory problem, but:

  • It would force us to iterate over all the index types registered whenever we create one of these objects
  • We would therefore need to add a dependency to StandardIndexes. This pattern were we have to (ab)use of static references is not something I really like, so I'm trying to limit it as much as possible.
  • This would make debugging (or logging) quite more difficult, as these objects will have one entry per index instead of one entry per explicitly define index.

@gortiz
Copy link
Contributor Author

gortiz commented Mar 1, 2023

Just realize this is copy-pasted from the class in segment local, but the original IndexHandler is modified recently..
Let's directly move the class to segment spi since there are very few usages of

I've just checked and there is no recent change in the interface. Both interfaces are not the same because I removed IndexCreatorProvider, as it is not needed in this version.

I'm going to move the class. It would increase the number of changes in this PR, but they should only be imports.

PS: I will move the current IndexHandler class, with IndexCreatorProvider. Therefore I will need to modify that class in following PRs

@gortiz
Copy link
Contributor Author

gortiz commented Mar 1, 2023

But in ColumnIndexType, the _indexName is always the lower case of the enum value. Can we always use IndexName? That can avoid a lot of confusion. If you really think we need 2 different names, we can call the internal one getInternalName()

I don't like having the two concepts either, but I think we need them for compatibility and usability reasons.

Yes, the ids of the literals in ColumnIndexType are in upper case and the names are the same value in lower case. But the literals are the following:

DICTIONARY("dictionary"),
  FORWARD_INDEX("forward_index"),
  INVERTED_INDEX("inverted_index"),
  BLOOM_FILTER("bloom_filter"),
  NULLVALUE_VECTOR("nullvalue_vector"),
  TEXT_INDEX("text_index"),
  FST_INDEX("fst_index"),
  JSON_INDEX("json_index"),
  RANGE_INDEX("range_index"),
  H3_INDEX("h3_index");

As you can see, all but dictionary and null value have the _index suffix.

In index-spi we have a new concept of id which is the one used to describe the index to the user. We use this id to indicate that something is wrong with an index and the user uses that id to define the config for that specific index. This new concept of id is different than the older one. We don't want the user to write the _index suffix everywhere like the following:

{
  "fieldConfigList": [
    {
       "name": “col1”,
       "indexes": {
         "inverted_index": {
           "enabled": true
         },
         "range_index": {
           "enabled": true
         },
         "text_index": {
           "type": “lucene”,
           "enableQueryCacheForTextIndex": false,
           "stopWordsInclude": ["a", "b", "c"]
         },
         "bloom_index": {
           "fpp": 0.01,
           "maxSizeInBytes": 1000000,
           "loadOnHeap": true
          }
       }
    }
  ]
}

And I think printing errors like "there was a problem with index of type " + indexType would be quite repetitive if they print "there was a problem with index of type bloom_index".

At the same time we cannot decide to just remove the _index suffix because that literal is used in the segment metadata, so we need to read segments commited with _index suffix. And at the same time the suffix is not always there (dictionary and null vector do not add that suffix) so we cannot just create the version with suffix by appending _index to the new user-friendly concept of id.

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Mar 1, 2023

IMO it is okay to have the _index suffix for inverted, forward etc. We don't have bloom_index (it is bloom_filter), dictionary_index which is controversial.
E.g. in IndexingConfig we have _jsonIndexColumns, _rangeIndexColumns, _bloomFilterColumns

@gortiz
Copy link
Contributor Author

gortiz commented Mar 2, 2023

IMO it is okay to have the _index suffix for inverted, forward etc. We don't have bloom_index (it is bloom_filter), dictionary_index which is controversial.

IMO that is cheep, repetitive (we define these objects as keys in indexes) and in general a bad user experience that is very easy to solve (in fact, it is solved with this PR). We can discuss whether we want to use other names on methods (maybe not calling them getId and getName), but forcing users to use these not that friendly names when it is easy to let them use better names is not a good idea.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@gortiz
Copy link
Contributor Author

gortiz commented Mar 9, 2023

Should we rename this method to getName()?

I think id is a better description for this concept. Unlike names, ids usually imply unicity and stability in time, while name is usually a term more focused on how to present the information to the user and they usually can change.

Also, all other branches are already using id so I would prefer to do not have to change them again unless you really think name is better.

* @param values The nonnull value of the cell. In case the cell was actually null, an empty array is received instead
* @param dictIds An optional array of dictionary values. If there is no dictionary, null is received.
*/
void add(@Nonnull Object[] values, @Nullable int[] dictIds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the convention is to annotate @Nullable only and leave default as @Nonnull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but here I'm introducing both to be more explicit. Like adding bold in a text just to emphasize something. If you think that is problematic I can remove it.

I know this is not the place to discuss it, but I have to say that the convention is not correct. Or better said, it is partial in the sense that tools doesn't know that. We should add @ParametersAreNonnullByDefault to all packages in order to inform tools that this is our standard. For example with that spotless or spotbugs can detect NPE at compile time. Otherwise they require explicit nullability annotations anywhere.


public IndexReaderConstraintException(String columnName, IndexType<?, ?, ?> type, String constraintDesc,
Throwable cause) {
this("Cannot read an index of type " + type + " on column " + columnName + ". Reason: " + constraintDesc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: String.format(...) to read more easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a personal taste. I find String.format quite more difficult to read given that I have to move my reading cursor to the end of the sentence in order to know what is going to feel the format variable.

* Sometimes it doesn't make sense to create an index, even when the user explicitly asked for it. For example, an
* inverted index shouldn't be created when the column is sorted.
*
* Apache Pinot will call this method once all index configurations have been parsed and it is included in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Pinot feels too generic to me here. Perhaps list some example callers of this method here, e.g. SegmentPreProcessor or SegmentCreator (iiuc) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to be too generic than too specific. Using an IDE (and even github!) it is easy to know who calls a function, so an interested reader can calculate the complete list of callers in real time. But in the too specific case, it may happen that in a future the caller changes to another class and it would be very easy to forget to change this javadoc, which would make it incorrect.

*
* Most indexes are stored as a buffer, but for example TextIndexType is stored in a separate lucene file.
*/
default boolean storedAsBuffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious how this method is going to be used?

The StarTree index data is kept in a separate file (star_tree_index) as PinotDataBuffers. Would it break any assumption here?

Copy link
Contributor Author

@gortiz gortiz Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be used in TextIndexType. StarTree index is not an index covered by index-spi.

You can see how it is being used here (in case the link doesn't work: file is pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/converter/SegmentV1V2ToV3FormatConverter.java in #10184).

TL;DR: if this method returns false, the buffer is not copied. I think this will only be false in TextIndexType, but we need to introduce the method here in order to do not introduce a special case in the generic code of SegmentV1V2ToV3FormatConverter

@Jackie-Jiang Jackie-Jiang merged commit f8d4320 into apache:master Mar 13, 2023
@gortiz gortiz mentioned this pull request Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants