Skip to content

Conversation

@saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Mar 1, 2024

Allows support for multi value fields to jsonExtractIndex. Explained below

columnName: jsonField
value:
{
    "stringField": "xyz",
    "arrField": [{"f1": 1, "f2": 2}, {"f1": 3, "f2": 4}, {"f2": 6}]
}

jsonExtractIndex(jsonField, '$.arrField[*].f1', 'INT_ARRAY') returns [1, 3] for this row
jsonExtractIndex(jsonField, '$.arrField[*].f2', 'INT_ARRAY') returns [2, 4, 6] for this row

Caveat to be covered in documentation:
If the json doc has > 1 array fields, users must set disableCrossArrayUnnest: true in the JSON index configs for the results to be correct.

There is a followup PR #12683 which adds support for contextual filtering when extracting MV values

@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 94.68085% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 61.50%. Comparing base (59551e4) to head (8bebd78).
Report is 150 commits behind head on master.

Files Patch % Lines
...rm/function/JsonExtractIndexTransformFunction.java 82.45% 10 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12532      +/-   ##
============================================
- Coverage     61.75%   61.50%   -0.26%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2452      +16     
  Lines        133233   133978     +745     
  Branches      20636    20756     +120     
============================================
+ Hits          82274    82398     +124     
- Misses        44911    45413     +502     
- Partials       6048     6167     +119     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.43% <94.68%> (-0.28%) ⬇️
java-21 61.37% <94.68%> (-0.25%) ⬇️
skip-bytebuffers-false 61.47% <94.68%> (-0.28%) ⬇️
skip-bytebuffers-true 61.32% <94.68%> (+33.59%) ⬆️
temurin 61.50% <94.68%> (-0.26%) ⬇️
unittests 61.49% <94.68%> (-0.26%) ⬇️
unittests1 46.32% <25.00%> (-0.57%) ⬇️
unittests2 27.80% <69.68%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Saurabh Dubey added 2 commits March 1, 2024 23:46
@saurabhd336 saurabhd336 marked this pull request as ready for review March 2, 2024 02:36
@saurabhd336 saurabhd336 changed the title Json extract index mv Json extract index mv + range and ragexp support on json index Mar 2, 2024
@saurabhd336 saurabhd336 marked this pull request as draft March 2, 2024 03:12
@saurabhd336 saurabhd336 marked this pull request as ready for review March 3, 2024 18:10
@saurabhd336 saurabhd336 changed the title Json extract index mv + range and ragexp support on json index Json extract index mv + range and regexp evaluation support with json index Mar 4, 2024
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Mar 5, 2024
@saurabhd336
Copy link
Contributor Author

@wirybeaver @Jackie-Jiang I've updated the PR to consolidate with jsonExtractIndex (essentially added MV support to jsonExtractIndex, as well as support for jsonPath filter). There is one caveat here wrt MV return types, wherein for the MV return type to work correctly, users must set disableCrossArrayUnnest: true in order to get correct results. Please do have a look!

@wirybeaver
Copy link
Contributor

wirybeaver commented Mar 21, 2024

Just paste my previous suggestion from the slack discussion so that other reviewers can understand how we get here:

Previously, I thought Saurabh wants to support non leaf json path. It turns out we are aiming to do the same thing except the execution order. I planed to support json path containing array index and then support array return type whereas Saurabh wants to support array type first (maybe array index later?)

I think it’s better to wait for my PR 12466 get merged since my PR considers to process json path containing array index, including [*]. Take the example below, if the json path is "$.[1][0].value", then just finding the key with prefix "…value<0000>" doesn’t work. We need more steps to get the value belonging to [1][0], i.e., intersecting with the roaringBitmap of .$index, .$index<0000>1 and ..$index<0000>0. getMatchingDocsMap(String jsonPath) already handle this logic.
And then Saurahb can work on the Regexpr predicate and Array type with the getMatchingDocsMap. I am good to take over the Array type if Saurahb doesn’t have bandwidth. I suggest that the array return type and Regexp should be splitted into two PRs.

 "[[{\"value\":1}, {\"value\":2}], [{\"value\":1}, {\"value\":4}]]"

"$.[*][0].value" should return [1, 1] , am I correct? If so, in order to support array as return type, getMatchingDocsMap(String key) should have return FlattendDocIds rather than the original DocIds to allow duplication elements as Saurahb does in their PR. At the moment, the getMatchingDocsMap returing original DocsIds is fine since the return type is just SV (no duplication since it’s SV).

@wirybeaver
Copy link
Contributor

@saurabhd336 Two comments.
Could you split the filter and mv into two PR? It's pretty tough to cherry-pick the MV feature only.

}

@Override
public Map<String, RoaringBitmap> getValueToFlattenedDocIdsMap(String jsonPathKey,
Copy link
Contributor

@wirybeaver wirybeaver Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly modify the getMatchingDocsMap which should have return FlattenDocIds and move the converting of flattenDocIds to DocIds into the up layer, i.e. transform function. I understand fixing the unit test JsonIndexTest might take more time but it's worth to do that since more test cases can be reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good

_matchingDocsMap = _jsonIndexReader.getMatchingDocsMap(_jsonPathString);
_resultMetadata = new TransformResultMetadata(dataType, isSingleValue, false);
_valueToMatchingFlattenedDocIdsMap =
_jsonIndexReader.getMatchingFlattenedDocsMap(inputJsonPath.substring(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Suggest passing in the full path, and JSON index reader should validate the path

* @return String[][] where String[i] is the mv array for docIds[i]
*/
String[] getValuesForKeyAndDocs(int[] docIds, Map<String, RoaringBitmap> context);
String[][] getValuesForMv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String[][] getValuesForMv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
String[][] getValuesMV(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) We usually put API for SV in front of MV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

* @return String[] where String[i] is the sv value for docIds[i]
*/
Map<String, RoaringBitmap> getMatchingDocsMap(String key);
String[] getValuesForSv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String[] getValuesForSv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
String[] getValuesSV(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

int numChildren = children.size();
MutableRoaringBitmap matchingDocIds = getMatchingFlattenedDocIds(children.get(0));
MutableRoaringBitmap matchingDocIds =
getMatchingFlattenedDocIds(children.get(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(format) Please reformat all the changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Map<String, RoaringBitmap> valueToMatchingFlattenedDocIdsMap = new HashMap<>();
_readLock.lock();
try {
Pair<String, RoaringBitmap> result = getKeyAndFlattenDocId(jsonPathKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor, not introduced here) Rename the method to getKeyAndFlattenedDocId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

public Map<String, RoaringBitmap> getMatchingDocsMap(String key) {
Map<String, RoaringBitmap> matchingDocsMap = new HashMap<>();
@VisibleForTesting
public Map<String, RoaringBitmap> convertFlattenedDocIdsToDocIds(Map<String, RoaringBitmap> valueToFlattenedDocIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now used to optimize SV extraction path by converting flattened doc IDs map to unflattened.

}

@VisibleForTesting
public Map<String, RoaringBitmap> convertFlattenedDocIdsToDocIds(Map<String, RoaringBitmap> valueToFlattenedDocIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now used to optimize SV extraction path by converting flattened doc IDs map to unflattened.

Copy link
Contributor

@wirybeaver wirybeaver Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense since the Transform function might be invoked multiple times. Then we should remove VisibleForTesting? I just noticed that I incidentally remove an optimization comment:

I just realized that we should keep both valueToDocId and valueToFlattenDocId. Using valueToFlattenDocIds would increase the read latency since it need to iterate the flattenDocId bitmap and create the docId bitmap. If the return type isSingleValue, init valueToDocId; otherwise, init valueToFlattenDocIds.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

* @return String[] where String[i] is the sv value for docIds[i]
*/
String[] getValuesForKeyAndDocs(int[] docIds, Map<String, RoaringBitmap> context);
String[] getValuesSv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToDocs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String[] getValuesSv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToDocs,
String[] getValuesSV(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToDocs,

* @return String[][] where String[i] is the mv array for docIds[i]
*/
Map<String, RoaringBitmap> getMatchingDocsMap(String key);
String[][] getValuesMv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String[][] getValuesMv(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);
String[][] getValuesMV(int[] docIds, int length, Map<String, RoaringBitmap> matchingValueToFlattenedDocs);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

* and the associated flattenDocId bitmap
*/
private Pair<String, MutableRoaringBitmap> getKeyAndFlattenDocId(String key) {
private Pair<String, MutableRoaringBitmap> getKeyAndFlattenedDocId(String key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private Pair<String, MutableRoaringBitmap> getKeyAndFlattenedDocId(String key) {
private Pair<String, MutableRoaringBitmap> getKeyAndFlattenedDocIds(String key) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new syntax, I think we want to pass in the real json path, e.g. $.foo, or do we want to keep the current way of no dollar sign?

@saurabhd336 saurabhd336 merged commit ddc3d0b into apache:master Mar 21, 2024
@Jackie-Jiang
Copy link
Contributor

One more question: with current implementation, does jsonExtractIndex(jsonField, '$.arrField[*].f1', 'INT_ARRAY') work? Or user should use jsonExtractIndex(jsonField, '.arrField[*].f1', 'INT_ARRAY') (no dollar sign)?

@wirybeaver
Copy link
Contributor

One more question: with current implementation, does jsonExtractIndex(jsonField, '$.arrField[*].f1', 'INT_ARRAY') work? Or user should use jsonExtractIndex(jsonField, '.arrField[*].f1', 'INT_ARRAY') (no dollar sign)?

the former one

@npawar
Copy link
Contributor

npawar commented May 22, 2024

was the documentation updated with this change @saurabhd336 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants