Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. #11210

kirkrodrigues · 2023-07-28T19:05:17Z

tags: feature, release-notes

This adds a RecordTransformer to transform semi-structured (e.g., JSON) log events to fit a table's schema without dropping fields.

JSON log events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus, this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field.

For example, consider this log event:

 {
   "timestamp": 1687786535928,
   "hostname": "host1",
   "level": "INFO",
   "message": "Started processing job1",
   "tags": {
     "platform": "data",
     "service": "serializer",
     "params": {
       "queueLength": 5,
       "timeout": 299,
       "userData_noIndex": {
         "nth": 99
       }
     }
   }
 }

And let's say the table's schema contains these fields:

timestamp
hostname
level
message
tags.platform
tags.service
indexableExtras
unindexableExtras

Without this transformer, the entire tags field would be dropped when storing the record in the table. However,
with this transformer, the record would be transformed into the following:

 {
   "timestamp": 1687786535928,
   "hostname": "host1",
   "level": "INFO",
   "message": "Started processing job1",
   "tags.platform": "data",
   "tags.service": "serializer",
   "indexableExtras": {
     "tags": {
       "params": {
         "queueLength": 5,
         "timeout": 299
       }
     }
   },
   "unindexableExtras": {
     "tags": {
       "userData_noIndex": {
         "nth": 99
       }
     }
   }
 }

Notice that the transformer:

Flattens nested fields which exist in the schema, like tags.platform
Moves fields which don't exist in the schema into the indexableExtras field
Moves fields which don't exist in the schema and have the suffix "_noIndex" into the unindexableExtras field

The unindexableExtras field allows the transformer to separate fields which don't need indexing (because they are
only retrieved, not searched) from those that do. The transformer also has other configuration options specified in JsonLogTransformerConfig.

This is part of the change requested in #9819 and described in this design doc.

Testing performed

Added new unit tests.
Validated JSON log events with dynamic schemas could be ingested into a table without dropping fields (unless configured to).

… a table's schema without dropping fields.

codecov-commenter · 2023-07-28T19:47:46Z

Codecov Report

Merging #11210 (a21abab) into master (11b85df) will increase coverage by 62.85%.
Report is 137 commits behind head on master.
The diff coverage is 86.16%.

@@              Coverage Diff              @@
##             master   #11210       +/-   ##
=============================================
+ Coverage      0.11%   62.96%   +62.85%     
- Complexity        0     1091     +1091     
=============================================
  Files          2223     2303       +80     
  Lines        119304   123992     +4688     
  Branches      18059    18878      +819     
=============================================
+ Hits            137    78072    +77935     
+ Misses       119147    40364    -78783     
- Partials         20     5556     +5536

Flag	Coverage Δ
integration	`<0.01% <0.00%> (?)`
integration1	`<0.01% <0.00%> (?)`
integration1temurin17	`?`
integration2	`0.00% <0.00%> (?)`
java-11	`49.95% <86.16%> (?)`
java-17	`62.81% <86.16%> (?)`
java-20	`62.83% <86.16%> (?)`
temurin	`62.96% <86.16%> (?)`
unittests	`62.96% <86.16%> (?)`
unittests1	`67.50% <86.16%> (?)`
unittests1temurin11	`?`
unittests1temurin17	`?`
unittests1temurin20	`?`
unittests2	`14.47% <0.00%> (?)`
unittests2temurin11	`?`
unittests2temurin17	`?`
unittests2temurin20	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...ache/pinot/segment/local/utils/IngestionUtils.java	`28.32% <0.00%> (+28.32%)`	⬆️
.../org/apache/pinot/spi/data/readers/GenericRow.java	`65.88% <0.00%> (+65.88%)`	⬆️
...apache/pinot/spi/stream/StreamDataDecoderImpl.java	`79.16% <0.00%> (+79.16%)`	⬆️
...he/pinot/segment/local/utils/TableConfigUtils.java	`70.65% <80.00%> (+70.65%)`	⬆️
...recordtransformer/SchemaConformingTransformer.java	`86.80% <86.80%> (ø)`
.../local/recordtransformer/CompositeTransformer.java	`84.61% <100.00%> (+84.61%)`	⬆️
...ot/spi/config/table/ingestion/IngestionConfig.java	`100.00% <100.00%> (+100.00%)`	⬆️
...e/ingestion/SchemaConformingTransformerConfig.java	`100.00% <100.00%> (ø)`

... and 1981 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamDataDecoderImpl.java

chenboat · 2023-08-04T20:57:03Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>


(1) what happens if the nest field occurs multiple times in the json? will we put all the values in the column?

Do you mean entries in an array, where multiple entries have the same key? We don't flatten the contents of arrays.

What I meant is fields like "platform" occur multiple times under "tags".

I might be misunderstanding, but since the input to the transformer is a GenericRow (which uses a map to store columns), I don't think we can have duplicate keys in the row.

chenboat · 2023-08-04T20:58:43Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>


Is this index referring to Pinot index? Is there any restriction or assumption on the index that should be applied?

Yes, it's referring to a Pinot index; the suggestion is for the user to index indexableExtras and to not index unindexableExtras. However, the transformer doesn't restrict the user from applying any index they choose to either column, so long as it can apply to a JSON/string type column.

chenboat · 2023-08-04T21:01:19Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {


I wonder how much of the json transformation here is specific to log related application only? Can we refactor the config/transformer so that it can be used for all Json transformation in general? A lot of what this transformer does can be used for non-logging purpose. If possible, I prefer we made a general json transformer.

Good point. Instead of calling it JsonLogTransformer, we could pick a name that reflects that it takes an input record and transforms it to fit the schema losslessly. Here are some options I can think of:

LosslessSchemaTransformer

SchemaAdaptationTransformer

SchemaFitTransformer

SchemaMoldingTransformer

TableAdaptationTransformer

No 2 looks like a good one. Should we also add json to the class name and make it JsonSchemaAdaptionTransformer?

We could add "Json", but I think this is more general than JSON since it could handle records from any of the decoders like CSV, Parquet, etc. I'm fine with adding "Json" if you think that's best.

SchemaConformingTransformer then?

chenboat · 2023-08-08T22:38:04Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"


Can we make the suffix configurable in the table config?

Oh it's already configurable through JsonLogTransformerConfig. I added some text to the Javadoc indicating it's configurable.

chenboat · 2023-08-08T22:47:37Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+   * Validates the schema with a JsonLogTransformerConfig instance and creates a tree representing the fields in the
+   * schema to be used when transforming input records. For instance, the field "a.b" in the schema would be
+   * un-flattened into "{a: b: null}" in the tree, allowing us to more easily process records containing the latter.
+   * @throws IllegalArgumentException if schema validation fails


Can you add javadoc (and some examples) on when the schema validation will fail?

chenboat · 2023-08-08T23:01:07Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in


Can we mention the fieldPathsToDrop field and show how it is used?

Good catch!

chenboat · 2023-08-08T23:03:21Z

...local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java

+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider


I think in the javadoc, we should in general explain only the behavior of the transformer. How the transformer is implemented can be removed from this javadoc section and moved to the codes. In the following example, I think we should emphasize that a schema path can be more than 2 levels.

Good point. Moved this section to processFields.

kirkrodrigues added 4 commits July 27, 2023 17:53

Add JsonLogTransformer to transform semi-structured log events to fit…

379926b

… a table's schema without dropping fields.

Minor fix

d077c10

Add unit tests for JsonLogTransformer.

ca89cff

Minor fixes.

9ad0b87

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Jul 31, 2023

chenboat self-assigned this Aug 4, 2023

chenboat reviewed Aug 4, 2023

View reviewed changes

chenboat reviewed Aug 8, 2023

View reviewed changes

kirkrodrigues added 3 commits August 10, 2023 11:47

Refactor docstrings.

6ad75da

Rename JsonLogTransformer to SchemaConformingTransformer

816512b

JsonLogTransformer: Pass through GenericRow's special keys

a21abab

kirkrodrigues changed the title ~~Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.~~ Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. Aug 24, 2023

chenboat approved these changes Aug 24, 2023

View reviewed changes

chenboat merged commit d91e5db into apache:master Aug 24, 2023

Jackie-Jiang added the documentation label Aug 29, 2023

kirkrodrigues mentioned this pull request Sep 1, 2023

Document the SchemaConformingTransformer. pinot-contrib/pinot-docs#238

Merged

Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. #11210

Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. #11210

Uh oh!

Conversation

kirkrodrigues commented Jul 28, 2023

Testing performed

Uh oh!

codecov-commenter commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenboat Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 28, 2023 •

edited

Loading

chenboat Aug 8, 2023 •

edited

Loading