Skip to content

Conversation

@kirkrodrigues
Copy link
Contributor

tags: feature, release-notes

This adds a RecordTransformer to transform semi-structured (e.g., JSON) log events to fit a table's schema without dropping fields.

JSON log events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus, this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field.

For example, consider this log event:

 {
   "timestamp": 1687786535928,
   "hostname": "host1",
   "level": "INFO",
   "message": "Started processing job1",
   "tags": {
     "platform": "data",
     "service": "serializer",
     "params": {
       "queueLength": 5,
       "timeout": 299,
       "userData_noIndex": {
         "nth": 99
       }
     }
   }
 }

And let's say the table's schema contains these fields:

  • timestamp
  • hostname
  • level
  • message
  • tags.platform
  • tags.service
  • indexableExtras
  • unindexableExtras

Without this transformer, the entire tags field would be dropped when storing the record in the table. However,
with this transformer, the record would be transformed into the following:

 {
   "timestamp": 1687786535928,
   "hostname": "host1",
   "level": "INFO",
   "message": "Started processing job1",
   "tags.platform": "data",
   "tags.service": "serializer",
   "indexableExtras": {
     "tags": {
       "params": {
         "queueLength": 5,
         "timeout": 299
       }
     }
   },
   "unindexableExtras": {
     "tags": {
       "userData_noIndex": {
         "nth": 99
       }
     }
   }
 }

Notice that the transformer:

  • Flattens nested fields which exist in the schema, like tags.platform
  • Moves fields which don't exist in the schema into the indexableExtras field
  • Moves fields which don't exist in the schema and have the suffix "_noIndex" into the unindexableExtras field

The unindexableExtras field allows the transformer to separate fields which don't need indexing (because they are
only retrieved, not searched) from those that do. The transformer also has other configuration options specified in JsonLogTransformerConfig.

This is part of the change requested in #9819 and described in this design doc.

Testing performed

  • Added new unit tests.
  • Validated JSON log events with dynamic schemas could be ingested into a table without dropping fields (unless configured to).

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2023

Codecov Report

Merging #11210 (a21abab) into master (11b85df) will increase coverage by 62.85%.
Report is 137 commits behind head on master.
The diff coverage is 86.16%.

@@              Coverage Diff              @@
##             master   #11210       +/-   ##
=============================================
+ Coverage      0.11%   62.96%   +62.85%     
- Complexity        0     1091     +1091     
=============================================
  Files          2223     2303       +80     
  Lines        119304   123992     +4688     
  Branches      18059    18878      +819     
=============================================
+ Hits            137    78072    +77935     
+ Misses       119147    40364    -78783     
- Partials         20     5556     +5536     
Flag Coverage Δ
integration <0.01% <0.00%> (?)
integration1 <0.01% <0.00%> (?)
integration1temurin17 ?
integration2 0.00% <0.00%> (?)
java-11 49.95% <86.16%> (?)
java-17 62.81% <86.16%> (?)
java-20 62.83% <86.16%> (?)
temurin 62.96% <86.16%> (?)
unittests 62.96% <86.16%> (?)
unittests1 67.50% <86.16%> (?)
unittests1temurin11 ?
unittests1temurin17 ?
unittests1temurin20 ?
unittests2 14.47% <0.00%> (?)
unittests2temurin11 ?
unittests2temurin17 ?
unittests2temurin20 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ache/pinot/segment/local/utils/IngestionUtils.java 28.32% <0.00%> (+28.32%) ⬆️
.../org/apache/pinot/spi/data/readers/GenericRow.java 65.88% <0.00%> (+65.88%) ⬆️
...apache/pinot/spi/stream/StreamDataDecoderImpl.java 79.16% <0.00%> (+79.16%) ⬆️
...he/pinot/segment/local/utils/TableConfigUtils.java 70.65% <80.00%> (+70.65%) ⬆️
...recordtransformer/SchemaConformingTransformer.java 86.80% <86.80%> (ø)
.../local/recordtransformer/CompositeTransformer.java 84.61% <100.00%> (+84.61%) ⬆️
...ot/spi/config/table/ingestion/IngestionConfig.java 100.00% <100.00%> (+100.00%) ⬆️
...e/ingestion/SchemaConformingTransformerConfig.java 100.00% <100.00%> (ø)

... and 1981 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Jul 31, 2023
@chenboat chenboat self-assigned this Aug 4, 2023
* </pre>
* Notice that the transformer:
* <ul>
* <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) what happens if the nest field occurs multiple times in the json? will we put all the values in the column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean entries in an array, where multiple entries have the same key? We don't flatten the contents of arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is fields like "platform" occur multiple times under "tags".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding, but since the input to the transformer is a GenericRow (which uses a map to store columns), I don't think we can have duplicate keys in the row.

* Notice that the transformer:
* <ul>
* <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
* <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this index referring to Pinot index? Is there any restriction or assumption on the index that should be applied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's referring to a Pinot index; the suggestion is for the user to index indexableExtras and to not index unindexableExtras. However, the transformer doesn't restrict the user from applying any index they choose to either column, so long as it can apply to a JSON/string type column.

* need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
* branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
*/
public class JsonLogTransformer implements RecordTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much of the json transformation here is specific to log related application only? Can we refactor the config/transformer so that it can be used for all Json transformation in general? A lot of what this transformer does can be used for non-logging purpose. If possible, I prefer we made a general json transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Instead of calling it JsonLogTransformer, we could pick a name that reflects that it takes an input record and transforms it to fit the schema losslessly. Here are some options I can think of:

  1. LosslessSchemaTransformer
  2. SchemaAdaptationTransformer
  3. SchemaFitTransformer
  4. SchemaMoldingTransformer
  5. TableAdaptationTransformer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No 2 looks like a good one. Should we also add json to the class name and make it JsonSchemaAdaptionTransformer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add "Json", but I think this is more general than JSON since it could handle records from any of the decoders like CSV, Parquet, etc. I'm fine with adding "Json" if you think that's best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaConformingTransformer then?

* <ul>
* <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
* <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
* <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the suffix configurable in the table config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's already configurable through JsonLogTransformerConfig. I added some text to the Javadoc indicating it's configurable.

* Validates the schema with a JsonLogTransformerConfig instance and creates a tree representing the fields in the
* schema to be used when transforming input records. For instance, the field "a.b" in the schema would be
* un-flattened into "{a: b: null}" in the tree, allowing us to more easily process records containing the latter.
* @throws IllegalArgumentException if schema validation fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadoc (and some examples) on when the schema validation will fail?

* </ul>
* <p>
* The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
* only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention the fieldPathsToDrop field and show how it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

* only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
* {@link JsonLogTransformerConfig}.
* <p>
* One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
Copy link
Contributor

@chenboat chenboat Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the javadoc, we should in general explain only the behavior of the transformer. How the transformer is implemented can be removed from this javadoc section and moved to the codes. In the following example, I think we should emphasize that a schema path can be more than 2 levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Moved this section to processFields.

@kirkrodrigues kirkrodrigues changed the title Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields. Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants