-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields. #11210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… a table's schema without dropping fields.
Codecov Report
@@ Coverage Diff @@
## master #11210 +/- ##
=============================================
+ Coverage 0.11% 62.96% +62.85%
- Complexity 0 1091 +1091
=============================================
Files 2223 2303 +80
Lines 119304 123992 +4688
Branches 18059 18878 +819
=============================================
+ Hits 137 78072 +77935
+ Misses 119147 40364 -78783
- Partials 20 5556 +5536
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1981 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamDataDecoderImpl.java
Show resolved
Hide resolved
| * </pre> | ||
| * Notice that the transformer: | ||
| * <ul> | ||
| * <li>Flattens nested fields which exist in the schema, like "tags.platform"</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) what happens if the nest field occurs multiple times in the json? will we put all the values in the column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean entries in an array, where multiple entries have the same key? We don't flatten the contents of arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is fields like "platform" occur multiple times under "tags".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding, but since the input to the transformer is a GenericRow (which uses a map to store columns), I don't think we can have duplicate keys in the row.
| * Notice that the transformer: | ||
| * <ul> | ||
| * <li>Flattens nested fields which exist in the schema, like "tags.platform"</li> | ||
| * <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this index referring to Pinot index? Is there any restriction or assumption on the index that should be applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's referring to a Pinot index; the suggestion is for the user to index indexableExtras and to not index unindexableExtras. However, the transformer doesn't restrict the user from applying any index they choose to either column, so long as it can apply to a JSON/string type column.
| * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this | ||
| * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call. | ||
| */ | ||
| public class JsonLogTransformer implements RecordTransformer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much of the json transformation here is specific to log related application only? Can we refactor the config/transformer so that it can be used for all Json transformation in general? A lot of what this transformer does can be used for non-logging purpose. If possible, I prefer we made a general json transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Instead of calling it JsonLogTransformer, we could pick a name that reflects that it takes an input record and transforms it to fit the schema losslessly. Here are some options I can think of:
LosslessSchemaTransformerSchemaAdaptationTransformerSchemaFitTransformerSchemaMoldingTransformerTableAdaptationTransformer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No 2 looks like a good one. Should we also add json to the class name and make it JsonSchemaAdaptionTransformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add "Json", but I think this is more general than JSON since it could handle records from any of the decoders like CSV, Parquet, etc. I'm fine with adding "Json" if you think that's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchemaConformingTransformer then?
| * <ul> | ||
| * <li>Flattens nested fields which exist in the schema, like "tags.platform"</li> | ||
| * <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li> | ||
| * <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the suffix configurable in the table config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's already configurable through JsonLogTransformerConfig. I added some text to the Javadoc indicating it's configurable.
| * Validates the schema with a JsonLogTransformerConfig instance and creates a tree representing the fields in the | ||
| * schema to be used when transforming input records. For instance, the field "a.b" in the schema would be | ||
| * un-flattened into "{a: b: null}" in the tree, allowing us to more easily process records containing the latter. | ||
| * @throws IllegalArgumentException if schema validation fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add javadoc (and some examples) on when the schema validation will fail?
| * </ul> | ||
| * <p> | ||
| * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are | ||
| * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mention the fieldPathsToDrop field and show how it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
| * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in | ||
| * {@link JsonLogTransformerConfig}. | ||
| * <p> | ||
| * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the javadoc, we should in general explain only the behavior of the transformer. How the transformer is implemented can be removed from this javadoc section and moved to the codes. In the following example, I think we should emphasize that a schema path can be more than 2 levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Moved this section to processFields.
tags: feature, release-notes
This adds a RecordTransformer to transform semi-structured (e.g., JSON) log events to fit a table's schema without dropping fields.
JSON log events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus, this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field.
For example, consider this log event:
And let's say the table's schema contains these fields:
Without this transformer, the entire
tagsfield would be dropped when storing the record in the table. However,with this transformer, the record would be transformed into the following:
Notice that the transformer:
tags.platformindexableExtrasfieldunindexableExtrasfieldThe
unindexableExtrasfield allows the transformer to separate fields which don't need indexing (because they areonly retrieved, not searched) from those that do. The transformer also has other configuration options specified in
JsonLogTransformerConfig.This is part of the change requested in #9819 and described in this design doc.
Testing performed