[BEAM-12697] Add SBE module and initial classes by zhoufek · Pull Request #15733 · apache/beam

zhoufek · 2021-10-15T18:31:04Z

Adds some types for helping to represent an SBE schema in Beam.

This is focused on types found under SBEs date and time encodings: https://www.fixtrading.org/standards/sbe-online/#date-and-time-encoding

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.

See the Contributor Guide for more tips on how to make review process smoother.

`ValidatesRunner` compliance status (on master branch)

Lang	ULR	Twister2
Go	---	---
Java
Python	---	---
XLang		---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

codecov · 2021-10-15T19:10:12Z

Codecov Report

Merging #15733 (49c87b2) into master (0111cff) will decrease coverage by 0.17%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #15733      +/-   ##
==========================================
- Coverage   83.77%   83.60%   -0.18%     
==========================================
  Files         444      445       +1     
  Lines       60414    61329     +915     
==========================================
+ Hits        50614    51276     +662     
- Misses       9800    10053     +253

Impacted Files	Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py	`62.72% <0.00%> (-12.84%)`	⬇️
...ython/apache_beam/runners/interactive/sql/utils.py	`76.09% <0.00%> (-7.91%)`	⬇️
sdks/python/apache_beam/utils/interactive_utils.py	`87.80% <0.00%> (-7.32%)`	⬇️
...ython/apache_beam/io/gcp/experimental/spannerio.py	`82.52% <0.00%> (-5.69%)`	⬇️
...thon/apache_beam/runners/worker/operation_specs.py	`40.67% <0.00%> (-4.90%)`	⬇️
...he_beam/runners/interactive/sql/beam_sql_magics.py	`49.75% <0.00%> (-4.79%)`	⬇️
...ython/apache_beam/io/gcp/bigquery_read_internal.py	`53.92% <0.00%> (-4.24%)`	⬇️
...eam/portability/api/beam_expansion_api_pb2_grpc.py	`57.89% <0.00%> (-4.02%)`	⬇️
sdks/python/apache_beam/io/gcp/bigquery_tools.py	`82.91% <0.00%> (-3.82%)`	⬇️
...eam/portability/api/beam_provision_api_pb2_grpc.py	`73.68% <0.00%> (-2.51%)`	⬇️
... and 82 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0111cff...49c87b2. Read the comment docs.

zhoufek · 2021-10-15T19:21:19Z

Run Java_Examples_Dataflow PreCommit

zhoufek · 2021-10-19T14:07:22Z

R: @TheNeuralBit

TheNeuralBit · 2021-10-20T22:46:58Z

+  private SbeLogicalTypes() {}
+
+  // Unsigned types are all stored at the next highest value. This prevents unexpected behavior
+  // when reading and likely has negligible space impact.


The space impact will be 2x which could be non-trivial if there are many unsigned integer fields. We could at least save the space on the wire by using the same bit width for the base type. But that would still consume memory on the worker. You'd also have to make a custom logical type for that.

Another option is the protobuf approach. It just maps unsigned integers to their signed counterparts directly and users are responsible for munging negative values. See footnote [2] here: https://developers.google.com/protocol-buffers/docs/proto3#scalar

@reuvenlax have you thought at all about the best way to represent unsigned integers in Java beam schemas?

My main concern with doing the protobuf approach is that conversion from Row to JSON will use the base value, which will be signed. Since conversion to JSON is probably going to be a standard part of any pipeline using this, I'm a bit concerned that it'll lead to a lot of unexpected output.

Or am I misunderstanding how ToJson works?

TheNeuralBit · 2021-10-29T19:08:50Z

+ *
+ * <p>These are convertible to/from a {@link Row} that is a direct mapping of an SBE composite type.
+ */
+public final class TimeValues {


Would it be possible to just use joda or java time types instead of defining our own date/time classes?

For Row -> T, that'll work.

For T -> Row, the difficulty will be recovering the original unit. This isn't an issue if the SBE schema is a variable unit, since we can just pick a unit and set that. However, SBE allows setting a constant unit, like:

<composite name="UTCTimestampNanos" description="UTC timestamp with nanosecond precision"> <type name="time" primitiveType="uint64" /> <type name="unit" primitiveType="uint8" presence="constant" valueRef="TimeUnit.nanosecond" /> </composite>

(Source)

I'm guessing that there's some potential for detecting this on Row -> SBE type, since the signature will be different:

// Variable unit public UTCTimestampEncoder unit(final short value) { buffer.putByte(offset + 8, (byte)value); return this; } // Constant unit public short unit() { return (short)9; }

If we assume nanos when doing T -> Row, then going Row -> SBE type, we would either write nanos or reduce precision if constant.

So it is doable, but I'm thinking that:

It is harder to implement and potentially more error prone.

It is more expensive (more analysis of reflection).

A sudden unit change might be surprising, even if it is variable. (e.g. "Why did it output nano when I only input seconds and millis?")

Overall, I thought it was better to use an intermediate type that bridges the gap between T and Row and which preserves the original value/unit.

So I have two main concerns with this approach:

I really don't want to get in the business of maintaining a date/time library :)

It would be preferable to have a library of common types that most connectors are using, so that it's trivial to interop with the schemas that they produce.

We could maybe alleviate (1) by discouraging users from using these types directly, instead preferring to convert to java time or joda time if they need to.

Note we can encode more information in the schema logical type than is represented in the Java type that it maps to. So another approach to be able to represent the mixed vs. fixed precision case could be to have a family of logical types that all map to Instant, but that have different representation types (i.e. wire types):

micros_instant

millis_instant

seconds_instant

variable_unit_instant (or something)

In the variable unit case the representation type could be row<int64 timestamp, int64 unit>. The drawback with this approach would be that the user can't interrogate the precision of the instant in their Java code, but I'm not so sure that's a significant problem.

We may want to discuss this on the dev list, as it's a pretty significant decision about how we want to deal with date/time types in Schemas in general.

I don't want your work on SBE to be hung up by that though. Can you still make progress on the schema mapping, just leaving out the types that are represented in this PR for now?

We can get by on just using a Row. I would still like to have a logical type that provides the same semantic meaning as SBE does.

I've changed them all to use PassThroughLogicalType<Row> and have removed all the relevant datetime values. I've left UTCDateOnly and LocalMktDate, since they're just using Java's LocalDate, not a custom type.

What I meant by make progress was to build other infrastructure, like the PayloadSerializerProvider, that only supports the unambiguous types (for now).

If we must provide types that have the same semantic meaning as SBE, I do think it's preferable to define concrete types (as you had before) rather than using Row. Row is supposed to be an implementation detail - we aim to create APIs where users don't need to interact with it directly. I just think we need to be careful creating these types, as users may come to expect date/time library-like functionality. If we're clear from the outset that these are intended to be simple containers to faithfully represent SBE data, that could be ok.

However that solution would still have drawback (2) - the types would have little utility outside of SBE. What if a user wanted to use these types in a SqlTransform, or write them to Avro, or take data from one of those sources and write it as SBE? Ideally they could do that without boilerplate to convert between SBE-native date/time types and standard types. This doesn't need to be a blocker, but it's something to think about.

What I meant by make progress was to build other infrastructure, like the PayloadSerializerProvider, that only supports the unambiguous types (for now).

Yeah, that's doable. I was just noting that in not explicitly accounting for these types, they'll likely be translated into a Row of primitive fields. That's how we would be handling unfamiliar composite types.

Actually, I am thinking back to my earlier comment about detecting the unit in converting to the SBE type, and I think that analyzing the type with reflection will always be necessary to avoid trying to write the unit when it is constant, which removes the first two concerns I had. The only challenge is getting the right unit in the variable case, but I can think of some ways to do that easily, though we may still choose a less precise unit than the original if the less precise unit would give the same result. I've at least tried this out with Instant, and I'd imagine it will work the same for the other types.

Basically, I'm thinking we could probably use Java time types and determine the wire format that works best. I'll update the logical types, and if there's still concerns, I can remove them and revisit them in a later PR.

Awesome, thank you. I just realized we probably want to provide a path for joda time where possible, since Beam Java uses it for event times. I don't think it supports nanosecond precision though, which complicates things.

I've added Java time for this. All of them are converted to/from String for simplicity and consistency. The UTC* and date-only types could use Long and Integer respectively if memory is concern.

I'll do Joda in a later PR.

zhoufek · 2021-11-01T15:44:04Z

The Java failure seems related to BEAM-11689.

Go precommit failures are related to Go tests:

09:45:03 Test for github.com/apache/beam/sdks/v2/go/cmd/starcgen finished, 6 completed, 4 failed.

Go portable precommit also seems related to the Go test environment:

09:53:15 ./run_validatesrunner_tests.sh: line 399: go: command not found

TheNeuralBit · 2021-11-01T19:59:11Z

Yeah it looks like Go precommits are broken at head: https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/4871/

zhoufek · 2021-11-11T18:33:24Z

Run Java_Examples_Dataflow PreCommit

zhoufek · 2021-11-11T19:08:47Z

Run Java PreCommit

zhoufek · 2021-11-11T20:07:44Z

Run Java_Examples_Dataflow PreCommit

TheNeuralBit

LGTM, just a minor suggestion. Really sorry I let this drop for so long.

TheNeuralBit · 2021-11-29T19:14:45Z

+ * Beam schemas with just a primitive.
+ */
+@Experimental(Kind.SCHEMAS)
+public final class SbeLogicalTypes {


nit: You might consider making all the LogicalType implementations private, and just exposing concrete instances of them here. I'm also fine merging as-is if you'd prefer

This is what we did in SqlTypes, so that we can easily migrate if/when it's necessary:

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes/SqlTypes.java

Lines 32 to 42 in d500129

/** Beam LogicalType corresponding to ZetaSQL/CalciteSQL DATE type. */

public static final LogicalType<LocalDate, Long> DATE = new Date();

/** Beam LogicalType corresponding to ZetaSQL/CalciteSQL TIME type. */

public static final LogicalType<LocalTime, Long> TIME = new Time();

/** Beam LogicalType corresponding to ZetaSQL DATETIME type. */

public static final LogicalType<LocalDateTime, Row> DATETIME = new DateTime();

/** Beam LogicalType corresponding to ZetaSQL TIMESTAMP type. */

public static final LogicalType<Instant, Row> TIMESTAMP = new MicrosInstant();

I would prefer to keep them public for now to remain consistent with the way that the protobuf extension does things:

beam/sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoSchemaLogicalTypes.java

Lines 67 to 73 in 3769d75

public static class UInt32 extends PassThroughLogicalType<Integer> {

public static final String IDENTIFIER = "Uint32";

UInt32() {

super(IDENTIFIER, FieldType.STRING, "", FieldType.INT32);

}

}

Sounds good 👍

zhoufek force-pushed the sbe_lt branch from 8602d21 to f55c439 Compare October 15, 2021 18:39

[BEAM-12697] Add SBE module and initial classes

6a7c14d

zhoufek force-pushed the sbe_lt branch from f55c439 to 6a7c14d Compare October 15, 2021 18:48

zhoufek commented Oct 19, 2021

View reviewed changes

Comment thread sdks/java/extensions/sbe/src/main/java/org/apache/beam/sdk/extensions/sbe/SbeLogicalTypes.java

zhoufek marked this pull request as ready for review October 19, 2021 14:07

TheNeuralBit self-requested a review October 19, 2021 16:20

TheNeuralBit reviewed Oct 29, 2021

View reviewed changes

zhoufek requested a review from TheNeuralBit October 29, 2021 21:20

[BEAM-12697] Simplify datetime types

1556379

[BEAM-12697] Use Java time

89dc3fb

[BEAM-12697] Remove newly unused Guava dependency

49c87b2

TheNeuralBit approved these changes Nov 29, 2021

View reviewed changes

TheNeuralBit merged commit 9160ba2 into apache:master Nov 29, 2021

zhoufek deleted the sbe_lt branch March 25, 2022 19:42

	/** Beam LogicalType corresponding to ZetaSQL/CalciteSQL DATE type. */
	public static final LogicalType<LocalDate, Long> DATE = new Date();

	/** Beam LogicalType corresponding to ZetaSQL/CalciteSQL TIME type. */
	public static final LogicalType<LocalTime, Long> TIME = new Time();

	/** Beam LogicalType corresponding to ZetaSQL DATETIME type. */
	public static final LogicalType<LocalDateTime, Row> DATETIME = new DateTime();

	/** Beam LogicalType corresponding to ZetaSQL TIMESTAMP type. */
	public static final LogicalType<Instant, Row> TIMESTAMP = new MicrosInstant();

	public static class UInt32 extends PassThroughLogicalType<Integer> {
	public static final String IDENTIFIER = "Uint32";

	UInt32() {
	super(IDENTIFIER, FieldType.STRING, "", FieldType.INT32);
	}
	}

Conversation

zhoufek commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ValidatesRunner compliance status (on master branch)

Examples testing status on various runners

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

codecov Bot commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhoufek commented Oct 15, 2021

Uh oh!

Uh oh!

zhoufek commented Oct 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheNeuralBit Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhoufek commented Nov 1, 2021

Uh oh!

TheNeuralBit commented Nov 1, 2021

Uh oh!

zhoufek commented Nov 11, 2021

Uh oh!

zhoufek commented Nov 11, 2021

Uh oh!

zhoufek commented Nov 11, 2021

Uh oh!

TheNeuralBit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhoufek commented Oct 15, 2021 •

edited

Loading

`ValidatesRunner` compliance status (on master branch)

codecov Bot commented Oct 15, 2021 •

edited

Loading

TheNeuralBit Nov 1, 2021 •

edited

Loading