[BEAM-11408] Integrate BigQuery sink streaming inserts with GroupIntoBatches#13496
[BEAM-11408] Integrate BigQuery sink streaming inserts with GroupIntoBatches#13496pabloem merged 5 commits intoapache:masterfrom
Conversation
468b4b7 to
d187951
Compare
d187951 to
3b0992c
Compare
There was a problem hiding this comment.
Not sure about this limit. What would be a proper value? Should we make it configurable?
There was a problem hiding this comment.
to be hones, I am not sure what's a good duration either. I think this is acceptable for now, until we find out more. Thoughts?
There was a problem hiding this comment.
Yeah sounds reasonable to proceed with this for now.
There was a problem hiding this comment.
We are relying on the fact that the GroupIntoBatches produces stable output. Really we should tag this with RequiresStableInput. Can you find out if this is safe to do in Dataflow?
There was a problem hiding this comment.
There is a transform override in Dataflow to add a preceding Reshuffle to DoFns marked with RequiresStableInput.
The override is disabled though. So I guess currently Dataflow does nothing for this tag.
Also my understanding is that dding a Reshuffle before GroupIntoBatches will introduce an extra shuffle as Reshuffle is essentially a GBK + value expansion in Dataflow.
f8790e3 to
0ab05d9
Compare
|
@reuvenlax I also have changes for FILE_LOADS ready in my local branch. If it is ok for you, I can merge those into this PR. Otherwise I will send a follow-up PR. |
|
R: @pabloem |
Hey Pablo, let me know if it is ok for you to include the changes in FILE_LOADS in this PR as well. If so, I will push a new commit. |
|
Run Java PostCommit |
|
Run Java PreCommit |
|
There are a couple checkstyle warnings on Precommit: These mean that the variables should be static to have CAPITALIZED_NAMES, or should be named with camelCase if they are not static. Can you fix that? Also, it seems there are some merge conflicts. Can you fix that as well? The last thing to figure out is how to address Reuven's comment regarding stable input |
0ab05d9 to
c033215
Compare
|
Run Java PostCommit |
|
Run Java_Examples_Dataflow_Java11 PreCommit |
Thanks for pointing out! Pushed a commit to fix those.
Done.
Based on my understanding Dataflow currently doesn't do anything for @RequiresStableInputs so it may be considered as safe for now but if we add naive support like a Reshuffle it would be adding duplicated shuffles. How about adding a TODO here so we don't forget? |
|
There seems to be an issue with a try/catch: https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/12361/console |
|
Run Java PostCommit |
|
Run Java PreCommit |
|
Postcommit from previous commit: https://ci-beam.apache.org/job/beam_PostCommit_Java_PR/558/ |
|
Thanks @nehsyc ! |
…sink streaming inserts with GroupIntoBatches * Integrate BQ streaming inserts with GroupIntoBatches * Moved autosharding option from BigQueryOption to BigQueryIOBuilder; addressed comments. * fix checkstyle error * Revert the logic that was dropped during merge * Add comments for RequiresStableInput
Use
GroupIntoBatches.WithShardedKeyAPI to group and batch write before streaming to BigQuery service. Currently batching is done best-effort on bundle finalization.This PR
BigQueryOptionsto toggle between the existing and new implementation;BatchedStreamingWriteand provides an option to choose the implementation.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.