[BEAM-10990] Elasticsearch response filtering and [BEAM-5172] Tries to reduce ES UTest flakiness by egalpin · Pull Request #15381 · apache/beam

egalpin · 2021-08-24T21:26:49Z

Adds the ability to prevent infinite retries with non-transient Elasticsearch write failures by providing a user-configurable setting called (for now, I'm not certain the name is perfect) withThrowWriteFailures, which is true by default to maintain backward compatibility.

If withThrowWriteFailures is set to false, the response from Elasticsearch Bulk API will be used to capture the result of persisting a document (or deleting it). The order of Bulk API response is guaranteed to be in the same order as the Bulk API request[1], so we can stitch together what the write result of a given input document was.

This PR introduces a new class WriteSummary. This class is used to collect the context of a document as it passes from raw input document (i.e. the PCollection being sent to ElasticsearchIO.Write/ElasticsearchIO.DocToBulk), the Bulk Directive which resulted from transform settings and the input document (ex. delete directive, upsert, scripted upsert, etc), as well as whether the document resulted in an error from ES. By maintaining all the context/ancestry of the document, users can get at the input document and its resulting error, for example. Without maintaining the ancestry, we could only return the Bulk Directive which would be less helpful in many cases.

[1] https://discuss.elastic.co/t/ordering-of-responses-in-the-bulk-api/13264

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

`ValidatesRunner` compliance status (on master branch)

Lang	ULR	Twister2
Go	---	---
Java
Python	---	---
XLang		---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

egalpin · 2021-08-24T21:27:42Z

R: @jbonofre @timrobertson100 @echauchot

egalpin · 2021-08-25T01:17:07Z

Run Java PreCommit

egalpin · 2021-09-01T15:11:32Z

Friendly bump. Anyone able to review?

timrobertson100 · 2021-09-01T17:29:16Z

I'm sorry I haven't got time to complete a review, but I have read through the changes once, and it all looks well structured and has good attention to detail on comments, tests etc.

The area I can't comment on immediately is the adapter and windowing behavior - if someone could confirm that section looks reasonable I think it looks good to merge.

egalpin · 2021-09-01T19:46:57Z

Ya I agree that maintaining a record of timestamps and using the context adapter is a lot of complexity. The concept was borrowed heavily from this RedisIO PR: #5841

I'm interested in investigating how much work would be involved in completing BEAM-1287 to allow FinishBundle methods to accept OutputReceiver/MultiOutputReceiver as that would make this type of pattern (which seems to be used in at least a few places) much easier to work with.

timrobertson100 · 2021-09-02T06:05:29Z

Thanks @egalpin

@echauchot - if you have time, could you please take a look at the section of changes in ElasticsearchIO from lines 2346 onwards? I'm just not familiar enough with the windowing to verify. The rest of the changes seem very reasonable.

echauchot · 2021-09-02T12:19:30Z

@timrobertson100 sure ! thanks for taking a look !
@egalpin thanks for your work !

egalpin · 2021-09-02T13:44:12Z

Thanks @timrobertson100 for having a look!

@echauchot I’m definitely keen to understand if the strategy in the PR will work and its windowing implications, and would be happy to learn of alternative approaches!

echauchot · 2021-09-15T14:37:27Z

@egalpin sorry for the late review, taking a look now

egalpin · 2021-09-16T01:06:07Z

Thanks for having a look @echauchot! And no problem, I’ve been swamped with work and unable to address the flaky tests so I completely understand!

echauchot · 2021-09-24T06:36:51Z

@aaltay @egalpin sorry guys I had very reduced availability these days because of the Apachecon. Resuming review

egalpin · 2021-09-27T22:36:15Z

Oh I had also forgotten about an alternative solution that would have less complexity, but I’m not sure how it might or might not adhere to best practices as a sink in the system.

We could instead combine inputs globally and forget about maintaining windows altogether. I believe this is what the BQ Write method does, and I think that not maintaining windows in a sink is generally sane 🤷‍♂️

Thoughts?

echauchot

Hi Evan, thanks a lot for your work !
I have some minor change requests around wording but a major concern with the complexity of having to deal with windows. It should be transparent to an IO. I think you can merge input doc and response globally and do not deal with windows.

egalpin · 2021-09-29T19:39:54Z

Thanks for the feedback @echauchot! I wasn't sure if re-windowing within an IO was acceptable but it sounds like that's the path to go. It's way less complex, and I definitely like that. I'll make those changes

echauchot · 2021-09-30T07:42:04Z

Thanks for the feedback @echauchot! I wasn't sure if re-windowing within an IO was acceptable but it sounds like that's the path to go. It's way less complex, and I definitely like that. I'll make those changes

@egalpin my pleasure !

I mean: you should not change the windows of the elements. In fact, you should not deal with the windows at all. Your problem is to join the input elements with the status (json and error) of the write. You could do:

join by doc id but that whould not be possible in case the id is not provided in the input doc (autogeneration). So it is not the correct way to go.
you maintain the same order between input docs and WriteSummary objects so you could simply join index 1 with index 1, index 2 with index 2 etc...
I don't get why you bothered with windows in the first place but maybe there is something I missed.

egalpin · 2021-09-30T12:11:01Z

Thanks for clarifying, that makes sense 👍 My original intent was to leave windows alone entirely and just output without modification. It’s worth noting that the window each element belongs to should be left unmodified, and that’s why the complexity is present.

The challenge arose from the use of FinishBundle, where the output method requires explicit use of a BoundedWindow. So all of the window collection/multimap/adapter complexity is all done so that we could output any buffered elements when FinishBundle is called. If FinishBundle could accept a MultiOutput or OutputReceiver, I believe this would be solved neatly.

I just didn’t want to gate this change on FinishBundle changes, but maybe that’s the right path?

echauchot · 2021-10-01T08:39:36Z

@egalpin I don't get it. Neither BulkIOBaseFn#finishBundle() nor ElasticsearchIO#flushbatch() expect windowing information.

egalpin · 2021-10-01T14:27:59Z

@echauchot It's highly possible that I have a fundamental misunderstanding about how to handle window data. I'm going to try to outline my goals and challenges, and take the proposed implementation out of focus.

Goals:

Change BulkIO/Write to output PCollectionTuple rather than PDone, in order to support reporting the status of indexing each input document
Leave windows/timestampes/etc of input data entirely unaltered

Challenges:

BulkIOBaseFn relies on buffering inputs, either using bundles or Stateful specs
BulkIOBaseFn#finishBundle() must be called to ensure that any buffered inputs are sent to ES, and as of this PR, output to the PCollectionTuple
DoFn.FinishBundle mehtods can accept a FinishBundleContext in order to output elements
FinishBundleContext output methods all require explicit specification of a BoundedWindow instance

I got a bit stuck on that last point. My impression was that in order to ensure buffered docs' results were output, and in order to leave those elements' windows unaltered, I needed to keep track of the windows to which those elements belong so that they could be explicitly passed to FinishBundleContext#output.

I'd definitely be keen to learn more about how to handle windows and challenge my assumptions here. Thanks for your time @echauchot in reviewing and teaching.

echauchot · 2021-10-08T08:29:45Z

@echauchot It's highly possible that I have a fundamental misunderstanding about how to handle window data. I'm going to try to outline my goals and challenges, and take the proposed implementation out of focus.

Goals:

Change BulkIO/Write to output PCollectionTuple rather than PDone, in order to support reporting the status of indexing each input document

Leave windows/timestampes/etc of input data entirely unaltered

Challenges:

BulkIOBaseFn relies on buffering inputs, either using bundles or Stateful specs

BulkIOBaseFn#finishBundle() must be called to ensure that any buffered inputs are sent to ES, and as of this PR, output to the PCollectionTuple

DoFn.FinishBundle mehtods can accept a FinishBundleContext in order to output elements

FinishBundleContext output methods all require explicit specification of a BoundedWindow instance

I got a bit stuck on that last point. My impression was that in order to ensure buffered docs' results were output, and in order to leave those elements' windows unaltered, I needed to keep track of the windows to which those elements belong so that they could be explicitly passed to FinishBundleContext#output.

I'd definitely be keen to learn more about how to handle windows and challenge my assumptions here. Thanks for your time @echauchot in reviewing and teaching.

Just to write here what we discussed privately yesterday (Apache way: what did not happen publicly did not happen at all):

My bad, I did not know about the DoFn#finishBundle() signature change. It now forces specifying a window in the output. I first thought that dealing with windows was not needed and brought unnecessary complexity but it seems it is mandatory 😄
but I hope it is ony temporary until OutputReceiver is fleshed out.
I took a look at the existing IOs that ouput data to PCollections:
- FhirIO: outputs to last seen window. Seems incorrect.
- HadoopFormatIO and BigqueryIO store to a map keyed by window to then output per window similarly to what you do
  => So I guess the general window maintaining looks good. I need to look in more details at the code to give LGTM

echauchot

Overall windowing management and flush* methods look good to me, please address the minor comments and we will be good to merge

egalpin · 2021-10-18T18:15:50Z

Note that e01efcd aims to also address BEAM-5172

echauchot · 2021-10-19T09:37:33Z

BEAM-5172

Nice !
But please, rename commit with [BEAM-5172] header as it refers to a different ticket, I guess the utest docCount decrease commit refers to the same ticket as well. And also please squash the [BEAM-10990] commits together and the [BEAM-5172] commits together.

echauchot

Thanks for your work Evan ! Almost LGTM. One final thing about adding a test that covers the case that you indicated about PDone. Please also squash as explained, that way we will avoid a round trip and I could directly merge when the above UTest is done.

egalpin · 2021-10-19T16:31:31Z

Pre commit failure seems to indicate that missing cache entry was the cause 🤔

12:03:32 * What went wrong:
12:03:32 Execution failed for task ':sdks:java:testing:load-tests:compileJava'.
12:03:32 > Failed to load cache entry for task ':sdks:java:testing:load-tests:compileJava'

I'll try re-running and then will dig in further if persistent.

egalpin · 2021-10-19T16:31:36Z

Run Java PreCommit

echauchot

LGTM ! Perfect work and fluid communication as always, thanks Evan !
Merging

egalpin · 2021-10-20T13:26:46Z

Thanks for your review efforts and insights @echauchot, it was a pleasure as always 🎉

lukecwik · 2022-03-07T17:05:06Z

I believe the flushing logic has a bug where we are outputting to the wrong window: https://issues.apache.org/jira/browse/BEAM-14064

echauchot · 2022-03-09T14:27:16Z

Should be fixed by this PR

timrobertson100 reviewed Sep 1, 2021

View reviewed changes

Comment thread ...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java Outdated

aaltay requested a review from echauchot September 23, 2021 18:58

echauchot requested changes Sep 28, 2021

View reviewed changes

echauchot requested changes Oct 11, 2021

View reviewed changes

egalpin requested a review from echauchot October 18, 2021 18:36

echauchot requested changes Oct 19, 2021

View reviewed changes

egalpin added 2 commits October 19, 2021 11:03

[BEAM-10990] Adds response filtering for ElasticsearchIO

6bfd83d

[BEAM-5172] Tries to reduce ES uTest flakiness

f7691e6

egalpin force-pushed the BEAM-10990-elasticsearch-response-filtering branch from bea212b to f7691e6 Compare October 19, 2021 15:07

echauchot approved these changes Oct 20, 2021

View reviewed changes

echauchot changed the title ~~[BEAM-10990] Elasticsearch response filtering~~ [BEAM-10990] Elasticsearch response filtering and [BEAM-5172] Tries to reduce ES UTest flakiness Oct 20, 2021

echauchot merged commit a05aa45 into apache:master Oct 20, 2021

egalpin mentioned this pull request Jun 14, 2022

Exception from ElasticSearch Write Module in DataFlow #21690

Closed

Conversation

egalpin commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ValidatesRunner compliance status (on master branch)

Examples testing status on various runners

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

egalpin commented Aug 24, 2021

Uh oh!

egalpin commented Aug 25, 2021

Uh oh!

egalpin commented Sep 1, 2021

Uh oh!

Uh oh!

timrobertson100 commented Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egalpin commented Sep 1, 2021

Uh oh!

timrobertson100 commented Sep 2, 2021

Uh oh!

echauchot commented Sep 2, 2021

Uh oh!

egalpin commented Sep 2, 2021

Uh oh!

echauchot commented Sep 15, 2021

Uh oh!

egalpin commented Sep 16, 2021

Uh oh!

echauchot commented Sep 24, 2021

Uh oh!

egalpin commented Sep 27, 2021

Uh oh!

echauchot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

egalpin commented Sep 29, 2021

Uh oh!

echauchot commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egalpin commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Oct 1, 2021

Uh oh!

egalpin commented Oct 1, 2021

Uh oh!

echauchot commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

egalpin commented Oct 18, 2021

Uh oh!

echauchot commented Oct 19, 2021

Uh oh!

echauchot left a comment

Choose a reason for hiding this comment

Uh oh!

egalpin commented Oct 19, 2021

Uh oh!

egalpin commented Oct 19, 2021

Uh oh!

egalpin commented Aug 24, 2021 •

edited

Loading

`ValidatesRunner` compliance status (on master branch)

timrobertson100 commented Sep 1, 2021 •

edited

Loading

echauchot commented Sep 30, 2021 •

edited

Loading

egalpin commented Sep 30, 2021 •

edited

Loading

echauchot commented Oct 8, 2021 •

edited

Loading