[BEAM-1444] Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch. - ASF Jira

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: runner-spark
Labels:
None

Description

Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

Spark does not seem to provide any out-of-the-box implementations for this.

One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.

Another approach would be to create a custom InputDStream that does this.

An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Amit Sela

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Feb/17 08:02

Updated:: 03/Jun/22 16:24