Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).
Spark does not seem to provide any out-of-the-box implementations for this.
One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.
Another approach would be to create a custom InputDStream that does this.
An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.