Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1444

Flatten of Bounded and Unbounded repeats the union with the RDD for each micro-batch.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • runner-spark
    • None

    Description

      Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is implemented by applying SparkContext#union(RDD...) inside a DStream.transform() which causes the same RDD to be "unionized" into each micro-batch and so multiplying it's content in the resulting stream (x number of batches).

      Spark does not seem to provide any out-of-the-box implementations for this.

      One approach I tried was to create a stream from Queue (single RDD stream) but this is not an option since this fails checkpointing.

      Another approach would be to create a custom InputDStream that does this.

      An important note here is that the challenge here is to find a solution that holds with checkpointing and recovery from failure.

      Attachments

        Activity

          People

            Unassigned Unassigned
            amitsela Amit Sela
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: