Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1329

ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • GraphX
    • None

    Description

      To reproduce, let's look at a graph with two nodes in one partition, and two edges between them split across two partitions:
      scala> val vs = sc.makeRDD(Seq(1L->null, 2L->null), 1)
      scala> val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, null)), 2)
      scala> val g = graphx.Graph(vs, es)

      Everything seems fine, until GraphX needs to join the two RDDs:
      scala> g.triplets.collect
      [...]
      java.lang.ArrayIndexOutOfBoundsException: 1
      at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76)
      at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75)
      at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75)
      at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73)
      at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
      at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
      at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
      at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
      at org.apache.spark.scheduler.Task.run(Task.scala:53)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
      at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:744)

      The bug is fairly obvious in RoutingTable.createPid2Vid() – it creates an array of length vertices.partitions.size, and then looks up partition IDs from the edges.partitionsRDD in it.

      A graph usually has more edges than nodes. So it is natural to have more edge partitions than node partitions.

      Attachments

        Issue Links

          Activity

            People

              ankurd Ankur Dave
              darabos Daniel Darabos
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: