[SPARK-1329] ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions - ASF Jira

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: GraphX
Labels:
None

Description

To reproduce, let's look at a graph with two nodes in one partition, and two edges between them split across two partitions:
scala> val vs = sc.makeRDD(Seq(1L->null, 2L->null), 1)
scala> val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, null)), 2)
scala> val g = graphx.Graph(vs, es)

Everything seems fine, until GraphX needs to join the two RDDs:
scala> g.triplets.collect
[...]
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76)
at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75)
at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

The bug is fairly obvious in RoutingTable.createPid2Vid() – it creates an array of length vertices.partitions.size, and then looks up partition IDs from the edges.partitionsRDD in it.

A graph usually has more edges than nodes. So it is natural to have more edge partitions than node partitions.

Attachments

Issue Links

is related to

SPARK-5480 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

Resolved

Activity

People

Assignee:: Ankur Dave

Reporter:: Daniel Darabos

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Mar/14 10:25

Updated:: 15/Feb/15 11:53

Resolved:: 26/May/14 20:19