Case Study: Flight Data Analysis
using Spark GraphX
Big Data Computing Vu Pham GraphX
Problem Statement
To analyze Real-Time Flight data using Spark GraphX,
provide near real-time computation results and
visualize the results.
Big Data Computing Vu Pham GraphX
Flight Data Analysis using Spark GraphX
Dataset Description:
The data has been collected from U.S. Department of
Transportation's (DOT) Bureau of Transportation
Statistics site which tracks the on-time performance of
domestic flights operated by large air carriers. The
Summary information on the number of on-time,
delayed, canceled, and diverted flights is published in
DOT's monthly Air Travel Consumer Report and in this
dataset of January 2014 flight delays and cancellations.
Big Data Computing Vu Pham GraphX
Big Data Pipeline for Flight Data Analysis using
Spark GraphX
Big Data Computing Vu Pham GraphX
UseCases
I. Monitoring Air traffic at airports
II. Monitoring of flight delays
III. Analysis overall routes and airports
IV. Analysis of routes and airports per airline
Big Data Computing Vu Pham GraphX
Objectives
Compute the total number of airports
Compute the total number of flight routes
Compute and sort the longest flight routes
Display the airports with the highest incoming flights
Display the airports with the highest outgoing flights
List the most important airports according to PageRank
List the routes with the lowest flight costs
Display the airport codes with the lowest flight costs
List the routes with airport codes which have the lowest
flight costs
Big Data Computing Vu Pham GraphX
Features
17 Attributes
Attribute Name Attribute Description
dOfM Day of Month
dOfW Day of Week
carrier Unique Airline Carrier Code
tailNum Tail Number
fNum Flight Number Reporting Airline
origin_id Origin Airport Id
origin Origin Airport
dest_id Destination Airport Id
dest Destination Airport
crsdepttime CRS Departure Time (local time: hhmm)
deptime Actual Departure Time
Big Data Computing Vu Pham GraphX
Features
Attribute Name Attribute Description
depdelaymins Difference in minutes between scheduled
and actual departure time. Early
departures set to 0.
crsarrtime CRS Arrival Time (local time: hhmm)
arrtime Actual Arrival Time (local time: hhmm)
arrdelaymins Difference in minutes between scheduled
and actual arrival time. Early arrivals set to
0.
crselapsedtime CRS Elapsed Time of Flight, in Minutes
dist Distance between airports (miles)
Big Data Computing Vu Pham GraphX
Sample Dataset
Big Data Computing Vu Pham GraphX
Spark Implementation
Big Data Computing Vu Pham GraphX
Spark Implementation
(“/home/iitp/spark-2.2.0-bin-hadoop-2.6/flights/airportdataset.csv")
Big Data Computing Vu Pham GraphX
Graph Operations
Big Data Computing Vu Pham GraphX
Graph Operations
Big Data Computing Vu Pham GraphX
Graph Operations
Big Data Computing Vu Pham GraphX
Graph Operations
what airport has the most in degrees or unique flights into it?
Big Data Computing Vu Pham GraphX
Graph Operations
What are our most important airports ?
Big Data Computing Vu Pham GraphX
Graph Operations
Output the routes where the distance between airports exceeds
1000 miles
graph.edges.filter {
case (Edge(org_id, dest_id, distance)) => distance > 1000
}.take(5).foreach(println)
Big Data Computing Vu Pham GraphX
Graph Operations
Output the airport with maximum incoming flight
// Define a reduce operation to compute the highest degree vertex
def max: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
// Compute the max degrees
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
Big Data Computing Vu Pham GraphX
Graph Operations
Output the airports with maximum incoming flights
val maxIncoming=graph.inDegrees.collect.sortWith (_._2 >
_._2).map(x => (airportMap(x._1),
x._2)).take(10).foreach(println)
Big Data Computing Vu Pham GraphX
Graph Operations
Output the longest routes
graph.triplets.sortBy(_.attr, ascending = false).map(triplet =>
"There were " + triplet.attr.toString + " flights from " + triplet.srcAttr
+ " to " + triplet.dstAttr + ".").take(20).foreach(println)
Big Data Computing Vu Pham GraphX
Graph Operations
Output the cheapest airfare routes
val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble / 20)
//Call pregel on graph
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
//vertex program
(id, distCost, newDistCost) => math.min(distCost, newDistCost),triplet => {
//send message
if(triplet.srcAttr + triplet.attr < triplet.dstAttr)
{
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
}
else
{
Iterator.empty
}
},
//Merge Messages
(a,b) => math.min(a,b)
)
//print routes with lowest flight cost
print("routes with lowest flight cost")
println(sssp.edges.take(10).mkString("\n"))
Big Data Computing Vu Pham GraphX
Graph Operations (PageRank)
Output the most influential airports using PageRank
val rank = graph.pageRank(0.001).vertices
val temp = rank.join(airports)
temp.take(10).foreach(println)
Output the most influential airports from most influential to latest
val temp2 = temp.sortBy(_._2._1,false)
Big Data Computing Vu Pham GraphX
Conclusion
The growing scale and importance of graph data has driven the
development of specialized graph computation engines
capable of inferring complex recursive properties of graph-
structured data.
In this lecture we have discussed GraphX, a distributed graph
processing framework that unifies graph-parallel and data-
parallel computation in a single system and is capable of
succinctly expressing and efficiently executing the entire graph
analytics pipeline.
Big Data Computing Vu Pham GraphX