High Performance Computing
using Apache Spark
Eliezer Beczi December 7,
2020
Introduction
● More data means more computational challenges.
● Single machines can’t handle data sizes anymore.
● The need to extend computation to multiple nodes.
PySpark
Why Apache Spark?
● Open-source.
● General-purpose.
● Fast.
● APIs.
● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.
● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.
● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.
● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.
● Map-Reduce processing technique.
Spark SQL
● DataFrames:
○ immutable and fault-tolerant collection of elements that can be operated on in
parallel.
● DataFrames are organized into named columns.
● Conceptually equivalent to a table in RDB.
Spark SQL
● DataFrames can be easily queried using SQL
operations.
● Spark allows to run queries directly on DataFrames
similar to how transformations are performed on
RDDs.
Thank you for your attention!