High Performance Computing with Spark

The document discusses using Apache Spark for high performance computing. It introduces Spark, explaining that more data means more computational challenges that exceed the capabilities of single machines. It then outlines some key Spark concepts, including SparkSession and SparkContext for connecting to clusters, RDDs for distributed datasets, transformations and actions for processing RDDs lazily and in parallel, and Spark SQL for querying structured data like tables. The document provides an overview of Spark as a tool for distributed computing on large datasets across clusters of machines.

Uploaded by

Eliezer Beczi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views10 pages

High Performance Computing with Spark

Uploaded by

Eliezer Beczi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,

2020
Introduction
● More data means more computational challenges.

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.

PySpark

Why Apache Spark?

● Open-source.

● General-purpose.

● Fast.

● APIs.

● Libraries.
Spark essentials
● SparkSession:
○ the main entrypoint to all Spark functionality.

● SparkContext:
○ connects to a cluster manager;
○ acquires executors;
○ sends app code to executors;
○ sends tasks for the executors to run.
Spark essentials
● RDD (Resilient Distributed Datasets):
○ immutable and fault-tolerant collection of elements that can be operated on in parallel.

● RDD operations:
○ transformations;
○ actions.
Spark essentials
● Transformations:
○ produce new RDDs;
○ lazy, not executed until an action is performed.

● The laziness of transformations allow Spark to boost performance by optimizing how a sequence
of transformations is executed at runtime.
Spark essentials
● Actions:
○ return non-RDD objects.

● Map-Reduce processing technique.

Spark SQL
● DataFrames:
○ immutable and fault-tolerant collection of elements that can be operated on in
parallel.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.

Spark SQL
● DataFrames can be easily queried using SQL
operations.

● Spark allows to run queries directly on DataFrames

similar to how transformations are performed on
RDDs.
Thank you for your attention!

Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
CC PPT
No ratings yet
CC PPT
12 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Analyzing Large Datasets with Spark
No ratings yet
Analyzing Large Datasets with Spark
11 pages
Apache Spark
No ratings yet
Apache Spark
162 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Q1. Understanding Apache Spark
No ratings yet
Q1. Understanding Apache Spark
4 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Py Spark
No ratings yet
Py Spark
9 pages
DSP
No ratings yet
DSP
3 pages
Databricks Spark
No ratings yet
Databricks Spark
2 pages
Bigdata
No ratings yet
Bigdata
3 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Spark
No ratings yet
Spark
26 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Introduction to Spark MLlib
No ratings yet
Introduction to Spark MLlib
6 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Module 4
No ratings yet
Module 4
29 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Spark Fundamentals Overview
No ratings yet
Spark Fundamentals Overview
25 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
8 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
PySpark Interview Questions WITH AWS
No ratings yet
PySpark Interview Questions WITH AWS
88 pages
Big Data With Spark Presentation
No ratings yet
Big Data With Spark Presentation
11 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark Cheat Sheet Overview
No ratings yet
PySpark Cheat Sheet Overview
18 pages
The Wind Energy Revolution
No ratings yet
The Wind Energy Revolution
18 pages
A Short Review in Model Order Reduction Based On Proper Generalized Decomposition
No ratings yet
A Short Review in Model Order Reduction Based On Proper Generalized Decomposition
11 pages
A Short Review of Failure Mechanisms of Lithium Metal and Lithiated Graphite Anodes in Liquid Electrolyte Solutions
No ratings yet
A Short Review of Failure Mechanisms of Lithium Metal and Lithiated Graphite Anodes in Liquid Electrolyte Solutions
12 pages
A Cyclic Peptide Inhibitor
No ratings yet
A Cyclic Peptide Inhibitor
8 pages
Chitosan and Alginate Wound Dressings
No ratings yet
Chitosan and Alginate Wound Dressings
7 pages
A Short Review of Catalysis
No ratings yet
A Short Review of Catalysis
12 pages
Skeletal Muscle Expression and Abnormal Function
No ratings yet
Skeletal Muscle Expression and Abnormal Function
5 pages
Breaking The Diffusion Limit
No ratings yet
Breaking The Diffusion Limit
7 pages
Optimal Partitioning in Clustering Algorithms
No ratings yet
Optimal Partitioning in Clustering Algorithms
9 pages
Optimal Partitioning in Clustering Algorithms
No ratings yet
Optimal Partitioning in Clustering Algorithms
1 page

High Performance Computing with Spark

Uploaded by

High Performance Computing with Spark

Uploaded by

High Performance Computing

using Apache Spark

Eliezer Beczi December 7,

● Single machines can’t handle data sizes anymore.

● The need to extend computation to multiple nodes.

Why Apache Spark?

● Map-Reduce processing technique.

● DataFrames are organized into named columns.

● Conceptually equivalent to a table in RDB.

● Spark allows to run queries directly on DataFrames

You might also like