0% found this document useful (0 votes)

125 views6 pages

Spark Architecture

The document provides an overview of Apache Spark's distributed architecture, detailing key components such as the Spark Driver, SparkSession, Cluster Manager, and Spark Executor. It explains how these components interact to manage resources, execute tasks, and achieve parallelism through the use of slots and partitions. Additionally, it outlines the structure of Spark applications, jobs, stages, and tasks, emphasizing the importance of DataFrames in data manipulation.

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views6 pages

Spark Architecture

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Apache Spark’s Distributed Parallel Processing Components

Spark is a distributed data processing which usually works on a cluster of machines. Let’s understand
how all the components of Spark’s distributed architecture work together and communicate. We will
also know what are the different modes in which clusters can be deployed.

Let’s start by looking at each of the individual components in Spark architecture.

Spark Driver:

Basically every Spark Application i.e. the spark program or spark job has a spark driver associated with it.
This Spark driver is the one who has the following roles:

1. Communicate with the Cluster manager.

2. Request Cluster manager to get the resources (CPU, Memory) for Spark executor.

3. Transforms all the Spark operations into DAG computations.

4. Distribute the task to the executor

5. Communicate and take the status of the task from the executor directly.
The driver process is absolutely essential – it’s the heart of a Spark Application and maintains all relevant
information during the lifetime of the application.

 The Driver is the JVM in which our application runs.

 The secret to Spark’s awesome performance is parallelism:

 Scaling vertically (i.e. making a single computer more powerful by adding physical
hardware) is limited to a finite amount of RAM, Threads and CPU speeds, due to the
nature of motherboards having limited physical slots in Data Centers/Desktops.

 Scaling horizontally (i.e. throwing more identical machines into the Cluster) means we
can simply add new “nodes” to the cluster almost endlessly, because a Data Center can
theoretically have an interconnected number of ~infinite machines

 We parallelize at two levels:

 The first level of parallelization is the Executor – a JVM running on a node, typically, one
executor instance per node.

 The second level of parallelization is the Slot – the number of which is determined by
the number of cores and CPUs of each node/executor.

Figure 2: Spark Driver and executor

SparkSession

It’s a unified object to perform all the Spark operations. In the earlier version of the Spark 1.x there were
separate objects like SparkContext, SQLContext, HiveContext, SparkConf, and StreamingContext.
However with Spark 2.x all these different objects combine into one i.e. the SparkSession. You can
perform all those operations using the SparkSession object itself.

This unison of all the objects has made life simpler for the Spark Developers.

In the Spark standalone mode you have to manually create the Sparksession object however in the
interactive spark-shell it will be given automatically with the global variable name ‘spark’.

Cluster Manager

As the name suggest it is responsible for managing the cluster. It also used to allocate the resources for
the nodes available in the cluster.

Different types of the cluster managers are available as:

1. Built-in standalone cluster manager,

2. Apache Hadoop YARN,

3. Apache Mesos

4. Kubernetes

Spark Executor

A Spark executor is a program which runs on each worker node in the cluster. Executors communicate
with the driver program and are responsible for executing tasks on the workers. In most deployments
modes, only a single executor runs per node. In nutshell Executor do:

1. Executing code assigned to it by the driver2666

2. Reporting the state of the computation, on that executor, back to the driver node

 Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by
the Driver.

 So for example:
 If we have 3 identical home desktops (nodes) connected together having
(8 cores) processors in each, then that’s a 3 node Cluster:

 1 Driver node

 2 Executor nodes

 The 8 cores per Executor node means 8 Slots, meaning the driver can assign
each executor up to 8 Tasks

 The idea is, multicore processor is built such that it is capable of

executing it’s own Task independent of the other Cores, so 8 Cores = 8
Slots = 8 Tasks in parellel

How to set number of slot, Task based on number of cores?

 All processors of today have multiple cores (e.g. 1 CPU = 8 Cores)

 Most processors of today are multi-threaded (e.g. 1 Core = 2 Threads, 8 cores = 16 Threads)

 A Spark Task runs on a Slot. 1 Thread is capable of doing 1 Task at a time. To make use of all our
threads on the CPU, we cleverly assign the number of Slots to correspond to a multiple of the
number of Cores (which translates to multiple Threads).

 For example: Assume that we have 4 Node in the cluster

Driver Node : 1

Worker node : 3 (i.e we have 3 executor node)

Assuming we have 8 core processor machine then slot = 3*8 = 24 Slot

Assuming multithreaded JVM with 2 thread per core processor = 3*8*2 = 48 thread slot

Hence in this cluster environment we can have 48 Tasks which can run on 48 Partitions.

You will try to keep your number of tasks equal to the number of slots available to avoid waiting time.

DataFrames

A DataFrame is the most common way to create the abstraction for data. It is the Structured API by
Apache Spark which can represent the data as a table

with rows and columns. The list of columns and the data types of those columns is called the schema.

Partitions
Dataframe used to hold the data on which you will apply the various operations like (filter, join, group by
etc) however under the hood dataframe saved the data in multiple partitions.

Spark actually splits the data into multiple chunks which are called Partitions and stores the data
physically on multiple machines.

A file gets divided into multiple chunks and stored as partitions on multiple machines. This has two
advantages

1. A very big file can get saved into a cluster otherwise it was difficult to store it on one machine.

2. Each task will use each partition and run in parallel. This will help in achieving the parallelism.

You can manipulate the number of partitions as per the need. You don’t access the partition on an
individual basis instead of that you will use the data frame and do operation on it.

Job

Invoking an action inside a Spark application triggers the launch of a job to fulfill it. One spark
application can have multiple jobs depending upon the code written.

Stage

Each job gets divided into smaller sets of tasks called stages that depend on each other. A stage is a
collection of tasks that run the same code, each on a different subset of the data.

Tasks

Each stage is consist of multiple Spark tasks (a unit of execution), which is performed on Spark executor.
Each task maps to a single core and works on a single partition of data.
Apache Spark Documentation Link

Final Thoughts

In this series of the Azure Databricks Spark tutorial we have covered the Apache Spark core concepts. We
have learned:

Cluster Cluster is set of node(machines).

Sparksession Sparksession object is the main object to run all spark operations.

SparkDriver SparkDriver is associated with every Spark application which take care of whole application.

Spark Application Spark Application divided in spark job which in turn divided in spark stage and further into spark ta

Dataframe Spark stores the data as dataframe which internally split into chunks and stored as partitions.

Spark Components

Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Apache Spark Features and Architecture
No ratings yet
Apache Spark Features and Architecture
4 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
17 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Data Warehousing - Book
No ratings yet
Data Warehousing - Book
203 pages
BATCH2728
No ratings yet
BATCH2728
85 pages
Execution Plans The Secret To Query Tuning Success
No ratings yet
Execution Plans The Secret To Query Tuning Success
98 pages
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
No ratings yet
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
40 pages
Migration Strategy
No ratings yet
Migration Strategy
3 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Spark
No ratings yet
Spark
96 pages
Database Materialized Views Guide
No ratings yet
Database Materialized Views Guide
31 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
6.3. Data - Structure - Pyspark - Ipynb - Exercise
No ratings yet
6.3. Data - Structure - Pyspark - Ipynb - Exercise
6 pages
PostgreSQL - Replication Progress Tracking PDF
No ratings yet
PostgreSQL - Replication Progress Tracking PDF
1 page
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Hadoop HDFS Commands With Examples
No ratings yet
Hadoop HDFS Commands With Examples
3 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Avro
No ratings yet
Avro
5 pages
Performance Tuning 16 by 10
No ratings yet
Performance Tuning 16 by 10
51 pages
SQL+Server+Index+Architecture+and+Design+Guide+-+SQL+Server+ +Microsoft+Docs
No ratings yet
SQL+Server+Index+Architecture+and+Design+Guide+-+SQL+Server+ +Microsoft+Docs
47 pages
Database Health Monitoring Script
No ratings yet
Database Health Monitoring Script
3 pages
SQL Database Systems Guide
No ratings yet
SQL Database Systems Guide
46 pages
ENGG1003 10 PythonApplicationsOnJupiter
No ratings yet
ENGG1003 10 PythonApplicationsOnJupiter
30 pages
Welcome: Predicting Change Outcomes
No ratings yet
Welcome: Predicting Change Outcomes
29 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
4 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Oracle Database 12c RAC Administration Ed 1 - TOC - D81250GC10
No ratings yet
Oracle Database 12c RAC Administration Ed 1 - TOC - D81250GC10
3 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Oracle 8i vs SQL Server 2000
No ratings yet
Oracle 8i vs SQL Server 2000
26 pages
The Secrets of Materialized Views
100% (2)
The Secrets of Materialized Views
8 pages
Oracle 12c - Administering A CDB
No ratings yet
Oracle 12c - Administering A CDB
5 pages
Databricks Widgets Overview and Usage
No ratings yet
Databricks Widgets Overview and Usage
13 pages
Comprehensive Big Data and Hadoop Course
No ratings yet
Comprehensive Big Data and Hadoop Course
17 pages
Oracle Indexes
No ratings yet
Oracle Indexes
3 pages
Data Warehousing Concept Using ETL Process For SCD Type-2
No ratings yet
Data Warehousing Concept Using ETL Process For SCD Type-2
6 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
Documentation Project
No ratings yet
Documentation Project
56 pages
PySpark Interview Questions Shubham
0% (1)
PySpark Interview Questions Shubham
3 pages
Apache Spark 2.3: Key Updates
No ratings yet
Apache Spark 2.3: Key Updates
57 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Streaming App Development Guide
No ratings yet
Spark Streaming App Development Guide
8 pages
Databricks Architecture Interview Preparation
No ratings yet
Databricks Architecture Interview Preparation
3 pages
Bitmap Index Internals
No ratings yet
Bitmap Index Internals
54 pages
SQL Server to Oracle Migration Guide
No ratings yet
SQL Server to Oracle Migration Guide
20 pages
Big Book of Data Warehousing and Bi v11 010925 Final
No ratings yet
Big Book of Data Warehousing and Bi v11 010925 Final
110 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Python Virtual Environment
No ratings yet
Python Virtual Environment
23 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Apache Spark Architecture Overview
No ratings yet
Apache Spark Architecture Overview
4 pages
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
Manual Lenovo IdeaCentre Q190 MiniPC
No ratings yet
Manual Lenovo IdeaCentre Q190 MiniPC
49 pages
Exam For Hardware Description Language
No ratings yet
Exam For Hardware Description Language
2 pages
2technologies, Wireless Application Protocol
No ratings yet
2technologies, Wireless Application Protocol
8 pages
System Requirements For Student PC Laptop 2023
No ratings yet
System Requirements For Student PC Laptop 2023
3 pages
Ada Programming
No ratings yet
Ada Programming
410 pages
32GB Computer Memory DDR5 For Asus Z13PP D32 RAM RDIMM
No ratings yet
32GB Computer Memory DDR5 For Asus Z13PP D32 RAM RDIMM
5 pages
J M International School Term-1 Test Paper Grade 3
No ratings yet
J M International School Term-1 Test Paper Grade 3
2 pages
LTE Cheat Sheet
100% (1)
LTE Cheat Sheet
11 pages
BA10339 Quad Comparator Datasheet
No ratings yet
BA10339 Quad Comparator Datasheet
5 pages
Make Better AC RMS Measurements With Your Digital Multimeter
No ratings yet
Make Better AC RMS Measurements With Your Digital Multimeter
8 pages
infoPLC Net IRC5 Connection To S500 IOs Via Profinet PDF
No ratings yet
infoPLC Net IRC5 Connection To S500 IOs Via Profinet PDF
28 pages
Fing CLI User Guide 1.2
No ratings yet
Fing CLI User Guide 1.2
26 pages
Pinouts: Power 1, Power 2 Ports USB Port
No ratings yet
Pinouts: Power 1, Power 2 Ports USB Port
4 pages
Firmware Wallbox
No ratings yet
Firmware Wallbox
15 pages
Making The Most of Enterprise Software With SAP LoadRunner by HP
No ratings yet
Making The Most of Enterprise Software With SAP LoadRunner by HP
2 pages
SQL Queries & Matplotlib Charts Guide
No ratings yet
SQL Queries & Matplotlib Charts Guide
3 pages
ترانزستور تأثير المجال
100% (1)
ترانزستور تأثير المجال
23 pages
Project Report of Online Job Portal Team I
No ratings yet
Project Report of Online Job Portal Team I
172 pages
CN Unit-Iv
No ratings yet
CN Unit-Iv
12 pages
EMC2
No ratings yet
EMC2
18 pages
Fortra Data Classification Suite For Windows Extensibility Guide
No ratings yet
Fortra Data Classification Suite For Windows Extensibility Guide
55 pages
Architecture Question Bank
No ratings yet
Architecture Question Bank
5 pages
CN Lab Manual
No ratings yet
CN Lab Manual
99 pages
M2000 Administrator Guide (Sun, S10) (V200R009 - 10)
No ratings yet
M2000 Administrator Guide (Sun, S10) (V200R009 - 10)
446 pages
Digital Electronics Complete Slides 2025
No ratings yet
Digital Electronics Complete Slides 2025
179 pages
ReleaseNote1662SMC27B9 Ed1
No ratings yet
ReleaseNote1662SMC27B9 Ed1
67 pages
Wireless ARM-Based Meter Reading System
No ratings yet
Wireless ARM-Based Meter Reading System
6 pages
Pcd00a 400
No ratings yet
Pcd00a 400
4 pages
An-Eval3Br0665Jf: 100W 18V Smps Evaluation Board With Coolset F3R Ice3Br0665Jf
No ratings yet
An-Eval3Br0665Jf: 100W 18V Smps Evaluation Board With Coolset F3R Ice3Br0665Jf
27 pages
Developer Changelog for curl 7.66.0
No ratings yet
Developer Changelog for curl 7.66.0
130 pages

Spark Architecture

Uploaded by

Spark Architecture

Uploaded by

Apache Spark’s Distributed Parallel Processing Components

Let’s start by looking at each of the individual components in Spark architecture.

1. Communicate with the Cluster manager.

3. Transforms all the Spark operations into DAG computations.

4. Distribute the task to the executor

 The Driver is the JVM in which our application runs.

 The secret to Spark’s awesome performance is parallelism:

 We parallelize at two levels:

Figure 2: Spark Driver and executor

Different types of the cluster managers are available as:

1. Built-in standalone cluster manager,

2. Apache Hadoop YARN,

1. Executing code assigned to it by the driver2666

 The idea is, multicore processor is built such that it is capable of

How to set number of slot, Task based on number of cores?

 All processors of today have multiple cores (e.g. 1 CPU = 8 Cores)

 For example: Assume that we have 4 Node in the cluster

Worker node : 3 (i.e we have 3 executor node)

Assuming we have 8 core processor machine then slot = 3*8 = 24 Slot

Cluster Cluster is set of node(machines).

You might also like