0% found this document useful (0 votes)
33 views16 pages

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

The document outlines key interview questions and answers related to PySpark, focusing on differences between Hadoop MapReduce and Spark, Spark architecture, and various components like RDDs, DataFrames, and SparkSession. It also discusses query optimization techniques, the importance of partitions, and the use of caching and broadcast variables. Additionally, it highlights the differences between narrow and wide transformations, as well as methods like coalesce() and repartition().

Uploaded by

zapukzupuk17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views16 pages

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

The document outlines key interview questions and answers related to PySpark, focusing on differences between Hadoop MapReduce and Spark, Spark architecture, and various components like RDDs, DataFrames, and SparkSession. It also discusses query optimization techniques, the importance of partitions, and the use of caching and broadcast variables. Additionally, it highlights the differences between narrow and wide transformations, as well as methods like coalesce() and repartition().

Uploaded by

zapukzupuk17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Pyspark most important interview

questions
1.​ What is the key difference b/w Hadoop MapReduce and Spark in
terms of performance and scalability?

9/
Feature Hadoop MapReduce Apache Spark

16
Processing Model Disk I/O heavy batch jobs In-memory DAG, efficient
transformations

39
Speed & Performance Slower, especially for Fast—thanks to caching
iterative jobs and optimized execution

APIs & Development Java-centric, verbose High-level APIs across

b9
Scala, Python, Java, R

Workload Support Batch-only Batch, streaming, SQL,


81 graph, ML unified engine

Fault Tolerance Re-run failed tasks from Lineage tracking and


a-
disk recomputation

Resource & Ecosystem Hadoop-dependent Flexible deployment,


pt

extensive integration
gu

Ideal Scenarios Simple ETL and batch Real-time, iterative,


workloads multi-paradigm workloads
a-

Follow up: Spark writes data to disk. Why and what’s the difference w.r.t MapReduce?
ik

Spark writes data to disk only when we instruct by using storage levels in persist() or
when data doesn't fit into memory and spilling into disk is required which makes it
sh

scenario specific.
Even though Spark and MapReduce when writing data into disk look the same, the
difference can be seen in performance. Spark performs much better than MapReduce
an

due to catalyst optimizer and dynamic optimization techniques like AQE.

2.​ What is the role of SparkContext in PySpark?


It’s the entry point to Spark’s execution engine. It connects your application to the Spark
cluster, manages resources, distributes tasks, and coordinates execution.

In modern Spark, it’s created automatically inside a SparkSession


(spark.sparkContext). Without it, no Spark jobs can run.
3.​ Explain Spark Architecture.
Components of Spark Architecture

●​ Driver Program
○​ The main process that runs the Spark application.
○​ Converts user code (in Python, Scala, Java, etc.) into a DAG (Directed
Acyclic Graph) of stages.
○​ Schedules tasks for execution across the cluster.
○​ Contains the SparkContext which is the gateway to the Spark cluster.​

9/
●​ Cluster Manager

16
○​ Responsible for allocating resources to Spark applications.
○​ Can be Standalone, YARN, or Mesos (Kubernetes in modern setups).​

39
●​ Executors
○​ Worker processes running on cluster nodes.

b9
○​ Execute tasks assigned by the driver.
○​ Store data in memory or on disk for caching.


81
Execution Flow

1.​ User Code → RDD/Dataset/DataFrame Transformations


a-
○​ Code gets translated into a logical plan.
2.​ Logical Plan → Physical Plan
pt

○​ Optimized by the Catalyst optimizer.


3.​ Stages and Tasks
gu

○​ Spark breaks the job into stages (based on shuffle boundaries).


○​ Each stage has multiple tasks (same operation on different partitions).
4.​ Cluster Manager Allocation
a-

○​ Driver requests executors from the cluster manager.


5.​ Execution on Executors
ik

○​ Executors run tasks in parallel and send results back to the driver.
sh

Key Points

●​ Spark uses lazy evaluation — transformations aren’t executed until an action is


an

triggered.
●​ Executors are long-lived processes (unlike Hadoop MapReduce which starts
fresh JVMs per task).
●​ Spark stores data in memory for speed, reducing disk I/O compared to
MapReduce.
●​ The driver monitors execution and re-runs failed tasks if needed.
4.​ What is the difference between RDDs, Dataframe and Dataset?
5.​ The driver monitors execution and re-runs failed tasks if needed.​

Feature RDD (Resilient DataFrame Dataset


Distributed Dataset)

Abstraction Level Low-level High-level High-level

Type Safety Type-safe (in Not type-safe Type-safe (only in

9/
Scala/Java) Scala, not in
PySpark)

16
Ease of Use More complex APIs Easier, SQL-like API Similar to DataFrame

Performance Less optimized (no Optimized using Optimized (like

39
Catalyst/Tungsten) Catalyst & Tungsten DataFrame)

Serialization Java serialization Tungsten binary Encoders (efficient)

b9
format

Use Case Complex, SQL operations, When compile-time


fine-grained analytics type safety is
81 transformations required (Scala only)

Support in PySpark ✅ Yes ✅ Yes ❌ Not supported


a-
(only in Scala/Java)
pt

6.​ What is query optimization in Pyspark?


gu

Spark uses lazy evaluation — transformations aren’t executed until an action is triggered.

Catalyst Optimizer Steps


a-

1.​ Analysis – Checks syntax, resolves column names, validates schema.


ik

2.​ Logical Plan – Represents the user’s query logically (no execution yet).
sh

3.​ Optimization – Applies rules like:


○​ Predicate pushdown (filtering early)
○​ Constant folding (evaluating constant expressions before execution)
an

○​ Projection pruning (selecting only required columns)


4.​ Physical Plan – Decides execution strategy (e.g., broadcast join vs shuffle join).
5.​ Code Generation – Uses whole-stage codegen for faster execution.

Common Optimization Techniques in PySpark

●​ Use DataFrame API instead of RDDs – DataFrames get Catalyst/Tungsten benefits.


●​ Avoid UDFs unless necessary – UDFs are slower; use Spark SQL functions.
●​ Column Pruning – Select only the columns you need.
●​ Predicate Pushdown – Filter data early so less data flows through stages.
●​ Broadcast Join – For small tables, use broadcast() to avoid shuffling.
●​ Cache/Persist – For reused DataFrames, cache results in memory.
●​ Partitioning – Repartition or coalesce wisely to balance parallelism.
●​ File Format – Use columnar formats like Parquet/ORC for faster reads.

7.​ What is SparkSession in Pyspark?

9/
Introduced in Spark 2.0 to replace older separate contexts.It’s a unified object that combines

16
●​ SparkContext (for RDDs)
●​ SQLContext (for SQL queries)
●​ HiveContext (for Hive support)

39
Every PySpark program starts by creating a SparkSession. It tells Spark:

b9
●​ How to run (local or cluster)
●​ App name81
●​ Configurations (memory, cores, etc.)

Without it, you can’t access DataFrame APIs.


a-
Creating a SparkSession:
pt

from pyspark.sql import SparkSession​



gu

spark = SparkSession.builder \​
.appName("RetailDataAnalysis") \​
a-

.master("local[*]") \​
.getOrCreate()
ik

Parameters:
sh

●​ appName() → Name for the Spark job (visible in UI).


●​ master() → Execution mode:
an

○​ "local[*]" → Run locally using all cores.


○​ "yarn", "mesos", "spark://..." → For cluster modes.
●​ .getOrCreate() → Creates a new session or returns an existing one.
8.​ Difference b/w wide & narrow transformation in Pyspark?

1. Narrow Transformations

●​ Definition: Transformations where each input partition contributes to only one output
partition.
●​ No shuffling of data across the network.
●​ Faster because computation stays local to the partition.
●​ Examples:

9/
○​ map()
○​ filter()

16
○​ union()
○​ coalesce() (without increasing partitions)

39
●​ When to use: If you can process data within the same partition without needing data
from others.

b9
2. Wide Transformations

●​ Definition: Transformations where input partitions contribute to multiple output


81
partitions.
●​ Involves shuffling data across nodes in the cluster (network + disk I/O).
a-
●​ Slower because shuffle is expensive.
●​ Examples:
○​ groupByKey()
pt

○​ reduceByKey()
gu

○​ join()
○​ distinct()
○​ repartition()
a-

●​ Why slower: Data is redistributed to new partitions → shuffle → sort → aggregation.


ik

3. Shuffle Process in Wide Transformations


sh

●​ Steps:
○​ Map stage – Data is prepared for shuffle.
an

○​ Shuffle stage – Data is moved across the cluster.


○​ Reduce stage – Data is aggregated/processed after shuffle.​

●​ Optimization Tip: Reduce shuffles as much as possible by using:


○​ reduceByKey() instead of groupByKey()
○​ mapPartitions() for custom processing.
4. Execution Plan

●​ Narrow transformations → executed in a single stage.


●​ Wide transformations → cause a stage boundary (new stage in DAG).
●​ Spark’s DAG scheduler splits jobs into stages based on these transformations.

5. Key Interview Points

9/
Feature Narrow Wide

Data movement No shuffle Shuffle

16
Speed Faster Slower

39
Stage creation Same stage New stage

Examples map, filter join, groupByKey

b9
9.​ What is the use of coalesce() and repartition() in Pyspark?
81
Feature Repartition Coalesce
a-
Purpose Increases or decreases number of Decreases number of partitions
partitions. only.
pt

Shuffling Full shuffle of data across all nodes. Minimizes shuffle, avoids full
gu

shuffle.

Use Case When increasing partitions or When reducing the number of


a-

evenly redistributing data. partitions for optimized output.

Performance Expensive due to full shuffle. Inexpensive due to limited


ik

Cost movement.
sh

Typical Scenario Before wide transformations like Before writing data to disk.
join, groupBy.
an

Data Even redistribution across Narrows existing partitions


Redistribution partitions. (merging adjacent ones).
10.​ When to use cache() and persist() in Pyspark and what is the
difference b/w the 2?

We use them when:

●​ The same DataFrame/RDD is used multiple times in a Spark application.


●​ We want to avoid recomputation each time it's used.
●​ The dataset is expensive to compute (e.g., involves joins, aggregations, filtering large
data).

9/
●​ We want to improve performance in iterative algorithms (e.g., ML model training, graph
processing).

16
Example scenario:​
If you perform an expensive transformation like

39
df.filter(...).groupBy(...).agg(...) and plan to use the result in multiple actions
(count(), show(), write()), you should cache/persist it.

b9
Feature cache() persist()
81
Default Stores data in Allows specifying a storage level (default:
Storage MEMORY_ONLY. MEMORY_ONLY).
a-
Level
pt

Flexibility Fixed storage level. Flexible – can store in MEMORY_AND_DISK,


DISK_ONLY, MEMORY_ONLY_SER, etc.
gu

When to When the dataset fits in When memory is limited or you need custom storage
Use memory and is reused behavior.
a-

multiple times.

Example df.cache() df.persist(StorageLevel.MEMORY_AND_DISK)


ik
sh

from pyspark import StorageLevel​


# Using cache (MEMORY_ONLY)​
an

df_cached = df.cache()​

# Using persist with custom storage level​
df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK)​

# Trigger actions​
df_cached.count()​
df_cached.show()
💡 Key Points for Interviews
●​ cache() is just a shorthand for persist(StorageLevel.MEMORY_ONLY).
●​ Use persist() if data might not fit in memory or you want to store serialized or on
disk.
●​ Both are lazily evaluated — data is stored only after the first action.
●​ Always unpersist() when the cached/persisted dataset is no longer needed to free
up resources.​

9/
11.​What is the importance of partitions in Pyspark?

16
1️⃣ Parallelism & Performance

39
●​ Spark processes data in parallel by splitting it into partitions i.e. MPP(Massive Parallel
Processing)
●​ Each partition is processed independently by a task on a Spark executor.

b9
●​ More partitions → more tasks → better parallelism (up to a point).
●​ Too few partitions → underutilized CPU cores.
●​ Too many partitions → overhead in task scheduling.
81
2️⃣ Data Distribution
a-
●​ Partitions determine how data is distributed across the cluster.
●​ Well-distributed partitions → balanced workload.
pt

●​ Skewed partitions (uneven data) → data skew → performance bottlenecks.


gu

3️⃣ Memory Management

●​ If a partition is too large → may cause OutOfMemoryError on executors.


a-

●​ Proper partitioning ensures each partition fits into executor memory.


ik

4️⃣ Shuffle Optimization


sh

●​ During wide transformations (groupBy, join), Spark shuffles data b/w partitions.
●​ The number and size of partitions after shuffle impacts performance.
an

●​ Setting the right spark.sql.shuffle.partitions is critical for big datasets.

5️⃣ ETL Efficiency

●​ In data engineering pipelines:


○​ Before joins/aggregations → increase partitions for parallelism.
○​ Before writing to storage (like ADLS, S3) → reduce partitions to avoid too
many small files.​
# Check partitions​
df.rdd.getNumPartitions()​

# Repartition for parallelism​
df = df.repartition(8)​

# Coalesce to reduce partitions before writing​
df = df.coalesce(2)

9/
12.​ What are broadcast variables and why are they used?

16
●​ Read-only shared variables that are cached on every executor node in a Spark
cluster.
●​ They allow you to avoid sending the same data repeatedly with each task.

39
●​ Instead, Spark broadcasts the data once to each node, significantly reducing network
I/O and serialization overhead.

b9
Why Use Them?
81
●​ Ideal for small lookup tables, configuration settings, or reference data used across
many tasks.
●​ Especially useful when performing operations that need frequent access to the same
a-
static dataset.
pt

Suppose you're processing a massive dataset of user activity and need to enrich it with country
names based on country codes:
gu

states = {"NY": "New York", "CA": "California", "FL": "Florida"}



a-

broadcast_states = spark.sparkContext.broadcast(states)​
df = spark.read.parquet("...")
ik


def lookup_state(code):​
sh

return broadcast_states.value.get(code, "Unknown")​


df_rdd = df.rdd.map(lambda row: (row.user_id, lookup_state(row.state_code),
an

row.activity))

Use broadcast variables when:

●​ You have a small-read-only dataset that needs to be reused across many tasks or
stages.
●​ You want to reduce shuffle and serialization overhead.​
Avoid them when:​

●​ The dataset is too large to fit in executor memory—could lead to OOM errors.
●​ The broadcasted data isn't used by enough tasks to justify the overhead.

13.​ What is the difference b/w df.show() and df.collect()?

9/
Feature df.show() df.collect()

16
Purpose Displays data in a tabular Retrieves all rows from the
format in the console for DataFrame to the driver as a

39
quick preview. Python list of Row objects.

Default Behavior By default shows top 20 Always returns all rows in

b9
rows. Can specify number of the DataFrame unless filtered
rows with df.show(n). before calling.

Output Location Prints to console only Returns a list that you can
81 (doesn’t return data). store in a variable and
process further.
a-
Memory Impact Lightweight; doesn’t load the Can be dangerous for large
full dataset into driver datasets – may cause
pt

memory. OutOfMemoryError on the


driver.
gu

Use Case For quick inspection or When you need to


debugging. programmatically work with
all the data in the driver.
a-
ik

For large datasets, avoid collect() unless you are absolutely sure the data fits in memory —
instead, use take(n) or limit(n).collect().
sh

14.​ What is lazy evaluation in Pyspark?


an

Lazy Evaluation in PySpark means that Spark does not execute transformations immediately
when you call them.
Instead, it builds a logical execution plan (DAG) and only runs the computation when an
action (like show(), collect(), count(), etc.) is triggered.
How it works:

1.​ Transformations (select(), filter(), map(), etc.) → Only recorded in a plan; no


actual execution.
2.​ Action (show(), count(), collect()) → Triggers Spark to optimize the plan and
execute it on the cluster.

Advantages:

9/
●​ Optimization: Spark can combine multiple transformations into a single stage
(pipelining).

16
●​ Efficiency: Avoids unnecessary computation.

39
15.​ What are advantages of Delta Lake over traditional file formats?

b9
Feature Traditional File Formats Delta Lake
(CSV, Parquet, etc.)

❌ ✅
81
ACID Transactions Not supported (risk of Guaranteed consistency
partial writes/inconsistent with ACID transactions
data)
a-

Schema Enforcement ❌ Data with wrong schema ✅ Rejects writes with


pt

may get written silently schema mismatch

Schema Evolution ⚠ Limited (Parquet allows ✅ Add/modify columns with


gu

but without validation) proper versioning

Data Versioning (Time ❌ Not available ✅ Access older versions of


a-

Travel) data for rollback, audit, or


debugging

❌ ✅ Native support via MERGE


ik

Upserts & Deletes Difficult (requires full file


rewrite) INTO, DELETE, UPDATE
sh

Performance (Indexing &


Caching)
⚠ Reads entire files, no
transaction log optimization
✅ Optimized reads/writes
with transaction log
an

(_delta_log)

Streaming + Batch
Unification
❌ Usually separate pipelines ✅ Same Delta table can be
used for batch & streaming

File Compaction (Optimize) ❌ Manual and complex ✅ Built-in commands to


compact small files for better
performance
Delta Lake turns your data lake into a transactional, reliable, and high-performance
lakehouse, solving common issues like inconsistent data, slow updates, and schema drift.

16.​ What happens when a Pyspark job runs OOM?


When a PySpark job runs Out of Memory (OOM), it means the executors or driver have
exhausted their allocated memory.

Stage What Happens

9/
1. Task Execution Executors start processing data partitions in
memory.

16
2. Memory Exhaustion If the data, shuffle operations, caching, or
joins exceed the executor’s memory limit,

39
Spark tries to spill data to disk (temporary
files) to free memory.

3. Spill to Disk Spark writes intermediate data to disk to

b9
avoid OOM. This slows down the job but
prevents immediate failure—if spill space is
enough.
81
4. GC Pressure JVM Garbage Collector runs frequently to
reclaim memory. If it spends too much time in
a-
GC (> ~98% of time), Spark throws a GC
overhead error.
pt

5. OOM Failure If neither GC nor spilling can free enough


space, you get an OutOfMemoryError (e.g.,
gu

Java heap space or GC overhead


limit exceeded). The stage fails.
a-

6. Job Retry / Abort Spark retries the failed stage (default 4


times). If it fails every time, the job aborts.
ik
sh

Common Causes
an

●​ Very large partitions (too much data for one executor).


●​ Wide transformations (e.g., large shuffles from groupBy, join).
●​ Caching huge datasets without enough memory.
●​ Skewed data causing one executor to handle much more data than others.

Prevention / Fixes

●​ Increase memory: --executor-memory and --driver-memory.


●​ Increase partitions: repartition() to reduce data per partition.
●​ Use broadcast joins for small datasets.
●​ Persist wisely: Use persist(StorageLevel.DISK_ONLY) if RAM is limited.
●​ Enable spilling configs: Tune spark.shuffle.spill and
spark.memory.fraction.
●​ Handle skew: Use salting or skew join optimization.

In PySpark, if a job runs Out of Memory (OOM), Spark first tries to spill intermediate data to disk
to free up RAM. If spilling and garbage collection can’t help, the executor throws an
OutOfMemoryError and the stage fails. Spark retries the stage (default 4 times), but if all

9/
retries fail, the job aborts.

16
Common causes are large partitions, skewed data, or caching huge datasets.​
To fix it: increase executor/driver memory, repartition to reduce data per task, use broadcast
joins for small datasets, and persist to disk instead of memory when RAM is limited.

39
17.​ What is AQE in Pyspark and why is it useful?

b9
AQE is a Spark feature (enabled by spark.sql.adaptive.enabled=true) that
dynamically optimizes query plans at runtime based on the actual data being processed,
81
rather than relying only on static plans created during compile time.

Why it’s useful:


a-

1.​ Handles Data Skew – Detects uneven partition sizes and splits/rebalances them to
pt

avoid slow tasks.


2.​ Optimizes Join Strategies – Can switch from shuffle join to broadcast join if a dataset
gu

turns out smaller than expected.


3.​ Reduces Shuffle Partitions – Automatically merges small shuffle partitions to avoid
excessive tasks and overhead.
a-

Example:​
In a retail analytics pipeline, if after filtering, the orders dataset becomes small, AQE can switch
ik

to a broadcast join automatically—saving shuffle time and boosting performance.


sh

18.​ How would you handle skewed data in Pyspark?


When data is skewed, some partitions hold far more records than others, causing certain tasks
an

to run much longer. This leads to slow stages and possible OOM errors.

Technique How It Works Example

Salting Add a random “salt” key to For a heavy


the join column to spread customer_id=123, create
skewed keys across multiple customer_id_salted =
partitions, then remove salt concat(customer_id,
after processing.
rand_int) before join.

Broadcast Join If one dataset is small, broadcast(df_small) in


broadcast it to all executors join.
to avoid shuffles.

AQE Skew Join Enable spark.conf.set("spark.


Optimization spark.sql.adaptive.ena sql.adaptive.enabled",
bled to let Spark split "true").
skewed partitions

9/
dynamically.

16
Filter Early Reduce data size before joins Apply where() before join.
or aggregations.

39
Repartition by Key Increase parallelism for the df.repartition(200,
skewed key. "key").

💡 Retail Example:

b9
If 70% of your transactions come from one store ID, salting that store ID before joins will ensure
81
the workload is spread evenly across executors.

19.​ What is broadcast join and when should you use it?
a-

A broadcast join sends a small DataFrame to all worker nodes so they can join it locally with
pt

their partition of the big DataFrame, avoiding a costly shuffle.

When to use:
gu

●​ One DataFrame is small enough to fit in each executor’s memory (rule of thumb: < 10
MB, but can be adjusted via spark.sql.autoBroadcastJoinThreshold).
a-

●​ To speed up joins by avoiding network shuffle of large datasets.


●​ Especially useful in star-schema or lookup table joins (fact table + small dimension
ik

table).
sh

df_result = df_large.join(broadcast(df_small),on="store_id",how="left")​
df_result.show()
an

20.​ What is SPILL is Spark and why does it happen?

In Spark, spill happens when data being processed does not fit entirely into memory, so Spark
temporarily writes (or spills) it to disk to continue processing.
Why does the spill happen?

●​ Spark operations like shuffle, sort, groupBy, join, or aggregation store intermediate
data in memory.
●​ If there’s insufficient executor memory (based on spark.executor.memory,
spark.memory.fraction), Spark can’t hold all intermediate data in RAM.
●​ To avoid OutOfMemory (OOM) errors, Spark spills this extra data to disk.

Common scenarios which cause spill?

9/
1.​ Large shuffles → e.g., groupByKey, large joins, repartitions.

16
2.​ Wide transformations with huge intermediate datasets.
3.​ Data skew → one partition has too much data to fit in memory.

39
4.​ Too many tasks competing for limited executor memory.

Impact of spilling

b9
●​ Performance degradation — disk I/O is much slower than memory.
●​ However, it allows Spark jobs to complete successfully instead of failing with OOM.
81
How to reduce spilling
a-
●​ Increase memory per executor (spark.executor.memory).
●​ Increase shuffle memory fraction (spark.memory.fraction).
pt

●​ Use map-side aggregation (reduceByKey instead of groupByKey).


gu

●​ Avoid data skew (salting, repartitioning).


●​ Broadcast small tables instead of shuffling large ones.
a-

21.​ What are Delta Lake's time travel features and how do they work?
ik
sh

Delta Lake Time Travel allows you to query, restore, or roll back data to a previous version of
a Delta table.
It’s essentially like having built-in version control for your data.
an

How it works
●​ Every write operation in Delta Lake creates a new version of the table (stored in the
_delta_log transaction log).
●​ The transaction log stores metadata + commit history for each version.
●​ You can query by version number or by timestamp.
1️⃣ Query by version number

df = spark.read.format("delta")\
.option("versionAsOf", 5).load("/path/to/table")

2️⃣ Query by timestamp

df = spark.read.format("delta")\

9/
.option("timestampAsOf", "2025-08-01 12:00:00").load("/path/to/table")

16
Key points

39
●​ Time travel works as long as old data files are retained (controlled by
delta.logRetentionDuration & delta.deletedFileRetentionDuration).
●​ Default retention is 30 days.

b9
●​ Does not duplicate data — it uses copy-on-write storage to maintain versions.
81
a-
pt
gu
a-
ik
sh
an

You might also like