Pyspark most important interview
questions
1. What is the key difference b/w Hadoop MapReduce and Spark in
terms of performance and scalability?
9/
Feature Hadoop MapReduce Apache Spark
16
Processing Model Disk I/O heavy batch jobs In-memory DAG, efficient
transformations
39
Speed & Performance Slower, especially for Fast—thanks to caching
iterative jobs and optimized execution
APIs & Development Java-centric, verbose High-level APIs across
b9
Scala, Python, Java, R
Workload Support Batch-only Batch, streaming, SQL,
81 graph, ML unified engine
Fault Tolerance Re-run failed tasks from Lineage tracking and
a-
disk recomputation
Resource & Ecosystem Hadoop-dependent Flexible deployment,
pt
extensive integration
gu
Ideal Scenarios Simple ETL and batch Real-time, iterative,
workloads multi-paradigm workloads
a-
Follow up: Spark writes data to disk. Why and what’s the difference w.r.t MapReduce?
ik
Spark writes data to disk only when we instruct by using storage levels in persist() or
when data doesn't fit into memory and spilling into disk is required which makes it
sh
scenario specific.
Even though Spark and MapReduce when writing data into disk look the same, the
difference can be seen in performance. Spark performs much better than MapReduce
an
due to catalyst optimizer and dynamic optimization techniques like AQE.
2. What is the role of SparkContext in PySpark?
It’s the entry point to Spark’s execution engine. It connects your application to the Spark
cluster, manages resources, distributes tasks, and coordinates execution.
In modern Spark, it’s created automatically inside a SparkSession
(spark.sparkContext). Without it, no Spark jobs can run.
3. Explain Spark Architecture.
Components of Spark Architecture
● Driver Program
○ The main process that runs the Spark application.
○ Converts user code (in Python, Scala, Java, etc.) into a DAG (Directed
Acyclic Graph) of stages.
○ Schedules tasks for execution across the cluster.
○ Contains the SparkContext which is the gateway to the Spark cluster.
9/
● Cluster Manager
16
○ Responsible for allocating resources to Spark applications.
○ Can be Standalone, YARN, or Mesos (Kubernetes in modern setups).
39
● Executors
○ Worker processes running on cluster nodes.
b9
○ Execute tasks assigned by the driver.
○ Store data in memory or on disk for caching.
81
Execution Flow
1. User Code → RDD/Dataset/DataFrame Transformations
a-
○ Code gets translated into a logical plan.
2. Logical Plan → Physical Plan
pt
○ Optimized by the Catalyst optimizer.
3. Stages and Tasks
gu
○ Spark breaks the job into stages (based on shuffle boundaries).
○ Each stage has multiple tasks (same operation on different partitions).
4. Cluster Manager Allocation
a-
○ Driver requests executors from the cluster manager.
5. Execution on Executors
ik
○ Executors run tasks in parallel and send results back to the driver.
sh
Key Points
● Spark uses lazy evaluation — transformations aren’t executed until an action is
an
triggered.
● Executors are long-lived processes (unlike Hadoop MapReduce which starts
fresh JVMs per task).
● Spark stores data in memory for speed, reducing disk I/O compared to
MapReduce.
● The driver monitors execution and re-runs failed tasks if needed.
4. What is the difference between RDDs, Dataframe and Dataset?
5. The driver monitors execution and re-runs failed tasks if needed.
Feature RDD (Resilient DataFrame Dataset
Distributed Dataset)
Abstraction Level Low-level High-level High-level
Type Safety Type-safe (in Not type-safe Type-safe (only in
9/
Scala/Java) Scala, not in
PySpark)
16
Ease of Use More complex APIs Easier, SQL-like API Similar to DataFrame
Performance Less optimized (no Optimized using Optimized (like
39
Catalyst/Tungsten) Catalyst & Tungsten DataFrame)
Serialization Java serialization Tungsten binary Encoders (efficient)
b9
format
Use Case Complex, SQL operations, When compile-time
fine-grained analytics type safety is
81 transformations required (Scala only)
Support in PySpark ✅ Yes ✅ Yes ❌ Not supported
a-
(only in Scala/Java)
pt
6. What is query optimization in Pyspark?
gu
Spark uses lazy evaluation — transformations aren’t executed until an action is triggered.
Catalyst Optimizer Steps
a-
1. Analysis – Checks syntax, resolves column names, validates schema.
ik
2. Logical Plan – Represents the user’s query logically (no execution yet).
sh
3. Optimization – Applies rules like:
○ Predicate pushdown (filtering early)
○ Constant folding (evaluating constant expressions before execution)
an
○ Projection pruning (selecting only required columns)
4. Physical Plan – Decides execution strategy (e.g., broadcast join vs shuffle join).
5. Code Generation – Uses whole-stage codegen for faster execution.
Common Optimization Techniques in PySpark
● Use DataFrame API instead of RDDs – DataFrames get Catalyst/Tungsten benefits.
● Avoid UDFs unless necessary – UDFs are slower; use Spark SQL functions.
● Column Pruning – Select only the columns you need.
● Predicate Pushdown – Filter data early so less data flows through stages.
● Broadcast Join – For small tables, use broadcast() to avoid shuffling.
● Cache/Persist – For reused DataFrames, cache results in memory.
● Partitioning – Repartition or coalesce wisely to balance parallelism.
● File Format – Use columnar formats like Parquet/ORC for faster reads.
7. What is SparkSession in Pyspark?
9/
Introduced in Spark 2.0 to replace older separate contexts.It’s a unified object that combines
16
● SparkContext (for RDDs)
● SQLContext (for SQL queries)
● HiveContext (for Hive support)
39
Every PySpark program starts by creating a SparkSession. It tells Spark:
b9
● How to run (local or cluster)
● App name81
● Configurations (memory, cores, etc.)
Without it, you can’t access DataFrame APIs.
a-
Creating a SparkSession:
pt
from pyspark.sql import SparkSession
gu
spark = SparkSession.builder \
.appName("RetailDataAnalysis") \
a-
.master("local[*]") \
.getOrCreate()
ik
Parameters:
sh
● appName() → Name for the Spark job (visible in UI).
● master() → Execution mode:
an
○ "local[*]" → Run locally using all cores.
○ "yarn", "mesos", "spark://..." → For cluster modes.
● .getOrCreate() → Creates a new session or returns an existing one.
8. Difference b/w wide & narrow transformation in Pyspark?
1. Narrow Transformations
● Definition: Transformations where each input partition contributes to only one output
partition.
● No shuffling of data across the network.
● Faster because computation stays local to the partition.
● Examples:
9/
○ map()
○ filter()
16
○ union()
○ coalesce() (without increasing partitions)
39
● When to use: If you can process data within the same partition without needing data
from others.
b9
2. Wide Transformations
● Definition: Transformations where input partitions contribute to multiple output
81
partitions.
● Involves shuffling data across nodes in the cluster (network + disk I/O).
a-
● Slower because shuffle is expensive.
● Examples:
○ groupByKey()
pt
○ reduceByKey()
gu
○ join()
○ distinct()
○ repartition()
a-
● Why slower: Data is redistributed to new partitions → shuffle → sort → aggregation.
ik
3. Shuffle Process in Wide Transformations
sh
● Steps:
○ Map stage – Data is prepared for shuffle.
an
○ Shuffle stage – Data is moved across the cluster.
○ Reduce stage – Data is aggregated/processed after shuffle.
● Optimization Tip: Reduce shuffles as much as possible by using:
○ reduceByKey() instead of groupByKey()
○ mapPartitions() for custom processing.
4. Execution Plan
● Narrow transformations → executed in a single stage.
● Wide transformations → cause a stage boundary (new stage in DAG).
● Spark’s DAG scheduler splits jobs into stages based on these transformations.
5. Key Interview Points
9/
Feature Narrow Wide
Data movement No shuffle Shuffle
16
Speed Faster Slower
39
Stage creation Same stage New stage
Examples map, filter join, groupByKey
b9
9. What is the use of coalesce() and repartition() in Pyspark?
81
Feature Repartition Coalesce
a-
Purpose Increases or decreases number of Decreases number of partitions
partitions. only.
pt
Shuffling Full shuffle of data across all nodes. Minimizes shuffle, avoids full
gu
shuffle.
Use Case When increasing partitions or When reducing the number of
a-
evenly redistributing data. partitions for optimized output.
Performance Expensive due to full shuffle. Inexpensive due to limited
ik
Cost movement.
sh
Typical Scenario Before wide transformations like Before writing data to disk.
join, groupBy.
an
Data Even redistribution across Narrows existing partitions
Redistribution partitions. (merging adjacent ones).
10. When to use cache() and persist() in Pyspark and what is the
difference b/w the 2?
We use them when:
● The same DataFrame/RDD is used multiple times in a Spark application.
● We want to avoid recomputation each time it's used.
● The dataset is expensive to compute (e.g., involves joins, aggregations, filtering large
data).
9/
● We want to improve performance in iterative algorithms (e.g., ML model training, graph
processing).
16
Example scenario:
If you perform an expensive transformation like
39
df.filter(...).groupBy(...).agg(...) and plan to use the result in multiple actions
(count(), show(), write()), you should cache/persist it.
b9
Feature cache() persist()
81
Default Stores data in Allows specifying a storage level (default:
Storage MEMORY_ONLY. MEMORY_ONLY).
a-
Level
pt
Flexibility Fixed storage level. Flexible – can store in MEMORY_AND_DISK,
DISK_ONLY, MEMORY_ONLY_SER, etc.
gu
When to When the dataset fits in When memory is limited or you need custom storage
Use memory and is reused behavior.
a-
multiple times.
Example df.cache() df.persist(StorageLevel.MEMORY_AND_DISK)
ik
sh
from pyspark import StorageLevel
# Using cache (MEMORY_ONLY)
an
df_cached = df.cache()
# Using persist with custom storage level
df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK)
# Trigger actions
df_cached.count()
df_cached.show()
💡 Key Points for Interviews
● cache() is just a shorthand for persist(StorageLevel.MEMORY_ONLY).
● Use persist() if data might not fit in memory or you want to store serialized or on
disk.
● Both are lazily evaluated — data is stored only after the first action.
● Always unpersist() when the cached/persisted dataset is no longer needed to free
up resources.
9/
11.What is the importance of partitions in Pyspark?
16
1️⃣ Parallelism & Performance
39
● Spark processes data in parallel by splitting it into partitions i.e. MPP(Massive Parallel
Processing)
● Each partition is processed independently by a task on a Spark executor.
b9
● More partitions → more tasks → better parallelism (up to a point).
● Too few partitions → underutilized CPU cores.
● Too many partitions → overhead in task scheduling.
81
2️⃣ Data Distribution
a-
● Partitions determine how data is distributed across the cluster.
● Well-distributed partitions → balanced workload.
pt
● Skewed partitions (uneven data) → data skew → performance bottlenecks.
gu
3️⃣ Memory Management
● If a partition is too large → may cause OutOfMemoryError on executors.
a-
● Proper partitioning ensures each partition fits into executor memory.
ik
4️⃣ Shuffle Optimization
sh
● During wide transformations (groupBy, join), Spark shuffles data b/w partitions.
● The number and size of partitions after shuffle impacts performance.
an
● Setting the right spark.sql.shuffle.partitions is critical for big datasets.
5️⃣ ETL Efficiency
● In data engineering pipelines:
○ Before joins/aggregations → increase partitions for parallelism.
○ Before writing to storage (like ADLS, S3) → reduce partitions to avoid too
many small files.
# Check partitions
df.rdd.getNumPartitions()
# Repartition for parallelism
df = df.repartition(8)
# Coalesce to reduce partitions before writing
df = df.coalesce(2)
9/
12. What are broadcast variables and why are they used?
16
● Read-only shared variables that are cached on every executor node in a Spark
cluster.
● They allow you to avoid sending the same data repeatedly with each task.
39
● Instead, Spark broadcasts the data once to each node, significantly reducing network
I/O and serialization overhead.
b9
Why Use Them?
81
● Ideal for small lookup tables, configuration settings, or reference data used across
many tasks.
● Especially useful when performing operations that need frequent access to the same
a-
static dataset.
pt
Suppose you're processing a massive dataset of user activity and need to enrich it with country
names based on country codes:
gu
states = {"NY": "New York", "CA": "California", "FL": "Florida"}
a-
broadcast_states = spark.sparkContext.broadcast(states)
df = spark.read.parquet("...")
ik
def lookup_state(code):
sh
return broadcast_states.value.get(code, "Unknown")
df_rdd = df.rdd.map(lambda row: (row.user_id, lookup_state(row.state_code),
an
row.activity))
Use broadcast variables when:
● You have a small-read-only dataset that needs to be reused across many tasks or
stages.
● You want to reduce shuffle and serialization overhead.
Avoid them when:
● The dataset is too large to fit in executor memory—could lead to OOM errors.
● The broadcasted data isn't used by enough tasks to justify the overhead.
13. What is the difference b/w df.show() and df.collect()?
9/
Feature df.show() df.collect()
16
Purpose Displays data in a tabular Retrieves all rows from the
format in the console for DataFrame to the driver as a
39
quick preview. Python list of Row objects.
Default Behavior By default shows top 20 Always returns all rows in
b9
rows. Can specify number of the DataFrame unless filtered
rows with df.show(n). before calling.
Output Location Prints to console only Returns a list that you can
81 (doesn’t return data). store in a variable and
process further.
a-
Memory Impact Lightweight; doesn’t load the Can be dangerous for large
full dataset into driver datasets – may cause
pt
memory. OutOfMemoryError on the
driver.
gu
Use Case For quick inspection or When you need to
debugging. programmatically work with
all the data in the driver.
a-
ik
For large datasets, avoid collect() unless you are absolutely sure the data fits in memory —
instead, use take(n) or limit(n).collect().
sh
14. What is lazy evaluation in Pyspark?
an
Lazy Evaluation in PySpark means that Spark does not execute transformations immediately
when you call them.
Instead, it builds a logical execution plan (DAG) and only runs the computation when an
action (like show(), collect(), count(), etc.) is triggered.
How it works:
1. Transformations (select(), filter(), map(), etc.) → Only recorded in a plan; no
actual execution.
2. Action (show(), count(), collect()) → Triggers Spark to optimize the plan and
execute it on the cluster.
Advantages:
9/
● Optimization: Spark can combine multiple transformations into a single stage
(pipelining).
16
● Efficiency: Avoids unnecessary computation.
39
15. What are advantages of Delta Lake over traditional file formats?
b9
Feature Traditional File Formats Delta Lake
(CSV, Parquet, etc.)
❌ ✅
81
ACID Transactions Not supported (risk of Guaranteed consistency
partial writes/inconsistent with ACID transactions
data)
a-
Schema Enforcement ❌ Data with wrong schema ✅ Rejects writes with
pt
may get written silently schema mismatch
Schema Evolution ⚠ Limited (Parquet allows ✅ Add/modify columns with
gu
but without validation) proper versioning
Data Versioning (Time ❌ Not available ✅ Access older versions of
a-
Travel) data for rollback, audit, or
debugging
❌ ✅ Native support via MERGE
ik
Upserts & Deletes Difficult (requires full file
rewrite) INTO, DELETE, UPDATE
sh
Performance (Indexing &
Caching)
⚠ Reads entire files, no
transaction log optimization
✅ Optimized reads/writes
with transaction log
an
(_delta_log)
Streaming + Batch
Unification
❌ Usually separate pipelines ✅ Same Delta table can be
used for batch & streaming
File Compaction (Optimize) ❌ Manual and complex ✅ Built-in commands to
compact small files for better
performance
Delta Lake turns your data lake into a transactional, reliable, and high-performance
lakehouse, solving common issues like inconsistent data, slow updates, and schema drift.
16. What happens when a Pyspark job runs OOM?
When a PySpark job runs Out of Memory (OOM), it means the executors or driver have
exhausted their allocated memory.
Stage What Happens
9/
1. Task Execution Executors start processing data partitions in
memory.
16
2. Memory Exhaustion If the data, shuffle operations, caching, or
joins exceed the executor’s memory limit,
39
Spark tries to spill data to disk (temporary
files) to free memory.
3. Spill to Disk Spark writes intermediate data to disk to
b9
avoid OOM. This slows down the job but
prevents immediate failure—if spill space is
enough.
81
4. GC Pressure JVM Garbage Collector runs frequently to
reclaim memory. If it spends too much time in
a-
GC (> ~98% of time), Spark throws a GC
overhead error.
pt
5. OOM Failure If neither GC nor spilling can free enough
space, you get an OutOfMemoryError (e.g.,
gu
Java heap space or GC overhead
limit exceeded). The stage fails.
a-
6. Job Retry / Abort Spark retries the failed stage (default 4
times). If it fails every time, the job aborts.
ik
sh
Common Causes
an
● Very large partitions (too much data for one executor).
● Wide transformations (e.g., large shuffles from groupBy, join).
● Caching huge datasets without enough memory.
● Skewed data causing one executor to handle much more data than others.
Prevention / Fixes
● Increase memory: --executor-memory and --driver-memory.
● Increase partitions: repartition() to reduce data per partition.
● Use broadcast joins for small datasets.
● Persist wisely: Use persist(StorageLevel.DISK_ONLY) if RAM is limited.
● Enable spilling configs: Tune spark.shuffle.spill and
spark.memory.fraction.
● Handle skew: Use salting or skew join optimization.
In PySpark, if a job runs Out of Memory (OOM), Spark first tries to spill intermediate data to disk
to free up RAM. If spilling and garbage collection can’t help, the executor throws an
OutOfMemoryError and the stage fails. Spark retries the stage (default 4 times), but if all
9/
retries fail, the job aborts.
16
Common causes are large partitions, skewed data, or caching huge datasets.
To fix it: increase executor/driver memory, repartition to reduce data per task, use broadcast
joins for small datasets, and persist to disk instead of memory when RAM is limited.
39
17. What is AQE in Pyspark and why is it useful?
b9
AQE is a Spark feature (enabled by spark.sql.adaptive.enabled=true) that
dynamically optimizes query plans at runtime based on the actual data being processed,
81
rather than relying only on static plans created during compile time.
Why it’s useful:
a-
1. Handles Data Skew – Detects uneven partition sizes and splits/rebalances them to
pt
avoid slow tasks.
2. Optimizes Join Strategies – Can switch from shuffle join to broadcast join if a dataset
gu
turns out smaller than expected.
3. Reduces Shuffle Partitions – Automatically merges small shuffle partitions to avoid
excessive tasks and overhead.
a-
Example:
In a retail analytics pipeline, if after filtering, the orders dataset becomes small, AQE can switch
ik
to a broadcast join automatically—saving shuffle time and boosting performance.
sh
18. How would you handle skewed data in Pyspark?
When data is skewed, some partitions hold far more records than others, causing certain tasks
an
to run much longer. This leads to slow stages and possible OOM errors.
Technique How It Works Example
Salting Add a random “salt” key to For a heavy
the join column to spread customer_id=123, create
skewed keys across multiple customer_id_salted =
partitions, then remove salt concat(customer_id,
after processing.
rand_int) before join.
Broadcast Join If one dataset is small, broadcast(df_small) in
broadcast it to all executors join.
to avoid shuffles.
AQE Skew Join Enable spark.conf.set("spark.
Optimization spark.sql.adaptive.ena sql.adaptive.enabled",
bled to let Spark split "true").
skewed partitions
9/
dynamically.
16
Filter Early Reduce data size before joins Apply where() before join.
or aggregations.
39
Repartition by Key Increase parallelism for the df.repartition(200,
skewed key. "key").
💡 Retail Example:
b9
If 70% of your transactions come from one store ID, salting that store ID before joins will ensure
81
the workload is spread evenly across executors.
19. What is broadcast join and when should you use it?
a-
A broadcast join sends a small DataFrame to all worker nodes so they can join it locally with
pt
their partition of the big DataFrame, avoiding a costly shuffle.
When to use:
gu
● One DataFrame is small enough to fit in each executor’s memory (rule of thumb: < 10
MB, but can be adjusted via spark.sql.autoBroadcastJoinThreshold).
a-
● To speed up joins by avoiding network shuffle of large datasets.
● Especially useful in star-schema or lookup table joins (fact table + small dimension
ik
table).
sh
df_result = df_large.join(broadcast(df_small),on="store_id",how="left")
df_result.show()
an
20. What is SPILL is Spark and why does it happen?
In Spark, spill happens when data being processed does not fit entirely into memory, so Spark
temporarily writes (or spills) it to disk to continue processing.
Why does the spill happen?
● Spark operations like shuffle, sort, groupBy, join, or aggregation store intermediate
data in memory.
● If there’s insufficient executor memory (based on spark.executor.memory,
spark.memory.fraction), Spark can’t hold all intermediate data in RAM.
● To avoid OutOfMemory (OOM) errors, Spark spills this extra data to disk.
Common scenarios which cause spill?
9/
1. Large shuffles → e.g., groupByKey, large joins, repartitions.
16
2. Wide transformations with huge intermediate datasets.
3. Data skew → one partition has too much data to fit in memory.
39
4. Too many tasks competing for limited executor memory.
Impact of spilling
b9
● Performance degradation — disk I/O is much slower than memory.
● However, it allows Spark jobs to complete successfully instead of failing with OOM.
81
How to reduce spilling
a-
● Increase memory per executor (spark.executor.memory).
● Increase shuffle memory fraction (spark.memory.fraction).
pt
● Use map-side aggregation (reduceByKey instead of groupByKey).
gu
● Avoid data skew (salting, repartitioning).
● Broadcast small tables instead of shuffling large ones.
a-
21. What are Delta Lake's time travel features and how do they work?
ik
sh
Delta Lake Time Travel allows you to query, restore, or roll back data to a previous version of
a Delta table.
It’s essentially like having built-in version control for your data.
an
How it works
● Every write operation in Delta Lake creates a new version of the table (stored in the
_delta_log transaction log).
● The transaction log stores metadata + commit history for each version.
● You can query by version number or by timestamp.
1️⃣ Query by version number
df = spark.read.format("delta")\
.option("versionAsOf", 5).load("/path/to/table")
2️⃣ Query by timestamp
df = spark.read.format("delta")\
9/
.option("timestampAsOf", "2025-08-01 12:00:00").load("/path/to/table")
16
Key points
39
● Time travel works as long as old data files are retained (controlled by
delta.logRetentionDuration & delta.deletedFileRetentionDuration).
● Default retention is 30 days.
b9
● Does not duplicate data — it uses copy-on-write storage to maintain versions.
81
a-
pt
gu
a-
ik
sh
an