0% found this document useful (0 votes)

33 views16 pages

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

The document outlines key interview questions and answers related to PySpark, focusing on differences between Hadoop MapReduce and Spark, Spark architecture, and various components like RDDs, DataFrames, and SparkSession. It also discusses query optimization techniques, the importance of partitions, and the use of caching and broadcast variables. Additionally, it highlights the differences between narrow and wide transformations, as well as methods like coalesce() and repartition().

Uploaded by

zapukzupuk17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views16 pages

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

Uploaded by

zapukzupuk17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Pyspark most important interview

questions
1. What is the key difference b/w Hadoop MapReduce and Spark in
terms of performance and scalability?

9/
Feature Hadoop MapReduce Apache Spark

16
Processing Model Disk I/O heavy batch jobs In-memory DAG, efficient
transformations

39
Speed & Performance Slower, especially for Fast—thanks to caching
iterative jobs and optimized execution

APIs & Development Java-centric, verbose High-level APIs across

b9
Scala, Python, Java, R

Workload Support Batch-only Batch, streaming, SQL,

81 graph, ML unified engine

Fault Tolerance Re-run failed tasks from Lineage tracking and

a-
disk recomputation

Resource & Ecosystem Hadoop-dependent Flexible deployment,

extensive integration
gu

Ideal Scenarios Simple ETL and batch Real-time, iterative,

workloads multi-paradigm workloads
a-

Follow up: Spark writes data to disk. Why and what’s the difference w.r.t MapReduce?
ik

Spark writes data to disk only when we instruct by using storage levels in persist() or
when data doesn't fit into memory and spilling into disk is required which makes it
sh

scenario specific.
Even though Spark and MapReduce when writing data into disk look the same, the
difference can be seen in performance. Spark performs much better than MapReduce
an

due to catalyst optimizer and dynamic optimization techniques like AQE.

2. What is the role of SparkContext in PySpark?

It’s the entry point to Spark’s execution engine. It connects your application to the Spark
cluster, manages resources, distributes tasks, and coordinates execution.

In modern Spark, it’s created automatically inside a SparkSession

(spark.sparkContext). Without it, no Spark jobs can run.
3. Explain Spark Architecture.
Components of Spark Architecture

● Driver Program
○ The main process that runs the Spark application.
○ Converts user code (in Python, Scala, Java, etc.) into a DAG (Directed
Acyclic Graph) of stages.
○ Schedules tasks for execution across the cluster.
○ Contains the SparkContext which is the gateway to the Spark cluster.

9/
● Cluster Manager

16
○ Responsible for allocating resources to Spark applications.
○ Can be Standalone, YARN, or Mesos (Kubernetes in modern setups).

39
● Executors
○ Worker processes running on cluster nodes.

b9
○ Execute tasks assigned by the driver.
○ Store data in memory or on disk for caching.

81
Execution Flow

1. User Code → RDD/Dataset/DataFrame Transformations

a-
○ Code gets translated into a logical plan.
2. Logical Plan → Physical Plan
pt

○ Optimized by the Catalyst optimizer.

3. Stages and Tasks
gu

○ Spark breaks the job into stages (based on shuffle boundaries).

○ Each stage has multiple tasks (same operation on different partitions).
4. Cluster Manager Allocation
a-

○ Driver requests executors from the cluster manager.

5. Execution on Executors
ik

○ Executors run tasks in parallel and send results back to the driver.
sh

Key Points

● Spark uses lazy evaluation — transformations aren’t executed until an action is

triggered.
● Executors are long-lived processes (unlike Hadoop MapReduce which starts
fresh JVMs per task).
● Spark stores data in memory for speed, reducing disk I/O compared to
MapReduce.
● The driver monitors execution and re-runs failed tasks if needed.
4. What is the difference between RDDs, Dataframe and Dataset?
5. The driver monitors execution and re-runs failed tasks if needed.

Feature RDD (Resilient DataFrame Dataset

Distributed Dataset)

Abstraction Level Low-level High-level High-level

Type Safety Type-safe (in Not type-safe Type-safe (only in

9/
Scala/Java) Scala, not in
PySpark)

16
Ease of Use More complex APIs Easier, SQL-like API Similar to DataFrame

Performance Less optimized (no Optimized using Optimized (like

39
Catalyst/Tungsten) Catalyst & Tungsten DataFrame)

Serialization Java serialization Tungsten binary Encoders (efficient)

b9
format

Use Case Complex, SQL operations, When compile-time

fine-grained analytics type safety is
81 transformations required (Scala only)

Support in PySpark ✅ Yes ✅ Yes ❌ Not supported

a-
(only in Scala/Java)
pt

6. What is query optimization in Pyspark?

Spark uses lazy evaluation — transformations aren’t executed until an action is triggered.

Catalyst Optimizer Steps

1. Analysis – Checks syntax, resolves column names, validates schema.

2. Logical Plan – Represents the user’s query logically (no execution yet).
sh

3. Optimization – Applies rules like:

○ Predicate pushdown (filtering early)
○ Constant folding (evaluating constant expressions before execution)
an

○ Projection pruning (selecting only required columns)

4. Physical Plan – Decides execution strategy (e.g., broadcast join vs shuffle join).
5. Code Generation – Uses whole-stage codegen for faster execution.

Common Optimization Techniques in PySpark

● Use DataFrame API instead of RDDs – DataFrames get Catalyst/Tungsten benefits.

● Avoid UDFs unless necessary – UDFs are slower; use Spark SQL functions.
● Column Pruning – Select only the columns you need.
● Predicate Pushdown – Filter data early so less data flows through stages.
● Broadcast Join – For small tables, use broadcast() to avoid shuffling.
● Cache/Persist – For reused DataFrames, cache results in memory.
● Partitioning – Repartition or coalesce wisely to balance parallelism.
● File Format – Use columnar formats like Parquet/ORC for faster reads.

7. What is SparkSession in Pyspark?

9/
Introduced in Spark 2.0 to replace older separate contexts.It’s a unified object that combines

16
● SparkContext (for RDDs)
● SQLContext (for SQL queries)
● HiveContext (for Hive support)

39
Every PySpark program starts by creating a SparkSession. It tells Spark:

b9
● How to run (local or cluster)
● App name81
● Configurations (memory, cores, etc.)

Without it, you can’t access DataFrame APIs.

a-
Creating a SparkSession:
pt

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("RetailDataAnalysis") \
a-

.master("local[*]") \
.getOrCreate()
ik

Parameters:
sh

● appName() → Name for the Spark job (visible in UI).

● master() → Execution mode:
an

○ "local[*]" → Run locally using all cores.

○ "yarn", "mesos", "spark://..." → For cluster modes.
● .getOrCreate() → Creates a new session or returns an existing one.
8. Difference b/w wide & narrow transformation in Pyspark?

1. Narrow Transformations

● Definition: Transformations where each input partition contributes to only one output
partition.
● No shuffling of data across the network.
● Faster because computation stays local to the partition.
● Examples:

9/
○ map()
○ filter()

16
○ union()
○ coalesce() (without increasing partitions)

39
● When to use: If you can process data within the same partition without needing data
from others.

b9
2. Wide Transformations

● Definition: Transformations where input partitions contribute to multiple output

81
partitions.
● Involves shuffling data across nodes in the cluster (network + disk I/O).
a-
● Slower because shuffle is expensive.
● Examples:
○ groupByKey()
pt

○ reduceByKey()
gu

○ join()
○ distinct()
○ repartition()
a-

● Why slower: Data is redistributed to new partitions → shuffle → sort → aggregation.

3. Shuffle Process in Wide Transformations

● Steps:
○ Map stage – Data is prepared for shuffle.
an

○ Shuffle stage – Data is moved across the cluster.

○ Reduce stage – Data is aggregated/processed after shuffle.

● Optimization Tip: Reduce shuffles as much as possible by using:

○ reduceByKey() instead of groupByKey()
○ mapPartitions() for custom processing.
4. Execution Plan

● Narrow transformations → executed in a single stage.

● Wide transformations → cause a stage boundary (new stage in DAG).
● Spark’s DAG scheduler splits jobs into stages based on these transformations.

5. Key Interview Points

9/
Feature Narrow Wide

Data movement No shuffle Shuffle

16
Speed Faster Slower

39
Stage creation Same stage New stage

Examples map, filter join, groupByKey

b9
9. What is the use of coalesce() and repartition() in Pyspark?
81
Feature Repartition Coalesce
a-
Purpose Increases or decreases number of Decreases number of partitions
partitions. only.
pt

Shuffling Full shuffle of data across all nodes. Minimizes shuffle, avoids full
gu

shuffle.

Use Case When increasing partitions or When reducing the number of

evenly redistributing data. partitions for optimized output.

Performance Expensive due to full shuffle. Inexpensive due to limited

Cost movement.
sh

Typical Scenario Before wide transformations like Before writing data to disk.
join, groupBy.
an

Data Even redistribution across Narrows existing partitions

Redistribution partitions. (merging adjacent ones).
10. When to use cache() and persist() in Pyspark and what is the
difference b/w the 2?

We use them when:

● The same DataFrame/RDD is used multiple times in a Spark application.

● We want to avoid recomputation each time it's used.
● The dataset is expensive to compute (e.g., involves joins, aggregations, filtering large
data).

9/
● We want to improve performance in iterative algorithms (e.g., ML model training, graph
processing).

16
Example scenario:
If you perform an expensive transformation like

39
df.filter(...).groupBy(...).agg(...) and plan to use the result in multiple actions
(count(), show(), write()), you should cache/persist it.

b9
Feature cache() persist()
81
Default Stores data in Allows specifying a storage level (default:
Storage MEMORY_ONLY. MEMORY_ONLY).
a-
Level
pt

Flexibility Fixed storage level. Flexible – can store in MEMORY_AND_DISK,

DISK_ONLY, MEMORY_ONLY_SER, etc.
gu

When to When the dataset fits in When memory is limited or you need custom storage
Use memory and is reused behavior.
a-

multiple times.

Example df.cache() df.persist(StorageLevel.MEMORY_AND_DISK)

ik
sh

from pyspark import StorageLevel

# Using cache (MEMORY_ONLY)
an

df_cached = df.cache()

# Using persist with custom storage level
df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK)

# Trigger actions
df_cached.count()
df_cached.show()
💡 Key Points for Interviews
● cache() is just a shorthand for persist(StorageLevel.MEMORY_ONLY).
● Use persist() if data might not fit in memory or you want to store serialized or on
disk.
● Both are lazily evaluated — data is stored only after the first action.
● Always unpersist() when the cached/persisted dataset is no longer needed to free
up resources.

9/
11.What is the importance of partitions in Pyspark?

16
1️⃣ Parallelism & Performance

39
● Spark processes data in parallel by splitting it into partitions i.e. MPP(Massive Parallel
Processing)
● Each partition is processed independently by a task on a Spark executor.

b9
● More partitions → more tasks → better parallelism (up to a point).
● Too few partitions → underutilized CPU cores.
● Too many partitions → overhead in task scheduling.
81
2️⃣ Data Distribution
a-
● Partitions determine how data is distributed across the cluster.
● Well-distributed partitions → balanced workload.
pt

● Skewed partitions (uneven data) → data skew → performance bottlenecks.

3️⃣ Memory Management

● If a partition is too large → may cause OutOfMemoryError on executors.

● Proper partitioning ensures each partition fits into executor memory.

4️⃣ Shuffle Optimization

● During wide transformations (groupBy, join), Spark shuffles data b/w partitions.
● The number and size of partitions after shuffle impacts performance.
an

● Setting the right spark.sql.shuffle.partitions is critical for big datasets.

5️⃣ ETL Efficiency

● In data engineering pipelines:

○ Before joins/aggregations → increase partitions for parallelism.
○ Before writing to storage (like ADLS, S3) → reduce partitions to avoid too
many small files.
# Check partitions
df.rdd.getNumPartitions()

# Repartition for parallelism
df = df.repartition(8)

# Coalesce to reduce partitions before writing
df = df.coalesce(2)

9/
12. What are broadcast variables and why are they used?

16
● Read-only shared variables that are cached on every executor node in a Spark
cluster.
● They allow you to avoid sending the same data repeatedly with each task.

39
● Instead, Spark broadcasts the data once to each node, significantly reducing network
I/O and serialization overhead.

b9
Why Use Them?
81
● Ideal for small lookup tables, configuration settings, or reference data used across
many tasks.
● Especially useful when performing operations that need frequent access to the same
a-
static dataset.
pt

Suppose you're processing a massive dataset of user activity and need to enrich it with country
names based on country codes:
gu

states = {"NY": "New York", "CA": "California", "FL": "Florida"}

broadcast_states = spark.sparkContext.broadcast(states)
df = spark.read.parquet("...")
ik

def lookup_state(code):
sh

return broadcast_states.value.get(code, "Unknown")

df_rdd = df.rdd.map(lambda row: (row.user_id, lookup_state(row.state_code),
an

row.activity))

Use broadcast variables when:

● You have a small-read-only dataset that needs to be reused across many tasks or
stages.
● You want to reduce shuffle and serialization overhead.
Avoid them when:

● The dataset is too large to fit in executor memory—could lead to OOM errors.
● The broadcasted data isn't used by enough tasks to justify the overhead.

13. What is the difference b/w df.show() and df.collect()?

9/
Feature df.show() df.collect()

16
Purpose Displays data in a tabular Retrieves all rows from the
format in the console for DataFrame to the driver as a

39
quick preview. Python list of Row objects.

Default Behavior By default shows top 20 Always returns all rows in

b9
rows. Can specify number of the DataFrame unless filtered
rows with df.show(n). before calling.

Output Location Prints to console only Returns a list that you can
81 (doesn’t return data). store in a variable and
process further.
a-
Memory Impact Lightweight; doesn’t load the Can be dangerous for large
full dataset into driver datasets – may cause
pt

memory. OutOfMemoryError on the

driver.
gu

Use Case For quick inspection or When you need to

debugging. programmatically work with
all the data in the driver.
a-
ik

For large datasets, avoid collect() unless you are absolutely sure the data fits in memory —
instead, use take(n) or limit(n).collect().
sh

14. What is lazy evaluation in Pyspark?

Lazy Evaluation in PySpark means that Spark does not execute transformations immediately
when you call them.
Instead, it builds a logical execution plan (DAG) and only runs the computation when an
action (like show(), collect(), count(), etc.) is triggered.
How it works:

1. Transformations (select(), filter(), map(), etc.) → Only recorded in a plan; no

actual execution.
2. Action (show(), count(), collect()) → Triggers Spark to optimize the plan and
execute it on the cluster.

Advantages:

9/
● Optimization: Spark can combine multiple transformations into a single stage
(pipelining).

16
● Efficiency: Avoids unnecessary computation.

39
15. What are advantages of Delta Lake over traditional file formats?

b9
Feature Traditional File Formats Delta Lake
(CSV, Parquet, etc.)

❌ ✅
81
ACID Transactions Not supported (risk of Guaranteed consistency
partial writes/inconsistent with ACID transactions
data)
a-

Schema Enforcement ❌ Data with wrong schema ✅ Rejects writes with

may get written silently schema mismatch

Schema Evolution ⚠ Limited (Parquet allows ✅ Add/modify columns with

but without validation) proper versioning

Data Versioning (Time ❌ Not available ✅ Access older versions of

Travel) data for rollback, audit, or

debugging

❌ ✅ Native support via MERGE

Upserts & Deletes Difficult (requires full file

rewrite) INTO, DELETE, UPDATE
sh

Performance (Indexing &

Caching)
⚠ Reads entire files, no
transaction log optimization
✅ Optimized reads/writes
with transaction log
an

(_delta_log)

Streaming + Batch
Unification
❌ Usually separate pipelines ✅ Same Delta table can be
used for batch & streaming

File Compaction (Optimize) ❌ Manual and complex ✅ Built-in commands to

compact small files for better
performance
Delta Lake turns your data lake into a transactional, reliable, and high-performance
lakehouse, solving common issues like inconsistent data, slow updates, and schema drift.

16. What happens when a Pyspark job runs OOM?

When a PySpark job runs Out of Memory (OOM), it means the executors or driver have
exhausted their allocated memory.

Stage What Happens

9/
1. Task Execution Executors start processing data partitions in
memory.

16
2. Memory Exhaustion If the data, shuffle operations, caching, or
joins exceed the executor’s memory limit,

39
Spark tries to spill data to disk (temporary
files) to free memory.

3. Spill to Disk Spark writes intermediate data to disk to

b9
avoid OOM. This slows down the job but
prevents immediate failure—if spill space is
enough.
81
4. GC Pressure JVM Garbage Collector runs frequently to
reclaim memory. If it spends too much time in
a-
GC (> ~98% of time), Spark throws a GC
overhead error.
pt

5. OOM Failure If neither GC nor spilling can free enough

space, you get an OutOfMemoryError (e.g.,
gu

Java heap space or GC overhead

limit exceeded). The stage fails.
a-

6. Job Retry / Abort Spark retries the failed stage (default 4

times). If it fails every time, the job aborts.
ik
sh

Common Causes
an

● Very large partitions (too much data for one executor).

● Wide transformations (e.g., large shuffles from groupBy, join).
● Caching huge datasets without enough memory.
● Skewed data causing one executor to handle much more data than others.

Prevention / Fixes

● Increase memory: --executor-memory and --driver-memory.

● Increase partitions: repartition() to reduce data per partition.
● Use broadcast joins for small datasets.
● Persist wisely: Use persist(StorageLevel.DISK_ONLY) if RAM is limited.
● Enable spilling configs: Tune spark.shuffle.spill and
spark.memory.fraction.
● Handle skew: Use salting or skew join optimization.

In PySpark, if a job runs Out of Memory (OOM), Spark first tries to spill intermediate data to disk
to free up RAM. If spilling and garbage collection can’t help, the executor throws an
OutOfMemoryError and the stage fails. Spark retries the stage (default 4 times), but if all

9/
retries fail, the job aborts.

16
Common causes are large partitions, skewed data, or caching huge datasets.
To fix it: increase executor/driver memory, repartition to reduce data per task, use broadcast
joins for small datasets, and persist to disk instead of memory when RAM is limited.

39
17. What is AQE in Pyspark and why is it useful?

b9
AQE is a Spark feature (enabled by spark.sql.adaptive.enabled=true) that
dynamically optimizes query plans at runtime based on the actual data being processed,
81
rather than relying only on static plans created during compile time.

Why it’s useful:

1. Handles Data Skew – Detects uneven partition sizes and splits/rebalances them to
pt

avoid slow tasks.

2. Optimizes Join Strategies – Can switch from shuffle join to broadcast join if a dataset
gu

turns out smaller than expected.

3. Reduces Shuffle Partitions – Automatically merges small shuffle partitions to avoid
excessive tasks and overhead.
a-

Example:
In a retail analytics pipeline, if after filtering, the orders dataset becomes small, AQE can switch
ik

to a broadcast join automatically—saving shuffle time and boosting performance.

18. How would you handle skewed data in Pyspark?

When data is skewed, some partitions hold far more records than others, causing certain tasks
an

to run much longer. This leads to slow stages and possible OOM errors.

Technique How It Works Example

Salting Add a random “salt” key to For a heavy

the join column to spread customer_id=123, create
skewed keys across multiple customer_id_salted =
partitions, then remove salt concat(customer_id,
after processing.
rand_int) before join.

Broadcast Join If one dataset is small, broadcast(df_small) in

broadcast it to all executors join.
to avoid shuffles.

AQE Skew Join Enable spark.conf.set("spark.

Optimization spark.sql.adaptive.ena sql.adaptive.enabled",
bled to let Spark split "true").
skewed partitions

9/
dynamically.

16
Filter Early Reduce data size before joins Apply where() before join.
or aggregations.

39
Repartition by Key Increase parallelism for the df.repartition(200,
skewed key. "key").

💡 Retail Example:

b9
If 70% of your transactions come from one store ID, salting that store ID before joins will ensure
81
the workload is spread evenly across executors.

19. What is broadcast join and when should you use it?
a-

A broadcast join sends a small DataFrame to all worker nodes so they can join it locally with
pt

their partition of the big DataFrame, avoiding a costly shuffle.

When to use:
gu

● One DataFrame is small enough to fit in each executor’s memory (rule of thumb: < 10
MB, but can be adjusted via spark.sql.autoBroadcastJoinThreshold).
a-

● To speed up joins by avoiding network shuffle of large datasets.

● Especially useful in star-schema or lookup table joins (fact table + small dimension
ik

table).
sh

df_result = df_large.join(broadcast(df_small),on="store_id",how="left")
df_result.show()
an

20. What is SPILL is Spark and why does it happen?

In Spark, spill happens when data being processed does not fit entirely into memory, so Spark
temporarily writes (or spills) it to disk to continue processing.
Why does the spill happen?

● Spark operations like shuffle, sort, groupBy, join, or aggregation store intermediate
data in memory.
● If there’s insufficient executor memory (based on spark.executor.memory,
spark.memory.fraction), Spark can’t hold all intermediate data in RAM.
● To avoid OutOfMemory (OOM) errors, Spark spills this extra data to disk.

Common scenarios which cause spill?

9/
1. Large shuffles → e.g., groupByKey, large joins, repartitions.

16
2. Wide transformations with huge intermediate datasets.
3. Data skew → one partition has too much data to fit in memory.

39
4. Too many tasks competing for limited executor memory.

Impact of spilling

b9
● Performance degradation — disk I/O is much slower than memory.
● However, it allows Spark jobs to complete successfully instead of failing with OOM.
81
How to reduce spilling
a-
● Increase memory per executor (spark.executor.memory).
● Increase shuffle memory fraction (spark.memory.fraction).
pt

● Use map-side aggregation (reduceByKey instead of groupByKey).

● Avoid data skew (salting, repartitioning).

● Broadcast small tables instead of shuffling large ones.
a-

21. What are Delta Lake's time travel features and how do they work?
ik
sh

Delta Lake Time Travel allows you to query, restore, or roll back data to a previous version of
a Delta table.
It’s essentially like having built-in version control for your data.
an

How it works
● Every write operation in Delta Lake creates a new version of the table (stored in the
_delta_log transaction log).
● The transaction log stores metadata + commit history for each version.
● You can query by version number or by timestamp.
1️⃣ Query by version number

df = spark.read.format("delta")\
.option("versionAsOf", 5).load("/path/to/table")

2️⃣ Query by timestamp

df = spark.read.format("delta")\

9/
.option("timestampAsOf", "2025-08-01 12:00:00").load("/path/to/table")

16
Key points

39
● Time travel works as long as old data files are retained (controlled by
delta.logRetentionDuration & delta.deletedFileRetentionDuration).
● Default retention is 30 days.

b9
● Does not duplicate data — it uses copy-on-write storage to maintain versions.
81
a-
pt
gu
a-
ik
sh
an

PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
PySpark Interview Questions Shubham
0% (1)
PySpark Interview Questions Shubham
3 pages
Pyspark
100% (1)
Pyspark
48 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Spark Theory
No ratings yet
Spark Theory
26 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
Pysparkq
No ratings yet
Pysparkq
3 pages
Overview of PySpark Components
No ratings yet
Overview of PySpark Components
9 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
PySpark Interview QA
No ratings yet
PySpark Interview QA
2 pages
Execr
No ratings yet
Execr
4 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Page 01
No ratings yet
Page 01
2 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Py Spark
No ratings yet
Py Spark
177 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Pyspark 1
No ratings yet
Pyspark 1
4 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Module 4
No ratings yet
Module 4
29 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
7 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
PySpark Performance Optimization Tips
No ratings yet
PySpark Performance Optimization Tips
56 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Pyspark
No ratings yet
Pyspark
10 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
PySpark+Slides v1
100% (1)
PySpark+Slides v1
458 pages
VHF 2m (144 148 MHZ) Monitor
No ratings yet
VHF 2m (144 148 MHZ) Monitor
6 pages
Electronic Communication and Summons Under BNSS
No ratings yet
Electronic Communication and Summons Under BNSS
3 pages
Hire FuelPHP MVC Development Expert
No ratings yet
Hire FuelPHP MVC Development Expert
22 pages
Noblelift PTE20B
100% (1)
Noblelift PTE20B
31 pages
UI Redressing: Attacks and Countermeasures Revisited: Marcus Niemietz Marcus - Niemietz@rub - de
No ratings yet
UI Redressing: Attacks and Countermeasures Revisited: Marcus Niemietz Marcus - Niemietz@rub - de
45 pages
Flo-Torq Propellor Installation Manual
No ratings yet
Flo-Torq Propellor Installation Manual
7 pages
1st National AI Olympiad Question Bank Watermark
No ratings yet
1st National AI Olympiad Question Bank Watermark
39 pages
V-Model and Testing Concepts Explained
No ratings yet
V-Model and Testing Concepts Explained
14 pages
Math SSC-I Model Question Paper
No ratings yet
Math SSC-I Model Question Paper
8 pages
CSE Module 4
No ratings yet
CSE Module 4
88 pages
EZ No-Vent GC-MS Connector For Shimadzu QP 2010 and QP 2010 Plus Mass Spectrometers
No ratings yet
EZ No-Vent GC-MS Connector For Shimadzu QP 2010 and QP 2010 Plus Mass Spectrometers
4 pages
Java Bug Tracking System Guide
No ratings yet
Java Bug Tracking System Guide
5 pages
Fortigate Antivirus Firewall Ips User Guide
100% (1)
Fortigate Antivirus Firewall Ips User Guide
58 pages
Weichai WP7 - Service Manual
50% (2)
Weichai WP7 - Service Manual
45 pages
(Ebook) Riba Job Book by Nigel Ostime ISBN 9781000703627, 1000703622 Available Full Chapters
No ratings yet
(Ebook) Riba Job Book by Nigel Ostime ISBN 9781000703627, 1000703622 Available Full Chapters
46 pages
Procurement Essentials for Students
No ratings yet
Procurement Essentials for Students
5 pages
Off-Highway Tyre Solutions
No ratings yet
Off-Highway Tyre Solutions
9 pages
ADEREMI
No ratings yet
ADEREMI
43 pages
Introduction To Bootstrap
No ratings yet
Introduction To Bootstrap
82 pages
SEEBURGER Business Integration Suite (BIS) Brochure - Solution Overview Brochure - English
No ratings yet
SEEBURGER Business Integration Suite (BIS) Brochure - Solution Overview Brochure - English
8 pages
Module 4 BJT
No ratings yet
Module 4 BJT
7 pages
Vulnerability Assessment Kerala
No ratings yet
Vulnerability Assessment Kerala
24 pages
Essay - MKT202 - Tran Nhat Phuong - HE171593 - MKT1815
No ratings yet
Essay - MKT202 - Tran Nhat Phuong - HE171593 - MKT1815
3 pages
Data Science Question Bank Unit-I
No ratings yet
Data Science Question Bank Unit-I
53 pages
What Is Hadoop HDF1
No ratings yet
What Is Hadoop HDF1
6 pages
Wind Turbine Efficiency Insights
No ratings yet
Wind Turbine Efficiency Insights
7 pages
Data Analyst in Python 2
No ratings yet
Data Analyst in Python 2
308 pages
Business Process Transformation Starter Pack - Overview
No ratings yet
Business Process Transformation Starter Pack - Overview
26 pages
OM Series: Wireless Networking Simplified
No ratings yet
OM Series: Wireless Networking Simplified
6 pages
NSX Deploy Pre-Req
No ratings yet
NSX Deploy Pre-Req
8 pages

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

Uploaded by

Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions

Uploaded by

Pyspark most important interview

APIs & Development Java-centric, verbose High-level APIs across

Workload Support Batch-only Batch, streaming, SQL,

Fault Tolerance Re-run failed tasks from Lineage tracking and

Resource & Ecosystem Hadoop-dependent Flexible deployment,

Ideal Scenarios Simple ETL and batch Real-time, iterative,

due to catalyst optimizer and dynamic optimization techniques like AQE.

2.​ What is the role of SparkContext in PySpark?

In modern Spark, it’s created automatically inside a SparkSession

1.​ User Code → RDD/Dataset/DataFrame Transformations

○​ Optimized by the Catalyst optimizer.

○​ Spark breaks the job into stages (based on shuffle boundaries).

○​ Driver requests executors from the cluster manager.

●​ Spark uses lazy evaluation — transformations aren’t executed until an action is

Feature RDD (Resilient DataFrame Dataset

Abstraction Level Low-level High-level High-level

Type Safety Type-safe (in Not type-safe Type-safe (only in

Performance Less optimized (no Optimized using Optimized (like

Serialization Java serialization Tungsten binary Encoders (efficient)

Use Case Complex, SQL operations, When compile-time

Support in PySpark ✅ Yes ✅ Yes ❌ Not supported

6.​ What is query optimization in Pyspark?

Catalyst Optimizer Steps

1.​ Analysis – Checks syntax, resolves column names, validates schema.

3.​ Optimization – Applies rules like:

○​ Projection pruning (selecting only required columns)

Common Optimization Techniques in PySpark

●​ Use DataFrame API instead of RDDs – DataFrames get Catalyst/Tungsten benefits.

7.​ What is SparkSession in Pyspark?

Without it, you can’t access DataFrame APIs.

from pyspark.sql import SparkSession​

●​ appName() → Name for the Spark job (visible in UI).

○​ "local[*]" → Run locally using all cores.

●​ Definition: Transformations where input partitions contribute to multiple output

●​ Why slower: Data is redistributed to new partitions → shuffle → sort → aggregation.

3. Shuffle Process in Wide Transformations

○​ Shuffle stage – Data is moved across the cluster.

●​ Optimization Tip: Reduce shuffles as much as possible by using:

●​ Narrow transformations → executed in a single stage.

5. Key Interview Points

Data movement No shuffle Shuffle

Examples map, filter join, groupByKey

Use Case When increasing partitions or When reducing the number of

evenly redistributing data. partitions for optimized output.

Performance Expensive due to full shuffle. Inexpensive due to limited

Data Even redistribution across Narrows existing partitions

We use them when:

●​ The same DataFrame/RDD is used multiple times in a Spark application.

Flexibility Fixed storage level. Flexible – can store in MEMORY_AND_DISK,

Example df.cache() df.persist(StorageLevel.MEMORY_AND_DISK)

from pyspark import StorageLevel​

●​ Skewed partitions (uneven data) → data skew → performance bottlenecks.

3️⃣ Memory Management

●​ If a partition is too large → may cause OutOfMemoryError on executors.

●​ Proper partitioning ensures each partition fits into executor memory.

4️⃣ Shuffle Optimization

●​ Setting the right spark.sql.shuffle.partitions is critical for big datasets.

5️⃣ ETL Efficiency

●​ In data engineering pipelines:

states = {"NY": "New York", "CA": "California", "FL": "Florida"}

return broadcast_states.value.get(code, "Unknown")​

Use broadcast variables when:

13.​ What is the difference b/w df.show() and df.collect()?

Default Behavior By default shows top 20 Always returns all rows in

memory. OutOfMemoryError on the

Use Case For quick inspection or When you need to

14.​ What is lazy evaluation in Pyspark?

1.​ Transformations (select(), filter(), map(), etc.) → Only recorded in a plan; no

Schema Enforcement ❌ Data with wrong schema ✅ Rejects writes with

may get written silently schema mismatch

Schema Evolution ⚠ Limited (Parquet allows ✅ Add/modify columns with

but without validation) proper versioning

Data Versioning (Time ❌ Not available ✅ Access older versions of

Travel) data for rollback, audit, or

❌ ✅ Native support via MERGE

Upserts & Deletes Difficult (requires full file

2. What is the role of SparkContext in PySpark?

1. User Code → RDD/Dataset/DataFrame Transformations

○ Optimized by the Catalyst optimizer.

○ Spark breaks the job into stages (based on shuffle boundaries).

○ Driver requests executors from the cluster manager.

● Spark uses lazy evaluation — transformations aren’t executed until an action is

6. What is query optimization in Pyspark?

1. Analysis – Checks syntax, resolves column names, validates schema.

3. Optimization – Applies rules like:

○ Projection pruning (selecting only required columns)

● Use DataFrame API instead of RDDs – DataFrames get Catalyst/Tungsten benefits.

7. What is SparkSession in Pyspark?

from pyspark.sql import SparkSession

● appName() → Name for the Spark job (visible in UI).

○ "local[*]" → Run locally using all cores.

● Definition: Transformations where input partitions contribute to multiple output

● Why slower: Data is redistributed to new partitions → shuffle → sort → aggregation.

○ Shuffle stage – Data is moved across the cluster.

● Optimization Tip: Reduce shuffles as much as possible by using:

● Narrow transformations → executed in a single stage.

● The same DataFrame/RDD is used multiple times in a Spark application.

from pyspark import StorageLevel

● Skewed partitions (uneven data) → data skew → performance bottlenecks.

● If a partition is too large → may cause OutOfMemoryError on executors.

● Proper partitioning ensures each partition fits into executor memory.

● Setting the right spark.sql.shuffle.partitions is critical for big datasets.

● In data engineering pipelines:

return broadcast_states.value.get(code, "Unknown")

13. What is the difference b/w df.show() and df.collect()?

14. What is lazy evaluation in Pyspark?

1. Transformations (select(), filter(), map(), etc.) → Only recorded in a plan; no

16. What happens when a Pyspark job runs OOM?

● Very large partitions (too much data for one executor).

● Increase memory: --executor-memory and --driver-memory.

18. How would you handle skewed data in Pyspark?

● To speed up joins by avoiding network shuffle of large datasets.

20. What is SPILL is Spark and why does it happen?

● Use map-side aggregation (reduceByKey instead of groupByKey).

● Avoid data skew (salting, repartitioning).