0% found this document useful (0 votes)

69 views2 pages

Interview

The document outlines a set of PySpark interview questions for data engineers with 3-5 years of experience, covering general concepts, SQL, DataFrame API, and practical programming tasks. It includes questions on transformations, lazy evaluation, schema enforcement, handling null values, and window functions. Additionally, it provides a scenario-based question for Azure Data Engineers on managing late-arriving records in a Delta Lake pipeline.

Uploaded by

vinayak.lahamage0109

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views2 pages

Interview

Uploaded by

vinayak.lahamage0109

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

Wipro PySpark Interview Questions for Data Engineers (3-5 years of experience)

General PySpark Concepts

1. What is the difference between transformations and actions in PySpark? Provide
examples.
2. Explain how PySpark handles lazy evaluation and its benefits.
3. What is the difference between `repartition()` and `coalesce()`? When would you
use each?
4. Explain how `cache()` and `persist()` work in PySpark and their impact on
performance.
5. What are the advantages of using PySpark over Pandas for big data processing?
PySpark SQL & DataFrame API:
6. How does PySpark handle schema enforcement? Explain ìnferSchema` and
`StructType`.
7. How can you read a CSV file in PySpark with custom delimiters and headers?
8. What are the different ways to remove duplicate records in PySpark?
9. Explain `groupBy()`, àgg()`, and `pivot()` functions with an example.
-------------------------------
1. Grouping and Aggregation
You are given a dataset of employees in an Indian company with columns: èmp_id`,
`name`, `department`,
`salary`, and `city`. Write a PySpark program to find the total salary paid in each
department.
Sample Data:
emp_id, name, department, salary, city
101, Rajesh, IT, 75000, Bangalore
102, Priya, HR, 60000, Mumbai
103, Anil, IT, 80000, Hyderabad
104, Sneha, HR, 62000, Pune
105, Manish, Finance, 90000, Chennai
106, Suresh, IT, 78000, Bangalore
Expected Output:
IT, 233000
HR, 122000
Finance, 90000

2. Handling Null Values

Question:
You are given a dataset containing customer transactions in an Indian e-commerce
platform with columns: `cust_id`, `cust_name`, `city`, `purchase_amount`, and
`product_category`. Some records have missing `purchase_amount`. Write a PySpark
program to fill missing `purchase_amount` values with the average purchase amount
of that product category.

Sample Data:
cust_id, cust_name, city, purchase_amount, product_category
201, Aman, Delhi, 1500, Electronics
202, Kiran, Mumbai, , Fashion
203, Ravi, Bangalore, 2000, Electronics
204, Simran, Hyderabad, , Fashion
205, Vinay, Pune, 1800, Electronics
206, Pooja, Chennai, 1300, Grocery
Expected Output (assuming average for Fashion = 2000):
201, Aman, Delhi, 1500, Electronics
202, Kiran, Mumbai, 2000, Fashion
203, Ravi, Bangalore, 2000, Electronics
204, Simran, Hyderabad, 2000, Fashion
205, Vinay, Pune, 1800, Electronics
206, Pooja, Chennai, 1300, Grocery
3. Window Function - Ranking
You have a dataset of students from different Indian states with columns:
`student_id`, `student_name`, `state`, `score`.
Write a PySpark program to rank students within each state based on their scores in
descending order.
Sample Data:
student_id, student_name, state, score
301, Rohit, Maharashtra, 85
302, Sneha, Karnataka, 92
303, Amit, Maharashtra, 90
304, Kunal, Karnataka, 88
305, Nidhi, Maharashtra, 78
306, Pavan, Karnataka, 80

Scenario based interview question for Azure Data Engineer role:

Q#6:
You have a pipeline that ingests event data in near real-time. Occasionally, some
records arrive late —
maybe by a few hours or even a day. How would you design your Delta Lake pipeline
to handle these late-arriving records effectively?

Answer:
Late-arriving data is a real challenge, especially when you’re building fact tables
or time-sensitive aggregations. Here’s how I deal with it:

Step 1: Identify late data using event timestamps

Use the event timestamp (not ingestion time) to detect if a record arrived late:

late_data = df.filter("event_time < current_date()")

Step 2: Use MERGE to upsert late records

Delta Lake’s MERGE command is your best friend for gracefully handling late data:

MERGE INTO silver.events AS target

USING new_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This way, if the record already exists, it’s updated; otherwise, it’s inserted.

Step 3: Partition wisely

Partitioning by event_date helps in isolating affected partitions and speeds up the
write performance when handling late data.

df.write.partitionBy("event_date").format("delta").mode("append").save("/mnt/
silver/events")

Step 4: Recalculate aggregations

If you’re pushing data into Gold tables (aggregations), make sure to reprocess the
impacted time windows (e.g., last 1-2 days) to ensure consistency.

Final Thought:
Late data doesn’t mean bad data — it just missed the first train. Build your
pipelines to welcome it with open arms!

Data Engineer Interview Prep
No ratings yet
Data Engineer Interview Prep
27 pages
? Knows Syntax, But Can't Debug DAG - Failed in Final Round
No ratings yet
? Knows Syntax, But Can't Debug DAG - Failed in Final Round
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Pyspark Scenario Based Qs
No ratings yet
Pyspark Scenario Based Qs
13 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Quantiphi Interview
No ratings yet
Quantiphi Interview
2 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Data Bricks
No ratings yet
Data Bricks
9 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Deloitee Data Engineer Interview Questions
100% (1)
Deloitee Data Engineer Interview Questions
24 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Publicis Sapient Pyspark
No ratings yet
Publicis Sapient Pyspark
10 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Infosys Data Engineering Questions and Answers - 2025
No ratings yet
Infosys Data Engineering Questions and Answers - 2025
25 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
EY Mock
No ratings yet
EY Mock
1 page
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Text 3
No ratings yet
Text 3
3 pages
Tech 3 5 Years Exp Questions
No ratings yet
Tech 3 5 Years Exp Questions
1 page
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Question
No ratings yet
Question
6 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Azure Databricks Data Engineering Guide
No ratings yet
Azure Databricks Data Engineering Guide
8 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
20 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Deloite Data Engineer Interview Questions
No ratings yet
Deloite Data Engineer Interview Questions
24 pages
Pyspark and SQL
No ratings yet
Pyspark and SQL
57 pages
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
Spark Handbook
No ratings yet
Spark Handbook
7 pages
SQL and Data Analysis Interview Questions
No ratings yet
SQL and Data Analysis Interview Questions
9 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
BYD Company Interview Questions
No ratings yet
BYD Company Interview Questions
1 page
Ade Companywise Interview
No ratings yet
Ade Companywise Interview
133 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
PySpark Optimization Interview Scenarios
No ratings yet
PySpark Optimization Interview Scenarios
8 pages
Using Spark to Read CSV Data
No ratings yet
Using Spark to Read CSV Data
5 pages
Pyspark Hands On
0% (1)
Pyspark Hands On
189 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Interview Questions For Data Analysis and Data Science
No ratings yet
Interview Questions For Data Analysis and Data Science
19 pages
Execr
No ratings yet
Execr
4 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Interview Questions For 5 Yrs of Exp
No ratings yet
Interview Questions For 5 Yrs of Exp
6 pages
Python Scenario Based Interview QA
No ratings yet
Python Scenario Based Interview QA
3 pages
Interview QnAs - CloudyML
No ratings yet
Interview QnAs - CloudyML
13 pages
PracticeExam DataEngineerAssociate
100% (1)
PracticeExam DataEngineerAssociate
23 pages
Databricks Associate Data Engg
100% (8)
Databricks Associate Data Engg
64 pages
Data Engineering With Databricks Da
100% (4)
Data Engineering With Databricks Da
232 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Databricks Data Engineer Associate Dumps
91% (11)
Databricks Data Engineer Associate Dumps
40 pages
Databricks Data Engineer Study Guide
80% (5)
Databricks Data Engineer Study Guide
157 pages
Databricks Question 1668314325
100% (2)
Databricks Question 1668314325
104 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
SnowFlake Notes
100% (1)
SnowFlake Notes
40 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
Snowflake Notes
100% (10)
Snowflake Notes
67 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
100% (1)
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
6 pages
Databricks Certification Preparation Associate DE
50% (2)
Databricks Certification Preparation Associate DE
65 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
ETL Processes Using PySpark
80% (5)
ETL Processes Using PySpark
7 pages
Data Engineering Cookbook
90% (10)
Data Engineering Cookbook
88 pages
Azure Databricks Guide: CSV & SQL Integration
No ratings yet
Azure Databricks Guide: CSV & SQL Integration
16 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Spark Architecture Explained
50% (2)
Spark Architecture Explained
12 pages
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
100% (1)
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
19 pages
Bradley University Capstone Projects
No ratings yet
Bradley University Capstone Projects
7 pages
IoT-Based Smart Parking Management System
No ratings yet
IoT-Based Smart Parking Management System
22 pages
¢нιηα נαι ? @chinaajaix - Twitter Profile Sotwe
No ratings yet
¢нιηα נαι ? @chinaajaix - Twitter Profile Sotwe
1 page
Database Transaction Concepts Quiz
0% (1)
Database Transaction Concepts Quiz
28 pages
IT Science - 12 (2025-26)
No ratings yet
IT Science - 12 (2025-26)
2 pages
Digital Library: A Web-Based Research File Management System
No ratings yet
Digital Library: A Web-Based Research File Management System
6 pages
Influence of Artificial Intelligence On Robotics Industry
No ratings yet
Influence of Artificial Intelligence On Robotics Industry
8 pages
Airline Graph Programming Lab
No ratings yet
Airline Graph Programming Lab
5 pages
Cs Project Class 12 Cbse
No ratings yet
Cs Project Class 12 Cbse
22 pages
Prework - YPO Mexico Evento - AI
No ratings yet
Prework - YPO Mexico Evento - AI
31 pages
Running TIS 2000 On Windows 7 - 10
67% (3)
Running TIS 2000 On Windows 7 - 10
7 pages
Computer Science Record Part 2
No ratings yet
Computer Science Record Part 2
16 pages
Excel Shortcuts for Beginners
No ratings yet
Excel Shortcuts for Beginners
1 page
Whatsapp Login: Identify, Authenticate, Engage Customers Using Whatsapp, Without Sending Otps
No ratings yet
Whatsapp Login: Identify, Authenticate, Engage Customers Using Whatsapp, Without Sending Otps
22 pages
Ethereum Dev: Wallets & Smart Contracts
No ratings yet
Ethereum Dev: Wallets & Smart Contracts
14 pages
Skirk Leak Gensin 5.7
No ratings yet
Skirk Leak Gensin 5.7
2 pages
The Illustrated Childrens Guide To Kubernetes PDF
No ratings yet
The Illustrated Childrens Guide To Kubernetes PDF
23 pages
Kodi Addon: Elementum Settings
No ratings yet
Kodi Addon: Elementum Settings
33 pages
Edupadi Com Classroom Lessons Ss3 Ict Security and Ethics II Page 1...
No ratings yet
Edupadi Com Classroom Lessons Ss3 Ict Security and Ethics II Page 1...
6 pages
b0700hm H Security
No ratings yet
b0700hm H Security
234 pages
Leapfrog Fundamentals Training Session 2013
No ratings yet
Leapfrog Fundamentals Training Session 2013
151 pages
Hp-Eva Xcs-11300000 Firmware RN
No ratings yet
Hp-Eva Xcs-11300000 Firmware RN
10 pages
Computer Mediated Communication: Boutkhil Guemide University Mohammed Boudiaf, M'sila Algeria
No ratings yet
Computer Mediated Communication: Boutkhil Guemide University Mohammed Boudiaf, M'sila Algeria
32 pages
Systems Analysis and Design 9th Edition Kendall Test Bank Instant Download
100% (19)
Systems Analysis and Design 9th Edition Kendall Test Bank Instant Download
43 pages
Save Webpage as PDF Guide
No ratings yet
Save Webpage as PDF Guide
5 pages
C# Nullable To String: 9 Answers
No ratings yet
C# Nullable To String: 9 Answers
4 pages
OTP User Manual For English) v1,0
No ratings yet
OTP User Manual For English) v1,0
15 pages
Mountbatten Whisperer
No ratings yet
Mountbatten Whisperer
8 pages
Cybersecurity Essentials
No ratings yet
Cybersecurity Essentials
49 pages
HashMap Equals and HashCode Explained
No ratings yet
HashMap Equals and HashCode Explained
10 pages

Interview

Uploaded by

Interview

Uploaded by

Wipro PySpark Interview Questions for Data Engineers (3-5 years of experience)

General PySpark Concepts

2. Handling Null Values

Scenario based interview question for Azure Data Engineer role:

Step 1: Identify late data using event timestamps

late_data = df.filter("event_time < current_date()")

Step 2: Use MERGE to upsert late records

MERGE INTO silver.events AS target

Step 3: Partition wisely

Step 4: Recalculate aggregations

You might also like