0% found this document useful (0 votes)
69 views2 pages

Interview

The document outlines a set of PySpark interview questions for data engineers with 3-5 years of experience, covering general concepts, SQL, DataFrame API, and practical programming tasks. It includes questions on transformations, lazy evaluation, schema enforcement, handling null values, and window functions. Additionally, it provides a scenario-based question for Azure Data Engineers on managing late-arriving records in a Delta Lake pipeline.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views2 pages

Interview

The document outlines a set of PySpark interview questions for data engineers with 3-5 years of experience, covering general concepts, SQL, DataFrame API, and practical programming tasks. It includes questions on transformations, lazy evaluation, schema enforcement, handling null values, and window functions. Additionally, it provides a scenario-based question for Azure Data Engineers on managing late-arriving records in a Delta Lake pipeline.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Wipro PySpark Interview Questions for Data Engineers (3-5 years of experience)

General PySpark Concepts


1. What is the difference between transformations and actions in PySpark? Provide
examples.
2. Explain how PySpark handles lazy evaluation and its benefits.
3. What is the difference between `repartition()` and `coalesce()`? When would you
use each?
4. Explain how `cache()` and `persist()` work in PySpark and their impact on
performance.
5. What are the advantages of using PySpark over Pandas for big data processing?
PySpark SQL & DataFrame API:
6. How does PySpark handle schema enforcement? Explain `inferSchema` and
`StructType`.
7. How can you read a CSV file in PySpark with custom delimiters and headers?
8. What are the different ways to remove duplicate records in PySpark?
9. Explain `groupBy()`, `agg()`, and `pivot()` functions with an example.
-------------------------------
1. Grouping and Aggregation
You are given a dataset of employees in an Indian company with columns: `emp_id`,
`name`, `department`,
`salary`, and `city`. Write a PySpark program to find the total salary paid in each
department.
Sample Data:
emp_id, name, department, salary, city
101, Rajesh, IT, 75000, Bangalore
102, Priya, HR, 60000, Mumbai
103, Anil, IT, 80000, Hyderabad
104, Sneha, HR, 62000, Pune
105, Manish, Finance, 90000, Chennai
106, Suresh, IT, 78000, Bangalore
Expected Output:
IT, 233000
HR, 122000
Finance, 90000

2. Handling Null Values


Question:
You are given a dataset containing customer transactions in an Indian e-commerce
platform with columns: `cust_id`, `cust_name`, `city`, `purchase_amount`, and
`product_category`. Some records have missing `purchase_amount`. Write a PySpark
program to fill missing `purchase_amount` values with the average purchase amount
of that product category.

Sample Data:
cust_id, cust_name, city, purchase_amount, product_category
201, Aman, Delhi, 1500, Electronics
202, Kiran, Mumbai, , Fashion
203, Ravi, Bangalore, 2000, Electronics
204, Simran, Hyderabad, , Fashion
205, Vinay, Pune, 1800, Electronics
206, Pooja, Chennai, 1300, Grocery
Expected Output (assuming average for Fashion = 2000):
201, Aman, Delhi, 1500, Electronics
202, Kiran, Mumbai, 2000, Fashion
203, Ravi, Bangalore, 2000, Electronics
204, Simran, Hyderabad, 2000, Fashion
205, Vinay, Pune, 1800, Electronics
206, Pooja, Chennai, 1300, Grocery
3. Window Function - Ranking
You have a dataset of students from different Indian states with columns:
`student_id`, `student_name`, `state`, `score`.
Write a PySpark program to rank students within each state based on their scores in
descending order.
Sample Data:
student_id, student_name, state, score
301, Rohit, Maharashtra, 85
302, Sneha, Karnataka, 92
303, Amit, Maharashtra, 90
304, Kunal, Karnataka, 88
305, Nidhi, Maharashtra, 78
306, Pavan, Karnataka, 80

Scenario based interview question for Azure Data Engineer role:

Q#6:
You have a pipeline that ingests event data in near real-time. Occasionally, some
records arrive late —
maybe by a few hours or even a day. How would you design your Delta Lake pipeline
to handle these late-arriving records effectively?

Answer:
Late-arriving data is a real challenge, especially when you’re building fact tables
or time-sensitive aggregations. Here’s how I deal with it:

Step 1: Identify late data using event timestamps


Use the event timestamp (not ingestion time) to detect if a record arrived late:

late_data = df.filter("event_time < current_date()")

Step 2: Use MERGE to upsert late records


Delta Lake’s MERGE command is your best friend for gracefully handling late data:

MERGE INTO silver.events AS target


USING new_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This way, if the record already exists, it’s updated; otherwise, it’s inserted.

Step 3: Partition wisely


Partitioning by event_date helps in isolating affected partitions and speeds up the
write performance when handling late data.

df.write.partitionBy("event_date").format("delta").mode("append").save("/mnt/
silver/events")

Step 4: Recalculate aggregations


If you’re pushing data into Gold tables (aggregations), make sure to reprocess the
impacted time windows (e.g., last 1-2 days) to ensure consistency.

Final Thought:
Late data doesn’t mean bad data — it just missed the first train. Build your
pipelines to welcome it with open arms!

You might also like