0% found this document useful (0 votes)

33 views7 pages

Senior Data Engineer Qs

The document provides answers to 45 common Senior Data Engineer interview questions tailored for Go Digital Technology Consulting LLP, covering topics such as PySpark, Python, SQL, AWS, and DevOps. Each question includes code snippets and explanations for concepts like data deduplication, handling skewed joins, and optimizing ETL processes. It serves as a comprehensive guide for candidates preparing for interviews in data engineering roles.

Uploaded by

jayessh.more72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views7 pages

Senior Data Engineer Qs

Uploaded by

jayessh.more72

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Below are the answers to 45 highly probable Senior Data Engineer

interview questions tailored for the Go Digital Technology Consulting LLP

role:

🔸 PySpark & Spark SQL

1. Read CSV from S3, remove duplicates, write Parquet partitioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DedupCSV").getOrCreate()
df = spark.read.option("header", True).csv("s3://bucket/input.csv")
dedup = df.dropDuplicates()
dedup.write.partitionBy("date").parquet("s3://bucket/output/")

"Dedup" short for deduplication, is a data compression technique that

eliminates redundant copies of data, thereby reducing storage space
and bandwidth usage.

2. Handle skewed joins in PySpark: Use salting:

from pyspark.sql.functions import rand, concat_ws

big_df = big_df.withColumn("salt", (rand()*10).cast("int"))

small_df = small_df.withColumn("salt", explode(array([lit(i) for i in
range(10)])))
joined = big_df.join(small_df, ["key", "salt"])

pyspark.sql.functions.rand is a function within PySpark's SQL module that

generates a column of random numbers.

In PySpark, explode is a function from pyspark.sql.functions used to

transform a column containing arrays or maps into multiple rows. It
effectively "flattens" nested data structures within a DataFrame.

3. Repartition vs coalesce:
 repartition(n): reshuffles data, increases partitions.
 coalesce(n): merges partitions, avoids shuffle.
df = df.repartition(10) # Use before wide transformations
out = df.coalesce(1) # Use before writing small outputs
4. PySpark UDF to mask PII:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def mask_email(email):
return email[0] + "***@" + email.split("@")[1] if email else None

mask_udf = udf(mask_email, StringType())

df = df.withColumn("email", mask_udf(df.email))

5. Flatten nested JSON:

from pyspark.sql.functions import col
flat = df.select(col("name"), col("address.city"), col("address.zip"))

6. Cache & persist:

df.persist(StorageLevel.MEMORY_AND_DISK)
df.count() # triggers caching

Cache is used to store data only in the memory

Persist– and this used to Store data on
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY

Both of this is used for memory optimization

7. Join two datasets and filter CDC records:

cdc_df = new_df.join(old_df, "id", "left_anti")

8. Parse log files:

import re
pattern = r"(?P<ip>\d+\.\d+\.\d+\.\d+).+?(?P<status>\d{3})"
df = raw_df.select(regexp_extract("value", pattern, 1).alias("ip"),
regexp_extract("value", pattern,
2).alias("status"))

9. Window function to get latest record per group:

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

w = Window.partitionBy("user_id").orderBy(col("event_time").desc())
latest = df.withColumn("rn", row_number().over(w)).filter("rn = 1")

10. Optimize long PySpark jobs:

 Reduce shuffle, use broadcast joins
 Persist intermediate data
 Monitor Spark UI DAGs
 Tune spark.sql.shuffle.partitions, executor memory

🔙 Python + Pandas
11. Parse JSON & extract list:
import json
data = json.loads(json_string)
my_list = data["items"]

12. Regex email validation:

import re
def validate(email):
return re.match(r"[^@]+@[^@]+\.[^@]+", email)

13. Top 3 frequent elements:

from collections import Counter
Counter(my_list).most_common(3)

14. Read Excel and convert to Parquet:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_excel("data.xlsx")
table = pa.Table.from_pandas(df)
pq.write_table(table, "data.parquet", compression='snappy')

15. Rename CSV files:

import os, time
for f in os.listdir("./data"):
if f.endswith(".csv"):
new_name = f"data_{int(time.time())}.csv"
os.rename(f, new_name)

16. Count nulls & data types in Pandas:

print(df.isnull().sum())
print(df.dtypes)
SQL + Hive
17. Remove duplicates using ROW_NUMBER:
WITH dedup AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_ts DESC)
AS rn
FROM my_table
)
SELECT * FROM dedup WHERE rn = 1;

18. Second highest salary per department:

SELECT dept, MAX(salary) AS second_salary
FROM emp
WHERE (dept, salary) NOT IN (
SELECT dept, MAX(salary) FROM emp GROUP BY dept
)
GROUP BY dept;

19. Partition by month & year:

CREATE TABLE sales_part (...)
PARTITIONED BY (year INT, month INT);
INSERT INTO TABLE sales_part PARTITION (year, month)
SELECT *, YEAR(ts), MONTH(ts) FROM sales;

20. Optimize OR conditions in Hive:

 Rewrite using UNION ALL
 Filter by partition column first
21. Pivot/Unpivot Hive:
 Use CASE WHEN for pivot:
SELECT id,
MAX(CASE WHEN type='A' THEN val END) as val_a,
MAX(CASE WHEN type='B' THEN val END) as val_b
FROM my_table
GROUP BY id;

☁️AWS + ETL
22. Boto3 list buckets and size:
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()['Buckets']
for b in buckets:
print(b['Name'])
23. Fault-tolerant PySpark pipeline:
 Use try-catch in code
 Write checkpoints to S3
 Use Glue job bookmarks or Delta format
24. Streaming from Kafka to S3 in PySpark:
df = spark.readStream.format("kafka").option("subscribe",
"topic").load()
df.writeStream.format("parquet").option("checkpointLocation",
"/chkpt").start("s3://output")

25. Trigger Lambda via Python:

import boto3
client = boto3.client('lambda')
resp = client.invoke(FunctionName="my-func",
InvocationType='RequestResponse')
print(resp['Payload'].read())

🧠 PySpark Internals + Optimizations

26. PySpark lifecycle:
 DAG -> Stages -> Tasks -> Executors -> Results collected
27. What happens on action:
 Triggers job execution, builds DAG, schedules tasks
28. Debug via Spark UI:
 Use Jobs tab to check failed stage
 Look for skew, spill, GC issues
29. map vs flatMap vs mapPartitions:
 map: element-wise
 flatMap: flattens iterables
 mapPartitions: operates on whole partition, better for bulk

30. Fix stage failures:

 Check logs for errors (e.g., memory, skew)
 Use retries, increase executor memory, repartition
🚀 Data Modeling + Scheduling
31. Upload to S3 with retry:
import boto3, botocore
s3 = boto3.client('s3')
try:
s3.upload_file("file.csv", "bucket", "file.csv")
except botocore.exceptions.ClientError as e:
print(e)

32. Submit Spark job to EMR:

aws emr add-steps --cluster-id j-XXXX --steps
Type=Spark,Name=MyJob,...

33. IAM for S3 access:

"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"

34. Automate pipeline:

 Step Functions for flow
 Lambda triggers
 SNS for failure alerts
35. Modular ETL pipeline:
 Use functions/stages for ingest, transform, validate, write
36. Schema evolution handling:
 Use Spark’s mergeSchema or Delta Lake’s schema evolution
37. ETL vs ELT:
 ETL: transform before load (good for raw-to-model)
 ELT: load raw, transform in DWH (good for flexibility)
38. Hourly CDC ingestion:
 Use last_updated timestamp
 Incremental filter like WHERE update_ts > last_run_ts

🔧 DevOps + Agile
39. SCD Type 2 logic in SQL:
 Compare hash of current vs incoming record
 Close old record, insert new
40. Hive bucketing vs partitioning:
 Bucketing spreads within partition
 Useful for joins on non-partition key
41. Optimize Hive joins:
 Use MAPJOIN hint
 Sort-merge joins
 Broadcast small tables
42. Schema-on-read:
 Hive reads structure at runtime
 Use SerDe for flexible formats (e.g., JSON, ORC)
43. Multi-env pipeline mgmt:
 Use environment variables or config files (YAML/JSON)
44. CI/CD for PySpark:
 Use Jenkins/GitHub Actions
 Run unit tests with pytest
 Package PySpark code with setup.py
45. Version control with Git:
 Use feature branches
 PRs + code review
 Tags for releases

Let me know which ones you want to practice live, build a portfolio
project from, or convert into a demo for your interview!

Data Wrangling and EDA with PySpark
No ratings yet
Data Wrangling and EDA with PySpark
10 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Py Spark
No ratings yet
Py Spark
7 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
SQL Important Revision
No ratings yet
SQL Important Revision
3 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Complete Data Engineer Interview Guide
No ratings yet
Complete Data Engineer Interview Guide
3 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark & AWS: Key Concepts Explained
No ratings yet
PySpark & AWS: Key Concepts Explained
2 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
ETL Processes Using PySpark
80% (5)
ETL Processes Using PySpark
7 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
BDA All 37 Answers Complete
No ratings yet
BDA All 37 Answers Complete
5 pages
Data Engineering Interview Prep
No ratings yet
Data Engineering Interview Prep
6 pages
Senior Data Engineer Qna
No ratings yet
Senior Data Engineer Qna
4 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Big Data Introduction
No ratings yet
Big Data Introduction
5 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Databricks Data Engineer Associate Notes
100% (1)
Databricks Data Engineer Associate Notes
5 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
7 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Spark 3.0 Key Features Overview
No ratings yet
Spark 3.0 Key Features Overview
8 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark Optimization
No ratings yet
Pyspark Optimization
9 pages
SP 3
No ratings yet
SP 3
18 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Intlibs
No ratings yet
Intlibs
19 pages
Z-Phisher Installation and Overview Guide
No ratings yet
Z-Phisher Installation and Overview Guide
11 pages
Fyp I Srs Template
No ratings yet
Fyp I Srs Template
11 pages
ABAP Certification Sample Questions Guide
No ratings yet
ABAP Certification Sample Questions Guide
4 pages
8 - Input - Output and System Implementation
No ratings yet
8 - Input - Output and System Implementation
36 pages
Shivraj Gaikwad: Test Engineer Profile
No ratings yet
Shivraj Gaikwad: Test Engineer Profile
5 pages
UNIT 1 SPM Updated
No ratings yet
UNIT 1 SPM Updated
36 pages
Iot Introduction: Sylvain Cherrier
No ratings yet
Iot Introduction: Sylvain Cherrier
82 pages
Mapping The Forensic Standard ISO IEC 27037 To Cloud Computing
No ratings yet
Mapping The Forensic Standard ISO IEC 27037 To Cloud Computing
31 pages
Business Process Modeling
100% (1)
Business Process Modeling
24 pages
Accenture Data Management Emerging Trends PDF
No ratings yet
Accenture Data Management Emerging Trends PDF
5 pages
D-PDD-DY-23 Updated Dumps - Dell PowerProtect DD Deploy 2023
No ratings yet
D-PDD-DY-23 Updated Dumps - Dell PowerProtect DD Deploy 2023
10 pages
Ephemeral Apps
No ratings yet
Ephemeral Apps
12 pages
Cloudera Administration
No ratings yet
Cloudera Administration
694 pages
Module 1 Its Lecture Notes
No ratings yet
Module 1 Its Lecture Notes
21 pages
ABAP Unit Test For Odata Services - SAP Blogs
No ratings yet
ABAP Unit Test For Odata Services - SAP Blogs
16 pages
ICT Skills DS08 Model Paper 2025 With Underlined Answers
No ratings yet
ICT Skills DS08 Model Paper 2025 With Underlined Answers
5 pages
How To Create An ID - Me Acc
No ratings yet
How To Create An ID - Me Acc
4 pages
SevOne NMS Port Number Requirements Guide
No ratings yet
SevOne NMS Port Number Requirements Guide
17 pages
Medilog Darwin2 HL7 Configuration Guide
No ratings yet
Medilog Darwin2 HL7 Configuration Guide
8 pages
Vmware Validated Design 62 SDDC Shutdown and Startup
No ratings yet
Vmware Validated Design 62 SDDC Shutdown and Startup
63 pages
CCSK Cloud Security Practice Questions
No ratings yet
CCSK Cloud Security Practice Questions
78 pages
ExternalInformation RobotStudio2019 RevB
No ratings yet
ExternalInformation RobotStudio2019 RevB
1 page
SQL Cluster Backup with NMM Guide
No ratings yet
SQL Cluster Backup with NMM Guide
3 pages
Scrum 101 Introduction To Scrum
No ratings yet
Scrum 101 Introduction To Scrum
33 pages
Cloud Computing Scheduling Algorithms and Resource Sharing
No ratings yet
Cloud Computing Scheduling Algorithms and Resource Sharing
6 pages
Image Steganography Project
No ratings yet
Image Steganography Project
10 pages
React
No ratings yet
React
1 page
JP Unit-1
No ratings yet
JP Unit-1
20 pages
Hydra Router Attack
No ratings yet
Hydra Router Attack
6 pages

Senior Data Engineer Qs

Uploaded by

Senior Data Engineer Qs

Uploaded by

Below are the answers to 45 highly probable Senior Data Engineer

interview questions tailored for the Go Digital Technology Consulting LLP

🔸 PySpark & Spark SQL

"Dedup" short for deduplication, is a data compression technique that

2. Handle skewed joins in PySpark: Use salting:

big_df = big_df.withColumn("salt", (rand()*10).cast("int"))

pyspark.sql.functions.rand is a function within PySpark's SQL module that

In PySpark, explode is a function from pyspark.sql.functions used to

mask_udf = udf(mask_email, StringType())

5. Flatten nested JSON:

6. Cache & persist:

Cache is used to store data only in the memory

Both of this is used for memory optimization

7. Join two datasets and filter CDC records:

8. Parse log files:

9. Window function to get latest record per group:

10. Optimize long PySpark jobs:

12. Regex email validation:

13. Top 3 frequent elements:

14. Read Excel and convert to Parquet:

15. Rename CSV files:

16. Count nulls & data types in Pandas:

18. Second highest salary per department:

19. Partition by month & year:

20. Optimize OR conditions in Hive:

25. Trigger Lambda via Python:

🧠 PySpark Internals + Optimizations

30. Fix stage failures:

32. Submit Spark job to EMR:

33. IAM for S3 access:

34. Automate pipeline:

You might also like