0% found this document useful (0 votes)
33 views7 pages

Senior Data Engineer Qs

The document provides answers to 45 common Senior Data Engineer interview questions tailored for Go Digital Technology Consulting LLP, covering topics such as PySpark, Python, SQL, AWS, and DevOps. Each question includes code snippets and explanations for concepts like data deduplication, handling skewed joins, and optimizing ETL processes. It serves as a comprehensive guide for candidates preparing for interviews in data engineering roles.

Uploaded by

jayessh.more72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views7 pages

Senior Data Engineer Qs

The document provides answers to 45 common Senior Data Engineer interview questions tailored for Go Digital Technology Consulting LLP, covering topics such as PySpark, Python, SQL, AWS, and DevOps. Each question includes code snippets and explanations for concepts like data deduplication, handling skewed joins, and optimizing ETL processes. It serves as a comprehensive guide for candidates preparing for interviews in data engineering roles.

Uploaded by

jayessh.more72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Below are the answers to 45 highly probable Senior Data Engineer

interview questions tailored for the Go Digital Technology Consulting LLP


role:

🔸 PySpark & Spark SQL


1. Read CSV from S3, remove duplicates, write Parquet partitioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DedupCSV").getOrCreate()
df = spark.read.option("header", True).csv("s3://bucket/input.csv")
dedup = df.dropDuplicates()
dedup.write.partitionBy("date").parquet("s3://bucket/output/")

"Dedup" short for deduplication, is a data compression technique that


eliminates redundant copies of data, thereby reducing storage space
and bandwidth usage.

2. Handle skewed joins in PySpark: Use salting:


from pyspark.sql.functions import rand, concat_ws

big_df = big_df.withColumn("salt", (rand()*10).cast("int"))


small_df = small_df.withColumn("salt", explode(array([lit(i) for i in
range(10)])))
joined = big_df.join(small_df, ["key", "salt"])

pyspark.sql.functions.rand is a function within PySpark's SQL module that


generates a column of random numbers.

In PySpark, explode is a function from pyspark.sql.functions used to


transform a column containing arrays or maps into multiple rows. It
effectively "flattens" nested data structures within a DataFrame.

3. Repartition vs coalesce:
 repartition(n): reshuffles data, increases partitions.
 coalesce(n): merges partitions, avoids shuffle.
df = df.repartition(10) # Use before wide transformations
out = df.coalesce(1) # Use before writing small outputs
4. PySpark UDF to mask PII:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def mask_email(email):
return email[0] + "***@" + email.split("@")[1] if email else None

mask_udf = udf(mask_email, StringType())


df = df.withColumn("email", mask_udf(df.email))

5. Flatten nested JSON:


from pyspark.sql.functions import col
flat = df.select(col("name"), col("address.city"), col("address.zip"))

6. Cache & persist:


df.persist(StorageLevel.MEMORY_AND_DISK)
df.count() # triggers caching

Cache is used to store data only in the memory


Persist– and this used to Store data on
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY

Both of this is used for memory optimization

7. Join two datasets and filter CDC records:


cdc_df = new_df.join(old_df, "id", "left_anti")

8. Parse log files:


import re
pattern = r"(?P<ip>\d+\.\d+\.\d+\.\d+).+?(?P<status>\d{3})"
df = raw_df.select(regexp_extract("value", pattern, 1).alias("ip"),
regexp_extract("value", pattern,
2).alias("status"))

9. Window function to get latest record per group:


from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

w = Window.partitionBy("user_id").orderBy(col("event_time").desc())
latest = df.withColumn("rn", row_number().over(w)).filter("rn = 1")

10. Optimize long PySpark jobs:


 Reduce shuffle, use broadcast joins
 Persist intermediate data
 Monitor Spark UI DAGs
 Tune spark.sql.shuffle.partitions, executor memory

🔙 Python + Pandas
11. Parse JSON & extract list:
import json
data = json.loads(json_string)
my_list = data["items"]

12. Regex email validation:


import re
def validate(email):
return re.match(r"[^@]+@[^@]+\.[^@]+", email)

13. Top 3 frequent elements:


from collections import Counter
Counter(my_list).most_common(3)

14. Read Excel and convert to Parquet:


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_excel("data.xlsx")
table = pa.Table.from_pandas(df)
pq.write_table(table, "data.parquet", compression='snappy')

15. Rename CSV files:


import os, time
for f in os.listdir("./data"):
if f.endswith(".csv"):
new_name = f"data_{int(time.time())}.csv"
os.rename(f, new_name)

16. Count nulls & data types in Pandas:


print(df.isnull().sum())
print(df.dtypes)
SQL + Hive
17. Remove duplicates using ROW_NUMBER:
WITH dedup AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_ts DESC)
AS rn
FROM my_table
)
SELECT * FROM dedup WHERE rn = 1;

18. Second highest salary per department:


SELECT dept, MAX(salary) AS second_salary
FROM emp
WHERE (dept, salary) NOT IN (
SELECT dept, MAX(salary) FROM emp GROUP BY dept
)
GROUP BY dept;

19. Partition by month & year:


CREATE TABLE sales_part (...)
PARTITIONED BY (year INT, month INT);
INSERT INTO TABLE sales_part PARTITION (year, month)
SELECT *, YEAR(ts), MONTH(ts) FROM sales;

20. Optimize OR conditions in Hive:


 Rewrite using UNION ALL
 Filter by partition column first
21. Pivot/Unpivot Hive:
 Use CASE WHEN for pivot:
SELECT id,
MAX(CASE WHEN type='A' THEN val END) as val_a,
MAX(CASE WHEN type='B' THEN val END) as val_b
FROM my_table
GROUP BY id;

☁️AWS + ETL
22. Boto3 list buckets and size:
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()['Buckets']
for b in buckets:
print(b['Name'])
23. Fault-tolerant PySpark pipeline:
 Use try-catch in code
 Write checkpoints to S3
 Use Glue job bookmarks or Delta format
24. Streaming from Kafka to S3 in PySpark:
df = spark.readStream.format("kafka").option("subscribe",
"topic").load()
df.writeStream.format("parquet").option("checkpointLocation",
"/chkpt").start("s3://output")

25. Trigger Lambda via Python:


import boto3
client = boto3.client('lambda')
resp = client.invoke(FunctionName="my-func",
InvocationType='RequestResponse')
print(resp['Payload'].read())

🧠 PySpark Internals + Optimizations


26. PySpark lifecycle:
 DAG -> Stages -> Tasks -> Executors -> Results collected
27. What happens on action:
 Triggers job execution, builds DAG, schedules tasks
28. Debug via Spark UI:
 Use Jobs tab to check failed stage
 Look for skew, spill, GC issues
29. map vs flatMap vs mapPartitions:
 map: element-wise
 flatMap: flattens iterables
 mapPartitions: operates on whole partition, better for bulk

30. Fix stage failures:


 Check logs for errors (e.g., memory, skew)
 Use retries, increase executor memory, repartition
🚀 Data Modeling + Scheduling
31. Upload to S3 with retry:
import boto3, botocore
s3 = boto3.client('s3')
try:
s3.upload_file("file.csv", "bucket", "file.csv")
except botocore.exceptions.ClientError as e:
print(e)

32. Submit Spark job to EMR:


aws emr add-steps --cluster-id j-XXXX --steps
Type=Spark,Name=MyJob,...

33. IAM for S3 access:


"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"

34. Automate pipeline:


 Step Functions for flow
 Lambda triggers
 SNS for failure alerts
35. Modular ETL pipeline:
 Use functions/stages for ingest, transform, validate, write
36. Schema evolution handling:
 Use Spark’s mergeSchema or Delta Lake’s schema evolution
37. ETL vs ELT:
 ETL: transform before load (good for raw-to-model)
 ELT: load raw, transform in DWH (good for flexibility)
38. Hourly CDC ingestion:
 Use last_updated timestamp
 Incremental filter like WHERE update_ts > last_run_ts

🔧 DevOps + Agile
39. SCD Type 2 logic in SQL:
 Compare hash of current vs incoming record
 Close old record, insert new
40. Hive bucketing vs partitioning:
 Bucketing spreads within partition
 Useful for joins on non-partition key
41. Optimize Hive joins:
 Use MAPJOIN hint
 Sort-merge joins
 Broadcast small tables
42. Schema-on-read:
 Hive reads structure at runtime
 Use SerDe for flexible formats (e.g., JSON, ORC)
43. Multi-env pipeline mgmt:
 Use environment variables or config files (YAML/JSON)
44. CI/CD for PySpark:
 Use Jenkins/GitHub Actions
 Run unit tests with pytest
 Package PySpark code with setup.py
45. Version control with Git:
 Use feature branches
 PRs + code review
 Tags for releases

Let me know which ones you want to practice live, build a portfolio
project from, or convert into a demo for your interview!

You might also like