Below are the answers to 45 highly probable Senior Data Engineer
interview questions tailored for the Go Digital Technology Consulting LLP
role:
🔸 PySpark & Spark SQL
1. Read CSV from S3, remove duplicates, write Parquet partitioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DedupCSV").getOrCreate()
df = spark.read.option("header", True).csv("s3://bucket/input.csv")
dedup = df.dropDuplicates()
dedup.write.partitionBy("date").parquet("s3://bucket/output/")
"Dedup" short for deduplication, is a data compression technique that
eliminates redundant copies of data, thereby reducing storage space
and bandwidth usage.
2. Handle skewed joins in PySpark: Use salting:
from pyspark.sql.functions import rand, concat_ws
big_df = big_df.withColumn("salt", (rand()*10).cast("int"))
small_df = small_df.withColumn("salt", explode(array([lit(i) for i in
range(10)])))
joined = big_df.join(small_df, ["key", "salt"])
pyspark.sql.functions.rand is a function within PySpark's SQL module that
generates a column of random numbers.
In PySpark, explode is a function from pyspark.sql.functions used to
transform a column containing arrays or maps into multiple rows. It
effectively "flattens" nested data structures within a DataFrame.
3. Repartition vs coalesce:
repartition(n): reshuffles data, increases partitions.
coalesce(n): merges partitions, avoids shuffle.
df = df.repartition(10) # Use before wide transformations
out = df.coalesce(1) # Use before writing small outputs
4. PySpark UDF to mask PII:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def mask_email(email):
return email[0] + "***@" + email.split("@")[1] if email else None
mask_udf = udf(mask_email, StringType())
df = df.withColumn("email", mask_udf(df.email))
5. Flatten nested JSON:
from pyspark.sql.functions import col
flat = df.select(col("name"), col("address.city"), col("address.zip"))
6. Cache & persist:
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count() # triggers caching
Cache is used to store data only in the memory
Persist– and this used to Store data on
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY
Both of this is used for memory optimization
7. Join two datasets and filter CDC records:
cdc_df = new_df.join(old_df, "id", "left_anti")
8. Parse log files:
import re
pattern = r"(?P<ip>\d+\.\d+\.\d+\.\d+).+?(?P<status>\d{3})"
df = raw_df.select(regexp_extract("value", pattern, 1).alias("ip"),
regexp_extract("value", pattern,
2).alias("status"))
9. Window function to get latest record per group:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
w = Window.partitionBy("user_id").orderBy(col("event_time").desc())
latest = df.withColumn("rn", row_number().over(w)).filter("rn = 1")
10. Optimize long PySpark jobs:
Reduce shuffle, use broadcast joins
Persist intermediate data
Monitor Spark UI DAGs
Tune spark.sql.shuffle.partitions, executor memory
🔙 Python + Pandas
11. Parse JSON & extract list:
import json
data = json.loads(json_string)
my_list = data["items"]
12. Regex email validation:
import re
def validate(email):
return re.match(r"[^@]+@[^@]+\.[^@]+", email)
13. Top 3 frequent elements:
from collections import Counter
Counter(my_list).most_common(3)
14. Read Excel and convert to Parquet:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.read_excel("data.xlsx")
table = pa.Table.from_pandas(df)
pq.write_table(table, "data.parquet", compression='snappy')
15. Rename CSV files:
import os, time
for f in os.listdir("./data"):
if f.endswith(".csv"):
new_name = f"data_{int(time.time())}.csv"
os.rename(f, new_name)
16. Count nulls & data types in Pandas:
print(df.isnull().sum())
print(df.dtypes)
SQL + Hive
17. Remove duplicates using ROW_NUMBER:
WITH dedup AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id ORDER BY update_ts DESC)
AS rn
FROM my_table
)
SELECT * FROM dedup WHERE rn = 1;
18. Second highest salary per department:
SELECT dept, MAX(salary) AS second_salary
FROM emp
WHERE (dept, salary) NOT IN (
SELECT dept, MAX(salary) FROM emp GROUP BY dept
)
GROUP BY dept;
19. Partition by month & year:
CREATE TABLE sales_part (...)
PARTITIONED BY (year INT, month INT);
INSERT INTO TABLE sales_part PARTITION (year, month)
SELECT *, YEAR(ts), MONTH(ts) FROM sales;
20. Optimize OR conditions in Hive:
Rewrite using UNION ALL
Filter by partition column first
21. Pivot/Unpivot Hive:
Use CASE WHEN for pivot:
SELECT id,
MAX(CASE WHEN type='A' THEN val END) as val_a,
MAX(CASE WHEN type='B' THEN val END) as val_b
FROM my_table
GROUP BY id;
☁️AWS + ETL
22. Boto3 list buckets and size:
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()['Buckets']
for b in buckets:
print(b['Name'])
23. Fault-tolerant PySpark pipeline:
Use try-catch in code
Write checkpoints to S3
Use Glue job bookmarks or Delta format
24. Streaming from Kafka to S3 in PySpark:
df = spark.readStream.format("kafka").option("subscribe",
"topic").load()
df.writeStream.format("parquet").option("checkpointLocation",
"/chkpt").start("s3://output")
25. Trigger Lambda via Python:
import boto3
client = boto3.client('lambda')
resp = client.invoke(FunctionName="my-func",
InvocationType='RequestResponse')
print(resp['Payload'].read())
🧠 PySpark Internals + Optimizations
26. PySpark lifecycle:
DAG -> Stages -> Tasks -> Executors -> Results collected
27. What happens on action:
Triggers job execution, builds DAG, schedules tasks
28. Debug via Spark UI:
Use Jobs tab to check failed stage
Look for skew, spill, GC issues
29. map vs flatMap vs mapPartitions:
map: element-wise
flatMap: flattens iterables
mapPartitions: operates on whole partition, better for bulk
30. Fix stage failures:
Check logs for errors (e.g., memory, skew)
Use retries, increase executor memory, repartition
🚀 Data Modeling + Scheduling
31. Upload to S3 with retry:
import boto3, botocore
s3 = boto3.client('s3')
try:
s3.upload_file("file.csv", "bucket", "file.csv")
except botocore.exceptions.ClientError as e:
print(e)
32. Submit Spark job to EMR:
aws emr add-steps --cluster-id j-XXXX --steps
Type=Spark,Name=MyJob,...
33. IAM for S3 access:
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"
34. Automate pipeline:
Step Functions for flow
Lambda triggers
SNS for failure alerts
35. Modular ETL pipeline:
Use functions/stages for ingest, transform, validate, write
36. Schema evolution handling:
Use Spark’s mergeSchema or Delta Lake’s schema evolution
37. ETL vs ELT:
ETL: transform before load (good for raw-to-model)
ELT: load raw, transform in DWH (good for flexibility)
38. Hourly CDC ingestion:
Use last_updated timestamp
Incremental filter like WHERE update_ts > last_run_ts
🔧 DevOps + Agile
39. SCD Type 2 logic in SQL:
Compare hash of current vs incoming record
Close old record, insert new
40. Hive bucketing vs partitioning:
Bucketing spreads within partition
Useful for joins on non-partition key
41. Optimize Hive joins:
Use MAPJOIN hint
Sort-merge joins
Broadcast small tables
42. Schema-on-read:
Hive reads structure at runtime
Use SerDe for flexible formats (e.g., JSON, ORC)
43. Multi-env pipeline mgmt:
Use environment variables or config files (YAML/JSON)
44. CI/CD for PySpark:
Use Jenkins/GitHub Actions
Run unit tests with pytest
Package PySpark code with setup.py
45. Version control with Git:
Use feature branches
PRs + code review
Tags for releases
Let me know which ones you want to practice live, build a portfolio
project from, or convert into a demo for your interview!