scenario-
based
interview
questions
www.prominentacademy.in
+91 98604 38743
Question : How can you implement automated testing for
PySpark pipelines across CI/CD stages?
Answer :
Use Flow or Apex to auto-create Contract when
Opportunity moves to Closed Won
Populate fields like Start Date, End Date, Amount,
Product summaryUse pytest + chispa (for DataFrame
comparison)
Example test:
Run tests in GitHub Actions or Azure DevOps
python
from chispa import assert_df_equality
def test_cleaning_logic(spark):
input_df = spark.createDataFrame([...])
expected_df = spark.createDataFrame([...])
result_df = cleaning_func(input_df)
assert_df_equality(result_df, expected_df)
⚠️ Common Struggles:
❌ No mock data
❌ Only testing schema, not content
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you perform a real-time join between
a large fact stream and a small dimension table while
minimizing skew?
Answer :
Use broadcast join for small static dimensions
Use salting technique if both are large
python
dim_df = spark.read.parquet("/mnt/dim").withColumn("salt",
lit(rand()))
fact_df = spark.readStream... \
.withColumn("salt", expr("CAST(rand() * 10 AS INT)"))
joined = fact_df.join(dim_df, ["join_key", "salt"])
⚠️ Common Struggles:
❌ Joining large tables directly without optimization
❌ Not monitoring shuffle partitions and skew
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you design a snapshot table to track
incremental daily changes without full reload?
Answer :
Use MERGE INTO + effective_date columns:
Maintain history using versioned Delta + Z-Ordering
sql
MERGE INTO snapshot_table tgt
USING daily_input src
ON tgt.id = src.id AND tgt.effective_date = current_date()
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
⚠️ Common Struggles:
❌ Overwriting historical data
❌ No time partition → slow queries
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you build a serverless ingestion
pattern triggered on file arrival in Azure?
Answer :
Configure Event Grid on ADLS Gen2 → Webhook to
Azure Function
Azure Function calls Databricks job using REST API
python
requests.post("https://<workspace>/api/2.1/jobs/run-now", json=
{"job_id": job_id})
⚠️ Common Struggles:
❌ Polling ADLS instead of using events
❌ No retry mechanism on Event Grid webhook failure
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you design a tokenization strategy for
PCI/PII data in Delta Lake while keeping raw access
restricted?
Answer :
Use hashing or encryption on sensitive fields before
writing:
Store mapping table (ID ↔
token) in secure, access-
restricted zone
python
from cryptography.fernet import Fernet
cipher = Fernet(key)
df = df.withColumn("ssn_token", sha2("ssn", 256)) # or
cipher.encrypt(...)
⚠️ Common Struggles:
❌ Storing raw + token in the same Delta table
❌ Using reversible masking without encryption
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you implement upsert logic in a
streaming write with dynamic partition overwrite?
Answer :
Use foreachBatch with MERGE INTO inside:
python
def upsert_to_delta(batch_df, batch_id):
batch_df.createOrReplaceTempView("updates")
spark.sql(\"\"\"
MERGE INTO silver_table tgt
USING updates src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
\"\"\")
stream_df.writeStream.foreachBatch(upsert_to_delta).start()
⚠️ Common Struggles:
❌ Using append mode with duplicates
❌ Overwriting full partitions — causes high I/O
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you ingest data from Kafka, Azure SQL,
and Blob Storage into a unified Delta Lake model?
Answer :
Kafka → Structured Streaming
Azure SQL → incremental copy using ADF or JDBC
Blob → AutoLoader for schema inference
Then merge:
python
df_kafka = spark.readStream...
df_sql = spark.read.format("jdbc")...
df_blob = spark.readStream.format("cloudFiles")...
merged_df = df_kafka.unionByName(df_sql).unionByName(df_blob)
⚠️ Common Struggles:
❌ Schema mismatch across sources
❌ Inconsistent ingestion frequency or deduplication
Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
#AzureSynapse #DataEngineering
#InterviewPreparation #JobReady
#MockInterviews #Deloitte #CareerSuccess
#ProminentAcademy
❌Think your skills are enough?
Think again—these Data engineer
scenario-based questions could cost you
your data engineering job.
In a recent interview at many big MNC’s, one of our
students faced scenario-based questions related to
data engineering, and many candidates struggled to
answer them correctly. These questions are designed
to test your real-world knowledge and ability to solve
complex data engineering problems.
Unfortunately, many students failed to answer these
questions confidently. The truth is, preparation is key,
and that’s where Prominent Academy comes in!
We specialize in preparing you for spark and data
✅
engineering interviews by:
✅
Offering scenario-based mock interviews
Providing hands-on training with data engineering
✅
features
✅
Optimizing your resume & LinkedIn profile
Giving personalized interview coaching to ensure
you’re job-ready
Don’t leave your future to chance!
📞Call us at +91 98604 38743and get the
interview prep you need to succeed