0% found this document useful (0 votes)
15 views9 pages

Data Bricks

The document provides a series of scenario-based interview questions and answers related to data engineering, specifically focusing on PySpark pipelines, real-time joins, snapshot tables, serverless ingestion patterns, tokenization strategies, upsert logic, and data ingestion from various sources. It highlights common struggles faced by candidates and emphasizes the importance of preparation for interviews in the field. Prominent Academy offers services such as mock interviews, hands-on training, and personalized coaching to help candidates succeed in their job search.

Uploaded by

Hariom Damekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

Data Bricks

The document provides a series of scenario-based interview questions and answers related to data engineering, specifically focusing on PySpark pipelines, real-time joins, snapshot tables, serverless ingestion patterns, tokenization strategies, upsert logic, and data ingestion from various sources. It highlights common struggles faced by candidates and emphasizes the importance of preparation for interviews in the field. Prominent Academy offers services such as mock interviews, hands-on training, and personalized coaching to help candidates succeed in their job search.

Uploaded by

Hariom Damekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

scenario-

based
interview
questions
www.prominentacademy.in
+91 98604 38743
Question : How can you implement automated testing for
PySpark pipelines across CI/CD stages?

Answer :
Use Flow or Apex to auto-create Contract when
Opportunity moves to Closed Won
Populate fields like Start Date, End Date, Amount,
Product summaryUse pytest + chispa (for DataFrame
comparison)
Example test:
Run tests in GitHub Actions or Azure DevOps

python

from chispa import assert_df_equality

def test_cleaning_logic(spark):
input_df = spark.createDataFrame([...])
expected_df = spark.createDataFrame([...])
result_df = cleaning_func(input_df)
assert_df_equality(result_df, expected_df)

⚠️ Common Struggles:
❌ No mock data
❌ Only testing schema, not content

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you perform a real-time join between
a large fact stream and a small dimension table while
minimizing skew?

Answer :
Use broadcast join for small static dimensions
Use salting technique if both are large

python

dim_df = spark.read.parquet("/mnt/dim").withColumn("salt",
lit(rand()))
fact_df = spark.readStream... \
.withColumn("salt", expr("CAST(rand() * 10 AS INT)"))
joined = fact_df.join(dim_df, ["join_key", "salt"])

⚠️ Common Struggles:
❌ Joining large tables directly without optimization
❌ Not monitoring shuffle partitions and skew

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you design a snapshot table to track
incremental daily changes without full reload?

Answer :
Use MERGE INTO + effective_date columns:
Maintain history using versioned Delta + Z-Ordering

sql

MERGE INTO snapshot_table tgt


USING daily_input src
ON tgt.id = src.id AND tgt.effective_date = current_date()
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

⚠️ Common Struggles:
❌ Overwriting historical data
❌ No time partition → slow queries

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you build a serverless ingestion
pattern triggered on file arrival in Azure?

Answer :
Configure Event Grid on ADLS Gen2 → Webhook to
Azure Function
Azure Function calls Databricks job using REST API

python

requests.post("https://<workspace>/api/2.1/jobs/run-now", json=
{"job_id": job_id})

⚠️ Common Struggles:
❌ Polling ADLS instead of using events
❌ No retry mechanism on Event Grid webhook failure

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you design a tokenization strategy for
PCI/PII data in Delta Lake while keeping raw access
restricted?

Answer :
Use hashing or encryption on sensitive fields before
writing:
Store mapping table (ID ↔
token) in secure, access-
restricted zone

python

from cryptography.fernet import Fernet


cipher = Fernet(key)
df = df.withColumn("ssn_token", sha2("ssn", 256)) # or
cipher.encrypt(...)

⚠️ Common Struggles:
❌ Storing raw + token in the same Delta table
❌ Using reversible masking without encryption

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you implement upsert logic in a
streaming write with dynamic partition overwrite?

Answer :
Use foreachBatch with MERGE INTO inside:

python

def upsert_to_delta(batch_df, batch_id):


batch_df.createOrReplaceTempView("updates")
spark.sql(\"\"\"
MERGE INTO silver_table tgt
USING updates src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
\"\"\")
stream_df.writeStream.foreachBatch(upsert_to_delta).start()

⚠️ Common Struggles:
❌ Using append mode with duplicates
❌ Overwriting full partitions — causes high I/O

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
Question : How do you ingest data from Kafka, Azure SQL,
and Blob Storage into a unified Delta Lake model?

Answer :
Kafka → Structured Streaming
Azure SQL → incremental copy using ADF or JDBC
Blob → AutoLoader for schema inference
Then merge:

python

df_kafka = spark.readStream...
df_sql = spark.read.format("jdbc")...
df_blob = spark.readStream.format("cloudFiles")...

merged_df = df_kafka.unionByName(df_sql).unionByName(df_blob)

⚠️ Common Struggles:
❌ Schema mismatch across sources
❌ Inconsistent ingestion frequency or deduplication

Your next opportunity is closer than you think. Let’s get you there!
📞 Don’t wait—call us at +91 98604 38743 today
#AzureSynapse #DataEngineering
#InterviewPreparation #JobReady
#MockInterviews #Deloitte #CareerSuccess
#ProminentAcademy

❌Think your skills are enough?


Think again—these Data engineer
scenario-based questions could cost you
your data engineering job.
In a recent interview at many big MNC’s, one of our
students faced scenario-based questions related to
data engineering, and many candidates struggled to
answer them correctly. These questions are designed
to test your real-world knowledge and ability to solve
complex data engineering problems.

Unfortunately, many students failed to answer these


questions confidently. The truth is, preparation is key,
and that’s where Prominent Academy comes in!
We specialize in preparing you for spark and data


engineering interviews by:


Offering scenario-based mock interviews
Providing hands-on training with data engineering


features


Optimizing your resume & LinkedIn profile
Giving personalized interview coaching to ensure
you’re job-ready
Don’t leave your future to chance!

📞Call us at +91 98604 38743and get the


interview prep you need to succeed

You might also like