Question 1:
A data engineer is preparing a Databricks asset bundle to define a job that executes a notebook located at ./src/my_notebook.py.
While writing the YAML configuration, the engineer drafts the following:
. bundle:
. name: my_bundle
. ____________:
. jobs:
. my_job:
. name: my_notebook_job
. tasks:
. - task_key: test_task
. existing_cluster_id: 1234-911320-xyzwpm999
0. notebook_task:
1. notebook_path: './src/my_notebook.py'
To ensure the bundle is valid and deployable, which key should correctly replace the above blank?
Correct answer
resources
Overall explanation
The resources section is where users define deployable objects, such as jobs, pipelines, notebooks, clusters, and more. These
resources form the core of what is deployed when a bundle is applied to a target environment.
Options B, C, and D are incorrect:
settings is not a valid section in the context of a Databricks Asset Bundle.
pipelines allows you to create Lakeflow Declarative Pipelines (Delta Live Tables or DLT) pipelines
workflows are not used in the databricks.yml structure.
Question 2:
Fill in the below blank to successfully create a table in Databricks using data from an existing PostgreSQL database:
. CREATE TABLE employees
. USING ____________
. OPTIONS (
. url "jdbc:postgresql:dbserver",
. dbtable "employees"
. )
Correct answer
org.apache.spark.sql.jdbc
Overall explanation
Using the JDBC library, Spark SQL can extract data from any existing relational database that supports JDBC. Examples include
mysql, postgres, SQLite, and more.
Question 3:
For production jobs, which of the following cluster types is recommended to use?
Your answer is correct
Job clusters
Overall explanation
Job Clusters are dedicated clusters for a job or task run. A job cluster auto terminates once the job is completed, which saves cost
compared to all-purpose clusters.
In addition, Databricks recommends using job clusters in production so that each job runs in a fully isolated environment.
Question 4:
The data engineer team has a DLT pipeline that updates all the tables once and then stops. The compute resources of the pipeline
continue running to allow for quick testing.
Which of the following best describes the execution modes of this DLT pipeline ?
Correct answer
The DLT pipeline executes in Triggered Pipeline mode under Development mode.
Overall explanation
Triggered pipelines update each table with whatever data is currently available and then they shut down.
In Development mode, the Delta Live Tables system ease the development process by
Reusing a cluster to avoid the overhead of restarts. The cluster runs for two hours when development mode is enabled.
Disabling pipeline retries so you can immediately detect and fix errors.
1
Question 5:
A data engineer has the following query in a Delta Live Tables pipeline:
. CREATE STREAMING TABLE sales_silver
. AS
. SELECT store_id, total + tax AS total_after_tax
. FROM sales_bronze
The pipeline is failing to start due to an error in this query.
Which of the following changes should be made to this query to successfully start the DLT pipeline ?
Correct answer
. CREATE STREAMING TABLE sales_silver
. AS
. SELECT store_id, total + tax AS total_after_tax
. FROM STREAM(sales_bronze)
Overall explanation
In DLT* pipelines, You can stream data from other tables in the same pipeline by using the STREAM() function**. In this case,
you must define a streaming table using the CREATE STREAMING TABLE syntax.
. CREATE STREAMING TABLE table_name
. AS SELECT *
. FROM STREAM(source_table)
** A recent syntax update now also supports the use of STREAM as a keyword.
Question 6:
A data engineering team is processing a large-scale ETL pipeline that involves joining multiple large datasets, each containing
hundreds of columns and billions of records. During the join phase, they notice that the Spark executors are repeatedly spilling
data to disk, and performance significantly degrades due to excessive shuffling.
What type of resource optimization should the team prioritize to improve the performance of this job?
Your answer is correct
Memory Optimized
Overall explanation
Spark joins, particularly when dealing with large datasets or complex join conditions, can be highly memory-intensive. This is
due to the need to shuffle and potentially store data in memory during the join operation.
A memory-optimized resources ensures that more in-memory operations can be completed without disk spills, thereby
significantly reducing shuffle overhead and improving performance.
Question 7:
A data engineer noticed that there are unused data files in the directory of a Delta table. They executed the VACUUM command
on this table; however, only some of those unused data files have been deleted.
Which of the following could explain why only some of the unused data files have been deleted after running the VACUUM
command ?
Correct answer
The deleted data files were older than the default retention threshold. While the remaining files are newer than the default
retention threshold and can not be deleted.
Overall explanation
Running the VACUUM command on a Delta table deletes the unused data files older than a specified data retention period.
Unused files newer than the default retention threshold are kept untouched.
2
Question 8:
A data engineer wants to create a relational object by pulling data from two tables. The relational object must be used by other
data engineers in other sessions on the same cluster only. In order to save on storage costs, the date engineer wants to avoid
copying and storing physical data.
Which of the following relational objects should the data engineer create?
Correct answer
Global Temporary view
Overall explanation
In order to avoid copying and storing physical data, the data engineer must create a view object. A view in databricks is a virtual
table that has no physical data. It’s just a saved SQL query against actual tables.
The view type should be Global Temporary view that can be accessed in other sessions on the same cluster. Global Temporary
views are tied to a cluster temporary database called global_temp.
Question 9:
A healthcare organization stores sensitive patient data within Databricks Unity Catalog. They need to share this data with an
external analytics vendor, who does not use Databricks.
What is the most secure and efficient method to enable this data access?
Your answer is correct
Using Delta Sharing with the open sharing protocol
Overall explanation
Delta Sharing is designed to securely share data across platforms using an open protocol. Since the vendor does not use
Databricks, Delta Sharing ensures secure, real-time access without manual exports or third-party workarounds.
Question 10:
In PySpark, which of the following commands can you use to query the Delta table employees created in Spark SQL?
Correct answer
spark.table("employees")
Overall explanation
spark.table() function returns the specified Spark SQL table as a PySpark DataFrame
Question 11:
Given the following Structured Streaming query:
. (spark.readStream
. .table("cleanedOrders")
. .groupBy("productCategory")
. .agg(sum("totalWithTax"))
. .writeStream
. .option("checkpointLocation", checkpointPath)
. .outputMode("complete")
. .table("aggregatedOrders")
. )
Which of the following best describe the purpose of this query in a Medallion Architecture?
Your answer is correct
The query is performing a hop from Silver layer to a Gold table
Overall explanation
The above Structured Streaming query creates business-level aggregates from clean orders data in the silver table cleanedOrders,
and loads them in the gold table aggregatedOrders.
Question 12:
In Databricks Jobs, which of the following approaches can a data engineer use to configure a linear dependency between Task
A and Task B ?
Correct answer
They can select the Task A in the Depends On field of the Task B configuration
Overall explanation
You can define the order of execution of tasks in a job using the Depends on dropdown menu. You can set this field to one or
more tasks in the job.
3
Question 13:
A data engineer has developed a code block to completely reprocess data based on the following if condition in Python:
. if process_mode = "init" and not is_table_exist:
. print("Start processing ...")
This code block is returning an invalid syntax error.
Which of the following changes should be made to the code block to fix this error ?
Correct answer
. if process_mode == "init" and not is_table_exist:
. print("Start processing ...")
Overall explanation
Python if statement looks like this in its simplest form:
. if <expr>:
. <statement>
Python supports the usual logical conditions from mathematics:
Equals: a == b
Not Equals: a != b
<, <=, >, >=
To combine conditional statements, you can use the following logical operators:
and
or
The negation operator in Python is: not
Question 14:
A data engineer has been tasked with consuming a stream of events from a Kafka topic named events_topic, which is hosted on a
remote Kafka broker at host:port.
Which of the following code snippets correctly establishes a streaming DataFrame in PySpark to read from this Kafka topic?
Your answer is correct
. eventsStream = (spark.readStream
. .format("kafka")
. .option("kafka.bootstrap.servers", "host:port")
. .option("subscribe", "events_topic")
. .option("startingOffsets", "latest")
. .load()
. )
Overall explanation
The correct approach uses PySpark’s readStream with format "kafka", and .option() methods to specify the necessary
configurations, such as:
kafka.bootstrap.servers to identify the Kafka cluster
subscribe to define the target Kafka topic, and
startingOffsets to control the consumption starting point. This can be set to "latest" or "earliest" to determine where to begin
reading.
Question 15:
A junior data engineer usually uses INSERT INTO command to write data into a Delta table. A senior data engineer suggested
using another command that avoids writing of duplicate records.
Which of the following commands is the one suggested by the senior data engineer ?
Correct answer
MERGE INTO
Overall explanation
MERGE INTO allows to merge a set of updates, insertions, and deletions based on a source table into a target Delta table. With
MERGE INTO, you can avoid inserting the duplicate records when writing into Delta tables.
Question 16:
An organization plans to use Delta Sharing for enabling large dataset access by multiple clients across AWS, Azure, and GCP. A
senior data engineer has recommended migrating the dataset to Cloudflare R2 object storage prior to initiating the data sharing
process.
Which benefit does Cloudflare R2 offer in this Delta Sharing setup?
Correct answer
Eliminates cloud provider egress cost for outbound data transfers
Overall explanation
Cloudflare R2 removes egress costs, which helps significantly lower expenses when sharing data across cloud environments.
4
Question 17:
A data engineer has a large unpartitioned Delta table that was experiencing performance issues. They run
the OPTIMIZE command, which successfully compacted many small files into larger ones. However, despite this compaction,
subsequent queries on the table continue to run slowly. To improve performance further, they re-run the OPTIMIZE command
but notice no change in data file sizes.
What most likely explains the behavior during the second run?
Correct answer
The OPTIMIZE command does not alter already compacted files
Overall explanation
The behavior observed by the team is best explained by the idempotent nature of the OPTIMIZE command in Delta Lake.
When OPTIMIZE is executed, it identifies small files within the specified table or partition and merges them into larger, more
efficient files to reduce file overhead and improve I/O performance.
Once files have already been compacted and no new data has been added or modified, re-running OPTIMIZE does not trigger
any further changes. This is by design, OPTIMIZE avoids redundant computation and does not modify files that already meet
optimization thresholds.
So, in this case, since no new data was ingested and the files are already in optimal form, the second execution has no effect on
file sizes or structure. This explains why there is no noticeable change.
However, the persistence of slow query performance suggests that the root cause is likely not related to file size.
Question 18:
Which of the following best describes a Data Lakehouse?
Correct answer
Single, flexible, high-performance system that supports data engineering, analytics, and machine learning workloads.
Overall explanation
The Databricks Intelligence Platform is a unified lakehouse analytics platform that combines the best elements of data lakes and
data warehouses. So, in the lakehouse, you can work on data engineering, analytics, and AI, all in one platform.
Question 19:
A data engineer needs to determine whether to use Auto Loader or COPY INTO command in order to load input data files
incrementally.
In which of the following scenarios should the data engineer use Auto Loader over COPY INTO command?
Correct answer
If they are going to ingest files in the order of millions or more over time
Overall explanation
Here are a few things to consider when choosing between Auto Loader and COPY INTO command:
If you’re going to ingest files in the order of thousands, you can use COPY INTO. If you are expecting files in the order of
millions or more over time, use Auto Loader.
If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and
evolution.
Question 20:
In the Medallion Architecture, which of the following statements best describes the Bronze layer ?
Correct answer
It maintains raw data ingested from various sources
Overall explanation
Bronze tables contain data in its rawest format ingested from various sources (e.g., JSON files, Operational Databaes, Kakfa
stream, ...)
Question 21:
A data engineer uses the following SQL query:
GRANT USE SCHEMA ON SCHEMA sales_db TO finance_team
Which of the following is the benefit of the USE SCHEMA privilege ?
Correct answer
It's a prerequisite to perform any action on the database
Overall explanation
The USE SCHEMA privilege (formerly known as USAGE) does not give any direct permissions. However, it serves as a
prerequisite for performing any actions on the objects within the schema. Additionally, it is necessary to grant the USE
CATALOG privilege on the parent catalog to enable access in Unity Catalog.
5
Question 22:
A data engineering team is discussing the optimal data layout strategy on a growing managed Delta table in Unity Catalog. They
are considering partitioning, Z-ordering, and Liquid Clustering to improve query performance.
Which scenario best indicates that Automatic Liquid Clustering is the recommended choice?
Your answer is correct
The table experiences diverse, frequently changing query filters across multiple columns, with unpredictable access
patterns.
Overall explanation
Automatic Liquid Clustering is designed to dynamically adapt to evolving and unpredictable query patterns by continuously
reorganizing data based on recent query filters. This is especially beneficial when query predicates frequently change across
multiple columns, making static strategies like partitioning or Z-ordering less effective.
Partitioning works best when filters are stable and predictable, often on date/time columns. Z-ordering optimizes clustering for
known high-cardinality columns with consistent filtering. When query filters are varied and unpredictable, Automatic Liquid
Clustering provides the agility to improve performance without manual tuning.
Question 23:
The data engineer team has a DLT pipeline that updates all the tables once and then stops. The compute resources of the pipeline
terminate when the pipeline is stopped.
Which of the following best describes the execution modes of this DLT pipeline ?
Correct answer
The DLT pipeline executes in Triggered Pipeline mode under Production mode.
Overall explanation
Triggered pipelines update each table with whatever data is currently available and then they shut down.
In Production mode, the Delta Live Tables* system:
Terminates the cluster immediately when the pipeline is stopped.
Restarts the cluster for recoverable errors (e.g., memory leak or stale credentials).
Retries execution in case of specific errors (e.g., a failure to start a cluster)
Question 24:
A data engineer is developing an automated data pipeline on Databricks, which includes several notebooks executed as part of a
scheduled job. One of these notebooks performs extensive data profiling and generates a large amount of textual output for
validation purposes. After multiple successful runs, the job suddenly begins to fail without any changes to the code logic or input
data volume.
Upon investigation, the engineer suspects that the issue may be related to excessive output generated during notebook execution,
which is causing the job cluster to reach a system-imposed limit.
To prevent notebook execution failure due to output size, what is the maximum amount of output a job cluster can handle?
Correct answer
30 MB
Overall explanation
Job clusters have a maximum notebook output size of 30 MB.
Question 25:
Which scenario is best suited for using spot instances in Databricks?
Correct answer
Non-critical batch processing jobs with retry capability
Overall explanation
Spot instances are ideal for stateless, fault-tolerant workloads—such as batch processing jobs that can be retried without
significant data loss or user impact. These jobs are often time-flexible and can tolerate interruptions, making them a good match
for spot instances.
In contrast, use cases that demand low latency, real-time processing, or guaranteed uptime—such as interactive queries or
transactional systems—are generally unsuited for spot instances due to the inherent risk of unexpected termination.
6
Question 26:
Which part of the Databricks platform can a data engineer use to grant permissions on tables to users ?
Correct answer
Catalog Explorer
Overall explanation
Catalog Explorer allows you to manage data object permissions. This includes granting privileges on tables and databases to
users or groups of users.
Question 27:
Which of the following technologies is used by Auto Loader to load data incrementally?
Your answer is correct
Spark Structured Streaming
Overall explanation
Auto Loader is based on Spark Structured Streaming. It provides a Structured Streaming source called cloudFiles.
Question 28:
Which of the following code blocks can a data engineer use to create a Python function to multiply two integers and return the
result?
Your answer is correct
. def multiply_numbers(num1, num2):
. return num1 * num2
Overall explanation
In Python, a function is defined using the def keyword. Here, we used the return keyword since the question clearly asks to
return the result, and not printing the output.
Syntax:
. def function_name(params):
. return params
Question 29:
A data engineer is using Databricks Delta Live Tables (DLT) to define a view of recent orders. The engineer has written the
following Python code to create the view:
. @dlt.view
. def recent_orders():
. return spark.read.table("orders").filter("year > 2025")
Based on this implementation, what type of object will be created when this code is executed?
Correct answer
Temporary view
Overall explanation
In Delta Live Tables (DLT)*, views declared using the @dlt.view decorator are temporary objects. They only exist during a
single pipeline run and are not saved to the catalog like tables. Because of this, views are useful for handling intermediate steps
in data processing or performing quality checks when the results don’t need to be kept permanently.
In SQL, it’s is created using the following syntax:
. CREATE TEMPORARY VIEW recent_orders
. AS SELECT * FROM orders WHERE year > 2025;
Question 30:
How does Automatic Liquid Clustering determine which columns to use as clustering keys in a Unity Catalog-managed Delta
table?
Correct answer
It leverages Predictive Optimization to choose optimal clustering keys based on observed query behavior.
Overall explanation
Automatic Liquid Clustering in Databricks is a feature designed to automatically optimize the physical layout of data in Delta
tables based on the access patterns and metadata statistics. It leverages Predictive Optimization, which uses query behavior
analytics to select clustering keys dynamically.
7
Question 31:
In Delta Lake tables, which of the following is the primary format for the data files?
Correct answer
Parquet
Overall explanation
Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more data files in Parquet
format, along with transaction logs in JSON format.
Question 32:
A data engineer is configuring a Databricks Auto Loader stream to ingest JSON data from an S3 bucket. The pipeline should fail
when new columns are detected in the incoming data, but those new columns should still be added to the schema so that
subsequent runs can resume successfully with the updated schema. Existing columns must retain their data types.
. spark.readStream \
. .format("cloudFiles") \
. .option("cloudFiles.format", "json") \
. .option("cloudFiles.schemaLocation", "s3://checkpoints/orders")
. .option("cloudFiles.schemaEvolutionMode", "_______________") \
. .load("s3://shop/raw/orders/json/") \
. .writeStream \
. .option("checkpointLocation", "s3://checkpoints/orders") \
. .start("orders_table")
Which option correctly fills in the blank to meet the specified requirement ?
Your answer is correct
addNewColumns
Overall explanation
The addNewColumns mode is the default schema evolution behavior in Auto Loader. In this mode, when a new column is
detected, the stream fails, but the new column is added to the schema. This allows the job to be restarted and continue processing
with the updated schema. Importantly, existing columns' data types are not changed.
Question 33:
Which of the following alert destinations is Not supported in Databricks SQL ?
Correct answer
SMS
Overall explanation
SMS is not supported as an alert destination in Databricks SQL . While, email, webhook, Slack, and Microsoft Teams are
supported alert destinations in Databricks SQL.
Question 34:
A data engineer wants to increase the cluster size of an existing Databricks SQL warehouse.
Which of the following is the benefit of increasing the cluster size of Databricks SQL warehouses ?
Correct answer
Reduces the latency of the queries execution
Overall explanation
Cluster Size represents the number of cluster workers and size of compute resources available to run your queries and
dashboards. To reduce query latency, you can increase the cluster size.
Question 35:
Databricks provides a declarative ETL framework for building reliable and maintainable data processing pipelines, while
maintaining table dependencies and data quality.
Which of the following technologies is being described above?
Correct answer
Delta Live Tables
Overall explanation
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the
transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring,
data quality, and error handling.
8
Question 36:
A data architect is designing a hybrid data platform that must securely connect to on-premises databases. These databases need to
be queried using SQL from within Databricks, and the platform must remain within the enterprise's custom-defined network for
compliance reasons.
Which SQL Warehouse type should the architect choose to support this architecture?
Correct answer
Pro SQL Warehouse
Overall explanation
Pro SQL Warehouses support secure connectivity to custom-defined networks, including hybrid architectures where data resides
both in the cloud and on-premises. This setup is ideal for scenarios requiring private network access to databases or event-driven
architectures.
Serverless options are managed by Databricks and typically do not operate within custom VPCs, making them unsuitable for
such scenarios.
Question 37:
Which of the following tasks is not supported by Git folders, and must be performed in your Git provider ?
Correct answer
Delete branches
Overall explanation
The following tasks are not supported by Databricks Repos, and must be performed in your Git provider:
Create a pull request
Delete branches
Question 38:
Which of the following operations can a data engineer use to update a Git folder from its remote Git repository?
Correct answer
Pull
Overall explanation
The git Pull operation is used to fetch and download content from a remote repository and immediately update the local
repository to match that content.
Question 39:
A data engineer wants to validate that all Spark and Pandas DataFrames in their notebook are correctly defined. They aim to
inspect the structure and column names at a glance, with immediate access to the full schema details for each DataFrame.
Which of the following features would best support this goal?
Correct answer
Notebook Variable Explorer
Overall explanation
The Variable Explorer in Databricks Notebooks provides a convenient view of variables defined in a notebook session. It allows
users to view at-a-glance all the variables, including their names, data types, and current values.
The Variable Explorer also presents additional metadata for Spark and Pandas DataFrames. The shape and column names are
available at-a-glance, and full view of the schema is available on hover.
Question 40:
Which of the following statements best describes the usage of CREATE SCHEMA command ?
Your answer is correct
It’s used to create a database
Overall explanation
CREATE SCHEMA is an alias for CREATE DATABASE statement. While usage of SCHEMA and DATABASE is
interchangeable, SCHEMA is preferred.
9
Question 41:
A data engineer is tasked with calculating both the total and average salary of employees, grouped by their department. They use
PySpark’s groupBy and wish to apply multiple aggregations to each group.
Complete the following code snippet:
. result_df = df.groupBy("department").____________(
. sum("salary").alias("total_salary"),
. avg("salary").alias("average_salary")
. )
What function should be used to complete the code?
Correct answer
agg
Overall explanation
The agg() function is used after a groupBy() operation when applying one or more aggregate functions to grouped data. In this
scenario, the data engineer is calculating both the sum and average of salaries, so agg() is the correct function.
Question 42:
Which of the following statements is Not true about CTAS statements ?
Correct answer
CTAS statements support manual schema declaration
Overall explanation
CREATE TABLE AS SELECT statements, or CTAS statements create and populate Delta tables using the output of a
SELECT query. CTAS statements automatically infer schema information from query results and do not support manual schema
declaration.
Question 43:
A data engineer has a Job with multiple tasks that takes more than 2 hours to complete. In the last run, the final task
unexpectedly failed.
Which of the following actions can the data engineer perform to complete this Job Run while minimizing the execution time?
Correct answer
They can repair this Job Run so only the failed tasks will be re-executed
Overall explanation
You can repair failed multi-task jobs by running only the subset of unsuccessful tasks and any dependent tasks. Because
successful tasks are not re-run, this feature reduces the time and resources required to recover from unsuccessful job runs.
Question 44:
Which of the following locations completely hosts the customer data?
Correct answer
Customer's cloud account
Overall explanation
According to the Databricks Lakehouse architecture, the storage account hosting the customer data is provisioned in the data
plane in the Databricks customer's cloud account.
Question 45:
A data engineer has defined the following data quality constraint in a Delta Live Tables pipeline:
CONSTRAINT valid_id EXPECT (id IS NOT NULL) _____________
Fill in the above blank so records violating this constraint cause the pipeline to fail.
Correct answer
ON VIOLATION FAIL UPDATE
Overall explanation
With ON VIOLATION FAIL UPDATE, records that violate the expectation will cause the pipeline to fail. When a pipeline fails
because of an expectation violation, you must fix the pipeline code to handle the invalid data correctly before re-running the
pipeline.
10