0% found this document useful (0 votes)
2K views24 pages

Certified Data Engineer Associate

certified data engineer associate exam with questions and answers

Uploaded by

mikel.uranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views24 pages

Certified Data Engineer Associate

certified data engineer associate exam with questions and answers

Uploaded by

mikel.uranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Certy IQ

Premium exam material


Get certification quickly with the CertyIQ Premium exam material.
Everything you need to prepare, learn & pass your certification exam easily. Lifetime free updates
First attempt guaranteed success.
https://www.CertyIQ.com
Databricks

(Certified Data Engineer Associate)

Certified Data Engineer Associate

Total: 130 Questions


Link: https://certyiq.com/papers/databricks/certified-data-engineer-associate
Question: 1 CertyIQ
A data organization leader is upset about the data analysis team’s reports being different from the data
engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data
analysis architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this issue?

A. Both teams would autoscale their work as data size evolves


B. Both teams would use the same source of truth for their work
C. Both teams would reorganize to report to the same department
D. Both teams would be able to collaborate on projects in real-time
E. Both teams would respond more quickly to ad-hoc requests

Answer: B

Explanation:

Unity Catalog in Databricks helps to eliminate Data Silos in an organization by having one single source of
truth data.

Question: 2 CertyIQ
Which of the following describes a scenario in which a data team will want to utilize cluster pools?

A. An automated report needs to be refreshed as quickly as possible.


B. An automated report needs to be made reproducible.
C. An automated report needs to be tested to identify errors.
D. An automated report needs to be version-controlled across multiple collaborators.
E. An automated report needs to be runnable by all stakeholders.

Answer: A

Explanation:

Using cluster pools reduces the cluster start up time. So in this case, the reports can be refreshed quickly and
not having to wait long for the cluster to start.

Question: 3 CertyIQ
Which of the following is hosted completely in the control plane of the classic Databricks architecture?

A. Worker node
B. JDBC data source
C. Databricks web application
D. Databricks Filesystem
E. Driver node

Answer: C

Explanation:
C. Data bricks web application.

In the classic Data bricks architecture, the control plane includes components like the Data bricks web
application, the Data bricks REST API, and the Data bricks Workspace. These components are responsible for
managing and controlling the Data bricks environment, including cluster provisioning, notebook management,
access control, and job scheduling.

The other options, such as worker nodes, JDBC data sources, Data bricks Filesystem (DBFS), and driver nodes,
are typically part of the data plane or the execution environment, which is separate from the control plane.
Worker nodes are responsible for executing tasks and computations, JDBC data sources are used to connect
to external databases, DBFS is a distributed file system for data storage, and driver nodes are responsible for
coordinating the execution of Spark jobs.

Question: 4 CertyIQ
Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?

A. The ability to manipulate the same data using a variety of languages


B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations

Answer: D

Explanation:

D. The ability to support batch and streaming workloads.

Delta Lake is a key component of the Data bricks Lakehouse Platform that provides several benefits, and one
of the most significant benefits is its ability to support both batch and streaming workloads seamlessly. Delta
Lake allows you to process and analyze data in real-time (streaming) as well as in batch, making it a versatile
choice for various data processing needs.

While the other options may be benefits or capabilities of Data bricks or the Lakehouse Platform in general,
they are not specifically associated with Delta Lake.

Question: 5 CertyIQ
Which of the following describes the storage organization of a Delta table?

A. Delta tables are stored in a single file that contains data, history, metadata, and other attributes.
B. Delta tables store their data in a single file and all metadata in a collection of files in a separate location.
C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
D. Delta tables are stored in a collection of files that contain only the data stored within the table.
E. Delta tables are stored in a single file that contains only the data stored within the table.

Answer: C

Explanation:

C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
Delta tables store data in a structured manner using Parquet files, and they also maintain metadata and
transaction logs in separate directories. This organization allows for versioning, transactional capabilities, and
metadata tracking in Delta Lake.

Question: 6 CertyIQ
Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the
existing Delta table my_table and save the updated table?

A. SELECT * FROM my_table WHERE age > 25;


B. UPDATE my_table WHERE age > 25;
C. DELETE FROM my_table WHERE age > 25;
D. UPDATE my_table WHERE age <= 25;
E. DELETE FROM my_table WHERE age <= 25;

Answer: C

Explanation:

DELETE FROM my_table WHERE age > 25;

Question: 7 CertyIQ
A data engineer has realized that they made a mistake when making a daily update to a table. They need to use
Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to
time travel to the older version, they are unable to restore the data because the data files have been deleted.
Which of the following explains why the data files are no longer present?

A. The VACUUM command was run on the table


B. The TIME TRAVEL command was run on the table
C. The DELETE HISTORY command was run on the table
D. The OPTIMIZE command was nun on the table
E. The HISTORY command was run on the table

Answer: A

Explanation:

A. The VACUUM command was run on the table.

The VACUUM command in Delta Lake is used to clean up and remove unnecessary data files that are no
longer needed for time travel or query purposes. When you run VACUUM with certain retention settings, it can
delete older data files, which might include versions of data that are older than the specified retention period.
If the data engineer is unable to restore the table to a version that is 3 days old because the data files have
been deleted, it's likely because the VACUUM command was run on the table, removing the older data files as
part of data cleanup.

Question: 8 CertyIQ
Which of the following Git operations must be performed outside of Databricks Repos?

A. Commit
B. Pull
C. Push
D. Clone
E. Merge

Answer: E

Explanation:

MERGE is the only git operation that is listed in the options that cannot be performed with Data bricks repos.
CLONE is absolutely possible.

Clone can be done in Data bricks Repo. Merge not in Repos, need to be in Git.

https://learn.microsoft.com/en-us/azure/databricks/repos/

Question: 9 CertyIQ
Which of the following data lakehouse features results in improved data quality over a traditional data lake?

A. A data lakehouse provides storage solutions for structured and unstructured data.
B. A data lakehouse supports ACID-compliant transactions.
C. A data lakehouse allows the use of SQL queries to examine data.
D. A data lakehouse stores data in open formats.
E. A data lakehouse enables machine learning and artificial Intelligence workloads.

Answer: B

Explanation:

B. A data lake house supports ACID-compliant transactions.

One of the key features of a data lake house that results in improved data quality over a traditional data lake
is its support for ACID (Atomicity, Consistency, Isolation, Durability) transactions. ACID transactions provide
data integrity and consistency guarantees, ensuring that operations on the data are reliable and that data is
not left in an inconsistent state due to failures or concurrent access.

In a traditional data lake, such transactional guarantees are often lacking, making it challenging to maintain
data quality, especially in scenarios involving multiple data writes, updates, or complex transformations. A
data lake house, by offering ACID compliance, helps maintain data quality by providing strong consistency
and reliability, which is crucial for data pipelines and analytics.

Question: 10 CertyIQ
A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their
project using Databricks Repos.
Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?

A. Databricks Repos automatically saves development progress


B. Databricks Repos supports the use of multiple branches
C. Databricks Repos allows users to revert to previous versions of a notebook
D. Databricks Repos provides the ability to comment on specific changes
E. Databricks Repos is wholly housed within the Databricks Lakehouse Platform

Answer: B

Explanation:

B. Databricks Repos supports the use of multiple branches.

An advantage of using Databricks Repos over the built-in Databricks Notebooks versioning is the ability to
work with multiple branches. Branching is a fundamental feature of version control systems like Git, which
Databricks Repos is built upon. It allows you to create separate branches for different tasks, features, or
experiments within your project. This separation helps in parallel development and experimentation without
affecting the main branch or the work of other team members.

Branching provides a more organized and collaborative development environment, making it easier to merge
changes and manage different development efforts. While Databricks Notebooks versioning also allows you
to track versions of notebooks, it may not provide the same level of flexibility and collaboration as branching
in Databricks Repos.

Question: 11 CertyIQ
A data engineer has left the organization. The data team needs to transfer ownership of the data engineer’s Delta
tables to a new data engineer. The new data engineer is the lead engineer on the data team.
Assuming the original data engineer no longer has access, which of the following individuals must be the one to
transfer ownership of the Delta tables in Data Explorer?

A. Databricks account representative


B. This transfer is not possible
C. Workspace administrator
D. New lead data engineer
E. Original data engineer

Answer: C

Explanation:

Correct answer is C:Workspace administrator.

Reference:

https://www.databricks.com/blog/2022/08/26/databricks-workspace-administration-best-practices-for-
account-workspace-and-metastore-admins.html

Question: 12 CertyIQ
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from
the data engineering team to implement a series of tests to ensure the data is clean. However, the data
engineering team uses Python for its tests rather than SQL.
Which of the following commands could the data engineering team use to access sales in PySpark?
A. SELECT * FROM sales
B. There is no way to share data between PySpark and SQL.
C. spark.sql("sales")D. spark.delta.table("sales")
E. spark.table("sales")

Answer: E

Explanation:

Correct answer is E:spark.table("sales").

Reference:

https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.SparkSession.table.html

Question: 13 CertyIQ
Which of the following commands will return the location of database customer360?

A. DESCRIBE LOCATION customer360;


B. DROP DATABASE customer360;
C. DESCRIBE DATABASE customer360;
D. ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user' ;
E. USE DATABASE customer360;

Answer: C

Explanation:

C. DESCRIBE DATABASE customer360;

To retrieve the location of a database named "customer360" in a database management system like Hive or
Databricks, you can use the DESCRIBE DATABASE command followed by the database name. This command
will provide information about the database, including its location.

Question: 14 CertyIQ
A data engineer wants to create a new table containing the names of customers that live in France.
They have written the following command:

A senior data engineer mentions that it is organization policy to include a table property indicating that the new
table includes personally identifiable information (PII).
Which of the following lines of code fills in the above blank to successfully complete the task?
A. There is no way to indicate whether a table contains PII.
B. "COMMENT PII"
C. TBLPROPERTIES PII
D. COMMENT "Contains PII"
E. PII

Answer: D

Explanation:

Correct answer is D:COMMENT "Contains PII".

Question: 15 CertyIQ
Which of the following benefits is provided by the array functions from Spark SQL?

A. An ability to work with data in a variety of types at once


B. An ability to work with data within certain partitions and windows
C. An ability to work with time-related data in specified intervals
D. An ability to work with complex, nested data ingested from JSON files
E. An ability to work with an array of tables for procedural automation

Answer: D

Explanation:

D. An ability to work with complex, nested data ingested from JSON files.

Array functions in Spark SQL are primarily used for working with arrays and complex, nested data structures,
such as those often encountered when ingesting JSON files. These functions allow you to manipulate and
query nested arrays and structures within your data, making it easier to extract and work with specific
elements or values within complex data formats.

While some of the other options (such as option A for working with different data types) are features of Spark
SQL or SQL in general, array functions specifically excel at handling complex, nested data structures like
those found in JSON files.

Question: 16 CertyIQ
Which of the following commands can be used to write data into a Delta table while avoiding the writing of
duplicate records?

A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT

Answer: C

Explanation:
C. MERGE

The MERGE command is used to write data into a Delta table while avoiding the writing of duplicate records. It
allows you to perform an "upsert" operation, which means that it will insert new records and update existing
records in the Delta table based on a specified condition. This helps maintain data integrity and avoid
duplicates when adding new data to the table.

Question: 17 CertyIQ
A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to
apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?

A.

B.

C.

D.

E.

Answer: A

Explanation:
The answer E is incorrect. A user defined function is never written as CREATE UDF. The correct way is
CREATE FUNCTION. So that leaves us with the choices A and D. Out of that, in D, there is no such thing as
RETURN CASE so the correct answer is A.

Question: 18 CertyIQ
A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day.
They only want the final query in the program to run on Sundays. They ask for help from the data engineering team
to complete this task.
Which of the following approaches could be used by the data engineering team to complete this task?

A. They could submit a feature request with Databricks to add this functionality.
B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run
the final query.
C. They could only run the entire program on Sundays.
D. They could automatically restrict access to the source table in the final query so that it is only accessible on
Sundays.
E. They could redesign the data model to separate the data used in the final query into a new table.

Answer: B

Explanation:

They could wrap the queries using PySpark and use Python’s control flow system to determine when to run
the final query.

Question: 19 CertyIQ
A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s
sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has
not changed.
Which of the following describes why the statement might not have copied any new records into the table?

A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The names of the files to be copied were not included with the FILES keyword.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows.

Answer: C

Explanation:

The previous day’s file has already been copied into the table.

Reference:
https://docs.databricks.com/ingestion/copy-into/tutorial-notebook.html

Question: 20 CertyIQ
A data engineer needs to create a table in Databricks using data from their organization’s existing SQLite
database.
They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

A. org.apache.spark.sql.jdbc
B. autoloader
C. DELTA
D. sqlite
E. org.apache.spark.sql.sqlite

Answer: A

Explanation:

Correct answer is A:org.apache.spark.sql.jdbc.

Question: 21 CertyIQ
A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions
in the month of March. The second table april_transactions is a collection of all retail transactions in the month of
April. There are no duplicate records between the tables.
Which of the following commands should be run to create a new table all_transactions that contains all records
from march_transactions and april_transactions without duplicate records?

A. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
INNER JOIN SELECT * FROM april_transactions;
B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;
C. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
OUTER JOIN SELECT * FROM april_transactions;
D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INTERSECT SELECT * from april_transactions;
E. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
MERGE SELECT * FROM april_transactions;
Answer: B

Explanation:

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

UNION SELECT * FROM april_transactions;

Question: 22 CertyIQ
A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is
equal to 1 and the Python variable review_period is True.
Which of the following control flow statements should the data engineer use to begin this conditionally executed
code block?

A. if day_of_week = 1 and review_period:


B. if day_of_week = 1 and review_period = "True":
C. if day_of_week == 1 and review_period == "True":
D. if day_of_week == 1 and review_period:
E. if day_of_week = 1 & review_period: = "True":

Answer: D

Explanation:

D. if day_of_week == 1 and review_period.

This statement will check if the variable day_of_week is equal to 1 and if the variable review_period evaluates
to a truthy value. The use of the double equal sign (==) in the comparison of day_of_week is important, as a
single equal sign (=) would be used to assign a value to the variable instead of checking its value. The use of a
single ampersand (&) instead of the keyword and is not valid syntax in Python. The use of quotes around True
in options B and C will result in a string comparison, which will not evaluate to True even if the value of
review_period is True.

Question: 23 CertyIQ
A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table
metadata and data.
They run the following command:

DROP TABLE IF EXISTS my_table -


While the object no longer appears when they run SHOW TABLES, the data files still exist.
Which of the following describes why the data files still exist and the metadata files were deleted?

A. The table’s data was larger than 10 GB


B. The table’s data was smaller than 10 GB
C. The table was external
D. The table did not have a location
E. The table was managed
Answer: C

Explanation:

C is the correct answer. For external tables, you need to go to the specific location using DESCRIBE
EXTERNAL TABLE command and delete all files.

Question: 24 CertyIQ
A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data
engineers in other sessions. It also must be saved to a physical location.
Which of the following data entities should the data engineer create?

A. Database
B. Function
C. View
D. Temporary view
E. Table

Answer: E

Explanation:

E. Table

To create a data entity that can be used by other data engineers in other sessions and must be saved to a
physical location, you should create a table. Tables in a database are physical storage structures that hold
data, and they can be accessed and shared by multiple users and sessions. By creating a table, you provide a
permanent and structured storage location for the data entity that can be used across different sessions and
by other users as needed.

Options like databases (A) can be used to organize tables, views (C) can provide virtual representations of
data, and temporary views (D) are temporary in nature and don't save data to a physical location. Functions (B)
are typically used for processing data or performing calculations, not for storing data.

Question: 25 CertyIQ
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data
is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the
quality level.
Which of the following tools can the data engineer use to solve this problem?

A. Unity Catalog
B. Data Explorer
C. Delta Lake
D. Delta Live Tables
E. Auto Loader

Answer: D

Explanation:
D. Delta Live Tables

Delta Live Tables is a tool provided by Databricks that can help data engineers automate the monitoring of
data quality. It is designed for managing data pipelines, monitoring data quality, and automating workflows.
With Delta Live Tables, you can set up data quality checks and alerts to detect issues and anomalies in your
data as it is ingested and processed in real-time. It provides a way to ensure that the data quality meets your
desired standards and can trigger actions or notifications when issues are detected.

While the other tools mentioned may have their own purposes in a data engineering environment, Delta Live
Tables is specifically designed for data quality monitoring and automation within the Databricks ecosystem.

Question: 26 CertyIQ
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are
defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Production mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?

A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
persist to allow for additional testing.
B. All datasets will be updated once and the pipeline will persist without any processing. The compute
resources will persist but go unused.
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be
deployed for the update and terminated when the pipeline is stopped.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow
for additional testing.

Answer: C

Explanation:

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be
deployed for the update and terminated when the pipeline is stopped.

Question: 27 CertyIQ
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any
kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record
the offset range of the data being processed in each trigger?

A. Checkpointing and Write-ahead Logs


B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
C. Replayable Sources and Idempotent Sinks
D. Write-ahead Logs and Idempotent Sinks
E. Checkpointing and Idempotent Sinks

Answer: A

Explanation:

A. Checkpointing and Write-ahead Logs.


To reliably track the exact progress of processing and handle failures in Spark Structured Streaming, Spark
uses both checkpointing and write-ahead logs. Checkpointing allows Spark to periodically save the state of
the streaming application to a reliable distributed file system, which can be used for recovery in case of
failures. Write-ahead logs are used to record the offset range of data being processed, ensuring that the
system can recover and reprocess data from the last known offset in the event of a failure.

Question: 28 CertyIQ
Which of the following describes the relationship between Gold tables and Silver tables?

A. Gold tables are more likely to contain aggregations than Silver tables.
B. Gold tables are more likely to contain valuable data than Silver tables.
C. Gold tables are more likely to contain a less refined view of data than Silver tables.
D. Gold tables are more likely to contain more data than Silver tables.
E. Gold tables are more likely to contain truthful data than Silver tables.

Answer: A

Explanation:

A. Gold tables are more likely to contain aggregations than Silver tables.

In some data processing pipelines, especially those following a typical "Bronze-Silver-Gold" data lakehouse
architecture, Silver tables are often considered a more refined version of the raw or Bronze data. Silver tables
may include data cleansing, schema enforcement, and some initial transformations.

Gold tables, on the other hand, typically represent a stage where data is further enriched, aggregated, and
processed to provide valuable insights for analytical purposes. This could indeed involve more aggregations
compared to Silver tables.

Question: 29 CertyIQ
Which of the following describes the relationship between Bronze tables and raw data?

A. Bronze tables contain less data than raw data files.


B. Bronze tables contain more truthful data than raw data.
C. Bronze tables contain aggregates while raw data is unaggregated.
D. Bronze tables contain a less refined view of data than raw data.
E. Bronze tables contain raw data with a schema applied.

Answer: E

Explanation:

E. Bronze tables contain raw data with a schema applied.

In a typical data processing pipeline following a "Bronze-Silver-Gold" data lakehouse architecture, Bronze
tables are the initial stage where raw data is ingested and transformed into a structured format with a schema
applied. The schema provides structure and meaning to the raw data, making it more usable and accessible
for downstream processing.

Therefore, Bronze tables contain the raw data but in a structured and schema-enforced format, which makes
them distinct from the unprocessed, unstructured raw data files.

Question: 30 CertyIQ
Which of the following tools is used by Auto Loader process data incrementally?

A. Checkpointing
B. Spark Structured Streaming
C. Data Explorer
D. Unity Catalog
E. Databricks SQL

Answer: B

Explanation:

B. Spark Structured Streaming.

The Auto Loader process in Data bricks is typically used in conjunction with Spark Structured Streaming to
process data incrementally. Spark Structured Streaming is a real-time data processing framework that allows
you to process data streams incrementally as new data arrives. The Auto Loader is a feature in Data bricks
that works with Structured Streaming to automatically detect and process new data files as they are added to
a specified data source location. It allows for incremental data processing without the need for manual
intervention.

Question: 31 CertyIQ
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then
perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the
following lines of code should the data engineer use to fill in the blank?

A. trigger("5 seconds")
B. trigger()
C. trigger(once="5 seconds")
D. trigger(processingTime="5 seconds")
E. trigger(continuous="5 seconds")

Answer: D
Explanation:

trigger(processingTime="5 seconds")

Question: 32 CertyIQ
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW
What is the expected behavior when a batch of data containing data that violates these constraints is processed?

A. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
B. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added
to the target dataset.
C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event
log.
D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
E. Records that violate the expectation cause the job to fail.

Answer: C

Explanation:

C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the
event log.

Invalid rows will be dropped as requested by the constraint and flagged as such in log files. If you need a
quarantine table, you'll have to write more code.

With the defined constraint and expectation clause, when a batch of data is processed, any records that
violate the expectation (in this case, where the timestamp is not greater than '2020-01-01') will be dropped
from the target dataset. These dropped records will also be recorded as invalid in the event log, allowing for
auditing and tracking of the data quality issues without causing the entire job to fail.

Reference:

https://docs.databricks.com/en/delta-live-tables/expectations.html

Question: 33 CertyIQ
Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE
INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT)
tables using SQL?

A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
C. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
D. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated
aggregations.
E. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.

Answer: B

Explanation:
B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally. The
CREATE STREAMING LIVE TABLE syntax is used to create tables that read data incrementally, while the
CREATE LIVE TABLE syntax is used to create tables that read data in batch mode. Delta Live Tables support
both streaming and batch modes of processing data. When the data is streamed and needs to be processed
incrementally, CREATE STREAMING LIVE TABLE should be used.

Question: 34 CertyIQ
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also
used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data
engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only
ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?

A. Unity Catalog
B. Delta Lake
C. Databricks SQL
D. Data Explorer
E. Auto Loader

Answer: E

Explanation:

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any
additional setup.

Reference:

https://docs.databricks.com/en/ingestion/auto-loader/index.html

Question: 35 CertyIQ
Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

A.

B.
C.

D.

E.

Answer: E

Explanation:

E is the right answer. The "gold layer" is used to store aggregated clean data, E is the only answer in which
aggregation is performed.

Question: 36 CertyIQ
A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop
invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in
the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.
Which of the following approaches can the data engineer take to identify the table that is dropping the records?

A. They can set up separate expectations for each table when developing their DLT pipeline.
B. They cannot determine which table is dropping the records.
C. They can set up DLT to notify them via email when records are dropped.
D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
E. They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Answer: D

Explanation:

D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

To identify the table in a Delta Live Tables (DLT) pipeline where data is being dropped due to quality concerns,
the data engineer can navigate to the DLT pipeline page, click on each table in the pipeline, and view the data
quality statistics. These statistics often include information about records dropped, violations of expectations,
and other data quality metrics. By examining the data quality statistics for each table in the pipeline, the data
engineer can determine at which table the data is being dropped.

Question: 37 CertyIQ
A data engineer has a single-task Job that runs each morning before they begin working. After identifying an
upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the new task?

A.They can clone the existing task in the existing Job and update it to run the new notebook.
B.They can create a new task in the existing Job and then add it as a dependency of the original task.
C.They can create a new task in the existing Job and then add the original task as a dependency of the new
task.
D.They can create a new job from scratch and add both tasks to run concurrently.
E.They can clone the existing task to a new Job and then edit it to run the new notebook.

Answer: B

Explanation:

B. They can create a new task in the existing Job and then add it as a dependency of the original task. Adding
a new task as a dependency to an existing task in the same Job allows the new task to run before the original
task is executed. This ensures that the data engineer can run the new notebook prior to the original task
without having to create a new Job from scratch. Cloning the existing task or creating a new Job would add
unnecessary complexity to the pipeline.

Question: 38 CertyIQ
An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For
the first week following the project’s release, the manager wants the query results to be updated every minute.
However, the manager is concerned that the compute resources used for the query will be left running and cost
the organization a lot of money beyond the first week of the project’s release.
Which of the following approaches can the engineering team use to ensure the query does not cost the
organization any money beyond the first week of the project’s release?

A.They can set a limit to the number of DBUs that are consumed by the SQL Endpoint.
B.They can set the query’s refresh schedule to end after a certain number of refreshes.
C.They cannot ensure the query does not cost the organization money beyond the first week of the project’s
release.
D.They can set a limit to the number of individuals that are able to manage the query’s refresh schedule.
E.They can set the query’s refresh schedule to end on a certain date in the query scheduler.

Answer: E

Explanation:

The correct answer is E. They can set the query's refresh schedule to end on a certain date in the query
scheduler.Databricks SQL supports a query scheduler that enables users to schedule SQL queries to run at
defined intervals. By default, scheduled queries run indefinitely. However, users can configure the scheduler
to stop running queries at a specific time or after a specific number of runs. In this scenario, the engineering
team can set the query's refresh schedule to end on a certain date, ensuring that the query does not run
beyond the first week of the project's release and potentially cost the organization more money.

Question: 39 CertyIQ
A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their
always-on SQL endpoint. They claim that this issue is present when many members of the team are running small
queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each
of the team’s queries uses the same SQL endpoint.
Which of the following approaches can the data engineering team use to improve the latency of the team’s
queries?

A.They can increase the cluster size of the SQL endpoint.


B.They can increase the maximum bound of the SQL endpoint’s scaling range.
C.They can turn on the Auto Stop feature for the SQL endpoint.
D.They can turn on the Serverless feature for the SQL endpoint.
E.They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to
“Reliability Optimized.”

Answer: B

Explanation:

They can increase the maximum bound of the SQL endpoint’s scaling range.
Thank you
Thank you for being so interested in the premium exam material.
I'm glad to hear that you found it informative and helpful.

But Wait

I wanted to let you know that there is more content available in the full version.
The full paper contains additional sections and information that you may find helpful,
and I encourage you to download it to get a more comprehensive and detailed view of
all the subject matter.

Download Full Version Now

Total: 130 Questions


Link: https://certyiq.com/papers/databricks/certified-data-engineer-associate

You might also like