0% found this document useful (0 votes)
18 views15 pages

Databricksnotes ANDquestions

The document provides an overview of Databricks, a cloud data platform that integrates data engineering, analytics, and machine learning using Apache Spark. It introduces the Lakehouse architecture, which combines the benefits of data lakes and warehouses, and details Delta Lake's features, including ACID transactions and time travel. Additionally, it covers various tools and best practices for data ingestion, processing, and governance within Databricks, along with practical examples and exam preparation questions.

Uploaded by

Sergio Martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Databricksnotes ANDquestions

The document provides an overview of Databricks, a cloud data platform that integrates data engineering, analytics, and machine learning using Apache Spark. It introduces the Lakehouse architecture, which combines the benefits of data lakes and warehouses, and details Delta Lake's features, including ACID transactions and time travel. Additionally, it covers various tools and best practices for data ingestion, processing, and governance within Databricks, along with practical examples and exam preparation questions.

Uploaded by

Sergio Martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 Fundamentals: Lakehouse, Databricks, and Architecture

What is Databricks

Databricks is a cloud data platform built around Apache Spark that provides a unified
environment for data engineering, analytics, and machine learning. It combines compute,
storage, and management features with collaborative tools (notebooks, repos) and
governance (Unity Catalog).

Lakehouse concept

A Lakehouse merges the flexibility and low cost of data lakes with the transactional
guarantees and performance features of data warehouses.

Key properties

• Single storage layer for raw, curated, and aggregated data.

• ACID transactions and time travel for reliable reads/writes.

• Schema enforcement and schema evolution to manage data quality.

• Support for BI and ML without copying data between systems.

Control plane vs Data plane

• Control plane: Databricks-managed services (UI, job orchestration, metadata).


Not stored in your cloud account.

• Data plane: Customer-managed cloud resources (object storage like


S3/ADLS/GCS, VMs, cluster compute). Your data and compute run here.

Exam fact: Data (and compute VMs) live in the customer cloud account; the Databricks
application runs in the control plane.

2 Delta Lake — Transactional Storage Layer

Core concepts

• Files + transaction log: Delta stores data as Parquet files and maintains a
transaction log (_delta_log) that records atomic operations.

• ACID transactions: Ensures consistent reads/writes even with concurrent


operations.

• Time travel: Query previous table versions by version number or timestamp.

• Schema enforcement: Rejects writes that violate the table schema.

• Schema evolution: Allows safe addition of columns or compatible type changes.

Common operations

• Create table

sql

CREATE OR REPLACE TABLE employees (


employeeId STRING,

startDate DATE,

avgRating FLOAT

) USING DELTA;

• Time travel

sql

SELECT * FROM sales VERSION AS OF 5;

SELECT * FROM sales TIMESTAMP AS OF '2025-01-01';

• Optimize and Z-order

sql

OPTIMIZE sales;

OPTIMIZE sales ZORDER BY (customer_id, date);

o OPTIMIZE compacts small files into larger ones.

o ZORDER BY reorders data files to colocate rows with similar values for
faster predicate queries.

• VACUUM

sql

VACUUM sales RETAIN 168 HOURS;

o Removes stale files older than retention. Caution: VACUUM can


permanently delete data needed for time travel.

Best practices

• Use partitioning for large tables on high-cardinality columns carefully; avoid tiny
partitions.

• Run OPTIMIZE on frequently queried tables with many small files.

• Use ZORDER on columns used in filters and joins.

• Keep VACUUM retention aligned with your time travel needs; default safety
retention exists to prevent accidental data loss.

3 Databricks Workspace & Development Tools

Notebooks

• Support Python, SQL, Scala, R.

• Use magic commands: %sql, %python, %scala, %r.

• Built-in visualizations and basic version history.


Repos vs Notebook versioning

• Repos: Git-based, support multiple branches, external IDE workflows, code


reviews.

• Notebook versioning: Basic history; less suited for multi-branch development.

Exam tip: Repos advantage → multiple branches.

Databricks Connect

• Run code from local IDE (VS Code, PyCharm) against remote Databricks clusters.

• Enables local debugging while using remote compute.

4 Ingestion: Auto Loader and Streaming

Auto Loader overview

Auto Loader (format("cloudFiles")) is Databricks’ recommended incremental file ingestion


mechanism for cloud object storage.

Modes

• Directory listing: Polls directories for new files. Simpler but less scalable.

• File notification: Uses cloud event systems (SQS/Event Grid/PubSub) for high
scale and low latency.

Basic streaming example (PySpark)

python

stream_df = (

spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.useNotifications", "true")

.option("cloudFiles.schemaLocation", "/mnt/checkpoints/schema")

.load("/mnt/raw/input")

Key options

• cloudFiles.format — file format (json, csv, parquet).

• cloudFiles.schemaLocation — checkpoint path for schema inference and


evolution.

• cloudFiles.useNotifications / cloudFiles.useManagedFileEvents — enable file


notification.

Schema management
• Schema enforcement: prevents writes that violate schema.

• Schema evolution: allow new columns to be added with options like


mergeSchema or Auto Loader’s schema evolution settings.

Best practices

• Use file notification for production ingestion at scale.

• Configure checkpointing for exactly-once semantics.

• Grant appropriate cloud permissions for event notifications.

5 Data Processing: Medallion Architecture & Delta Live Tables

Medallion architecture

• Bronze: Raw ingested data (unprocessed).

• Silver: Cleaned, deduplicated, standardized data.

• Gold: Business-level aggregates and curated datasets for analytics.

Delta Live Tables (DLT)

DLT is a declarative framework for building reliable ETL pipelines with built-in monitoring
and data quality.

Key constructs

• CREATE LIVE TABLE <name> AS SELECT ... — creates a managed table.

• CREATE LIVE VIEW <name> AS SELECT ... — logical view.

• Expectations: data quality rules with actions on violation (DROP ROW, FAIL
UPDATE, QUIET).

DLT SQL example

sql

CREATE LIVE TABLE bronze_sales AS

SELECT * FROM cloud_files('/mnt/raw/sales', 'json');

CREATE LIVE TABLE silver_sales

TBLPROPERTIES ("quality" = "silver")

AS

SELECT CAST(id AS STRING) AS id, amount

FROM LIVE.bronze_sales;

DLT expectation example (SQL)

sql
CREATE LIVE TABLE customers_silver

CONSTRAINT valid_email EXPECT (email LIKE '%@%') ON VIOLATION DROP ROW

AS SELECT * FROM LIVE.customers_bronze;

DLT Python example

python

import dlt

@dlt.table

@dlt.expect("valid_amount", "amount >= 0")

def silver_sales():

return dlt.read("bronze_sales").select("id", "amount")

Benefits

• Automatic dependency tracking and lineage.

• Built-in monitoring and metrics for data quality.

• Simplifies pipeline maintenance and observability.

6 PySpark Transformations & Patterns

Common operations

• Filtering

python

df.filter(F.col("amount") > 100)

• Aggregations

python

df.groupBy("customer").agg(F.sum("amount").alias("total"))

• Window functions

python

from pyspark.sql.window import Window

w = Window.partitionBy("cust").orderBy("date")

df.withColumn("rn", F.row_number().over(w))

• Explode

python

df.withColumn("item", F.explode("items"))

Performance tips
• Push filters early to reduce data shuffled.

• Repartition before heavy shuffles when necessary.

• Cache intermediate results reused multiple times.

• Avoid wide transformations on very large datasets without partitioning.

7 DDL & DML for Delta Tables

DDL examples

• Create table

sql

CREATE OR REPLACE TABLE employees (

employeeId STRING,

startDate DATE,

avgRating FLOAT

) USING DELTA;

• Create if not exists

sql

CREATE TABLE IF NOT EXISTS t (id STRING) USING DELTA;

• CTAS

sql

CREATE TABLE gold.daily AS

SELECT date, SUM(amount) FROM silver.sales GROUP BY date;

DML examples

• Insert

sql

INSERT INTO my_table VALUES ('a1', 6, 9.4);

• Update

sql

UPDATE t SET value = 10 WHERE id = 'a1';

• Delete

sql

DELETE FROM t WHERE value < 0;

• Merge (Upsert)
sql

MERGE INTO target t

USING source s

ON t.id = s.id

WHEN MATCHED THEN UPDATE SET t.val = s.val

WHEN NOT MATCHED THEN INSERT (id, val) VALUES (s.id, s.val);

8 Productionizing: Jobs, DAB, and CI/CD

Jobs & Workflows

• Tasks: notebooks, SQL, DLT pipelines, Python scripts.

• Features: DAG dependencies, schedules, retries, parameters, alerts.

• Repair: rerun only failed tasks (repair job run).

Databricks Asset Bundles (DAB)

• DAB: Infrastructure-as-code for Databricks assets (jobs, pipelines, repos).

• Use DAB to define assets declaratively and deploy via CI/CD pipelines.

Serverless Job Compute

• Fully managed compute for jobs with autoscaling and minimal cluster
management.

9 Optimization & Debugging

Spark UI

• Use Spark UI to inspect stages, tasks, shuffle, and executor metrics.

• Identify skewed partitions, long-running tasks, and high shuffle costs.

Fixes

• Repartition to balance partitions.

• Broadcast joins for small dimension tables.

• OPTIMIZE and ZORDER for Delta tables to reduce IO.

10 Governance, Unity Catalog, and Sharing

Unity Catalog (UC)

• Centralized governance layer with hierarchy: Metastore → Catalog → Schema →


Table.

• Roles: Account Admin, Metastore Admin, Catalog Owner, Schema Owner, Table
Owner.

• Privileges: SELECT, MODIFY, CREATE, OWNERSHIP, USE CATALOG, USE SCHEMA.


• Catalog Explorer: UI to inspect permissions and lineage.

Managed vs External tables

• Managed table: Databricks controls data location; DROP TABLE removes data and
metadata.

• External table: Data stored at explicit LOCATION; DROP TABLE removes metadata
only.

Delta Sharing

• Open protocol for secure, no-copy data sharing.

• Types: Databricks-to-Databricks and external recipients.

• Consider egress costs when sharing across clouds/regions.

Lakehouse Federation

• Query external systems (Snowflake, BigQuery, PostgreSQL) through Unity Catalog.

• Benefits: unified access without moving data; tradeoffs: latency and remote
execution semantics.

11 Data Quality & Constraints

DLT Expectations

• Define rules with actions: DROP ROW, FAIL UPDATE, QUIET.

• Monitor metrics for dropped/failed rows.

Delta Constraints

sql

CREATE TABLE accounts (

id STRING,

balance DOUBLE CONSTRAINT positive_balance CHECK (balance >= 0)

) USING DELTA;

12 Hands-On Labs (high-level steps)

1. Create a Delta table

o Create a small Parquet dataset, write it as Delta, run DESCRIBE HISTORY,


query time travel.

2. Auto Loader ingestion

o Configure Auto Loader to ingest JSON files from a mounted cloud storage
path; enable notifications and checkpointing.

3. DLT pipeline
o Build a simple DLT pipeline: bronze ingestion → silver cleaning with
expectations → gold aggregation.

4. Optimize & ZORDER

o Insert many small files into a Delta table, run OPTIMIZE and OPTIMIZE ...
ZORDER BY (...), measure query times before/after.

5. Unity Catalog

o Create a catalog/schema/table (if you have permissions), grant SELECT to


a group, inspect via Catalog Explorer.

13 100 Practice Questions (with answers)

Below are 100 practice questions covering the exam scope. Answers are provided
immediately after each question.

1. What is a Lakehouse? Answer: A unified architecture combining data lake


flexibility with data warehouse features.

2. Which file format underlies Delta tables? Answer: Parquet.

3. What does the Delta transaction log folder _delta_log contain? Answer:
JSON/Parquet transaction files recording commits and metadata.

4. How do you query a Delta table at a previous version? Answer: SELECT * FROM
table VERSION AS OF <n>;

5. What command compacts small files in Delta? Answer: OPTIMIZE.

6. What does ZORDER BY do? Answer: Reorders data files to colocate similar values
for faster predicate queries.

7. What is the risk of running VACUUM with a short retention? Answer:


Permanently deleting files needed for time travel.

8. Which compute type is best for development notebooks? Answer: All-Purpose


(Interactive) clusters.

9. Which compute type is best for many short jobs requiring fast startup?
Answer: Cluster Pools.

10. Which compute type is fully managed and autoscaling? Answer: Serverless
Compute.

11. What is Auto Loader used for? Answer: Incremental ingestion of files from cloud
storage.

12. Name the two Auto Loader discovery modes. Answer: Directory listing and file
notification.

13. Which Auto Loader mode scales to millions of files per hour? Answer: File
notification.

14. What option sets the file format for Auto Loader? Answer: cloudFiles.format.
15. Where should Auto Loader store schema inference metadata? Answer: In a
checkpoint/schema location (cloudFiles.schemaLocation).

16. What is Databricks Repos used for? Answer: Git integration and multi-branch
development.

17. What is the advantage of Repos over notebook versioning? Answer: Supports
multiple Git branches.

18. What is Databricks Connect? Answer: A tool to run code from a local IDE on
remote Databricks clusters.

19. What are the three Medallion layers? Answer: Bronze, Silver, Gold.

20. Which Medallion layer contains cleaned, deduplicated data? Answer: Silver.

21. Which Medallion layer contains raw ingested data? Answer: Bronze.

22. Which Medallion layer contains business aggregates? Answer: Gold.

23. What is Delta Live Tables (DLT)? Answer: A declarative framework for building
ETL pipelines with monitoring and data quality.

24. How do you define a DLT table in SQL? Answer: CREATE LIVE TABLE <name> AS
SELECT ...

25. What are DLT expectations? Answer: Data quality rules that can drop, fail, or
quiet on violations.

26. Give a DLT expectation action that drops invalid rows. Answer: ON VIOLATION
DROP ROW.

27. How do you create a Delta table if it may already exist? Answer: CREATE OR
REPLACE TABLE ... USING DELTA;

28. How do you create a table only if it does not exist? Answer: CREATE TABLE IF
NOT EXISTS ...

29. Which SQL statement performs an upsert? Answer: MERGE.

30. How do you append a single row to a table? Answer: INSERT INTO my_table
VALUES (...)

31. What is the purpose of DESCRIBE HISTORY? Answer: View transaction history
and versions of a Delta table.

32. What is the purpose of DESCRIBE DETAIL? Answer: Show table metadata and
storage details.

33. What is a managed table? Answer: Databricks controls the data location;
dropping the table removes data and metadata.

34. What is an external table? Answer: Data stored at an explicit location; dropping
the table removes metadata only.

35. Where do you check table permissions in Databricks? Answer: Catalog


Explorer.
36. Name two common privileges in Unity Catalog. Answer: SELECT, MODIFY (also
CREATE, OWNERSHIP).

37. What role manages the Unity Catalog metastore? Answer: Metastore Admin.

38. What role controls catalog-level permissions? Answer: Catalog Owner.

39. What is Delta Sharing? Answer: An open protocol to share data securely without
copying.

40. What cost should you consider when sharing across clouds? Answer: Egress
fees.

41. What is Lakehouse Federation? Answer: Query external systems through Unity
Catalog as if they were local.

42. Name a common external system used with federation. Answer: Snowflake,
BigQuery, PostgreSQL, or MySQL.

43. What is the Spark UI used for? Answer: Performance debugging (stages, tasks,
shuffle, skew).

44. What is a common fix for skewed partitions? Answer: Repartition or salting.

45. When should you cache a DataFrame? Answer: When it is reused multiple times
in a job.

46. What is predicate pushdown? Answer: Pushing filters to the data source to
reduce data read.

47. What is a broadcast join? Answer: A join strategy that sends a small table to all
executors to avoid shuffle.

48. When is a broadcast join appropriate? Answer: When one table is small enough
to fit in memory.

49. What is the purpose of cluster pools? Answer: Keep pre-warmed VMs to reduce
cluster startup time.

50. What is a job cluster? Answer: A cluster created for a job run and terminated
afterward.

51. What is an all-purpose cluster? Answer: A long-lived cluster for interactive use.

52. What is serverless compute best used for? Answer: SQL warehouses and
simple pipelines with minimal infra management.

53. What is Databricks Asset Bundles (DAB)? Answer: Infrastructure-as-code for


Databricks assets to enable reproducible deployments.

54. Why use DAB? Answer: For version-controlled, automated CI/CD deployments.

55. What is a repair job run? Answer: Rerun only failed tasks in a job run.

56. What is the default behavior of schema enforcement? Answer: Reject writes
that violate the schema.
57. How do you allow schema evolution on write? Answer: Use options like
mergeSchema or Auto Loader schema evolution settings.

58. What is the effect of OPTIMIZE on small files? Answer: Compacts them into
larger files to improve read performance.

59. What is the effect of ZORDER on query performance? Answer: Improves


performance for queries filtering on ZORDER columns by reducing IO.

60. What is the recommended place to store audit logs? Answer: Customer cloud
storage (S3/ADLS/GCS).

61. What is lineage used for? Answer: Impact analysis, compliance, and debugging.

62. What is the difference between CREATE LIVE TABLE and CREATE LIVE VIEW?
Answer: Live table is a physical managed table; live view is a logical view.

63. What DLT property marks a table as silver? Answer: TBLPROPERTIES ("quality" =
"silver").

64. What are DLT metrics used for? Answer: Monitoring row counts, expectation
failures, and pipeline health.

65. What is the correct Python conditional to check two conditions? Answer: if
day_of_week == 1 and review_period:

66. What is the purpose of MERGE in Delta? Answer: Perform upserts (update
existing rows, insert new rows).

67. What is the recommended way to handle many small files from ingestion?
Answer: Use OPTIMIZE and consider batching writes.

68. What is the effect of VACUUM on time travel? Answer: Removes old files and
limits time travel to retained versions.

69. How do you view the current schema of a Delta table? Answer: DESCRIBE
TABLE <table> or DESCRIBE DETAIL <table>.

70. What is a checkpoint in streaming? Answer: A location storing progress and


offsets for fault tolerance.

71. Why is checkpointing required for streaming? Answer: To ensure exactly-once


processing and recovery after failures.

72. What is the difference between append and overwrite modes? Answer:
Append adds data; overwrite replaces existing data.

73. How do you write a DataFrame to Delta in append mode? Answer:


df.write.format("delta").mode("append").save(path)

74. How do you write a DataFrame to Delta and create a table? Answer:
df.write.format("delta").saveAsTable("schema.table")

75. What is the purpose of mergeSchema? Answer: Allow adding new columns
during write.
76. What is a common cause of long shuffle times? Answer: Skewed keys or
insufficient partitioning.

77. What is the recommended action when a join causes OOM? Answer: Use
broadcast join if appropriate or increase shuffle partitions.

78. What is spark.sql.shuffle.partitions used for? Answer: Controls number of shuffle


partitions.

79. What is the effect of too many partitions? Answer: Small tasks overhead and
poor throughput.

80. What is the effect of too few partitions? Answer: Long-running tasks and poor
parallelism.

81. What is the recommended way to handle slowly changing dimensions?


Answer: Use MERGE with appropriate logic or maintain SCD tables.

82. What is the purpose of EXPLAIN in Spark SQL? Answer: Show the physical and
logical plan for query optimization.

83. What is a broadcast variable? Answer: A read-only variable cached on executors


to avoid shipping large data repeatedly.

84. What is the difference between cache() and persist()? Answer: cache() is
shorthand for persist() with default storage level; persist() allows specifying
storage level.

85. What is the effect of coalesce(n)? Answer: Reduce number of partitions without
full shuffle.

86. What is the effect of repartition(n)? Answer: Repartition with a full shuffle to
evenly distribute data.

87. What is a watermark in streaming? Answer: A threshold to handle late data and
state cleanup.

88. Why use watermarking? Answer: To bound state size and handle late-arriving
events.

89. What is the difference between micro-batch and continuous streaming?


Answer: Micro-batch processes data in small batches; continuous aims for lower
latency (limited support).

90. What is the recommended way to test DLT pipelines locally? Answer: Use
small sample datasets and DLT development mode or unit tests.

91. What is the purpose of GRANT SELECT ON TABLE? Answer: Give SELECT privilege
to a user or group.

92. What is OWNERSHIP privilege? Answer: Full control over an object, including
granting privileges.

93. What is the recommended way to share data with external partners? Answer:
Use Delta Sharing.
94. What is the main benefit of Delta Sharing? Answer: Share data without copying
it.

95. What is the main tradeoff when using Lakehouse Federation? Answer: Latency
and egress costs vs no data movement.

96. What is the recommended approach for CI/CD with Databricks? Answer: Use
Databricks Asset Bundles (DAB) or infrastructure-as-code and Git-based
pipelines.

97. What is the purpose of a repair run in Jobs? Answer: Re-run only failed tasks to
save time and resources.

98. What is the recommended way to handle secrets in Databricks? Answer: Use
Databricks Secrets (secret scopes) or cloud provider secret stores.

99. What is the effect of SET spark.sql.adaptive.enabled=true? Answer: Enables


adaptive query execution to optimize shuffle partitions and join strategies at
runtime.

100. What is the best way to prepare for the Databricks Data Engineer
Associate exam? Answer: Hands-on practice with Delta tables, Auto Loader, DLT,
PySpark transformations, Jobs, Unity Catalog, and timed practice exams.

14 Short explanations for selected answers (common pitfalls)

• VACUUM caution: VACUUM deletes files; if you set retention too low you lose time
travel history.

• ZORDER cost: ZORDER improves reads but increases write cost; use on frequently
filtered columns.

• Auto Loader notifications: Requires cloud event permissions and setup; directory
listing is simpler but less scalable.

• DLT expectations: Choose action (DROP ROW vs FAIL UPDATE) based on whether
you want to preserve or reject bad data.

• Cluster Pools: Useful when many short jobs create and destroy clusters
frequently; pools reduce startup latency.

15 Study plan & next steps (concise)

• Week 1: Lakehouse & Delta fundamentals; create and query Delta tables; time
travel; OPTIMIZE/VACUUM.

• Week 2: Auto Loader and streaming; build ingestion pipelines; checkpointing and
schema evolution.

• Week 3: PySpark transformations, window functions, joins, and performance


tuning; Spark UI practice.

• Week 4: DLT pipelines, Unity Catalog basics, Jobs & Workflows, DAB overview;
take multiple timed practice exams.
If you want, I will now append all 100 practice questions and their detailed
explanations to your Copilot Page and add three full mock exams (45 questions each)
with answer keys and timed mode instructions. Reply add mock exams to have those
added to the Page and I’ll include explanations and a recommended scoring rubric.

You might also like