0% found this document useful (0 votes)
69 views27 pages

Delta Live Tables for Data Engineering

Delta Live Tables (DLT) is a feature of Databricks that allows users to build production ETL pipelines in a simplified way. DLT allows users to declare SQL or Python code to orchestrate ETL workflows, handles retries and changing data automatically. It also automatically manages infrastructure tasks like recovery, auto-scaling, and performance optimization. DLT ensures high data quality and unifies batch and streaming data processing with one unified API.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views27 pages

Delta Live Tables for Data Engineering

Delta Live Tables (DLT) is a feature of Databricks that allows users to build production ETL pipelines in a simplified way. DLT allows users to declare SQL or Python code to orchestrate ETL workflows, handles retries and changing data automatically. It also automatically manages infrastructure tasks like recovery, auto-scaling, and performance optimization. DLT ensures high data quality and unifies batch and streaming data processing with one unified API.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Simplify your

Streaming:
Delta Live Tables

©2021 Databricks Inc. — All rights reserved


Housekeeping
● Your connection will be muted
● We will share recording with all attendees after the session
● Submit questions in the Q&A panel
● If we do not answer you question during the event, we will
follow-up with you to get you the information you need!

©2021 Databricks Inc. — All rights reserved


What’s the problem with
Data Engineering?

©2021 Databricks Inc. — All rights reserved


We know data is critical to business outcomes
Customer Product /Service
experience Innovation

Operational HIPAA
efficiency
Business
objectives
BCBS29

Revenue
growth Governance & CCPA
compliance

Self-service
analytics
DATA
GDPR
Analytics
& AI
Predictive
analytics Data warehouse /
Digital lake convergence
modernization

Data-driven
decisions Data migrations
©2021 Databricks Inc. — All rights reserved
But there is complexity in the data delivery….
Home-
Grown Streaming
ETL
Azure Data Analytics
Data Sources TASK
Factory
FLOW
TASK
FLOW

Streaming
Sources ETL
Unstructured
Home-
Grown ETL
ETL Business
Cloud Object Stores Azure Insights
Semi-structured Synapse
SaaS Applications Code
Generated
ETL
AWS EMR
NoSQL
Structured Cloud
Relational Databases Data Lake ETL
Analytics
On-premises
systems TASK ETL
FLOW
AWS Glue ETL

Machine
Learning
Data sharing

©2021 Databricks Inc. — All rights reserved


How does Databricks Help?

©2021 Databricks Inc. — All rights reserved


Lakehouse Platform

Databricks Data Data Data Data Science

Lakehouse
Warehousing Engineering Streaming and ML

Platform is the Unity Catalog


Fine-grained governance for data and AI

foundation for Delta Lake

Data Engineering
Data reliability and performance

Cloud Data Lake


All structured and unstructured data

©2021 Databricks Inc. — All rights reserved 7


Delta Live Tables
The best way to do ETL on the lakehouse

Accelerate ETL development


Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data

CREATE STREAMING TABLE raw_data Automatically manage your infrastructure


AS SELECT * Automates complex tedious activities like recovery,
FROM cloud_files ("/raw_data", "json") auto-scaling, and performance optimization

Ensure high data quality


CREATE LIVE TABLE clean_data Deliver reliable data with built-in quality controls,
AS SELECT … testing, monitoring, and enforcement
FROM LIVE.raw_data
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one unified API

©2021 Databricks Inc. — All rights reserved 8


Build Production ETL Pipelines with DLT
Databricks Lakehouse Platform

Business Insights
Continuous Batch Error Handling and Data Pipeline Automatic Orchestrate
or Stream Automatic Observability Deployments & Data Pipelines
Processing Recovery Operations

Analytics
Data Transformation
Data Quality
Continuous or
Scheduled Ingest Bronze
Silver Gold
Zone
Zone Zone

Business Level Machine Learning


Data Quality
Aggregates
Photon

UNITY CATALOG Operational Apps

©2021 Databricks Inc. — All rights reserved


Key Differentiators

©2021 Databricks Inc. — All rights reserved


Continuous or scheduled data ingestion
● Incrementally and efficiently process new
data files as they arrive in cloud storage
using Auto Loader

● Automatically infer schema of incoming files


or superimpose what you know with Schema
Hints

● Automatic schema evolution

● Rescue data column - never lose data again

Schema
JSON CSV AVRO
Evolution ✅ ✅ ✅ ✅ PARQUET

©2021 Databricks Inc. — All rights reserved


Declarative SQL & Python APIs
Source
/* Create a temp view on the accounts table */ ● Use intent-driven declarative development
CREATE STREAMING VIEW account_raw AS
SELECT * FROM cloud_files(“/data”, “csv”);
to abstract away the “how” and define
“what” to solve
Bronze
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING TABLE account_bronze AS
● Automatically generate lineage based on
COMMENT "Bronze table with valid account ids" table dependencies across the data pipeline
SELECT * FROM account_raw ...

Silver
● Automatically checks for errors, missing
dependencies and syntax errors
/* Stage 2:Send rows to Silver, run validation rules */
CREATE STREAMING TABLE account_silver AS
COMMENT "Silver Accounts table with validation checks"
SELECT * FROM account_bronze ...

Gold

©2021 Databricks Inc. — All rights reserved


Change data capture (CDC)

● Stream change records (inserts, updates,


Data
Sources deletes) from any data source supported by
Streaming UPSERT DBR, cloud storage, or DBFS
Sources via CDC

Cloud Object
Stores ● Simple, declarative “APPLY CHANGES INTO”
Structured
Data
UPSERT
via CDC
API for SQL or Python
Unstructured
Data

Semi-
● Handles out-of-order events
UPSERT
structured
via CDC
data

Data ● Schema evolution


Migration
Services
Bronze Silver
● SCD2 support

©2021 Databricks Inc. — All rights reserved


Data quality validation and monitoring

/* Stage 1: Bronze Table drop invalid rows */

● Define data quality and integrity CREATE STREAMING LIVE TABLE fire_account_bronze AS
( CONSTRAINT valid_account_open_dt EXPECT (acconut_dt is not
controls within the pipeline with data null and (account_close_dt > account_open_dt)) ON VIOLATION DROP
ROW
expectations COMMENT "Bronze table with valid account ids"
SELECT * FROM fire_account_raw ...

● Address data quality errors with flexible


policies: fail, drop, alert, quarantine(future)

● All data pipeline runs and quality metrics are


captured, tracked and reported

©2021 Databricks Inc. — All rights reserved


Data pipeline observability

• High-quality, high-fidelity lineage diagram


that provides visibility into how data flows
for impact analysis

• Granular logging for operational, governance,


quality and status of the data pipeline at a
row level

• Continuously monitor data pipeline jobs to


ensure continued operation

• Notifications using Databricks SQL

©2021 Databricks Inc. — All rights reserved


Automated ETL development lifecycle

• Develop in environment(s) separate from Lineage information Development


production with the ability to easily test it captured and used
to keep data fresh
before deploying - entirely in SQL anywhere

raw Staging
• Deploy and manage environments using
parameterization clean

• Unit testing and documentation scored


Production
• Enables metadata-driven ability to
programatically scale to 100s of
tables/pipelines dynamically

©2021 Databricks Inc. — All rights reserved


Automated ETL operations

• Reduce down time with automatic error handling


and easy replay

• Eliminate maintenance with automatic


optimizations of all Delta Live Tables

• Auto-scaling adds more resources automatically


when needed.

©2021 Databricks Inc. — All rights reserved


Enhanced Autoscaling
Save infrastructure costs while maintaining end-to-end latency SLAs for streaming workloads

Problem
Optimize infrastructure spend when making scaling
decisions for streaming workloads

Backlog • Built to handle streaming workloads which


monitoring
are spiky and unpredictable
Utilization No/Small
monitoring backlog
& low
utilization
• Shuts down nodes when utilization is low
while guaranteeing task execution
Scale
down
• Only scales up to needed # of nodes
Streaming source Spark executors

AWS Azure GCP

Generally Available Generally Available Public Preview


GA Coming Soon

©2021 Databricks Inc. — All rights reserved 18


Lakehouse Platform Databricks Workflows
Workflows Unified orchestration for Delta Live Tables
Orders pipelines and more
DLT
Aggregate Analyze
Pipeline 1
TASK Sessions Simple workflow authoring for all data practitioners
DLT
Train
Pipeline 2
Data practitioners can easily orchestrate DLT pipelines
and other tasks from inside their Databricks workspace.
Advanced users can use their favorite IDE's with full
support for CI/CD.
BI & Data Data Data Data Actionable insights from real-time monitoring
Warehousing Engineering Streaming Science & ML
Full visibility into every task in every workflow. See the
Unity Catalog health of all your production workloads in real-time with
detailed metrics and analytics to identify, troubleshoot,
and fix issues fast.
Delta Lake
Proven reliability for production workloads
A fully managed orchestration service with serverless
data processing and a history of 99.95% uptime. Trusted
by thousands of Databricks customers running millions of
production workloads.
©2021 Databricks Inc. — All rights reserved
Upcoming Features

©2021 Databricks Inc. — All rights reserved


Accelerate development
Improve time to insight with improved navigation and data management tools

• Logical schema grouping


Group tables into different
schemas depending on their
contents.

• Data sample previews


Easily preview data directly
in the DLT console

• Notebook integration
Users can run & see DLT
pipelines update status from
the Notebook AWS Azure GCP

In development In development In development


Preview Coming Soon Preview Coming Soon Preview Coming Soon

©2021 Databricks Inc. — All rights reserved 21


Change-data-capture (CDC) enhancements
Problem: It’s hard to ingest & track changes with CDC from streaming or batch sources

city_updates • CDC from Full Snapshot


Perform change-data-capture from any source by
{"id": 1, "ts": 1, "city": "Bekerly, CA"} providing full snapshots. Supports both SCD type 1
{"id": 1, "ts": 2, "city": "Berkeley, CA"} and type 2.

cities • SCD type 2 GA


Track changes and retain full history of values.
id city __starts_at __ends_at
When the value of an attribute changes, the
1 Bekerly, CA 1 null
2
current record is closed, a new record is
1 Berkeley, CA 2 null
created with the changed data values.
APPLY CHANGES INTO LIVE.cities

FROM STREAM(LIVE.city_updates) AWS Azure GCP

KEYS (id) CDC from Full Preview Preview Preview


Snapshot GA Coming Soon GA Coming Soon GA Coming Soon
SEQUENCE BY ts
Preview Preview Preview
CDC SCD TYPE 2
STORED AS SCD TYPE 2 GA Coming Soon GA Coming Soon GA Coming Soon
©2021 Databricks Inc. — All rights reserved
22
Seamlessly add new streaming sources
Add new sources to a streaming table without a full refresh
Data • Declarative API to define sources
Sources
in DLT
Streaming
Sources
• Adding a new streaming source
Cloud Object
Stores without doing a full refresh
Structured
Data • Works with any streaming source
Unstructured
Streaming Table
supported by DLT
Data

Semi-
structured
data

Data
Migration AWS Azure GCP
Services
Preview Preview Preview
GA Coming Soon GA Coming Soon GA Coming Soon

©2021 Databricks Inc. — All rights reserved


Simplify Operations

• Email notifications
Configure detailed notifications at
the individual table level for
continuous and triggered
pipelines.

• Improved errors
Debug issues more quickly with
detailed and accurate error
reports in DLT.
AWS Azure GCP

In Development In Development In Development


Preview Coming Soon Preview Coming Soon Preview Coming Soon

©2021 Databricks Inc. — All rights reserved 24


Enzyme performance optimization for MVs
Improve end-to-end SLAs and reduce infrastructure costs by incrementally updating
materialized views.
Problem
Achieving near real-time data freshness for streaming pipelines with large data volumes requires
significant infrastructure spend or complex hand-coding.

Monotonic Append
Delta Tracked
Changes

Cost Model
Partition Recompute

Optimal
Query Plan MERGE Updates
Update
Analysis Technique
Full Recompute

AWS Azure GCP

Private Preview Private Preview Private Preview


©2021 Databricks Inc. — All rights reserved
Data Samples
Easily preview data directly in the DLT console

• Single-click to view a selection of


rows for any table in your pipeline

• Easily perform ad-hoc queries


with Databricks SQL Warehouse
query editor

• View detailed schema information

AWS Azure GCP

In Development In Development In Development


Preview Coming Soon Preview Coming Soon Preview Coming Soon

©2021 Databricks Inc. — All rights reserved 26


Thank you

©2021 Databricks Inc. — All rights reserved

You might also like