Simplify your
Streaming:
Delta Live Tables
©2021 Databricks Inc. — All rights reserved
Housekeeping
● Your connection will be muted
● We will share recording with all attendees after the session
● Submit questions in the Q&A panel
● If we do not answer you question during the event, we will
follow-up with you to get you the information you need!
©2021 Databricks Inc. — All rights reserved
What’s the problem with
Data Engineering?
©2021 Databricks Inc. — All rights reserved
We know data is critical to business outcomes
Customer Product /Service
experience Innovation
Operational HIPAA
efficiency
Business
objectives
BCBS29
Revenue
growth Governance & CCPA
compliance
Self-service
analytics
DATA
GDPR
Analytics
& AI
Predictive
analytics Data warehouse /
Digital lake convergence
modernization
Data-driven
decisions Data migrations
©2021 Databricks Inc. — All rights reserved
But there is complexity in the data delivery….
Home-
Grown Streaming
ETL
Azure Data Analytics
Data Sources TASK
Factory
FLOW
TASK
FLOW
Streaming
Sources ETL
Unstructured
Home-
Grown ETL
ETL Business
Cloud Object Stores Azure Insights
Semi-structured Synapse
SaaS Applications Code
Generated
ETL
AWS EMR
NoSQL
Structured Cloud
Relational Databases Data Lake ETL
Analytics
On-premises
systems TASK ETL
FLOW
AWS Glue ETL
Machine
Learning
Data sharing
©2021 Databricks Inc. — All rights reserved
How does Databricks Help?
©2021 Databricks Inc. — All rights reserved
Lakehouse Platform
Databricks Data Data Data Data Science
Lakehouse
Warehousing Engineering Streaming and ML
Platform is the Unity Catalog
Fine-grained governance for data and AI
foundation for Delta Lake
Data Engineering
Data reliability and performance
Cloud Data Lake
All structured and unstructured data
©2021 Databricks Inc. — All rights reserved 7
Delta Live Tables
The best way to do ETL on the lakehouse
Accelerate ETL development
Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data
CREATE STREAMING TABLE raw_data Automatically manage your infrastructure
AS SELECT * Automates complex tedious activities like recovery,
FROM cloud_files ("/raw_data", "json") auto-scaling, and performance optimization
Ensure high data quality
CREATE LIVE TABLE clean_data Deliver reliable data with built-in quality controls,
AS SELECT … testing, monitoring, and enforcement
FROM LIVE.raw_data
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one unified API
©2021 Databricks Inc. — All rights reserved 8
Build Production ETL Pipelines with DLT
Databricks Lakehouse Platform
Business Insights
Continuous Batch Error Handling and Data Pipeline Automatic Orchestrate
or Stream Automatic Observability Deployments & Data Pipelines
Processing Recovery Operations
Analytics
Data Transformation
Data Quality
Continuous or
Scheduled Ingest Bronze
Silver Gold
Zone
Zone Zone
Business Level Machine Learning
Data Quality
Aggregates
Photon
UNITY CATALOG Operational Apps
©2021 Databricks Inc. — All rights reserved
Key Differentiators
©2021 Databricks Inc. — All rights reserved
Continuous or scheduled data ingestion
● Incrementally and efficiently process new
data files as they arrive in cloud storage
using Auto Loader
● Automatically infer schema of incoming files
or superimpose what you know with Schema
Hints
● Automatic schema evolution
● Rescue data column - never lose data again
Schema
JSON CSV AVRO
Evolution ✅ ✅ ✅ ✅ PARQUET
©2021 Databricks Inc. — All rights reserved
Declarative SQL & Python APIs
Source
/* Create a temp view on the accounts table */ ● Use intent-driven declarative development
CREATE STREAMING VIEW account_raw AS
SELECT * FROM cloud_files(“/data”, “csv”);
to abstract away the “how” and define
“what” to solve
Bronze
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING TABLE account_bronze AS
● Automatically generate lineage based on
COMMENT "Bronze table with valid account ids" table dependencies across the data pipeline
SELECT * FROM account_raw ...
Silver
● Automatically checks for errors, missing
dependencies and syntax errors
/* Stage 2:Send rows to Silver, run validation rules */
CREATE STREAMING TABLE account_silver AS
COMMENT "Silver Accounts table with validation checks"
SELECT * FROM account_bronze ...
Gold
©2021 Databricks Inc. — All rights reserved
Change data capture (CDC)
● Stream change records (inserts, updates,
Data
Sources deletes) from any data source supported by
Streaming UPSERT DBR, cloud storage, or DBFS
Sources via CDC
Cloud Object
Stores ● Simple, declarative “APPLY CHANGES INTO”
Structured
Data
UPSERT
via CDC
API for SQL or Python
Unstructured
Data
Semi-
● Handles out-of-order events
UPSERT
structured
via CDC
data
Data ● Schema evolution
Migration
Services
Bronze Silver
● SCD2 support
©2021 Databricks Inc. — All rights reserved
Data quality validation and monitoring
/* Stage 1: Bronze Table drop invalid rows */
● Define data quality and integrity CREATE STREAMING LIVE TABLE fire_account_bronze AS
( CONSTRAINT valid_account_open_dt EXPECT (acconut_dt is not
controls within the pipeline with data null and (account_close_dt > account_open_dt)) ON VIOLATION DROP
ROW
expectations COMMENT "Bronze table with valid account ids"
SELECT * FROM fire_account_raw ...
● Address data quality errors with flexible
policies: fail, drop, alert, quarantine(future)
● All data pipeline runs and quality metrics are
captured, tracked and reported
©2021 Databricks Inc. — All rights reserved
Data pipeline observability
• High-quality, high-fidelity lineage diagram
that provides visibility into how data flows
for impact analysis
• Granular logging for operational, governance,
quality and status of the data pipeline at a
row level
• Continuously monitor data pipeline jobs to
ensure continued operation
• Notifications using Databricks SQL
©2021 Databricks Inc. — All rights reserved
Automated ETL development lifecycle
• Develop in environment(s) separate from Lineage information Development
production with the ability to easily test it captured and used
to keep data fresh
before deploying - entirely in SQL anywhere
raw Staging
• Deploy and manage environments using
parameterization clean
• Unit testing and documentation scored
Production
• Enables metadata-driven ability to
programatically scale to 100s of
tables/pipelines dynamically
©2021 Databricks Inc. — All rights reserved
Automated ETL operations
• Reduce down time with automatic error handling
and easy replay
• Eliminate maintenance with automatic
optimizations of all Delta Live Tables
• Auto-scaling adds more resources automatically
when needed.
©2021 Databricks Inc. — All rights reserved
Enhanced Autoscaling
Save infrastructure costs while maintaining end-to-end latency SLAs for streaming workloads
Problem
Optimize infrastructure spend when making scaling
decisions for streaming workloads
Backlog • Built to handle streaming workloads which
monitoring
are spiky and unpredictable
Utilization No/Small
monitoring backlog
& low
utilization
• Shuts down nodes when utilization is low
while guaranteeing task execution
Scale
down
• Only scales up to needed # of nodes
Streaming source Spark executors
AWS Azure GCP
Generally Available Generally Available Public Preview
GA Coming Soon
©2021 Databricks Inc. — All rights reserved 18
Lakehouse Platform Databricks Workflows
Workflows Unified orchestration for Delta Live Tables
Orders pipelines and more
DLT
Aggregate Analyze
Pipeline 1
TASK Sessions Simple workflow authoring for all data practitioners
DLT
Train
Pipeline 2
Data practitioners can easily orchestrate DLT pipelines
and other tasks from inside their Databricks workspace.
Advanced users can use their favorite IDE's with full
support for CI/CD.
BI & Data Data Data Data Actionable insights from real-time monitoring
Warehousing Engineering Streaming Science & ML
Full visibility into every task in every workflow. See the
Unity Catalog health of all your production workloads in real-time with
detailed metrics and analytics to identify, troubleshoot,
and fix issues fast.
Delta Lake
Proven reliability for production workloads
A fully managed orchestration service with serverless
data processing and a history of 99.95% uptime. Trusted
by thousands of Databricks customers running millions of
production workloads.
©2021 Databricks Inc. — All rights reserved
Upcoming Features
©2021 Databricks Inc. — All rights reserved
Accelerate development
Improve time to insight with improved navigation and data management tools
• Logical schema grouping
Group tables into different
schemas depending on their
contents.
• Data sample previews
Easily preview data directly
in the DLT console
• Notebook integration
Users can run & see DLT
pipelines update status from
the Notebook AWS Azure GCP
In development In development In development
Preview Coming Soon Preview Coming Soon Preview Coming Soon
©2021 Databricks Inc. — All rights reserved 21
Change-data-capture (CDC) enhancements
Problem: It’s hard to ingest & track changes with CDC from streaming or batch sources
city_updates • CDC from Full Snapshot
Perform change-data-capture from any source by
{"id": 1, "ts": 1, "city": "Bekerly, CA"} providing full snapshots. Supports both SCD type 1
{"id": 1, "ts": 2, "city": "Berkeley, CA"} and type 2.
cities • SCD type 2 GA
Track changes and retain full history of values.
id city __starts_at __ends_at
When the value of an attribute changes, the
1 Bekerly, CA 1 null
2
current record is closed, a new record is
1 Berkeley, CA 2 null
created with the changed data values.
APPLY CHANGES INTO LIVE.cities
FROM STREAM(LIVE.city_updates) AWS Azure GCP
KEYS (id) CDC from Full Preview Preview Preview
Snapshot GA Coming Soon GA Coming Soon GA Coming Soon
SEQUENCE BY ts
Preview Preview Preview
CDC SCD TYPE 2
STORED AS SCD TYPE 2 GA Coming Soon GA Coming Soon GA Coming Soon
©2021 Databricks Inc. — All rights reserved
22
Seamlessly add new streaming sources
Add new sources to a streaming table without a full refresh
Data • Declarative API to define sources
Sources
in DLT
Streaming
Sources
• Adding a new streaming source
Cloud Object
Stores without doing a full refresh
Structured
Data • Works with any streaming source
Unstructured
Streaming Table
supported by DLT
Data
Semi-
structured
data
Data
Migration AWS Azure GCP
Services
Preview Preview Preview
GA Coming Soon GA Coming Soon GA Coming Soon
©2021 Databricks Inc. — All rights reserved
Simplify Operations
• Email notifications
Configure detailed notifications at
the individual table level for
continuous and triggered
pipelines.
• Improved errors
Debug issues more quickly with
detailed and accurate error
reports in DLT.
AWS Azure GCP
In Development In Development In Development
Preview Coming Soon Preview Coming Soon Preview Coming Soon
©2021 Databricks Inc. — All rights reserved 24
Enzyme performance optimization for MVs
Improve end-to-end SLAs and reduce infrastructure costs by incrementally updating
materialized views.
Problem
Achieving near real-time data freshness for streaming pipelines with large data volumes requires
significant infrastructure spend or complex hand-coding.
Monotonic Append
Delta Tracked
Changes
Cost Model
Partition Recompute
Optimal
Query Plan MERGE Updates
Update
Analysis Technique
Full Recompute
AWS Azure GCP
Private Preview Private Preview Private Preview
©2021 Databricks Inc. — All rights reserved
Data Samples
Easily preview data directly in the DLT console
• Single-click to view a selection of
rows for any table in your pipeline
• Easily perform ad-hoc queries
with Databricks SQL Warehouse
query editor
• View detailed schema information
AWS Azure GCP
In Development In Development In Development
Preview Coming Soon Preview Coming Soon Preview Coming Soon
©2021 Databricks Inc. — All rights reserved 26
Thank you
©2021 Databricks Inc. — All rights reserved