Introduction to SQL on Apache Flink®
Flink SQL Training
https://github.com/ververica/sql-training
© 2019 Ververica
Apache Flink is a Distributed Data Processing System
© 2019 Ververica
Scalable and Consistent Data Processing
● Flexible and expressive APIs
● Guaranteed correctness
○ Exactly-once state consistency
○ Event-time semantics
● In-memory processing at massive scale
○ Runs on 10000s of cores
○ Manages 10s TBs of state
© 2019 Ververica
Powered By Apache Flink
Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html
Also check out the Flink Forward YouTube channel with more than 350 recorded talks at https://www.youtube.com/channel/UCY8_lgiZLZErZPF47a2hXMA
© 2019 Ververica
Why SQL for Stream Processing?
• Implementing Flink stream processing apps requires special skills
○ Java/Scala experience
○ In-depth knowledge of streaming concepts like time and state
○ Knowledge of distributed data processing
• Everybody knows and uses SQL
• SQL queries are optimized and efficiently executed
• Unified syntax and semantics for batch & streaming data
© 2019 Ververica
Flink SQL in a Nutshell
A standard-compliant SQL service
to query static and streaming data alike
that leverages the performance, scalability, and consistency
of Apache Flink.
6 © 2019 Ververica
How is streaming SQL different from traditional SQL?
● Basically all tables that are processed with SQL queries change over time
○ Transactions from applications
○ Bulk inserts from ETL processes
● Traditional processors run SQL queries on static snapshots of the tables
○ The query input is finite
○ The query result is final and finite
● Stream SQL processors run continuous queries on changing (dynamic) tables
○ The query input is unbounded
○ The query result is never final, continuously updated, and potentially unbounded
● The semantics of a query are the same regardless whether it is executed one-
time on a table snapshot or continuously on a changing table
7 © 2019 Ververica
Running a One-time Query on a Changing Table
Take a snapshot when the A final result is
query starts produced
user cTime url user cnt
SELECT
Mary 12:00:00 https://… user, Mary 2
COUNT(url) as cnt
Bob 12:00:00 https://… Bob 1
FROM clicks
Mary 12:00:02 https://… GROUP BY user
Liz 12:00:03 https://…
The query
A row that was added after the query terminates
8 © 2019 Ververica was started is not considered
Running a Continuous Query on a Changing Table
Ingest all changes Continuously update
as they happen the result
user cTime url
user cnt
Mary 12:00:00 https://… SELECT
Mary 1
2
Bob 12:00:00 https://… user,
COUNT(url) as cnt Bob 1
Mary 12:00:02 https://… FROM clicks
GROUP BY user Liz 1
Liz 12:00:03 https://…
The result is identical to the one-time query (at this point)
9 © 2019 Ververica
SQL Feature Set in Flink 1.11
STREAMING & BATCH STREAMING ONLY
● SELECT FROM WHERE ● OVER / WINDOW
● GROUP BY [HAVING] ○ UNBOUNDED + BOUNDED PRECEDING
○ Non-windowed
● INNER JOIN with
○ Time-versioned table
○ TUMBLE, HOP, SESSION windows
○ External lookup table
● JOIN ● MATCH_RECOGNIZE
○ Time-Windowed INNER + OUTER JOIN ○ Pattern Matching/CEP (SQL:2016)
○ Non-windowed INNER + OUTER JOIN
● User-Defined Functions BATCH ONLY
○ Scalar
○ Aggregation ● Full TPC-DS support
○ Table-valued
10 © 2019 Ververica
Data Pipelines
● Transform, aggregate, and move events in real-time
● Low-latency ETL
o Convert and write streams to file systems, DBMS, K-V stores, indexes, …
o Ingest appearing files to produce streams
© 2019 Ververica
Stream & Batch Analytics
● Stream & Batch Analytics
o Run analytical queries over bounded and unbounded data
o Query and compare historic and real-time data
o Compute and update data to visualize in real-time
© 2019 Ververica
Training Environment
https://github.com/ververica/sql-training/
© 2019 Ververica
What You Will Learn in This Training?
● Querying streaming data with SQL
● Expressing common stream processing operations with SQL
o Window aggregations, stream joins, and pattern matching
● Piping the results of continuous queries into Kafka and S3
● Materializing the results of continuous queries in MySQL
● Using Flink's SQL CLI client
14 © 2019 Ververica
Training Scenario: Taxi Ride Data
● We are working with data about taxi rides in New York
● Three tables
o Rides One start and one end event for each ride
o Fares One payment event for each ride
o DriverChanges One event for each driver change of a taxi
o All tables are registered and available in the environment
o Each tables is backed by a Kafka topic
© 2019 Ververica
Training Scenario: Taxi Ride Data
Flink SQL> SELECT * FROM Rides;
rideId taxiId isStart lon lat rideTime psgCnt
1 2013000001 true -73.99078 40.76088 2013-01-01T00:00 1
2 2013000002 true -73.978325 40.77809 2013-01-01T00:00 5
3 2013000003 true -73.98962 40.72999 2013-01-01T00:00 1
Flink SQL> SELECT * FROM Fares;
rideId payTime payMethod tip toll fare
65 2013-01-01T00:00:36 CSH 0.0 0.0 3.5
137 2013-01-01T00:01 CSH 0.0 0.0 3.5
77 2013-01-01T00:01:22 CSH 0.0 0.0 4.0
Flink SQL> SELECT * FROM DriverChanges;
taxiId driverId usageStartTime
2013000061 2013000061 2013-01-01T00:00:02
2013000062 2013000062 2013-01-01T00:00:03
2013000063 2013000063 2013-01-01T00:00:08
16 © 2019 Ververica
Our Training Environment
Assign & monitor
Submit query query tasks TaskManager
SQL Client JobManager Execute
query tasks
Read & write data
WebUI: http://localhost:8081
Push events at 10x speed
Coordinate
© 2019 Ververica
Introduction to SQL Client
© 2019 Ververica
Interactive Query Submission via SQL Client
Event Log
SQL Client
Database /
SELECT HDFS
user, Catalog
COUNT(url) AS cnt Submit Query
FROM clicks
GROUP BY user
CLI Optimizer Submit Job
Query
Changelog Results
or Table Result Server State
Results
Gateway
© 2019 Ververica
Detached Query Submission via SQL Client
Event Log
Database /
HDFS
SQL Client
INSERT INTO dashboard
SELECT
Catalog
user, Submit Query
COUNT(url) AS cnt
FROM clicks
GROUP BY user CLI Optimizer Submit Job
Query
Result Server State
Cluster ID &
Job ID
Gateway
© 2019 Ververica
Hands On Exercises
© 2019 Ververica
Introduction to SQL on Flink
Continue with the “Introduction to the Training Environment”
in “Introduction to SQL on Flink”
https://github.com/ververica/sql-training/wiki/Introduction-to-SQL-on-Flink
We are here to help!
© 2019 Ververica
www.ververica.com @VervericaData
© 2019 Ververica