MODERN DATA ARCHITECTURES
FOR BIG DATA II
APACHE SPARK
STREAMING API
Agenda
● Spark Streaming
● API
● Summary
2
Where are we?
Spark Streaming is built on top of Spark’s APIs:
3
1.
SPARK
STREAMING
Spark Streaming Features
Spark Streaming’s design options are:
● Declarative API:
Application specifies what instead of how to
compute events.
● Event and processing time:
timestamp of the event when it was created at the
source or when it arrived to Spark.
● Micro-Batch and Continuous execution:
until Spark 2.3 only micro batching was possible
(↑throughput vs ↓latency).
5
Spark Streaming Features
6
Spark Streaming Features
Spark Streaming has two streaming APIs:
● DStreams API (RDD):
very low level and no recommended any
more as it faces some limitations
(micro-batching, processing time, RDD
interaction, java/python objects, …)
● Structured Streaming (DataFrames):
built upon Spark’s Structured APIs and
multiple optimizations (event time,
continuous processing, …).
7
1.1
STRUCTURED
STREAMING
BASICS
Structured Streaming Basics
Structured Streaming uses the Structured APIs
in Spark: DataFrames, Datasets and SQL.
All the operations we’ve seen so far are supported
→ that’s an unified computing engine.
9
Structured Streaming Basics
Streaming application execution summarizes as:
● Write the code for your processing.
● Specify a destination (file, database, kafka, …)
● The Structured Streaming engine will run the
code incrementally and continuously as new
data arrives into the system.
10
Structured Streaming Basics
Stream of data abstracted as a table to which
data is continuously appended.
11
Continuous Application
End-to-end application that reacts to data in
real time by combining a variety of tools:
● streaming jobs
● batch jobs
● joins between streaming and offline data
● interactive ad-hoc queries.
12
Continuous Application
Spark’s uniqueness might be gotten by having a
scenario like the following one:
“Structured Streaming to (1) continuously update
a table that (2) users query interactively with
Spark SQL, (3) serve a machine learning model
trained by MLlib, or (4) join streams with offline
data in any of Spark’s data sources.”
13
1.2
CORE
CONCEPTS
Core components
The following are the core components in a
Structured Streaming job:
● Transformations & Actions
● Input sources
● Output Sinks
● Output modes
● Triggers
● Event-time processing
15
Core components
Transformations & Actions
Structured Streaming maintains the same
concept of transformations and actions.
Same transformations but with some
restrictions.
Only one action available: starting a stream,
which will then run continuously and output
results.
16
Core components
Input Sources
Specifies the source of the data stream
● Kafka
● File
● Socket*
● Rate**
* used for testing
** used for benchmarking 17
Core components
18
Core components
19
Core components
Output Sinks
Specifies the destination of the processing
results
● Kafka
● File
● Console *
● Memory *
● Foreach
● ForeachBatch
* used for debugging
20
Core components
21
Core components
Output Modes
Specifies how we want to save the data to the
sink:
● Append: only add new records to the sink.
● Update: update changed records in place
● Complete: rewrites the full output.
* Certain sinks only support certain output modes.
22
Core components
23
Core components
Triggers
Specifies when should check for new input data
and update the result
● Micro-batch mode (default)
● Fixed interval micro-batches
● One-time micro-batch
● Continuous with fixed checkpoint interval
24
Core components
25
Core components
Event-Time Processing
Structured Streaming support for event-time
processing (i.e., processing data based on
timestamps included in the record that may
arrive out of order)
26
Core components
Event-Time Processing
● Event-time data:
Event-time means time fields that are
embedded in the data
Rather than processing data according to the
time it reaches your system, you process it
according to the time that it was generated
Handy when events arrive out of order (ex.
due to network delays).
27
Core components
Event-Time Processing
● Watermarks:
Allow to specify how late it’s expected to see
data in event time (to limit how long they
need to remember old data)
28
2.
API
Input Source
SparkSession.readStream
Returns a DataStreamReader that can be used to
read data streams as a streaming DataFrame.
30
Input Source
DataStreamReader
To setup the input source through methods:
● .format(source) Specifies the input data
source format.
● .option(key, value) Adds an input option for
the underlying data source.
● .load() Loads a data stream from a data
source and returns it as a DataFrame.
● ...
31
Output Sink
DataFrame.writeStream
Returns a DataStreamWriter to save the content
of the streaming DataFrame
32
Actions
DataStreamWriter.start
Starts the processing of the contents of the
DataFrame
33
Actions
StreamQuery.awaitTermination(timeout)
In addition to the start action, we need to tell the
driver process to keep running in the background
“forever” by using the following method:
34
3.
SUMMARY
Summary
Structured Streaming presents a powerful way to
write streaming applications.
Taking a batch job you already run and turning it
into a streaming job with almost no code
changes is both simple and extremely helpful
36
Summary
Structured Streaming Programming Guide
PySpark Structured Streaming
37