Chapter 5:
Stream Processing
Big Data Management and Analytics 193
DATABASE
SYSTEMS
Stream Processing
GROUP
Today‘s Lesson
• Data Streams & Data Stream Management System
• Data Stream Models
• Insert-Only
• Insert-Delete
• Additive
• Streaming Methods
• Sliding Windows & Ageing
• Data Synopsis
• Stream Processing – Concepts & Tools
• Micro-Batching with Apache Spark Streaming
• Real-time Stream Processing with Apache Storm
Big Data Management and Analytics 194
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Streams
Big Data Management and Analytics 195
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Streams
• Definition:
A data stream can be seen as a continuous and potentially
infinite stochastic process in which events occur indepen-
dently from another
• Huge amount of data
→ Data objects cannot be stored
• Single scan
Big Data Management and Analytics 196
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Streams – Key Characteristics
• The data elements in the stream arrive on-line
• The system has no control over the order in which data
elements arrive (either within a data stream or across
multiple data streams)
• Data streams are potentially unbound in size
• Once an element has been processed it is discarded or
archived
Big Data Management and Analytics 197
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Stream Management System
Ad-hoc queries
Data Streams
Stream Processor
Output
Streams
Standing
query
time
Limited
working
storage
Archival Storage
Big Data Management and Analytics 198
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Stream Models – Insert-Only Model
• Once an element is seen, it cannot be changed
Stream Stream
Processor Processor
time
Big Data Management and Analytics 199
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Stream Models – Insert-Delete Model
• Elements can be deleted or updated
3 4 3
2 4 Stream 2 Stream
Processor Processor
4
4
time
Big Data Management and Analytics 200
DATABASE
SYSTEMS
Stream Processing
GROUP
Data Stream Models – Additive Model
• Each element is an increment to the previous
version of the given data object
2 3 2
Stream Stream
Processor Processor
3
time
1 5 11 4 4 5 11 4
Big Data Management and Analytics 201
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods
• Huge amount of data vs. limited resources in
space → impractical to store all data
• Solutions:
• Storing summaries of previously seen data
• „Forgetting“ stale data
• But: Trade-off between storage space and the
ability to provide precise query answers
Big Data Management and Analytics 202
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods – Sliding Windows
• Idea: Keep most recent stream elements in main
memory and discard older ones
• Timestamp-based:
Data Stream
Sliding interval Window length
Big Data Management and Analytics 203
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods – Sliding Windows
• Idea: Keep most recent stream elements in main
memory and discard older ones
• Sequence-based:
Data Stream
Sliding interval Window length
Big Data Management and Analytics 204
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods – Ageing
• Idea: Keep only the summary in main memory and
discard objects as soon as they are processed
Data Stream
• Multiply the summary with a decay factor after
each time epoche, resp. after a certain amount of
occuring elements
Big Data Management and Analytics 205
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods
• High velocity of incoming data vs. limited resour-
ces in time → impossible to process all data
• Solutions:
• Data reduction
• Data approximation
• But: Trade-off between processing speed and the
ability to provide precise query answers
Big Data Management and Analytics 206
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods – Sampling
• Select a subset of the data
→ Reduce the amount of data to process
• Difficulty: Obtaining a Reservoir Sampling Algorithm
input: Stream , Size of reservoir
representative sample begin
Insert first objects into reservoir;
foreach ∈ do
• Simplest form: random Let be the position of ;
sampling if M
≔ random integer in range 1. . ;
then
– Reservoir Sampling Insert into reservoir;
– Min-Wise Sampling Delete an instance from the reservoir at random;
• Load Shedding: Discard some fractions of data if the arrival
rate of the stream might overload the system
Big Data Management and Analytics 207
DATABASE
SYSTEMS
Stream Processing
GROUP
Streaming Methods – Data Synopsis & Histograms
• Summaries of data objects oftenly used to reduce
the amount of data
– e.g. Microclusters that describe groups of similar
objects
• Histograms are used to approximate the frequency
distribution of element values
– Commonly used for query optimizers (e.g. range
queries)
Big Data Management and Analytics 208
DATABASE
SYSTEMS
Stream Processing
GROUP
• Overview of techniques to build a summary (reduced
representation) of a sequence of numeric attributes:
0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120
DFT DWT SVD APCA PAA PLA
Big Data Management and Analytics 209
DATABASE
SYSTEMS
Stream Processing
GROUP
Diskrete Wavelet Transformation (DWT)
• Idea:
X
• Sequence represented as linear
X'
DWT
combination of basic wavelet 0 20 40 60 80 100 120 140
functions Haar 0
• Wavelet transformation decomposes Haar 1
a signal into several groups of Haar 2
coefficients at different scales
Haar 3
• Small coefficients can be eliminated
Small errors when reconstructing Haar 4
the signal Haar 5
Take only the first function Haar 6
coefficents Haar 7
• Often: Haar-wavelets used (easy to implement)
Big Data Management and Analytics 210
DATABASE
SYSTEMS
Stream Processing
GROUP
Example:
Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1]
h1= 4 = h2 = 2 = h3 = 2 = h4 = -1 =
X = {8, 4, 1, 3} mean(8,4,1, mean(8,4) - (8-4)/2 (1-3)/2
8 3) h1
7
6
5
4
3
2
1
(Lossless) Reconstruction of original sequence (stream) from Haar-wavelet representation:
h1 = 4 h2 = 2 h3 = 2 h4 = -1 X = {8, 4, 1, 3}
8
7
6
5
4
3
2
1
Big Data Management and Analytics 211
DATABASE
SYSTEMS
Stream Processing
GROUP
Haar Wavelet Transformation
Input sequence: Haar Wavelet Transform Algorithm
input: Sequence S , ,…, , of even length
output: Sequence of wavelet coefficients
begin
Transform into a sequence of two‐component‐vectors
1 1
, ,…, , where ⋅ ⋅ ;
1 1
Separate the sequences and ;
Recursively transform sequence ;
Step 1:
2 5, 8 9, 7 4, 1 1 /2, 2 5, 8 9, 7 4, 1 1 /2
3.5, 8.5, 5.5, 0 , 1.5, 0.5, 1.5, 1
Step 2:
3.5 8.5, 5.5 0 /2, 3.5 8.5, 5.5 0 /2
6, 2.75 , 2.5, 2.75
Step 3:
6 2.75 /2, 6 2.75 /2
4.375, 1.625
→ Wavelet coefficients 4.375, 1.625, 2.5, 2.75, 1.5, 0.5, 1.5, 1
Big Data Management and Analytics 212
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
• Spark‘s Streaming Framework build on top of Spark‘s Core
API
• Data ingestion from several different data sources
• Stream processing might be combined with other Spark
libraries (e.g. Spark Mllib)
Big Data Management and Analytics 213
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
• Spark‘s Streaming Workflow:
• Streaming engine receives data from input streams
• Data stream is divided into several microbatches, i.e.
sequences of RDDs
• Microbatches are processed by Spark engine
• The result is a data stream of batches of processed data
Big Data Management and Analytics 214
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
• DStreams (Discretized Streams) as basic abstraction
• Any operation applied on a DStream translates to
operations on the underlying RDDs (computed by Spark
Engine)
• StreamingContext objects as starting points
sc = SparkContext(master, appName)
ssc = StreamingContext(sc, 1) #params: SparkContext, time interval
Big Data Management and Analytics 215
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
General schedule for a Spark Streaming application:
1. Define the StreamingContext ssc
2. Define the input sources by creating input DStreams
3. Define the streaming computations by applying
transformations and output operations to Dstreams
4. Start receiving data and processing it using ssc.start()
5. Wait for the processing to be stopped (manually or due to
any error) using ssc.awaitTermination()
6. The processing can be manually stopped using ssc.stop()
Big Data Management and Analytics 216
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
#Create a local StreamingContext with two working threads and batch
#interval of 1 sec
sc = SparkContext(“local[2]“,“NetworkWordCount“)
ssc = StreamingContext(sc, 1)
#Create a DStream that will connect to localhost:9999
lines = ssc.socketTextStream(“localhost“, 9999)
#Split each line into words
words = lines.flatMap(lambda line: line.split(“ “))
#Count each word in each batch
pairs = words.map(lambda word: (word,1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#Print the first ten elements of each RDD of this DStream to the console
wordCounts.pprint()
#Start the computation and wait for it to terminate
ssc.start()
ssc.awaitTermination()
Big Data Management and Analytics 217
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
• Support of window operations
• Two basic parameters:
– windowLength
– slideInterval
• Support of many transformations for windowed DStreams
#Reduce last 30 sec of data, every 10 sec
winWordCounts = pairs
.reduceByKeyAndWindow(lambda x,y: x+y, 30, 10)
Big Data Management and Analytics 218
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
• Alternative to Spark Streaming
• Support of Real-time Processing
• Three abstractions:
– Spouts
– Bolts
– Topologies
Big Data Management and Analytics 219
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
• Spouts:
– Source of streams
– Typically reads from queuing brokers (e.g. Kafka, RabbitMQ)
– Can also generate its own data or read from external sources (e.g.
Twitter)
• Bolts:
– Processes any number of input streams
– Produces any number of output streams
– Holds most of the logic of the computations (functions, filters,…)
Big Data Management and Analytics 220
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
• Topologies:
– Network of spouts and bolts
– Each edge represents a bolt subscribing to the output stream of
some other spout or bolt
– A topology is an arbitrarily complex multi-stage stream computation
Big Data Management and Analytics 221
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
• Streams:
– Core abstraction in Storm
– A stream is an unbounded sequence of tuples that is
processed and created in parallel in a distributed fashion
– Tuples can contain standard types like integers, floats,
shorts, booleans, strings and so on
– Custom types can be used if a own serializer is defined
– A stream grouping defines how that stream should be
partitioned among the bolt's tasks
Big Data Management and Analytics 222
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
Spout Bolt Bolt
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
// 4 Tasks spread across 2 Executors and the TOPOLOGY
// tuples shall be randomly distributed across
// the bolt‘s tasks, each bolt shall get an Worker Process Worker Process
// equal number of tuples
Executor Executor Executor Executor
topologyBuilder.setBolt("yellow-bolt", Task Task Task Task
new YellowBolt(), 6)
.shuffleGrouping("green-bolt"); Task Task
Executor Executor
StormSubmitter.submitTopology( Task Task
"mytopology",
conf,
topologyBuilder.createTopology() Executor Executor Executor Executor
); Task Task Task Task
Big Data Management and Analytics 223
DATABASE
SYSTEMS
Stream Processing
GROUP
Further Reading
• Joao Gama: Knowledge Discovery from Data Streams
(http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf)
• Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of
Massive Datasets
• Holden Karau, Andy Konwinski, Patrick Wendell, Matei
Zaharia: Learning Spark - Lightning-Fast Big Data Analysis
• http://spark.apache.org/docs/latest/streaming-programming-
guide.html
• http://storm.apache.org/documentation/Concepts.html
Big Data Management and Analytics 224