Technical Principles of
Streaming
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.
Objectives
Upon completion of this course, you will be able to know:
Real-time stream processing
System architecture of Streaming
Key features of Streaming
Basic CQL concepts
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Introduction to Streaming
2. System Architecture
3. Key Features
4. Introduction to StreamCQL
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Streaming Overview
Streaming is a distributed real-time computing framework based on
the open source Storm.with the following features:
Real-time response with low delay
Data computing before storing
Continuous query
No waiting; Results delivered in-flight
Event-driven Event Alerts
Data Actions
Queries
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Application Scenarios of Streaming
Streaming is applicable to the following scenarios:
Real-time analysis: real-time log processing and vehicle traffic analysis
Real-time statistics: real-time website access statistics and sorting
Real-time recommendation: real-time advertisement positioning and event
marketing
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Position of Streaming in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Data Information Knowledge Wisdom
DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Streaming Flink
Hadoop LibrA
YARN/ ZooKeeper Security
management
HDFS/HBase
Streaming is a distributed real-time computing framework, widely used in real-
time business.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Comparison with Spark Streaming
Micro-batch processing by Spark Streaming Stream processing by Streaming
Spark Streaming Streaming
Task execution Instant execution logic startup and Execution logic startup before execution, and
mode reclamation upon completion logic persistence
Event processing Processing started upon accumulation of a Real-time event processing
mode certain number of event batches
Delay Second-level Millisecond-level
Throughput High (2 to 5 times that of Streaming) Average
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison of Application Scenario
Real-time Performance
Streaming
Spark Streaming
Time
milliseconds seconds
Streaming applies to delay-sensitive services.
Spark Streaming applies to delay-insensitive services.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Contents
1. Introduction to Streaming
2. System Architecture
3. Key Features
4. Introduction to StreamCQL
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic Concepts (1)
Topology: a real-time application in Streaming.
Nimbus: assigns resources and schedules tasks.
Supervisor: receives tasks assigned by Nimbus, and starts/stops
Worker processes.
Worker: runs component logic processes.
Spout: generates source data flows in a topology.
Bolt: receives and processes data in a topology.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Basic Concepts (2)
Task: a Spout or Bolt thread of Worker.
Tuple: core data structure of Streaming. It is basic message delivery
unit in key-value pairs, which can be created and processed in a
distributed way.
Stream: an infinite continuous Tuple sequence.
Zookeeper: provides distributed collaboration services for processes.
Active/Standby Nimbus, Supervisor, and Worker register their
information in ZooKeeper. This enables Nimbus to detect the health
status of all roles.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
System Architecture
Submits a Monitors the heartbeat
topology. and assigns tasks.
Client Nimbus
ZooKeeper
Downloads a JAR package.
ZooKeeper
Obtains tasks.
Supervisor Supervisor
ZooKeeper
Starts Worker.
Worker
Executor
Worker
Executor Reports the heartbeat.
Worker
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Topology
A topology is a directed acyclic graph (DAG) consisting of Spout (data source) and Bolt
(for logical processing). Spout and Bolt are connected through Stream Groupings.
Service processing logic is encapsulated in topologies in Streaming.
Filters data.
Obtains stream data
from external data
sources Bolt A Bolt B
Spout
Triggers external
messages.
Bolt C
Persistent archiving
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Worker
Worker: A Worker is a JVM process and Worker Process
a topology runs in one or more Workers. Executor
Executor
A started Worker runs all the way Task
unless manually stopped. The number of Task
Worker processes depends on the Executor
Task Task
topology setting, and has no upper
limit. The number of Worker processes
that can be scheduled and started
depends on the number of slots configured in Supervisor.
Executor: In a Worker process runs one or more Executor threads. Each Executor can run one
or more task instances of either Spout or Bolt.
Task: a unit that processes data
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Task
Both Spout and Bolt in a topology support concurrent running. In the topology, you can
specify the number of concurrently running tasks on each node. Streaming assigns tasks
in the cluster to enable simultaneous calculation and enhance processing capability of
the system.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Message Delivery Policies
Grouping Mode Description
Delivers messages in groups to tasks of the target
fieldsGrouping (field grouping)
Bolt according to message hash values.
Delivers all messages to a fixed task of the target
globalGrouping (global grouping)
Bolt.
Delivers messages to a random task of the target
shuffleGrouping (shuffle grouping)
Bolt.
Delivers messages randomly to tasks if one or more
localOrShuffleGrouping (local or shuffle grouping) tasks exist in the target Bolt process, or delivers
messages in shuffle grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt
specified by the data producer. The task ID needs
directGrouping (direct grouping)
to be specified by using the emitDirect (taskID,
tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Contents
1. Introduction to Streaming
2. System Architecture
3. Key Features
4. Introduction to StreamCQL
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Nimbus HA
ZooKeeper cluster
Streaming cluster
Active Standby
Nimbus Nimbus
Supervisor Supervisor Supervisor
…
worker worker worker worker worker worker
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Disaster Recovery
Services are automatically migrated from faulty nodes to normal ones, preventing
service interruptions.
Node1 Node2 Node3
Topo1 Topo1 Topo1
Topo2 Topo3 Topo4
Zero
manual
operation
Node1 Node2 Node3
Topo1 Topo1 Topo1
Topo2 Topo3 Topo4
Topo1 Topo3
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Message Reliability
Reliability Processing
Description
Level Mechanism
This mode involves the highest throughput and applies to
At Most Once None
messages with low reliability requirements.
This mode involves low throughput and applies to messages
At Least Once Ack with high reliability problems. All data must be completely
processed.
Trident is a special transactional API provided by Storm and
Exactly Once Trident
involves the lowest throughput.
When a tuple is completely processed in Streaming, the tuple and all its derived tuples are successfully
processed. A tuple fails to be processed if the processing is not complete within the timeout period.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Ack Mechanism
When Spout sends a tuple, it notifies Acker
that a new root message is generated. Acker Spout Ack6
will create a tuple tree and initializes the Ac
k1
checksum to 0.
Bolt1 Bolt2 Ack
When Bolt sends a message, it sends an 2 Acker
anchor tuple to Acker to refresh the tuple Ack3
tree, and reports the result to Acker after k4
Bolt3 Bolt4 Ac
the message is sent successfully. If the
message is sent successfully, Acker refreshes
the checksum. If the message fails to be sent, Ack5
Acker immediately notifies Spout of the failure.
When a tuple tree is completely processed (checksum = 0), Acker notifies Spout of the
result.
Spout provides ack() and fail() functions to process Acker results. The fail() function
implements message resending logic.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Reliability Level Setting
If not every message is required to be processed (allowing some
message loss), the reliability mechanism can be disabled to ensure
better performance.
The reliability mechanism can be disabled in the following ways:
Setting Config.TOPOLOGY_ACKERS to 0.
Using Spout to send messages through interfaces that do not restrict
message IDs.
Using Bolt to send messages in Unanchor mode.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Streaming and Other Components
HDFS, HBase, Kafka…
Streaming
HDFS
Kafka
Topology1
Topic1 Redis
Topic2 Topology2
HBase
Topic N
Topology N Kafka
……
External components such as HDFS and HBase are integrated to
facilitate real-time offline analysis.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Contents
1. Introduction to Streaming
2. System Architecture
3. Key Features
4. Introduction to StreamCQL
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
StreamCQL Overview
StreamCQL(Stream Continuous Query Language) is a query language based on the
distributed stream processing platform based on and can be built on various stream
processing engines (mainly Apache Storm).
Currently, most stream processing platforms provide only distributed processing
capabilities but involve complex service logic development and poor stream computing
capabilities. The development efficiency is low due to low reuse and repeated
development. StreamCQL provides various distributed stream computing functions,
including traditional SQL functions such as filtering and conversion, and new functions
such as stream-based time window computing, window data statistics, and stream
data splitting and merging.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
StreamCQL Easy to Develop
//Def Input:
public void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector) {…} --Def Input:
public void nextTuple() {…} CREATE INPUT STREAM S1 …
public void ack(Object id) { …}
public void
declareOutputFields(OutputFieldsDeclar
er declarer) {…} --Def logic:
//Def logic: INSERT INTO STREAM filterstr SELECT *
public void execute(Tuple tuple, FROM S1 WHERE name="HUAWEI";
BasicOutputCollector collector) {…}
public void
declareOutputFields(OutputFieldsDeclar --Def Output:
er ofd) {…} CREATE OUTPUT STREAM S2…
//Def Output:
public void execute(Tuple tuple,
BasicOutputCollector collector) {…} --Def Topology:
public void SUBMIT APPLICATION test;
declareOutputFields(OutputFieldsDeclar
er ofd) {…} StreamCQL
//Def Topology:
public static void main(String[] args)
throws Exception {…}
Native Storm API
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
StreamCQL and Stream Processing
Platform
Service interface CQL IDE
Function
Join Aggregate Split Merge Pattern Matching
Stream Window
Engine
Other stream
Storm processing engines
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Summary
This module describes the following information about
Streaming:
Definition
Application Scenarios
Position of Streaming in FusionInsight
System architecture of Streaming
Key features of Streaming
Introduction to StreamCQL
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Quiz
1. How is message reliability guaranteed in Streaming?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Quiz
1. Which of the following statements about Supervisor is CORRECT? ( )
A. Nimbus HA supports hot failover and eliminates single points of
failure.
B. Supervisor faults can be automatically recovered without affecting
running services.
C. Worker faults can be automatically recovered.
D. Tasks on a faulty node of the cluster will be re-assigned to other
normal nodes.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Quiz
2. Which of the following statements about Supervisor is
CORRECT? ( )
A. Supervisor assigns resources and schedules tasks.
B. Supervisor receives tasks assigned by Nimbus, and starts/stops
Worker processes.
C. Supervisor runs component logic processes.
D. Supervisor receives and processes data in a topology.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35