Stream Computing
Stream Computing
Data Science Foundations
© Copyright IBM Corporation 2018
Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit 11 Stream Computing
© Copyright IBM Corp. 2018 11-2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Unit objectives
• Define streaming data
• Describe IBM as a pioneer in streaming data - with System S
IBM Streams
• Explain streaming data - concepts & terminology
• Compare and contrast batch data vs streaming data
• List and explain streaming components & Streaming Data Engines
(SDEs)
Stream Computing © Copyright IBM Corporation 2018
Unit objectives
© Copyright IBM Corp. 2018 11-3
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Generations of analytics processing
• From Relational Databases (OLTP) Data Warehouses (OLAP)
Real-time Analytic Processing (RTAP) on Streaming Data
Stream Computing © Copyright IBM Corporation 2018
Generations of analytics processing
Graphic from:
• Ebbers, M., Abdel-Gayed, A., Budhi, V. B., Dolot, F., Kamat, V., Picone, R., &
Trevelin, J. (2013). Addressing data volume, velocity, and variety with IBM
InfoSphere Streams V3.0. Raleigh, NC: IBM International Technical Support
Organization (ITSO), p. 26.
• This eBook is available as a free download from:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf.
• Note that the current version of the IBM Streams product (February, 2018)
is v4.2.
Extract from this book (pp. 26-27):
Hierarchical databases were invented in the 1960s and still serve as the foundation for
online transaction processing (OLTP) systems for all forms of business and government
that drive trillions of transactions today. Consider a bank as an example. It is likely that
even today in many banks that information is entered into an OLTP system, possibly by
employees or by a web application that captures and stores that data in hierarchical
databases. This information then appears in daily reports and graphical dashboards to
demonstrate the current state of the business and to enable and support appropriate
actions. Analytical processing here is limited to capturing and understanding what
happened.
© Copyright IBM Corp. 2018 11-4
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Relational databases brought with them the concept of data warehousing, which
extended the use of databases from OLTP to online analytic processing (OLAP). By
using our example of the bank, the transactions that are captured by the OLTP system
were stored over time and made available to the various business analysts in the
organization. With OLAP, the analysts can now use the stored data to determine trends
in loan defaults, overdrawn accounts, income growth, and so on. By combining and
enriching the data with the results of their analysis, they could do even more complex
analysis to forecast future economic trends or make recommendations on new
investment areas. Additionally, they can mine the data and look for patterns to help
them be more proactive in predicting potential future problems in areas such as
foreclosures. The business can then analyze the recommendations to decide whether
they must take action. The core value of OLAP is focused on understanding why things
happened to make more informed recommendations.
A key component of OLTP and OLAP is that the data is stored. Now, some new
applications require faster analytics than is possible when you have to wait until the
data is retrieved from storage. To meet the needs of these new dynamic applications,
you must take advantage of the increase in the availability of data before storage,
otherwise known as streaming data. This need is driving the next evolution in analytic
processing called real-time analytic processing (RTAP). RTAP focuses on taking the
proven analytics that are established in OLAP to the next level. Data in motion and
unstructured data might be able to provide actual data where OLAP had to settle for
assumptions and hunches. The speed of RTAP allows for the potential of action in
place of making recommendations.
So, what type of analysis makes sense to do in real time? Key types of RTAP include,
but are not limited to, the following analyses:
• Alerting
• Feedback
• Detecting failures
[These uses are discussed in more detail on pp. 26-27.]
© Copyright IBM Corp. 2018 11-5
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
What is streaming data?
• Applications that require on-the-fly processing, filtering, and analysis
of streaming data
▪ Sensors: environmental, industrial, surveillance video, GPS, …
▪ “Data exhaust”: network/system/webserver log files
▪ High-rate transaction data: financial transactions, call detail records
• Criteria: two or more of the following
▪ Messages are processed in isolation or in limited data windows
▪ Sources include non-traditional data (spatial, imagery, text, …)
▪ Sources vary in connection methods, data rates, and processing
requirements, that present integration challenges
▪ Data rates / volumes require the resources of multiple processing nodes
▪ Analysis and response are needed with sub-millisecond latency
▪ Data rates and volumes are too great for store-and-mine approaches
Stream Computing © Copyright IBM Corporation 2018
What is streaming data?
Reading:
What is Streaming Data?
(excellent article written from the perspective of Amazon’s AWS, but with general
applicability) https://aws.amazon.com/streaming-data/
• Benefits of Streaming Data
• Streaming Data Examples
• Comparison between Batch Process and Stream Processing
• Challenged in Working with Streaming Data
© Copyright IBM Corp. 2018 11-6
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
IBM is a pioneer in streaming analytics
• Book:
▪ Roy, J. (2016). Streaming analytics with IBM Streams: Analyze more, act
faster, and get continuous insights. Indianapolis, IN: Wiley.
▪ Available as a free PDF download from IBM at:
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?
htmlfid=IMM14184USEN&attachment=IMM14184USEN.PDF
• IBM researchers spent a decade transforming the vision of stream
computing into a product - a new programming language, IBM Streams
Processing Language (SPL), was even built just for streaming systems
Stream Computing © Copyright IBM Corporation 2018
IBM is a pioneer in streaming analytics
References:
• The Invention of Stream Computing
http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/streamcomputing/
• Research articles (2004-2015)
https://researcher.watson.ibm.com/researcher/view_group_pubs.php?grp=2531&
t=1
© Copyright IBM Corp. 2018 11-7
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
IBM System S
▪ System S provides a programming model
and an execution platform for user-
developed applications that ingest, filter,
analyze, and correlate potentially massive
volumes of continuous data streams
▪ Source Adapters = an operator that
connects to a specific type of input (e.g.,
Stock Exchange, Weather, file system, …)
▪ Sink Adapters = an operator that connects
to a specific type of output external to the
streams processing system (RDBMS, , …)
▪ Operator = software processing unit
▪ Stream = flow of tuples from one operator
to the next operator (they do not traverse
operators)
Stream Computing © Copyright IBM Corporation 2018
IBM System S
References:
• Stream Computing Platforms, Applications, and Analytics (Tab 4: System S)
https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=253
4&t=4
System S: While at IBM, Dr. Ted Codd invented the relational database. In the defining
IBM Research project, it was referred to as System R, which stood for Relational. The
relational database is the foundation for data warehousing that launched the highly
successful Client/Server and On-Demand informational eras. One of the cornerstones
of that success was the capability of OLAP products that are still used in critical
business processes today.
When the IBM Research division again set its sights on developing something to
address the next evolution of analysis (RTAP) for the Smarter Planet evolution, they set
their sights on developing a platform with the same level of world-changing success,
and decided to call their effort System S, which stood for Streams. Like System R,
System S was founded on the promise of a revolutionary change to the analytic
paradigm. The research of the Exploratory Stream Processing Systems team at T.J.
Watson Research Center, which was set on advanced topics in highly scalable stream-
processing applications for the System S project, is the heart and soul of Streams.
© Copyright IBM Corp. 2018 11-8
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Ebbers, M., Abdel-Gayed, A., Budhi, V. B., Dolot, F., Kamat, V., Picone, R., & Trevelin,
J. (2013). Addressing Data Volume, Velocity, and Variety with IBM InfoSphere Streams
V3.0. Armonk, NY: IBM International Technical Support Organization (ITSO), p. 27.
© Copyright IBM Corp. 2018 11-9
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streaming data - concepts & terminology (1)
Stream Computing © Copyright IBM Corporation 2018
Streaming data - concepts & terminology
Reference:
Stream Computing Platforms, Applications, and Analytics (Overview)
https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534&t=1
© Copyright IBM Corp. 2018 11-10
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streaming data - concepts & terminology (2)
• Directed Acyclic Graph (DAG) -
a directed acyclic graph (DAG) is a directed graph that contains no
cycles
• Hardware Node - one computer in a cluster of computers that can
work as a single processing unit
• Operator (or Software Node ) - a software process that handles
one unit of the processing that needs to be performed on the data
• Tuple - an ordered set of elements
(equivalent to the concept of a record)
• Source - an operator where tuples are ingested
• Target (or Sink) - an operator where tuples are consumed (and
usually made available outside the streams processing environment)
Stream Computing © Copyright IBM Corporation 2018
Wikipedia:
“In mathematics and computer science, a directed acyclic graph, is a finite directed
graph with no directed cycles. That is, it consists of finitely many vertices and edges,
with each edge directed from one vertex to another, such that there is no way to start at
any vertex v and follow a consistently-directed sequence of edges that eventually loops
back to v again. Equivalently, a DAG is a directed graph that has a topological ordering,
a sequence of the vertices such that every edge is directed from earlier to later in the
sequence.”
© Copyright IBM Corp. 2018 11-11
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streaming data - concepts & terminology (3)
Types of Operators:
• Source - Reads the input data in the form of streams.
• Sink - Writes the data of the output streams to external storage or
systems.
• Functor - An operator that filters, transforms, and performs
functions on the data of the input stream.
• Sort - Sorts streams data on defined keys.
• Split- Splits the input streams data into multiple output streams.
• Join - Joins the input streams data on defined keys.
• Aggregate - Aggregates streams data on defined keys.
• Barrier - Combines and coordinates streams data.
• Delay - Delays a stream data flow.
• Punctor - Identifies groups of data that should be processed
together.
Stream Computing © Copyright IBM Corporation 2018
The terminology used here is that of IBM Streams, but similar terminology applies to
other Streams Processing Engines (SPE).
References:
• Glossary of terms for IBM Streams (formerly known as IBM InfoSphere Streams)
https://www.ibm.com/support/knowledgecenter/en/SSCRJU_4.2.1/com.ibm.strea
ms.glossary.doc/doc/glossary_streams.html
Examples of other terminology:
• Apache Storm: http://storm.apache.org/releases/current/Concepts.html
o A spout is a source of streams in a topology. Generally spouts will read
tuples from an external source and emit them into the topology.
o All processing in topologies is done in bolts. Bolts can do anything from
filtering, functions, aggregations, joins, talking to databases, and more.
© Copyright IBM Corp. 2018 11-12
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streaming data - concepts & terminology (4)
• Topologies
▪ Generally defined in terms of a Directed Acyclic Graph (DAG)
• Reliability
▪ Different Stream Processing Engines (SPE) provide reliability differently:
− Some take full responsibility for guaranteeing that no tuple is lost (but that can
slow the engine down)
− Some require that the developer program for reliability using, for example,
redundant processing streams in parallel
• Operators / Software Nodes / Workers
▪ To handle volume and velocity of data, most SPEs divide the work among
multiple software operators (= typically JVM tasks) that work in parallel
▪ The software operators are often spread over multiple Hardware Nodes in
a cluster
Stream Computing © Copyright IBM Corporation 2018
Reliability, using the example of Apache Storm
(http://storm.apache.org/releases/current/Concepts.html):
Storm guarantees that every spout tuple will be fully processed by the topology. It does
this by tracking the tree of tuples triggered by every spout tuple and determining when
that tree of tuples has been successfully completed. Every topology has a "message
timeout" associated with it. If Storm fails to detect that a spout tuple has been
completed within that timeout, then it fails the tuple and replays it later.
To take advantage of Storm's reliability capabilities, you must tell Storm when new
edges in a tuple tree are being created and tell Storm whenever you've finished
processing an individual tuple. These are done using the OutputCollector object that
bolts use to emit tuples. Anchoring is done in the emit method, and you declare that
you're finished with a tuple using the ack method.
© Copyright IBM Corp. 2018 11-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Batch processing - classic approach
• With batch processing, the data is stationary (“at rest”) in a database or
an application store
• Operations are performed over all the data that is stored
Stream Computing © Copyright IBM Corporation 2018
Batch processing - classic approach
We are familiar with these operations in classic batch SQL. In the case of batch
processing, all that data is present at the time the SQL statement is processed.
But, with streaming data, the data is constantly flowing and thus other techniques need
to be performed. With streaming data, operations will be performed on windows of
data.
© Copyright IBM Corp. 2018 11-14
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Stream processing - the real-time data approach
• Moving from a multi‐threaded program to one that takes advantage of
multiple nodes is a major rewrite of the application We are now dealing
with multiple processes as well as the communication between them -
aka message queues move data = multiple streams
• Instead of all data, only windows of data are visible for processing
Stream Computing © Copyright IBM Corporation 2018
Stream processing - the real-time data approach
Sometimes a stream processing job needs to do something in regular time intervals,
regardless of how many incoming messages the job is processing. For example, say
you want to report the number of page views per minute. To do this, you increment a
counter every time you see a page view event. Once per minute, you send the current
counter value to an output stream and reset the counter to zero. This is a fixed time
window and is useful for reporting, for instance, sales that occurred during a clock-
hour.
Other methods of windowing of streaming data use a sliding window or a session
window (both illustrated above).
Microsoft Azure offers tumbling, hopping, and sliding windows to perform temporal
operations on streaming data. See Microsoft’s article: Introduction to Stream
Analytics Window functions, https://docs.microsoft.com/en-us/azure/stream-
analytics/stream-analytics-window-functions
© Copyright IBM Corp. 2018 11-15
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streaming components & Streaming Data Engines
Open Source: Proprietary:
• Hortonworks HDF / NiFi • IBM Streams (full SDE)
• Apache Storm • Amazon Kinesis
• Apache Flink • Microsoft Azure Stream Analytics
• Apache Kafka
• Apache Samza
• Apache Beam (an SDK)
• Apache Spark Streaming
Stream Computing © Copyright IBM Corporation 2018
Streaming components & Streaming Data Engines
To work with streaming analytics is important to understand what are the various
available components are and how they relate. The topic itself is quite complex and
deserving of a full workshop in itself - thus, here, we can only provide an introduction.
A full data pipeline (i.e., Streaming Application) involves:
• Accessing data at its source (“source operator” components) - Kafka can be used
here
• Processing data (serializing data, merging & joining individual streams,
referencing static data from in-memory stores and databases, transforming data,
performing aggregation and analytics, … ) - Storm is a component that is
sometimes used here
• Delivering data to long-term persistence and to on-the-fly visualization
(“sink operators”)
IBM Streams can handle all these operations with using standard and custom-build
operators and is thus a full Streaming Data Engine (SDE), but open source
components can be used to build somewhat equivalent systems.
We are only going to look at HDF / NiFi and IBM Streams (data pipeline) in detail -
and separately; they are bolded in the list above.
© Copyright IBM Corp. 2018 11-16
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Open source references
• http://storm.apache.org/
• http://flink.apache.org/
• http://kafka.apache.org
• http://samza.apache.org
Apache Samza is a distributed stream processing framework. It uses Apache
Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management
• https://beam.apache.org/
Apache Beam is an advanced unified programming model that implement batch
and streaming data processing jobs that run on any execution engine
• https://spark.apache.org/streaming/
Spark Streaming makes it easy to build scalable fault-tolerant streaming
applications
• Introduction to Spark Streaming (Hortonworks Tutorial)
https://hortonworks.com/hadoop-tutorial/introduction-spark-streaming/
Proprietary SDE references:
• IBM Streams
https://www.ibm.com/ms-en/marketplace/stream-computing
• Try IBM Streams (basic) for free
https://console.bluemix.net/catalog/services/streaming-analytics
• Amazon Kinesis
https://aws.amazon.com/kinesis/
• Microsoft Azure Stream Analytics
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-
introduction
• Stock-trading analysis and alerts
• Fraud detection, data, and identity protections
• Embedded sensor and actuator analysis
• Web clickstream analytics.
© Copyright IBM Corp. 2018 11-17
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Hortonworks HDF / NiFi
• Hortonworks DataFlow (HDF) provides an only end-to-end platform
that collects, curates, analyzes, and acts on data in real-time, on-
premises or in the cloud
• With version 3.x, HDF has an available a drag-and-drop visual
interface
• HDF is an integrated solution that uses Apache Nifi/MiNifi, Apache
Kafka, Apache Storm, Superset, and Druid components where
appropriate
• The HDF streaming real-time data analytics platform includes Data
Flow Management Systems, Stream Processing, and Enterprise
Services
• The newest additions to HDF include a Schema Repository
Stream Computing © Copyright IBM Corporation 2018
Hortonworks HDF / NiFi
© Copyright IBM Corp. 2018 11-18
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
What is Hortonworks DataFlow (HDF)?
• HDF is a collection of products for data movement, streaming
analytics, and the management and governance of streaming data
Stream Computing © Copyright IBM Corporation 2018
What is Hortonworks DataFlow (HDF)?
Hortonworks DataFlow (HDF) is a collection of products for data movement, streaming
analytics, and the management and governance of streaming data:
• Flow management: NiFi/MiNiFi
• Stream processing: Storm, Kafka, Streaming Analytics Manager
• Enterprise services: Schema Registry, Apache Ranger, Ambari
The Streams Processing element is the only component that competes with
IBM Streams.
There is a more complete competitive presentation on Storm available but we’ll cover
the basics in here.
© Copyright IBM Corp. 2018 11-19
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Components of the HDF Platform
Stream Computing © Copyright IBM Corporation 2018
Components of the HDF Platform
Reference for this graphic: https://hortonworks.com/products/data-platforms/hdf/
Hortonworks provides a 7-part Data-in-Motion webinar series (available on demand) at:
https://hortonworks.com/blog/harness-power-data-motion/
© Copyright IBM Corp. 2018 11-20
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
NiFi & MiNiFi
• NiFi is a disk-based, microbatch ETL tool that can duplicate
the same processing on multiple hosts for scalability
▪ Web-based user interface
▪ Highly configurable / Loss tolerant vs guaranteed delivery / Low latency vs
high throughput
▪ Tracks dataflow from beginning to end
• MiNiFi-a subproject of Apache NiFi-is a complementary data collection
approach that supplements the core tenets of NiFi in dataflow
management, focusing on the collection of data at the source of its
creation
▪ Small size and low resource consumption
▪ Central management of agents
▪ Generation of data provenance (full chain of custody of information)
▪ Integration with NiFi for follow-on dataflow management
Stream Computing © Copyright IBM Corporation 2018
NiFi & MiNiFi
References:
• https://nifi.apache.org
• https://nifi.apache.org/minifi/
NiFi background:
• Originated at the National Security Agency (NSA) - more than eight years of
development as a closed-source product
• Apache incubator November 2014 as part of NSA technology transfer program
• Apache top-level project in July 2015
• Java based, running on a JVM
Wikipedia (“Apache NiFi):
• Apache NiFi (short for NiagaraFiles) is a software project from the Apache
Software Foundation designed to automate the flow of data between software
systems. It is based on the "NiagaraFiles" software previously developed by the
NSA and open-sourced as a part of its technology transfer program in 2014
© Copyright IBM Corp. 2018 11-21
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
• The software design is based on the flow-based programming model and offers
features which prominently include the ability to operate within clusters, security
using TLS encryption, extensibility (users can write their own software to extend
its abilities) and improved usability features like a portal which can be used to
view and modify behavior visually.
NiFi is written in Java and runs within a Java virtual machine running on the server
that hosts it. The main components of Nifi are
• Web Server - the HTTP-based component used to visually control the software
and monitor the
• Flow Controller - serves as the brains of NiFi's behavior and controls the running
of Nifi extensions, scheduling the allocation of resources for this to happen.
• Extensions - various plugins that allow Nifi to interact with different kinds of
systems
• FlowFile repository - used by NiFi to maintain and track status of the currently
active FlowFile, including the information that NiFi is helping move between
systems
• Cluster manager - an instance of NiFi that provides the sole management point
for the cluster; the same data flow runs on all the nodes of the cluster
• Content repository - where data in transit is maintained
• Provenance repository - metadata detailing to the provenance of the data flowing
through the Software development and commercial support is offered by
Hortonworks. The software is fully open source. This software is also sold and
supported by IBM.
MiNiFi is a subproject of NiFi designed to solve the difficulties of managing and
transmitting data feeds to and from the source of origin, often the first/last mile of
digital signal, enabling edge intelligence to adjust flow behavior/bi-directional
communication.
Reference:
• https://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/
• Free eBook (PDF, 52 pp., authored by Hortonworks & Attunity):
http://discover.attunity.com/apache-nifi-for-dummies-en-report-go-c-lp8558.html
© Copyright IBM Corp. 2018 11-22
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Edge Intelligence for IoT with Apache MiNiFi
• Since the first mile of data collection (the far edge), is very distributed
and likely involves a very large number of end devices (ie. IoT), MiNiFi
carries over all the main capabilities of NiFi
Stream Computing © Copyright IBM Corporation 2018
Edge Intelligence for IoT with Apache MiNiFi
Reference: https://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/
MiNiFi is ideal for situations where there are a large number of distributed, connected
devices. For example first mile data collection from routers, security cameras, cable
modems, ATMs, security appliances, point of sale systems, weather detection systems,
fleets of trucks/trains/planes/ships, thermostats, utility or power meters are all good
uses cases for MiNiFi due the distributed nature of their physical locations. MiNIFi then
enables multiple applications. For example, by deploying MiNiFi on point of sale
terminals, retailers can gather data from 100’s or 1000’s or more terminals.
This edge collection role was played often in the past by Flume, but Flume is has been
marked as deprecated. See:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_release-
notes/content/deprecated_items.html
© Copyright IBM Corp. 2018 11-23
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
HDP dashboard - example
• Hortonworks provides a full
tutorial - Trucking IoT on HDF -
that covers the core concepts of
Storm and the role it plays in an
environment where real-time,
low-latency and distributed data
processing are important
▪ Continuous IoT data (Kafka)
▪ Real-time stream processing
(Storm)
▪ Visualization (Web application)
• https://hortonworks.com/hadoop-
tutorial/trucking-iot-hdf/
Stream Computing © Copyright IBM Corporation 2018
HDP dashboard - example
This tutorial discuss a real-world use-case and provides an understanding of the role
Storm plays within in HDF.
• https://hortonworks.com/hadoop-tutorial/trucking-iot-hdf/
The use case is a trucking company that dispatches trucks across the country. The
trucks are outfitted with sensors that collect data (IoT, Internet of Things). The data
includes:
• The name of the driver
• The route the truck is using and intends to use
• The speed of the truck
• Event recently occurred (speeding, the truck weaving out of its lane, following too
closely, etc)
Data like this is generated often - perhaps several times per second - and is
streamed back to the company’s servers.
Components of the tutorial are:
• Demonstration
• A dive into the Storm internals and how to build a topology from scratch
• A discussion of how to package the Storm code and deploy onto a cluster
© Copyright IBM Corp. 2018 11-24
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Controlling HDF components from Ambari
Example: Storm
Stream Computing © Copyright IBM Corporation 2018
Controlling HDF components from Ambari
The Ambari interface provides:
• Open source management platform
• Ongoing cluster management and maintenance (cluster size irrelevant) via Web
UI and REST API (cluster operations automation)
• Visualization of cluster health and extended cluster capability by wrapping custom
services under management
• Centralized security setup
© Copyright IBM Corp. 2018 11-25
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
IBM Streams
• IBM Streams is an advanced computing platform that allows user-
developed applications to quickly ingest, analyze, and correlate
information as it arrives from thousands of real-time sources
• The solution can handle very high data throughput rates, up to millions
of events or messages per second
• IBM Streams provides:
▪ Development support - Rich Eclipse-based, visual IDE lets solution
architects visually build applications or use familiar programming languages
like Java, Scala or Python
▪ Rich data connections - Connect with virtually any data source whether
structured, unstructured or streaming, and integrate with Hadoop, Spark
and other data infrastructures
▪ Analysis and visualization - Integrate with business solutions. Built-in
domain analytics like machine learning, natural language, spatial-temporal,
text, acoustic, and more, to create adaptive streams applications
Stream Computing © Copyright IBM Corporation 2018
IBM Streams
References:
• https://www.ibm.com/cloud/streaming-analytics
• Streaming Analytics: Resources
https://www.ibm.com/cloud/streaming-analytics/resources
• Free eBook: Addressing Data Volume, Velocity, and Variety with IBM InfoSphere
Streams V3.0
IBM Redbook series, PDF, 326 pp.
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
• Toolkits, Sample, and Tutorials for IBM Streams
https://github.com/IBMStreams
• Forrester Total Economic Impact of IBM Streams:
This study provides readers with a framework to evaluate the potential financial
impact of Streams on their organizations
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMC15131USEN
© Copyright IBM Corp. 2018 11-26
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Streams is based on nearly two decades of effort by the IBM Research team to extend
computing technology to handle advanced analysis of high volumes of data quickly.
How important is their research? Consider how it would help crime investigation to
analyze the output of any video cameras in the area that surrounds the scene of a
crime to identify specific faces of any persons of interest in the crowd and relay that
information to the unit that is responding. Similarly, what a competitive edge it could
provide by analyzing 6 million stock market messages per second and execute trades
with an average trade latency of only 25 microseconds (far faster than a hummingbird
flaps its wings). Think about how much time, money, and resources could be saved by
analyzing test results from chip-manufacturing wafer testers in real time to determine
whether there are defective chips before they leave the line.
System S: While at IBM, Dr. Ted Codd invented the relational database. In the
defining IBM Research project, it was referred to as System R, which stood for
Relational. The relational database is the foundation for data warehousing that
launched the highly successful Client/Server and On-Demand informational eras.
One of the cornerstones of that success was the capability of OLAP products that
are still used in critical business processes today.
When the IBM Research division again set its sights on developing something to
address the next evolution of analysis (RTAP) for the Smarter Planet evolution,
they set their sights on developing a platform with the same level of world-
changing success, and decided to call their effort System S, which stood for
Streams. Like System R, System S was founded on the promise of a
revolutionary change to the analytic paradigm. The research of the Exploratory
Stream Processing Systems team at T.J. Watson Research Center, which was
set on advanced topics in highly scalable stream-processing applications for the
System S project, is the heart and soul of Streams.
Critical intelligence, informed actions, and operational efficiencies that are all available
in real time is the promise of Streams.
© Copyright IBM Corp. 2018 11-27
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Comparison of IBM Streams vs NiFi
IBM Streams NiFi
▪ Stream Data Engine (SDE) ▪ Microbatch engine
▪ Complete cluster support ▪ Master-slave, duplicate processing
▪ C++ engine - performance & scalability ▪ Java engine
▪ Mature & proven product (v4.2) ▪ Evolving product
▪ Memory-based ▪ Disk-based
▪ Streaming analytics ▪ Extract-Tranform-Load (ETL) oriented
▪ Many analytics operators ▪ No analytics Processors
▪ Enterprise data source / sink support ▪ Limited data source / sink support
▪ Web-based, command-line based, and ▪ Web-based monitoring tooling
REST- based monitoring tooling ▪ Web-based development environment
▪ Drag-and-drop development ▪ Needs Storm or Spark to provide the
environment + Streams Processing analytics
Language (SPL)
Stream Computing © Copyright IBM Corporation 2018
Comparison of IBM Streams vs NiFi
NiFi itself is not a streaming analytics engine. HDF uses Storm for the analytic
processing.
Apache Storm is a distributed stream processing computation framework written
predominantly in the Clojure programming language. Originally created by Nathan Marz
and team at BackType, the project was open sourced after being acquired by Twitter.
Why use Storm? http://storm.apache.org/
• Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing. Storm is simple, can
be used with any programming language, and is a lot of fun to use!
• Storm has many use cases: realtime analytics, online machine learning,
continuous computation, distributed RPC, ETL, and more. Storm is fast: a
benchmark clocked it at over a million tuples processed per second per node. It is
scalable, fault-tolerant, guarantees your data will be processed, and is easy to set
up and operate.
© Copyright IBM Corp. 2018 11-28
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
• Storm integrates with the queueing and database technologies you already use.
A Storm topology consumes streams of data and processes those streams in
arbitrarily complex ways, repartitioning the streams between each stage of the
computation however needed.
• The reference provides a tutorial.
Apache Storm with v1.0 works with Apache Trident. Trident is a high-level
abstraction for doing realtime computing on top of Storm. Trident provides
microbatching.
• Trudent allows you to seamlessly intermix high throughput (millions of messages
per second), stateful stream processing with low latency distributed querying. If
you're familiar with high level batch processing tools like Pig or Cascading, the
concepts of Trident will be very familiar – Trident has joins, aggregations,
grouping, functions, and filters. In addition to these, Trident adds primitives for
doing stateful, incremental processing on top of any database or persistence
store. Trident has consistent, exactly-once semantics, so it is easy to reason
about Trident topologies.
• Refer: http://storm.apache.org/releases/current/Trident-tutorial.html
References:
• Apache Storm: Overview
https://hortonworks.com/apache/storm/
• The Future of Apache Storm (YouTube, June 2016)
https://youtu.be/_Q2uzRIkTd8
• Storm or Spark: Choose your real-time weapon
https://www.infoworld.com/article/2854894/application-development/spark-and-
storm-for-real-time-computation.html
© Copyright IBM Corp. 2018 11-29
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Advantages of IBM Streams & Streams Studio
Streams Studio provides the ability to create stream processing
topologies without programming through a comprehensive set of
capabilities:
• Pre-built Sources (input): Kafka, Event Hubs, HDFS
• Pre-built Processing (operators): Aggregate, Branch, Join, PMML,
Projection Bolt, Rule
• Pre-built Sink (output): Cassandra, Druid, Hive, HBase, HDFS,
JDBC, Kafka
• Notification (email), Open TSDB, Solr
• Pre-built visualization: 30+ business visualizations - and these can
be organized into a dashboard
• Extensible: Ability to add user-defined functions (UDF) and
aggregates (UDA) through jar files
Stream Computing © Copyright IBM Corporation 2018
Advantages of IBM Streams & Streams Studio
Being able to create stream processing topologies without programming is a worthwhile
goal. This is something that is possible already with IBM Streams using Streams
Studio.
IBM Streams is a complete, off-the-shelf Streaming Data Engine (SDE) that is ready to
run immediately after installation. In addition, you have all the tools to develop custom
source and sink
PMML stands for "Predictive Model Markup Language." It is the de-facto standard to
represent predictive solutions. A PMML file may contain a myriad of data
transformations (pre- and post-processing) as well as one or more predictive models.
IBM Streams comes with the ability to cross-integrate with IBM SPSS Statistics to
provide PMML capability and work with R, the open source statistical package that
supports PMML.
What is PMML?
https://www.ibm.com/developerworks/library/ba-ind-PMML1/
© Copyright IBM Corp. 2018 11-30
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Where does IBM Streams fit in the processing cycle?
Stream Computing © Copyright IBM Corporation 2018
Where does IBM Streams fit in the processing cycle?
What if you wanted to add in fraud detection to your processing cycle before authorizing
the transaction and committing it to the database? Fraud detection has to happen in
real-time in order for it to truly be of a benefit. So instead of taking our data and running
it directly into the authorization / OLTP process, we process it as it streaming data
using IBM Streams. The results of this process can be directed to our transactional
system or data warehouse or to a Data Lake or Hadoop system storage (e.g., HDFS).
© Copyright IBM Corp. 2018 11-31
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Real-time processing to find new insights
Multi-channel customer
sentiment and experience a
analysis
Detect life-threatening
conditions at hospitals in
time to intervene
Predict weather patterns to plan
optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Make risk decisions based on
real-time transactional data
Identify criminals and threats
from disparate video, audio,
and data feeds
Stream Computing © Copyright IBM Corporation 2018
Real-time processing to find new insights
The graphic represents some of the situations in which IBM Streams have been applied
to perform real-time analytics on streaming data.
Today organizations are only tapping in to a small fraction of the data that is available to
them. The challenge is figuring out how to analyze ALL of the data, and find insights in
these new and unconventional data types. Imagine if you could analyze the 7 terabytes
of tweets being created each day to figure out what people are saying about your
products, and figure out who the key influencers are within your target demographics.
Can you imagine being able to mine this data to identify new market opportunities?
What if hospitals could take the thousands of sensor readings collected every hour per
patients in ICUs to identify subtle indications that the patient is becoming unwell, days
earlier that is allowed by traditional techniques? Imagine if a green energy company
could use petabytes of weather data along with massive volumes of operational data to
optimize asset location and utilization, making these environmentally friendly energy
sources more cost competitive with traditional sources.
What if you could make risk decisions, such as whether or not someone qualifies for a
mortgage, in minutes, by analyzing many sources of data, including real-time
transactional data, while the client is still on the phone or in the office? Or if law
enforcement agencies could analyze audio and video feeds in real-time without human
intervention to identify suspicious activity?
© Copyright IBM Corp. 2018 11-32
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
As new sources of data continue to grow in volume, variety and velocity, so too does
the potential of this data to revolutionize the decision-making processes in every
industry.
© Copyright IBM Corp. 2018 11-33
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Components of IBM Streams
Stream Computing © Copyright IBM Corporation 2018
Components of IBM Streams
© Copyright IBM Corp. 2018 11-34
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Application graph of an IBM Streams application
Stream Computing © Copyright IBM Corporation 2018
Application graph of an IBM Streams application
Applications can be developed in the IBM Streams Studio by using IBM Streams
Processing Language (SPL), a declarative language that is customized for stream
computing. Applications with the latest release are, however, generally developed using
a drag-and-drop graphical approach.
After the applications are developed, they are deployed to Streams Runtime
environment. By using Streams Live Graph, you can monitor the performance of the
runtime cluster from the perspective of individual machines and the communications
between them.
Virtually any device, sensor, or application system can be defined by using the
language. But, there are predefined source and output adapters that can further simplify
application development. As examples, IBM delivers the following adapters, among
many others:
• TCP/IP, UDP/IP, and files
• IBM WebSphere Front Office, which delivers stock feeds from major exchanges
worldwide
• IBM solidDB® includes an in-memory, persistent database that uses the Solid
Accelerator API
© Copyright IBM Corp. 2018 11-35
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
• Relational databases, which are supported by using industry standard ODBC
Applications, such as the one shown above, almost always feature multiple steps.
For example, some utilities began paying customers who sign up for a particular usage
plan to have their air conditioning units turned off for a short time, allowing the
temperature to be changed. An application to implement this plan collects data from
meters and might apply a filter to monitor only for those customers who selected this
service. Then, a usage model must be applied that was selected for that company. Up-
to-date usage contracts then need to be applied by retrieving them, extracting text,
filtering on key words, and possibly applying a seasonal adjustment. Current weather
information can be collected and parsed from the US National Oceanic &
Atmospheric Administration (NOAA), which has weather stations across the United
States. After the correct location is parsed for, text can be extracted and temperature
history can be read from a database and compared to historical information.
Optionally, the latest temperature history could be stored in a warehouse for future use.
Finally, the three streams (meter information, usage contract, and current weather
comparison to historical weather) can be used to take actions.
This example is taken from the free IBM RedBook (PDF): Ebbers, M., Abdel-Gayed, A.,
Budhi, V. B., Dolot, F., Kamat, V., Picone, R., & Trevelin, J. (2013). Addressing Data
Volume, Velocity, and Variety with IBM InfoSphere Streams V3.0. Raleigh, NC: IBM
International Technical Support Organization (ITSO), p. 31.
This RedBook is available from:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
© Copyright IBM Corp. 2018 11-36
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Demo: “Learn IBM Streams in 5 minutes” (5:53)
Play at full-screen, the YouTube Video:
https://www.ibm.com/ms-
en/marketplace/stream-computing
Stream Computing © Copyright IBM Corporation 2018
Demo: “Learn IBM Streams in 5 minutes” (5:53)
Example exploration: How to capture data streams and analyze them faster with
IBM Streams.
YouTube URL: https://www.youtube.com/watch?v=HLHGRy7Hif4
© Copyright IBM Corp. 2018 11-37
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Demo: “IBM Streams Designer Overview” (2:54)
Play at full-screen, the YouTube Video:
https://youtu.be/i6nBhFoXZqo
Stream Computing © Copyright IBM Corporation 2018
Demo: “IBM Streams Designer Overview” (2:54)
This video provides an overview of IBM Streams Designer, part of IBM Watson Data
Platform.
Find more videos in the IBM Streams Designer Learning Center at:
http://ibm.biz/streams-designer-learning.
© Copyright IBM Corp. 2018 11-38
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit 11 Stream Computing
Unit summary
• Define streaming data
• Describe IBM as a pioneer in streaming data - with System S
IBM Streams
• Explain streaming data - concepts & terminology
• Compare and contrast batch data vs streaming data
• List and explain streaming components & Streaming Data Engines
(SDEs)
Stream Computing © Copyright IBM Corporation 2018
Unit summary
© Copyright IBM Corp. 2018 11-41
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.