0% found this document useful (0 votes)
6 views59 pages

Streaming Data Ingestion v1 181001151203

The document discusses streaming data ingestion in Big Data and IoT applications, highlighting the importance of real-time data processing and integration. It covers various technologies such as Apache NiFi, StreamSets Data Collector, and Kafka Connect, and emphasizes the need for efficient event hubs and stream analytics. The author, Guido Schmutz, provides insights into the challenges of data ingestion and the architectural frameworks necessary for modern data analytics solutions.

Uploaded by

Dinesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views59 pages

Streaming Data Ingestion v1 181001151203

The document discusses streaming data ingestion in Big Data and IoT applications, highlighting the importance of real-time data processing and integration. It covers various technologies such as Apache NiFi, StreamSets Data Collector, and Kafka Connect, and emphasizes the need for efficient event hubs and stream analytics. The author, Guido Schmutz, provides insights into the challenges of data ingestion and the architectural frameworks necessary for modern data analytics solutions.

Uploaded by

Dinesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Streaming Data Ingestion in BigData-

und IoT-Anwendungen
Guido Schmutz – 27.9.2018

@gschmutz guidoschmutz.wordpress.com

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF


HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Guido Schmutz

Working at Trivadis for more than 21 years


Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis

More than 30 years of software development experience

Contact: [email protected]
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
With over 600 specialists and IT experts in your region.
COPENHAGEN

14 Trivadis branches and more than


600 employees
HAMBURG
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
DÜSSELDORF
CHF 5.0 million
FRANKFURT Financially self-supporting and
sustainably profitable
STUTTGART
Experience from more than 1,900
FREIBURG
MUNICH
VIENNA
projects per year at over 800
BASEL
BRUGG customers
ZURICH
BERN
GENEVA LAUSANNE
Agenda

1. Big Data and IoT Reference Architecture


2. Event Hub
3. Stream Data Integration
Apache NiFi
StreamSets Data Collector
Kafka Connect
4. Summary
Big Data and IoT Reference
Architecture
Big Data solves Volume and Variety – not Velocity
Enterprise Data
high latency Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Processing Search / Explore
Storage

Enterprise Apps

{ }

API Logic

Introduction to Stream Processing


Big Data solves Volume and Variety – not Velocity
Enterprise Data
high latency Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Event Source
Processing Search / Explore
Mobile Storage
Apps

IoT
Data Event Stream

Enterprise Apps
Location

{ }
Social
API Logic

Telemetry

Introduction to Stream Processing


Big Data solves Volume and Variety – not Velocity
Enterprise Data
Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Event Source high latency
Processing Search / Explore
Mobile Storage
Apps Event
Event
Event
Hub • Machine Learning
Hub
IoT Hub • Graph Algorithms
Data Event Stream
• Natural Language Processing
Enterprise Apps
Location

{ }
Social
API Logic

Telemetry

Introduction to Stream Processing


Stream Processing Architecture solves Velocity
Enterprise Data
Bulk Source Warehouse

File

DB
Extract
BI Tools
DB

Low(est) latency, no history


Event Source
Search / Explore
Mobile Search
Event Hadoop Clusterd
Apps Event Hadoop Cluster
Event
Hub Stream Analytics
Event Hub Event Platform
IoT Hub
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing


Big Data for all historical data analysis
Hadoop Clusterd Enterprise Data
Bulk Source Hadoop Cluster Warehouse
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing


Integrate existing systems through CDC
Traditional Silo-based Data Store Integration
System

CDC
CDC Connector
User Interface Logic Data

Event
Stream

Capture changes directly on database Event Hub Consuming Systems

Change Data Capture (CDC) => think like Event State


Stream Logic
a global database trigger

Transform existing systems to event


producer

Introduction to Stream Processing


Integrate existing systems with lower latency through CDC
Hadoop Clusterd Enterprise Data
Bulk Source Hadoop Cluster Warehouse
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing


New systems participate in event-oriented fashion
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Event
Apps Stream Stream Analytics Platform Search

IoT Event
Stream
{ }
Data
Stream State
Event API
Stream
Processor Enterprise Apps
Location
Event
Stream { }
Microservice Platform Service
Social
API Logic
{ }
Telemetry Event
Microservice State API
Stream
Introduction to Stream Processing
Edge computing allows processing close to data sources
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

DB

Raw
Parallel
Processing
Storage
Event Source
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Stream Analytics Platform Search

IoT Edge Node { }


Data
Stream State API
Event Hub
Processor Enterprise Apps
Location

{ }
Microservice Platfrom Service
Social
Storage API Logic
{ }
Telemetry Event
Microservice State API
Rules Stream
Introduction to Stream Processing
Unified Architecture for Modern Data Analytics Solutions
Enterprise Data
Hadoop Clusterd Warehouse
Bulk Source Hadoop Cluster
Big Data
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

DB

Raw
Parallel
Processing
Storage
Event Source Event Stream Event Search / Explore
Stream Analytics
Mobile Hub Event
Apps Stream Search

IoT Edge Node { }


Data
Stream State API
Event Hub Processor Enterprise Apps
Location
Microservices
{ }
Service
Social Storage
API Logic
{ }
Telemetry Event
Rules Microservice State API
Stream
Two Types of Stream Processing
(from Gartner)

Stream Data Integration Stream Analytics

• primarily focuses on the ingestion and • targets analytics use cases


processing of data sources targeting real- • calculating aggregates and detecting
time extract-transform-load (ETL) and data patterns to generate higher-level, more
integration use cases relevant summary information (complex
• filter and enrich the data events)
• optionally calculate time-windowed • Complex events may signify threats or
aggregations before storing the results in a opportunities that require a response from
database or file system the business through real-time dashboards,
alerts or decision automation

Introduction to Stream Processing


Event Hub
Implementing "Event Hub"
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Cluster Infrastructure
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB

Raw
Parallel
Processing
Storage
Event Source Big Data Batch Analytics
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Search
{ }
IoT Edge Node
Stream State
Data API
Processor
Event Hub Stream Analytics Enterprise Apps
Location

{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Apache Kafka – A Streaming Platform

High-Level Architecture Scale-Out Architecture

Distributed Log at the Core Logs do not (necessarily) forget

Introduction to Stream Processing


Hold Data for Long-Term – Data Retention

1. Never Broker 1

2. Time based (TTL)


log.retention.{ms | minutes | hours}

3. Size based Broker 2


log.retention.bytes Producer 1

4. Log compaction based


(entries with same key are removed):
Broker 3
kafka-topics.sh --zookeeper zk:2181 \
--create --topic customers \
--replication-factor 1 \
--partitions 1 \
--config cleanup.policy=compact

Introduction to Stream Processing


Keep Topics in Compacted Form

0 1 2 3 4 5 6 7 8 9 10 11 Offset
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2 Key
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Value

Compaction
K1 V1 V3 V4
Offset 3 4 6 8 9 10 V1
K2 V2 V6
Key 0
K1 K3 K4 K5 K2 K6
Value K3 V5
V4 V5 V7 V9 V10 V11

K4 V7

K5 V8 V9
V1
K6
1

Introduction to Stream Processing


Stream Data Integration
Implementing "Stream Data Integration"
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Cluster Infrastructure
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB

Raw
Parallel
Processing
Storage
Event Source Big Data Batch Analytics
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Search
{ }
IoT Edge Node
Stream State
Data API
Processor
Event Hub Stream Analytics Enterprise Apps
Location

{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Integrating (Streaming) Data Sources

File Polling SQL Polling

File Stream (File Tailing)


Change Data Capture Sensor Stream
(CDC)

File Stream (Appender)

Introduction to Stream Processing


IoT devices will often not be able to
talk to Kafka directly
File Source Dataflow
26
Lo
26
g
TopicGW

DB Source
Lo CDC
g

Lo Event Hub
DB Source g CDC GW

Native
Lo Lo Topic
g g CDC
Big Data
Topic

Connect
Social
Native

Topic
Dataflow Stream
DataflowGW Topic Processing

REST
IoT Sensor
REST

Topic

MQTT
Messaging
IoT Sensor IoT GW Topic
Broker
GW
IoT GW
Queue
IoT Sensor
Why is Data Ingestion Difficult?

Infrastructure Drift Structure Drift Semantic Drift

Physical and Logical Data Structures and Data semantics change


Infrastructure changes formats evolve and change with evolving applications
rapidly unexpectedly

Key Challenges: Key Challenges: Key Challenges

Infrastructure Automation Consumption Readiness Timely Intervention


Edge Deployment Corruption and Loss System Consistency

Source: Streamsets
Integration with or without
Transformation? d

Zero Transformation
• No transformation, plain ingest, no
schema validation Enrichment Transformation
• Keep the original format – Text, • Add new data to the message
CSV, … • Do not change existing values
• Allows to store data that may have • Convert a value from one system to
errors in the schema another and add it to the message

Format Transformation Value Transformation


• Prefer name of Format Translation • Replaces values in the message
• Simply change the format • Convert a value from one system to
• Change format from Text to Avro another and change the value in-place
• Does schema validation • Destroys the raw data!

Introduction to Stream Processing


Demo Case

Truck-1
truck/nn/
position
Truck-2
? Raw Data
?
truck
position raw Store
Truck-3

truck/nn/
Truck-4 position

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Stream Data Integration: Apache
NiFi
Apache NiFi

• Originated at NSA as Niagarafiles – developed


behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, “file-oriented” payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface
Processors for Source and Sink

• ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …)


• DeleteXXXX (DynamoDB, Elasticsearch, HDFS, RethinkDB, S3, SQS, ...)
• FetchXXXX (AzureBlobStorage, ElasticSearch, File, FTP, HBase, HDFS, S3 ...)
• ExecuteXXXX (FlumeSink, FlumeSource, Script, SQL, ...)
• GetXXXX (AzureEventHub, Couchbase, DynamoDB, File, FTP, HBase, HDFS,
HTTP, Ignite, JMSQueue, JMSTopic, Kafka, Mongo, Solr, Splunk, SQS, TCP, ...)
• ListenXXXX (HTTP, RELP, SMTP, Syslog, TCP, UDP, WebSocket, ...)
• PublishXXXX (Kafka, MQTT)
• PutXXXX (AzureBlobStorage, AzureEventHub, CassandraQL, CloudWatchMetric,
Couchbase, DynamoDB, Elasticsearch, Email, FTP, File, Hbase, HDFS, HiveQL,
Kudu, Lambda, Mongo, Parquet, Slack, SQL, TCP, ....)
• QueryXXXX (Cassandra, DatabaseTable, DNS, Elasticserach)
Processors for Processing

• ConvertXxxxToYyyy • MergeContent
• ConvertRecord • ReplaceText
• EnforceOrder • ResizeImage
• EncryptContent • SplitXXXX (Avro, Content, JSON,
• ExtractXXXX (AvroMetdata, Record, Xml, ...)
EmailAttachments, Grok, • TailFile
HL7Attributes, ImageMetadata, ...) • TransformXML
• GeoEnrichIP • UpdateAttribute
• JoltTransformJSON
Demo Case

Truck-1
truck/nn/
position
Truck-2
Port: 1883
MQTT truck Kafka to Raw Data
to Kafka position raw Raw
Truck-3
Store

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Processor
Demo: Kafka Processor
Demo: Masking Field with ReplaceText Processor
Stream Data Integration:
StreamSets DataCollector
StreamSets Data Collector

• Founded by ex-Cloudera, Informatica


employees
• Continuous open source, intent-driven, big data
ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release September
2015
• So far, vast majority of commits are from
StreamSets staff
StreamSets Origins

An origin stage represents the


source for the pipeline. You can
use a single origin stage in a
pipeline

Origins on the right are available


out of the box

API for writing custom origins

Source: https://streamsets.com/connectors
StreamSets Processors

A processor stage represents a type of Some of processors available out-of-the-


data processing that you want to perform box:
• Expression Evaluator
use as many processors in a pipeline as • Field Flattener
you need • Field Hasher
• Field Masker
Programming languages supported • Field Merger
• Java • Field Order
• JavaScript • Field Splitter
• Jython • Field Zip
• Groovy • Groovy Evaluator
• Java Expression Language (EL) Spark • JDBC Lookup
• JSON Parser
• Spark Evaluator
• …
StreamSets Destinations

A destination stage represents


the target for a pipeline. You can
use one or more destinations in a
pipeline

Destinations on the right are


available out of the box

API for writing custom origins

Source: https://streamsets.com/connectors
Demo Case
Edge

Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Source
Demo: Kafka Sink
Demo: Dataflow for MQTT to Kafka
Demo: Masking fields
Demo: Sending Message to Kafka in Avro
StreamSets Dataflow Performance Manager

• Map dataflows to topologies, manage releases &


track changes

• Measure KPIs and establish baselines for data


availability and accuracy

• Master dataflow operations through Data SLAs

Source: https://streamsets.com/connectors
Stream Data Integration: Kafka
Connect
Kafka Connect - Overview

Source Sink
Connecto Connecto
r r

Introduction to Stream Processing


Kafka Connect – Single Message Transforms (SMT)

Simple Transformations for a single message Some of currently available


transforms:
Defined as part of Kafka Connect • InsertField
• some useful transforms provided out-of-the-box • ReplaceField
• Easily implement your own • MaskField
• ValueToKey
Optionally deploy 1+ transforms with each • ExtractField
connector • TimestampRouter
• Modify messages produced by source • RegexRouter
connector • SetSchemaMetaData
• Modify messages sent to sink connectors • Flatten
• TimestampConverter
Makes it much easier to mix and match connectors
Kafka Connect – Many Connectors
Certified Connectors Community Connectors
60+ since first release (0.9+)

20+ from Confluent and Partners

Confluent supported Connectors

Source: http://www.confluent.io/product/connectors
Demo Case

Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo (II) – devices send to MQTT instead of Kafka
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" \
-H "Content-Type: application/json" \
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'
Summary
Summary

Apache NiFi StreamSets Kafka Connect

• visual dataflow modelling • visual dataflow modelling • declarative style data flows
• very powerful – “with power • very powerful – “with power • simplicity - “simple things
comes responsibility” comes responsibility” done simple”
• special package for Edge • special package for Edge • very well integrated with
computing computing Kafka – comes with Kafka
• data lineage and data • data lineage and data • Single Message Transforms
provenance provenance (SMT)
• supports for backpressure • no transport mechanism • use Kafka Streams for
• no transport mechanism • custom sources, sinks, complex data flows
(DEV/TST/PROD) processors • custom connectors
• custom processors • supported by StreamSets • supported by Confluent
• supported by Hortonworks
Technology on its own won't help you.
You need to know how to use it properly.

You might also like