Streaming Data Ingestion v1 181001151203
Streaming Data Ingestion v1 181001151203
und IoT-Anwendungen
Guido Schmutz – 27.9.2018
@gschmutz guidoschmutz.wordpress.com
Contact: [email protected]
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
With over 600 specialists and IT experts in your region.
COPENHAGEN
Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB
Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage
Raw
Parallel
Processing Search / Explore
Storage
Enterprise Apps
{ }
API Logic
Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB
Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage
Raw
Parallel
Event Source
Processing Search / Explore
Mobile Storage
Apps
IoT
Data Event Stream
Enterprise Apps
Location
{ }
Social
API Logic
Telemetry
Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB
Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage
Raw
Parallel
Event Source high latency
Processing Search / Explore
Mobile Storage
Apps Event
Event
Event
Hub • Machine Learning
Hub
IoT Hub • Graph Algorithms
Data Event Stream
• Natural Language Processing
Enterprise Apps
Location
{ }
Social
API Logic
Telemetry
File
DB
Extract
BI Tools
DB
Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models
Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow
Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream
Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models
CDC
CDC Connector
User Interface Logic Data
Event
Stream
Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow
Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream
Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models
Refined
DB
Extract Results
Storage SQL BI Tools
Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Event
Apps Stream Stream Analytics Platform Search
IoT Event
Stream
{ }
Data
Stream State
Event API
Stream
Processor Enterprise Apps
Location
Event
Stream { }
Microservice Platform Service
Social
API Logic
{ }
Telemetry Event
Microservice State API
Stream
Introduction to Stream Processing
Edge computing allows processing close to data sources
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Big Data Platform
File File Import / SQL Import
Refined
DB
Extract Results
Storage SQL BI Tools
DB
Raw
Parallel
Processing
Storage
Event Source
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Stream Analytics Platform Search
{ }
Microservice Platfrom Service
Social
Storage API Logic
{ }
Telemetry Event
Microservice State API
Rules Stream
Introduction to Stream Processing
Unified Architecture for Modern Data Analytics Solutions
Enterprise Data
Hadoop Clusterd Warehouse
Bulk Source Hadoop Cluster
Big Data
File File Import / SQL Import
Refined
DB
Extract Results
Storage SQL BI Tools
DB
Raw
Parallel
Processing
Storage
Event Source Event Stream Event Search / Explore
Stream Analytics
Mobile Hub Event
Apps Stream Search
Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB
Raw
Parallel
Processing
Storage
Event Source Big Data Batch Analytics
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Search
{ }
IoT Edge Node
Stream State
Data API
Processor
Event Hub Stream Analytics Enterprise Apps
Location
{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Apache Kafka – A Streaming Platform
1. Never Broker 1
0 1 2 3 4 5 6 7 8 9 10 11 Offset
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2 Key
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Value
Compaction
K1 V1 V3 V4
Offset 3 4 6 8 9 10 V1
K2 V2 V6
Key 0
K1 K3 K4 K5 K2 K6
Value K3 V5
V4 V5 V7 V9 V10 V11
K4 V7
K5 V8 V9
V1
K6
1
Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB
Raw
Parallel
Processing
Storage
Event Source Big Data Batch Analytics
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Search
{ }
IoT Edge Node
Stream State
Data API
Processor
Event Hub Stream Analytics Enterprise Apps
Location
{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Integrating (Streaming) Data Sources
DB Source
Lo CDC
g
Lo Event Hub
DB Source g CDC GW
Native
Lo Lo Topic
g g CDC
Big Data
Topic
Connect
Social
Native
Topic
Dataflow Stream
DataflowGW Topic Processing
REST
IoT Sensor
REST
Topic
MQTT
Messaging
IoT Sensor IoT GW Topic
Broker
GW
IoT GW
Queue
IoT Sensor
Why is Data Ingestion Difficult?
Source: Streamsets
Integration with or without
Transformation? d
Zero Transformation
• No transformation, plain ingest, no
schema validation Enrichment Transformation
• Keep the original format – Text, • Add new data to the message
CSV, … • Do not change existing values
• Allows to store data that may have • Convert a value from one system to
errors in the schema another and add it to the message
Truck-1
truck/nn/
position
Truck-2
? Raw Data
?
truck
position raw Store
Truck-3
truck/nn/
Truck-4 position
Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Stream Data Integration: Apache
NiFi
Apache NiFi
• ConvertXxxxToYyyy • MergeContent
• ConvertRecord • ReplaceText
• EnforceOrder • ResizeImage
• EncryptContent • SplitXXXX (Avro, Content, JSON,
• ExtractXXXX (AvroMetdata, Record, Xml, ...)
EmailAttachments, Grok, • TailFile
HL7Attributes, ImageMetadata, ...) • TransformXML
• GeoEnrichIP • UpdateAttribute
• JoltTransformJSON
Demo Case
Truck-1
truck/nn/
position
Truck-2
Port: 1883
MQTT truck Kafka to Raw Data
to Kafka position raw Raw
Truck-3
Store
truck/nn/
Truck-4 position
Port: 1884
Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Processor
Demo: Kafka Processor
Demo: Masking Field with ReplaceText Processor
Stream Data Integration:
StreamSets DataCollector
StreamSets Data Collector
Source: https://streamsets.com/connectors
StreamSets Processors
Source: https://streamsets.com/connectors
Demo Case
Edge
Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka
truck/nn/
Truck-4 position
Port: 1884
Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Source
Demo: Kafka Sink
Demo: Dataflow for MQTT to Kafka
Demo: Masking fields
Demo: Sending Message to Kafka in Avro
StreamSets Dataflow Performance Manager
Source: https://streamsets.com/connectors
Stream Data Integration: Kafka
Connect
Kafka Connect - Overview
Source Sink
Connecto Connecto
r r
Source: http://www.confluent.io/product/connectors
Demo Case
Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka
truck/nn/
Truck-4 position
Port: 1884
Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo (II) – devices send to MQTT instead of Kafka
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" \
-H "Content-Type: application/json" \
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'
Summary
Summary
• visual dataflow modelling • visual dataflow modelling • declarative style data flows
• very powerful – “with power • very powerful – “with power • simplicity - “simple things
comes responsibility” comes responsibility” done simple”
• special package for Edge • special package for Edge • very well integrated with
computing computing Kafka – comes with Kafka
• data lineage and data • data lineage and data • Single Message Transforms
provenance provenance (SMT)
• supports for backpressure • no transport mechanism • use Kafka Streams for
• no transport mechanism • custom sources, sinks, complex data flows
(DEV/TST/PROD) processors • custom connectors
• custom processors • supported by StreamSets • supported by Confluent
• supported by Hortonworks
Technology on its own won't help you.
You need to know how to use it properly.