0% found this document useful (0 votes)

6 views59 pages

Streaming Data Ingestion v1 181001151203

The document discusses streaming data ingestion in Big Data and IoT applications, highlighting the importance of real-time data processing and integration. It covers various technologies such as Apache NiFi, StreamSets Data Collector, and Kafka Connect, and emphasizes the need for efficient event hubs and stream analytics. The author, Guido Schmutz, provides insights into the challenges of data ingestion and the architectural frameworks necessary for modern data analytics solutions.

Uploaded by

Dinesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views59 pages

Streaming Data Ingestion v1 181001151203

Uploaded by

Dinesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Streaming Data Ingestion in BigData-

und IoT-Anwendungen
Guido Schmutz – 27.9.2018

@gschmutz guidoschmutz.wordpress.com

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF

HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Guido Schmutz

Working at Trivadis for more than 21 years

Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis

More than 30 years of software development experience

Contact: [email protected]
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
With over 600 specialists and IT experts in your region.
COPENHAGEN

14 Trivadis branches and more than

600 employees
HAMBURG
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
DÜSSELDORF
CHF 5.0 million
FRANKFURT Financially self-supporting and
sustainably profitable
STUTTGART
Experience from more than 1,900
FREIBURG
MUNICH
VIENNA
projects per year at over 800
BASEL
BRUGG customers
ZURICH
BERN
GENEVA LAUSANNE
Agenda

1. Big Data and IoT Reference Architecture

2. Event Hub
3. Stream Data Integration
Apache NiFi
StreamSets Data Collector
Kafka Connect
4. Summary
Big Data and IoT Reference
Architecture
Big Data solves Volume and Variety – not Velocity
Enterprise Data
high latency Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Processing Search / Explore
Storage

Enterprise Apps

{ }

API Logic

Introduction to Stream Processing

Big Data solves Volume and Variety – not Velocity
Enterprise Data
high latency Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Event Source
Processing Search / Explore
Mobile Storage
Apps

IoT
Data Event Stream

Enterprise Apps
Location

{ }
Social
API Logic

Telemetry

Introduction to Stream Processing

Big Data solves Volume and Variety – not Velocity
Enterprise Data
Warehouse
Bulk Source

Hadoop Clusterd
File Hadoop Cluster
Big Data Platform
DB

Refined
Extract
BI Tools
SQL
File Import / SQL Import
Results
DB Storage

Raw
Parallel
Event Source high latency
Processing Search / Explore
Mobile Storage
Apps Event
Event
Event
Hub • Machine Learning
Hub
IoT Hub • Graph Algorithms
Data Event Stream
• Natural Language Processing
Enterprise Apps
Location

{ }
Social
API Logic

Telemetry

Introduction to Stream Processing

Stream Processing Architecture solves Velocity
Enterprise Data
Bulk Source Warehouse

File

DB
Extract
BI Tools
DB

Low(est) latency, no history

Event Source
Search / Explore
Mobile Search
Event Hadoop Clusterd
Apps Event Hadoop Cluster
Event
Hub Stream Analytics
Event Hub Event Platform
IoT Hub
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing

Big Data for all historical data analysis
Hadoop Clusterd Enterprise Data
Bulk Source Hadoop Cluster Warehouse
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing

Integrate existing systems through CDC
Traditional Silo-based Data Store Integration
System

CDC
CDC Connector
User Interface Logic Data

Event
Stream

Capture changes directly on database Event Hub Consuming Systems

Change Data Capture (CDC) => think like Event State

Stream Logic
a global database trigger

Transform existing systems to event

producer

Introduction to Stream Processing

Integrate existing systems with lower latency through CDC
Hadoop Clusterd Enterprise Data
Bulk Source Hadoop Cluster Warehouse
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage
BI Tools
DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Search
Hadoop Clusterd
Apps Hadoop Cluster
Stream Analytics
IoT Event Event Platform
Data Stream Stream

Enterprise Apps
Location
Results
Event Stream Analytics { }
Stream
Social
API Logic
Reference / Dashboard
Telemetry Models

Introduction to Stream Processing

New systems participate in event-oriented fashion
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

DB Event Data Flow

Raw
Hub Parallel
Processing
Storage
Event Source
Search / Explore
Mobile Event
Apps Stream Stream Analytics Platform Search

IoT Event
Stream
{ }
Data
Stream State
Event API
Stream
Processor Enterprise Apps
Location
Event
Stream { }
Microservice Platform Service
Social
API Logic
{ }
Telemetry Event
Microservice State API
Stream
Introduction to Stream Processing
Edge computing allows processing close to data sources
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Big Data Platform
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

Raw
Parallel
Processing
Storage
Event Source
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Stream Analytics Platform Search

IoT Edge Node { }

Data
Stream State API
Event Hub
Processor Enterprise Apps
Location

{ }
Microservice Platfrom Service
Social
Storage API Logic
{ }
Telemetry Event
Microservice State API
Rules Stream
Introduction to Stream Processing
Unified Architecture for Modern Data Analytics Solutions
Enterprise Data
Hadoop Clusterd Warehouse
Bulk Source Hadoop Cluster
Big Data
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools

Raw
Parallel
Processing
Storage
Event Source Event Stream Event Search / Explore
Stream Analytics
Mobile Hub Event
Apps Stream Search

IoT Edge Node { }

Data
Stream State API
Event Hub Processor Enterprise Apps
Location
Microservices
{ }
Service
Social Storage
API Logic
{ }
Telemetry Event
Rules Microservice State API
Stream
Two Types of Stream Processing
(from Gartner)

Stream Data Integration Stream Analytics

• primarily focuses on the ingestion and • targets analytics use cases

processing of data sources targeting real- • calculating aggregates and detecting
time extract-transform-load (ETL) and data patterns to generate higher-level, more
integration use cases relevant summary information (complex
• filter and enrich the data events)
• optionally calculate time-windowed • Complex events may signify threats or
aggregations before storing the results in a opportunities that require a response from
database or file system the business through real-time dashboards,
alerts or decision automation

Introduction to Stream Processing

Event Hub
Implementing "Event Hub"
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Cluster Infrastructure
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB

Raw
Parallel
Processing
Storage
Event Source Big Data Batch Analytics
Event Stream Event Search / Explore
Mobile Hub Event
Apps Stream Search
{ }
IoT Edge Node
Stream State
Data API
Processor
Event Hub Stream Analytics Enterprise Apps
Location

{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Apache Kafka – A Streaming Platform

High-Level Architecture Scale-Out Architecture

Distributed Log at the Core Logs do not (necessarily) forget

Introduction to Stream Processing

Hold Data for Long-Term – Data Retention

1. Never Broker 1

2. Time based (TTL)

log.retention.{ms | minutes | hours}

3. Size based Broker 2

log.retention.bytes Producer 1

4. Log compaction based

(entries with same key are removed):
Broker 3
kafka-topics.sh --zookeeper zk:2181 \
--create --topic customers \
--replication-factor 1 \
--partitions 1 \
--config cleanup.policy=compact

Introduction to Stream Processing

Keep Topics in Compacted Form

0 1 2 3 4 5 6 7 8 9 10 11 Offset
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2 Key
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Value

Compaction
K1 V1 V3 V4
Offset 3 4 6 8 9 10 V1
K2 V2 V6
Key 0
K1 K3 K4 K5 K2 K6
Value K3 V5
V4 V5 V7 V9 V10 V11

K4 V7

K5 V8 V9
V1
K6
1

Introduction to Stream Processing

Stream Data Integration
Implementing "Stream Data Integration"
Enterprise Data
Bulk Source Hadoop Clusterd Warehouse
Hadoop Cluster
Cluster Infrastructure
File File Import / SQL Import

Refined
DB
Extract Results
Storage SQL BI Tools
Replay
DB

{ }
{ } Service
Social
Storage API Logic
Microservice State API
Weather Rules Event
Stream Modern Applications
Introduction to Stream Processing
Integrating (Streaming) Data Sources

File Polling SQL Polling

File Stream (File Tailing)

Change Data Capture Sensor Stream
(CDC)

File Stream (Appender)

Introduction to Stream Processing

IoT devices will often not be able to
talk to Kafka directly
File Source Dataflow
26
Lo
26
g
TopicGW

DB Source
Lo CDC
g

Lo Event Hub
DB Source g CDC GW

Native
Lo Lo Topic
g g CDC
Big Data
Topic

Connect
Social
Native

Topic
Dataflow Stream
DataflowGW Topic Processing

REST
IoT Sensor
REST

Topic

MQTT
Messaging
IoT Sensor IoT GW Topic
Broker
GW
IoT GW
Queue
IoT Sensor
Why is Data Ingestion Difficult?

Infrastructure Drift Structure Drift Semantic Drift

Physical and Logical Data Structures and Data semantics change

Infrastructure changes formats evolve and change with evolving applications
rapidly unexpectedly

Key Challenges: Key Challenges: Key Challenges

Infrastructure Automation Consumption Readiness Timely Intervention

Edge Deployment Corruption and Loss System Consistency

Source: Streamsets
Integration with or without
Transformation? d

Zero Transformation
• No transformation, plain ingest, no
schema validation Enrichment Transformation
• Keep the original format – Text, • Add new data to the message
CSV, … • Do not change existing values
• Allows to store data that may have • Convert a value from one system to
errors in the schema another and add it to the message

Format Transformation Value Transformation

• Prefer name of Format Translation • Replaces values in the message
• Simply change the format • Convert a value from one system to
• Change format from Text to Avro another and change the value in-place
• Does schema validation • Destroys the raw data!

Introduction to Stream Processing

Demo Case

Truck-1
truck/nn/
position
Truck-2
? Raw Data
?
truck
position raw Store
Truck-3

truck/nn/
Truck-4 position

• Originated at NSA as Niagarafiles – developed

behind closed doors for 8 years
• Open sourced December 2014, Apache Top
Level Project July 2015
• Look-and-Feel modernized in 2016
• Opaque, “file-oriented” payload
• Distributed system of processors with
centralized control
• Based on flow-based programming concepts
• Data Provenance and Data Lineage
• Web-based user interface
Processors for Source and Sink

• ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …)

• DeleteXXXX (DynamoDB, Elasticsearch, HDFS, RethinkDB, S3, SQS, ...)
• FetchXXXX (AzureBlobStorage, ElasticSearch, File, FTP, HBase, HDFS, S3 ...)
• ExecuteXXXX (FlumeSink, FlumeSource, Script, SQL, ...)
• GetXXXX (AzureEventHub, Couchbase, DynamoDB, File, FTP, HBase, HDFS,
HTTP, Ignite, JMSQueue, JMSTopic, Kafka, Mongo, Solr, Splunk, SQS, TCP, ...)
• ListenXXXX (HTTP, RELP, SMTP, Syslog, TCP, UDP, WebSocket, ...)
• PublishXXXX (Kafka, MQTT)
• PutXXXX (AzureBlobStorage, AzureEventHub, CassandraQL, CloudWatchMetric,
Couchbase, DynamoDB, Elasticsearch, Email, FTP, File, Hbase, HDFS, HiveQL,
Kudu, Lambda, Mongo, Parquet, Slack, SQL, TCP, ....)
• QueryXXXX (Cassandra, DatabaseTable, DNS, Elasticserach)
Processors for Processing

• ConvertXxxxToYyyy • MergeContent
• ConvertRecord • ReplaceText
• EnforceOrder • ResizeImage
• EncryptContent • SplitXXXX (Avro, Content, JSON,
• ExtractXXXX (AvroMetdata, Record, Xml, ...)
EmailAttachments, Grok, • TailFile
HL7Attributes, ImageMetadata, ...) • TransformXML
• GeoEnrichIP • UpdateAttribute
• JoltTransformJSON
Demo Case

Truck-1
truck/nn/
position
Truck-2
Port: 1883
MQTT truck Kafka to Raw Data
to Kafka position raw Raw
Truck-3
Store

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Processor
Demo: Kafka Processor
Demo: Masking Field with ReplaceText Processor
Stream Data Integration:
StreamSets DataCollector
StreamSets Data Collector

• Founded by ex-Cloudera, Informatica

employees
• Continuous open source, intent-driven, big data
ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release September
2015
• So far, vast majority of commits are from
StreamSets staff
StreamSets Origins

An origin stage represents the

source for the pipeline. You can
use a single origin stage in a
pipeline

Origins on the right are available

out of the box

API for writing custom origins

Source: https://streamsets.com/connectors
StreamSets Processors

A processor stage represents a type of Some of processors available out-of-the-

data processing that you want to perform box:
• Expression Evaluator
use as many processors in a pipeline as • Field Flattener
you need • Field Hasher
• Field Masker
Programming languages supported • Field Merger
• Java • Field Order
• JavaScript • Field Splitter
• Jython • Field Zip
• Groovy • Groovy Evaluator
• Java Expression Language (EL) Spark • JDBC Lookup
• JSON Parser
• Spark Evaluator
• …
StreamSets Destinations

A destination stage represents

the target for a pipeline. You can
use one or more destinations in a
pipeline

Destinations on the right are

available out of the box

API for writing custom origins

Source: https://streamsets.com/connectors
Demo Case
Edge

Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo: Dataflow for MQTT to Kafka
Demo: MQTT Source
Demo: Kafka Sink
Demo: Dataflow for MQTT to Kafka
Demo: Masking fields
Demo: Sending Message to Kafka in Avro
StreamSets Dataflow Performance Manager

• Map dataflows to topologies, manage releases &

track changes

• Measure KPIs and establish baselines for data

availability and accuracy

• Master dataflow operations through Data SLAs

Source: https://streamsets.com/connectors
Stream Data Integration: Kafka
Connect
Kafka Connect - Overview

Source Sink
Connecto Connecto
r r

Introduction to Stream Processing

Kafka Connect – Single Message Transforms (SMT)

Simple Transformations for a single message Some of currently available

transforms:
Defined as part of Kafka Connect • InsertField
• some useful transforms provided out-of-the-box • ReplaceField
• Easily implement your own • MaskField
• ValueToKey
Optionally deploy 1+ transforms with each • ExtractField
connector • TimestampRouter
• Modify messages produced by source • RegexRouter
connector • SetSchemaMetaData
• Modify messages sent to sink connectors • Flatten
• TimestampConverter
Makes it much easier to mix and match connectors
Kafka Connect – Many Connectors
Certified Connectors Community Connectors
60+ since first release (0.9+)

20+ from Confluent and Partners

Confluent supported Connectors

Source: http://www.confluent.io/product/connectors
Demo Case

Truck-1
truck/nn/
position
Truck-2 MQTT-1
Port: 1883 to Kafka
truck Kafka to Raw Data
position raw Raw
Truck-3
MQTT-2 Store
to Kafka

truck/nn/
Truck-4 position
Port: 1884

Truck-5
{"truckid":"57","driverid":"15","routeid":"192762466
2","eventtype":"Normal","latitude":"38.65","longitu
de":"-
90.21","correlationId":"4412891759760421296"}
Demo (II) – devices send to MQTT instead of Kafka
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" \
-H "Content-Type: application/json" \
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'
Summary
Summary

Apache NiFi StreamSets Kafka Connect

• visual dataflow modelling • visual dataflow modelling • declarative style data flows
• very powerful – “with power • very powerful – “with power • simplicity - “simple things
comes responsibility” comes responsibility” done simple”
• special package for Edge • special package for Edge • very well integrated with
computing computing Kafka – comes with Kafka
• data lineage and data • data lineage and data • Single Message Transforms
provenance provenance (SMT)
• supports for backpressure • no transport mechanism • use Kafka Streams for
• no transport mechanism • custom sources, sinks, complex data flows
(DEV/TST/PROD) processors • custom connectors
• custom processors • supported by StreamSets • supported by Confluent
• supported by Hortonworks
Technology on its own won't help you.
You need to know how to use it properly.

4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
No ratings yet
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
44 pages
Stream Processing in Big Data Analytics
No ratings yet
Stream Processing in Big Data Analytics
33 pages
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
58 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Streaming Data Insights for Tech Pros
No ratings yet
Streaming Data Insights for Tech Pros
4 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Big Data Stream Processing
No ratings yet
Big Data Stream Processing
25 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Stream Processing Architecture Overview
No ratings yet
Stream Processing Architecture Overview
27 pages
Lec 19
No ratings yet
Lec 19
24 pages
Stream Computing
No ratings yet
Stream Computing
18 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
BDA Architecture
No ratings yet
BDA Architecture
15 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages
Lec 19
No ratings yet
Lec 19
23 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Understanding Stream Processing Basics
No ratings yet
Understanding Stream Processing Basics
15 pages
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
Big Data and Streaming Analytics Overview
No ratings yet
Big Data and Streaming Analytics Overview
17 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Data Ingestion Use Cases: Moving Big Data Into Hadoop
No ratings yet
Data Ingestion Use Cases: Moving Big Data Into Hadoop
2 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
11 pages
Unit 2
No ratings yet
Unit 2
10 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
20250129-EB-Ultimate Data Streaming Guide
No ratings yet
20250129-EB-Ultimate Data Streaming Guide
103 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
41 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Data Stream Processing Overview
No ratings yet
Data Stream Processing Overview
53 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Stream Processing and Website Tracking
No ratings yet
Stream Processing and Website Tracking
2 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
7
No ratings yet
7
1 page
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Power Platform Administration
No ratings yet
Power Platform Administration
1,192 pages
All About Forcepoint DLP
No ratings yet
All About Forcepoint DLP
8 pages
Software Engineering Process Framework
No ratings yet
Software Engineering Process Framework
14 pages
Lesson I. Basic Security Concepts Principles and Strategy
No ratings yet
Lesson I. Basic Security Concepts Principles and Strategy
9 pages
Title-Gift Shop 2. Aims and Objectives
No ratings yet
Title-Gift Shop 2. Aims and Objectives
8 pages
ME51N
No ratings yet
ME51N
15 pages
Introduction To Cloud Computing - GeeksforGeeks
No ratings yet
Introduction To Cloud Computing - GeeksforGeeks
10 pages
Image and Social Media Scammer Detection
No ratings yet
Image and Social Media Scammer Detection
1 page
SQL Injection Basics and Prevention
No ratings yet
SQL Injection Basics and Prevention
67 pages
Big Data Quiz Answers
No ratings yet
Big Data Quiz Answers
33 pages
Managing Insider Threat
No ratings yet
Managing Insider Threat
16 pages
Temenos T24 CSS Fault Reports - Customer Support Service How To' Guide
No ratings yet
Temenos T24 CSS Fault Reports - Customer Support Service How To' Guide
11 pages
CCNA Cyber Ops (Version 1.1) - Chapter 3 Exam Answers Full PDF
No ratings yet
CCNA Cyber Ops (Version 1.1) - Chapter 3 Exam Answers Full PDF
10 pages
AWS Certified Developer Exam Guide
No ratings yet
AWS Certified Developer Exam Guide
3 pages
Squid Game 2 Merry Go 2
No ratings yet
Squid Game 2 Merry Go 2
2 pages
Cloud & Generative AI - A Synergistic Future
No ratings yet
Cloud & Generative AI - A Synergistic Future
74 pages
Java Codelab Solutions - Section 2.2.3
No ratings yet
Java Codelab Solutions - Section 2.2.3
1 page
Online Helpdesk Support System For Handling Complaints and Service
No ratings yet
Online Helpdesk Support System For Handling Complaints and Service
6 pages
SQL Ba Interview Qna - Experience
No ratings yet
SQL Ba Interview Qna - Experience
2 pages
SPM Previous Question Papers
No ratings yet
SPM Previous Question Papers
10 pages
Web Services in Cloud Computing Overview
No ratings yet
Web Services in Cloud Computing Overview
23 pages
C - C4HCX - 2405 o
No ratings yet
C - C4HCX - 2405 o
2 pages
Success Story Iceland Frozen Foods
No ratings yet
Success Story Iceland Frozen Foods
2 pages
VBScript for Active Directory Tasks
No ratings yet
VBScript for Active Directory Tasks
60 pages
AP04 AA5 EV05 Elaboracion de Manual de Usuario Ingles
No ratings yet
AP04 AA5 EV05 Elaboracion de Manual de Usuario Ingles
6 pages
Centre For Management Studies: Online Submission of Assignment-02
No ratings yet
Centre For Management Studies: Online Submission of Assignment-02
10 pages
Vtiger Sales Deck Alera
No ratings yet
Vtiger Sales Deck Alera
18 pages
SAS Programming Quiz
No ratings yet
SAS Programming Quiz
4 pages
MineSight Axis Workflow Overview
No ratings yet
MineSight Axis Workflow Overview
14 pages
Set 4 Dec 15
No ratings yet
Set 4 Dec 15
22 pages

Streaming Data Ingestion v1 181001151203

Uploaded by

Streaming Data Ingestion v1 181001151203

Uploaded by

Streaming Data Ingestion in BigData-

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF

Working at Trivadis for more than 21 years

More than 30 years of software development experience

14 Trivadis branches and more than

1. Big Data and IoT Reference Architecture

Introduction to Stream Processing

Introduction to Stream Processing

Introduction to Stream Processing

Low(est) latency, no history

Introduction to Stream Processing

Introduction to Stream Processing

Capture changes directly on database Event Hub Consuming Systems

Change Data Capture (CDC) => think like Event State

Transform existing systems to event

Introduction to Stream Processing

Introduction to Stream Processing

DB Event Data Flow

IoT Edge Node { }

IoT Edge Node { }

Stream Data Integration Stream Analytics

• primarily focuses on the ingestion and • targets analytics use cases

Introduction to Stream Processing

High-Level Architecture Scale-Out Architecture

Distributed Log at the Core Logs do not (necessarily) forget

Introduction to Stream Processing

2. Time based (TTL)

3. Size based Broker 2

4. Log compaction based

Introduction to Stream Processing

Introduction to Stream Processing

File Polling SQL Polling

File Stream (File Tailing)

File Stream (Appender)

Introduction to Stream Processing

Infrastructure Drift Structure Drift Semantic Drift

Physical and Logical Data Structures and Data semantics change

Key Challenges: Key Challenges: Key Challenges

Infrastructure Automation Consumption Readiness Timely Intervention

Format Transformation Value Transformation

Introduction to Stream Processing

• Originated at NSA as Niagarafiles – developed

• ConsumeXXXX (AMQP, EWS, IMAP, JMS, Kafka, MQTT, POP3, …)

• Founded by ex-Cloudera, Informatica

An origin stage represents the

Origins on the right are available

API for writing custom origins

A processor stage represents a type of Some of processors available out-of-the-

A destination stage represents

Destinations on the right are

API for writing custom origins

• Map dataflows to topologies, manage releases &

• Measure KPIs and establish baselines for data

• Master dataflow operations through Data SLAs

Introduction to Stream Processing

Simple Transformations for a single message Some of currently available

20+ from Confluent and Partners

Confluent supported Connectors

Apache NiFi StreamSets Kafka Connect

You might also like