0% found this document useful (0 votes)

25 views49 pages

Compute Engine

The document discusses the role of compute engines in Fast Data architectures, focusing on two main processing paradigms: micro-batch and one-at-a-time processing. It highlights frameworks like Apache Spark for micro-batching and Apache Flink for one-at-a-time processing, along with considerations for choosing the right engine based on throughput and latency requirements. Additionally, it covers storage solutions and the importance of data ingestion methods, emphasizing the need for tailored storage solutions for different Fast Data applications.

Uploaded by

sireesha lakkaraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views49 pages

Compute Engine

Uploaded by

sireesha lakkaraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

The Compute Engine

• At the center of a Fast Data architecture, we

find compute engines.
These engines are in charge of transforming the
data flowing into the system into valuable
insights through the application of business
logic encoded in their specific model.

• There are two main stream processing

paradigms: micro-batch and
one-at-a-time message processing
Micro-Batch Processing

• Micro-batching refers to the method of

accumulating input data until a certain
threshold, typically of time, in order to
process all those messages together.
• Compare it to a bus waiting at a terminal
until departure time. This bus is able to
deliver many passengers to their
destination who are sharing the same
transport and fuel.
Micro-Batch Processing

• Micro-batching enjoys mechanical

sympathy with the network and
storage systems, where it is optimal
to send packets of certain sizes that
can be processed all at once.
• In the micro-batch department, the leading
framework is Apache Spark.
• Apache Spark is a general-purpose distributed
computing framework with libraries for
machine learning, streaming, and graph
analytics.
• Spark provides high-level structured abstractions
that let us operate on the data viewed as records
that follow a certain schema.
• This concept ties together a high-level API that
offers bindings in Scala, Java, Python, and R with
a low-level execution engine that translates the
high-level structure-oriented constructs into
query and execution plans that can be optimally
pushed toward the data sources.
• With the recent introduction of Structured
Streaming, Spark aims at unifying the data
analytics model for batching and streaming.
• The streaming case is implemented as a recurrent
application of the different queries we define on
the streaming data, and enriched with
event-time support, windows, different output
modes, and triggers to differentiate the ingest
interval from the time that the output is
produced.
One-at-a-Time Processing
• One-at-a-time message processing ensures that each
message is processed as soon as it arrives in the
engine; hence, it delivers results with minimal delay.
• At the same time, shipping small messages individually
increases the overall overhead of the system and
therefore reduces the number of messages we can
process per unit of time.

Following the transportation analogy, one-at-a-time

processing is like a taxi: an individual transport that can
take a single passenger to its destination as fast as
possible.
One-at-a-Time Processing
• The leading engine in this category is Apache
Flink.
• Flink is a one -at-a-time streaming framework,
also offering snapshots to isolate results from
machine failure.
• This comprehensive framework provides a
lower-level API than Structured Streaming,
but it is a competitive alternative to Spark
Streaming when low-latency is the key focus
of interest. Flink presents APIs in Scala and
Java.
One-at-a-Time Processing
• flink.apache.org/flink-architecture.html
https://flink.apache.org/usecases.html
• https://engineering.zalando.com/posts/2016/
03/apache-showdown-flink-vs.-spark.html
• https://romanglushach.medium.com/flink-
spark-storm-kafka-a-comparative-analysis-of-
big-data-stream-processing-frameworks-for-
dab3dd42fc16
One-at-a-Time Processing
• In this space, we find Kafka Streams and Akka
Streams. These are not frameworks, but libraries
that can be used to build data-oriented
applications with a focus on data analytics.
• Both offer low-latency, one-at-a-time message
processing. Their APIs include projections,
grouping, windows, aggregations, and some
forms of joins.
• While Kafka Streams comes as a standalone
library that can be integrated into applications,
Akka Streams is part of the larger Reactive
Platform with a focus on microservices.
• In this category, we also find Apache Beam, which
provides a high level model for defining parallel
processing pipelines.
• This definition is then executed on a runner, the
Apache Beam term for an execution engine.
• Apache Beam can use Apache Apex, Apache Flink,
Apache Gearpump, and the proprietary Google
Cloud Data‐flow.
How to Choose

• The choice of a processing engine is largely driven

by the throughput and latency requirements of
the use case at hand.
• If we need the lowest response time possible—
for example, an industrial sensor alert, alarms, or
an anomaly detection system—then we should
look into the one-at-a-time processing engines.
How to Choose

• If, on the other hand, we are implementing a

massive ingest and the data needs to be
processed in different data lines to produce,
for example, a registration of every record and
aggregated reports, as well as train a machine
learning model, a micro-batch system will be
best suited to handle the workload.
How to Choose

• In practice, we observe that this choice is also influenced by

the existing practices in the enterprise. Preferences for
specific programming languages and DevOps processes will
certainly be influential in the selection process.
• While software development teams might prefer compiled
languages in a stricter CI/CD pipeline, data science teams are
often driven by availability of libraries and individual language
preferences (R versus Python) that create challenges on the
operational side.
How to Choose

• Luckily, general-purpose distributed

processing engines such as Apache Spark offer
bindings in different languages, such as Scala
and Java for the discerning developer, and
Python and R for the data science practitioner
Storage
• Storage runs across the entire data engineering
lifecycle, often occurring in multiple places in a data
pipeline, with storage systems crossing over with
source systems, ingestion, transformation, and
serving.
• In many ways, the way data is stored impacts how it
is used in all of the stages of the data engineering
lifecycle.
• For example, cloud data warehouses can store data,
process data in pipelines, and serve it to analysts.
• Streaming frameworks such as Apache Kafka and
Pulsar can function simultaneously as ingestion,
storage, and query systems for messages, with object
storage being a standard layer for data transmission.
Storage
• when Hadoop-based architectures became
popular, the challenge that they were solving
was two-fold:
• how to reliably store large amounts of data
and
• how to process it.

• These thought of DATA being at REST.

• HDFS – Replicated blocks provided reliability

• Map-Reduce - Parallel computations to where

the data was stored to remove the overhead
of moving data over the network.
Storage as the FAST Data Borders
• In the particular case of Fast Data
architectures, storage usually
demarks the transition boundaries
between the Fast Data core and the
traditional applications that might
consume the produced data.
• The choice of storage technology
is driven by the particular
requirements of this transition
between moving and resting
data.
• Key considerations for batch versus stream
ingestion Should you go streaming-first ?
Despite the attractiveness of a streaming-first
approach, there are many trade-offs to
understand and think about.
•If I ingest the data in real time, can downstream
storage systems handle the rate of data flow?
• Do I need millisecond real-time data ingestion?
Or would a micro-batch approach work,
accumulating and ingesting data, say, every
minute?
• What are my use cases for streaming ingestion?
What specific benefits do I realize by implementing
streaming?
•If I get data in real time, what actions can I
take on that data that would be an improvement
upon batch?
• Will my streaming-first approach cost more in
terms of time, money, maintenance, downtime,
and opportunity cost than simply doing batch?
• Are my streaming pipeline and system reliable
and redundant if infrastructure fails?
• What tools are most appropriate for the use case?
Should I use a managed service (Amazon Kinesis,
Google Cloud Pub/Sub, Google Cloud Dataflow) or
stand up my own instances of Kafka, Flink, Spark,
Pulsar, etc.? If I do the latter, who will manage it?
What are the costs and trade-offs?

• If I’m deploying an ML model, what benefits do I

have with online predictions and possibly continuous
training?
• Am I getting data from a live production instance? If
so, what’s the impact of my ingestion process on this
source system?
Push versus pull
• In the push model of data ingestion, a source
system writes data out to a target, whether a
database, object store, or filesystem.
• In the pull model, data is retrieved from the
source system.
• The line between the push and pull paradigms can
be quite blurry; data is often pushed and pulled as
it works its way through the various stages of a
data pipeline.
• One common method triggers a message
every time a row is changed in the source
database.
• This message is pushed to a queue, where the
ingestion system picks it up.
Another common CDC method uses binary
logs, which record every commit to the
database.
• The database pushes to its logs. The ingestion
system reads the logs but doesn’t directly
interact with the database otherwise. This
adds little to no additional load to the source
database.
• Some versions of batch CDC use the pull
pattern. For example, in timestamp-based
CDC, an ingestion system queries the source
database and pulls the rows that have
changed since the previous update.
• Change Data Capture (CDC) is a continuous
data capture process that helps data scientists
and analysts get near-real-time access to
data. CDC is a software design pattern that
detects and captures data changes in a
database and then delivers them to a
downstream system or process.
• With streaming ingestion, data bypasses a
backend database and is pushed directly
to an endpoint, typically with data buffered by
an event-streaming platform.
• This pattern is useful with fleets of IoT sensors
emitting sensor data.
• If we need to store the complete
data stream as it comes in, and we
need access to each individual record
or sequential slices of them,
we need a highly scalable backend
with low-latency writes and keybased
query capabilities.
• Apache Cassandra is a great choice in
such a scenario, as it offers linear
scalability and a limited but powerful
query language (Cassandra Query
Language, or CQL)
• Data could also then be loaded into a
traditional data warehouse or
could be used to build a data lake
that can support different
capabilities including machine
learning, reporting, or ad hoc
analysis.
• On the other side of the spectrum, we have
predigested aggregates that are requested by
a frontend visualization system.
• Here, we probably want the full SQL query
and indexing support to quickly locate those
records for display.
• A more classical PostgreSQL, MySQL, or the
commercial Relational Data Base
Management System (RDBMS) counterparts
would be a reasonable choice.
• Between these two cases is a whole
range of options, ranging from
specialized databases (such as InfluxDB
for time series or Redis for
fast in-memory lookups), to raw storage
(such as on-premises HDFS) or the cloud
storage offerings (Amazon S3, Azure
Storage, Google Cloud Storage, and
more).
The Message Backbone as
Transition Point
• In some cases, it is even possible to use the
message backbone as the data hand-over
point
• We can exploit the capabilities of a persistent
event log such as Apache Kafka, to transition
data between the Fast Data applications and
clients with different runtime characteristics.

• https://www.confluent.io/blog/okay-store-
data-apache-kafka/
•
• Data in Kafka is persisted to disk,
checksummed, and replicated for fault
tolerance. Accumulating more stored data
doesn’t make it slower. There are Kafka
clusters running in production with over a
petabyte of stored data.
• Stream processing jobs do computation off a
stream of data coming via Kafka. When the logic of
the stream processing code changes, you often
want to recompute your results.
• A very simple way to do this is just to reset the
offset for the program to zero to recompute the
results with the new code.
• This sometimes goes by the somewhat grandiose
name of The Kappa Architecture.
• Kafka is often used to capture and distribute a
stream of database updates (this is often called
Change Data Capture or CDC).
• Applications that consume this data in steady
state just need the newest changes, however new
applications need start with a full dump or
snapshot of data.
• However performing a full dump of a large
production database is often a very delicate and
time consuming operation.
• Enabling log compaction on the topic containing
the stream of changes allows consumers of this
data to simple reload by resetting to offset zero.
• If your messaging system is not good at storing
messages then it isn’t good at “queuing” messages
either. You might think that this doesn’t matter
because you don’t plan on storing messages for very
long.
• But no matter how briefly messages are stored in
the messaging system, if the messaging system is
under continuous load, there will always be some
unconsumed messages being stored.
• So whenever it does fail, if it has no capability of
providing fault tolerant storage, it is going to lose
some data. Doing messaging right requires getting
storage right. This point seems obvious but is often
missed when people evaluate messaging systems.
• So storage is important for messaging use cases.
But in fact, Kafka isn’t really a message queue in
the traditional sense at all—in implementation it
looks nothing like RabbitMQ or other similar
technologies.
• https://aws.amazon.com/compare/the-
difference-between-rabbitmq-and-kafka/
• It is much closer in architecture to a distributed
filesystem or database than to traditional
message queue
There are three primary differences between Kafka
and traditional messaging systems:
• As we described, Kafka stores a persistent log
which can be re-read and kept indefinitely.
• Kafka is built as a modern distributed system: it’s
runs as a cluster, can expand or contract
elastically, and replicates data internally for fault-
tolerance and high-availability.
• Kafka is built to allow real-time stream
processing, not just processing of a single
message at a time. This allows working with data
streams at a much higher level of abstraction.
• We think of these differences as being enough
to make it pretty inaccurate to think of Kafka
as a message queue, and instead categorize it
as a Streaming Platform.
• Kafka’s core abstraction is a continuous time-
ordered log. The key to this abstraction is that it is
a kind of structured “file” that doesn’t end when
you come to the last byte, but rather, at least
logically, goes on forever.
• A program written to work with a log therefore
doesn’t need to differentiate between data that
already occurred and data that is going to happen
in the future, it all appears as one continuous
stream. This combination of storing the past and
propagating the future behind a single uniform
protocol and API is exactly what makes Kafka
work well for stream processing.
• This stored log is very much like a file in a
distributed file system in that it is replicated
across machines, persisted to disk, and
supports high-throughput linear reads and
writes, but it is also like a messaging system in
that it allows many, many high-throughput
concurrent writes and has a very precise
definition of when a message is published to
allow cheap, low-latency propagation of
changes to lots of consumers. In this sense it is
the best of both worlds.
• When dealing with storage choices, there is no
one-size-fits-all.
Every Fast Data application will probably
require a specific storage solution for each
integration path with other applications in the
enterprise ecosystem.

• https://www.confluent.io/confluent-
cloud/tryfree/

Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Choose The Right Stream Processing Engine Whitepaper
No ratings yet
Choose The Right Stream Processing Engine Whitepaper
16 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Your Paragraph Text
No ratings yet
Your Paragraph Text
26 pages
32study of Data Ingestion Tools
No ratings yet
32study of Data Ingestion Tools
9 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
5
No ratings yet
5
1 page
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
41 pages
Data Engineering for Professionals
No ratings yet
Data Engineering for Professionals
45 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Building Effective Data Pipelines
No ratings yet
Building Effective Data Pipelines
16 pages
Streaming Data Ingestion v1 181001151203
No ratings yet
Streaming Data Ingestion v1 181001151203
59 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Data Engg
No ratings yet
Data Engg
19 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages
? What Is Big Data
No ratings yet
? What Is Big Data
14 pages
Data Stream Processing Platforms Explained
No ratings yet
Data Stream Processing Platforms Explained
27 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
7
No ratings yet
7
1 page
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Streaming Data Insights for Tech Pros
No ratings yet
Streaming Data Insights for Tech Pros
4 pages
4
No ratings yet
4
2 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Ebook Fast Data Architectures For Streaming Applications 2
No ratings yet
Ebook Fast Data Architectures For Streaming Applications 2
58 pages
C42-Batch Stream Micro Batch Realtime Processing
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
33 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Unit 5
No ratings yet
Unit 5
14 pages
Apache Flink for Big Data Engineers
No ratings yet
Apache Flink for Big Data Engineers
116 pages
3
No ratings yet
3
2 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Big Data Analytics - Chapter 4
No ratings yet
Big Data Analytics - Chapter 4
22 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Bda Unit 2 - Mam
No ratings yet
Bda Unit 2 - Mam
63 pages
Big Data Analysis Apache Storm Perspecti
No ratings yet
Big Data Analysis Apache Storm Perspecti
6 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Data Analytics - Intermediate
No ratings yet
Data Analytics - Intermediate
36 pages
6
No ratings yet
6
1 page
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Stream Processing in Big Data Analytics
No ratings yet
Stream Processing in Big Data Analytics
33 pages
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
No ratings yet
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
17 pages
25-Introduction To Data Streaming-04-03-2025
No ratings yet
25-Introduction To Data Streaming-04-03-2025
13 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Algo DS Book PDF
67% (3)
Algo DS Book PDF
525 pages
Dbms Objective Bits
No ratings yet
Dbms Objective Bits
92 pages
Create ARFF File Using WEKA Tool
No ratings yet
Create ARFF File Using WEKA Tool
4 pages
Basler AW00148803000 Pylon SDK Samples Manual
No ratings yet
Basler AW00148803000 Pylon SDK Samples Manual
110 pages
Exp3 3
No ratings yet
Exp3 3
46 pages
Apps 11i OA Framework Extension Guide
No ratings yet
Apps 11i OA Framework Extension Guide
6 pages
How To Check If A Python String Is A Palindrome
No ratings yet
How To Check If A Python String Is A Palindrome
15 pages
LAB 2: Classes and Objects: 1 Problem Statement
No ratings yet
LAB 2: Classes and Objects: 1 Problem Statement
5 pages
HTML Calculator
No ratings yet
HTML Calculator
6 pages
Ood Programming Proposal HND 2022 Paper1
No ratings yet
Ood Programming Proposal HND 2022 Paper1
4 pages
World of Java
No ratings yet
World of Java
10 pages
CS365 Slides 01 Algorithmic Thinking Arithmetic Problems
No ratings yet
CS365 Slides 01 Algorithmic Thinking Arithmetic Problems
50 pages
Job Roles and Descriptions for EGG Positions
No ratings yet
Job Roles and Descriptions for EGG Positions
27 pages
Computer ICSE X Chapter 3
No ratings yet
Computer ICSE X Chapter 3
10 pages
C++ Interview Questions
No ratings yet
C++ Interview Questions
24 pages
1 Program For Recursive Binary & Linear Search.
No ratings yet
1 Program For Recursive Binary & Linear Search.
3 pages
Playwright
No ratings yet
Playwright
2 pages
AUTOSAR SWS MCUDriver
No ratings yet
AUTOSAR SWS MCUDriver
49 pages
Se - Iii Bca
No ratings yet
Se - Iii Bca
77 pages
02 - Memory Management
No ratings yet
02 - Memory Management
53 pages
Python for Scientific Computing Guide
No ratings yet
Python for Scientific Computing Guide
8 pages
Advance Python Programming
100% (1)
Advance Python Programming
12 pages
ABAP Development in Eclipse Guide
No ratings yet
ABAP Development in Eclipse Guide
79 pages
Java OOP: Inheritance and Overriding
No ratings yet
Java OOP: Inheritance and Overriding
47 pages
Redux: Comprehensive Guide & Tutorials
No ratings yet
Redux: Comprehensive Guide & Tutorials
248 pages
100 C Language Questions and Answers
No ratings yet
100 C Language Questions and Answers
9 pages
Trees and Binary Trees
No ratings yet
Trees and Binary Trees
53 pages
Practical Assignment 1 Cs Class 12
No ratings yet
Practical Assignment 1 Cs Class 12
3 pages
Daa Question Bank
No ratings yet
Daa Question Bank
5 pages
Control Structures
No ratings yet
Control Structures
26 pages

Compute Engine

Uploaded by

Compute Engine

Uploaded by

The Compute Engine

• At the center of a Fast Data architecture, we

• There are two main stream processing

• Micro-batching refers to the method of

• Micro-batching enjoys mechanical

Following the transportation analogy, one-at-a-time

• The choice of a processing engine is largely driven

• If, on the other hand, we are implementing a

• In practice, we observe that this choice is also influenced by

• Luckily, general-purpose distributed

• These thought of DATA being at REST.

• Map-Reduce - Parallel computations to where

• If I’m deploying an ML model, what benefits do I

You might also like