The Compute Engine
• At the center of a Fast Data architecture, we
find compute engines.
These engines are in charge of transforming the
data flowing into the system into valuable
insights through the application of business
logic encoded in their specific model.
• There are two main stream processing
paradigms: micro-batch and
one-at-a-time message processing
Micro-Batch Processing
• Micro-batching refers to the method of
accumulating input data until a certain
threshold, typically of time, in order to
process all those messages together.
• Compare it to a bus waiting at a terminal
until departure time. This bus is able to
deliver many passengers to their
destination who are sharing the same
transport and fuel.
Micro-Batch Processing
• Micro-batching enjoys mechanical
sympathy with the network and
storage systems, where it is optimal
to send packets of certain sizes that
can be processed all at once.
• In the micro-batch department, the leading
framework is Apache Spark.
• Apache Spark is a general-purpose distributed
computing framework with libraries for
machine learning, streaming, and graph
analytics.
• Spark provides high-level structured abstractions
that let us operate on the data viewed as records
that follow a certain schema.
• This concept ties together a high-level API that
offers bindings in Scala, Java, Python, and R with
a low-level execution engine that translates the
high-level structure-oriented constructs into
query and execution plans that can be optimally
pushed toward the data sources.
• With the recent introduction of Structured
Streaming, Spark aims at unifying the data
analytics model for batching and streaming.
• The streaming case is implemented as a recurrent
application of the different queries we define on
the streaming data, and enriched with
event-time support, windows, different output
modes, and triggers to differentiate the ingest
interval from the time that the output is
produced.
One-at-a-Time Processing
• One-at-a-time message processing ensures that each
message is processed as soon as it arrives in the
engine; hence, it delivers results with minimal delay.
• At the same time, shipping small messages individually
increases the overall overhead of the system and
therefore reduces the number of messages we can
process per unit of time.
Following the transportation analogy, one-at-a-time
processing is like a taxi: an individual transport that can
take a single passenger to its destination as fast as
possible.
One-at-a-Time Processing
• The leading engine in this category is Apache
Flink.
• Flink is a one -at-a-time streaming framework,
also offering snapshots to isolate results from
machine failure.
• This comprehensive framework provides a
lower-level API than Structured Streaming,
but it is a competitive alternative to Spark
Streaming when low-latency is the key focus
of interest. Flink presents APIs in Scala and
Java.
One-at-a-Time Processing
• flink.apache.org/flink-architecture.html
https://flink.apache.org/usecases.html
• https://engineering.zalando.com/posts/2016/
03/apache-showdown-flink-vs.-spark.html
• https://romanglushach.medium.com/flink-
spark-storm-kafka-a-comparative-analysis-of-
big-data-stream-processing-frameworks-for-
dab3dd42fc16
One-at-a-Time Processing
• In this space, we find Kafka Streams and Akka
Streams. These are not frameworks, but libraries
that can be used to build data-oriented
applications with a focus on data analytics.
• Both offer low-latency, one-at-a-time message
processing. Their APIs include projections,
grouping, windows, aggregations, and some
forms of joins.
• While Kafka Streams comes as a standalone
library that can be integrated into applications,
Akka Streams is part of the larger Reactive
Platform with a focus on microservices.
• In this category, we also find Apache Beam, which
provides a high level model for defining parallel
processing pipelines.
• This definition is then executed on a runner, the
Apache Beam term for an execution engine.
• Apache Beam can use Apache Apex, Apache Flink,
Apache Gearpump, and the proprietary Google
Cloud Data‐flow.
How to Choose
• The choice of a processing engine is largely driven
by the throughput and latency requirements of
the use case at hand.
• If we need the lowest response time possible—
for example, an industrial sensor alert, alarms, or
an anomaly detection system—then we should
look into the one-at-a-time processing engines.
How to Choose
• If, on the other hand, we are implementing a
massive ingest and the data needs to be
processed in different data lines to produce,
for example, a registration of every record and
aggregated reports, as well as train a machine
learning model, a micro-batch system will be
best suited to handle the workload.
How to Choose
• In practice, we observe that this choice is also influenced by
the existing practices in the enterprise. Preferences for
specific programming languages and DevOps processes will
certainly be influential in the selection process.
• While software development teams might prefer compiled
languages in a stricter CI/CD pipeline, data science teams are
often driven by availability of libraries and individual language
preferences (R versus Python) that create challenges on the
operational side.
How to Choose
• Luckily, general-purpose distributed
processing engines such as Apache Spark offer
bindings in different languages, such as Scala
and Java for the discerning developer, and
Python and R for the data science practitioner
Storage
• Storage runs across the entire data engineering
lifecycle, often occurring in multiple places in a data
pipeline, with storage systems crossing over with
source systems, ingestion, transformation, and
serving.
• In many ways, the way data is stored impacts how it
is used in all of the stages of the data engineering
lifecycle.
• For example, cloud data warehouses can store data,
process data in pipelines, and serve it to analysts.
• Streaming frameworks such as Apache Kafka and
Pulsar can function simultaneously as ingestion,
storage, and query systems for messages, with object
storage being a standard layer for data transmission.
Storage
• when Hadoop-based architectures became
popular, the challenge that they were solving
was two-fold:
• how to reliably store large amounts of data
and
• how to process it.
• These thought of DATA being at REST.
• HDFS – Replicated blocks provided reliability
• Map-Reduce - Parallel computations to where
the data was stored to remove the overhead
of moving data over the network.
Storage as the FAST Data Borders
• In the particular case of Fast Data
architectures, storage usually
demarks the transition boundaries
between the Fast Data core and the
traditional applications that might
consume the produced data.
• The choice of storage technology
is driven by the particular
requirements of this transition
between moving and resting
data.
• Key considerations for batch versus stream
ingestion Should you go streaming-first ?
Despite the attractiveness of a streaming-first
approach, there are many trade-offs to
understand and think about.
•If I ingest the data in real time, can downstream
storage systems handle the rate of data flow?
• Do I need millisecond real-time data ingestion?
Or would a micro-batch approach work,
accumulating and ingesting data, say, every
minute?
• What are my use cases for streaming ingestion?
What specific benefits do I realize by implementing
streaming?
•If I get data in real time, what actions can I
take on that data that would be an improvement
upon batch?
• Will my streaming-first approach cost more in
terms of time, money, maintenance, downtime,
and opportunity cost than simply doing batch?
• Are my streaming pipeline and system reliable
and redundant if infrastructure fails?
• What tools are most appropriate for the use case?
Should I use a managed service (Amazon Kinesis,
Google Cloud Pub/Sub, Google Cloud Dataflow) or
stand up my own instances of Kafka, Flink, Spark,
Pulsar, etc.? If I do the latter, who will manage it?
What are the costs and trade-offs?
• If I’m deploying an ML model, what benefits do I
have with online predictions and possibly continuous
training?
• Am I getting data from a live production instance? If
so, what’s the impact of my ingestion process on this
source system?
Push versus pull
• In the push model of data ingestion, a source
system writes data out to a target, whether a
database, object store, or filesystem.
• In the pull model, data is retrieved from the
source system.
• The line between the push and pull paradigms can
be quite blurry; data is often pushed and pulled as
it works its way through the various stages of a
data pipeline.
• One common method triggers a message
every time a row is changed in the source
database.
• This message is pushed to a queue, where the
ingestion system picks it up.
Another common CDC method uses binary
logs, which record every commit to the
database.
• The database pushes to its logs. The ingestion
system reads the logs but doesn’t directly
interact with the database otherwise. This
adds little to no additional load to the source
database.
• Some versions of batch CDC use the pull
pattern. For example, in timestamp-based
CDC, an ingestion system queries the source
database and pulls the rows that have
changed since the previous update.
• Change Data Capture (CDC) is a continuous
data capture process that helps data scientists
and analysts get near-real-time access to
data. CDC is a software design pattern that
detects and captures data changes in a
database and then delivers them to a
downstream system or process.
• With streaming ingestion, data bypasses a
backend database and is pushed directly
to an endpoint, typically with data buffered by
an event-streaming platform.
• This pattern is useful with fleets of IoT sensors
emitting sensor data.
• If we need to store the complete
data stream as it comes in, and we
need access to each individual record
or sequential slices of them,
we need a highly scalable backend
with low-latency writes and keybased
query capabilities.
• Apache Cassandra is a great choice in
such a scenario, as it offers linear
scalability and a limited but powerful
query language (Cassandra Query
Language, or CQL)
• Data could also then be loaded into a
traditional data warehouse or
could be used to build a data lake
that can support different
capabilities including machine
learning, reporting, or ad hoc
analysis.
• On the other side of the spectrum, we have
predigested aggregates that are requested by
a frontend visualization system.
• Here, we probably want the full SQL query
and indexing support to quickly locate those
records for display.
• A more classical PostgreSQL, MySQL, or the
commercial Relational Data Base
Management System (RDBMS) counterparts
would be a reasonable choice.
• Between these two cases is a whole
range of options, ranging from
specialized databases (such as InfluxDB
for time series or Redis for
fast in-memory lookups), to raw storage
(such as on-premises HDFS) or the cloud
storage offerings (Amazon S3, Azure
Storage, Google Cloud Storage, and
more).
The Message Backbone as
Transition Point
• In some cases, it is even possible to use the
message backbone as the data hand-over
point
• We can exploit the capabilities of a persistent
event log such as Apache Kafka, to transition
data between the Fast Data applications and
clients with different runtime characteristics.
• https://www.confluent.io/blog/okay-store-
data-apache-kafka/
•
• Data in Kafka is persisted to disk,
checksummed, and replicated for fault
tolerance. Accumulating more stored data
doesn’t make it slower. There are Kafka
clusters running in production with over a
petabyte of stored data.
• Stream processing jobs do computation off a
stream of data coming via Kafka. When the logic of
the stream processing code changes, you often
want to recompute your results.
• A very simple way to do this is just to reset the
offset for the program to zero to recompute the
results with the new code.
• This sometimes goes by the somewhat grandiose
name of The Kappa Architecture.
• Kafka is often used to capture and distribute a
stream of database updates (this is often called
Change Data Capture or CDC).
• Applications that consume this data in steady
state just need the newest changes, however new
applications need start with a full dump or
snapshot of data.
• However performing a full dump of a large
production database is often a very delicate and
time consuming operation.
• Enabling log compaction on the topic containing
the stream of changes allows consumers of this
data to simple reload by resetting to offset zero.
• If your messaging system is not good at storing
messages then it isn’t good at “queuing” messages
either. You might think that this doesn’t matter
because you don’t plan on storing messages for very
long.
• But no matter how briefly messages are stored in
the messaging system, if the messaging system is
under continuous load, there will always be some
unconsumed messages being stored.
• So whenever it does fail, if it has no capability of
providing fault tolerant storage, it is going to lose
some data. Doing messaging right requires getting
storage right. This point seems obvious but is often
missed when people evaluate messaging systems.
• So storage is important for messaging use cases.
But in fact, Kafka isn’t really a message queue in
the traditional sense at all—in implementation it
looks nothing like RabbitMQ or other similar
technologies.
• https://aws.amazon.com/compare/the-
difference-between-rabbitmq-and-kafka/
• It is much closer in architecture to a distributed
filesystem or database than to traditional
message queue
There are three primary differences between Kafka
and traditional messaging systems:
• As we described, Kafka stores a persistent log
which can be re-read and kept indefinitely.
• Kafka is built as a modern distributed system: it’s
runs as a cluster, can expand or contract
elastically, and replicates data internally for fault-
tolerance and high-availability.
• Kafka is built to allow real-time stream
processing, not just processing of a single
message at a time. This allows working with data
streams at a much higher level of abstraction.
• We think of these differences as being enough
to make it pretty inaccurate to think of Kafka
as a message queue, and instead categorize it
as a Streaming Platform.
• Kafka’s core abstraction is a continuous time-
ordered log. The key to this abstraction is that it is
a kind of structured “file” that doesn’t end when
you come to the last byte, but rather, at least
logically, goes on forever.
• A program written to work with a log therefore
doesn’t need to differentiate between data that
already occurred and data that is going to happen
in the future, it all appears as one continuous
stream. This combination of storing the past and
propagating the future behind a single uniform
protocol and API is exactly what makes Kafka
work well for stream processing.
• This stored log is very much like a file in a
distributed file system in that it is replicated
across machines, persisted to disk, and
supports high-throughput linear reads and
writes, but it is also like a messaging system in
that it allows many, many high-throughput
concurrent writes and has a very precise
definition of when a message is published to
allow cheap, low-latency propagation of
changes to lots of consumers. In this sense it is
the best of both worlds.
• When dealing with storage choices, there is no
one-size-fits-all.
Every Fast Data application will probably
require a specific storage solution for each
integration path with other applications in the
enterprise ecosystem.
• https://www.confluent.io/confluent-
cloud/tryfree/