0% found this document useful (0 votes)

22 views10 pages

Interactive Analytics with RADStack

Uploaded by

ellur sudhakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views10 pages

Interactive Analytics with RADStack

Uploaded by

ellur sudhakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

The RADStack: Open Source Lambda Architecture for

Interactive Analytics

Fangjin Yang, Gian Merlino, Nelson Ray, Xavier Léauté, Himanshu Gupta, Eric Tschetter
{fangjinyang, gianmerlino, ncray86, xavier.leaute, g.himanshu, echeddar}@gmail.com

ABSTRACT An ideal data serving layer alone is often not suﬃcient

The Real-time Analytics Data Stack, colloquially referred to as a complete analytics solution. In most real-world use
as the RADStack, is an open-source data analytics stack de- cases, raw data cannot be directly stored in the serving layer.
signed to provide fast, flexible queries over up-to-the-second Raw data suffers from many imperfections and must first
data. It is designed to overcome the limitations of either be processed (transformed, or cleaned) before it is usable
a purely batch processing system (it takes too long to sur- [17]. The drawback of this requirement is that loading and
face new events) or a purely real-time system (it’s difficult processing batch data is slow, and insights on events cannot
to ensure that no data is left behind and there is often no be obtained until hours after the events have occurred.
way to correct data after initial processing). It will seam- To address the delays in data freshness caused by batch
lessly return best-effort results on very recent data combined processing frameworks, numerous open-source stream pro-
with guaranteed-correct results on older data. In this paper, cessing frameworks such as Apache Storm[12], Apache Spark
we introduce the architecture of the RADStack and discuss Streaming[25], and Apache Samza[1] have gained popular-
our methods of providing interactive analytics and a flexible ity for offering a low-latency model to ingest and process
data processing environment to handle a variety of real-world event streams at near real-time speeds. The drawback of
workloads. almost all stream processors is that they do not necessarily
provide the same correctness guarantees as batch processing
frameworks. Events can come in days late, and may need
1. INTRODUCTION to be corrected after the fact. Large batches of data may
also need to be reprocessed when new columns are added or
The rapid growth of the Hadoop[16] ecosystem has en- removed.
abled many organizations to flexibly process and gain in- Combining batch processing, streaming processing, and
sights from large quantities of data. These insights are typ- a serving layer in a single technology stack is known as a
ically generated from business intelligence, or OnLine Ana- lambda architecture[9]. In lambda architectures, data en-
lytical Processing (OLAP) queries. Hadoop has proven to be tering the system is concurrently fed to both the batch and
an extremely effective framework capable of providing many streaming processing layer. The streaming layer is responsi-
analytical insights and is able to solve a wide range of dis- ble for immediately processing incoming data, however, the
tributed computing problems. However, as much as Hadoop processed data may suffer from duplicated events and other
is lauded for its wide range of use cases, it is derided for its imperfections in data accuracy. The batch layer processes
high latency in processing and returning results. A common incoming data much slower than the streaming layer, but is
approach to surface data insights is to run MapReduce jobs able to provide accurate views of data. The serving layer
that may take several hours to complete. merges the results from the batch and streaming layers and
Data analysis and data-driven applications are becoming provides an interface for queries. Although each individual
increasingly important in industry, and the long query times component in a lambda architecture has their own limita-
encountered with using batch frameworks such as Hadoop tions, the pieces complement each other extremely well and
are becoming increasingly intolerable. User facing appli- the overall stack is robust enough to handle a wide array of
cations are replacing traditional reporting interfaces as the data processing and querying challenges at scale.
preferred means for organizations to derive value from their The RADStack is an open source lambda architecture im-
datasets. In order to provide an interactive user experi- plementation meant to offer flexible, low-latency analytic
ence with data applications, queries must complete in an queries on near real-time data. The solution combines the
order of milliseconds. Because most of these interactions low latency guarantees of stream processors and the correct-
revolve around data exploration and computation, organi- ness and flexibility guarantees of batch processors. It also
zations quickly realized that in order to support low latency introduces a serving layer specifically designed for interac-
queries, dedicated serving layers were necessary. Today, tive analytics. The stack’s main building blocks are Apache
most of these serving layers are Relational Database Man- Kafka[11], Apache Samza, Apache Hadoop, and Druid [23],
agement Systems (RDBMS) or NoSQL key/value stores. and we have found that the combination of technologies is
Neither RDBMS nor NoSQL key/value stores are partic- flexible enough to handle a wide variety of processing re-
ularly designed for analytics [19], but these technologies are quirements and query loads. Each piece of the stack is de-
still frequently selected as serving layers. Solutions that in- signed to do a specific set of things very well. This paper
volve these broad-focus technologies can be inflexible once will cover the details and design principles of the RADStack.
tailored to the analytics use case, or suffer from architecture Our contributions are around the architecture of the stack
drawbacks that prevent them from returning queries fast itself, the introduction of Druid as a serving layer, and our
enough to power interactive, user-facing applications [20].

1
joined, cleaned up, and transformed before it was usable in
Druid, but that was the trade-off we were willing to make
in order to get the performance necessary to power an in-
teractive data application. We introduced stream process-
ing to our stack to provide the processing required before
raw data could be loaded into Druid. Our stream process-
ing jobs range from simple data transformations, such as id
to name lookups, up to complex operations such as multi-
stream joins. Pairing Druid with a stream processor enabled
flexible data processing and querying, but we still had prob-
Figure 1: The components of the RADStack. Kafka lems with event delivery. Our events were delivered from
acts as the event delivery endpoints. Samza and many different locations and sources, and peaked at several
Hadoop process data to load data into Druid. Druid million events per second. We required a high throughput
acts as the endpoint for queries. message bus that could hold these events for consumpation
by our stream processor. To simplify data transmission for
our clients, we wanted the message bus to be the single de-
model for unifying real-time and historical workflows. livery endpoint for events entering our cluster.
The structure of the paper is as follows: Section 2 de- Our stack would be complete here if real-time processing
scribes the problems and use cases that led to the creation were perfect, but the open source stream processing space
of the RADStack. Section 3 describes Druid, the serving is still young. Processing jobs can go down for extended
layer of the stack, and how Druid is built for real-time and periods of time and events may be delivered more than
batch data ingestion, as well as exploratory analytics. Sec- once. These are realities of any production data pipeline.
tion 4 covers the role of Samza and Hadoop for data pro- To overcome these issues, we included Hadoop in our stack
cessing, and Section 5 describes the role of Kafka for event to periodically clean up any data generated by the real-time
delivery. In Section 6, we present our production metrics. pipeline. We stored a copy of the raw events we received in
Section 7 presents our experiences with running the RAD- a distributed file system, and periodically ran batch process-
Stack in production, and in Section 8 we discuss the related ing jobs over this data. The high level architecture of our
solutions. setup is shown in Figure 1. Each component is designed
to do a specific set of things well, and there is isolation in
2. BACKGROUND terms of functionality. Individual components can entirely
fail without impacting the services of the other components.
The RADStack was first developed to address problems
in the online advertising. In online advertising, automated
systems from different organizations will place bids against
one another to display users ads in the milliseconds before a
3. THE SERVING LAYER
webpage loads. These actions generate a tremendous volume Druid is a column-oriented data store designed for ex-
of data. The data shown in Table 1 is an example of such ploratory analytics and is the serving layer in the RAD-
data. Each event is comprised of three components: a times- Stack. A Druid cluster consists of different types of nodes
tamp indicating when the event occurred; a set of dimen- and, similar to the overall design of the RADStack, each
sions indicating various attributes about the event; and a set node type is instrumented to perform a specific set of things
of metrics concerning the event. Organizations frequently well. We believe this design separates concerns and simpli-
serve this insights to this data to ad publishers through vi- fies the complexity of the overall system. To solve complex
sualizations and data applications. These applications must data analysis problems, the different node types come to-
rapidly compute drill-down and aggregates with this data, gether to form a fully working system. The composition of
and answer questions such as “How many clicks occurred and flow of data in a Druid cluster are shown in Figure 2.
over the span of one week for publisher google.com?” or
“How many impressions were seen over the last quarter in 3.1 Segments
San Francisco?”. Queries over any arbitrary number of di- Data tables in Druid (called ”data sources”) are collec-
mensions should return in a few hundred milliseconds. tions of timestamped events and partitioned into a set of
As an additional requirement, user-facing applications of- segments, where each segment is typically 5–10 million rows.
ten face highly concurrent workloads and good applications Segments represent the fundamental storage unit in Druid
need to provide relatively consistent performance to all users. and Druid queries only understand how to scan segments.
Of course, backend infrastructure also needs to be highly Druid always requires a timestamp column as a method
available. Downtime is costly and many businesses cannot of simplifying data distribution policies, data retention poli-
afford to wait if a system is unavailable in the face of soft- cies, and first level query pruning. Druid partitions its data
ware upgrades or network failure. sources into well defined time intervals, typically an hour
To address these requirements of scale, stability, and per- or a day, and may further partition on values from other
formance, we created Druid. Druid was designed from the columns to achieve the desired segment size. The time gran-
ground up to provide arbitrary data exploration, low la- ularity to partition segments is a function of data volume
tency aggregations, and fast data ingestion. Druid was also and time range. A data set with timestamps spread over a
designed to accept fully denormalized data, and moves away year is better partitioned by day, and a data set with times-
from the traditional relational model. Since most raw data tamps spread over a day is better partitioned by hour.
is not denormalized, it must be processed before it can be Segments are uniquely identified by a data source iden-
ingested and queried. Multiple streams of data had to be tifier, the time interval of the data, and a version string

2
Timestamp Publisher Advertiser Gender City Click Price
2011-01-01T01:01:35Z bieberfever.com google.com Male San Francisco 0 0.65
2011-01-01T01:03:63Z bieberfever.com google.com Male Waterloo 0 0.62
2011-01-01T01:04:51Z bieberfever.com google.com Male Calgary 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female Taiyuan 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female New York 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female Vancouver 1 1.53

Table 1: Sample ad data. These events are created when users views or clicks on ads.

Streaming
Data Real-time Druid Nodes
Nodes
External Dependencies

Metadata
Storage

Client
Distributed
Broker Nodes Queries
Coordination

Coordinator
Nodes

Queries
Batch Metadata
Deep Historical Data/Segments
Data
Storage Nodes

Figure 2: An overview of a Druid cluster and the ﬂow of data through the cluster.

that increases whenever a new segment is created. The ver- in the segment. Druid has multiple column types to repre-
sion string indicates the freshness of segment data; segments sent various data formats. Timestamps are stored in long
with later versions have newer views of data (over some columns, dimensions are stored in string columns, and met-
time range) than segments with older versions. This seg- rics are stored in int, float, long or double columns. Depend-
ment metadata is used by the system for concurrency con- ing on the column type, different compression methods may
trol; read operations always access data in a particular time be used. Metric columns are compressed using LZ4[3] com-
range from the segments with the latest version identifiers pression. String columns are dictionary encoded, similar to
for that time range. other data stores such as PowerDrill[8]. Additional indexes
Druid segments are stored in a column orientation. Given may be created for particular columns. For example, Druid
that Druid is best used for aggregating event streams, the will by default create inverted indexes for string columns.
advantages of storing aggregate information as columns rather
than rows are well documented [2]. Column storage allows 3.2 Streaming Data Ingestion
for more efficient CPU usage as only what is needed is ac- Druid real-time nodes encapsulate the functionality to in-
tually loaded and scanned. In a row oriented data store, gest, query, and create segments from event streams. Events
all columns associated with a row must be scanned as part indexed via these nodes are immediately available for query-
of an aggregation. The additional scan time can introduce ing. The nodes are only concerned with events for a rela-
significant performance degradations [2]. tively small time range (e.g. hours) and periodically hand
Druid nodes use one thread to scan one segment at a time, off immutable batches of events they have collected over
and the amount of data that can be scanned in parallel is this small time range to other nodes in the Druid cluster
directly correlated to the number of available cores in the that are specialized in dealing with batches of immutable
cluster. Segments are immutable, and hence, this no con- events. The nodes announce their online state and the data
tention between reads and writes in a segment. they serve using a distributed coordination service (this is
A single query may scan thousands of segments concur- currently Zookeeper[10]).
rently, and many queries may run at the same time. We Real-time nodes employ a log structured merge tree[14]
want to ensure that the entire cluster is not starved out for recently ingested data. Incoming events are first stored
while a single expensive query is executing. Thus, segments in an in-memory buffer. The in-memory buffer is directly
have an upper limit in how much data they can hold, and queryable and Druid behaves as a key/value store for queries
are sized to be scanned in a few milliseconds. By keeping on events that exist in this JVM heap-based store. The in-
segment computation very fast, cores and other resources memory buffer is heavily write optimized, and given that
are constantly being yielded. This ensures segments from Druid is really designed for heavy concurrent reads, events
different queries are always being scanned. do not remain in the in-memory buffer for very long. Real-
Druid segments are very self-contained for the time inter- time nodes persist their in-memory indexes to disk either pe-
val of data that they hold. Column data is stored directly riodically or after some maximum row limit is reached. This
persist process converts data stored in the in-memory buffer

3
Queries For further clariﬁcation, consider Figure 4. Figure 4 illus-
trates the operations of a real-time node. The node starts
at 13:37 and, with a 10 minute window period, will only
Heap and in-memory index Off-heap memory and
persisted indexes
accept events for a window between 13:27 and 13:47. When
the ﬁrst events are ingested, the node announces that it is
event_34982
event_35789 event_23312 event_1234 event_3456
event_36791 event_23481 event_2345 event_4567

serving a segment for an interval from 13:00 to 14:00. Every

... event_23593 ... ...
...

event_5678
event_6789
event_7890
event_8901 10 minutes (the persist period is configurable), the node will
flush and persist its in-memory buffer to disk. Near the end
... ...
Persist

Disk and persisted indexes

of the hour, the node will likely see events for 14:00 to 15:00.
event_1234 event_3456 When this occurs, the node prepares to serve data for the
next hour and creates a new in-memory buﬀer. The node
event_2345 event_4567 Load
... ...

event_5678
event_6789
event_7890
event_8901
then announces that it is also serving a segment from 14:00
... ...
to 15:00. At 13:10, which is the end of the hour plus the
window period, the node begins the hand oﬀ process.

3.3 Hand off

Figure 3: Real-time nodes write events to a write Real-time nodes are designed to deal with a small win-
optimized in-memory index. Periodically, events are dow of recent data and need periodically hand off segments
persisted to disk, converting the write optimized for- they’ve built. The hand-off process first involves a com-
mat to a read optimized one. On a periodic basis, paction step. The compaction process finds all the segments
persisted indexes are then merged together and the that were created for a specific interval of time (for example,
final segment is handed off. Queries will hit both all the segments that were created by intermediate persists
the in-memory and persisted indexes. over the period of an hour). These segments are merged
together to form a final immutable segment for handoff.
Handoff occurs in a few steps. First, the finalized segment
to the column oriented segment storage format described is uploaded to a permanent backup storage, typically a dis-
in Section 3.1. Persisted segments are memory mapped tributed file system such as S3 [5] or HDFS [16], which Druid
and loaded to off-heap memory such that they can still be refers to as “deep storage”. Next, an entry is created in the
queried. This is illustrated in Figure 4. Data is continuously metadata store (typically a RDBMS such as MySQL) to in-
queryable during the persist process. dicate that a new segment has been created. This entry in
Real-time ingestion in Druid is self-throttling. If a signifi- the metadata store will eventually cause other nodes in the
cant spike occurs in event volume from the upstream event Druid cluster to download and serve the segment. The real-
producer, there are a few safety mechanisms built in. Re- time node continues to serve the segment until it notices that
call that events are first stored in an in-memory buffer and the segment is available on Druid historical nodes, which are
persists can occur when a maximum configurable row limit nodes that are dedicated to serving historical data. At this
is reached. Spikes in event volume should cause persists point, the segment is dropped and unannounced from the
to occur more often and not overflow the in-memory buffer. real-time node. The entire handoff process is fluid; data re-
However, the process of building a segment does require time mains continuously queryable throughout the entire handoff
and resources. If too many concurrent persists occur, and if process. Segments created by real-time processing are ver-
events are added to the in-memory buffer faster than they sioned by the start of the segment granularity interval.
can be removed through the persist process, problems can
still arise. Druid sets a limit on the maximum number of 3.4 Batch Data Ingestion
persists that can occur at a time, and if this limit is reached, The core component used by real-time ingestion is a hash
Druid will begin to throttle event ingestion. In this case, the map that can be incrementally populated and finalized to
onus is on the upstream consumer to be resilient in the face create an immutable segment. This core component is shared
of increasing backlog. across both real-time and batch ingestion. Druid has built
Real-time nodes store recent data for a configurable period in support for creating segments by leveraging Hadoop and
of time, typically an hour. This period is referred to as the running MapReduce jobs to partition data for segments.
segment granularity period. The nodes employ a sliding Events can be read in one at a time directly from static
window to accept and reject events and use the wall-clock files in a ”streaming” fashion.
time as the basis of the window. Events within a range of Similar to the real-time ingestion logic, segments created
the node’s wall-clock time are accepted, and events outside through batch ingestion are directly uploaded to deep stor-
this window are dropped. This period is referred to as the age. Druid’s Hadoop-based batch indexer will also create an
window period and typical window periods are 10 minutes entry in the metadata storage once all segments have been
in length. At the end of the segment granularity period plus created. The version of the segments created by batch inges-
the window period, a real-time node will hand off the data tion are based on the time the batch processing job started
it has collected during the segment granularity period. The at.
use of the window period means that delayed events may
be dropped. In practice, we see that these occurrences are 3.5 Unifying Views
rare, but they do occur. Druid’s real-time logic does not When new entries are created in the metadata storage,
guarantee exactly once processing and is instead best effort. they will eventually be noticed by Druid coordinator nodes.
The lack of exactly once processing in Druid is one of the Druid coordinator nodes poll the metadata storage for what
motivations for requiring batch fixup in the RADStack. segments should be loaded on Druid historical nodes, and

4
13:47 14:07 ~14:11
persist data for 13:00-14:00 persist data for 13:00-14:00 - unannounce segment
for data 13:00-14:00
13:57
persist data for 13:00-14:00

13:00 14:00 15:00

13:37
14:10
- node starts
- merge and handoff for data 13:00-14:00
- announce segment
- persist data for 14:00-15:00
for data 13:00-14:00

~14:00
- announce segment
for data 14:00-15:00

Figure 4: The node starts, ingests data, persists, and periodically hands data off. This process repeats
indefinitely. The time periods between different real-time node operations are configurable.

compare the result with what is actually loaded on those Day 1 Day 2 Day 3

nodes. Coordinator nodes will tell historical nodes to load

new segments, drop outdated segments, and move segments Segment_v4

across nodes.
Druid historical nodes are very simple in operation. They
Segment_v3

know how to load, drop, and respond to queries to scan Segment_v2

segments. Historical nodes typically store all the data that Segment_v1

is older than an hour (recent data lives on the real-time

node). The real-time handoﬀ process requires that a histor- Results Segment_v4 Segment_v3 Segment_v1

ical must first load and begin serving queries for a segment
before that segment can be dropped from the real-time node. Figure 5: Druid utilizes multi-version concurrency
Since segments are immutable, the same copy of a segment control and reads data from segments with the latest
can exist on multiple historical nodes and real-time nodes. version for a given interval. Segments that are that
Most nodes in typical production Druid clusters are histor- completely overshadowed are ignored and eventually
ical nodes. automatically dropped from the cluster.
To consolidate results from historical and real-time nodes,
Druid has a set of broker nodes which act as the client query
endpoint. Broker nodes in part function as query routers to tervals that overlap the lookup interval, along with interval
historical and real-time nodes. Broker nodes understand ranges for which the data in a segment is valid.
the metadata published in distributed coordination service Brokers extract the interval of a query and use it for
(Zookeeper) about what segments are queryable and where lookups into the timeline. The result of the timeline is used
those segments are located. Broker nodes route incoming to remap the original query into a set of specific queries
queries such that the queries hit the right historical or real- for the actual historical and real-time nodes that hold the
time nodes. Broker nodes also merge partial results from pertinent query data. The results from the historical and
historical and real-time nodes before returning a final con- real-time nodes are finally merged by the broker, which re-
solidated result to the caller. turns the final result to the caller.
Broker nodes maintain a segment timeline containing in- The coordinator node also builds a segment timeline for
formation about what segments exist in the cluster and the segments in the cluster. If a segment is completely over-
version of those segments. Druid uses multi-version concun- shadowed by one or more segments, it will be flagged in this
currency control to manage how data is extracted from seg- timeline. When the coordinator notices overshadowed seg-
ments. Segments with higher version identifiers have prece- ments, it tells historical nodes to drop these segments from
dence over segments with lower version identifiers. If two the cluster.
segments exactly overlap for an interval, Druid only consid-
ers the data from the segment with the higher version. This 4. THE PROCESSING LAYER
is illustrated in Figure 5
Although Druid can ingest events that are streamed in
Segments are inserted into the timeline as they are an-
one at a time, data must be denormalized beforehand as
nounced. The timeline sorts the segment based on their
Druid cannot yet support join queries. Furthermore, real
data interval in a data structure similar to an interval tree.
world data must often be transformed before it is usable by
Lookups in the timeline will return all segments with in-
an application.

5
Figure 6: Ad impressions and clicks are recorded
in two separate streams. An event we want to join
is located in two different Kafka partitions on two Figure 7: A shuffle operation ensures events to be
different topics. joined at stored in the same Kafka partition.

4.1 Stream Processing

Stream processors provide infrastructure to develop pro-
cessing logic for unbounded sequences of messages. We use
Apache Samza as our stream processor, although other tech-
nologies are viable alternatives (we initially chose Storm, but
have since switched to Samza). Samza provides an API to
write jobs that run over a sequence of tuples and perform
operations over those tuples in a user-defined way. The in-
put to each job is provided by Kafka, which can also act as
a sink for the output of the job. Samza jobs are executed
in a resource management and task execution framework Figure 8: The join operation adds a new field,
such as YARN[21]. It is beyond the scope of this paper to ”is_clicked”.
go into the full details of Kafka/YARN/Samza interactions,
but more information is available in other literature[1]. We
id are both present. The original events are discarded, and
will instead focus on how we leverage this framework for
the new event is send further downstream. This join stage
processing data for analytic use cases.
shown in Figure 8
On top of Samza infrastructure, we introduce the idea of
The final stage of our data processing is to enhance the
a “pipeline”, which is a grouping for a series of related pro-
data. This stage cleans up faults in data, and performs
cessing stages where “upstream” stages produce messages
lookups and transforms of events. Once data is cleaned,
that are consumed by “downstream” stages. Some of our
it is ready to be delivered to Druid for queries. The total
jobs involve operations such as renaming data, inserting de-
streaming data processing pipeline is shown in Figure 9.
fault values for nulls and empty strings, and filtering data.
The system we have designed is not perfect. Because
One pipeline may write to many data sources in Druid.
we are doing windowed joins and because events cannot be
To understand a real-world pipeline, let’s consider an ex-
buffered indefinitely, not all joins are guaranteed to com-
ample from online advertising. In online advertising, events
plete. If events are substantially delayed and do not arrive
are generated by impressions (views) of an ad and clicks
in the allocated window period, they will not be joined. In
of an ad. Many advertisers are interested in knowing how
practice, this generally leads to one “primary” event continu-
many impressions of an ad converted into clicks. Impression
ing through the pipeline and other secondary events with the
streams and click streams are almost always generated as
same join key getting dropped. This means that our stream
separate streams by ad servers. Recall that Druid does not
processing layer is not guaranteed to deliver 100% accurate
support joins at query time, so the events must be generated
results. Furthermore, even without this restriction, Samza
at processing time. An example of data generated by these
does not offer exactly-once processing semantics. Problems
two event streams is shown in Figure 6. Every event has a
in network connectivity or node failure can lead to dupli-
unique impression id that identifies the ad served. We use
cated events. For these reasons, we run a separate batch
this id as our join key.
pipeline that generates a more accurate transformation of
The first stage of the Samza processing pipeline is a shuffle
the ingested data.
step. Events are written to a keyed Kafka topic based on
the hash of an event’s impression id. This ensures that the
events that need to be joined are written to the same Kafka
topic. YARN containers running Samza tasks may read from
one or more Kafka topics, so it is important Samza task for
joins actually has both events that need to be joined. This
is shuffle stage is shown in Figure 7.
The next stage in the data pipeline is to actually join
the impression and click events. This is done by another
Samza task that creates a new event in the data with a new
field called ”is_clicked”. This field is marked as ”true” if an
impression event and a click event with the same impression Figure 9: The streaming processing data pipeline.

6
Data Source Dimensions Metrics
The final job of our processing pipeline is to deliver data a 25 21
to Druid. For high availability, processed events from Samza b 30 26
are transmitted concurrently to two real-time nodes. Both c 71 35
nodes receive the same copy of data, and effectively act as d 60 19
e 29 8
replicas of each other. The Druid broker can query for either f 30 16
copy of the data. When handoff occurs, both real-time nodes g 26 18
race to hand off the segments they’ve created. The segment h 78 14
that is pushed into deep storage first will be the one that is
used for historical querying, and once that segment is loaded Table 2: Characteristics of production data sources.
on the historical nodes, both real-time nodes will drop their
versions of the same segment.
described in Section 4.1. Topics in Kafka map to pipelines
4.2 Batch Processing in Samza, and pipelines in Samza map to data sources in
Our batch processing pipeline is composed of a multi-stage Druid. The second consumer reads messages from Kafka
MapReduce[4] pipeline. The first set of jobs mirrors our and stores them in a distributed file system. This file sys-
stream processing pipeline in that it transforms data and tem is the same as the one used for Druid’s deep storage, and
joins relevant events in preparation for the data to be loaded also acts as a backup for raw events. The purpose of storing
into Druid. The second set of jobs is responsible for directly raw events in deep storage is so that we can run batch pro-
creating immutable Druid segments. The indexing code for cessing jobs over them at any given time. For example, our
both streaming and batch ingestion in Druid is shared be- stream processing layer may choose to not include certain
tween the two modes of ingestion. These segments are then columns when it first processes a new pipeline. If we want
uploaded to deep storage and registered with the metadata to include these columns in the future, we can reprocess the
store. Druid will proceed to load the batch generated seg- raw data to generate new Druid segments.
ments. Kafka is the single point of delivery for events entering our
The batch process typically runs much less frequently than system, and must have the highest availability. We repli-
the real-time process, and may run many hours or even days cate our Kafka producers across multiple datacenters. In
after raw events have been delivered. The wait is necessary the event that Kafka brokers and consumers become unre-
for severely delayed events, and to ensure that the raw data sponsive, as long as our HTTP endpoints are still available,
is indeed complete. we can buffer events on the producer side while recovering
Segments generated by the batch process are versioned by the system. Similarily, if our processing and serving lay-
the start time of the process. Hence, segments created by ers completely fail, we can recover by replaying events from
batch processing will have a version identifier that is greater Kafka.
than segments created by real-time processing. When these
batch created segments are loaded in the cluster, they atom-
ically replace segments created by real-time processing for 6. PERFORMANCE
their processed interval. Hence, soon after batch processing Druid runs in production at several organizations, and to
completes, Druid queries begin reflecting batch-originated briefly demonstrate its performance, we have chosen to share
data rather than real-time-originated data. some real world numbers for one of the larger production
We use the streaming data pipeline described in Section4.1 clusters. We also include results from synthetic workloads
to deliver immediate insights on events as they are occur- on TPC-H data.
ring, and the batch data pipeline described in this section to
provide an accurate copy of the data. The batch process typ-
ically runs much less frequently than the real-time process,
6.1 Query Performance in Production
and may run many hours or even days after raw events have Druid query performance can vary signficantly depending
been delivered. The wait is necessary for severely delayed on the query being issued. For example, sorting the values
events, and to ensure that the raw data is indeed complete. of a high cardinality dimension based on a given metric is
much more expensive than a simple count over a time range.
To showcase the average query latencies in a production
5. THE DELIVERY LAYER Druid cluster, we selected 8 frequently queried data sources,
In our stack, events are delivered over HTTP to Kafka. described in Table 2.
Events are transmitted via POST requests to a receiver that The queries vary from standard aggregates involving differ-
acts as a front for a Kafka producer. Kafka is a distributed ent types of metrics and filters, ordered group bys over one
messaging system with a publish and subscribe model. At or more dimensions with aggregates, and search queries and
a high level, Kafka maintains events or messages in cate- metadata retrieval queries. Queries involving a single col-
gories called topics. A distributed Kafka cluster consists of umn are very frequent, and queries involving all columns are
numerous brokers, which store messages in a replicated com- very rare.
mit log. Kafka consumers subscribe to topics and process
feeds of published messages. • There were approximately 50 total data sources in this
Kafka provides functionality isolation between producers particular cluster and several hundred users issuing queries.
of data and consumers of data. The publish/subscribe model
works well for our use case as multiple consumers can sub- • There was approximately 10.5TB of RAM available in this
scribe to the same topic and process the same set of events. cluster and approximately 10TB of segments loaded. Col-
We have two primary Kafka consumers. The first is a Samza lectively, there are about 50 billion Druid rows in this
job that reads messages from Kafka for stream processing as cluster. Results for every data source is not shown.

7
Mean query latency
datasource

1.0 b

query time (s) c

e
0.5
f

h
0.0

Feb 03 Feb 10 Feb 17 Feb 24

time

Query latency percentiles

1.5

1.0

0.5 90%ile

datasource
0.0
a

Figure 11: Druid benchmarked against Google Big-

4
query time (seconds)

Query – 100GB TPC-H data.

3 c
95%ile

d
2

Data Source Dimensions Metrics Peak Events/s

e
1
f
d1 34 24 218123
0 g
d2 36 24 172902
d3 46 21 170264
20 h

15 d4 40 17 94064
d5 41 23 68104
99%ile

d6 31 31 64222
10

5 d7 29 8 30048
0
Feb 03 Feb 10 Feb 17
time
Feb 24 Table 3: Characteristics of production data sources.

Figure 10: Query latencies of production data time interval and 36,246,530 rows/second/core for a select
sources. sum(float) type query.

6.3 Data Ingestion Performance

• This cluster uses Intel® Xeon® E5-2670 processors and To showcase the ingestion latency of the RADStack, we
consists of 1302 processing threads and 672 total cores selected the top seven production datasources in terms of
(hyperthreaded). peak events ingested per second for early 2015. These data-
sources are described in Table 3. Our production ingestion
• A memory-mapped storage engine was used (the machine setup used over 40 nodes, each with 60GB of RAM and 32
was configured to memory map the data instead of loading cores (12 x Intel® Xeon® E5-2670). Each pipeline for each
it into the Java heap.) datasource involved transforms and joins.
Query latencies are shown in Figure 10. Across all the Ingestion latency is heavily dependent on the complexity
various data sources, average query latency is approximately of the data set being ingested. The data complexity is deter-
550 milliseconds, with 90% of queries returning in less than mined by the number of dimensions in each event, the num-
1 second, 95% in under 2 seconds, and 99% of queries re- ber of metrics in each event, and the types of aggregations
turning in less than 10 seconds. we want to perform on those metrics. With the most basic
data set (one that only has a timestamp column), our setup
6.2 Query Benchmarks on TPC-H Data can ingest data at a rate of 800,000 events/second/core,
We also present Druid benchmarks on TPC-H data. We which is really just a measurement of how fast we can de-
selected queries more typical of Druid’s workload to demon- serialize events. At peak, a single node was able to process
strate query performance. In Figure 11, we present our 62259 events/second. The total peak events per second was
results compared again Google BigQuery, which is Google 840500. The median events per second was 590100. The
Dremel[13]. The Druid results were ran on a 24 vCPU, 156 first and third quantiles were 487000 events/s and 663200
GB Google Compute Engine instance (2.2 GHz Intel Xeon events/s respectively.
E5 v4 Broadwell) and the BigQuery results were run through
Google’s web interface. In our results, Druid performed from 7. PRODUCTION EXPERIENCES
2-20x faster than BigQuery.
Our Druid setup used Amazon EC2 m3.2xlarge instance 7.1 Experiences with Druid
types (Intel® Xeon® E5-2680 v2 @ 2.80GHz) for historical
nodes and c3.2xlarge instances (Intel® Xeon® E5-2670 v2 7.1.1 Query Patterns
@ 2.50GHz) for broker nodes. Druid is often used for exploratory analytics and report-
We benchmarked Druid’s scan rate at 53,539,211 rows/sec- ing, which are two very distinct use cases. Exploratory ana-
ond/core for select count(*) equivalent query over a given lytic workflows provide insights by progressively narrowing

8
down a view of data until an interesting observation is made. query speed degradations, less than optimally tuned hard-
Users tend to explore short intervals of recent data. In the ware, and various other system bottlenecks.
reporting use case, users query for much longer data inter-
vals, and the volume of queries is generally much less. The
insights that users are looking for are often pre-determined. 8. RELATED WORK

7.1.2 Multitenancy 8.1 Hybrid Batch/Streaming Workflows

Expensive concurrent queries can be problematic in a mul- Spark[24] is a cluster computing framework optimized for
titenant environment. Queries for large data sources may iterative workflows. Spark Streaming is a separate project
end up hitting every historical node in a cluster and con- that converts sequences of tuples into immutable micro-
sume all cluster resources. Smaller, cheaper queries may be batches. Each micro-batch can be processed using the un-
blocked from executing in such cases. We introduced query derlying Spark framework. Spark SQL is a query optimiza-
prioritization to address these issues. Each historical node tion layer that can sit on top of Spark and issue SQL queries,
is able to prioritize which segments it needs to scan. Proper along with Spark’s native API. Druid’s approach to query-
query planning is critical for production workloads. Thank- ing is quite different and Druid insteads builds immutable
fully, queries for a significant amount of data tend to be for indexed data structures optimized for low latency OLAP
reporting use cases and can be de-prioritized. Users do not queries, and does not leverage lineage in its architecture.
expect the same level of interactivity in this use case as they The RADStack can theoretically be composed of Spark and
do when they are exploring data. Spark Streaming for processing, Kafka for event delivery,
and Druid to serve queries.
7.1.3 Node Failures
Single node failures are common in distributed environ- 8.2 Druid and Other Data Stores
ments, but many nodes failing at once are not. If historical Druid builds on many of the same principles as other dis-
nodes completely fail and do not recover, their segments tributed columnar data stores[7], and in-memory databasesi
need to be reassigned, which means we need excess cluster such as SAP’s HANA[6] and VoltDB[22]. These data stores
capacity to load this data. The amount of additional capac- lack Druid’s low latency ingestion characteristics. Druid also
ity to have at any time contributes to the cost of running has native analytical features baked in, similar to ParAc-
a cluster. From our experiences, it is extremely rare to see cel[15], however, Druid allows system wide rolling software
more than 2 nodes completely fail at once and hence, we updates with no downtime.
leave enough capacity in our cluster to completely reassign Druid is similar to C-Store[18] in that it has two subsys-
the data from 2 historical nodes. tems, a read-optimized subsystem in the historical nodes
and a write-optimized subsystem in real-time nodes. Real-
7.2 Experiences with Ingestion time nodes are designed to ingest a high volume of ap-
pend heavy data, and do not support data updates. Unlike
the two aforementioned systems, Druid is meant for OLAP
7.2.1 Multitenancy
transactions and not OLTP transactions.
Before moving our streaming pipeline to Samza, we exper-
imented with other stream processors. One of the biggest
pains we felt was around multi-tenancy. Multiple pipelines 9. CONCLUSIONS AND FUTURE WORK
may contend for resources, and it is often unclear how vari- In this paper we presented the RADStack, a collection
ous jobs impact one another when running in the same en- of complementary technologies that can be used together
vironment. Given that each of our pipelines is composed to power interactive analytic applications. The key pieces
of different tasks, Samza was able to provide per task re- of the stack are Kafka, Samza, Hadoop, and Druid. Druid
source isolation, which was far easier to manage than per is designed for exploratory analytics and is optimized for
application resource isolation. low latency data exploration, aggregation, and ingestion,
and is well suited for OLAP workflows. Samza and Hadoop
7.3 Operational Monitoring complement Druid and add data processing functionality,
Proper monitoring is critical to run a large scale dis- and Kafka enables high throughput event delivery problem.
tributed cluster, especially with many different technologies. We believe that in later iterations of this work, batch pro-
Each Druid node is designed to periodically emit a set of op- cessing may not be necessary. As open source technologies
erational metrics. These metrics may include system level mature, the existing problems around exactly-once process-
data such as CPU usage, available memory, and disk capac- ing will eventually be solved. The Druid, Samza and Kafka
ity, JVM statistics such as garbage collection time, and heap communities are working on exactly once, lossless processing
usage, or node specific metrics such as segment scan time, for their respective systems, and in the near future, the same
cache hit rates, and data ingestion latencies. Druid also guarantees that the RADStack provides right now should be
emits per query metrics so we can examine why a particular available using only these technologies.
query may be slow. We’ve also added functionality to peri-
odically emit metrics from Samza, Kafka, and Hadoop. We
emit metrics from our production RADStack and load them 10. ACKNOWLEDGMENTS
into a dedicated metrics RADstack. The metrics cluster is Druid, Samza, Kafka, and Hadoop could not have been
used to explore the performance and stability of the pro- built the assistance of their respective communities. We
duction cluster. This dedicated metrics cluster has allowed want to thank everyone that contributes to open source for
us to find numerous production problems, such as gradual their invaluable support.

9
11. REFERENCES [15] Paraccel analytic database.
[1] Apache samza. http://samza.apache.org/, 2014. http://www.paraccel.com/resources/Datasheets/
[2] D. J. Abadi, S. R. Madden, and N. Hachem. ParAccel-Core-Analytic-Database.pdf, March
Column-stores vs. row-stores: How different are they 2013.
really? In Proceedings of the 2008 ACM SIGMOD [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
international conference on Management of data, The hadoop distributed file system. In Mass Storage
pages 967–980. ACM, 2008. Systems and Technologies (MSST), 2010 IEEE 26th
[3] Y. Collet. Lz4: Extremely fast compression algorithm. Symposium on, pages 1–10. IEEE, 2010.
code. google. com, 2013. [17] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,
[4] J. Dean and S. Ghemawat. Mapreduce: simplified E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and
parallel dbmss: friends or foes? Communications of
data processing on large clusters. Communications of
the ACM, 53(1):64–71, 2010.
the ACM, 51(1):107–113, 2008.
[18] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,
[5] G. DeCandia, D. Hastorun, M. Jampani,
M. Cherniack, M. Ferreira, E. Lau, A. Lin,
G. Kakulapati, A. Lakshman, A. Pilchin,
S. Madden, E. O’Neil, et al. C-store: a
S. Sivasubramanian, P. Vosshall, and W. Vogels.
column-oriented dbms. In Proceedings of the 31st
Dynamo: amazon’s highly available key-value store. In
international conference on Very large data bases,
ACM SIGOPS Operating Systems Review, volume 41,
pages 553–564. VLDB Endowment, 2005.
pages 205–220. ACM, 2007.
[19] M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim,
[6] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd,
D. Maier, O. Ratzesberger, and S. B. Zdonik.
S. Sigg, and W. Lehner. Sap hana database: data
Requirements for science data bases and scidb. In
management for modern business applications. ACM
CIDR, volume 7, pages 173–184, 2009.
Sigmod Record, 40(4):45–51, 2012.
[20] E. Tschetter. Introducing druid: Real-time analytics
[7] B. Fink. Distributed computation on dynamo-style
at a billion rows per second. http://druid.io/blog/
distributed storage: riak pipe. In Proceedings of the
2011/04/30/introducing-druid.html, April 2011.
eleventh ACM SIGPLAN workshop on Erlang
workshop, pages 43–50. ACM, 2012. [21] V. K. Vavilapalli, A. C. Murthy, C. Douglas,
S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe,
[8] A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and
H. Shah, S. Seth, et al. Apache hadoop yarn: Yet
M. Nunkesser. Processing a trillion cells per mouse
another resource negotiator. In Proceedings of the 4th
click. Proceedings of the VLDB Endowment,
annual Symposium on Cloud Computing, page 5.
5(11):1436–1446, 2012.
ACM, 2013.
[9] M. Hausenblas and N. Bijnens. Lambda architecture.
[22] L. VoltDB. Voltdb technical overview.
URL: http://lambda-architecture. net/. Luettu, 6:2015,
https://voltdb.com/, 2010.
2014.
[23] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino,
[10] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.
and D. Ganguli. Druid: a real-time analytical data
Zookeeper: Wait-free coordination for internet-scale
store. In Proceedings of the 2014 ACM SIGMOD
systems. In USENIX ATC, volume 10, 2010.
international conference on Management of data,
[11] J. Kreps, N. Narkhede, and J. Rao. Kafka: A pages 157–168. ACM, 2014.
distributed messaging system for log processing. In
[24] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
Proceedings of 6th International Workshop on
M. McCauley, M. J. Franklin, S. Shenker, and
Networking Meets Databases (NetDB), Athens, Greece,
I. Stoica. Resilient distributed datasets: A
2011.
fault-tolerant abstraction for in-memory cluster
[12] N. Marz. Storm: Distributed and fault-tolerant computing. In Proceedings of the 9th USENIX
realtime computation. http://storm-project.net/, conference on Networked Systems Design and
February 2013. Implementation, pages 2–2. USENIX Association,
[13] S. Melnik, A. Gubarev, J. J. Long, G. Romer, 2012.
S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: [25] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica.
interactive analysis of web-scale datasets. Proceedings Discretized streams: an efficient and fault-tolerant
of the VLDB Endowment, 3(1-2):330–339, 2010. model for stream processing on large clusters. In
[14] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The Proceedings of the 4th USENIX conference on Hot
log-structured merge-tree (lsm-tree). Acta Topics in Cloud Computing, pages 10–10. USENIX
Informatica, 33(4):351–385, 1996. Association, 2012.

Lambda Architecture - Wikipedia
No ratings yet
Lambda Architecture - Wikipedia
4 pages
Lambda Architecture for Data Pros
No ratings yet
Lambda Architecture for Data Pros
20 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
5
No ratings yet
5
1 page
1 The 7V of Big Data
No ratings yet
1 The 7V of Big Data
6 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
41 pages
Understanding Lambda Architecture
No ratings yet
Understanding Lambda Architecture
2 pages
Understanding Lambda Architecture in Big Data
No ratings yet
Understanding Lambda Architecture in Big Data
5 pages
Module4 1
No ratings yet
Module4 1
68 pages
4
No ratings yet
4
2 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
7
No ratings yet
7
1 page
Understanding Lambda Architecture in Big Data
No ratings yet
Understanding Lambda Architecture in Big Data
23 pages
6
No ratings yet
6
1 page
When and How To Leverage Lambda Architecture in Big Data - Cuelogic Technologies Pvt. LTD
No ratings yet
When and How To Leverage Lambda Architecture in Big Data - Cuelogic Technologies Pvt. LTD
9 pages
3
No ratings yet
3
2 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Big Data Analysis Apache Storm Perspecti
No ratings yet
Big Data Analysis Apache Storm Perspecti
6 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
8
No ratings yet
8
1 page
Latency 5
No ratings yet
Latency 5
8 pages
Data Stream Processing Platforms Explained
No ratings yet
Data Stream Processing Platforms Explained
27 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Key Elements of Lambda Architecture
No ratings yet
Key Elements of Lambda Architecture
2 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Berkeley Data Analytics Stack
No ratings yet
Berkeley Data Analytics Stack
48 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
DC Unit V
No ratings yet
DC Unit V
26 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Hot Data Analytics For Real-Time Streaming in Iot Platform
No ratings yet
Hot Data Analytics For Real-Time Streaming in Iot Platform
227 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
BDA Unit3
No ratings yet
BDA Unit3
17 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Unit 4
No ratings yet
Unit 4
8 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
19 Databricks
No ratings yet
19 Databricks
28 pages
Open Source Big Data Frameworks Overview
No ratings yet
Open Source Big Data Frameworks Overview
6 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
Lec 19
No ratings yet
Lec 19
24 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
3.2.4.6 Packet Tracer - Investigating The TCP-IP and OSI Models in Action KIM
No ratings yet
3.2.4.6 Packet Tracer - Investigating The TCP-IP and OSI Models in Action KIM
5 pages
Securing Information Systems: Prof. Himanshu Joshi Himanshu@imi - Edu
No ratings yet
Securing Information Systems: Prof. Himanshu Joshi Himanshu@imi - Edu
35 pages
Basic Computer Networking
No ratings yet
Basic Computer Networking
33 pages
Understanding Enterprise Information Systems
No ratings yet
Understanding Enterprise Information Systems
31 pages
2VAA000724-206 - en S+ Operations 2.0 SP6 History Reference Guide
No ratings yet
2VAA000724-206 - en S+ Operations 2.0 SP6 History Reference Guide
115 pages
Recent Cyberattacks
No ratings yet
Recent Cyberattacks
5 pages
Software Deployment Feedback Request
No ratings yet
Software Deployment Feedback Request
4 pages
Java API: Advantages and Importance
100% (1)
Java API: Advantages and Importance
9 pages
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
No ratings yet
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
15 pages
Smarttechie Jan 11 Issue
No ratings yet
Smarttechie Jan 11 Issue
33 pages
YAML For Home Assistant UI
No ratings yet
YAML For Home Assistant UI
6 pages
How To Accelerate ISO 27001 Implementation
100% (2)
How To Accelerate ISO 27001 Implementation
15 pages
The Impact of Telecommunication Today
No ratings yet
The Impact of Telecommunication Today
1 page
Midterms Cs 372
No ratings yet
Midterms Cs 372
6 pages
Bitbucket Guide for Developers
No ratings yet
Bitbucket Guide for Developers
9 pages
SAP Functional Specification Overview
No ratings yet
SAP Functional Specification Overview
4 pages
Doc02 - DCS Full-Stack Data Center Solution
No ratings yet
Doc02 - DCS Full-Stack Data Center Solution
4 pages
Credit Card Hacking Methods
69% (13)
Credit Card Hacking Methods
4 pages
Defence
No ratings yet
Defence
13 pages
OAF Profile Options Explained
No ratings yet
OAF Profile Options Explained
3 pages
Cisco MDS User Id
No ratings yet
Cisco MDS User Id
22 pages
New G/L Overview in mySAP ERP
No ratings yet
New G/L Overview in mySAP ERP
4 pages
E Business: Concepts and Context
No ratings yet
E Business: Concepts and Context
15 pages
Aizu 2023 Autumn Masters Admission Guide
No ratings yet
Aizu 2023 Autumn Masters Admission Guide
13 pages
TCS Feedback
No ratings yet
TCS Feedback
6 pages
OWASP Top 10 Security Flaws 2021
No ratings yet
OWASP Top 10 Security Flaws 2021
22 pages
Application Deployment-Kuber
No ratings yet
Application Deployment-Kuber
10 pages
DM Record - 8
No ratings yet
DM Record - 8
6 pages
IT Internship Application Letter
No ratings yet
IT Internship Application Letter
2 pages
Battle Card Acronis Cyber Protect Cloud Backup For Microsoft 365 Vs Competitors EN EU 20220131
No ratings yet
Battle Card Acronis Cyber Protect Cloud Backup For Microsoft 365 Vs Competitors EN EU 20220131
2 pages

Interactive Analytics with RADStack

Uploaded by

Interactive Analytics with RADStack

Uploaded by

The RADStack: Open Source Lambda Architecture for

ABSTRACT An ideal data serving layer alone is often not suﬃcient

serving a segment for an interval from 13:00 to 14:00. Every

Disk and persisted indexes

3.3 Hand off

13:00 14:00 15:00

nodes. Coordinator nodes will tell historical nodes to load

know how to load, drop, and respond to queries to scan Segment_v2

is older than an hour (recent data lives on the real-time

4.1 Stream Processing

query time (s) c

Feb 03 Feb 10 Feb 17 Feb 24

Query latency percentiles

Figure 11: Druid benchmarked against Google Big-

Query – 100GB TPC-H data.

Data Source Dimensions Metrics Peak Events/s

6.3 Data Ingestion Performance

7.1.2 Multitenancy 8.1 Hybrid Batch/Streaming Workflows

You might also like