RESOURCE CENTER (HTTP://MAMMOTHDATACOM/RESOURCE.CENTER/) | PARTNERS (HTTP.//MAMMOTHDATA.COM/PARTNERS/)
‘OUR TEAM (HTTP://MAMMOTHDATACOM/TEAM/) || CAREERS (HTTP://MAMMOTHDATA.COM/CAREERS/)
BLOG (HTTP://MAMMOTHDATACOM/BLOG) | NEWS (HTTP.//MAMMOTHDATA.COM/BLOG-NEWS/)
B
(https:/www linkedin.com/company/open-
software-
6) © Bintegrators-
(httietprathisiitiineboatigie rodinyhiatar ob ACA lg 2888)
7 MOTH (HTTP//MAMMOTHDATACOM/)
BLOG
DAG vs MapReduce
The new generation of Big Data tools largely focus on improving support fo
eal-time (or near-time) computation and interactive applications by
educing the latency involved in processing jobs.
f you look at Storm, Spark, Tez, and other newer tools, you will frequently encounter the term “DAG" or
Directed Acyclic Graph. This article will explain why traditional MapReduce is subject to undesirable latencies
vhat a DAG is, and why these new systems use this approach.
Jadoop, which began life specifically as an implementation of the MapReduce paradigm, has traditionally
elied on MapReduce as its primary programming model. Hadoop MapReduce jobs display high latencies as
1 result of the programming model of traditional MapReduce, in which jobs follow a stock structure of ‘map,”
allowed by “shuffle,” followed by “reduce” steps. Even single-step jobs under MapReduce tend to feature higl
atencies. This problem is exacerbated for more complex processing involving “chaining” successiveJlapReduce jobs. In multi-step jobs, each job is blocked from beginning until all of the preceding jobs have
inished, As a result of this model, complex computations can require time on the order of minutes, hours, or
onger — even with fairly small data volumes.
\ Directed Acyclic Graph, in this context, refers to a model for scheduling work in which jobs are represented
1s vertices in a graph, where the order of execution is specified by the directionality of the edges in the graph
The “acyclic” part just means that there are no loops (“cycles”) in the graph. In a system which schedules jobs
tsing a DAG, independent nodes (computational steps) in the graph can run in parallel, rather than
iequentially. This approach makes it easier for programmers to build more complex multi-step computations,
and avoids the scheduling overhead imposed by traditional MapReduce.
3f course simply switching to a DAG for scheduling does not alleviate the high latencies associated with
iingle-step Hadoop MapReduce jobs. This is why even workflows constructed as DAGs that link Hadoop
JlapReduce jobs, still suffer in the latency area, An example of this problem would be using external
scheduler like Oozie to control a series of MapReduce jobs. Each workflow stil has to pay the cost of high
itartup times and high latencies for individual jobs. So in order to achieve low overall latency, systems such
1s Spark, Storm, Samza, and others have also added other optimizations — primarily copying data into
nemory and performing substantially less disk (/O.
Aside from improving latency, DAG based systems have other advantages, For example, itis simpler to
mplement a fault tolerant approach using a DAG. In the event of a job failure, you can easily backtrack
hrough the graph and re-execute any failed jobs, even at intermediate stages of a computation. The enforcec
arder of the graph always allows you to walk through the graph from any node, to the eventual end.
‘inally, we would be remiss in not pointing out that Hadoop has also moved beyond its historical reliance on
simple MapReduce as well. The Hadoop 2.x series has refactored the resource allocation and scheduling
somponents to support a much more flexible architecture, which allows the implementation of new, non
JlapReduce, programming models. With Hadoop 2 other processing engines can layer on top of YARN and
rrovide low-latency, real-time processing, while living side-by-side with jobs written for MapReduce, MPI,
3SP, or other execution models. Spark, in fact, can be deployed onto an existing Hadoop cluster, and take
dvantage of YARN for scheduling and resource allocation,
\s you can see, a Directed Acyclic Graph approach is a key element of most next-generation, real-time Big
data platforms, These tools, including Storm, Spark, Samza and Tez, offer amazing new capabilities for
2uilding highly interactive, real-time computing systems to power your real-time Bl, predictive analytics, real-
ime marketing and other critical systems.
\re you looking to incorporate a new generation of Big Data tools to support real-time computation and
nteractive applications? Interested in Hadoop or expanding into the Hadoop ecosystem to give
‘our organization the data-driven success stories it needs. Give us a call at 919.321.0119 or shoot us an email
afoamammothdata.com to get started
= Phil Rhodes, Senior ConsultarLeave a Reply
‘our email address will not be published.
Comment
Name
Email
Website
TLL (https:/clutch.co/researchibig-
CONSULTANTS Jf
Eesen ( BIG eel
Review SOLUTION PROVIDERS 2015
(http:!/mammothdata.com/news/mammoth-data-named-most-
promising-big-data-solutions-provider-by-cic-reviow!)
Copyright © 2015 » Mammoth Data, Inc. + All rights reserved
Contact (htp:!imammothdata.com/contact/)
|” Privacy Policy (hitp:/mammothdata.com/prvacy/)
(nttp://mammothdata.com/)
MAMMOTH
DATA
Mammoth Data, Inc.
345 West Main Street Suite 201
Durham, NC 27701 #1.919.321.0119
[email protected]
(mailto:
[email protected])