0% found this document useful (0 votes)
25 views21 pages

Module 2

Uploaded by

bauskarsanket
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views21 pages

Module 2

Uploaded by

bauskarsanket
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Module 2 :- Hadoop HDFS and MapReduce

2.1 Distributed File Systems


Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big Data.
HDFS is a distributed file system that runs on commodity hardware. HDFS is already
configured with default configuration for many installations. Hadoop interact directly with
HDFS by shell-like commands.

Hadoop Distributed File System is a block-structured file system where each file is
divided into blocks of a pre-determined size. These blocks are stored across a cluster of
one or several machines.

HDFS Components

There are two major components of Hadoop HDFS.

• NameNode (Master Node)

• DataNode (Slave Node)

• NameNode

It is also known as Master node. NameNode does not store actual data or dataset.
It stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.

Tasks of HDFS NameNode

• Manage file system namespace.

• Regulates client’s access to files.

• Executes file system execution such as naming, closing, opening files and
directories.

• DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the
clients. Replica block of Datanode consists of 2 files on the file system. The first
file is for
data and second file is for recording the block’s metadata. HDFS Metadata
includes checksums for data. At startup, each Datanode connects to its
corresponding Namenode and does handshaking. Verification of namespace ID
and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.

Tasks of HDFS DataNode:

• DataNode performs operations like block replica creation, deletion, and


replication according to the instruction of NameNode.
• DataNode manages data storage of the system.
Hadoop HDFS Daemons
There are 2 daemons which run for HDFS for data storage:

• Namenode:
This is the daemon that runs on all the masters. Name node stores metadata like
filename, the number of blocks, number of replicas, a location of blocks, block IDs
etc. This metadata is available in memory in the master for faster retrieval of data. In
the local disk, a copy of metadata is available for persistence. So name node memory
should be high as per the requirement.

• Datanode:
This is the daemon that runs on the slave. These are actual worker nodes that store
the data.
HDFS Architecture

• This architecture gives you a complete picture of Hadoop Distributed File System.
There is a single namenode which stores metadata and there are multiple datanodes
which do actual storage work.
• Nodes are arranged in racks and Replicas of data blocks are stored on different racks in
the cluster to provide fault tolerance.
• To read or write a file in HDFS, the client needs to interact with Namenode. HDFS
applications need a write-once-read-many access model for files. A file once created
and written cannot be edited.
• Namenode stores metadata and datanode which stores actual data. The client interacts
with namenode for any task to be performed as namenode is the centerpiece in the
cluster.
There are several datanodes in the cluster which store HDFS data in the local disk.
Datanode sends a heartbeat message to namenode periodically to indicate that it is
alive. Also, it replicates data to other datanode as per the replication factor.
Data

storage in HDFS
Whenever any file has to be written in HDFS, it is broken into small pieces of data
known as blocks. HDFS has a default block size of 128 MB which can be increased as
per the requirements. These blocks are stored in the cluster in distributed manner on
different nodes. This provides a mechanism for MapReduce to process the data in
parallel in the cluster.

2.1.1 Physical Organization of Compute Nodes

• Compute nodes are stored om Racks, perhaps 8-64 on a rack.

• Nodes on a single rack are connected by a network.

• There can be many racks of compute nodes, and racks are connected by
another level of network or a switch.
• The bandwidth of inter-rack communication is greater than the intrarack
ethernet but given the number of pairs of nodes that might need to
communicate between racks.
Rack Awareness in Hadoop HDFS

• Hadoop runs on a cluster of computers which are commonly spread across


many racks.
• NameNode places replicas of a block on multiple racks for improved fault
tolerance.
• NameNode tries to place at least one replica of a block in each rack, so that if a
complete rack goes down then also system will be highly available.
• Optimizing replica placement distinguishes HDFS from most other
distributed file systems.
• The purpose of a rack-aware replica placement policy is to improve data
reliability, availability, and network bandwidth utilization.

2.1.2 Large Scale File System Organization

• This new file system is often called as Distributed File System or DFS. • DFS
is useful when files can be enormous, possibly a terabyte in size and are rarely
updated.
• Files are devided into chunks which are typically 64 megabytes in size. •
Chunks are replicated, perhaps three times, at three different compute nodes. •
The nodes holding copies of one chunk should be located on different racks so
as avoid loosing all copies due to a rack failure.
• User can decide the chunk size and the degreed of replication.

• Master node is used to find the chunks of a file. This node is itself replicated,
and a directory for the file system as a whole knows where to find ts
copies.The directory itself can be replicated, and all participants using the
DFS know where the directory copies are.
• DFS Implementations are

▪ The Google File System(GFS)

▪ Hadoop Distributed File System(HDFS)

▪ CloudStore

2.2 Map-Reduce:

• MapReduce is the data processing component of Hadoop.

• MapReduce is a software framework which helps in writing applications that


processes large data sets using distributed and parallel algorithms inside Hadoop
environment.

• MapReduce works in two phases.Map Phase and Reduce Phase

• Map Phase:

▪ This phase takes input as key-value pairs and produces output as key
value pairs. It can write custom business logic in this phase.

▪ The Map function performs actions like filtering, grouping and sorting.

▪ The result generated by the Map function is a key value pair (K, V)
which acts as the input for Reduce function.
• Reduce Phase:

▪ This phase aggregates and summarizes the result produced by map


function.

2.2.1

The Map Tasks

• A Map function is written to convert input elements to Key-Value


pairs.

• The types of keys and values are each arbitrary.

• Keys are not Unique. A Map task can produce several key-value
pairs with the same key, even from the same element. Element can
be a tuple or a document.

Example :- Counting the number of occurrences for each word in a


collection of documents.

• The input file is a repository of documents, and each document is an


element.

• The Map function uses keys that are of type String(The words) and
values that are integers.

• The Map Task reads a document and breaks it into its sequence of
words W1, W2,..., Wn.
• It then emits a sequence of key-value pairs where the value is
always 1. That is, the output of Map Task for this document is the
sequence of Key-Value pairs: (W1, 1), (W2, 1),...,(Wn, 1)
2.2.2 Grouping by key

• After all the Map tasks have completed successfully, the master
controller merges the file from each Map task that are destined for a
particular Reduce task and feeds the merged file to that process as a
sequence of key-list-of-value pairs.

• For each key k, the input to the Reduce task that handles key k is a
pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn)
are all the key-value pairs with key k coming from all the Map
tasks.

2.2.3 The Reduce Tasks

• The input to the reduce task is the pairs consisting of a key and its
list of associated values and combine those values in some way.

• The output of a Reduce task is a sequence of key-value pairs


consisting of each input key k(The word) that the Reduce task
received, paired with the combined value constructed from the list
of values that the Reduce task received along with the key k.

• The outputs from all the Reduce tasks are merged into a single file.

Example:- The word-count example

• The reduce function adds up all the values.

• The The output of a Reduce task is a sequence of (w, m) pairs where


‘w’ is a word that appears at least once among all the input
documents and ‘m’ is the total number of occurrences of ‘w’ along
all those documents.

2.2.4 Combiners

• Hadoop Combiner is also known as “Mini-Reducer” that summarizes the


Mapper output record with the same Key before passing to the Reduce. • On a
large dataset when MapReduce job runs , large chunks of intermediate data is
generated by the Mapper and this intermediate data is passed on the Reducer for
further processing, which leads to enormous network congestion.

• MapReduce framework provides a function known as Hadoop Combiner that


plays a key role in reducing network. Congestion.
Working of MapReduce Combiner :
• In the below diagram, no combiner is used. Input is split into two mappers and 9
keys are generated from the mappers. Hence, (9 key/value) intermediate data is
generated by mapper , the mapper will send this data directly to reducer and
while sending data to the reducer, it consumes some network bandwidth. It will
take more time to transfer data to reducer if the size of data is big.

• Now in between mapper and reducer if hadoop combiner is used, then combiner
shuffles intermediate data (9 key/value) before sending it to the reducer and
generates 4 key/value pair as an output.
• Reducer now needs to process only 4 key/value pair data which is generated
from 2 combiners. Thus reducer needs to be executed only 4 times to produce
final output, which increases the overall performance.

2.2.5 Details of Map-Reduce Execution


• The user program first forks the master controller and worker
processes at different computing nodes.

• The worker takes up a job of mapper or reducer but not both


simultaneously.

• The master does the job of creating map tasks by considering


chunks of an input file where usually one or more chunk is
handled by one map task and reduce tasks as dictated by the user
program.

• The number of reduce tasks would be less in number compared to


map tasks.

• The master assigns tasks to workers.

• The master keeps track of Map and Reduce job status.it could be
idle, executing, completed.

• The job of a worker is to inform master, about progress of a job


being executed.

• The Map task creates intermediate file on the local disk of


worker for assigned operation. It further passes it to Reduce task.

• The reduce task does reduction based on the code supplied by the
user and gives output accordingly.
2.2.6 Coping with Node Failures

• The compute node at which the Master is executing fails

▪ The master node which controls everything fails, which brings


the entire process down and the entire map-reduce job must be
restarted.

▪ Other than master node failures will be managed by the


master, and the map-reduce job will complete eventually.

• The compute node at which a Map worker resides fails

▪ This failure will be detected by the Master, because it


periodically pings the Worker processes.

▪ All the Map tasks that were assigned to this Worker will have
to be redone, even if they had completed. The reason for
redoing completed Map tasks is that their output destined for
the Reduce tasks resides at that compute node, and is now
unavailable to the Reduce tasks.

▪ The Master sets the status of each of these Map tasks to idle
and will schedule them on a Worker when one becomes
available.
▪ The Master must also inform each Reduce task that the
location of its input from that Map task has changed.

• The failure at the node of a Reduce worker

▪ The Master simply sets the status of its currently executing


Reduce tasks to idle This will be rescheduled on another
reduce worker later.

2.3 Algorithms Using Map-Reduce


2.3.1 Marix-Vector Multiplication by Map-Reduce

An n x n matrix mij, whose elements in row i and column j will be


denoted Mij. A vector v of length n, whose jth element is Vj. Then, the
matrix vector product is the vector X of length n, whose i th element Xi
is given by

Pseudo code:-
map(key, value):

for (i,j,aij) in value:


emit(i, aij * v[j])
reduce(key, values):
result = 0
for value in values:
result += value
emit(key, result)

If the vector v Cannot Fit in Main Memory

• If n = 100,then there is no need to use a DFS or MapReduce for this calculation. But
this sort of calculation is at the heart of the ranking of Web pages that goes on at
search engines, and there, n is in the tens of billions.

• It is assume that n is large, but not so large that vector v cannot fit in main memory
and thus be available to every Map task. The matrix M and the vector v each will
be stored in a file of the DFS. We assume that the row-column coordinates of each
matrix element will be discoverable, either from its position in the file, or because
it is stored with explicit coordinates, as a triple (i, j,mij). Also assume the
position of element vj in the vector v will be discoverable in the analogous way.
1. The Map Function: The Map function is written to apply to one element of M.
However, if v is not already read into main memory at the compute node
executing a Map task, then v is first read, in its entirety, and subsequently will
be available to all applications of the Map function performed at this Map task.
Each Map task will operate on a chunk of the matrix M. From each matrix
element mij it produces the key-value pair (i, mij * vj ). Thus, all terms of
the sum that make up the component xi of the matrix-vector product will get
the same key, i.
2. The Reduce Function: The Reduce function simply sums all the values
associated with a given key i. The result will be a pair (i, xi).
3. The matrix is divided into vertical stripes of equal width and the vector is
divided into an equal number of horizontal stripes, of the same height. Here, goal
is to use enough stripes so that the portion of the vector in one stripe can fit
conveniently into main memory at a compute node. Figure suggests what the
partition looks like if the matrix and vector are each divided into five stripes.

2.3.2 Relational Algebra Operations


Relational Databases vs. MapReduce

• Relational databases:
– Multipurpose: analysis and transactions; batch and
interactive – Data integrity via ACID transactions
– Lots of tools in software ecosystem (for ingesting, reporting,
etc.) – Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Relational Algebra Operations
1. Selection.
2. Projection.
3. Union & Intersection.
4. Natural Join.
5. Grouping & Aggregation.

Selection:
Selection: σC(R) – Apply condition C to each tuple of relation R – Produce in output a
relation containing only tuples that satisfy C
A MapReduce implementation of σC(R)

Map (key, valve)


for tuple t in valve :
if tuple t satisfies C :
emit (t, t)

Reduce (key, valves)


emit (key, key)
projection:
Projection: πS(R)
– Given a subset S of relation R attributes
– Produce in output a relation containing only tuples for the attributes in S

Map (key, valve)


for tuple t in valve :
ts = tuple with only the components for the attributes in S.
emit (ts, ts)
Reduce (key, values)
emit (key, key)
the reduce operation is duplicate elimination
– This operation is associative and commutative; hence it is possible to optimize
MapReduce by using a Combiner in each mapper.

Union

Suppose relations R and S have the same schema


– Map tasks will be assigned chunks from either R or S
– Mappers don’t do much, just pass by to reducers
– Reducers do duplicate elimination

2.3.3 Computing Selections by Map-Reduce


Selections can be done mostly in the map portion alone.

The Map Function:


For each tuple t in R, test if satisfies condition C. If so, produce
the key value pair(t,t). That is, both the key and value are t.

The Reduce Function:


The reduce function is the identity. It simply passes each key-value
pair to the output.

Pseudo Code:
map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)

2.3.3 Computing Projections by Map-Reduce


The Reduce function is used to eliminate duplicates since projection
may cause the same tuple to appear several times.

The Map Function:


For each tuple t in R, construct a tuple ts by eliminating from t those
components whose attributes are not in S. Output the key-value pair
(ts, ts).

The Reduce Function:


For each key ts produced by any of the Map task, there will be
one or more key-value pairs(ts, ts). The Reduce function turns [ts,
ts, ..., ts] into (ts, ts), so it produces exactly one pair (ts, ts) for this
key ts.
Pseudo Code:
map(key, value):
for tuple in value:
ts = tuple with only the components for the attributes in S
emit(ts, ts)
reduce(key, values):
emit(key, key)

2.3.4 Union, Intersection, and Difference by Map-Reduce


2.3.4.1 Union
For union operation, both the relations R and S need to
have the same schema.
Map tasks will be assigned chunks from either R or S relation.
The Map tasks only pass their input tuple as key-value pairs
to the Reduce tasks.

Mappers are fed by all tuples of two sets to be united.


Reducer is used to eliminate duplicates.

The Map Function:


Turn each input tuple t into a key-value pair (t, t).

The Reduce Function:


Associated with each key t there will be either one or two
values. Produce output (t, t) in either case.

Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)

2.3.4.2 Intersection
Mappers are fed by all tuples of both R and S realations
to be intersected.

Reducer emits only tuples that occured twice.


The Reduce function must produce a tuple only if both relations
have the tuple. If the key t has a list of two values [t, t]
associated with it, then the Reduce task for t should produce (t,
t).
If the value-list associated with key t is just [t], then one of R
and S is misisng t, so we do not want to produce a tuple for the
intersection.

The Map Function:


Turn each tuple t into a key-value pair (t, t).

The Reduce Function:


If key t has value list [t, t], then produce (t, t). Otherwise,
produce nothing.

Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
if values = = [key, key]
emit(key, key)

2.3.4.3 Difference
The difference R – S, a tuple t can appear in the output is if it
is in relation R, but not in relation S.
The Map function can pass tuples from R and S through, but
must inform the Reduce function whether the tuple came
from R or S.

The Map Function:


For a tuple t in R, produce key-value pair (t , R), and for a
tuple t in S, produce key-value pair(t, S).
The Reduce Function:
For each key t, if the associated value list is [R], then
produce (t, t). Otherwise, produce nothing.

Pseudo Code:
map(key, value):
if key = = R:
for tuple in value:
emit(tuple, R)
else:
for tuple in value:
emit(tuple, S)
reduce(key, values):
if values = = [R]
emit(key, key)

2.4 Hadoop Limitations

• Issues with Small Files

The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the
ability to support the random reading of small due to its high capacity design.

Small files are smaller than the HDFS Block size (default 128MB). If you are storing
these huge numbers of small files, HDFS cannot handle these lots of small files. As
HDFS was designed to work with a small number of large files for storing large data sets
rather than a large number of small files. If there are lot many small files, then the
NameNode will be overloaded since it stores the namespace of HDFS.

Solution:

Simply merge the small files to create bigger files and then copy bigger to HDFS.

Hadoop Archives (HAR files) deals with the problem of lots of small files. Hadoop
Archives works by building a layered filesystem on the top of HDFS. With the help
Hadoop archive command, HAR files are created; this runs a MapReduce job to pack the
files being archived into a small number of HDFS files. Reading files through HAR is not
more efficient than reading through HDFS. As each HAR file access requires two index
files read as well the data file to read, this will make it slower.

Sequence files also overcome the small file problem. In which we use the filename as the
key and the file contents as the value. By writing a program for files (100 KB), we can
put them into a single Sequence file and then we can process them in a streaming fashion
operating on the Sequence file. MapReduce in Hadoop can break Sequence file into
chunks and operate on each chunk independently because Sequence file is splittable.

By storing files in Hbase we can overcome the small file problem. We are not actually
storing millions of small file into HBase rather adding the binary content of the file to a
cell.

• Slow Processing Speed

MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking


the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to
perform these tasks, thus increasing latency. Hence, reduces processing speed.

Solution:

By in-memory processing of data, Apache Spark overcomes this issue. As in In-memory


processing, no time is spent in moving the data/processes in and out of the disk, thus this
makes it faster. Apache Spark is 100 times faster as compared to MapReduce because it
processes everything in memory.

Flink can also overcome this issue. Flink processes faster than Spark because of its
streaming architecture.

• Support for Batch Processing only

Hadoop only supports batch processing, it is not suitable for streaming data. Hence,
overall performance is slower. MapReduce framework doesn’t leverage the memory of
the Hadoop cluster to the maximum.

Solution

Apache Spark solves this problem as it supports stream processing. But Spark stream
processing is not as much efficient as Flink as it uses micro-batch processing. Apache
Flink improves the overall performance as it provides single run-time for the streaming
as well as batch processing.
• No Real-time Processing

Apache Hadoop is a batch processing framework. It means it takes a huge amount of data
in input, processes it and produces the result. Batch processing is very efficient for
processing a high volume of data, but depends on the size of data being processed and
computational power of the system; an output can be delayed significantly. Apache
Hadoop is not suitable for Real-time processing.

Solution:

Spark is suitable for stream processing. Steaming processing provide continuous


input/output data. It process data within the small amount of time.
Flink provides single run-time for both streamings as well as batch

processing. • Iterative Processing

Apache Hadoop is not much efficient for iterative processing. As Hadoop is not
supported cyclic data flow (i.e. a chain of stages in which each output of the previous
stage is the input to the next stage).

Solution:

Spark overcomes this issue. As Apache Spark accesses data from RAM instead of the
Disk. This dramatically improves the performance of an iterative algorithm that accesses
the same dataset repeatedly. In Apache Spark, for iterative processing, each iteration has
to be scheduled and executed separately.

• Latency

MapReduce in Hadoop is slower because it supports different format, structured and huge
amount of data. In MapReduce, Map takes a set of data and converts it into another set of
data, where an individual element is broken down into a key-value pair. Reduce takes the
output from the map as and Reduce takes the output from the map as input and process
further. MapReduce requires a lot of time to perform these tasks thereby increasing
latency.

Solution:

Apache Spark can reduce this issue. Although Spark is the batch system, it is relatively
faster, because it caches much of the input data on memory by RDD. Apache Flink data
streaming achieves low latency and high throughput.

• No Ease of Use

MapReduce developer in Hadoop needs to hand code for each and every operation which
makes it very difficult to work. In Hadoop, MapReduce has no interactive mode, but
adding hive and pig makes working with MapReduce little easier.

Solution:

Spark has overcome this issue, as the Spark has an interactive mode. So, that developers
and users alike can have intermediate feedback for queries and other activities. As spark
has tons of high-level operators so it is easy to program Spark. One can also use Apache
Flink as it also has high-level operators.

• Security Issue

Apache Hadoop is challenging in maintaining the complex applications. Hadoop is


missing encryption at the storage and network levels, which is a major point of concern.
Apache Hadoop supports Kerberos authentication, which is hard to manage.

Solution:

Apache Spark provides security bonus. If you run Apache Spark in HDFS, it can use
HDFS ACLs and file level permissions.

• Vulnerable by Nature

Apache Hadoop is written in Java. Java, is a most popular language, hence java most
heavily exploited by cybercriminals.

• No Caching

Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate
data in memory for the further requirement and this diminishes the performance of
Hadoop.

Solution:
Spark and Flink overcome this issue. Spark and Flink cache data in memory for further
iterations which enhance the overall performance.

• Lengthy Code

Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of
bugs. Hence it will take more time to execute the programs.

Solution:

Spark and Flink are written in Scala and Java. But the implementation is in Scala, so
the number of line of code is lesser than Hadoop. Thus, it takes less time to execute the
programs.

You might also like