Module 2
Module 2
Hadoop Distributed File System is a block-structured file system where each file is
divided into blocks of a pre-determined size. These blocks are stored across a cluster of
one or several machines.
HDFS Components
• NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
It stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
• Executes file system execution such as naming, closing, opening files and
directories.
• DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the
clients. Replica block of Datanode consists of 2 files on the file system. The first
file is for
data and second file is for recording the block’s metadata. HDFS Metadata
includes checksums for data. At startup, each Datanode connects to its
corresponding Namenode and does handshaking. Verification of namespace ID
and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
• Namenode:
This is the daemon that runs on all the masters. Name node stores metadata like
filename, the number of blocks, number of replicas, a location of blocks, block IDs
etc. This metadata is available in memory in the master for faster retrieval of data. In
the local disk, a copy of metadata is available for persistence. So name node memory
should be high as per the requirement.
• Datanode:
This is the daemon that runs on the slave. These are actual worker nodes that store
the data.
HDFS Architecture
• This architecture gives you a complete picture of Hadoop Distributed File System.
There is a single namenode which stores metadata and there are multiple datanodes
which do actual storage work.
• Nodes are arranged in racks and Replicas of data blocks are stored on different racks in
the cluster to provide fault tolerance.
• To read or write a file in HDFS, the client needs to interact with Namenode. HDFS
applications need a write-once-read-many access model for files. A file once created
and written cannot be edited.
• Namenode stores metadata and datanode which stores actual data. The client interacts
with namenode for any task to be performed as namenode is the centerpiece in the
cluster.
There are several datanodes in the cluster which store HDFS data in the local disk.
Datanode sends a heartbeat message to namenode periodically to indicate that it is
alive. Also, it replicates data to other datanode as per the replication factor.
Data
storage in HDFS
Whenever any file has to be written in HDFS, it is broken into small pieces of data
known as blocks. HDFS has a default block size of 128 MB which can be increased as
per the requirements. These blocks are stored in the cluster in distributed manner on
different nodes. This provides a mechanism for MapReduce to process the data in
parallel in the cluster.
• There can be many racks of compute nodes, and racks are connected by
another level of network or a switch.
• The bandwidth of inter-rack communication is greater than the intrarack
ethernet but given the number of pairs of nodes that might need to
communicate between racks.
Rack Awareness in Hadoop HDFS
• This new file system is often called as Distributed File System or DFS. • DFS
is useful when files can be enormous, possibly a terabyte in size and are rarely
updated.
• Files are devided into chunks which are typically 64 megabytes in size. •
Chunks are replicated, perhaps three times, at three different compute nodes. •
The nodes holding copies of one chunk should be located on different racks so
as avoid loosing all copies due to a rack failure.
• User can decide the chunk size and the degreed of replication.
• Master node is used to find the chunks of a file. This node is itself replicated,
and a directory for the file system as a whole knows where to find ts
copies.The directory itself can be replicated, and all participants using the
DFS know where the directory copies are.
• DFS Implementations are
▪ CloudStore
2.2 Map-Reduce:
• Map Phase:
▪ This phase takes input as key-value pairs and produces output as key
value pairs. It can write custom business logic in this phase.
▪ The Map function performs actions like filtering, grouping and sorting.
▪ The result generated by the Map function is a key value pair (K, V)
which acts as the input for Reduce function.
• Reduce Phase:
2.2.1
• Keys are not Unique. A Map task can produce several key-value
pairs with the same key, even from the same element. Element can
be a tuple or a document.
• The Map function uses keys that are of type String(The words) and
values that are integers.
• The Map Task reads a document and breaks it into its sequence of
words W1, W2,..., Wn.
• It then emits a sequence of key-value pairs where the value is
always 1. That is, the output of Map Task for this document is the
sequence of Key-Value pairs: (W1, 1), (W2, 1),...,(Wn, 1)
2.2.2 Grouping by key
• After all the Map tasks have completed successfully, the master
controller merges the file from each Map task that are destined for a
particular Reduce task and feeds the merged file to that process as a
sequence of key-list-of-value pairs.
• For each key k, the input to the Reduce task that handles key k is a
pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn)
are all the key-value pairs with key k coming from all the Map
tasks.
• The input to the reduce task is the pairs consisting of a key and its
list of associated values and combine those values in some way.
• The outputs from all the Reduce tasks are merged into a single file.
2.2.4 Combiners
• Now in between mapper and reducer if hadoop combiner is used, then combiner
shuffles intermediate data (9 key/value) before sending it to the reducer and
generates 4 key/value pair as an output.
• Reducer now needs to process only 4 key/value pair data which is generated
from 2 combiners. Thus reducer needs to be executed only 4 times to produce
final output, which increases the overall performance.
• The master keeps track of Map and Reduce job status.it could be
idle, executing, completed.
• The reduce task does reduction based on the code supplied by the
user and gives output accordingly.
2.2.6 Coping with Node Failures
▪ All the Map tasks that were assigned to this Worker will have
to be redone, even if they had completed. The reason for
redoing completed Map tasks is that their output destined for
the Reduce tasks resides at that compute node, and is now
unavailable to the Reduce tasks.
▪ The Master sets the status of each of these Map tasks to idle
and will schedule them on a Worker when one becomes
available.
▪ The Master must also inform each Reduce task that the
location of its input from that Map task has changed.
Pseudo code:-
map(key, value):
• If n = 100,then there is no need to use a DFS or MapReduce for this calculation. But
this sort of calculation is at the heart of the ranking of Web pages that goes on at
search engines, and there, n is in the tens of billions.
• It is assume that n is large, but not so large that vector v cannot fit in main memory
and thus be available to every Map task. The matrix M and the vector v each will
be stored in a file of the DFS. We assume that the row-column coordinates of each
matrix element will be discoverable, either from its position in the file, or because
it is stored with explicit coordinates, as a triple (i, j,mij). Also assume the
position of element vj in the vector v will be discoverable in the analogous way.
1. The Map Function: The Map function is written to apply to one element of M.
However, if v is not already read into main memory at the compute node
executing a Map task, then v is first read, in its entirety, and subsequently will
be available to all applications of the Map function performed at this Map task.
Each Map task will operate on a chunk of the matrix M. From each matrix
element mij it produces the key-value pair (i, mij * vj ). Thus, all terms of
the sum that make up the component xi of the matrix-vector product will get
the same key, i.
2. The Reduce Function: The Reduce function simply sums all the values
associated with a given key i. The result will be a pair (i, xi).
3. The matrix is divided into vertical stripes of equal width and the vector is
divided into an equal number of horizontal stripes, of the same height. Here, goal
is to use enough stripes so that the portion of the vector in one stripe can fit
conveniently into main memory at a compute node. Figure suggests what the
partition looks like if the matrix and vector are each divided into five stripes.
• Relational databases:
– Multipurpose: analysis and transactions; batch and
interactive – Data integrity via ACID transactions
– Lots of tools in software ecosystem (for ingesting, reporting,
etc.) – Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Relational Algebra Operations
1. Selection.
2. Projection.
3. Union & Intersection.
4. Natural Join.
5. Grouping & Aggregation.
Selection:
Selection: σC(R) – Apply condition C to each tuple of relation R – Produce in output a
relation containing only tuples that satisfy C
A MapReduce implementation of σC(R)
Union
Pseudo Code:
map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)
Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)
2.3.4.2 Intersection
Mappers are fed by all tuples of both R and S realations
to be intersected.
Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
if values = = [key, key]
emit(key, key)
2.3.4.3 Difference
The difference R – S, a tuple t can appear in the output is if it
is in relation R, but not in relation S.
The Map function can pass tuples from R and S through, but
must inform the Reduce function whether the tuple came
from R or S.
Pseudo Code:
map(key, value):
if key = = R:
for tuple in value:
emit(tuple, R)
else:
for tuple in value:
emit(tuple, S)
reduce(key, values):
if values = = [R]
emit(key, key)
The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the
ability to support the random reading of small due to its high capacity design.
Small files are smaller than the HDFS Block size (default 128MB). If you are storing
these huge numbers of small files, HDFS cannot handle these lots of small files. As
HDFS was designed to work with a small number of large files for storing large data sets
rather than a large number of small files. If there are lot many small files, then the
NameNode will be overloaded since it stores the namespace of HDFS.
Solution:
Simply merge the small files to create bigger files and then copy bigger to HDFS.
Hadoop Archives (HAR files) deals with the problem of lots of small files. Hadoop
Archives works by building a layered filesystem on the top of HDFS. With the help
Hadoop archive command, HAR files are created; this runs a MapReduce job to pack the
files being archived into a small number of HDFS files. Reading files through HAR is not
more efficient than reading through HDFS. As each HAR file access requires two index
files read as well the data file to read, this will make it slower.
Sequence files also overcome the small file problem. In which we use the filename as the
key and the file contents as the value. By writing a program for files (100 KB), we can
put them into a single Sequence file and then we can process them in a streaming fashion
operating on the Sequence file. MapReduce in Hadoop can break Sequence file into
chunks and operate on each chunk independently because Sequence file is splittable.
By storing files in Hbase we can overcome the small file problem. We are not actually
storing millions of small file into HBase rather adding the binary content of the file to a
cell.
Solution:
Flink can also overcome this issue. Flink processes faster than Spark because of its
streaming architecture.
Hadoop only supports batch processing, it is not suitable for streaming data. Hence,
overall performance is slower. MapReduce framework doesn’t leverage the memory of
the Hadoop cluster to the maximum.
Solution
Apache Spark solves this problem as it supports stream processing. But Spark stream
processing is not as much efficient as Flink as it uses micro-batch processing. Apache
Flink improves the overall performance as it provides single run-time for the streaming
as well as batch processing.
• No Real-time Processing
Apache Hadoop is a batch processing framework. It means it takes a huge amount of data
in input, processes it and produces the result. Batch processing is very efficient for
processing a high volume of data, but depends on the size of data being processed and
computational power of the system; an output can be delayed significantly. Apache
Hadoop is not suitable for Real-time processing.
Solution:
Apache Hadoop is not much efficient for iterative processing. As Hadoop is not
supported cyclic data flow (i.e. a chain of stages in which each output of the previous
stage is the input to the next stage).
Solution:
Spark overcomes this issue. As Apache Spark accesses data from RAM instead of the
Disk. This dramatically improves the performance of an iterative algorithm that accesses
the same dataset repeatedly. In Apache Spark, for iterative processing, each iteration has
to be scheduled and executed separately.
• Latency
MapReduce in Hadoop is slower because it supports different format, structured and huge
amount of data. In MapReduce, Map takes a set of data and converts it into another set of
data, where an individual element is broken down into a key-value pair. Reduce takes the
output from the map as and Reduce takes the output from the map as input and process
further. MapReduce requires a lot of time to perform these tasks thereby increasing
latency.
Solution:
Apache Spark can reduce this issue. Although Spark is the batch system, it is relatively
faster, because it caches much of the input data on memory by RDD. Apache Flink data
streaming achieves low latency and high throughput.
• No Ease of Use
MapReduce developer in Hadoop needs to hand code for each and every operation which
makes it very difficult to work. In Hadoop, MapReduce has no interactive mode, but
adding hive and pig makes working with MapReduce little easier.
Solution:
Spark has overcome this issue, as the Spark has an interactive mode. So, that developers
and users alike can have intermediate feedback for queries and other activities. As spark
has tons of high-level operators so it is easy to program Spark. One can also use Apache
Flink as it also has high-level operators.
• Security Issue
Solution:
Apache Spark provides security bonus. If you run Apache Spark in HDFS, it can use
HDFS ACLs and file level permissions.
• Vulnerable by Nature
Apache Hadoop is written in Java. Java, is a most popular language, hence java most
heavily exploited by cybercriminals.
• No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate
data in memory for the further requirement and this diminishes the performance of
Hadoop.
Solution:
Spark and Flink overcome this issue. Spark and Flink cache data in memory for further
iterations which enhance the overall performance.
• Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of
bugs. Hence it will take more time to execute the programs.
Solution:
Spark and Flink are written in Scala and Java. But the implementation is in Scala, so
the number of line of code is lesser than Hadoop. Thus, it takes less time to execute the
programs.