0% found this document useful (0 votes)
13 views29 pages

Components

The document outlines the core components and design principles of the Hadoop framework, including HDFS, MapReduce, and YARN. It discusses the architecture of HDFS, its fault tolerance, scalability, and the features that make Hadoop suitable for big data processing. Additionally, it highlights the simplicity and modularity of Hadoop's design, emphasizing the importance of data locality and self-management in the system.

Uploaded by

Sachin Kumar N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Components

The document outlines the core components and design principles of the Hadoop framework, including HDFS, MapReduce, and YARN. It discusses the architecture of HDFS, its fault tolerance, scalability, and the features that make Hadoop suitable for big data processing. Additionally, it highlights the simplicity and modularity of Hadoop's design, emphasizing the importance of data locality and self-management in the system.

Uploaded by

Sachin Kumar N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE6001

BIG DATA FRAMEWORKS


Module - 2
Hadoop Component - Design Principles of Hadoop

Prof. Lokeshkumar R
Hadoop Core Components

HDFS

Common

Mapreduce Libraries
Distributed YARN
Processing
Hadoop Common
• Contains
• Libraries and Utilities
• Interface for DFS and I/O
• Serialization and java RPC
• File Based Data Structure
HDFS
• Hadoop Distributed File System
• Java Based
• Store all kinds of Data on Disks at the clusters
MapReduce
MapReduce v1
• Software Programming Model for Hadoop1 Versions
• Mapper and Reducer
• Processes Large sets of Data
• Parallel
• Batch

MapReduce v2
• Software Programming Model for Hadoop1 Versions
• YARN based System for Parallel Processing
YARN
• Software for managing resources for computing
• Tasks run in parallel at Hadoop
• Scheduling - Distributed
• Interactive Queries, Text Analytics, Streaming Analysis
Hadoop Ecosystem - 1
A
Layer Diagram
B C

D
Hadoop Ecosystem - 2
There are so many different ways in which you can organize these
systems, and that is why you’ll see multiple images of the ecosystem all
over the Internet.

HDFS Apache Spark


YARN Tez
MapReduce Apache HBase
Apache Pig Apache Storm
Apache Hive Oozie
Apache Ambari ZooKeeper
Mesos Data Ingestion
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
• Hadoop MapReduce Other Modules: Zookeeper, Impala,
Oozie, etc.

Spark, Storm, Tez,


etc.
Pig Hive
Non-relational

Scripting SQL Like Query


Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)


Hadoop HDFS

• Hadoop distributed File System


• Serves as the distributed file system for most tools in the Hadoop ecosystem
• Scalability for large data sets
• Reliability to cope with hardware failures
• HDFS good for:
• Large files
• Streaming data
• Not good for:
• Lots of small files
• Random access to files
• Low latency access
Design of Hadoop Distributed File System
(HDFS)
• Master-Slave design
• Master Node
• Single NameNode for managing metadata
• Slave Nodes
• Multiple DataNodes for storing data
• Other
• Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Cmd, Data
HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node


B1 B2 B4 B3

Node Node Node


B1 Node
B3 B1 B2 B4

Node Node Node Node


B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks


• It’s the basic unit of read/write
• Default size is 64MB, could be larger (128MB)
• Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
• One block stored at multiple location, also at different racks (usually 3
times)
• This makes HDFS storage fault tolerant and faster to read
HBase

• NoSQL data store build on top of HDFS


• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
• Transactional applications
• Data Analytics

• Not efficient for text searching and processing


MapReduce: Simple Programming for Big Data

• MapReduce is simple programming paradigm for the Hadoop ecosystem


• Traditional parallel programming requires expertise of different
computing/systems concepts
• examples: multithreads, synchronization mechanisms (locks, semaphores, and
monitors )
• incorrect use: can crash your program, get incorrect results, or severely
impact performance
• Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code in parallel
• you don't have to deal with any of above issues
• only need to create, map and reduce functions
Map Reduce Paradigm
• Map and Reduce are based on functional programming

Apply function Map: Reduce:


Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];


square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output


MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce

D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
• Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
• Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
• Support for others languages needed
• Only for Batch processing
• Interactivity, streaming data
Features of Hadoop
- Compared with other Systems
Features - Hadoop
Open Source
• Apache Hadoop is an open source project.
• code can be modified according to business requirements

Distributed Processing
• Data is stored in a distributed manner in HDFS across the cluster, data
is processed in parallel on a cluster of nodes.
Features – Hadoop (1)
Fault Tolerance
• Very important features of Hadoop
• By default 3 replicas of each block is stored across the cluster in
Hadoop and it can be changed also as per the requirement.
• Failures of nodes or tasks are recovered automatically by the
framework.

Reliability
• Due to replication of data in the cluster, data is reliably stored on the
cluster of machine despite machine failures
Features - Hadoop (2)
High Availability
• Data is highly available and accessible despite hardware failure due to
multiple copies of data. If a machine or few hardware crashes, then
data will be accessed from another path.

Scalability
• Hadoop is highly scalable in the way new hardware can be easily
added to the nodes.
• This feature of Hadoop also provides horizontal scalability which
means new nodes can be added on the fly without any downtime
Features - Hadoop (3)
Easy to use
• No need of client to deal with distributed computing, the framework
takes care of all the things. So this feature of Hadoop is easy to use.

Data Locality
• This one is a unique features of Hadoop that made it easily handle the
Big Data.
• Hadoop works on data locality principle which states that move
computation to data instead of data to computation.
Design Principles of
Hadoop
System shall manage and heal itself
• Automatically and transparently route around failure (Fault Tolerant)
• Speculatively execute redundant tasks if certain nodes are detected to
be slow

Performance shall scale linearly


• Proportional change in capacity with resource change (Scalability)
Computation should move to data
• Lower latency, lower bandwidth (Data Locality)

Simple core, modular and extensible


(Economical)

You might also like