CSE6001
BIG DATA FRAMEWORKS
Module - 2
Hadoop Component - Design Principles of Hadoop
Prof. Lokeshkumar R
Hadoop Core Components
HDFS
Common
Mapreduce Libraries
Distributed YARN
Processing
Hadoop Common
• Contains
• Libraries and Utilities
• Interface for DFS and I/O
• Serialization and java RPC
• File Based Data Structure
HDFS
• Hadoop Distributed File System
• Java Based
• Store all kinds of Data on Disks at the clusters
MapReduce
MapReduce v1
• Software Programming Model for Hadoop1 Versions
• Mapper and Reducer
• Processes Large sets of Data
• Parallel
• Batch
MapReduce v2
• Software Programming Model for Hadoop1 Versions
• YARN based System for Parallel Processing
YARN
• Software for managing resources for computing
• Tasks run in parallel at Hadoop
• Scheduling - Distributed
• Interactive Queries, Text Analytics, Streaming Analysis
Hadoop Ecosystem - 1
A
Layer Diagram
B C
D
Hadoop Ecosystem - 2
There are so many different ways in which you can organize these
systems, and that is why you’ll see multiple images of the ecosystem all
over the Internet.
HDFS Apache Spark
YARN Tez
MapReduce Apache HBase
Apache Pig Apache Storm
Apache Hive Oozie
Apache Ambari ZooKeeper
Mesos Data Ingestion
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
• Hadoop MapReduce Other Modules: Zookeeper, Impala,
Oozie, etc.
Spark, Storm, Tez,
etc.
Pig Hive
Non-relational
Scripting SQL Like Query
Database
HBase
MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager
HDFS Distributed File System (Storage)
Hadoop HDFS
• Hadoop distributed File System
• Serves as the distributed file system for most tools in the Hadoop ecosystem
• Scalability for large data sets
• Reliability to cope with hardware failures
• HDFS good for:
• Large files
• Streaming data
• Not good for:
• Lots of small files
• Random access to files
• Low latency access
Design of Hadoop Distributed File System
(HDFS)
• Master-Slave design
• Master Node
• Single NameNode for managing metadata
• Slave Nodes
• Multiple DataNodes for storing data
• Other
• Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data
Secondary
Client NameNode
NameNode
DataNode DataNode DataNode DataNode
DataNode DataNode DataNode DataNode
Cmd, Data
HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance
File B1 B2 B3 B4
Node Node Node Node
B1 B2 B4 B3
Node Node Node
B1 Node
B3 B1 B2 B4
Node Node Node Node
B4 B3 B1 B2
HDFS
• HDFS files are divided into blocks
• It’s the basic unit of read/write
• Default size is 64MB, could be larger (128MB)
• Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
• One block stored at multiple location, also at different racks (usually 3
times)
• This makes HDFS storage fault tolerant and faster to read
HBase
• NoSQL data store build on top of HDFS
• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
• Transactional applications
• Data Analytics
• Not efficient for text searching and processing
MapReduce: Simple Programming for Big Data
• MapReduce is simple programming paradigm for the Hadoop ecosystem
• Traditional parallel programming requires expertise of different
computing/systems concepts
• examples: multithreads, synchronization mechanisms (locks, semaphores, and
monitors )
• incorrect use: can crash your program, get incorrect results, or severely
impact performance
• Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code in parallel
• you don't have to deal with any of above issues
• only need to create, map and reduce functions
Map Reduce Paradigm
• Map and Reduce are based on functional programming
Apply function Map: Reduce:
Apply a function to all the elements of Combine all the elements of list for a
List summary
list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];
square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]
Input Map Reduce Output
MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce
D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
• Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
• Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
• Support for others languages needed
• Only for Batch processing
• Interactivity, streaming data
Features of Hadoop
- Compared with other Systems
Features - Hadoop
Open Source
• Apache Hadoop is an open source project.
• code can be modified according to business requirements
Distributed Processing
• Data is stored in a distributed manner in HDFS across the cluster, data
is processed in parallel on a cluster of nodes.
Features – Hadoop (1)
Fault Tolerance
• Very important features of Hadoop
• By default 3 replicas of each block is stored across the cluster in
Hadoop and it can be changed also as per the requirement.
• Failures of nodes or tasks are recovered automatically by the
framework.
Reliability
• Due to replication of data in the cluster, data is reliably stored on the
cluster of machine despite machine failures
Features - Hadoop (2)
High Availability
• Data is highly available and accessible despite hardware failure due to
multiple copies of data. If a machine or few hardware crashes, then
data will be accessed from another path.
Scalability
• Hadoop is highly scalable in the way new hardware can be easily
added to the nodes.
• This feature of Hadoop also provides horizontal scalability which
means new nodes can be added on the fly without any downtime
Features - Hadoop (3)
Easy to use
• No need of client to deal with distributed computing, the framework
takes care of all the things. So this feature of Hadoop is easy to use.
Data Locality
• This one is a unique features of Hadoop that made it easily handle the
Big Data.
• Hadoop works on data locality principle which states that move
computation to data instead of data to computation.
Design Principles of
Hadoop
System shall manage and heal itself
• Automatically and transparently route around failure (Fault Tolerant)
• Speculatively execute redundant tasks if certain nodes are detected to
be slow
Performance shall scale linearly
• Proportional change in capacity with resource change (Scalability)
Computation should move to data
• Lower latency, lower bandwidth (Data Locality)
Simple core, modular and extensible
(Economical)