0% found this document useful (0 votes)

13 views29 pages

Components

The document outlines the core components and design principles of the Hadoop framework, including HDFS, MapReduce, and YARN. It discusses the architecture of HDFS, its fault tolerance, scalability, and the features that make Hadoop suitable for big data processing. Additionally, it highlights the simplicity and modularity of Hadoop's design, emphasizing the importance of data locality and self-management in the system.

Uploaded by

Sachin Kumar N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views29 pages

Components

Uploaded by

Sachin Kumar N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CSE6001

BIG DATA FRAMEWORKS

Module - 2
Hadoop Component - Design Principles of Hadoop

Prof. Lokeshkumar R
Hadoop Core Components

HDFS

Common

Mapreduce Libraries
Distributed YARN
Processing
Hadoop Common
• Contains
• Libraries and Utilities
• Interface for DFS and I/O
• Serialization and java RPC
• File Based Data Structure
HDFS
• Hadoop Distributed File System
• Java Based
• Store all kinds of Data on Disks at the clusters
MapReduce
MapReduce v1
• Software Programming Model for Hadoop1 Versions
• Mapper and Reducer
• Processes Large sets of Data
• Parallel
• Batch

MapReduce v2
• Software Programming Model for Hadoop1 Versions
• YARN based System for Parallel Processing
YARN
• Software for managing resources for computing
• Tasks run in parallel at Hadoop
• Scheduling - Distributed
• Interactive Queries, Text Analytics, Streaming Analysis
Hadoop Ecosystem - 1
A
Layer Diagram
B C

D
Hadoop Ecosystem - 2
There are so many different ways in which you can organize these
systems, and that is why you’ll see multiple images of the ecosystem all
over the Internet.

HDFS Apache Spark

YARN Tez
MapReduce Apache HBase
Apache Pig Apache Storm
Apache Hive Oozie
Apache Ambari ZooKeeper
Mesos Data Ingestion
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
• Hadoop MapReduce Other Modules: Zookeeper, Impala,
Oozie, etc.

Spark, Storm, Tez,

etc.
Pig Hive
Non-relational

Scripting SQL Like Query

Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)

Hadoop HDFS

• Hadoop distributed File System

• Serves as the distributed file system for most tools in the Hadoop ecosystem
• Scalability for large data sets
• Reliability to cope with hardware failures
• HDFS good for:
• Large files
• Streaming data
• Not good for:
• Lots of small files
• Random access to files
• Low latency access
Design of Hadoop Distributed File System
(HDFS)
• Master-Slave design
• Master Node
• Single NameNode for managing metadata
• Slave Nodes
• Multiple DataNodes for storing data
• Other
• Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

Cmd, Data
HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node

B1 B2 B4 B3

Node Node Node

B1 Node
B3 B1 B2 B4

Node Node Node Node

B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks

• It’s the basic unit of read/write
• Default size is 64MB, could be larger (128MB)
• Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
• One block stored at multiple location, also at different racks (usually 3
times)
• This makes HDFS storage fault tolerant and faster to read
HBase

• NoSQL data store build on top of HDFS

• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
• Transactional applications
• Data Analytics

• Not efficient for text searching and processing

MapReduce: Simple Programming for Big Data

• MapReduce is simple programming paradigm for the Hadoop ecosystem

• Traditional parallel programming requires expertise of different
computing/systems concepts
• examples: multithreads, synchronization mechanisms (locks, semaphores, and
monitors )
• incorrect use: can crash your program, get incorrect results, or severely
impact performance
• Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code in parallel
• you don't have to deal with any of above issues
• only need to create, map and reduce functions
Map Reduce Paradigm
• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce

D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
• Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
• Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
• Support for others languages needed
• Only for Batch processing
• Interactivity, streaming data
Features of Hadoop
- Compared with other Systems
Features - Hadoop
Open Source
• Apache Hadoop is an open source project.
• code can be modified according to business requirements

Distributed Processing
• Data is stored in a distributed manner in HDFS across the cluster, data
is processed in parallel on a cluster of nodes.
Features – Hadoop (1)
Fault Tolerance
• Very important features of Hadoop
• By default 3 replicas of each block is stored across the cluster in
Hadoop and it can be changed also as per the requirement.
• Failures of nodes or tasks are recovered automatically by the
framework.

Reliability
• Due to replication of data in the cluster, data is reliably stored on the
cluster of machine despite machine failures
Features - Hadoop (2)
High Availability
• Data is highly available and accessible despite hardware failure due to
multiple copies of data. If a machine or few hardware crashes, then
data will be accessed from another path.

Scalability
• Hadoop is highly scalable in the way new hardware can be easily
added to the nodes.
• This feature of Hadoop also provides horizontal scalability which
means new nodes can be added on the fly without any downtime
Features - Hadoop (3)
Easy to use
• No need of client to deal with distributed computing, the framework
takes care of all the things. So this feature of Hadoop is easy to use.

Data Locality
• This one is a unique features of Hadoop that made it easily handle the
Big Data.
• Hadoop works on data locality principle which states that move
computation to data instead of data to computation.
Design Principles of
Hadoop
System shall manage and heal itself
• Automatically and transparently route around failure (Fault Tolerant)
• Speculatively execute redundant tasks if certain nodes are detected to
be slow

Performance shall scale linearly

• Proportional change in capacity with resource change (Scalability)
Computation should move to data
• Lower latency, lower bandwidth (Data Locality)

Simple core, modular and extensible

(Economical)

Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
26 pages
Unit 2
No ratings yet
Unit 2
17 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
33 pages
Module - 2
No ratings yet
Module - 2
84 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
About Hadoop
No ratings yet
About Hadoop
12 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Big Data and Hadoop Ecosystem Overview
No ratings yet
Big Data and Hadoop Ecosystem Overview
260 pages
Introduction to Hadoop Ecosystem Basics
No ratings yet
Introduction to Hadoop Ecosystem Basics
23 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit I
No ratings yet
Unit I
38 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Shawn
No ratings yet
Shawn
4 pages
Understanding Hadoop 4 Ecosystem
No ratings yet
Understanding Hadoop 4 Ecosystem
44 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction To
No ratings yet
Introduction To
7 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
132 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
HADOOP
No ratings yet
HADOOP
19 pages
Slides PDF - Module 2
No ratings yet
Slides PDF - Module 2
106 pages
Introduction to Hadoop & MapReduce Basics
No ratings yet
Introduction to Hadoop & MapReduce Basics
27 pages
History and Architecture of Hadoop
No ratings yet
History and Architecture of Hadoop
53 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
818cit01-Bda-Unit 5 - Notes
No ratings yet
818cit01-Bda-Unit 5 - Notes
23 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Shshbaz Ahmad Big Data Assig 222222222222
No ratings yet
Shshbaz Ahmad Big Data Assig 222222222222
19 pages
Module5 Lecture1 LanguageModelling
No ratings yet
Module5 Lecture1 LanguageModelling
52 pages
Module3 Lecture1 ContextFreeGrammar
No ratings yet
Module3 Lecture1 ContextFreeGrammar
11 pages
Module2 Lecture4 HMM
No ratings yet
Module2 Lecture4 HMM
20 pages
Module2 Lecture2 FST
No ratings yet
Module2 Lecture2 FST
69 pages
Operating System: Lectures
No ratings yet
Operating System: Lectures
28 pages
Installing Oracle 23ai On Windows
No ratings yet
Installing Oracle 23ai On Windows
8 pages
Unit II Part 1
No ratings yet
Unit II Part 1
123 pages
Clamav Antivirus Setup
No ratings yet
Clamav Antivirus Setup
2 pages
Shell Scripting, Processes and Scheduling
No ratings yet
Shell Scripting, Processes and Scheduling
2 pages
Midterm Review Answers
No ratings yet
Midterm Review Answers
4 pages
Diskless NxD Server 7.0 Setup Guide
No ratings yet
Diskless NxD Server 7.0 Setup Guide
56 pages
Understanding Deadlocks in Operating Systems
No ratings yet
Understanding Deadlocks in Operating Systems
29 pages
Java Programming Course Outline
No ratings yet
Java Programming Course Outline
2 pages
VFX Delivery Guidelines for Film & TV
No ratings yet
VFX Delivery Guidelines for Film & TV
6 pages
1078 Problems Running A Classkit License (CKL)
No ratings yet
1078 Problems Running A Classkit License (CKL)
2 pages
Hiren's CD-Creating Ghost32.Uha With Uharc
No ratings yet
Hiren's CD-Creating Ghost32.Uha With Uharc
3 pages
Instalar Postgres Rhel
No ratings yet
Instalar Postgres Rhel
3 pages
100 Essential Unix Commands Guide
No ratings yet
100 Essential Unix Commands Guide
3 pages
R-Linux: User's Manual
No ratings yet
R-Linux: User's Manual
87 pages
Server SEP Client Update Guide
No ratings yet
Server SEP Client Update Guide
6 pages
Crossbar Switch and Multiport Memories
No ratings yet
Crossbar Switch and Multiport Memories
15 pages
VSAM Return Codes
No ratings yet
VSAM Return Codes
9 pages
Linux File Permissions Guide
No ratings yet
Linux File Permissions Guide
5 pages
Understanding Paging in Memory Management
No ratings yet
Understanding Paging in Memory Management
4 pages
GitLab Docker Image Setup Guide
No ratings yet
GitLab Docker Image Setup Guide
8 pages
macOS 10.14 VM Configuration
No ratings yet
macOS 10.14 VM Configuration
2 pages
CentOS 6 to 7 Upgrade Guide
No ratings yet
CentOS 6 to 7 Upgrade Guide
5 pages
13 Batch Mid Final Q Solve Database
No ratings yet
13 Batch Mid Final Q Solve Database
9 pages
Tom Clancys Ghost Recon Wildlands 2017
No ratings yet
Tom Clancys Ghost Recon Wildlands 2017
3 pages
Overview of Operating Systems
No ratings yet
Overview of Operating Systems
84 pages
Kernel Parameter Setup for Oracle 10g
No ratings yet
Kernel Parameter Setup for Oracle 10g
4 pages
BiblioteQ Admin Setup Guide
No ratings yet
BiblioteQ Admin Setup Guide
3 pages
Hpux Interview Questions
100% (1)
Hpux Interview Questions
8 pages
Real-Time Systems Overview
No ratings yet
Real-Time Systems Overview
50 pages

Components

Uploaded by

Components

Uploaded by

CSE6001

BIG DATA FRAMEWORKS

HDFS Apache Spark

Spark, Storm, Tez,

Scripting SQL Like Query

HDFS Distributed File System (Storage)

• Hadoop distributed File System

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Node Node Node Node

Node Node Node

Node Node Node Node

• HDFS files are divided into blocks

• NoSQL data store build on top of HDFS

• Not efficient for text searching and processing

• MapReduce is simple programming paradigm for the Hadoop ecosystem

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

Performance shall scale linearly

Simple core, modular and extensible

You might also like