The Basic Hadoop Components
Hadoop Common - libraries and utilities
Hadoop Distributed File System (HDFS) a
distributed file-system
Hadoop YARN a resource-management platform,
scheduling
Hadoop MapReduce a programming
model for large scale data processing
Hadoop Stack Transition
Applications and Frameworks
HBase a scalable data warehouse with support for large
tables.
Hive a data warehouse infrastructure that provides data
summarization and ad hoc querying
Pig A high-level data-flow language and execution framework
for parallel computation
Spark a fast and general compute
engine for Hadoop data. Wide range
of applications ETL, Machine Learning,
stream processing, and graph analytics.
Distributed Filesystem
Resource Management, Scheduling
YARN-based system for
parallel processing of data
HDFS and HDFS2
Original HDFS Design Goals
Resilience to hardware failure
Streaming data access
Support for large dataset, scalability to
hundreds/thousands of nodes with high
aggregate bandwidth
Application locality to data
Portability across heterogeneous
hardware and software platforms
Original HDFS Design
Single NameNode - a master server that manages
the file system namespace and regulates access to
files by clients.
Multiple DataNodes typically one per node in the
cluster. Functions:
Manage storage
Serving read/write requests from clients
Block creation, deletion, replication
based on instructions from NameNode
HDFS in Hadoop 2
HDFS Federation
Multiple Namenode servers
Multiple namespaces
High Availability redundant
NameNodes
Heterogeneous Storage and
Archival Storage
ARCHIVE, DISK, SSD, RAM_DISK
Federation
Federation: Block Pools
Federation: Benefits
Allows namespace scaling
Scales up filesystem read/write
throughput
Isolation
MapReduce Framework
Software framework for writing parallel data
processing applications
MapReduce job splits data into chunks
Map tasks process data chunks
Framework sorts map output
Reduce tasks use sorted map
data as input
MapReduce Framework
Typically compute and storage nodes are
the same.
MapReduce tasks and HDFS running on
the same nodes
Can schedule tasks on nodes
with data already present.
Original MapReduce Framework
Single master JobTracker
JobTracker schedules, monitors,
and re-executes failed tasks.
One slave TaskTracker per
cluster node
TaskTracker executes tasks per
JobTracker requests.
Original Hadoop Architecture
Master Node
Jobtracker
Namenode
Network/
Switching
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Original Hadoop Architecture
Master Node
Jobtracker
Namenode
Network/
Switching
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Original Hadoop Architecture
Master Node
Jobtracker
Namenode
Network/
Switching
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
Compute/Datanode
Tasktracker
YARN: NexGen MapReduce
Main idea Separate resource
management and job
scheduling/monitoring.
Global ResourceManager (RM)
NodeManager on each node
ApplicationMaster one for
each application
YARN Architecture
YARN Architecture
YARN Architecture
Additional YARN Features
High Availability ResourceManager
Timeline Server
Use of Cgroups
Secure Containers
YARN web services
REST APIs