0% found this document useful (0 votes)
42 views73 pages

Unit 2

Hadoop is an open-source framework for distributed storage and processing of large data sets, maintained by the Apache Software Foundation. It consists of various components, including HDFS for data storage and MapReduce for data processing, and has evolved through several versions, improving scalability and fault tolerance. Hadoop is widely used across industries such as social media, finance, and government for its ability to handle massive quantities of data efficiently.

Uploaded by

anurag2200455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views73 pages

Unit 2

Hadoop is an open-source framework for distributed storage and processing of large data sets, maintained by the Apache Software Foundation. It consists of various components, including HDFS for data storage and MapReduce for data processing, and has evolved through several versions, improving scalability and fault tolerance. Hadoop is widely used across industries such as social media, finance, and government for its ability to handle massive quantities of data efficiently.

Uploaded by

anurag2200455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Big Data (BCS-061)

Unit II
Hadoop
• Hadoop is a platform that provides both distributed
storage and computational capabilities.
• Hadoop is a well-adopted, standards-based, open-
source software framework
• Hadoop is designed to abstract away much of the
complexity of distributed processing
• The not-for-profit Apache Software Foundation has
taken over maintenance of Hadoop, with Yahoo!
making significant contributions.
Hadoop
• Hadoop is the Apache Software Foundation top-level
project that holds the various Hadoop subprojects that
graduated from the Apache Incubator.
• The Hadoop project provides and supports the
development of open source software that supplies a
framework for the development of highly scalable
distributed computing applications.
• The Hadoop framework handles the processing
details, leaving developers free to focus on application
logic.
Hadoop: In Industry
✓ Social media (e.g., Facebook, Twitter)
✓ Life sciences
✓ Financial services
✓ Retail
✓ Government
Hadoop: History
• In 2002, when the WWW was relatively new and before
things were “Googled”, Doug Cutting and Mike Cafarella
wanted to crawl the Web and index the content so that they
could produce an Internet search engine.
• They began a project called Nutch to do this but needed a
scalable method to store the content of their indexing.
• The standard method to organize and store data in 2002
was by means of relational database management systems
(RDBMS), which were accessed in a language called SQL.
• But almost all SQL and relational stores were not
appropriate for Internet search engine storageand
retrieval.
Hadoop: History
• They were costly, not terribly scalable, not as tolerant
to failure as required, and possibly not as performant
as desired.
• In 2003 and 2004, Google released two important
papers, one on the Google File System and the other
on a programming model on clustered servers called
MapReduce.
• Cutting and Cafarella incorporated these technologies
into their project, and eventually Hadoop was born.
• Yahoo! Began using Hadoop as the basis of its search
engine, and soon its use spread to many other
organizations
Hadoop: Name Origin
• The name Hadoop is not an acronym; it’s a made-up
name
• The name Doug Cutting’s, project’s creator, kid gave a
stuffed yellow elephant.
• Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere
Hadoop: Features
• It is optimized to handle massive quantities of all three kinds
of data using commodity hardware, that is, relatively
inexpensive computers
• It has a shared nothing architecture
• It replicates data across multiple computers
• It is for high throughput than low latency
• It complements OLAP and OLTP
• It is not good when work cannot be parallelized or when
there data dependencies
• It is not good for processing small files.
Hadoop: Advantage
• Stores data in its native format
• Scalable
• Cost- effective
• Resilient to failures
• Flexibility
• Fast
Evolution of Hadoop
Hadoop: Versions
• Versions of Hadoop:
– Hadoop 1.0
– Hadoop 2.0
– Hadoop 3.0
Hadoop 1.0

• It has two main parts:


– Data Storage Framework
• It is general purpose file system called HDFS
• HDFS is a schema less
• It simply stores data files of any format as close to their
original form as possible
– Data Processing Framework
• This is simple functional programming model initially
popularized by Google as MapReduce
• It uses two functions: MAP and Reduce
Hadoop 1.0: Limitations
• Requirement fro MapReduce programming expertise along
with proficiency in other language like, Java
• It supported only batch processing
• It was tightly computationally coupled with MapReduce
Hadoop 2.0
• In hadoop 2.0, HDFS continues to be data storage
framework
• YARN (Yet Another Resource Negotiator) a separate
resource management framework was added
• Any application capable of dividing itself into parallel
tasks is supported by YARN
• MapReduce programming expertise was no longer
required
• It also supports real time processing along with batch
processing
YARN
• The idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate
daemons.
• The idea is to have a global ResourceManager (RM) and per-
application ApplicationMaster (AM).
• An application is either a single job or a DAG of jobs.
• The ResourceManager and the NodeManager form the data-
computation framework.
• The RM is the ultimate authority that arbitrates resources
among all the applications in the system.
• The NodeManager is the per-machine framework agent who is
responsible for containers, monitoring their resource usage
(cpu, memory, disk, network) and reporting the same to the
ResourceManager/Scheduler.
Hadoop 3.0
• Apache Hadoop 3.0.0 incorporates a
number of significant enhancements over
the previous major release line (hadoop-
2.x).
Hadoop 2.0 Hadoop 3.0
License Apache 2.0, Open Apache 2.0, Open Source
Source
Java Version Java 7 Java 8
Fault Replication Erasure coding
Tolerance
Data Balancing HDFS balancer Intra-data node balancer
Storage Uses 3X replication Support for erasure
Scheme scheme encoding in HDFS
Storage 200% 50%
Overhead
YARN Timeline Uses an old timeline Improve the timeline
Service service which has service and improves the
scalability issues scalability and reliability of
timeline service.
Hadoop 2.0 Hadoop 3.0
Compatible HDFS (Default FS), Previous as well as
File System FTP File system Microsoft Azure Data Lake
filesystem.
Single Point of Features to overcome No needs manual
Failure SPOF intervention to overcome it
Scalability 10,000 Nodes per More than 10,000 nodes
cluster per cluster.
NameNode 1 NameNode 1 NameNode
1 Passive NameNode More than 1 Passive
NameNode
Hadoop: Ecosystem
Apache HDFS The Hadoop Distributed File System 1. hadoop.apache.org
(HDFS) offers a way to store large files 2. Google FileSystem - GFS Paper
across multiple machines. Hadoop and 3. Cloudera Why HDFS
HDFS was derived from Google File 4. Hortonworks Why HDFS
System (GFS) paper. Prior to Hadoop
2.0.0, the NameNode was a single
point of failure (SPOF) in an HDFS
cluster. With Zookeeper the HDFS High
Availability feature addresses this
problem by providing the option of
running two redundant NameNodes in
the same cluster in an Active/Passive
configuration with a hot standby.

Apache MapReduce MapReduce is a programming model 1. Apache MapReduce


for processing large data sets with a 2. Google MapReduce paper
parallel, distributed algorithm on a 3. Writing YARN applications
cluster. Apache MapReduce was
derived from Google MapReduce:
Simplified Data Processing on Large
Clusters paper. The current Apache
MapReduce version is built over
Apache YARN Framework. YARN stands
for “Yet-Another-Resource-Negotiator”.
Hadoop: Includes
• Hadoop Core, our flagship sub-project, provides a distributed
filesystem (HDFS) and support for the MapReduce distributed
computing metaphor.
• HBase builds on Hadoop Core to provide a scalable, distributed
database.
• Pig is a high-level data-flow language and execution framework for
parallel computation. It is built on top of Hadoop Core.
• ZooKeeper is a highly available and reliable coordination system.
Distributed applications use ZooKeeper to store and mediate
updates for critical shared state.
• Hive is a data warehouse infrastructure built on Hadoop Core that
provides data summarization, adhoc querying and analysis of
datasets.
Hadoop: Resources
• The Hadoop Distributed File System (HDFS)
• The MapReduce programing platform
• The Hadoop ecosystem
The Hadoop Distributed File
System (HDFS)
HDFS
• The Hadoop Distributed File System (HDFS) is the place in a
Hadoop cluster where the data is stored.
• Built for data-intensive applications, the HDFS is designed to
run on clusters of inexpensive commodity servers.
• HDFS is optimized for high-performance, readintensive
operations, and is resilient to failures in the cluster.
• It does not prevent failures, but is unlikely to lose data,
because HDFS by default makes multiple copies of each of its
data blocks.
HDFS
• Moreover, HDFS is a write once, read many (or WORM-ish)
filesystem: once a file is created, the filesystem API only
allows you to append to the file, not to overwrite it.
• As a result, HDFS is usually inappropriate for normal online
transaction processing (OLTP) applications.
• Most uses of HDFS are for sequential reads of large files.
• These files are broken into large blocks, usually 64 MB or
larger in size, and these blocks are distributed among the
nodes in the server.
HDFS
• Blocks in HDFS are mapped to files in the host’s underlying
filesystem, often ext3 in Linux systems.
• HDFS does not assume that the underlying disks in the host
are RAID protected, so by default, three copies of each block
are made and are placed on different nodes in the cluster.
• This provides protection against lost data when nodes or disks
fail and assists in Hadoop’s notion of accessing data where it
resides, rather than moving it through a network to access it.
The Design of HDFS
• HDFS is a filesystem designed for storing very large
files with streaming data access patterns, running on
clusters of commodity hardware.
Hadoop High Level Architecture
Hadoop High Level Architecture
• Hadoop is a distributed Master-slave architecture
• MasterNode is known as NameNode
• SlaveNode is known as DataNode
• Components of NameNode:
– Master HDFS
• Responsible for partition of data storage across the slave
node
• Keep track of location of DataNode
– Master MapReduce
• Decides and schedules computation task on slave nodes.
Hadoop Distributions
• APACHE: Apache is the organization that maintains the core Hadoop
code and distribution.
– Hadoop 1.0, Hadoop 2.0
• CLOUDERA: Cloudera is the most tenured Hadoop distribution, and it
employs a large number of Hadoop (and Hadoop ecosystem)
committers. Doug Cutting, who along with Mike Caferella originally
created Hadoop, is the chief architect at Cloudera.
– CDH 4.0, CDH 5.0
• HORTONWORKS: Hortonworks offers the same advantages as
Cloudera in terms of the ability to quickly address problems and
feature requests in core Hadoop and its ecosystem projects.
– HDP 1.0, HDP 2.0
• MAPR: M3, M5, M8
An HDFS client communicating with the master
NameNode and slave DataNodes
Daemons
Daemon # per Purpose
cluster
Namenode 1 Stores filesystem metadata, stores file to block
map, and provides a global picture of the
filesystem
Secondary 1 Performs internal namenode transaction log
namenode checkpointing
Datanode Many Stores block data (file contents)
Daemons: NameNode
• NameNode stores HDFS namespace
• Hadoop’s NameNode keeps all the HDFS metadata in memory
for fast metadata operations.
• The early releases in Hadoop 1.X had a single NameNode
• Any failure to the NameNode's hardware meant the entire
cluster becoming unusable.
• Subsequent releases had a cold standby in the form of a
secondary NameNode.
• The secondary NameNode merged the edit logs and
NameNode image files
• HDFS breaks a large file into smaller pieces called blocks.
Daemons: NameNode
• NameNode uses a rack ID to identify DataNodes in the rack
• A rack is a collection of DataNode within the cluster
• NameNode keeps tracks of blocks of a file a it is placed on
various DataNode
• It manages operations such as read, write, create & delete
• NameNode uses an EditLog to record every transaction that
happens to the file system metadata
• The main of NameNode is to manage File System Namespace
• File System Namespace is a collection of files in the cluster
• File System Namespace includes mapping of blocks to file, file
properties and is stored in a file called FsImage
Daemons: DataNode
• The DataNode stores each HDFS data block in a separate file
on its local filesystem with no knowledge about the HDFS files
themselves.
• The DataNode does not create all files in the same directory.
• Instead, it uses heuristics to determine the optimal number of
files per directory, and creates subdirectories appropriately.
• The DataNode stores the received blocks in a local filesystem,
and forwards that portion of data to the next DataNode in the
list.
• There are multiple DataNode per cluster
Daemons: DataNode
• DataNode continuously sends heartbeat message to the
NameNode to ensure the connectivity between NameNode
and DataNode.

Daemons: Secondary DataNode
• The secondary NameNode merged the edit logs and
NameNode image files, periodically bringing in two benefits.
• One, the primary NameNode startup time was reduced as the
NameNode did not have to do the entire merge on startup.
• Two, the secondary NameNode acted as a replica that could
minimize data loss on NameNode disasters.
• However, the secondary NameNode is not a backup node for
NameNode
• It was still not a hot standby, leading to high failover and
recovery times and affecting cluster availability.
DATA FORMAT
• A data/file format defines how information is stored in HDFS.
• Hadoop does not have a default file format and the choice of
a format depends on its use.
• The big problem in the performance of applications that use
HDFS is the information search time and the writing time.
• Managing the processing and storage of large volumes of
information is very complex that’s why a certain data format
is required.
DATA FORMAT
• The choice of an appropriate file format can
produce the following benefits:
– Optimum writing time
– Optimum reading time
– File divisibility
DATA FORMAT
• Some of the most commonly used formats of the Hadoop
ecosystem are :
– Text/CSV: A plain text file or CSV is the most common format both
outside and within the Hadoop ecosystem.
– SequenceFile: The SequenceFile format stores the data in binary format,
this format accepts compression but does not store metadata.
– Avro: Avro is a row-based storage format. This format includes the
definition of the scheme of your data in JSON format. Avro allows block
compression along with its divisibility, making it a good choice for most
cases when using Hadoop.
– Parquet: Parquet is a column-based binary storage format that can store
nested data structures. This format is very efficient in terms of disk
input/output operations when the necessary columns to be used are
specified.
DATA FORMAT
– RCFile (Record Columnar File): RCFile is a columnar
format that divides data into groups of rows, and
inside it, data is stored in columns.
– ORC (Optimized Row Columnar): ORC is
considered an evolution of the RCFile format and
has all its benefits alongside some improvements
such as better compression, allowing faster
queries
Anatomy of File Read
Anatomy of File Write
Processing Data With Hadoop
• MapReduce Programming is a software framework
• It helps in processing massive amount of data in parallel
• In this, the input dataset is split into independent chunks
• Map tasks process these chunks in a parallel manner
• The output given by Map Tasks is intermediate data which is
stored in the local disk of the user.
• The output of the mappers are automatically shuffled and
stored by the framework
• MapReduce Framework sorts the output based on keys
• This sorted output becomes input to the reduce tasks
Processing Data With Hadoop
• Reduce tasks provided reduced output by combining the
output of various mappers.
• Job input and output are stored in a file system
• MapReduce framework also takes care of other tasks such as
scheduling, monitoring.
• MapReduce framework has two daemons:
– JobTracker
– TaskTracker
MapReduce Daemons: JobTracker
• JobTracker is a master daemon responsible for executing
overall MapReduce Job
• It provides connectivity between Hadoop and the application.
• When a job is submitted, the JobTracker creates the execution
plan
• When a job fails, it re-schedule the task to a new node after a
number of retries.
• JobTracker = 1/cluster
MapReduce Daemons: TaskTracker
• It is responsible for executing individual tasks assigned by the
JobTracker
• There is a single TaskTracker per slave and spawns multiple
JVM to handle multiple map or reduce tasks in parallel
• TaskTracker sends periodically heartbeat to JobTracker
Interaction between Daemons
MapReduce Programming Architecture
MapReduce Programming Workflow
MapReduce Working
MapReduce’s shuffle and sort phases
MapReduce Example:
WordCount Problem
• This is an apple
• Apple is red in colour
Developing a MapReduce Application
• Write map , reduce , driver functions.
• Test with a small subset of dataset.
• If it fails use IDE’s debugger to identify solve
the problem.
• Run on full dataset and if it fails debug it using
hadoop debugging tools.
• Do profiling to tune the performance of the
program.
MAPPER
• Reads in input pair <Key,Value>
• Outputs a pair <K’, V’>
• Let’s count number of each word in user
queries (or Tweets/Blogs)
• The input to the mapper will be <queryID,
QueryText>
REDUCER
• Accepts the Mapper output, and aggregates
values on the key
Shuffle and Sort
• Probably the most complex aspect of MapReduce and heart
of the map reduce!
• Map side
– Map outputs are buffered in memory in a circular buffer.
– When buffer reaches threshold, contents are “spilled” to disk.
– Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here first.
• Reduce side
– First, map outputs are copied over to reducer machine.
– “Sort” is a multi-pass merge of map outputs (happens in memory
and on disk): combiner runs here again.
– Final merge pass goes directly into reducer
Mapper Phase Code
• The first stage in development of MapReduce
Application is the Mapper Class.
• Here, RecordReader processes each Input record and
generates the respective key-value pair.
• Hadoop’s Mapper store saves this intermediate data
into the local disk
Reducer Phase Code
• The Intermediate output generated from the mapper
is fed to the reducer which processes it and
generates the final output which is then saved in the
HDFS.
Driver code
• The major component in a MapReduce job is a Driver
Class.
• It is responsible for setting up a MapReduce Job to
run-in Hadoop.
• We specify the names of Mapper and Reducer
Classes long with data types and their respective job
names.
Debugging a Mapreduce Application
• For the process of debugging Log files are essential.
• Log Files can be found on the local fs of each TaskTracker and
if JVM reuse is enabled, each log accumulates the entire JVM
run.
• Anything written to standard output or error is directed to
the relevant logfile.
Unit tests with MRunit
• Apache MRUnit is a unit testing framework designed specifically for Apache
Hadoop map-reduce jobs. It allows developers to write and run tests locally
without requiring a Hadoop cluster, saving both time and resources. MRUnit is
used in Data Science to ensure the correctness of map-reduce logic before it’s
deployed to a production environment.
• History
• Apache MRUnit was developed as part of the larger Apache Hadoop project,
allowing for map-reduce job testing within the Hadoop ecosystem. It was adopted
by the Apache Software Foundation in early 2012 and has since seen regular
updates and improvements.
• Functionality and Features
• Apache MRUnit offers the following key features:
• Local execution of map-reduce tests without needing a Hadoop cluster.
• Support for multiple input and output data types.
• Mock object interfaces for simulating complex object interactions.
Unit tests with MRunit
• Benefits and Use Cases
• Apache MRUnit provides several benefits:
• It speeds up the development process by providing a local testing
framework.
• It aids in debugging by allowing developers to step through their code.
• It provides robust testing capabilities ensuring quality before deployment.
• Challenges and Limitations
• However, Apache MRUnit also has certain limitations:
• It does not fully replicate the Hadoop runtime environment, which may
lead to some inconsistencies.
• It lacks support for testing the entire job flow including both Mapper and
Reducer portions simultaneously.
Anatomy of MapReduce
Anatomy of MapReduce
• There are five independent entities:
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of
• compute resources on the cluster.
• The YARN node managers, which launch and monitor the compute
• containers on machines in the cluster.
• The MapReduce application master, which coordinates the tasks
• running the MapReduce job The application master and the
MapReduce tasks run in containers that are scheduled by the
resource manager and managed by the node managers.
• The distributed filesystem, which is used for sharing job files
between the other entities.
Anatomy of MapReduce
• Job Submission :
• The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.
• Having submitted the job, waitForCompletion polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
• When the job completes successfully, the job counters are displayed. Otherwise, the error
that caused the job to fail is logged to the console.
• The job submission process implemented by JobSubmitter does the following:
• Asks the resource manager for a new application ID, used for the MapReduce job ID.
• Checks the output specification of the job For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program.
• Computes the input splits for the job If the splits cannot be computed (because the input
paths don’t exist, for example), the job is not submitted and an error is thrown to the
MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the shared filesystem in a directory named after the job ID.
• Submits the job by calling submitApplication() on the resource manager.
Anatomy of MapReduce

• Job Initialization :
• When the resource manager receives a call to its submitApplication() method, it
hands off the request to the YARN scheduler.
• The scheduler allocates a container, and the resource manager then launches
the application master’s process there, under the node manager’s management.
• The application master for MapReduce jobs is a Java application whose main
class is MRAppMaster .
• It initializes the job by creating a number of bookkeeping objects to keep track of
the job’s progress, as it will receive progress and completion reports from the
tasks.
• It retrieves the input splits computed in the client from the shared filesystem.
• It then creates a map task object for each split, as well as a number of reduce
task objects determined by the mapreduce.job.reduces property (set by the
setNumReduceTasks() method on Job).
Anatomy of MapReduce
• Task Assignment:
• If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager .
• Requests for map tasks are made first and with a higher priority than those for reduce tasks,
since all the map tasks must complete before the sort phase of the reduce can start.
• Requests for reduce tasks are not made until 5% of map tasks have completed.

• Task Execution:
• Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by contacting the
node manager.
• The task is executed by a Java application whose main class is YarnChild.
• Before it can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
• Finally, it runs the map or reduce task.
Failures in MapReduce
• Worker Failures:
• If a worker node (where map or reduce tasks are executed)
fails, the tasks assigned to that node need to be rescheduled
and re-executed on other nodes.
• Master Failures:
• If the master node (JobTracker) fails, the entire MapReduce
job might be restarted by a different master, requiring a new
job submission.
• Resilience:
• MapReduce frameworks are designed to be resilient to
failures, with mechanisms to detect and recover from them.
Job Scheduling in MapReduce
• Scheduling Algorithms:
• MapReduce frameworks employ scheduling algorithms
(e.g., Capacity Scheduler, Fair Scheduler, FIFO Scheduler)
to determine which tasks to run on which nodes.
• Task Allocation:
• The scheduler allocates tasks to available nodes, taking
into account resource availability and task dependencies.
• Failure Handling:
• The scheduler also plays a crucial role in rescheduling
failed tasks and ensuring that the job progresses despite
failures.
Shuffle and Sort
• Shuffle Phase:
• In the shuffle phase, the intermediate data produced by the map tasks is partitioned and
distributed to the appropriate reduce tasks.
• Sort Phase:
• The data is then sorted by key, ensuring that all data with the same key is grouped together for the
reduce phase.
• Reduce Stage:
• The reduce tasks then process the sorted data, combining values for each key to produce the final
output.
• Efficiency:
• The efficiency of the shuffle and sort phases significantly impacts the overall performance of the
MapReduce job.
• External Merge Sort:
• When the intermediate data is too large to fit in memory, an external merge sort algorithm is used
during the sort phase.
• Data Distribution:
• The shuffle phase aims to distribute data evenly among the reduce tasks to prevent data skew,
where some reduce tasks receive significantly more data than others.

You might also like