0% found this document useful (0 votes)
28 views83 pages

BDA Module2

The document provides an overview of Big Data Analytics with a focus on Hadoop, detailing its advantages over traditional RDBMS, including scalability, cost-effectiveness, and the ability to handle various data types. It explains the architecture of Hadoop, including its core components like HDFS and MapReduce, and emphasizes the importance of data processing and storage in managing large datasets. Additionally, it discusses the challenges of Big Data and the historical context of Hadoop's development.

Uploaded by

surbhi2965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views83 pages

BDA Module2

The document provides an overview of Big Data Analytics with a focus on Hadoop, detailing its advantages over traditional RDBMS, including scalability, cost-effectiveness, and the ability to handle various data types. It explains the architecture of Hadoop, including its core components like HDFS and MapReduce, and emphasizes the importance of data processing and storage in managing large datasets. Additionally, it discusses the challenges of Big Data and the historical context of Hadoop's development.

Uploaded by

surbhi2965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 83

Semester /

Course Code : BAD601 VI / III


Year :

Academic
Course Title : Big Data Analytics 2024-25
Year :
Module 2

Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not


RDBMS, RDBMS Vs Hadoop, History of Hadoop, Hadoop overview,
Use case of Hadoop, HDFS (Hadoop Distributed File
System),Processing data with Hadoop, Managing resources and
applications with Hadoop YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper,
Reducer, Combiner, Partitioner, Searching, Sorting, Compression.

2
Introducing Hadoop
Introducing Hadoop
 Big Data Growth and Statistics:
 Enterprises are realizing the potential of untapped data, which is growing rapidly.
 Data exists in structured, semi-structured, and unstructured forms.
 Examples of Big Data Growth:
 Every day:
 NYSE generates 1.5 billion shares and trade data.
 Facebook stores 2.7 billion comments and likes.
 Google processes 24 petabytes of data.
 Every minute:
 Facebook users share 2.5 million pieces of content.
 Twitter sees 300,000 new tweets.
 Instagram users upload 220,000 photos.
 YouTube uploads 72 hours of new videos.
 Apple users download 50,000 apps.
 Emails send over 200 million messages.
Introducing Hadoop
 Amazon generates over $80,000 in online sales.
 Google receives over 4 million search queries.
 Every second:
 Banking applications process over 10,000 credit card transactions.
 Data: The Treasure Trove
 Business advantages: Helps in generating product recommendations, developing new products, analyzing the
market, etc.
 Early key indicators: Data insights can shape the future of businesses.
 Precision analysis: More data leads to better decision-making and accuracy.
 Challenges of Big Data
 High volume: Managing large-scale data efficiently.
 Variety: Handling structured, semi-structured, and unstructured data from different sources.
 Velocity: Processing data quickly for timely decision-making.
Why Hadoop
Why Hadoop
 Key Consideration:
 Hadoop is highly popular due to its capability to handle massive amounts of data, different categories of data
– fairly quickly.
 Other Key Considerations:
 Low Cost
 Open-source framework.
 Uses commodity hardware (cheap, widely available hardware) to store large amounts of data.
 Computing Power
 Based on a distributed computing model for handling large data volumes efficiently.
 More nodes = more computing power.
 Scalability
 Easily scalable by adding more nodes to the system.
 Requires minimal administration.
Why Hadoop
 Storage Flexibility
 Unlike relational databases, data doesn't need pre-processing before storing in Hadoop.
 Can store both structured and unstructured data (e.g., text, images, videos).
 Inherent Data Protection
 Automatic failure recovery: If a node fails, the task is reassigned to another node.
 Data replication: Multiple copies of data are stored across nodes to prevent data loss.
 Hadoop Framework
 Uses commodity hardware, distributed file system, and distributed computing.
 Groups of machines work together in a Cluster.
Why Hadoop
 How Hadoop Manages Data:
 Data Distribution & Duplication
 Each file is split into chunks and distributed across several nodes.
 Parallel Processing
 Local computing resources process each data chunk simultaneously for efficiency.
 Fault Tolerance
 Detects and handles hardware failures automatically.
Why not RDBMS
Why not RDBMS
 Not Suitable for Large Files & Unstructured Data
 RDBMS struggles with large files, images, videos, and semi-structured or unstructured data (JSON, XML,
logs, etc.).
 Designed primarily for structured, tabular data.
 High Cost of Scaling
 Scaling an RDBMS requires expensive hardware upgrades (vertical scaling).
 As data volume increases, costs rise exponentially (as shown in Figure 5.4).
 Scaling beyond terabytes (TB) to petabytes (PB) requires massive investment.
 Limited Performance for Massive Data Processing
 Relational databases require complex indexing and transactions, which slow down performance.
 Joins and queries become inefficient with large datasets.
 Not Designed for Distributed Computing
 RDBMS primarily operates on a single machine or a small cluster.
 Less Flexibility
 Schema changes in RDBMS can be complex and time-consuming.
RDBMS Vs Hadoop
RDBMS Vs Hadoop
History of Hadoop
History of Hadoop
 Hadoop was created by Doug Cutting, who also developed Apache Lucene, a widely used text search library.
 Hadoop originated as part of the Apache Nutch project (Yahoo), which was an open-source web search engine
and a subset of the Lucene project.
 The Name "Hadoop“
 The name Hadoop is not an acronym; it's a made-up name.
 Doug Cutting named the project after his child's stuffed yellow elephant.
 He preferred a name that was short, easy to spell, meaningless, unique, and not used elsewhere.
 Similar to Google’s naming approach (Googol is a kid’s term for a very large number).
Hadoop overview
Hadoop overview
 Hadoop is an open-source framework designed to store and process massive amounts of data in a distributed
fashion on clusters of commodity hardware.
 It primarily achieves two objectives:
 Massive data storage.
 Faster data processing.
 Key Aspects of Hadoop
 Open Source Software: Free to download, use, and contribute to.
 Framework: Provides necessary tools and programs for application development.
 Distributed Architecture: Data is divided and stored across multiple connected nodes.
 Massive Storage: Uses low-cost commodity hardware to store colossal amounts of data.
 Faster Processing: Parallel processing of large datasets yields quick responses.
Hadoop overview
 Hadoop Components
 Hadoop consists of core components and an ecosystem of tools that enhance its functionality.
 HDFS (Hadoop Distributed File System):
 Acts as the storage component.
 Distributes data across multiple nodes.
 Ensures redundancy for fault tolerance.
 MapReduce:
 Provides a computational framework.
 Splits tasks into smaller parallel processes across nodes.
 Efficiently processes large-scale data.
Hadoop overview
 Hadoop Ecosystem
 Several support projects enhance Hadoop's capabilities, including:
 HIVE: Data warehousing tool for querying and managing large datasets.
 PIG: High-level scripting language for data transformation.
 SQOOP: Facilitates data transfer between RDBMS and Hadoop.
 HBASE: NoSQL database for real-time data access.
 FLUME: Handles data ingestion from multiple sources.
 OOZIE: Workflow scheduler for managing Hadoop jobs.
 MAHOUT: Machine learning library for Hadoop.
Hadoop overview
 Hadoop Conceptual Layer
 Hadoop consists of two primary layers:
 Data Storage Layer: Manages storage of large volumes of data.
 Data Processing Layer: Processes data in parallel for insights
and analytics.
 High-Level Architecture of Hadoop
 Hadoop follows a Master-Slave Architecture, where:
 Master Node:
 NameNode (Master HDFS): Manages file system metadata and partitions data across DataNodes.
 JobTracker (Master MapReduce): Assigns and schedules computational tasks to worker nodes.
 Slave Nodes:
 DataNodes store actual data.
 TaskTrackers execute assigned computation tasks.
 This architecture ensures fault tolerance, scalability, and efficient distributed processing, making Hadoop a
preferred choice for big data applications.
Hadoop overview
Use case of Hadoop
HDFS (Hadoop Distributed
File System)
HDFS (Hadoop Distributed File System)
 HDFS is the primary storage system of Hadoop, designed to store and manage vast amounts of data efficiently
across distributed clusters. Below are some key aspects of HDFS:
 Storage Component of Hadoop: HDFS is the fundamental storage layer of the Hadoop ecosystem.
 Distributed File System: It distributes data across multiple nodes in a cluster to ensure fault tolerance and
scalability.
 Modeled after Google File System (GFS): HDFS follows principles similar to GFS, enabling high
performance and robustness.
 Optimized for High Throughput: HDFS leverages large block sizes and moves computation closer to data,
reducing network congestion and enhancing performance.
 Data Replication: Files in HDFS are replicated multiple times (as per the configured replication factor)
across different nodes, making it resilient to hardware and software failures.
 Automatic Block Replication: If a node fails, HDFS automatically replicates data blocks to maintain
availability and integrity.
 Efficient Handling of Large Files: HDFS is optimized for storing and processing large files (in the range of
gigabytes and terabytes), rather than small files.
 Runs on Native File Systems: HDFS sits on top of traditional file systems like ext3 and ext4, providing
abstraction and enhanced capabilities.
HDFS (Hadoop Distributed File System)
 Example of File Storage in HDFS
 Assume a file named Sample.txt is of size 192MB. Given the default block size of 64MB, this file will be split
into three blocks and stored across different nodes. Each block is replicated based on the default replication
factor to ensure data reliability and fault tolerance.
HDFS (Hadoop Distributed File System)
 HDFS Daemons
 NameNode
 HDFS breaks large files into smaller pieces called blocks.
 The NameNode uses a rack ID to identify DataNodes in the rack, which is a collection of DataNodes
within the cluster.
 The NameNode tracks blocks of a file and manages file-related operations such as reading, writing,
creating, and deleting files.
 The primary function of the NameNode is to manage the File System Namespace, which maps file blocks
to DataNodes and stores this mapping in the FsImage file.
 The EditLog records all filesystem metadata changes.
 When the NameNode starts, it applies all transactions from the EditLog to the FsImage in memory,
flushes the updated FsImage to disk, and truncates the old EditLog.
 There is only one NameNode per cluster.
HDFS (Hadoop Distributed File System)
 HDFS Daemons
 DataNode
 Multiple DataNodes exist per cluster.
 During pipeline read and write operations, DataNodes communicate with each other.
 Each DataNode sends a heartbeat to the NameNode to confirm connectivity.
 If a DataNode fails to send a heartbeat, the NameNode replicates its data to another DataNode to maintain
cluster integrity.
 Secondary NameNode
 The Secondary NameNode periodically takes snapshots of HDFS metadata to reduce the burden on the
primary NameNode.
 It does not process real-time metadata updates but can assist in cluster recovery by providing recent
snapshots.
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System)
 Anatomy of File Read
 The file read process in HDFS follows these steps:
 The client opens a file using open() in the DistributedFileSystem.
 The DistributedFileSystem contacts the NameNode to get block locations.
 The FSDataInputStream is returned to the client.
 The client reads data from the closest DataNode.
 When a block is fully read, the stream connects to the next DataNode for subsequent blocks.
 After reading the entire file, the client closes the stream.
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System)
 Anatomy of File Write
 The file write process in HDFS follows these steps:
 The client calls create() on DistributedFileSystem.
 An RPC request is sent to the NameNode, which verifies if the file exists and creates an entry without
assigning blocks.
 The client writes data using DFSOutputStream, which splits data into packets stored in an internal queue.
 Packets are streamed to the first DataNode, which forwards them to the second and third DataNodes,
forming a pipeline.
 An Ack queue ensures packets are acknowledged before they are removed from the queue.
 Once writing is complete, the client closes the stream.
 The final acknowledgments are received before notifying the NameNode of the file's successful creation..
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System)
 Replica Placement Strategy
 HDFS follows a default Replica Placement Strategy:
 The first replica is placed on the same node as the client.
 The second replica is placed on a node in a different rack.
 The third replica is placed on a different node in the same rack as the second replica.
 This strategy enhances reliability and fault tolerance.
HDFS (Hadoop Distributed File System)
 Working with HDFS Commands
 List files and directories at the root of HDFS:
 hadoop fs -ls /
 List all directories and files recursively:
 hadoop fs -ls -R /
 Create a directory in HDFS:
 hadoop fs -mkdir /sample
 Copy a local file to HDFS:
 hadoop fs -put /root/sample/test.txt /sample/test.txt
 Copy a file from HDFS to the local filesystem:
 hadoop fs -get /sample/test.txt /root/sample/test.txt
 Copy a file using copyFromLocal:
 hadoop fs -copyFromLocal /root/sample/test.txt /sample/test.txt
 Copy a file from HDFS to the local filesystem using copyToLocal:
 hadoop fs -copyToLocal /sample/test.txt /root/sample/test.txt
HDFS (Hadoop Distributed File System)
 Working with HDFS Commands
 Display the contents of an HDFS file:
 hadoop fs -cat /sample/test.txt
 Copy a file within HDFS:
 hadoop fs -cp /sample/test.txt /sample1
 Remove a directory from HDFS:
 hadoop fs -rm -r /sample1
 Special Features of HDFS
 Data Replication: Ensures high availability by directing clients to the nearest replica.
 Data Pipeline: Enables efficient data writing by streaming packets through multiple DataNodes in a pipeline.
 Working with HDFS Commands
PROCESSING DATA WITH
HADOOP
PROCESSING DATA WITH HADOOP
 MapReduce Programming is a software framework.
 MapReduce Programming helps you to process massive amounts of data in parallel.
 In MapReduce Programming, the input dataset is split into independent chunks.
 Map tasks process these independent chunks completely in a parallel manner.
 The output produced by the map tasks serves as intermediate data and is stored on the local disk of that server.
 The output of the mappers is automatically shuffled and sorted by the framework.
 MapReduce Framework sorts the output based on keys.
 This sorted output becomes the input to the reduce tasks. Reduce task provides reduced output by combining the
output of the various mappers.
 Job inputs and outputs are stored in a file system. MapReduce framework also takes care of other tasks such as
scheduling, monitoring, re-executing failed tasks, etc.
 Hadoop Distributed File System and MapReduce Framework run on the same set of nodes.
 This configuration allows effective scheduling of tasks on the nodes where data is present (Data Locality).
 This in turn results in very high throughput.
PROCESSING DATA WITH HADOOP
 There are two daemons associated with MapReduce Programming.
 A single master JobTracker per cluster and one slave TaskTracker per cluster-node.
 The JobTracker is responsible for scheduling tasks to the TaskTrackers, monitoring the task, and re-executing the
task just in case the TaskTracker fails.
 The TaskTracker executes the task.
 The MapReduce functions and input/output locations are implemented via the MapReduce applications.
 These applications use suitable interfaces to construct the job.
 The application and the job parameters together are known as job configuration. Hadoop job client submits the
job (jar/executable, etc.) to the JobTracker.
 Then it is the responsibility of the JobTracker to schedule tasks to the slaves.
 In addition to scheduling, it also monitors the task and provides status information to the job-client.
PROCESSING DATA WITH HADOOP
 MapReduce Daemons
 JobTracker:
 It provides connectivity between Hadoop and your application.
 When you submit code to the cluster, JobTracker creates the execution plan by deciding which task to
assign to which node. It also monitors all the running tasks.
 When a task fails, it automatically re-schedules the task to a different node after a predefined number of
retries. JobTracker is a master daemon responsible for executing the overall MapReduce job. There is a
single JobTracker per Hadoop cluster.
 TaskTracker:
 This daemon is responsible for executing individual tasks that are assigned by the JobTracker.
 There is one TaskTracker per slave node, and it spawns multiple Java Virtual Machines (JVMs) to handle
multiple map or reduce tasks in parallel. TaskTracker continuously sends heartbeat messages to
JobTracker.
 When the JobTracker fails to receive a heartbeat from a TaskTracker, it assumes that the TaskTracker has
failed and resubmits the task to another available node in the cluster.
PROCESSING DATA WITH HADOOP
 MapReduce Daemons
PROCESSING DATA WITH HADOOP
 How Does MapReduce Work?
PROCESSING DATA WITH HADOOP
 How Does MapReduce Work?
 Steps in MapReduce Programming
 First, the input dataset is split into multiple pieces of data (several small subsets).
 The framework creates a master and multiple worker processes, which execute the tasks remotely.
 Several map tasks run in parallel, processing data assigned to them.
 Each map worker extracts relevant data from its node.
 The map function generates key/value pairs from the extracted data.
 Map worker uses a partitioner function to divide data into regions.
 The partitioner determines which reducer receives which part of the output.
 Once map tasks are completed, the master instructs the reducers to begin processing.
 Reducers fetch key/value data from mappers.
 The data is shuffled and sorted by keys before processing.
 The reduce function is executed for every unique key, and the final output is written to the file system.
 After all reduce workers finish, the master transfers control back to the user program.
PROCESSING DATA WITH HADOOP
 How Does MapReduce Work?
PROCESSING DATA WITH HADOOP
 MapReduce Example
 The famous example for MapReduce Programming is Word Count.
 For example, consider you need to count the occurrences of similar words across 50 files.
 You can achieve this using MapReduce Programming.
 Word Count MapReduce Programming using Java
 The MapReduce Programming requires three components:
 Driver Class: This class specifies Job Configuration details.
 Mapper Class: This class overrides the Map Function based on the problem statement.
 Reducer Class: This class overrides the Reduce Function based on the problem statement.
PROCESSING DATA WITH HADOOP
 MapReduce Example
Managing resources and
applications with Hadoop
YARN(Yet Another Resource
Negotiator)
Managing resources and applications with Hadoop
YARN
 Apache Hadoop YARN is a sub-project of Hadoop 2.x.
 Hadoop 2.x is YARN-based architecture. It is a general processing platform.
 YARN is not constrained to MapReduce only.
 You can run multiple applications in Hadoop 2.x in which all applications share a common resource management.
 Now Hadoop can be used for various types of processing such as Batch, Interactive, Online, Streaming, Graph,
and others.
Managing resources and applications with Hadoop
YARN
 Limitations of Hadoop 1.0 Architecture
 In Hadoop 1.0, HDFS and MapReduce are Core Components, while other components are built around them.
 Single NameNode manages the entire namespace for the Hadoop Cluster.
 It has a restricted processing model, which is suitable only for batch-oriented MapReduce jobs.
 Hadoop MapReduce is not suitable for interactive analysis.
 Hadoop 1.0 is not suitable for machine learning, graphs, and memory-intensive algorithms.MapReduce is
responsible for cluster resource management and data processing.
 In this architecture, map slots might be "full", while the reduce slots are empty and vice versa.
 This causes resource utilization issues, requiring improvements for better resource utilization.
 HDFS Limitation
 The NameNode stores all file metadata in main memory. Although modern systems have large memory, there
is still a limit on how many objects a single NameNode can handle.
 The NameNode can become overwhelmed as the system load increases.
 In Hadoop 2.x, this issue is resolved using HDFS Federation.
Managing resources and applications with Hadoop
YARN
 Hadoop 2: HDFS
 HDFS 2 consists of two major components:
 Namespace Service – Manages file-related operations (creating, modifying files, directories).
 Block Storage Service – Manages data storage, replication, and DataNode cluster management.
 HDFS 2 Features
 Horizontal scalability
 High availability
 HDFS Federation allows multiple independent NameNodes for horizontal scalability.
 NameNodes do not need to coordinate with each other.
 Instead, DataNodes store blocks shared across all NameNodes in the cluster.
 High availability in Hadoop 2.x is achieved using Passive Standby NameNode.
 The Active-Passive NameNode system automatically handles failover.
 All metadata edits are recorded to shared NFS storage, ensuring a single writer at any point in time, while the
Passive NameNode reads edits from shared storage.
Managing resources and applications with Hadoop
YARN
 Hadoop 2: HDFS
 In case of Active NameNode failure, the Passive NameNode becomes an Active NameNode automatically.
Then it starts writing to the shared storage.
 Figure describes the Active-Passive NameNode interaction.

 Active NameNode writes to Shared Edit Logs.


 Passive NameNode reads from these logs and takes over upon failure.
Managing resources and applications with Hadoop
YARN
 Hadoop 2: HDFS
 Hadoop 1.0: Uses MapReduce for both resource management and data processing.
 Hadoop 2.0: Introduces YARN, separating resource management from data processing.YARN allows other
processing frameworks beyond MapReduce, making Hadoop more versatile.
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 YARN helps us store all data in one place and interact with it in multiple ways for predictable performance
and service quality. Yahoo originally architected this concept.

 Supports Batch (MR), Interactive (TEZ), Online (HBASE), Streaming (Storm), In-Memory (SPARK), Graph
Processing, and Search.
 Cluster Resource Management is handled by YARN on top of HDFS 2.
Managing resources and applications with Hadoop
YARN
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 Fundamental Idea
 NodeManager:
 This is a per-machine slave daemon.
 NodeManager's responsibility is launching the application containers for application execution.
 NodeManager monitors resource usage such as memory, CPU, disk, network, etc.
 It then reports the usage of resources to the global ResourceManager.
 Per-application ApplicationMaster:
 This is an application-specific entity.
 Its responsibility is to negotiate required resources for execution from the ResourceManager.
 It works along with the NodeManager for executing and monitoring component tasks.
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 Fundamental Idea
 NodeManager:
 This is a per-machine slave daemon.
 NodeManager's responsibility is launching the application containers for application execution.
 NodeManager monitors resource usage such as memory, CPU, disk, network, etc.
 It then reports the usage of resources to the global ResourceManager.
 Per-application ApplicationMaster:
 This is an application-specific entity.
 Its responsibility is to negotiate required resources for execution from the ResourceManager.
 It works along with the NodeManager for executing and monitoring component tasks.
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 Basic Concepts
 Application:
 Application is a job submitted to the framework.
 Example - MapReduce Job.
 Container:
 Basic unit of allocation.
 Fine-grained resource allocation across multiple resource types (Memory, CPU, disk, network, etc.)
 (a) container_0 = 2GB, 1 CPU
 (b) container_1 = 1GB, 6 CPU
 Replaces the fixed map/reduce slots.
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 YARN Architecture:
 A client program submits the application which includes the necessary specifications to launch the
application-specific ApplicationMaster itself.
 The ResourceManager launches the ApplicationMaster by assigning some container.
 The ApplicationMaster, on boot-up, registers with the ResourceManager.
 This helps the client program to query the ResourceManager directly for details.
 During the normal course, ApplicationMaster negotiates appropriate resource containers via the resource-
request protocol.
 On successful container allocations, the ApplicationMaster launches the container by providing the
container launch specification to the NodeManager.
 The NodeManager executes the application code and provides necessary information such as progress,
status, etc., to its ApplicationMaster via an application-specific protocol.
 During the application execution, the client that submitted the job directly communicates with the
ApplicationMaster to get status, progress updates, etc., via an application-specific protocol.
Managing resources and applications with Hadoop
YARN
 Hadoop 2 YARN: Taking Hadoop beyond Batch
 YARN Architecture:
 Once the application has been processed completely, the ApplicationMaster deregisters with the
ResourceManager and shuts down, allowing its own container to be repurposed.
Introduction to Map Reduce
Programming
Introduction
 In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce tasks.
 Then these tasks are executed in a distributed fashion on a Hadoop cluster.
 Each task processes a small subset of data that has been assigned to it.
 This way, Hadoop distributes the load across the cluster.
 A MapReduce job takes a set of files that is stored in HDFS (Hadoop Distributed File System) as input.
 Map task takes care of loading, parsing, transforming, and filtering.
 The responsibility of the reduce task is grouping and aggregating data that is produced by map tasks to generate
the final output.
 Each map task is broken into the following phases:
 RecordReader
 Mapper
 Combiner
 Partitioner
 The output produced by the map task is known as intermediate keys and values.
Introduction
 These intermediate keys and values are sent to the reducer.
 The reduce tasks are broken into the following phases:
 Shuffle
 Sort
 Reducer
 Output Format
 Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
 This way, Hadoop ensures data locality.
 Data locality means that data is not moved over the network; only computational code is moved to process data,
which saves network bandwidth.
MAPPER
Mapper
 A mapper maps the input key-value pairs into a set of intermediate key-value pairs.
 Maps are individual tasks that have the responsibility of transforming input records into intermediate key-value
pairs.
 RecordReader:
 RecordReader converts a byte-oriented view of the input (as generated by the InputSplit) into a record-
oriented view and presents it to the Mapper tasks.
 It presents the tasks with keys and values. Generally, the key is the positional information, and the value
is a chunk of data that constitutes the record.
 Map:
 The Map function works on the key-value pair produced by RecordReader and generates zero or more
intermediate key-value pairs.
 The MapReduce framework decides the key-value pair based on the context.
Mapper
 Combiner:
 It is an optional function but provides high performance in terms of network bandwidth and disk space.
 It takes an intermediate key-value pair provided by the mapper and applies a user-specific aggregate
function to only that mapper.
 It is also known as a local reducer.
 Partitioner:
 The partitioner takes the intermediate key-value pairs produced by the mapper, splits them into shards,
and sends the shard to a particular reducer as per the user-specific code.
 Usually, the key with the same values goes to the same reducer.
 The partitioned data of each map task is written to the local disk of that machine and pulled by the
respective reducer.
REDUCER
REDUCER
 The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share a common key) to
a smaller set of values.
 The Reducer has three primary phases: Shuffle and Sort, Reduce, and Output Format.
 Shuffle and Sort:
 This phase takes the output of all the partitioners and downloads them into the local machine where the
reducer is running.
 Then these individual data pipes are sorted by keys, which produce a larger data list.
 The main purpose of this sort is grouping similar words so that their values can be easily iterated over by the
reduce task.
 Reduce:
 The reducer takes the grouped data produced by the shuffle and sort phase, applies the reduce function, and
processes one group at a time.
 The reduce function iterates all the values associated with that key.
 The reducer function provides various operations such as aggregation, filtering, and combining data.
 Once it is done, the output (zero or more key-value pairs) of the reducer is sent to the output format.
REDUCER
 Output Format:
 The output format separates key-value pairs with a tab (default) and writes them out to a file using a record
writer.
 Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the Word Count problem. The
Word Count problem has been discussed under “Combiner” and “Partitioner”.
COMBINER
COMBINER
 It is an optimization technique for a MapReduce Job.
 Generally, the reducer class is set to be the combiner class.
 The difference between the combiner class and the reducer class is as follows:
 Output generated by the combiner is intermediate data and is passed to the reducer.
 Output of the reducer is passed to the output file on disk.
 The sections have been designed as follows:
 Objective: What is it that we are trying to achieve here?
 Input Data: What is the input that has been given to us to act upon?
 Act: The actual statement/command to accomplish the task at hand.
 Output: The result/output as a consequence of executing the statement.
COMBINER
 Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use a combiner for
optimization.
 Input Data
 Welcome to Hadoop Session
 Introduction to Hadoop
 Introducing Hive
 Hive Session
 Pig Session
 Act
 In the driver program, set the combiner class as shown below:
 javaCopyEditjob.setCombinerClass(WordCounterRed.class);
 // Input and Output Path
 FileInputFormat.addInputPath(job, new Path("/mapreducedemos/lines.txt"));
 FileOutputFormat.setOutputPath(job, new Path("/mapreducedemos/output/wordcount/"));
COMBINER
COMBINER
PARTITIONER
PARTITIONER
 The partitioning phase happens after the map phase and before the reduce phase.
 Usually, the number of partitions is equal to the number of reducers.
 The default partitioner is the hash partitioner.
 Objective
 Write a MapReduce program to count the occurrence of similar words in a file. Use a partitioner to partition
the key based on alphabets.
 Input Data
 Welcome to Hadoop Session
 Introduction to Hadoop
 Introducing Hive
 Hive Session
 Pig Session
PARTITIONER
PARTITIONER
SEARCHING
SEARCHING
 Objective
 Write a MapReduce program to search for specific keyword in a file.
 Input Data
 Welcome to Hadoop Session
 Introduction to Hadoop
 Introducing Hive
 Hive Session
 Pig Session
SORTING
SORTING
 Objective
 Write a MapReduce program to sort the data by student name.
 Input Data
 1001,John,45
 1002,Aby,40
 1003,Beu,48
 1004,Ravi,45
COMPRESSION
COMPRESSION
 In MapReduce programming, MapReduce output file can be compressed as well. It provides two benefits:
 Write Reduce the space to store files.
 Speeds up the data transfer across the network.
 The Compression format in the driver program as shown below:
 conf.setBoolean(“mapred.output.compress”,true);
 conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,CompressionCodec.class);
 Here, codec is the implementation of a compression and decompression algorithm.
 GzipCodec is the compression algorithm for gzip. This compresses the output file.

You might also like