0% found this document useful (0 votes)

140 views43 pages

Hadoop (Mapreduce)

MapReduce is a programming framework used for processing large datasets in a distributed manner across clusters of computers. It consists of a map phase that processes input records in parallel and a reduce phase that aggregates the outputs of the map to produce the final results. The map tasks distribute work across nodes, applying user-defined map functions to key-value pairs. The reduce tasks then pull intermediate outputs from the maps, grouping by key and applying reduce functions to produce the final output stored back in HDFS.

Uploaded by

Nisrine Mofakir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views43 pages

Hadoop (Mapreduce)

Uploaded by

Nisrine Mofakir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Hadoop Framework

(MapReduce)
The Outline
1 What is MapReduce?

2 Examples of MapReduce
use case
The map and reduce
3 functions

4 Architecture of MapReduce

5 MapReduce workflow

6 Execution on MapReduce
job: Yarn/hadoop v.1

7 More examples
2
Let’s explore: MAPREDUCE

3
Hadoop general architecture (2.0)

Storage unit of Hadoop Resources Manager Processing unit of

unit of Hadoop Hadoop

4
What is MapReduce?
 It is the processing unit of Hadoop.
 It has been introduced by Google on 2004.
 It is a programming technique where huge data is processed in parallel and distributed manner.
 The objective of this module is: reduce data into information that people can understand and that is
valuable.

Output
Big Data Processing unit of
Hadoop

Processor

5
MapReduce Analogy: example counting
elections

6
MapReduce Analogy: example counting
elections

7
MapReduce vs Traditional approach?

Slave Slave Slave Slave

Master

Slave Slave Slave Slave

Traditional approach – Data is processed MapReduce approach – Data is processed

at the Master node At the Slave nodes
8
What is MapReduce?
 It was designed in the early 2000s by the engineers at Google Research.

 Within this module, processing is done at the slave nodes and the final output is sent to the

Master nodes.

 This is to overcome the traditional approach shown below, in which processing is done at

the Master Node.

 Hence, a large bandwidth is consumed to transfer the needed data from DataNodes to feed

them to the program in the Master Node.

 In MapReduce, parallel processing is implemented as jobs are processed in parallel by slave

nodes. 9
What is MapReduce?
 It is an illustration of using the functional programming on a distributed cluster: having several
functions working independently on different chunks of data, and then aggregating the results.

 MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism that
execute in parallel. Here is the global overview:

Processing unit of
Hadoop

Map Reduce
A user defined A user defined
code code
10
Applications of MapReduce?
MR is commonly preferred for the following applications:

 Searching, sorting, grouping.

 Simple statistics: counting, ranking.

 Complex statistics: covariance.

 Pre-processing huge data to apply machine learning algorithms.

 Text processing, index building, graph creation and analysis, pattern recognition,

collaborative filtering, sentiment analysis.

11
What is MapReduce? the map function
Here is a pseudocode of the map function. A function that takes two arguments, a key and a value. After
performing some analysis or transformation on the input data, the map function may then output zero or
more resulting key/value pairs:

12
The parallelization of map function
 The key point is the possibility of parallelizing these functions in order to calculate mutch faster on a
multi-core machine or on a cluster of machines.

 The map function is parallelizable, because the calculations are independent.

 The map function must be a pure function of its parameter, it must have no side effect such as
modifying a global variable or memorizing its previous values.
13
What is MapReduce? the reduce function
The reducer is intended to aggregate the many values that are output from the map phase in order to
transform a large volume of data into a smaller, more manageable set of summary data

14
The parallelization of reduce function
 The reduce function is partially parallelized, in a hierarchical form, for example:

15
Architecture of MapReduce?
MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism. Here is
the global architecture:

Map ()

Map () Reduce ()

Input Data Output

Map () Reduce ()

Map ()

16
What is MapReduce? (con’t)
 Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that run in a

distributed fashion on a cluster of computers.

 Each task works on the small subset of the data it has been assigned so that the load is

spread across the cluster.

 The map tasks generally load, parse, transform, and filter data.

 Each reduce task is responsible for handling a subset of the map task output.

 Intermediate data is then copied from mapper tasks by the reducer tasks in order to group

and aggregate the data.

17
MapReduce: Implemented on a Cluster
 Each node gets a copy of the mapper operation, and applies the mapper to the key/value
pairs that are stored in the blocks of data of the local HDFS data nodes.

 There can be any number of mappers working independently on as much data as

possible.

 The mappers’ output is not dependent on anything but the incoming values, and
therefore failed mappers can be reattempted on another node.

 Reducers require as input the output of the mappers on a per-key basis; so, reducer
computation can also be distributed such that there can be as many reduce operations as
there are keys available from the mapper output.

 Shuffle and sort operation is required to coordinate the map and reduce phases, reducer
sees all values for a single, unique key.
18
MapReduce: Implemented on a Cluster
As shown in the figure below, MapReduce is implemented as a staged framework where a
map phase is coordinated to a reduce phase via an intermediate shuffle and sort phase.

19
Workflow of MapReduce? Basic example
A MapReduce job for the word count job is

defined by the following steps:

1. The input data is split into records.

2. Map functions process these records and

produce key/value pairs for each word.

3. All key/value pairs that are output by the

map function are merged together, grouped by

a key shuffled and sorted.

4. The intermediate results are transmitted to

the reduce function, which will produce the

final output.
20
Workflow of MapReduce?more details
Input data (stored in A map task is assigned A common optimization at this point is
HDFS), and split by to one or more nodes to apply a combiner: an aggregation
InputFormat before starting in the cluster, which of the mapping output for a single
the Map operation contains local blocks of mapper => facilitates reducer’s work
data specified as input
to the map operation

The reducers pulls an The intermediate keys are pulled

iterator of data for each from the map processes to a
key and performs a reduce partitioner. The partitioner
operation such as an The partitioner also sorts the decides how to allocate the
aggregation. Their output key/value pairs such that the keys to the reducers. A hash
full “shuffle and sort” phase is function is used to evenly
key/value pairs are then
implemented. divide keys among the
written back to HDFS
reducers.
using an OutputFormat
class.

21
Workflow of MapReduce?
 The input to a MapReduce job is a set of files in the data store that are spread out over the Hadoop
Distributed File System (HDFS).

 In Hadoop, these files are split with an input format, which defines how to separate a file into input
splits. Number of splits is generally specified in the mapred-site.xml by mapred.min.split.size, and
mapred.max.split.size

 An input split is a byte oriented view of a chunk of the file to be loaded by a map task.

 Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and
partitioner. The output of the map tasks, called the intermediate keys and values, are sent to the
reducers.

 The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format.

 The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the
data typically does not have to move over the network and can be computed on the local machine.
22
MapReduce workflow:
Illustration of words count within a
document

23
MapReduce workflow: I- Input Split
Input is stored in hadoop distributed file system. It is split based on the input format that specifies
how input will be read and split. Each split will be handled by an individual mapper.

Text input
format Input Split

I am a big data
I am a big expert

Input data expert

stored in I can handle I can handle big

data efficiently
HDFS big data
efficiently …..
…..

24
MapReduce workflow: II-Record Reader
The record reader communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.
RecordReaderExample of words counts
Input Split

I am a big data
RecordReader
expert

I can handle big

RecordReader
data efficiently
…………

RecordReader

…………
25
MapReduce workflow: III- the mapper
The mapper processes input records from RecordReader and generates intermediate key-value
pairs
Input key
Value pair

RecordReader Mapper

RecordReader Mapper Processes input records

from RecordReader and
generates intermediate

…
key-value pairs
RecordReader Mapper

26
MapReduce workflow: IV- the combiner
Combiner is a mini reducer, and there is a combiner for each mapper: performs the aggregation
for each mapper
Input key
Value pair Intermediate
key value
pair

Mapper Combiner

Mapper Combiner
Combiner is a mini
reducer and for every
…

…
combiner there is one
Mapper mapper
Combiner

27
MapReduce workflow: V- the partitioner
Input key
Value pair Intermediate
key value
pair

Mapper Combiner

Mapper Combiner
Substitute

…
Intermediate
Mapper Combiner key value
pair

Partitioner

This decides how

outputs from combiner Partitioner
are sent to reducers

…
Partitioner
28
MapReduce workflow: VI- Sorting
Input key
Value pair Intermediate
key value
pair
Mapper Combiner

Example of words counts Mapper Combiner

Substitute

…
a1 Intermediate
Mapper Combiner key value
am, 1 pair
big, 1
big, 1
Partitioner
can, 1
data, 1 Shuffling
data, 1 and Partitioner
Results shuffling and Sorting
efficiently, 1 sorting: example

…
expert, 1
I, 1 Partitioner
I, 1
handle, 1
29
MapReduce workflow: VII- the reducer
Input key
Value pair Intermediate
key value
pair
Mapper Combiner

Example of words
counts Mapper Combiner Substitute

…
Intermediate
a1
key value
am, 1 Mapper Combiner
pair
big, 1,1
Partitioner
can, 1
Reducer Shuffling
data, 1,1
and Partitioner
efficiently, 1 Results of reducer:

…
Sorting
expert, 1 example

…
I, 1,1 Reducer
handle, 1 Partitioner

All the intermediate values for the

intermediate keys are combined into a
list by the reducer called tuples
MapReduce workflow: VIII- the output
Input key
Intermediate
Value pair
key value
pair
Mapper Combiner

Example of words counts Mapper Combiner Substitute

…
Intermediate
a1 key value
Mapper Combiner
am, 1 pair
big, 2
Partitioner
can, 1
Reducer Shuffling
data, 2 Results of
and Partitioner
efficiently, 1 output format: Output

…
Sorting
format

…
expert, 1
I, 2 Reducer
Partitioner
handle, 1
RecordWriter writes these output key-
value pairs from the Reducer to the
output files.
Summary of MapReduce phases
 Input data : this concerns the data stored in blocks in HDFS datanodes.
 Input split format: the phase where the splitting of the source data is done. (criteria of
splitting are defined by developpers. For example, the splitting can be by lines).

 RecordReader:communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.

 Maps tasks: reads the data, and processes it to generate key-value pairs. Within this task,
mapping functions are applied on the blocks of data stored in HDFS depending on the file
size. (see the hdfs section).

 Reduce tasks: this performs operations on output generated from the previous phase which
is the mapping phase. It is the reducing function. It is done by shuffling, sorting, and
aggregating the key-value pairs into a smaller set of tuples.

 Output: The smaller set of tuples is the final output and gets stored in HDFS. 32
MapReduce workflow recap
Input key
Value pair Intermediate
key value
pair
InputSplit RecordReader Mapper Combiner
Input
Data
stored Input
in format InputSplit RecordReader Mapper Combiner
Substitute
HDFS
…

…
Intermediate
InputSplit RecordReader Mapper Combiner key value
pair
Partitioner

Output Reducer Shuffling

Data and Partitioner
Output

…
stored Sorting
format

…
in
HDFS Reducer
Partitioner
How many mappers and reducers?
 Can be set in a configuration file (JobConfFile) or during command line execution

 Number of Maps: driven by the number of HDFS blocks in the input files. HDFS can be

adjusted in terms of block size to adjust the number of maps.

 Number of Reducers: can be user-defined (the default is 1) Partition function

 The keyspace of the intermediate key-value pairs is evenly distributed over the reducers

with an hash function (same keys in different mappers end up at the same reducer)

 The partition of the intermediate key-value pairs generated by the maps can be

customized.
34
MapReduce architecture: within Yarn
 Input Data is split into smaller pieces. These chunks of data will be processed in
parallel in independent tasks. The ouput will be a list.

 This will be done based on the input format.

 Resource manager breaks the clients’ application into smaller tasks and assigns them
to the different nodes. Each node will handle the task thanks to its Node
Manager(responsible of resources’ allocation and management within each node).

 Within each node, mapping, combining and partitioning functions are executed on the
chunk of data. The processing ‘s results at each node is stored in the RAM of the
corresponding machine.

 At the level of the rack, the reduce function is called to provide outputs. This will be
written back to hdfs. 35
Submitting MapReduce job to Yarn
Node Manager

Scheduler Applications
Manager Container App Master

Submit job
request Node Manager
MR Resource
Manager
Client
App Master Container

job submission
Node Manager
Node status

MapReduce status Container Container

Resource request
MapReduce architecture: within Yarn
 The Yarn layer consists of two components: Scheduler and Applications Manager

 Scheduler allocates resources and does not participate in running or monitoring an

application. Resources are allocated as resource containers, with each resource

container assigned a specific memory

 Applications Manager accepts mapreduce job submission from clients and starts

processes called Applications Master to run the jobs

 There is one Application Master slave daemon for each client application.

37
Submitting MapReduce job(old version of hadoop)
It is called Resource Manager in
Hadoop version 2 and the slaves
are called Node Managers

38
MapReduce example 1: counting songs
Counting songs that were played. Use case: Google Music Analytics : to find the top ten
trending songs:

39
MapReduce example 2:
Let’s the following sample of cars dataset:

tuples = [
{'id':1, 'Brand':'Renault', 'Model':'Clio', 'Price':4200},
{'id':3, 'Brand':'Fiat','Model':'500', 'Price':8840},
{'id':4, 'Brand':'Fiat','Model':'600', 'Price':10240},
{'id':5, 'Brand':'Peugeot', 'Model':'206', 'Price':4300},
…]

To calculate the maximum, or average price, the functions

Map and Reduce can be written as:
• FunctionM: it extracts the desired value from the
tuples.
• FunctionR: is a grouping or aggregation function.
40
MapReduce example 2: analyzing the cars’ price
To analyze the cars’ price by calculating the average, min and max, the MapReduce program should
look like:
• FunctionM extracts the price of a car from the dataset.
• FunctionR calculates the max of a prices:
For efficiency, the intermediate values are not stored but transmitted between the two functions by
a kind of pipe (as in Unix).

Here is the python implementation of

The MapReduce job:

41
MapReduce example 2: analyzing the cars’ price(con’t)

The following figure depicts the MapReduce job of this example:

Values =
[ 4200,8200,8840,
10240, 4300,6140 ]
10240

Max
tuples map values reduce Prices

42
MapReduce configuration file: mapred-site.xml

Specifies the location of the

resource manager

Specifies that MapReduce

Will use Yarn for resource
management

The location where

previous jobs shall be
saved

Spark Theory
No ratings yet
Spark Theory
26 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Welcome: Predicting Change Outcomes
No ratings yet
Welcome: Predicting Change Outcomes
29 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
BATCH2728
No ratings yet
BATCH2728
85 pages
01 SQL - V6
No ratings yet
01 SQL - V6
192 pages
Execution Plans The Secret To Query Tuning Success
No ratings yet
Execution Plans The Secret To Query Tuning Success
98 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
Hortonworks Sandbox for Beginners
No ratings yet
Hortonworks Sandbox for Beginners
12 pages
Avro
No ratings yet
Avro
5 pages
Oracle 8i vs SQL Server 2000
No ratings yet
Oracle 8i vs SQL Server 2000
26 pages
Data Warehousing - Book
No ratings yet
Data Warehousing - Book
203 pages
SQL Server to Oracle Migration Guide
No ratings yet
SQL Server to Oracle Migration Guide
20 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
SQL+Server+Index+Architecture+and+Design+Guide+-+SQL+Server+ +Microsoft+Docs
No ratings yet
SQL+Server+Index+Architecture+and+Design+Guide+-+SQL+Server+ +Microsoft+Docs
47 pages
Informatica B2B Data Transformation Course
No ratings yet
Informatica B2B Data Transformation Course
3 pages
SQL Database Systems Guide
No ratings yet
SQL Database Systems Guide
46 pages
Understanding Business Intelligence:: ETL and Data Mart Best Practices
No ratings yet
Understanding Business Intelligence:: ETL and Data Mart Best Practices
20 pages
Er Studio Da Quick
No ratings yet
Er Studio Da Quick
72 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
17 pages
Optimizing SQL Server Clustered Indexes
No ratings yet
Optimizing SQL Server Clustered Indexes
17 pages
Performance Tuning 16 by 10
No ratings yet
Performance Tuning 16 by 10
51 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Hadoop Online Course Overview
100% (1)
Hadoop Online Course Overview
57 pages
XQuery Basics for XML Bookstore
No ratings yet
XQuery Basics for XML Bookstore
25 pages
Snowflake Data Prep Best Practices
No ratings yet
Snowflake Data Prep Best Practices
8 pages
Dimensional Modeling Guide
No ratings yet
Dimensional Modeling Guide
14 pages
Oracle 12c - Administering A CDB
No ratings yet
Oracle 12c - Administering A CDB
5 pages
Business Intelligence & Data Warehousing Guide
No ratings yet
Business Intelligence & Data Warehousing Guide
17 pages
Oracle & SQL Interview Prep
No ratings yet
Oracle & SQL Interview Prep
17 pages
Bitmap Indexing Overview & Applications
No ratings yet
Bitmap Indexing Overview & Applications
11 pages
Azure Data Solution Exam Prep
No ratings yet
Azure Data Solution Exam Prep
108 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
36 pages
Talend Subramanyam B Feb 2022
No ratings yet
Talend Subramanyam B Feb 2022
283 pages
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
No ratings yet
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
40 pages
Database Materialized Views Guide
No ratings yet
Database Materialized Views Guide
31 pages
Data Warehousing Insights
No ratings yet
Data Warehousing Insights
6 pages
Module 1 - Oracle Architecture: Objectives
No ratings yet
Module 1 - Oracle Architecture: Objectives
41 pages
Spark
No ratings yet
Spark
160 pages
SQL Datetime Conversion - String Date Convert Formats - SQLUSA PDF
No ratings yet
SQL Datetime Conversion - String Date Convert Formats - SQLUSA PDF
11 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
156 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
Introduction To Data Warehouse: Unit I: Data Warehousing
No ratings yet
Introduction To Data Warehouse: Unit I: Data Warehousing
110 pages
Seminar
No ratings yet
Seminar
16 pages
OLTP
No ratings yet
OLTP
12 pages
Migration Strategy
No ratings yet
Migration Strategy
3 pages
Oracle Indexes
No ratings yet
Oracle Indexes
3 pages
Intro To SQL 2022 Edition by University of Jordan
No ratings yet
Intro To SQL 2022 Edition by University of Jordan
98 pages
SAP BI Architecture Overview
No ratings yet
SAP BI Architecture Overview
6 pages
Understanding Data Models: Key Concepts and Evolution
No ratings yet
Understanding Data Models: Key Concepts and Evolution
57 pages
Bitmap Index
No ratings yet
Bitmap Index
20 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Anatomy of MapReduce in Hadoop
No ratings yet
Anatomy of MapReduce in Hadoop
37 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
MapReduce Programming in Hadoop
No ratings yet
MapReduce Programming in Hadoop
42 pages
Unit 2
No ratings yet
Unit 2
12 pages
Notes 3 & 4 B Unit
No ratings yet
Notes 3 & 4 B Unit
19 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
RC DS Gazelle-R202i R202i-Vm
No ratings yet
RC DS Gazelle-R202i R202i-Vm
4 pages
Biosensors - Biology
No ratings yet
Biosensors - Biology
14 pages
Maximum Demand Calculation for Shop
No ratings yet
Maximum Demand Calculation for Shop
14 pages
MCS-022 Solved Assignment 2024-25
No ratings yet
MCS-022 Solved Assignment 2024-25
23 pages
Source Code Length Measurement Insights
No ratings yet
Source Code Length Measurement Insights
21 pages
B.Sc. Computer Science Program Overview
No ratings yet
B.Sc. Computer Science Program Overview
10 pages
Magic Quadrant for Content Services Platforms
No ratings yet
Magic Quadrant for Content Services Platforms
27 pages
12th Electronics Full-Syllabus
No ratings yet
12th Electronics Full-Syllabus
8 pages
Introduction To Data Communication & Networking
No ratings yet
Introduction To Data Communication & Networking
50 pages
ESOFT Admin Module-Assessment Resource 8316 8316-1723617289399-Unit 07 SDLC Assignment (2024 Update)
No ratings yet
ESOFT Admin Module-Assessment Resource 8316 8316-1723617289399-Unit 07 SDLC Assignment (2024 Update)
21 pages
Tutorial 6 - SPI
No ratings yet
Tutorial 6 - SPI
4 pages
Accesorios PDF
No ratings yet
Accesorios PDF
228 pages
Attack Report
No ratings yet
Attack Report
5 pages
Principles and Practice Control Process 3rd Edtion Smitih Corripio
33% (3)
Principles and Practice Control Process 3rd Edtion Smitih Corripio
10 pages
Library Management System Report
No ratings yet
Library Management System Report
29 pages
Orientation - VBB - V8
No ratings yet
Orientation - VBB - V8
37 pages
Tesla Social Media Engagement Review
No ratings yet
Tesla Social Media Engagement Review
10 pages
TC Electronic Flashback x4 Delay Looper Manual English PDF
No ratings yet
TC Electronic Flashback x4 Delay Looper Manual English PDF
39 pages
COST ESTIMATION - GBP Dump Truck Rental
No ratings yet
COST ESTIMATION - GBP Dump Truck Rental
100 pages
2
No ratings yet
2
4 pages
Highland Navigator 1019v2
No ratings yet
Highland Navigator 1019v2
3 pages
Total Station Overview and Usage Guide
100% (1)
Total Station Overview and Usage Guide
6 pages
R22M Tech AdvancedManufacturingSystemsSyllabus
No ratings yet
R22M Tech AdvancedManufacturingSystemsSyllabus
60 pages
Communication Network Design
No ratings yet
Communication Network Design
13 pages
UK Web Focus Backup Blog 20081003
100% (3)
UK Web Focus Backup Blog 20081003
544 pages
Agile Testing: Principles and Practices
No ratings yet
Agile Testing: Principles and Practices
2 pages
04 Task Performance 1 1
No ratings yet
04 Task Performance 1 1
5 pages
Cranesoft ERP Proposal - MR Lekan-For-Ikotun Warehouse
No ratings yet
Cranesoft ERP Proposal - MR Lekan-For-Ikotun Warehouse
20 pages
BCC8002 BC8002 Advanced Mainboard: Building Technologies
No ratings yet
BCC8002 BC8002 Advanced Mainboard: Building Technologies
4 pages
ER Diagram of Library Management System
100% (1)
ER Diagram of Library Management System
5 pages

Hadoop (Mapreduce)

Uploaded by

Hadoop (Mapreduce)

Uploaded by

Hadoop Framework

Storage unit of Hadoop Resources Manager Processing unit of

Slave Slave Slave Slave

Slave Slave Slave Slave

Traditional approach – Data is processed MapReduce approach – Data is processed

the Master Node.

them to the program in the Master Node.

 In MapReduce, parallel processing is implemented as jobs are processed in parallel by slave

 Searching, sorting, grouping.

 Simple statistics: counting, ranking.

 Complex statistics: covariance.

 Pre-processing huge data to apply machine learning algorithms.

collaborative filtering, sentiment analysis.

 The map function is parallelizable, because the calculations are independent.

Input Data Output

distributed fashion on a cluster of computers.

spread across the cluster.

and aggregate the data.

 There can be any number of mappers working independently on as much data as

defined by the following steps:

1. The input data is split into records.

2. Map functions process these records and

produce key/value pairs for each word.

3. All key/value pairs that are output by the

map function are merged together, grouped by

a key shuffled and sorted.

4. The intermediate results are transmitted to

the reduce function, which will produce the

The reducers pulls an The intermediate keys are pulled

Input data expert

stored in I can handle I can handle big

I can handle big

RecordReader Mapper Processes input records

This decides how

Example of words counts Mapper Combiner

All the intermediate values for the

Example of words counts Mapper Combiner Substitute

Output Reducer Shuffling

adjusted in terms of block size to adjust the number of maps.

 Number of Reducers: can be user-defined (the default is 1) Partition function

 This will be done based on the input format.

MapReduce status Container Container

 Scheduler allocates resources and does not participate in running or monitoring an

application. Resources are allocated as resource containers, with each resource

container assigned a specific memory

processes called Applications Master to run the jobs

To calculate the maximum, or average price, the functions

Here is the python implementation of

The following figure depicts the MapReduce job of this example:

Specifies the location of the

Specifies that MapReduce

The location where

You might also like