0% found this document useful (0 votes)

111 views45 pages

Chapter 4 - Understanding Map Reduce Fundamentals

The document outlines the phases and functions of MapReduce, a framework for processing large datasets in a distributed manner. It discusses the map and reduce functions, which take key-value pairs as input and output. The map function processes the input pairs and generates intermediate key-value pairs. These pairs are shuffled, sorted, and partitioned before being passed to the reduce function to produce the final output pairs. Optional phases like combine are also described. Examples are provided to illustrate how MapReduce works.

Uploaded by

WEGENE ARGOW

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views45 pages

Chapter 4 - Understanding Map Reduce Fundamentals

Uploaded by

WEGENE ARGOW

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

1

2
Wollo University ,Kombolicha Institute of Technology

Department of Software Engineering

Fundamentals of Big Data Analytics and Business

Intelligence
By Ashenafi Workie(MSc.) and Bihonegn Abebe (MSc.)
KIOT@SE by Ashenafi Workie
Major chapters outlines

1 Chapter 1: Introduction to Big Data Analytics

2 Chapter 2: Introducing Hadoop
3 Chapter 3: The Hadoop Distributed Filesystem
4 Chapter 4: Understanding Map Reduce Fundamentals
5 Chapter 5: Introducing Pig: Pig architecture

4
Assessment Methods

• Assignment 15%
• Test 20%
• Project & Lab Exercise 20%
• Final Exam 45%

5
What is MapReduce

6
What is MapReduce

7
What is MapReduce
 MapReduce is a software framework introduced by Google for
processing huge datasets on certain kinds of problems on a distributed
system.
 MapReduce is a parallel programming model developed by Google as a
mechanism for processing large amounts of raw data,
 e.g., web pages the search engine has crawled.
 This data is so large that it must be distributed across thousands of
machines in order to be processed in a reasonable time.
8
What is MapReduce

9
What is MapReduce

 MapReduce is the processing engine of Hadoop that processes and

computes large volumes of data.
 MapReduce is a programming model or pattern within the Hadoop
framework that is used to access big data stored in the Hadoop File System
(HDFS).
 It is a core component, integral to the functioning of the Hadoop framework.

10
What is MapReduce

11
What is MapReduce
• With MapReduce, rather than sending data to where the application or
logic resides, the logic is executed on the server where the data already
resides, to expedite processing.
• Data access and storage is disk-based—the input is usually stored as files
containing
 Structured
 semi-structured, or
 unstructured data, and the output is also stored in files.
12
What is MapReduce

 MapReduce was once the only method through which the data
stored in the HDFS could be retrieved, but that is no longer the case.
 Today, other query-based systems such as Hive and Pig are used to
retrieve data from the HDFS using SQL-like statements.
 However, these usually run along with jobs that are written using the
MapReduce model.
 That's because MapReduce has unique advantages.
13
Architectures of MapReduce
• The architecture of MapReduce can be depicted as:

14
Phase of MapReduce
 MapReduce model has three major and one optional phase.
 Mapping
 Shuffling and Sorting
 Reducing
 Combining(Optional)

 Mapping :- It is the first phase of MapReduce programming.

 Mapping Phase accepts key-value pairs as input as (k, v), where the key represents the
Key address of each record and the value represents the entire record content.
 The output of the Mapping phase will also be in the key-value format (k’, v’) 15
Phase of MapReduce
 Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and
Sorting phase.
 All the same values are deleted, and different values are grouped together based on same keys.

 The output of the Shuffling and Sorting phase will be key-value pairs again as key and array of values
(k, v[ ]).

 Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase.

 In this phase reducer function’s logic is executed and all the values are Collected against their
corresponding keys.
 Reducer stabilize outputs of various mappers and computes the final output.

16
Phase of MapReduce
Combining :- It is an optional phase in the MapReduce phases .
 The combiner phase is used to optimize the performance of MapReduce phases.

 This phase makes the Shuffling and Sorting phase work even quicker by enabling
additional performance features in MapReduce phases.

17
Phase of MapReduce with example

18
How MapReduce works

 At the crux of MapReduce are two functions: Map and Reduce.

They are sequenced one after the other.
 The Map function takes input from the disk as <key,value> pairs,
processes them, and produces another set of intermediate
<key,value> pairs as output.
 The Reduce function also takes inputs as <key,value> pairs, and
produces <key,value> pairs as output.

19
How MapReduce works

 Let us now take a close look at each of •Input Phase − Here we have a Record Reader
the phases and try to understand their
that translates each record in an input file
significance
and sends the parsed data to the mapper in
the form of key-value pairs.
•Map − Map is a user-defined function, which
takes a series of key-value pairs and
processes each one of them to generate zero
or more key-value pairs.
•Intermediate Keys − They key-value pairs
generated by the mapper are known as
intermediate keys.
20
How MapReduce works

• Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets.
• It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate
the values in a small scope of one mapper.
• It is not a part of the main MapReduce algorithm; it is optional.

• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step.
• It downloads the grouped key-value pairs onto the local machine, where the Reducer is running.

• The individual key-value pairs are sorted by key into a larger data list.

• The data list groups the equivalent keys together so that their values can be iterated easily in the
Reducer task. 21
How MapReduce works

 Let us try to understand the two tasks Map & Reduce with the help of a
small diagram.

22
How MapReduce works

23
How MapReduce works

 The types of keys and values differ based on the use case. All inputs and
outputs are stored in the HDFS. While the map is a mandatory step to
filter and sort the initial data, the reduce function is optional.
 <k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
 Mappers and Reducers are the Hadoop servers that run the Map and
Reduce functions respectively. It doesn’t matter if these are the same or
different servers.
24
How MapReduce works

25
How MapReduce works

26
MapReduce workflows

27
How MapReduce works

28
MapReduce workflows

29
MapReduce with example

30
Map

 The input data is first split into smaller blocks. Each block is then assigned
to a mapper for processing.
 For example, if a file has 100 records to be processed, 100 mappers can
run together to process one record each. Or maybe 50 mappers can run
together to process two records each. The Hadoop framework decides
how many mappers to use, based on the size of the data to be processed
and the memory block available on each mapper server.

31
Map

32
Combine

 Partition is the process that translates the <key, value> pairs resulting
from mappers to another set of <key, value> pairs to feed into the
reducer. It decides how the data has to be presented to the reducer and
also assigns it to a particular reducer.
 The default partitioner determines the hash value for the key, resulting
from the mapper, and assigns a partition based on this hash value. There
are as many partitions as there are reducers. So, once the partitioning is
complete, the data from each partition is sent to a specific reducer. 33
Reduce

 After all the mappers complete processing, the framework shuffles and
sorts the results before passing them on to the reducers.
 A reducer cannot start while a mapper is still in progress. All the map
output values that have the same key are assigned to a single reducer,
which then aggregates the values for that key.

34
Partition
 There are two intermediate steps between Map and Reduce.
 Combine is an optional process. The combiner is a reducer that runs
individually on each mapper server.
 It reduces the data on each mapper further to a simplified form before
passing it downstream.
 This makes shuffling and sorting easier as there is less data to work with.
Often, the combiner class is set to the reducer class itself, due to the
cumulative and associative functions in the reduce function. However, if
needed, the combiner can be a separate class as well. 35
Characteristics of MapReduce

36
Real-Time use of MapReduce

37
Example
Numerical Movie Lens Data Solution :
Step 1 – First we have to map the values , it is
happen in 1st phase of Map Reduce model.
196:242;
186:302;
196:377;
244:51;
166:346;
186:274;
186:265

38
Example
Solution :
Numerical Movie Lens Data Step 2 – After Mapping we have to shuffle and sort the values.
166:346 ;
186:302,274,265 ;
196:242,377 ;
244:51

39
Example
Solution :
Numerical Movie Lens Data Step 3 – After completion of step1 and step2 we have to
reduce each key’s values. Now, put all values together.
166:1;
186:3;
196:2;
244:1

40
Example

41
Real time Examples
 Let us take a real-world example to comprehend the power of MapReduce.
 Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per
second.
 The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

42
Real time Examples

• As shown in the illustration, the MapReduce algorithm performs the following

actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.

Aggregate Counters − Prepares an aggregate of similar counter values into small

manageable units.

43
Application of MapReduce

• The applications which include:

• Indexing And Search
• Graph Analysis
• Text Analysis
• Machine Learning
• Data Transformation And Many More, Are Not Easy To Implement By
Making The Use Of Standard SQL.
44
End ….

Common questions

MapReduce consists of several phases: Mapping, Shuffling and Sorting, and Reducing, with an optional Combining phase. In the Mapping phase, input records are processed as key-value pairs; the output remains in key-value format. Shuffling and Sorting groups values with the same key and organizes them into arrays. The Reducing phase aggregates these values for output. The Combining phase optimizes processing by pre-reducing data at the mapper level. Each phase is crucial for efficient data transformation in a distributed environment .

MapReduce processes large datasets by distributing data across thousands of machines for parallel computation. The main components are the Map and Reduce functions. The Map function takes input as key-value pairs from disk, processes them, and produces intermediate key-value pairs. The Reduce function takes these intermediate pairs and combines them to form final output. The entire process involves phases like Mapping, Shuffling and Sorting, and Reducing. Optionally, the framework can include a Combiner to optimize performance by reducing data transferred during Shuffling .

MapReduce addresses several challenges in big data environments, notably the necessity for parallel processing capabilities to handle vast data volumes spread across distributed systems. It facilitates efficient data storage and retrieval by using HDFS, where data locality minimizes data movement across networks. MapReduce simplifies programming for scalability and error recovery, managing tasks on clusters effectively, which is critical for processing semi-structured and unstructured data that traditional systems struggle with. Through these mechanisms, it improves data retrieval and processing speed .

A Combiner is employed to optimize MapReduce overall performance by reducing the volume of data that needs to be moved from the Mapper to the Reducer, thus minimizing data transfer during the Shuffling phase. It acts as a mini-reducer that runs on the output of the Mapper, aggregating repeated keys locally and reducing the amount of data transferred over the network. This optimization is particularly beneficial when the function used is both associative and commutative .

The Reducer function in MapReduce acts on the output of the Shuffling and Sorting phase, which aggregates and organizes data from Mappers into groups of key and associated list of values. Each Reducer processes key-value list pairs, applying user-defined processing to combine values associated with a particular key. The final output is a more reduced set of key-value pairs, summarizing the data processing as specified by the logic provided; this could be summing, averaging, or other aggregate operations depending on the specific data application .

Partitioning in MapReduce organizes the outputs from Mappers into partitions before they are sent to Reducers. Each partition corresponds to a single Reducer, and the data is assigned based on the hash of the key. The partitioner decides how intermediate key-value pairs are distributed among Reducers. Efficient partitioning ensures balanced workloads among Reducers, improving overall performance by avoiding bottlenecks caused by uneven data distribution. This step is crucial for optimizing data processing efficiency across a distributed system .

MapReduce, while efficient for large-scale data processing, has limitations particularly with structured and semi-structured data when compared to SQL-based systems. Its programming model is lower-level and less intuitive, requiring more complex coding for operations that are simpler with SQL. Additionally, MapReduce's disk-based data storage approach can be slower than in-memory systems for iterative processing, making it less suited for tasks requiring repeated data access. It also lacks interactive querying capabilities inherently available in databases designed for structured data handling, limiting its use in real-time data analysis scenarios .

MapReduce is advantageous in scenarios requiring custom data processing algorithms or handling massive raw data. Unlike SQL-based systems like Hive and Pig, which are limited in functionality and optimized for structured queries, MapReduce allows for more granular control and customizability in data processing, often necessary for complex tasks like iterative algorithms in machine learning or detailed text analysis. Its framework efficiently processes unstructured and semi-structured data distributed across large cluster environments .

A notable real-world application of MapReduce is its use by Twitter to manage their massive data inflow. Twitter processes around 500 million tweets per day using MapReduce, which involves tokenizing tweets, filtering unwanted words, counting tokens, and aggregating counts into manageable data units. This workflow efficiently partitions the data processing across thousands of server nodes in parallel, demonstrating MapReduce's capacity to handle large-scale data efficiently with rapid processing capabilities crucial for real-time analytics .

MapReduce optimizes data processing by executing logic close to where data resides, rather than moving data across the network to a processing application. This approach reduces network traffic and enhances processing speed since the input data for Map and output data from Reduce are stored in HDFS local to the computation node. By minimizing data movement, MapReduce leverages locality to handle data-heavy tasks efficiently .

BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
7 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Understanding MapReduce for Big Data
No ratings yet
Understanding MapReduce for Big Data
7 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
120 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Unit 3
No ratings yet
Unit 3
22 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Data Science
No ratings yet
Data Science
7 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
7 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
MapReduce: Big Data Processing Guide
No ratings yet
MapReduce: Big Data Processing Guide
25 pages
DRKP Module 3
No ratings yet
DRKP Module 3
44 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Unit 2
No ratings yet
Unit 2
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit 3
No ratings yet
Unit 3
27 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
MapReduce Fundamentals Explained
No ratings yet
MapReduce Fundamentals Explained
15 pages
BDA Notes
No ratings yet
BDA Notes
39 pages
BDS Session 8 MapReduce YARN
No ratings yet
BDS Session 8 MapReduce YARN
68 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Notes 3 & 4 B Unit
No ratings yet
Notes 3 & 4 B Unit
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Bda FW-4
No ratings yet
Bda FW-4
7 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Unit III
No ratings yet
Unit III
8 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
MapReduce Programming in Hadoop
No ratings yet
MapReduce Programming in Hadoop
42 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Fluke 92B 96B 99B 105B Service Manual
No ratings yet
Fluke 92B 96B 99B 105B Service Manual
322 pages
Sakai SV521 - Spec 2019
No ratings yet
Sakai SV521 - Spec 2019
4 pages
期末專題1
No ratings yet
期末專題1
14 pages
Iso 19036 - 2019 - Estimation of Measurement Uncertainty For Quantitative Determinations
No ratings yet
Iso 19036 - 2019 - Estimation of Measurement Uncertainty For Quantitative Determinations
46 pages
Anxiety Levels of Bse Science Students in Pangasinan - State University During The Covid 19 Pandemic
No ratings yet
Anxiety Levels of Bse Science Students in Pangasinan - State University During The Covid 19 Pandemic
91 pages
Workplace Safety Quiz Guide
No ratings yet
Workplace Safety Quiz Guide
2 pages
Programme Guide-Mcom Compressed
No ratings yet
Programme Guide-Mcom Compressed
105 pages
Grade 9 Geography - Social Sciences Question Paper 2023
No ratings yet
Grade 9 Geography - Social Sciences Question Paper 2023
6 pages
Ordinance Barangay Reading Center
No ratings yet
Ordinance Barangay Reading Center
3 pages
Fourth Quarter Lesson Plan May 7, 2024 I. Objectives Content Standards
No ratings yet
Fourth Quarter Lesson Plan May 7, 2024 I. Objectives Content Standards
6 pages
Lagna Lord in Various Houses
100% (1)
Lagna Lord in Various Houses
31 pages
Philippine National Standard: Solar Powered Irrigation System - Methods of Test
No ratings yet
Philippine National Standard: Solar Powered Irrigation System - Methods of Test
19 pages
Welding Filler Metals Guide
No ratings yet
Welding Filler Metals Guide
28 pages
7 Keeping Your Code Readable
No ratings yet
7 Keeping Your Code Readable
7 pages
School Memo S, 2025
No ratings yet
School Memo S, 2025
3 pages
The Product Keys For Autodesk 2018 Products Are As Follows
No ratings yet
The Product Keys For Autodesk 2018 Products Are As Follows
4 pages
87 NURS FPX 6112 Assessment 3
No ratings yet
87 NURS FPX 6112 Assessment 3
3 pages
LANGUAGE w2 Day 1
No ratings yet
LANGUAGE w2 Day 1
33 pages
Proposed VCCT Wind Turbine Foundation Bill of Quantities
No ratings yet
Proposed VCCT Wind Turbine Foundation Bill of Quantities
2 pages
Epq Dissertation Structure
100% (2)
Epq Dissertation Structure
4 pages
My Project Proposal
No ratings yet
My Project Proposal
4 pages
MATLAB Convolution in Signal Processing
No ratings yet
MATLAB Convolution in Signal Processing
4 pages
Apple's Iphone Air and The Marketing
No ratings yet
Apple's Iphone Air and The Marketing
2 pages
Tuc 2 Tu 2 e
No ratings yet
Tuc 2 Tu 2 e
17 pages
Atmosphere Printable
No ratings yet
Atmosphere Printable
1 page
Intro:: Continuation of Lists of Honorable Guests
No ratings yet
Intro:: Continuation of Lists of Honorable Guests
3 pages
Presentation On Building
No ratings yet
Presentation On Building
18 pages
CUMI's Resilience and Innovation
No ratings yet
CUMI's Resilience and Innovation
124 pages
An Analysis of Different Methods For Major Energy Saving in Thermal Power Plant
No ratings yet
An Analysis of Different Methods For Major Energy Saving in Thermal Power Plant
9 pages
Sea Giraffe 4a Product Brochure
100% (1)
Sea Giraffe 4a Product Brochure
2 pages

Chapter 4 - Understanding Map Reduce Fundamentals

Uploaded by

Chapter 4 - Understanding Map Reduce Fundamentals

Uploaded by

1

Department of Software Engineering

Fundamentals of Big Data Analytics and Business

1 Chapter 1: Introduction to Big Data Analytics

 MapReduce is the processing engine of Hadoop that processes and

 Mapping :- It is the first phase of MapReduce programming.

 At the crux of MapReduce are two functions: Map and Reduce.

• As shown in the illustration, the MapReduce algorithm performs the following

Aggregate Counters − Prepares an aggregate of similar counter values into small

• The applications which include:

Common questions

Describe the distinct phases of the MapReduce model and their respective roles in processing data.

How does the MapReduce framework handle large data processing on a distributed system, and what are its main functional components?

What challenges does MapReduce address in big data environments, and how does it affect data storage and retrieval?

Why might a Combiner be employed in a MapReduce job, and what strategic advantage does it offer?

Illustrate the role of the Reducer function in the MapReduce framework in terms of its input and output processes.

Explain how the partitioning process functions within the MapReduce workflow and its impact on data processing efficiency.

Evaluate the limitations of MapReduce in the context of its use for processing structured and semi-structured data.

In what scenarios might MapReduce be more advantageous than SQL-based data retrieval systems like Hive and Pig?

Discuss a real-world scenario where MapReduce has proven essential, particularly focusing on its efficiency in handling large-scale data.

How does MapReduce optimize data processing by executing logic on the server where data resides?

You might also like