0% found this document useful (0 votes)
111 views45 pages

Chapter 4 - Understanding Map Reduce Fundamentals

The document outlines the phases and functions of MapReduce, a framework for processing large datasets in a distributed manner. It discusses the map and reduce functions, which take key-value pairs as input and output. The map function processes the input pairs and generates intermediate key-value pairs. These pairs are shuffled, sorted, and partitioned before being passed to the reduce function to produce the final output pairs. Optional phases like combine are also described. Examples are provided to illustrate how MapReduce works.

Uploaded by

WEGENE ARGOW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views45 pages

Chapter 4 - Understanding Map Reduce Fundamentals

The document outlines the phases and functions of MapReduce, a framework for processing large datasets in a distributed manner. It discusses the map and reduce functions, which take key-value pairs as input and output. The map function processes the input pairs and generates intermediate key-value pairs. These pairs are shuffled, sorted, and partitioned before being passed to the reduce function to produce the final output pairs. Optional phases like combine are also described. Examples are provided to illustrate how MapReduce works.

Uploaded by

WEGENE ARGOW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

1

2
Wollo University ,Kombolicha Institute of Technology

Department of Software Engineering

Fundamentals of Big Data Analytics and Business


Intelligence
By Ashenafi Workie(MSc.) and Bihonegn Abebe (MSc.)
KIOT@SE by Ashenafi Workie
Major chapters outlines

1 Chapter 1: Introduction to Big Data Analytics


2 Chapter 2: Introducing Hadoop
3 Chapter 3: The Hadoop Distributed Filesystem
4 Chapter 4: Understanding Map Reduce Fundamentals
5 Chapter 5: Introducing Pig: Pig architecture

4
Assessment Methods

• Assignment 15%
• Test 20%
• Project & Lab Exercise 20%
• Final Exam 45%

5
What is MapReduce

6
What is MapReduce

7
What is MapReduce
 MapReduce is a software framework introduced by Google for
processing huge datasets on certain kinds of problems on a distributed
system.
 MapReduce is a parallel programming model developed by Google as a
mechanism for processing large amounts of raw data,
 e.g., web pages the search engine has crawled.
 This data is so large that it must be distributed across thousands of
machines in order to be processed in a reasonable time.
8
What is MapReduce

9
What is MapReduce

 MapReduce is the processing engine of Hadoop that processes and


computes large volumes of data.
 MapReduce is a programming model or pattern within the Hadoop
framework that is used to access big data stored in the Hadoop File System
(HDFS).
 It is a core component, integral to the functioning of the Hadoop framework.

10
What is MapReduce

11
What is MapReduce
• With MapReduce, rather than sending data to where the application or
logic resides, the logic is executed on the server where the data already
resides, to expedite processing.
• Data access and storage is disk-based—the input is usually stored as files
containing
 Structured
 semi-structured, or
 unstructured data, and the output is also stored in files.
12
What is MapReduce

 MapReduce was once the only method through which the data
stored in the HDFS could be retrieved, but that is no longer the case.
 Today, other query-based systems such as Hive and Pig are used to
retrieve data from the HDFS using SQL-like statements.
 However, these usually run along with jobs that are written using the
MapReduce model.
 That's because MapReduce has unique advantages.
13
Architectures of MapReduce
• The architecture of MapReduce can be depicted as:

14
Phase of MapReduce
 MapReduce model has three major and one optional phase.​
 Mapping
 Shuffling and Sorting
 Reducing
 Combining(Optional)

 Mapping :- It is the first phase of MapReduce programming.


 Mapping Phase accepts key-value pairs as input as (k, v), where the key represents the
Key address of each record and the value represents the entire record content.​
 The output of the Mapping phase will also be in the key-value format (k’, v’) 15
Phase of MapReduce
 Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and
Sorting phase.​
 All the same values are deleted, and different values are grouped together based on same keys.​

 The output of the Shuffling and Sorting phase will be key-value pairs again as key and array of values
(k, v[ ]).

 Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer phase.​

 In this phase reducer function’s logic is executed and all the values are Collected against their
corresponding keys. ​
 Reducer stabilize outputs of various mappers and computes the final output.​

16
Phase of MapReduce
Combining :- It is an optional phase in the MapReduce phases .​
 The combiner phase is used to optimize the performance of MapReduce phases.

 This phase makes the Shuffling and Sorting phase work even quicker by enabling
additional performance features in MapReduce phases.

17
Phase of MapReduce with example

18
How MapReduce works

 At the crux of MapReduce are two functions: Map and Reduce.


They are sequenced one after the other.
 The Map function takes input from the disk as <key,value> pairs,
processes them, and produces another set of intermediate
<key,value> pairs as output.
 The Reduce function also takes inputs as <key,value> pairs, and
produces <key,value> pairs as output.

19
How MapReduce works

 Let us now take a close look at each of •Input Phase − Here we have a Record Reader
the phases and try to understand their
that translates each record in an input file
significance
and sends the parsed data to the mapper in
the form of key-value pairs.
•Map − Map is a user-defined function, which
takes a series of key-value pairs and
processes each one of them to generate zero
or more key-value pairs.
•Intermediate Keys − They key-value pairs
generated by the mapper are known as
intermediate keys.
20
How MapReduce works

• Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets.
• It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate
the values in a small scope of one mapper.
• It is not a part of the main MapReduce algorithm; it is optional.

• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step.
• It downloads the grouped key-value pairs onto the local machine, where the Reducer is running.

• The individual key-value pairs are sorted by key into a larger data list.

• The data list groups the equivalent keys together so that their values can be iterated easily in the
Reducer task. 21
How MapReduce works

 Let us try to understand the two tasks Map & Reduce with the help of a
small diagram.

22
How MapReduce works

23
How MapReduce works

 The types of keys and values differ based on the use case. All inputs and
outputs are stored in the HDFS. While the map is a mandatory step to
filter and sort the initial data, the reduce function is optional.
 <k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
 Mappers and Reducers are the Hadoop servers that run the Map and
Reduce functions respectively. It doesn’t matter if these are the same or
different servers.
24
How MapReduce works

25
How MapReduce works

26
MapReduce workflows

27
How MapReduce works

28
MapReduce workflows

29
MapReduce with example

30
Map

 The input data is first split into smaller blocks. Each block is then assigned
to a mapper for processing.
 For example, if a file has 100 records to be processed, 100 mappers can
run together to process one record each. Or maybe 50 mappers can run
together to process two records each. The Hadoop framework decides
how many mappers to use, based on the size of the data to be processed
and the memory block available on each mapper server.

31
Map

32
Combine

 Partition is the process that translates the <key, value> pairs resulting
from mappers to another set of <key, value> pairs to feed into the
reducer. It decides how the data has to be presented to the reducer and
also assigns it to a particular reducer.
 The default partitioner determines the hash value for the key, resulting
from the mapper, and assigns a partition based on this hash value. There
are as many partitions as there are reducers. So, once the partitioning is
complete, the data from each partition is sent to a specific reducer. 33
Reduce

 After all the mappers complete processing, the framework shuffles and
sorts the results before passing them on to the reducers.
 A reducer cannot start while a mapper is still in progress. All the map
output values that have the same key are assigned to a single reducer,
which then aggregates the values for that key.

34
Partition
 There are two intermediate steps between Map and Reduce.
 Combine is an optional process. The combiner is a reducer that runs
individually on each mapper server.
 It reduces the data on each mapper further to a simplified form before
passing it downstream.
 This makes shuffling and sorting easier as there is less data to work with.
Often, the combiner class is set to the reducer class itself, due to the
cumulative and associative functions in the reduce function. However, if
needed, the combiner can be a separate class as well. 35
Characteristics of MapReduce

36
Real-Time use of MapReduce

37
Example
​Numerical Movie Lens Data Solution :
Step 1 – First we have to map the values , it is
happen in 1st phase of Map Reduce model.
196:242;
186:302;
196:377;
244:51;
166:346;
186:274;
186:265

38
Example
Solution :
​Numerical Movie Lens Data Step 2 – After Mapping we have to shuffle and sort the values.
166:346 ;
186:302,274,265 ;
196:242,377 ;
244:51

39
Example
Solution :
​Numerical Movie Lens Data Step 3 – After completion of step1 and step2 we have to
reduce each key’s values. Now, put all values together.
166:1;
186:3;
196:2;
244:1

40
Example

41
Real time Examples
 Let us take a real-world example to comprehend the power of MapReduce.
 Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per
second.
 The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

42
Real time Examples

• As shown in the illustration, the MapReduce algorithm performs the following


actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.

Aggregate Counters − Prepares an aggregate of similar counter values into small


manageable units.

43
Application of MapReduce

• The applications which include:


• Indexing And Search
• Graph Analysis
• Text Analysis
• Machine Learning
• Data Transformation And Many More, Are Not Easy To Implement By
Making The Use Of Standard SQL.
44
End ….

45

Common questions

Powered by AI

MapReduce consists of several phases: Mapping, Shuffling and Sorting, and Reducing, with an optional Combining phase. In the Mapping phase, input records are processed as key-value pairs; the output remains in key-value format. Shuffling and Sorting groups values with the same key and organizes them into arrays. The Reducing phase aggregates these values for output. The Combining phase optimizes processing by pre-reducing data at the mapper level. Each phase is crucial for efficient data transformation in a distributed environment .

MapReduce processes large datasets by distributing data across thousands of machines for parallel computation. The main components are the Map and Reduce functions. The Map function takes input as key-value pairs from disk, processes them, and produces intermediate key-value pairs. The Reduce function takes these intermediate pairs and combines them to form final output. The entire process involves phases like Mapping, Shuffling and Sorting, and Reducing. Optionally, the framework can include a Combiner to optimize performance by reducing data transferred during Shuffling .

MapReduce addresses several challenges in big data environments, notably the necessity for parallel processing capabilities to handle vast data volumes spread across distributed systems. It facilitates efficient data storage and retrieval by using HDFS, where data locality minimizes data movement across networks. MapReduce simplifies programming for scalability and error recovery, managing tasks on clusters effectively, which is critical for processing semi-structured and unstructured data that traditional systems struggle with. Through these mechanisms, it improves data retrieval and processing speed .

A Combiner is employed to optimize MapReduce overall performance by reducing the volume of data that needs to be moved from the Mapper to the Reducer, thus minimizing data transfer during the Shuffling phase. It acts as a mini-reducer that runs on the output of the Mapper, aggregating repeated keys locally and reducing the amount of data transferred over the network. This optimization is particularly beneficial when the function used is both associative and commutative .

The Reducer function in MapReduce acts on the output of the Shuffling and Sorting phase, which aggregates and organizes data from Mappers into groups of key and associated list of values. Each Reducer processes key-value list pairs, applying user-defined processing to combine values associated with a particular key. The final output is a more reduced set of key-value pairs, summarizing the data processing as specified by the logic provided; this could be summing, averaging, or other aggregate operations depending on the specific data application .

Partitioning in MapReduce organizes the outputs from Mappers into partitions before they are sent to Reducers. Each partition corresponds to a single Reducer, and the data is assigned based on the hash of the key. The partitioner decides how intermediate key-value pairs are distributed among Reducers. Efficient partitioning ensures balanced workloads among Reducers, improving overall performance by avoiding bottlenecks caused by uneven data distribution. This step is crucial for optimizing data processing efficiency across a distributed system .

MapReduce, while efficient for large-scale data processing, has limitations particularly with structured and semi-structured data when compared to SQL-based systems. Its programming model is lower-level and less intuitive, requiring more complex coding for operations that are simpler with SQL. Additionally, MapReduce's disk-based data storage approach can be slower than in-memory systems for iterative processing, making it less suited for tasks requiring repeated data access. It also lacks interactive querying capabilities inherently available in databases designed for structured data handling, limiting its use in real-time data analysis scenarios .

MapReduce is advantageous in scenarios requiring custom data processing algorithms or handling massive raw data. Unlike SQL-based systems like Hive and Pig, which are limited in functionality and optimized for structured queries, MapReduce allows for more granular control and customizability in data processing, often necessary for complex tasks like iterative algorithms in machine learning or detailed text analysis. Its framework efficiently processes unstructured and semi-structured data distributed across large cluster environments .

A notable real-world application of MapReduce is its use by Twitter to manage their massive data inflow. Twitter processes around 500 million tweets per day using MapReduce, which involves tokenizing tweets, filtering unwanted words, counting tokens, and aggregating counts into manageable data units. This workflow efficiently partitions the data processing across thousands of server nodes in parallel, demonstrating MapReduce's capacity to handle large-scale data efficiently with rapid processing capabilities crucial for real-time analytics .

MapReduce optimizes data processing by executing logic close to where data resides, rather than moving data across the network to a processing application. This approach reduces network traffic and enhances processing speed since the input data for Map and output data from Reduce are stored in HDFS local to the computation node. By minimizing data movement, MapReduce leverages locality to handle data-heavy tasks efficiently .

You might also like