0% found this document useful (0 votes)

77 views22 pages

Problem-Solving Using Mapreduce/Hadoop

This document discusses using MapReduce and Hadoop to solve big data problems. It begins by covering MapReduce concepts like using key-value pairs to store data in a distributed file system. It then provides examples of MapReduce algorithms for common problems like word count and PageRank. The PageRank example is explained in detail, showing the mapper, reducer, and iterative computation required. Overall, the document introduces MapReduce and Hadoop as approaches for parallel processing of large datasets.

Uploaded by

Raghavendra Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views22 pages

Problem-Solving Using Mapreduce/Hadoop

Uploaded by

Raghavendra Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Problem-solving using

MapReduce/Hadoop

B. RAMAMURTHY
THIS IS WORK IS SUPPORTED BY NSF GRANT
NSF-DUE-TUES-0920335 (PHASE 2) &
NSF-ACI-1041280
Topics for Discussion

 Problem solving approaches for big-data

 Origins of MR/Hadoop
 Algorithms, data structures and infrastructures
 Hello “Wordcount”
 Wordcount mapreduce version
 MapReduce
 Hadoop
 Linked structures
 PageRank mapreduce version

 Infrastructure
 Local: Single node hadoop
 Local: CCR cluster
 Amazon aws cloud MR/Hadoop infrastructure
 Google appEngine Mapreduce
Big-data Problem Solving Approaches

 Algorithmic: after all we have working towards this

for ever: scalable/tractable
 High Performance computing (HPC: multi-core)
CCR has machines that are: 16 CPU , 32 core
machine with 128GB RAM.
 GPGPU programming: general purpose graphics
processor (NVIDIA)
 Statistical packages like R running on parallel
threads on powerful machines
 Machine learning algorithms on super computers
Different Type of Storage

• Internet introduced a new challenge in the form web logs, web crawler’s data:
large scale “peta scale”
• But observe that this type of data has an uniquely different characteristic than
your transactional or the “customer order” data, or “bank account data” :
• The data type is “write once read many (WORM)” ;
• Privacy protected healthcare and patient information;

• Historical financial data;

• Other historical data

 Relational file system and tables are insufficient.

• Large <key, value> stores (files) and storage management system.
• Built-in features for fault-tolerance, load balancing, data-transfer and
aggregation,…
• Clusters of distributed nodes for storage and computing.
• Computing is inherently parallel

4/24/2019 4
MR-data Concepts

 Originated from the Google File System (GFS) is the

special <key, value> store
 Hadoop Distributed file system (HDFS) is the open source
version of this. (Currently an Apache project)
 Parallel processing of the data using MapReduce (MR)
programming model
 Challenges
 Formulation of MR algorithms
 Proper use of the features of infrastructure (Ex: sort)
 Best practices in using MR and HDFS
 An extensive ecosystem consisting of other components
such as column-based store (Hbase, BigTable), big data
warehousing (Hive), workflow languages, etc.

4/24/2019 5
Hadoop-MapReduce

 MapReduce like algorithms on Hadoop-like

infrastructures: typically batch processing
 Distributed parallelism among commodity machines
 WORM
 <key, value> pairs
 Challenges
 Formulation of MR algorithms

 Proper use of the features of infrastructure (Ex: sort)

 Best practices in using MR and HDFS

MapReduce Design
7

 You focus on Map function, Reduce function and other

related functions like combiner etc.
 Mapper and Reducer are designed as classes and the
function defined as a method.
 Configure the MR “Job” for location of these functions,
location of input and output (paths within the local
server), scale or size of the cluster in terms of #maps, #
reduce etc., run the job.
 Thus a complete MapReduce job consists of code for the
mapper, reducer, combiner, and partitioner, along with
job configuration parameters. The execution
framework handles everything else.

CSE4/587 4/24/2019
The code
8

1: class Mapper
2: method Map(docid a; doc d)
3: for all term t in doc d do
4: Emit(term t; count 1)

1: class Reducer
2: method Reduce(term t; counts [c1; c2; : : :])
3: sum = 0
4: for all count c in counts [c1; c2; : : :] do
5: sum = sum + c
6: Emit(term t; count sum)

CSE4/587 4/24/2019
MapReduce Example: Mapper with Combiner
9

This is a cat
Cat sits on a roof
<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1>

The roof is a tin roof

There is a tin can on the roof
<the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1> <can 1> <on 1>

Cat kicks the can

It rolls on the roof and falls on the next roof
<cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof <1,1>> <and 1>
<falls 1> <next 1>

The cat rolls too

It sits on the can
<the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1>

CSE4/587 4/24/2019
MapReduce Example: Combiner, Reducer,
Shuffle, Sort
10

Input to the reducers:

<cat <1,1,1,1>>
<roof <1,1,1,1,1,1>>
<can <1, 1,1>>
…
Reduce (sum in this case) the counts: Can use non-traditional methods for summing
<cat 4>
<can 3>
<roof 6>

CSE4/587 4/24/2019
More on MR
11

 All Mappers work in parallel.

 Barriers enforce all mappers completion before
Reducers start.
 Mappers and Reducers typically execute on the same
server
 You can configure job to have other combinations
besides Mapper/Reducer: ex: identify
mappers/reducers for realizing “sort” (that happens
to be a benchmark)
 Mappers and reducers can have side effects; this
allows for sharing information between iterations.

CSE4/587 4/24/2019
Classes of problems “mapreducable”
12

 Benchmark for comparing: Jim Gray’s challenge on data-

intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0
 Probably many classical math problems.

CCSCNE 2009 Plattsburg, April 24 2009 B.Ramamurthy & K.Madurai

Page Rank
General idea

 Consider the world wide web with all its links.

 Now imagine a random web surfer who visits a page
and clicks a link on the page
 Repeats this to infinity
 Pagerank is a measure of how frequently will a page
will be encountered.
 In other words it is a probability distribution over
nodes in the graph representing the likelihood that a
random walk over the linked structure will arrive at a
particular node.
PageRank Formula

1 𝑃 𝑚
P(n) = α + (1 − 𝛼) σ𝑚∈𝐿(𝑛)
𝐺 𝐶 𝑚
α randomness factor
G is the total number of nodes in the graph
L(n) is all the pages that link to n
C(m) is the number of outgoing links of the page m
Note that PageRank is recursively defined.
It is implemented by iterative MRs.
Lets assume α is zero for a simple walk through.
PageRank: Walk Through
0.2 0.2 0.066 0.033 0.166
0.1
n1 n2 n1 n2
0.1 0.066 0.033 0.1
0.066 0.1 0.1 0.083
0.1 0.083
0.2 0.3
n5 n5
0.066 0.1
0.2 0.3

0.2 n3 0.166 n3
n4 0.2 n4 0.166
0.2 0.3

0.1 0.133
n1 n2

0.383
n5

n3
n4 0.183
0.2
Mapper for PageRank

Class Mapper
method map (nid, Node N)
p  N.Pagerank/|N.Adajacency|
emit(nid, N)
for all m in N. Adjacencylist
emit(nid m, p)

“divider”
Reducer for Pagerank

Class Reducer
method Reduce(nid m, [p1, p2, p3..])
Node M  null; s = 0;
for all p in [p1,p2, ..]
{ if p is a Node then M  p
else s  s+p}
M.pagerank  s
emit (nid m, Node M)

“aggregator”
At the reducer you get two types of items in the list.
Issues; Points to ponder

 How to account for dangling nodes: one that has many

incoming links and no outgoing links
 Simply redistributes its pagerank to all
 One iteration requires pagerank computation + redistribution of
“unused” pagerank
 Pagerank is iterated until convergence: when is
convergence reached?
 Probability distribution over a large network means
underflow of the value of pagerank.. Use log based
computation
 MR: How do PRAM algs. translate to MR? how about
math algorithms?
Demos

 Single node: Eclipse Helios, Hadoop (MR)0.2,

Hadoop-eclipse plug-in
 Amazon Elastic cloud computing aws.amazon.com
 CCR: Video of 100-node cluster for processing a
billion node k-nary tree
References

1. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data

processing on large clusters. Communication of ACM 51, 1 (Jan.
2008), 107-113.
2. Lin and Dyer (2010): Data-intensive text processing with MapReduce;
http://beowulf.csail.mit.edu/18.337-2012/MapReduce-book-final.pdf
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
Take Home Messages

 MapReduce (MR) algorithm is for distributed processing

of big-data
 Apache Hadoop (open source) provides the distributed
infrastructure for MR
 Most challenging aspect is designing the MR algorithm
for solving a problem; it is different mind-set;
 Visualizing data as key,value pairs; distributed parallel processing;
 Probably beautiful MR solutions can be designed for classical Math
problems.
 It is not just mapper and reducer, but also other operations such as
combiner, partitioner that have be cleverly used for solving large
scale problems.

14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
MapReduce for Python Developers
100% (1)
MapReduce for Python Developers
82 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
36 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Introduction to Distributed Platforms
No ratings yet
Introduction to Distributed Platforms
71 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
45 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Week 02
No ratings yet
Week 02
115 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
34 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Big Data Processing with MapReduce
No ratings yet
Big Data Processing with MapReduce
49 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
48 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Hadoop and Big Data Solutions
No ratings yet
Hadoop and Big Data Solutions
61 pages
Week10-Map Reducible Algo-2025
No ratings yet
Week10-Map Reducible Algo-2025
39 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
MapReduce Its Applications For Course
No ratings yet
MapReduce Its Applications For Course
36 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
43 pages
Java Programming Lab Manual
100% (1)
Java Programming Lab Manual
66 pages
Listing 1: Course - Java - Java Bean
No ratings yet
Listing 1: Course - Java - Java Bean
11 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
Top 100 Networking Interview Questions
100% (1)
Top 100 Networking Interview Questions
26 pages
Network Optimization for Engineers
No ratings yet
Network Optimization for Engineers
22 pages
MPLS Essentials
No ratings yet
MPLS Essentials
32 pages
Offline Charging White Paper
No ratings yet
Offline Charging White Paper
34 pages
Vandal Proof PTZ Dome Camera Specs
No ratings yet
Vandal Proof PTZ Dome Camera Specs
2 pages
Windows Server 2003 Network Setup
No ratings yet
Windows Server 2003 Network Setup
4 pages
Understanding Electronic Mail (Email)
No ratings yet
Understanding Electronic Mail (Email)
7 pages
Tcs Ipa Preparation Itis Scenario
No ratings yet
Tcs Ipa Preparation Itis Scenario
17 pages
FOG Computing
No ratings yet
FOG Computing
28 pages
Message Queues (ActiveMQs and Kafka)
No ratings yet
Message Queues (ActiveMQs and Kafka)
7 pages
Switchgear Reference Codes Guide
No ratings yet
Switchgear Reference Codes Guide
24 pages
MGW Engineer Position at QuadGen
No ratings yet
MGW Engineer Position at QuadGen
3 pages
TLE ICT6 Mod4 Communicating and Collaborating Using ICT
No ratings yet
TLE ICT6 Mod4 Communicating and Collaborating Using ICT
24 pages
MMC Module 1
No ratings yet
MMC Module 1
76 pages
Vianet Data Breach Analysis 2020
No ratings yet
Vianet Data Breach Analysis 2020
19 pages
Computer Basics for Older Adults
No ratings yet
Computer Basics for Older Adults
9 pages
Script Telegrammicrotikispinterbotarmando
No ratings yet
Script Telegrammicrotikispinterbotarmando
2 pages
Unit 2 Internetworking
No ratings yet
Unit 2 Internetworking
47 pages
LTE 7100 Digital Radio Test Set Overview
No ratings yet
LTE 7100 Digital Radio Test Set Overview
12 pages
Bidirectional Power Transfer in V2G Systems
No ratings yet
Bidirectional Power Transfer in V2G Systems
119 pages
AMR-WB - Adaptive Multi-Rate Wideband Speech Code: 1/9 Document Type Author Document Title Date, Version
No ratings yet
AMR-WB - Adaptive Multi-Rate Wideband Speech Code: 1/9 Document Type Author Document Title Date, Version
9 pages
Troubleshooting J1708 and J1939 Connections
No ratings yet
Troubleshooting J1708 and J1939 Connections
9 pages
Financial Budget
No ratings yet
Financial Budget
1 page
HART Vs Foundation Fieldbus - Comparison
No ratings yet
HART Vs Foundation Fieldbus - Comparison
10 pages
App Launch Manager Log Analysis
No ratings yet
App Launch Manager Log Analysis
3,332 pages
Configuring On-Demand Routing: Finding Feature Information
No ratings yet
Configuring On-Demand Routing: Finding Feature Information
14 pages
Serial Communications in Win32
100% (3)
Serial Communications in Win32
46 pages
Reakti̇f Güç Kontröl Rölesi̇ (RVT)
No ratings yet
Reakti̇f Güç Kontröl Rölesi̇ (RVT)
8 pages
Alerton Unitary Controller - Product Datasheet - 31 00696 - 01
No ratings yet
Alerton Unitary Controller - Product Datasheet - 31 00696 - 01
7 pages
MPC10E-10C-MRATE Specifications Guide
No ratings yet
MPC10E-10C-MRATE Specifications Guide
8 pages

Problem-Solving Using Mapreduce/Hadoop

Uploaded by

Problem-Solving Using Mapreduce/Hadoop

Uploaded by

Problem-solving using

 Problem solving approaches for big-data

 Algorithmic: after all we have working towards this

• Historical financial data;

• Other historical data

 Relational file system and tables are insufficient.

 Originated from the Google File System (GFS) is the

 MapReduce like algorithms on Hadoop-like

 Proper use of the features of infrastructure (Ex: sort)

 Best practices in using MR and HDFS

 You focus on Map function, Reduce function and other

The roof is a tin roof

Cat kicks the can

The cat rolls too

Input to the reducers:

 All Mappers work in parallel.

 Benchmark for comparing: Jim Gray’s challenge on data-

CCSCNE 2009 Plattsburg, April 24 2009 B.Ramamurthy & K.Madurai

 Consider the world wide web with all its links.

 How to account for dangling nodes: one that has many

 Single node: Eclipse Helios, Hadoop (MR)0.2,

1. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data

 MapReduce (MR) algorithm is for distributed processing

You might also like