0% found this document useful (0 votes)

4 views46 pages

Unit-2 Parallel DataProcessing

Uploaded by

navya sree pinnamaneni AP22110010981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views46 pages

Unit-2 Parallel DataProcessing

Uploaded by

navya sree pinnamaneni AP22110010981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Parallel Data Processing with

Hadoop/MapReduce
Overview

• What is MapReduce?
–Example with word counting
•Parallel data processing with
MapReduce
–Hadoop file system
• More application example
Motivations

• Motivations
– Large-scale data processing on clusters
– Massively parallel (hundreds or thousands of CPUs)
– Reliable execution with easy data access
• Functions
– Automatic parallelization & distribution
– Fault-tolerance
– Status and monitoring tools
– A clean abstraction for programmers
» Functional programming meets distributed computing
» A batch data processing system
Parallel Data Processing in a Cluster

• Scalability to large data volumes:

– Scan 1000 TB on 1 node @ 100 MB/s = 24 days
– Scan on 1000-node cluster = 35 minutes

• Cost-efficiency:
– Commodity nodes /network
» Cheap, but not high bandwidth, sometime unreliable
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)
Typical Hadoop Cluster

Aggregation switch

Rack switch

• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 Gbps bandwidth in rack, 8 Gbps out of rack
• Node specs :
8-16 cores, 32 GB RAM, 8×1.5 TB disks
MapReduce Programming Model

• Data: a set of key-value pairs

– Initially input data is stored in files

• Parallel computation:
– A set of Map tasks and reduce tasks to access and produce
key-value pairs
– Map Function: (key1, val1) → (key2, val2)
– Reduce: (key2, [val2 list]) → [val3]
– Inspired from map and reduce operations commonly used
in functional programming languages like Lisp
• Input/output files are stored in Hadoop: distributed file
system built on a cluster of machines🡪 Looks like one
machine
Key-Value Pairs Maniupated by Map/Reduce Tasks

Output files
Input files
in Hadoop
Stored in Hadoop

Map Tasks Reduce Tasks

Inspired by LISP Function Programming

• Two Lisp functions

• Lisp map function
– Input parameters: a function and a set of
values
–This function is applied to each of the values.
Example:
– (map ‘length ‘(() (a) (ab) (abc)))
🡪(length(()) length(a) length(ab) length(abc))
🡪 (0 1 2 3)
Lisp Reduce Function

• Lisp reduce function

– given a binary function and a set of values.
– It combines all the values together using the
binary function.
• Example:
– use the + (add) function to reduce the list (0 1 2 3)
– (reduce #'+ '(0 1 2 3))
🡪 6
Example: Map Processing with Hadoop
• Given a file
– A file may be divided by the system into
multiple parts (called splits or shards).
• Each record in a split is processed by a user Map
function,
– takes each record as an input
– produces key/value pairs
Processing of Reducer Tasks

• Given a set of (key, value) records produced by map tasks.

– all the intermediate values for a key are combined
together into a list and given to a reducer. Call it [val2]
– A user-defined function is applied to each list
[val2] and produces another value

k1 k2 k3
Put Map and Reduce Tasks Together

User
responsibility
Example of Word Count Job (WC)

Inp Map Shuffle & Sort Outp

ut Reduce ut
the, 1
the brown,
brown,
quic Map 1 brown, 1
2
k fox, 1 Reduce
brow quick, fox, 2
n the, 1 how, 1
brown,
fox 1 now, 1
1
the fox, the, 3
ate,
fox Mapmouse
1
1
ate ,1the,
the 1
mous ate, 1
e Reduce cow, 1
how mouse,
now Map
how, 1
brow 1 quick,
n now, 1 1
cow brown,
Input/output specification of the WC mapreduce job

Input : a set of (key values) stored in files

key: document ID
value: a list of words as content of each document

Output: a set of (key values) stored in files

key: wordID
value: word frequency appeared in all documents

MapReduce function specification:

map(String input_key, String input_value):
reduce(String output_key, Iterator intermediate_values):
MapReduce [Link]
Hadoop distribution: src/examples/org/apache/hadoop/examples/[Link]

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); // a mapreduce int

class private Text word = new Text(); //a mapreduce String class

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException { // key is the offset
of current record in a file
StringTokenizer itr = new StringTokenizer([Link]());
while ([Link]()) { // loop for each token
[Link]([Link]()); //convert from string to token
} [Link](word, one); // emit (key,value) pairs for reducer
© Spinnaker Labs,
Inc.
MapReduce [Link]
map() gets a key, value, and context
• key - "bytes from the beginning of the line?“
• value - the current line;
in the while loop, each token is a "word" from the current
line
Line value tokens
Input file
US history book US history
US history book book
School admission records
iPADs sold in 2012 School admission records

iPADs sold in 2012

Reduce code in [Link]
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException
{ int sum = 0;
for (IntWritable val : values)
{ sum += [Link]();
}
[Link](sum); //convert “int” to IntWritable
[Link](key, result); //emit the final key-value
result
The driver to set things up and start
// Usage: wordcount <in> <out>
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs(); Job job = new Job(conf, "word count"); //mapreduce
job [Link]([Link]); //set jar file
[Link]([Link]); // set mapper class
[Link]([Link]);
[Link]([Link]); //set
//set reducer class
combiner class
[Link]([Link]); // output key class
[Link]([Link]); //output value class
[Link](job, new Path(otherArgs[0])); //job input path
[Link](job, new Path(otherArgs[1])); //job output
path [Link]([Link](true) ? 0 : 1); //exit status
© Spinnaker Labs, Inc.
Systems Support for MapReduce

Applications

MapReduce

Distributed File Systems (Hadoop,

Google FS)
Distributed Filesystems
• The interface is the same as a single-machine file system
– create(), open(), read(), write(), close()
• Distribute file data to a number of machines (storage units).
– Support replication
• Support concurrent data access
– Fetch content from remote servers. Local caching
• Different implementations sit in different places on
complexity/feature scale
– Google file system and Hadoop HDFS
» Highly scalable for large data-intensive applications.
» Provides redundant storage of massive amounts of data
on cheap and unreliable
Assumptions of GFS/Hadoop DFS

• High component failure rates

– Inexpensive commodity components fail all the time
• “Modest” number of HUGE files
– Just a few million
– Each is 100MB or larger; multi-GB files typical
• Files are write-once, mostly appended to
– Perhaps concurrently
• Large streaming reads
• High sustained throughput favored over low latency
Hadoop Distributed File System

• Files split into 64 MB blocks Nameno

File
• Blocks replicated across de 1

several datanodes ( 3) 2
3
• Namenode stores metadata 4
(file names, locations, etc)
• Files are append-only.
Optimized for large files,
sequential reads
1 2 1 3
– Read: use any copy
4 3 4
– Write: append to 3 replicas
Datanod
es
Shell Commands for Hadoop File System

Hapdoop Local Linux

• Mkdir, ls, cat, cp
– hadoop fs -mkdir /user/deepak/dir1
– hadoop fs -ls /user/deepak
User
– hadoop fs -cat /usr/deepak/[Link]
– hadoop fs -cp /user/deepak/dir1/[Link] /user/deepak/dir2
• Copy data from the local file system to HDF
– hadoop fs -copyFromLocal <src:localFileSystem> <dest:Hdfs>
– Ex: hadoop fs –copyFromLocal /home/hduser/[Link] /user/deepak/dir1
• Copy data from HDF to local
– hadoop fs -copyToLocal <src:Hdfs> <dest:localFileSystem>

[Link]
Hadoop DFS with MapReduce
Demons for Hadoop/Mapreduce

•Following demons must be running

(use jps to show these
Java processes)
• Hadoop
– Name node (master)
– Secondary name node
– data nodes
• Mapreduce
– Task tracker
– Job tracker
Hadoop Cluster with MapReduce
Execute MapReduce on a cluster of machines with
Hadoop DFS
MapReduce: Execution Details

• Input reader
– Divide input into splits, assign each split to a Map task
• Map task for data parallelism
– Apply the Map function to each record in the split
– Each Map function returns a list of (key, value) pairs
• Shuffle/Partition and Sort
– Shuffle distributes sorting & aggregation to many reducers
– All records for key k are directed to the same reduce processor
– Sort groups the same keys together, and prepares for aggregation
• Reduce task for data parallelism
– Apply the Reduce function to each key
– The result of the Reduce function is a list of (key, value) pairs
• Performance consideration in mappers/reducers: Too many key-
value pairs? Not enough pairs?
29
How to create and execute map tasks?

• The system spawns a number of mapper processes and reducer

processes
– A typical/default setting 2 mappers and 1 reducer per core.
– User can specify/change setting
• Input reader
– Input is typically a directory of files.
– Divide each input file into splits,
– Assign each split to a Map task
• Map task
– Executed by a mapper process
– Apply the user-defined
map function to each record in
the split
– Each Map function
returns a list of (key, value) pairs
How to create and execute reduce tasks?

• Shuffle/partition outputs of map tasks

– Sort keys and group values of the same key together.
– Direct (key, values) pairs to the partitions, and then distribute to
the right destinations.
• Reduce task
– Apply the Reduce function to the list of each key
• Multiple map tasks -> one reduce

32
Multiple map tasks and multiple reduce tasks

• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task. There can
be many keys (and their associated values) in each partition, but the
records for any given key are all in a single partition

33
MapReduce: Fault Tolerance
• Handled via re-execution of tasks.
⚫ Task completion committed through master
• Mappers save outputs to local disk before serving to reducers
– Allows recovery if a reducer crashes
– Allows running more reducers than # of nodes
• If a task crashes:
– Retry on another node
» OK for a map because it had no dependencies
» OK for reduce because map outputs are on disk
– If the same task repeatedly fails, fail the job or ignore that input block
– : For the fault tolerance to work, user tasks must be deterministic and
side- effect-free
2. If a node crashes:
– Relaunch its current tasks on other nodes
– Relaunch any maps the node previously ran
» Necessary because their output files were lost along with the
MapReduce: Redundant Execution
• Slow workers are source of bottleneck, may delay
completion time.
• spawn backup tasks, one to finish first wins.
• Effectively utilizes computing power, reducing job
completion time by a factor.
User Code Optimization: Combining Phase

• Run on map machines after map phase

– “Mini-reduce,” only on local map output
– E.g. [Link]([Link]);
• save bandwidth before sending data to full reduce
tasks
• Requirement: commutative & associative
On one mapper machine:

Map output

Combiner
replaces with:

To reducer To reducer
MapReduce Applications (I)

• Distributed grep (search for words)

• Map: emit a line if it matches a
given pattern
• URL access frequency
• Map: process logs of web page
access; output
• Reduce: add all values for the same
URL
37
MapReduce Applications (II)

• Map only parallel processing

• Count word usage for each document
• Map-reduce two-stage processing
• Count word usage for the entire
document collection
• Multiple map-reduce stages
1. Count word usage in a document set
[Link] most frequent words in each
document, but exclude those most
39
MapReduce Job Chaining

• Run a sequence of map-reduce jobs

• Use [Link]()
– Define the first job including input/output directories,
and map/combiner/reduce classes.
» Run the first job with [Link]()
– Define the second job
» Run the second job with [Link]()
• Use [Link](job)
Example
Job job = new Job(conf, "word count"); //mapreduce job
[Link]([Link]); //set jar file
[Link]([Link]); // set mapper class
...
[Link](job, new Path(otherArgs[0])); // input path
[Link](job, new Path(otherArgs[1])); // output
path [Link](true) ;
Job job1 = new Job(conf, "word count"); //mapreduce
job [Link]([Link]); //set jar file
job1. setMapperClass([Link]); // set mapper class
...
[Link](job1, new Path(otherArgs[1])); // input path
[Link](job1, new Path(otherArgs[2])); // output
path [Link]([Link](true) ? 0 : 1); //exit status
MapReduce Use Case: Inverted Indexing
Preliminaries

Construction of inverted lists for document

search
• Input: documents: (docid, [term, term..]),
(docid, [term, ..]), ..
• Output: (term, [docid, docid, …])
– E.g., (apple, [1, 23, 49, 127, …])
A document id is an internal document id, e.g.,
a unique integer
• Not an external document id such as a url
42 © 2010, Jamie Callan
Inverted Indexing: Data flow

Foo
Foo map
output
contains:
Reduced
This page Foo much: output
contains so Foo page :
much text Foo so : Foo contains: Foo,
text: Foo Bar much: Foo
This : My: Bar
Foo page : Foo,
Bar so : Foo
Bar text: Foo,
Bar map
output Bar This :
Foo too: Bar
contains:
My page Bar My: Bar
contains page : Bar
text too text: Bar
too: Bar
Using MapReduce to Construct Inverted Indexes

• Each Map task is a document parser

– Input: A stream of documents
– Output: A stream of (term, docid) tuples
» (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …
» We may create internal IDs for words.
• Shuffle sorts tuples by key and routes tuples to Reducers
• Reducers convert streams of keys into streams of inverted lists
– Input: (long, 1) (long, 127) (long, 49) (long, 23) …
– The reducer sorts the values for a key and builds an inverted
list
– Output: (long, [frequency:492, docids:1, 23, 49, 127, …])

Combine: Special Local Reduction

• Combine locally if possible

k k k
Using Combiner () to Reduce Communication

• Map: (docid1, content1) 🡪 (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …

– Each output inverted list covers just one document
• Combine locally
Sort by t
Combiner: (t1 [ilist1,2 ilist1,3 ilist1,1 …]) 🡪 (t1, ilist1,27)
– Each output inverted list covers a sequence of documents
• Shuffle and sort by t
(t4, ilist4,1) (t5, ilist5,3) … 🡪 (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) …

• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …]) 🡪 (t7, ilistfinal)

ilisti,j: the j’th inverted list fragment for term

46
Hadoop and Tools
• Various Linux Hadoop clusters
– Cluster +Hadoop: [Link]
– Amazon EC2
• Windows and other platforms
– The NetBeans plugin simulates Hadoop
– The workflow view works on Windows
• Hadoop-based tools
– For Developing in Java, NetBeans plugin
• Pig Latin, a SQL-like high level data processing script
language
• Hive, Data warehouse, SQL
• HBase, Distributed data store as a large table

47
New Hadoop Develpment

Cluster resource management

MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Big Data Analytics with Hadoop Guide
No ratings yet
Big Data Analytics with Hadoop Guide
10 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop MapReduce WordCount Guide
No ratings yet
Hadoop MapReduce WordCount Guide
5 pages
MapReduce Programming Architecture Guide
No ratings yet
MapReduce Programming Architecture Guide
50 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Hadoop MapReduce Tutorial Guide
No ratings yet
Hadoop MapReduce Tutorial Guide
31 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Distributed Systems: MapReduce Basics
No ratings yet
Distributed Systems: MapReduce Basics
24 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
61 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Hadoop for Developers
No ratings yet
Hadoop for Developers
49 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
81 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Lecture 6 MR
No ratings yet
Lecture 6 MR
53 pages
Importance of Big Data and Hadoop
No ratings yet
Importance of Big Data and Hadoop
13 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Hadoop Map-Reduce Guide
No ratings yet
Hadoop Map-Reduce Guide
28 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
MapReduce Application Development Guide
No ratings yet
MapReduce Application Development Guide
83 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
Overview of MapReduce Framework
No ratings yet
Overview of MapReduce Framework
23 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Large-Scale Data Management with Hadoop
No ratings yet
Large-Scale Data Management with Hadoop
22 pages
Chapter 4
No ratings yet
Chapter 4
4 pages
ASP.NET MVC Repository & Unit of Work
No ratings yet
ASP.NET MVC Repository & Unit of Work
11 pages
Translation Application
No ratings yet
Translation Application
18 pages
User Types For PitchMatter
No ratings yet
User Types For PitchMatter
10 pages
SQL Server Troubleshooting Guide
No ratings yet
SQL Server Troubleshooting Guide
13 pages
Amazon Redshift Interview Questions
100% (1)
Amazon Redshift Interview Questions
4 pages
Sap Ewm Roadmap
No ratings yet
Sap Ewm Roadmap
2 pages
(Ebook PDF) Systems Analysis and Design, 12th Edition 2024 Scribd Download
100% (3)
(Ebook PDF) Systems Analysis and Design, 12th Edition 2024 Scribd Download
55 pages
Management of Information Systems Overview
No ratings yet
Management of Information Systems Overview
12 pages
Network System Design
No ratings yet
Network System Design
8 pages
Dbms Unit-5 Notes
100% (2)
Dbms Unit-5 Notes
27 pages
Guide To Completing The Data Integration Template
100% (1)
Guide To Completing The Data Integration Template
15 pages
Software Quality Assurance Templates & Checklists
No ratings yet
Software Quality Assurance Templates & Checklists
12 pages
IGCSE ICT Mock Exam Paper 1
No ratings yet
IGCSE ICT Mock Exam Paper 1
11 pages
Problem Statement
100% (7)
Problem Statement
2 pages
Civil Engineering Information System
100% (2)
Civil Engineering Information System
14 pages
Forensic Tools and Challenges Overview
No ratings yet
Forensic Tools and Challenges Overview
2 pages
IT-Data and Business Analysis
No ratings yet
IT-Data and Business Analysis
15 pages
CyberSense For Dell PowerProtect Cyber Recovery 1656644381
No ratings yet
CyberSense For Dell PowerProtect Cyber Recovery 1656644381
12 pages
Cloud One Endpoint Security Overview
No ratings yet
Cloud One Endpoint Security Overview
25 pages
Aws Saa Practice Exam
No ratings yet
Aws Saa Practice Exam
133 pages
DESIGN THINKING UNSDG App Tracker
No ratings yet
DESIGN THINKING UNSDG App Tracker
3 pages
20240819T194714 Educ 1018 Australian Curriculum Technologies Digital Technologies f1-10 Version 90 Scope and Sequence
No ratings yet
20240819T194714 Educ 1018 Australian Curriculum Technologies Digital Technologies f1-10 Version 90 Scope and Sequence
15 pages
MM - LIS - Analyses
No ratings yet
MM - LIS - Analyses
20 pages
PRoject Report On Automation Tool
No ratings yet
PRoject Report On Automation Tool
4 pages
Comprehensive Computer Science Guide
No ratings yet
Comprehensive Computer Science Guide
83 pages
Adhish's Resume
No ratings yet
Adhish's Resume
1 page
SD WAN Admin Guide
No ratings yet
SD WAN Admin Guide
34 pages
Az 201
No ratings yet
Az 201
13 pages
Cache Frequently Asked Questions
No ratings yet
Cache Frequently Asked Questions
36 pages

Unit-2 Parallel DataProcessing

Uploaded by

Unit-2 Parallel DataProcessing

Uploaded by

Parallel Data Processing with

• Scalability to large data volumes:

• 40 nodes/rack, 1000-4000 nodes in cluster

• Data: a set of key-value pairs

Map Tasks Reduce Tasks

• Two Lisp functions

• Lisp reduce function

• Given a set of (key, value) records produced by map tasks.

Inp Map Shuffle & Sort Outp

Input : a set of (key values) stored in files

Output: a set of (key values) stored in files

MapReduce function specification:

public static class TokenizerMapper

private final static IntWritable one = new IntWritable(1); // a mapreduce int

public void map(Object key, Text value, Context context

iPADs sold in 2012

public void reduce(Text key, Iterable<IntWritable> values,

Distributed File Systems (Hadoop,

• High component failure rates

• Files split into 64 MB blocks Nameno

Hapdoop Local Linux

•Following demons must be running

• The system spawns a number of mapper processes and reducer

• Shuffle/partition outputs of map tasks

• Run on map machines after map phase

• Distributed grep (search for words)

• Reverse web-link graph

• Map only parallel processing

• Run a sequence of map-reduce jobs

Construction of inverted lists for document

• Each Map task is a document parser

44 © 2010, Jamie Callan

• Combine locally if possible

• Map: (docid1, content1) 🡪 (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …

• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …]) 🡪 (t7, ilistfinal)

ilisti,j: the j’th inverted list fragment for term

Cluster resource management

You might also like