0% found this document useful (0 votes)

356 views15 pages

S MapReduce Types Formats Features

This document provides an overview of MapReduce types, formats, and features. It discusses the general forms of Map and Reduce functions, as well as common input and output formats like TextInputFormat and TextOutputFormat. It also covers topics like counters, sorting, joins, side data distribution, and common MapReduce library classes. The included video demonstrates a basic MapReduce word count example.

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

356 views15 pages

S MapReduce Types Formats Features

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 15

Ch 8 and Ch 9:

MapReduce Types, Formats

and Features
MapReduce Form Review
General form of Map/Reduce functions:

map: (K1, V1) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

General form with Combiner function:

map: (K1, V1) -> list(K2, V2)

combiner: (K2, list(V2)) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

Partition function:
Input Formats - Basics
Input split - a chunk of the input that is processed by a single map

Each map processes a single split, which is divided into records (key-value pair) that are
individually processed by the map

Represented by Java class InputSplit

Set of storage locations (hostname strings)

Contains reference to the data not the actual data

InputFormat - responsible for creating input splits and dividing them into records
so you will not directly deal with with the InputSplit class

Controlling split size

Input Formats - Basics
Avoid small files - storing a large number of small files increases the numbers of
seeks needed to run the job

A sequence file can be used to merge small files into larger files to avoid a large number of small
files

Preventing splitting - you might want to prevent splitting if you want a single
mapper to process each input file as an entire file

1. Increase the minimum split size to be larger than the largest file in the system

2. Subclass the subclass of FileInputFormat to override the isSplitable() method to return false

Reading an entire file as a record:

Input Formats - File Input
FileInputFormat - the base class for all implementations of InputFormat that use
a file as the source for data

Provides a place to define what files are included as input to a job and an implementation for
generating splits for the input files

Input is often specified as a collection of paths

Splits large files (larger than HDFS block)

CombineFileInputFormat - Java class designed to work well with small files in

Hadoop

Each split will contain many of the small files so that each mapper has more to process

Takes node and rack locality into account when deciding what blocks to place into the same split
Input Formats - Text Input
TextInputFormat - default InputFormat where each record is a line of input

Key - byte offset within the file of the beginning of the line; Value - the contents of the line, not
including any line terminators, packaged as a Text object

mapreduce.input.linerecordreader.line.maxlength - can be used to set a maximum expected line

length

Safeguards against corrupted files (often appears as a very long line)

KeyValueTextInputFormat - Used to interpret TextOutputFormat (default output

that contains key-value pairs separated by a delimiter)

mapreduce.input.keyvaluelinerecordreader.key.value.separator - used to specify the

delimiter/separator which is tab by default
Input Formats - Binary Input, Multiple Inputs, and Database I/O

Binary Input:

SequenceFileInputFormat - stores sequences of binary key-value pairs

SequenceFileAsTextInputFormat - converts sequence file’s keys and values to Text objects

SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and values as binary

objects

FixedLengthInputFormat - reading fixed-width binary records from a file where the records are not
separated by delimiters

Multiple Inputs:

All input is interpreted by a single InputFormat and a single Mapper

Output Formats
Text Output: TextOutputFormat - default output format; writes records as lines
of text (keys and values are turned into strings)

KeyValueTextInputFormat - breaks lines into key-value pairs based on a configurable separator

Binary Output:

SequenceFileOutputFormat - writes sequence files as output

SequenceFileAsBinaryOutputFormat - writes keys and values in binary format into a sequence

file container

MapFileOutputFormat - writes map files as output

Multiple Outputs:
Counters
Useful for gathering statistics about a job, quality-control, and problem diagnosis

Built-in Counter Types:

Task Counters - gather info about tasks as they are executed and results are aggregated over all
job tasks

Maintained by each task attempt and are sent to the application manager on a regular basis
to be globally aggregated

May go down if a task fails

Job Counters - measure job-level statistics and are maintained by the application master so they
do not need to be sent across the network

User-Defined Counters: User can define a set of counters to be incremented in

Sorting
Partial Sort - does not produce a globally- Total Sort - produces a globally-sorted output
sorted output file file

Produce a set of sorted files that can be

concatenated to form a globally-sorted file

To do this: use a partitioner that respects the

total order of the output and the partition
sizes must be fairly even

Secondary Sort - Sorts the values of the keys

These are usually not sorted by MapReduce

Joins
MapReduce can perform joins between large datasets.

Ex:
Joins - Map-Side vs Reduce-Side
Map-Side Join Reduce-Side Join

the inputs must be divided into Input datasets do not have to be

the same number of structured in a particular way
partitions and sorted by the
same key (the join key) Results in records with the same
key being brought together in
All the records for a particular the reducer function
key must reside in the same
partition Uses MultipleInputs and a
secondary sort
CompositeInputFormat can be
used to run a map-side join
Side Data Distribution
Side Data - extra read-only data needed by a job to process the main dataset

The main challenge is to make side data available to all the map or reduce tasks (which are
spread across the cluster) in way that is convenient and efficient

Using the Job Configuration

Configuration is a setter method used to set key-value pairs in the job configuration

Useful for passing metadata to tasks

Distributed Cache

Instead of serializing side data in the job config, it is preferred to distribute the datasets using
Hadoop’s distributed cache
MapReduce Library Classes
Mappers/Reducers for commonly-used functions:
Video – Example MapReduce WordCount

Video: https://youtu.be/aelDuboaTqA

S MapReduce Types Formats
100% (3)
S MapReduce Types Formats
22 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Anatomy of a MapReduce Job Run
No ratings yet
Anatomy of a MapReduce Job Run
20 pages
Hadoop Framework & MapReduce Guide
No ratings yet
Hadoop Framework & MapReduce Guide
11 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
81 pages
BigData Unit 4
No ratings yet
BigData Unit 4
12 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Notes 3 & 4 B Unit
No ratings yet
Notes 3 & 4 B Unit
19 pages
Hadoop MapReduce: TextInputFormat Guide
No ratings yet
Hadoop MapReduce: TextInputFormat Guide
2 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Unit 4
No ratings yet
Unit 4
19 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
Anatomy of Hadoop MapReduce Jobs
No ratings yet
Anatomy of Hadoop MapReduce Jobs
11 pages
Unit 4 Session 4
No ratings yet
Unit 4 Session 4
43 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
MapReduce in Hadoop Explained
No ratings yet
MapReduce in Hadoop Explained
45 pages
Essential MapReduce Interview Questions
No ratings yet
Essential MapReduce Interview Questions
6 pages
MapReduce Basics: Components & Code
No ratings yet
MapReduce Basics: Components & Code
25 pages
MapReduce Components and Functions Explained
No ratings yet
MapReduce Components and Functions Explained
19 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Unit-4 BDT
No ratings yet
Unit-4 BDT
17 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
2 1-MapReduce
No ratings yet
2 1-MapReduce
16 pages
Anatomy of A MapReduce Job
100% (1)
Anatomy of A MapReduce Job
5 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Bda Unit 2
No ratings yet
Bda Unit 2
54 pages
Hadoop for Developers
No ratings yet
Hadoop for Developers
49 pages
Hadoop
No ratings yet
Hadoop
38 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Anatomy of MapReduce in Hadoop
No ratings yet
Anatomy of MapReduce in Hadoop
37 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Lecture 4
No ratings yet
Lecture 4
28 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
Big Data Analytics Exam Questions
No ratings yet
Big Data Analytics Exam Questions
11 pages
Hadoop
No ratings yet
Hadoop
34 pages
Understanding MapReduce Programming Model
No ratings yet
Understanding MapReduce Programming Model
10 pages
Java SE 8 Question Bank
100% (1)
Java SE 8 Question Bank
107 pages
Notes
No ratings yet
Notes
3 pages
Name of The Student Student ID Session 2. Present Address
No ratings yet
Name of The Student Student ID Session 2. Present Address
9 pages
Clock App Development Guide
No ratings yet
Clock App Development Guide
2 pages
Chapter 05 Slides
No ratings yet
Chapter 05 Slides
35 pages
Map Reduce
No ratings yet
Map Reduce
1 page
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
46 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
No ratings yet
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
12 pages
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
No ratings yet
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
9 pages
A Living Archive of Modern Protest Memory Making in The Women S March
No ratings yet
A Living Archive of Modern Protest Memory Making in The Women S March
10 pages
Framing The Women's March On Washington
No ratings yet
Framing The Women's March On Washington
10 pages
Farm Worker Movement (Jenkins, Perrow)
No ratings yet
Farm Worker Movement (Jenkins, Perrow)
21 pages
Social Networks (Snow)
No ratings yet
Social Networks (Snow)
16 pages
12 Sympathizers (Oegema, Klandermans) PDF
No ratings yet
12 Sympathizers (Oegema, Klandermans) PDF
21 pages
Black Insurgency (McAdam)
No ratings yet
Black Insurgency (McAdam)
21 pages
Linux Lab Manual by Zoom PDF
No ratings yet
Linux Lab Manual by Zoom PDF
184 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
MPU3412 Community Service Assignment
No ratings yet
MPU3412 Community Service Assignment
12 pages
Personal Statement
100% (1)
Personal Statement
2 pages
Operation Instructions: ABUS Load Indication System
No ratings yet
Operation Instructions: ABUS Load Indication System
22 pages
-تحديات وضرورة تحسين وسائل الدفع الإلكترونية لأداء البنوك في ظل جائحة كورونا -دراسة حالة الجزائر
100% (1)
-تحديات وضرورة تحسين وسائل الدفع الإلكترونية لأداء البنوك في ظل جائحة كورونا -دراسة حالة الجزائر
35 pages
11hyatt Regency Walkaway Collapse
100% (1)
11hyatt Regency Walkaway Collapse
5 pages
Library Management Assignment
No ratings yet
Library Management Assignment
13 pages
A Framework For Teaching Learning
No ratings yet
A Framework For Teaching Learning
11 pages
Report On Performance Management by Gelase Mutahaba 2 November 2010
No ratings yet
Report On Performance Management by Gelase Mutahaba 2 November 2010
73 pages
Advanced Data Structures Course
No ratings yet
Advanced Data Structures Course
2 pages
Deterritorializing The New German Cinema
100% (3)
Deterritorializing The New German Cinema
216 pages
Data Sheet Infinite Cache 20110707
No ratings yet
Data Sheet Infinite Cache 20110707
2 pages
Thinking With Your Eyes3
No ratings yet
Thinking With Your Eyes3
2 pages
LMS11
No ratings yet
LMS11
7 pages
S7.2 Upper Echelons (Hambrick, 2007)
No ratings yet
S7.2 Upper Echelons (Hambrick, 2007)
11 pages
Harrah Case
100% (1)
Harrah Case
4 pages
Ta8 Isw Semester 2 Review - Key
100% (1)
Ta8 Isw Semester 2 Review - Key
14 pages
Al Hasoun Profile
No ratings yet
Al Hasoun Profile
299 pages
Transit Pass Guide for Commuters
No ratings yet
Transit Pass Guide for Commuters
5 pages
History p1 Gr11 Memo November 2023 - English
No ratings yet
History p1 Gr11 Memo November 2023 - English
11 pages
Grade 4 Volume of Rectangular Prisms Lesson
No ratings yet
Grade 4 Volume of Rectangular Prisms Lesson
8 pages
SN3500 SN4500 DB Update Instructions B PDF
No ratings yet
SN3500 SN4500 DB Update Instructions B PDF
20 pages
The Necessity of Liberal Arts Education
No ratings yet
The Necessity of Liberal Arts Education
7 pages
Factors Affecting Traffic Congestion in Cagayan de Oro City
75% (8)
Factors Affecting Traffic Congestion in Cagayan de Oro City
41 pages
Criminology Course Schedule 2019-2020
No ratings yet
Criminology Course Schedule 2019-2020
7 pages
Drip Feed
No ratings yet
Drip Feed
3 pages
Mathematics DLP Revised K 12 Matatag Sir Kevin
No ratings yet
Mathematics DLP Revised K 12 Matatag Sir Kevin
6 pages
Types of Essays and Health Perspectives
No ratings yet
Types of Essays and Health Perspectives
201 pages
Diagbox - Introduction Guide
100% (1)
Diagbox - Introduction Guide
46 pages
AVEVA Australia Training Schedule
No ratings yet
AVEVA Australia Training Schedule
11 pages

S MapReduce Types Formats Features

Uploaded by

S MapReduce Types Formats Features

Uploaded by

Ch 8 and Ch 9:

MapReduce Types, Formats

map: (K1, V1) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

General form with Combiner function:

map: (K1, V1) -> list(K2, V2)

combiner: (K2, list(V2)) -> list(K2, V2)

reduce: (K2, list(V2)) -> list(K3, V3)

Represented by Java class InputSplit

Set of storage locations (hostname strings)

Contains reference to the data not the actual data

Controlling split size

Reading an entire file as a record:

Input is often specified as a collection of paths

Splits large files (larger than HDFS block)

CombineFileInputFormat - Java class designed to work well with small files in

mapreduce.input.linerecordreader.line.maxlength - can be used to set a maximum expected line

Safeguards against corrupted files (often appears as a very long line)

KeyValueTextInputFormat - Used to interpret TextOutputFormat (default output

mapreduce.input.keyvaluelinerecordreader.key.value.separator - used to specify the

SequenceFileInputFormat - stores sequences of binary key-value pairs

SequenceFileAsTextInputFormat - converts sequence file’s keys and values to Text objects

SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and values as binary

All input is interpreted by a single InputFormat and a single Mapper

KeyValueTextInputFormat - breaks lines into key-value pairs based on a configurable separator

SequenceFileOutputFormat - writes sequence files as output

SequenceFileAsBinaryOutputFormat - writes keys and values in binary format into a sequence

MapFileOutputFormat - writes map files as output

Built-in Counter Types:

May go down if a task fails

User-Defined Counters: User can define a set of counters to be incremented in

Produce a set of sorted files that can be

To do this: use a partitioner that respects the

Secondary Sort - Sorts the values of the keys

These are usually not sorted by MapReduce

the inputs must be divided into Input datasets do not have to be

Using the Job Configuration

Useful for passing metadata to tasks

You might also like