50% found this document useful (2 votes)

2K views199 pages

Hadoop and Pig Problem Solving Guide

The document provides an agenda for a training on practical problem solving with Hadoop and Pig. The morning session will cover introductions, motivating examples, the Hadoop distributed file system, and Hadoop MapReduce. The afternoon session will cover performance tuning, Hadoop examples, Pig, the Pig Latin language and examples, Pig architecture, and Q&A.

Uploaded by

satya2003m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

2K views199 pages

Hadoop and Pig Problem Solving Guide

Uploaded by

satya2003m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Practical Problem Solving with Hadoop and Pig

Milind Bhandarkar (milindb@[Link])

Agenda
Introduction Hadoop Distributed File System Map-Reduce Pig Q &A
Middleware 2009 2

Agenda: Morning (8.30 - 12.00)

Introduction Motivating Examples Hadoop Distributed File System Hadoop Map-Reduce Q &A
Middleware 2009 3

Agenda: Afternoon (1.30 - 5.00)

Performance Tuning Hadoop Examples Pig Pig Latin Language & Examples Architecture Q &A
Middleware 2009 4

About Me
Lead Yahoo! Grid Solutions Team since June
2005

Contributor to Hadoop since January 2006 Trained 1000+ Hadoop users at Yahoo! &
elsewhere

20+ years of experience in Parallel

Programming
Middleware 2009 5

Hadoop At Yahoo!
6

Hadoop At Yahoo! (Some Statistics)

25,000 + machines in 10+ clusters Largest cluster is 3,000 machines 3 Petabytes of data (compressed,
unreplicated)

700+ users 10,000+ jobs/week

Middleware 2009 7

Sample Applications
Data analysis is the inner loop of Web 2.0 Data Information Value Log processing: reporting, buzz Search index Machine learning: Spam lters Competitive intelligence
Middleware 2009 8

Prominent Hadoop Users

Yahoo! [Link] EHarmony Facebook Fox Interactive Media IBM

Quantcast Joost [Link] Powerset New York Times Rackspace

Yahoo! Search Assist

Search Assist
Insight: Related concepts appear close
together in text corpus

Input: Web pages 1 Billion Pages, 10K bytes each 10 TB of input data Output: List(word, List(related words))
Middleware 2009 11

Search Assist
// Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords))
12

You Might Also Know

Insight:You might also know Joe Smith if a
lot of folks you know, know Joe Smith

Numbers: 300 MM users Average connections per user is 100

Middleware 2009 14

if you dont know Joe Smith already

You Might Also Know

// Input: List(UserName, List(Connections)) foreach u in UserList : // 300 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 3 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving;

Performance
101 Random accesses for each user Assume 1 ms per random access 100 ms per user 300 MM users 300 days on a single machine
Middleware 2009 16

MapReduce Paradigm

Map & Reduce

Primitives in Lisp (& Other functional
languages) 1970s

Google Paper 2004 [Link]

[Link]
Middleware 2009 18

Map
Output_List = Map (Input_List)

Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100)

Reduce
Output_Element = Reduce (Input_List)

Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385

Parallelism
Map is inherently parallel Each list element processed
independently

Reduce is inherently sequential Unless processing multiple lists Grouping to produce multiple lists
Middleware 2009 21

Search Assist Map

// Input: [Link] Pairs = Tokenize_And_Pair ( Text ( Input ) )

Output = { (apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)... }

Search Assist Reduce

// Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords)

Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... }

Issues with Large Data

Map Parallelism: Splitting input data Shipping input data Reduce Parallelism: Grouping related data Dealing with failures Load imbalance
Middleware 2009 24

Apache Hadoop
January 2006: Subproject of Lucene January 2008: Top-level Apache project Latest Version: 0.21 Stable Version: 0.20.x Major contributors:Yahoo!, Facebook,
Powerset
Middleware 2009 26

Apache Hadoop
Reliable, Performant Distributed le system MapReduce Programming framework Sub-Projects: HBase, Hive, Pig, Zookeeper,
Chukwa, Avro

Related Projects: Mahout, Hama, Cascading,

Scribe, Cassandra, Dumbo, Hypertable, KosmosFS
27 Middleware 2009

Problem: Bandwidth to Data

Scan 100TB Datasets on 1000 node cluster Remote storage @ 10MB/s = 165 mins Local storage @ 50-200MB/s = 33-8 mins Moving computation is more efcient than
moving data

Need visibility into data placement

Middleware 2009 28

Failure is not an option, its a rule ! 1000 nodes, MTBF < 1 day 4000 disks, 8000 cores, 25 switches, 1000
NICs, 2000 DIMMS (16TB RAM)

Problem: Scaling Reliably

Need fault tolerant store with reasonable

availability guarantees

Handle hardware faults transparently

Middleware 2009 29

Hadoop Goals

Scalable: Petabytes (1015 Bytes) of data on thousands on nodes

Economical: Commodity components only Reliable Engineering reliability into every

application is expensive
Middleware 2009 30

Hadoop Distributed File System

HDFS
Data is organized into les and directories Files are divided into uniform sized blocks
(default 64MB) and distributed across cluster nodes computation can be migrated to data

HDFS exposes block placement so that

Middleware 2009 32

HDFS
Blocks are replicated (default 3) to handle
hardware failure

Replication for performance and fault

tolerance (Rack-Aware placement)

HDFS keeps checksums of data for

corruption detection and recovery
Middleware 2009 33

HDFS
Master-Worker Architecture Single NameNode Many (Thousands) DataNodes
Middleware 2009 34

HDFS Master (NameNode)

Manages lesystem namespace File metadata (i.e. inode) Mapping inode to list of blocks + locations Authorization & Authentication Checkpoint & journal namespace changes
Middleware 2009 35

Namenode
Mapping of datanode to list of blocks Monitor datanode health Replicate missing blocks Keeps ALL namespace in memory 60M objects (File/Block) in 16GB
Middleware 2009 36

Datanodes
Handle block storage on multiple volumes
& block integrity nodes

Clients access the blocks directly from data Periodically send heartbeats and block
reports to Namenode

Blocks are stored as underlying OSs les

Middleware 2009 37

HDFS Architecture

Replication
A les replication factor can be changed
dynamically (default 3)

Block placement is rack aware Block under-replication & over-replication

is detected by Namenode

Balancer application rebalances blocks to

balance datanode utilization
39 Middleware 2009

Accessing HDFS
hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-rmr <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-count[-q] <path>] [-help [cmd]]

HDFS Java API

// Get default file system instance fs = [Link](new Configuration()); // Or Get file system instance from URI fs = [Link]([Link](uri), new Configuration()); // Create, open, list, OutputStream out = [Link](path, ); InputStream in = [Link](path, ); boolean isDone = [Link](path, recursive); FileStatus[] fstat = [Link](path);
41

libHDFS
#include hdfs.h hdfsFS fs = hdfsConnectNewInstance("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, /tmp/[Link], O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written = hdfsWrite(fs, writeFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, writeFile); hdfsFile readFile = hdfsOpenFile(fs, /tmp/[Link], O_RDONLY, 0, 0, 0); tSize num_read = hdfsRead(fs, readFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, readFile); hdfsDisconnect(fs);
42

Installing Hadoop
Check requirements Java 1.6+ bash (Cygwin on Windows) Download Hadoop release Change conguration Launch daemons
Middleware 2009 43

Download Hadoop
$ wget [Link] hadoop-0.18.3/[Link] $ tar zxvf [Link] $ cd hadoop-0.18.3 $ ls -cF conf [Link] [Link] [Link] [Link] [Link] masters [Link] slaves [Link] [Link]

Set Environment
# Modify conf/[Link] $ $ $ $ export export export export JAVA_HOME=.... HADOOP_HOME=.... HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves HADOOP_CONF_DIR=${HADOOP_HOME}/conf

# Enable password-less ssh # Assuming $HOME is shared across all nodes $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Make Directories
# On Namenode, create metadata storage and tmp space $ mkdir -p /home/hadoop/dfs/name $ mkdir -p /tmp/hadoop # Create slaves file $ cat > conf/slaves slave00 slave01 slave02 ... ^D # Create data directories on each slave $ bin/[Link] "mkdir -p /tmp/hadoop" $ bin/[Link] "mkdir -p /home/hadoop/dfs/data"
46

Start Daemons
# Modify [Link] with appropriate # [Link], [Link], etc. $ mv ~/[Link] conf/[Link] # On Namenode $ bin/hadoop namenode -format # Start all daemons $ bin/[Link] # Done !
47

Check Namenode

Cluster Summary

Browse Filesystem

Questions ?

Hadoop MapReduce

Think MR
Record = (Key,Value) Key : Comparable, Serializable Value: Serializable Input, Map, Shufe, Reduce, Output
Middleware 2009 55

Seems Familiar ?

cat /var/log/[Link]* | \ grep session opened | cut -d -f10 | \ sort | \ uniq -c > \ ~/userlist

Map
Input: (Key ,Value ) Output: List(Key ,Value ) Projections, Filtering, Transformation
1 1 2 2
Middleware 2009 57

Shufe
Input: List(Key ,Value ) Output Sort(Partition(List(Key , List(Value )))) Provided by Hadoop
2 2 2 2
Middleware 2009 58

Reduce
Input: List(Key , List(Value )) Output: List(Key ,Value ) Aggregation
2 2 3 3
Middleware 2009 59

Example: Unigrams
Input: Huge text corpus Wikipedia Articles (40GB uncompressed) Output: List of words sorted in descending
order of frequency

Middleware 2009

MR for Unigrams
mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)

MR for Unigrams

mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency)

Dataow

MR Dataow

Unigrams: Java Mapper

public static class MapClass extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = [Link](); StringTokenizer itr = new StringTokenizer(line); while ([Link]()) { Text word = new Text([Link]()); [Link](word, new IntWritable(1)); } } }
66

Unigrams: Java Reducer

public static class Reduce extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while ([Link]()) { sum += [Link]().get(); } [Link](key, new IntWritable(sum)); } }
67

Unigrams: Driver
public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf([Link]); [Link]("wordcount"); [Link]([Link]); [Link]([Link]); [Link](conf, new Path(inputPath)); [Link](conf, new Path(outputPath)); [Link](conf); }
68

MapReduce Pipeline

Pipeline Details

Conguration
Unied Mechanism for Conguring Daemons Runtime environment for Jobs/Tasks Defaults: *-[Link] Site-Specic: *-[Link] nal parameters
Middleware 2009 71

Example
<configuration> <property> <name>[Link]</name> <value>[Link]</value> </property> <property> <name>[Link]</name> <value>hdfs://[Link]</value> </property> <property> <name>[Link]</name> <value>-Xmx512m</value> <final>true</final> </property> .... </configuration>
72

InputFormats
Format
TextInputFormat (Default) KeyValueInputFormat SequenceFileInputFormat

Key Type
File Offset Text (upto \t) User-Dened

Value Type
Text Line Remaining Text User-Dened

OutputFormats
Format
TextOutputFormat (default)

Description
Key \t Value \n

Binary Serialized keys and SequenceFileOutputFormat values NullOutputFormat Discards Output

Hadoop Streaming
Hadoop is written in Java Java MapReduce code is native What about Non-Java Programmers ? Perl, Python, Shell, R grep, sed, awk, uniq as Mappers/Reducers Text Input and Output
Middleware 2009 75

Hadoop Streaming
Thin Java wrappers for Map & Reduce Tasks Forks actual Mapper & Reducer IPC via stdin, stdout, stderr [Link]() \t [Link]() \n Slower than Java programs Allows for quick prototyping / debugging
Middleware 2009 76

Hadoop Streaming
$ bin/hadoop jar [Link] \ -input in-files -output out-dir \ -mapper [Link] -reducer [Link] # [Link] sed -e 's/ /\n/g' | grep . # [Link] uniq -c | awk '{print $2 "\t" $1}'

Hadoop Pipes
Library for C/C++ Key & Value are std::string (binary) Communication through Unix pipes High numerical performance legacy C/C++ code (needs modication)
Middleware 2009 78

Pipes Program
#include "hadoop/[Link]" #include "hadoop/[Link]" #include "hadoop/[Link]" int main(int argc, char *argv[]) { return HadoopPipes::runTask( HadoopPipes::TemplateFactory<WordCountMap, WordCountReduce>()); }

Pipes Mapper
class WordCountMap: public HadoopPipes::Mapper { public: WordCountMap(HadoopPipes::TaskContext& context){} void map(HadoopPipes::MapContext& context) { std::vector<std::string> words = HadoopUtils::splitString( [Link](), " "); for(unsigned int i=0; i < [Link](); ++i) { [Link](words[i], "1"); } } };

Pipes Reducer
class WordCountReduce: public HadoopPipes::Reducer { public: WordCountReduce(HadoopPipes::TaskContext& context){} void reduce(HadoopPipes::ReduceContext& context) { int sum = 0; while ([Link]()) { sum += HadoopUtils::toInt([Link]()); } [Link]([Link](), HadoopUtils::toString(sum)); } };
81

Running Pipes
# upload executable to HDFS $ bin/hadoop fs -put wordcount /examples/bin # Specify configuration $ vi /tmp/[Link] ... // Set the binary path on DFS <property> <name>[Link]</name> <value>/examples/bin/wordcount</value> </property> ... # Execute job # bin/hadoop pipes -conf /tmp/[Link] \ -input in-dir -output out-dir
82

MR Architecture

Job Submission

Initialization

Scheduling

Execution

Map Task

Sort Buffer

Reduce Task

Questions ?

Running Hadoop Jobs

Running a Job
[milindb@gateway ~]$ hadoop jar \ $HADOOP_HOME/[Link] wordcount \ /data/newsarchive/20080923 /tmp/newsout [Link]: Total input paths to process : 4 [Link]: Running job: job_200904270516_5709 [Link]: map 0% reduce 0% [Link]: map 3% reduce 0% [Link]: map 7% reduce 0% .... [Link]: map 100% reduce 21% [Link]: map 100% reduce 31% [Link]: map 100% reduce 33% [Link]: map 100% reduce 66% [Link]: map 100% reduce 100% [Link]: Job complete: job_200904270516_5709
93

Running a Job
[Link]: Counters: 18 [Link]: Job Counters [Link]: Launched reduce tasks=1 [Link]: Rack-local map tasks=10 [Link]: Launched map tasks=25 [Link]: Data-local map tasks=1 [Link]: FileSystemCounters [Link]: FILE_BYTES_READ=491145085 [Link]: HDFS_BYTES_READ=3068106537 [Link]: FILE_BYTES_WRITTEN=724733409 [Link]: HDFS_BYTES_WRITTEN=377464307

Running a Job
[Link]: [Link]: [Link]: [Link]: [Link]: [Link]: [Link]: [Link]: [Link]: Map-Reduce Framework Combine output records=73828180 Map input records=36079096 Reduce shuffle bytes=233587524 Spilled Records=78177976 Map output bytes=4278663275 Combine input records=371084796 Map output records=313041519 Reduce input records=15784903

JobTracker WebUI

JobTracker Status

Jobs Status

Job Details

Job Counters

Job Progress

All Tasks

Task Details

Task Counters

Task Logs

Debugging
Run job with the Local Runner Set [Link] to local Runs application in a single thread Run job on a small data set on a 1 node
cluster
Middleware 2009 106

Debugging
Set [Link] to keep les from
failed tasks

Use the IsolationRunner to run just the

failed task

Java Debugging hints Send a kill -QUIT to the Java process to

get the call stack, locks held, deadlocks
107 Middleware 2009

Hadoop Performance Tuning

108

Example
Bob wants to count records in AdServer
logs (several hundred GB) reducer

Used Identity Mapper & Single counting What is he doing wrong ? This happened, really !
Middleware 2009 109

MapReduce Performance
Reduce intermediate data size map outputs + reduce inputs Maximize map input transfer rate Pipelined writes from reduce Opportunity to load balance
Middleware 2009 110

Shufe
Often the most expensive component M * R Transfers over the network Sort map outputs (intermediate data) Merge reduce inputs
Middleware 2009 111

Improving Shufe
Avoid shufing/sorting if possible Minimize redundant transfers Compress intermediate data
Middleware 2009 112

Avoid Shufe
Set [Link] to zero Known as map-only computations Filters, Projections, Transformations Number of output les = number of input
splits = number of input blocks

May overwhelm namenode

Middleware 2009 113

Minimize Redundant Transfers

Combiners Intermediate data compression

Middleware 2009

114

Combiners
When Maps produce many repeated keys Combiner: Local aggregation after Map &
before Reduce

Side-effect free Same interface as Reducers, and often the

same class
Middleware 2009 115

Compression
Often yields huge performance gains Set [Link] to true to
compress job output

Set [Link] to true to

compress map outputs native gzip

Codecs: Java zlib (default), LZO, bzip2,

Middleware 2009 116

Load Imbalance
Inherent in application Imbalance in input splits Imbalance in computations Imbalance in partitions Heterogenous hardware Degradation over time
Middleware 2009 117

Optimal Number of Nodes

T = Map slots per TaskTracker N = optimal number of nodes S = N * T = Total Map slots in cluster M = Map tasks in application Rule of thumb: 5*S < M < 10*S
m m m m m
Middleware 2009 118

Conguring Task Slots

[Link] [Link] Tradeoffs: Number of cores, RAM, number
and size of disks

Also consider resources consumed by

TaskTracker & DataNode
Middleware 2009 119

Speculative Execution
Runs multiple instances of slow tasks Instance that nishes rst, succeeds [Link]=true [Link]=true Can dramatically bring in long tails on jobs
Middleware 2009 120

Hadoop Examples

121

Example: Standard Deviation

Takeaway: Changing algorithm to suit architecture yields the best implementation

Implementation 1
Two Map-Reduce stages First stage computes Mean Second stage computes standard deviation
Middleware 2009 123

Stage 1: Compute Mean

Map Input (x for i = 1 ..N ) Map Output (N , Mean(x )) Single Reducer Reduce Input (Group(Map Output)) Reduce Output (Mean(x ))
i m m 1..Nm 1..N
Middleware 2009 124

Stage 2: Compute Standard Deviation

Map Input (x for i = 1 ..N ) & Mean(x ) Map Output (Sum(x Mean(x)) for i =
i m 1..N i 2

1 ..Nm

Single Reducer Reduce Input (Group (Map Output)) & N Reduce Output ()
Middleware 2009 125

Standard Deviation

Algebraically equivalent Be careful about numerical accuracy, though

Implementation 2
Map Input (x for i = 1 ..N ) Map Output (N ,
i m m 2 [Sum(x 1..Nm),Mean(x1..Nm)])

Single Reducer Reduce Input (Group (Map Output)) Reduce Output ()

Middleware 2009 127

NGrams

Bigrams
Input: A large text corpus Output: List(word , Top (word )) Two Stages: Generate all possible bigrams Find most frequent K bigrams for each
1 K 2

word

Middleware 2009

129

Bigrams: Stage 1 Map

Generate all possible Bigrams Map Input: Large text corpus Map computation In each sentence, or each word word Output (word , word ), (word , word ) Partition & Sort by (word , word )
1 2 1 2 2 1 1 2
Middleware 2009 130

[Link]
while(<STDIN>) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^\s+//g ; $_ =~ s/\s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/\s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]\n"; print "$words[$i+1]:$words[$i]\n"; } }

131

Bigrams: Stage 1 Reduce

Input: List(word , word ) sorted and
1 2

partitioned

Output: List(word , [freq, word ]) Counting similar to Unigrams example

1 2
Middleware 2009 132

[Link]
$_ = <STDIN>; chomp; my ($pw1, $pw2) = split(/:/, $_); $count = 1; while(<STDIN>) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2\n"; $pw1 = $w1; $pw2 = $w2; $count = 1; } } print "$pw1:$count:$pw2\n";
133

Bigrams: Stage 2 Map

Input: List(word , [freq,word ]) Output: List(word , [freq, word ]) Identity Mapper (/bin/cat) Partition by word Sort descending by (word , freq)
1 2 1 2 1 1
Middleware 2009 134

Bigrams: Stage 2 Reduce

Input: List(word , [freq,word ]) partitioned by word sorted descending by (word , freq) Output: Top (List(word , [freq, word ])) For each word, throw away after K records
1 2 1 1 K 1 2
Middleware 2009 135

[Link]
$N = 5; $_ = <STDIN>; chomp; my ($pw1, $count, $pw2) = split(/:/, $_); $idx = 1; $out = "$pw1\t$pw2,$count;"; while(<STDIN>) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$out\n"; $pw1 = $w1; $idx = 1; $out = "$pw1\t$w2,$c;"; } } print "$out\n"; 136

Partitioner
By default, evenly distributes keys hashcode(key) % NumReducers Overriding partitioner Skew in map-outputs Restrictions on reduce outputs All URLs in a domain together
Middleware 2009 137

Partitioner
// [Link](className) public interface Partitioner <K, V> extends JobConfigurable { int getPartition(K key, V value, int maxPartitions); }

138

Fully Sorted Output

By contract, reducer gets input sorted on
key

Typically reducer output order is the same

as input order

How to make sure that Keys in part i are all

less than keys in part i+1 ?
139 Middleware 2009

Each output le (part le) is sorted

Fully Sorted Output

Use single reducer for small output Insight: Reducer input must be fully sorted Partitioner should provide fully sorted
reduce input

Sampling + Histogram equalization

Middleware 2009 140

Number of Maps
Number of Input Splits Number of HDFS blocks [Link] Minimum Split Size ([Link]) split_size = max(min(hdfs_block_size,
data_size/#maps), min_split_size)
141 Middleware 2009

Parameter Sweeps
External program processes data based on
command-line parameters

./prog params=0.1,0.3 < [Link] > [Link] Objective: Run an instance of ./prog for each
parameter combination

Number of Mappers = Number of different

parameter combinations
142 Middleware 2009

Parameter Sweeps
Input File: [Link] Each line contains one combination of
parameters

Input format is NLineInputFormat (N=1) Number of maps = Number of splits =

Number of lines in [Link]
Middleware 2009 143

Auxiliary Files
-le [Link] Job submitter adds le to [Link] Unjarred on the task tracker Available to task as $cwd/[Link] Not suitable for large / frequently used les
Middleware 2009 144

Auxiliary Files
Tasks need to access side les Read-only Dictionaries (such as for porn
ltering)

Tasks themselves can fetch les from HDFS Not Always ! (Hint: Unresolved symbols)
Middleware 2009 145

Dynamically linked libraries

Distributed Cache
Specify side les via cacheFile If lot of such les needed Create a [Link] archive Upload to HDFS Specify via cacheArchive
Middleware 2009 146

Distributed Cache
TaskTracker downloads these les once Untars archives Accessible in tasks $cwd before task starts Cached across multiple tasks Cleaned up upon exit
Middleware 2009 147

Datasets are streams of key-value pairs Could be split across multiple les in a
single directory

Joining Multiple Datasets

Join could be on Key, or any eld in Value Join could be inner, outer, left outer, cross
product etc

Join is a natural Reduce operation

Middleware 2009 148

Example
A = (id, name), B = (name, address) A is in /path/to/A/part-* B is in /path/to/B/part-* Select [Link], [Link] where [Link] ==
[Link]
Middleware 2009 149

Map in Join
Input: (Key ,Value ) from A or B [Link] indicates A or B MAP_INPUT_FILE in Streaming Output: (Key , [Value , A|B]) Key is the Join Key
1 1 2 2 2
Middleware 2009 150

Reduce in Join
Input: Groups of [Value , A|B] for each Key Operation depends on which kind of join Inner join checks if key has values from
2 2

both A & B

Output: (Key , JoinFunction(Value ,))

2 2
Middleware 2009 151

MR Join Performance
Map Input = Total of A & B Map output = Total of A & B Shufe & Sort Reduce input = Total of A & B Reduce output = Size of Joined dataset Filter and Project in Map
Middleware 2009 152

Join Special Cases

Fragment-Replicate 100GB dataset with 100 MB dataset Equipartitioned Datasets Identically Keyed Equal Number of partitions Each partition locally sorted
Middleware 2009 153

Fragment-Replicate
Fragment larger dataset Specify as Map input Replicate smaller dataset Use Distributed Cache Map-Only computation No shufe / sort
Middleware 2009 154

Equipartitioned Join
Available since Hadoop 0.16 Datasets joined before input to mappers Input format: CompositeInputFormat [Link] Simpler to use in Java, but can be used in
Streaming
Middleware 2009 155

Example
[Link] = inner ( tbl ( ....[Link], "hdfs://namenode:8020/path/to/data/A" ), tbl ( ....[Link], "hdfs://namenode:8020/path/to/data/B" ) )

156

Questions ?

Apache Pig

What is Pig?
System for processing large semistructured data sets using Hadoop MapReduce platform

Pig Latin: High-level procedural language Pig Engine: Parser, Optimizer and
distributed query execution
Middleware 2009 159

Pig vs SQL

Pig is procedural Nested relational data model Schema is optional Scan-centric analytic workloads Limited query optimization
160

SQL is declarative Flat relational data model Schema is required OLTP + OLAP workloads Signicant opportunity for query optimization

Pig vs Hadoop
Increases programmer productivity Decreases duplication of effort Insulates against Hadoop complexity Version Upgrades JobConf conguration tuning Job Chains
Middleware 2009 161

Example

Input: User proles, Page visits Find the top 5 most visited pages by users aged 18-25

In Native Hadoop

In Pig
Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites;

164

Natural Fit

Comparison

Flexibility & Control

Easy to plug-in user code Metadata is not mandatory Does not impose a data model Fine grained control Complex data types
Middleware 2009 167

Pig Data Types

Tuple: Ordered set of elds Field can be simple or complex type Nested relational model Bag: Collection of tuples Can contain duplicates Map: Set of (key, value) pairs
Middleware 2009 168

Simple data types

int : 42 long : 42L oat : 3.1415f double : 2.7182818 chararray : UTF-8 String bytearray : blob
Middleware 2009 169

Expressions
A = LOAD [Link] AS (f1:int , f2:{t:(n1:int, n2:int)}, f3: map[] )

A = { ( 1, { (2, 3), (4, 6) }, [ yahoo#mail ] ) }

170

-- A.f1 or A.$0 -- A.f2 or A.$1 -- A.f3 or A.$2

Pig Unigrams
Input: Large text document Process: Load the le For each line, generate word tokens Group by word Count words in each group
Middleware 2009 171

Load
myinput = load '/user/milindb/[Link]' USING TextLoader() as (myword:chararray);

{ (program program) (pig pig) (program pig) (hadoop pig) (latin latin) (pig latin) }
172

Tokenize
words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*));

{ (program) (program) (pig) (pig) (program) (pig) (hadoop) (pig) (latin) (latin) (pig) (latin) }

173

Group
grouped = GROUP words BY $0;

{ (pig, {(pig), (pig), (pig), (pig), (pig)}) (latin, {(latin), (latin), (latin)}) (hadoop, {(hadoop)}) (program, {(program), (program), (program)}) }
174

Count
counts = FOREACH grouped GENERATE group, COUNT(words);

{ (pig, 5L) (latin, 3L) (hadoop, 1L) (program, 3L) }

175

Store
store counts into /user/milindb/output using PigStorage();

pig latin hadoop program

5 3 1 3

176

Example: Log Processing

-- use a custom loader Logs = load /var/log/access_log using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes); -- apply your own function Cleaned = foreach Logs generate addr, canonicalize(url) as url; Grouped = group Cleaned by url; -- run the result through a binary Analyzed = stream Grouped through [Link]; store Analyzed into analyzedurls;

177

Schema on the y
-- declare your types Grades = load studentgrades as (name: chararray, age: int, gpa: double); Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by type Sorted = order Good by gpa; store Sorted into smartgrownups;

178

Nested Data
Logs = load weblogs as (url, userid); Grouped = group Logs by url; -- Code inside {} will be applied to each -- value in turn. DisinctCount = foreach Grouped { Userid = [Link]; DistinctUsers = distinct Userid; generate group, COUNT(DistinctUsers); } store DistinctCount into distinctcount;

179

Pig Architecture

Pig Stages

Logical Plan
Directed Acyclic Graph Logical Operator as Node Data ow as edges Logical Operators One per Pig statement Type checking with Schema
Middleware 2009 182

Pig Statements
Load Read data from the le system Write data to the le system Write data to stdout

Store

Dump

Pig Statements
Foreach..Generate Apply expression to each record and generate one or more records Apply predicate to each record and remove records where false Stream records through user-provided binary

Filter

Stream..through

Pig Statements
Group/CoGroup Collect records with the same key from one or more inputs Join two or more inputs based on a key Sort records based on a key

Join

Order..by

Physical Plan
Pig supports two back-ends Local Hadoop MapReduce 1:1 correspondence with most logical
operators

Except Distinct, Group, Cogroup, Join etc

Middleware 2009 186

MapReduce Plan
Detect Map-Reduce boundaries Group, Cogroup, Order, Distinct Coalesce operators into Map and Reduce
stages

[Link] is created and submitted to Hadoop

JobControl
Middleware 2009 187

Lazy Execution
Nothing really executes until you request
output

Store, Dump, Explain, Describe, Illustrate Advantages

In-memory pipelining Filter re-ordering across multiple commands

188

Middleware 2009

Parallelism
Split-wise parallelism on Map-side
operators

By default, 1 reducer PARALLEL keyword group, cogroup, cross, join, distinct, order
Middleware 2009 189

Running Pig
$ pig grunt > A = load students as (name, age, gpa); grunt > B = filter A by gpa > 3.5; grunt > store B into good_students; grunt > dump A; (jessica thompson, 73, 1.63) (victor zipper, 23, 2.43) (rachel hernandez, 40, 3.60) grunt > describe A; A: (name, age, gpa )

190

Running Pig
Batch mode $ pig [Link] Local mode $ pig x local Java mode (embed pig statements in java) Keep [Link] in the class path
Middleware 2009 191

PigPen

Pig for SQL Programmers

194

SQL to Pig
SQL
...FROM MyTable...

Pig
A = LOAD MyTable USING PigStorage(\t) AS (col1:int, col2:int, col3:int);

SELECT col1 + col2, col3 ...

B = FOREACH A GENERATE col1 + col2, col3;

...WHERE col2 > 2

C = FILTER B by col2 > 2;

SQL to Pig
SQL Pig
D = GROUP A BY (col1, col2) SELECT col1, col2, sum(col3) E = FOREACH D GENERATE FROM X GROUP BY col1, col2 FLATTEN(group), SUM(A.col3);

...HAVING sum(col3) > 5

F = FILTER E BY $2 > 5;

...ORDER BY col1

G = ORDER F BY $0;

SQL to Pig
SQL Pig

SELECT DISTINCT col1 from X

I = FOREACH A GENERATE col1; J = DISTINCT I;

SELECT col1, count(DISTINCT col2) FROM X GROUP BY col1

K = GROUP A BY col1; L = FOREACH K { M = DISTINCT A.col2; GENERATE FLATTEN(group), count(M); }

SQL to Pig
SQL Pig
N = JOIN A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE A.col1, B.col3; SELECT A.col1, B. -- Or col3 FROM A JOIN B USING (col1) N = COGROUP A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE flatten(A), flatten(B); P = FOREACH O GENERATE A.col1, B.col3

Questions ?

Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
94 pages
Hive Interview Questions for Professionals
50% (2)
Hive Interview Questions for Professionals
6 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
8 pages
Snowflake Architecture Guide
No ratings yet
Snowflake Architecture Guide
18 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Hands-On Hadoop Tutorial Guide
100% (1)
Hands-On Hadoop Tutorial Guide
13 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
No ratings yet
250 Hadoop Interview Questions and Answers For Experienced Hadoop Developers - Hadoop Online Tutorials
35 pages
Hadoop Admin
No ratings yet
Hadoop Admin
13 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Learn Cassandra
100% (2)
Learn Cassandra
37 pages
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Essential PySpark Commands Guide
No ratings yet
Essential PySpark Commands Guide
12 pages
Admin Cloudera
100% (3)
Admin Cloudera
637 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
Top 50 Apache Spark Interview Questions
No ratings yet
Top 50 Apache Spark Interview Questions
19 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Cassandra Datastax
100% (1)
Cassandra Datastax
10 pages
Apache Flink for Big Data Experts
No ratings yet
Apache Flink for Big Data Experts
68 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
BigQuery Query Optimization With Troposphere PDF
No ratings yet
BigQuery Query Optimization With Troposphere PDF
51 pages
Hive and HBase for Data Engineers
No ratings yet
Hive and HBase for Data Engineers
25 pages
Hadoop With Python
100% (7)
Hadoop With Python
71 pages
Tableau Interview Questions
No ratings yet
Tableau Interview Questions
31 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Big Data and NoSQL Systems Overview
No ratings yet
Big Data and NoSQL Systems Overview
51 pages
Hadoop HDFS Replication Overview
No ratings yet
Hadoop HDFS Replication Overview
46 pages
Hadoop Tools and Concepts Overview
No ratings yet
Hadoop Tools and Concepts Overview
57 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Unit 5
No ratings yet
Unit 5
101 pages
Hadoop: Big Data Processing Essentials
No ratings yet
Hadoop: Big Data Processing Essentials
19 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Introduction to Apache Hadoop
No ratings yet
Introduction to Apache Hadoop
29 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
6 and 7 14 August and 21 August: Lect Date: Topics
No ratings yet
6 and 7 14 August and 21 August: Lect Date: Topics
16 pages
Ask c95 Manual
No ratings yet
Ask c95 Manual
31 pages
AssemblyStarterActivity 07-02-2025
No ratings yet
AssemblyStarterActivity 07-02-2025
3 pages
VHDL Unit 2 Part 2
No ratings yet
VHDL Unit 2 Part 2
25 pages
Dinning Philosopher
No ratings yet
Dinning Philosopher
5 pages
CS401 PPT
No ratings yet
CS401 PPT
193 pages
OpenOffice Writer: Free Word Tool
No ratings yet
OpenOffice Writer: Free Word Tool
26 pages
Introduction To The 8051 Microcontroller
No ratings yet
Introduction To The 8051 Microcontroller
17 pages
2025-07-20
No ratings yet
2025-07-20
70 pages
Java Unit-4
No ratings yet
Java Unit-4
10 pages
CONBOX v1.4.0 ReleaseMemo
No ratings yet
CONBOX v1.4.0 ReleaseMemo
5 pages
DataStage Course Overview and Curriculum
No ratings yet
DataStage Course Overview and Curriculum
5 pages
Multi-Search Highlighting in Vim
No ratings yet
Multi-Search Highlighting in Vim
4 pages
Pega Rule Resolution and Availability Guide
No ratings yet
Pega Rule Resolution and Availability Guide
3 pages
BW PlatformDimensioningGuide
No ratings yet
BW PlatformDimensioningGuide
13 pages
Introduction to Computer Architecture
No ratings yet
Introduction to Computer Architecture
35 pages
MCS 011
No ratings yet
MCS 011
2 pages
IOT - Chapter5 - IoT Interoperability
No ratings yet
IOT - Chapter5 - IoT Interoperability
4 pages
Geolocation Caching System Report
No ratings yet
Geolocation Caching System Report
21 pages
103L Project 5
No ratings yet
103L Project 5
9 pages
Geeetech A10 3D Printer Guide
No ratings yet
Geeetech A10 3D Printer Guide
59 pages
Nutanix Day-2 Workshop Nutanix Arquitectura
No ratings yet
Nutanix Day-2 Workshop Nutanix Arquitectura
155 pages
LAB 10 Interfacing Adc809 To The 8051 Trainer
No ratings yet
LAB 10 Interfacing Adc809 To The 8051 Trainer
3 pages
Chapter 3 - Embedded Systems Design Issues
No ratings yet
Chapter 3 - Embedded Systems Design Issues
40 pages
Log 12 05-08-2025
No ratings yet
Log 12 05-08-2025
136 pages
UserManual SANWatch V2.2e
No ratings yet
UserManual SANWatch V2.2e
337 pages
HPE ProLiant Compute DL380a Gen12-A00047453enw
No ratings yet
HPE ProLiant Compute DL380a Gen12-A00047453enw
33 pages
Migrating From Broker To Universal Messaging - Webinar - Dec 2015
100% (1)
Migrating From Broker To Universal Messaging - Webinar - Dec 2015
21 pages
7SA611 V4 - 7 - PRN - 190721
No ratings yet
7SA611 V4 - 7 - PRN - 190721
127 pages
STEVAL-3DP001 User Manual
No ratings yet
STEVAL-3DP001 User Manual
41 pages