0% found this document useful (0 votes)
19 views11 pages

04 Combiners and Partition Functions 12-17 Advanced

Uploaded by

sin1080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

04 Combiners and Partition Functions 12-17 Advanced

Uploaded by

sin1080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Mining

 of  Massive  Datasets  


Leskovec,  Rajaraman,  and  Ullman  
Stanford  University  
¡ O"en  a  Map  task  will  produce  many  pairs  of  
the  form  (k,v1),  (k,v2),  …  for  the  same  key  k  
§ E.g.,  popular  words  in  the  word  count  example  

¡ Can  save  network  -me  by    


pre-­‐aggrega-ng  values  in    
the  mapper:  
§ combine(k, list(v1)) à v2
§ Combiner  is  usually  same    
as  the  reduce  func?on  

2  
¡ Back  to  our  word  coun-ng  example:  
§ Combiner  combines  the  values  of  all  keys  of  a  
single  mapper  (single  node):  

§ Much  less  data  needs  to  be  copied  and  shuffled!  


3  
¡ Combiner  trick  works  only  if  reduce    
func?on  is  commuta?ve  and  associa?ve  
¡ Sum  
 

¡ Average  

 
 
¡ Median  

4  
¡ Want  to  control  how  keys  get  par--oned  
§ The  set  of  keys  that  go  to  a  single  reduce  worker  

¡ System  uses  a  default  par--on  func-on:  


§ hash(key) mod R

¡ Some-mes  useful  to  override  the  hash  


func-on:  
§ E.g.,  hash(hostname(URL)) mod R  ensures  URLs  
from  a  host  end  up  in  the  same  output  file  

5  
¡ Google  MapReduce  
§ Uses  Google  File  System  (GFS)  for  stable  storage  
§ Not  available  outside  Google  

¡ Hadoop  
§ Open-­‐source  implementa?on  in  Java  
§ Uses  HDFS  for  stable  storage  
§ Download:  http://lucene.apache.org/hadoop/

¡ Hive,  Pig  
§ Provide  SQL-­‐like  abstrac?ons  on  top  of  Hadoop  Map-­‐
Reduce  layer  
 
6  
¡ Ability  to  rent  compu?ng  by  the  hour  
§ Addi?onal  services  e.g.,  persistent  storage  

¡ E.g.,  Amazon’s  “Elas?c  Compute  Cloud”  (EC2)  


§ S3  (stable  storage)  
§ Elas?c  Map  Reduce  (EMR)    

7  
¡ Jeffrey  Dean  and  Sanjay  Ghemawat:  
MapReduce:  Simplified  Data  Processing      on  
Large  Clusters  
§ hbp://labs.google.com/papers/mapreduce.html  

¡ Sanjay  Ghemawat,  Howard  Gobioff,  and  


Shun-­‐Tak  Leung:  The  Google  File  System  
§ hbp://labs.google.com/papers/gfs.html    

9  
¡ Hadoop  Wiki  
§  Introduc?on  
§  hbp://wiki.apache.org/lucene-­‐hadoop/  
§  Gegng  Started  
§  hbp://wiki.apache.org/lucene-­‐hadoop/
GegngStartedWithHadoop  
§  Map/Reduce  Overview    
§  hbp://wiki.apache.org/lucene-­‐hadoop/HadoopMapReduce  
§  hbp://wiki.apache.org/lucene-­‐hadoop/
HadoopMapRedClasses  
§  Eclipse  Environment  
§ hbp://wiki.apache.org/lucene-­‐hadoop/EclipseEnvironment  
¡  Javadoc  
§  hbp://lucene.apache.org/hadoop/docs/api/    
10  
¡  Releases  from  Apache  download  mirrors  
§ hbp://www.apache.org/dyn/closer.cgi/lucene/
hadoop/  
¡  Nightly  builds  of  source  
§ hbp://people.apache.org/dist/lucene/hadoop/
nightly/  
¡  Source  code  from  subversion  
§ hbp://lucene.apache.org/hadoop/
version_control.html  

11  

You might also like