{"id":2045,"date":"2012-10-17T10:00:00","date_gmt":"2012-10-17T10:00:00","guid":{"rendered":"http:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-intensive-text-processing-local-aggregation-part-ii.html"},"modified":"2012-10-28T23:34:59","modified_gmt":"2012-10-28T23:34:59","slug":"mapreduce-working-through-data-2","status":"publish","type":"post","link":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html","title":{"rendered":"MapReduce: Working Through Data-Intensive Text Processing &#8211; Local Aggregation Part II"},"content":{"rendered":"<div dir=\"ltr\" style=\"text-align: left\">This post continues with the series on implementing algorithms found in the <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\" title=\"Working Through Data-Intensive Text Processing with MapReduce\">Data Intensive Processing with MapReduce<\/a> book. Part one can be found <a href=\"http:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html\" target=\"_blank\" title=\"Working Through Data-Intensive Text Processing with MapReduce\">here<\/a>. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network. Reducing the amount of data transferred is one of the top ways to improve the efficiency of a MapReduce job. A word-count MapReduce job was used to demonstrate local aggregation. Since the results only require a total count, we could re-use the same reducer for our combiner as changing the order or groupings of the addends will not affect the sum.<\/p>\n<p>But what if you wanted an <i>average<\/i>? Then the same approach would not work because calculating an average of averages is not equal to the average of the original set of numbers. With a little bit of insight though, we can still use local aggregation. For these examples we will be using a sample of the <a href=\"https:\/\/github.com\/tomwhite\/hadoop-book\/tree\/master\/input\/ncdc\/all\" target=\"_blank\">NCDC weather dataset<\/a> used in <a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\" target=\"_blank\">Hadoop the Definitive Guide<\/a> book. We will calculate the average temperature for each month in the year 1901. The averages algorithm for the combiner and the in-mapper combining option can be found in chapter 3.1.3 of Data-Intensive Processing with MapReduce. <\/p>\n<p><strong>One Size Does Not Fit All<\/strong><\/p>\n<p>Last time we described two approaches for reducing data in a MapReduce job, Hadoop Combiners and the in-mapper combining approach. Combiners are considered an optimization by the Hadoop framework and there are no guarantees on how many times they will be called, if at all. As a result, mappers must emit data in the form expected by the reducers so if combiners aren\u2019t involved, the final result is not changed. To adjust for calculating averages, we need to go back to the mapper and change it\u2019s output.<br \/>\n<strong><br \/>\n<\/strong><strong>Mapper Changes<\/strong>        <\/p>\n<p>In the word-count example, the non-optimized mapper simply emitted the word and the count of 1. The combiner and in-mapper combining mapper optimized this output by keeping each word as a key in a hash map with the total count as the value. Each time a word was seen the count was incremented by 1. With this setup, if the combiner was not called, the reducer would receive the word as a key and a long list of 1?s to add together, resulting in the same output (of course using the in-mapper combining mapper avoided this issue because it\u2019s guaranteed to combine results as it\u2019s part of the mapper code). To compute an average, we will have our base mapper emit a string key (the year and month of the weather observation concatenated together) and a <a href=\"http:\/\/hadoop.apache.org\/docs\/r0.20.2\/api\/org\/apache\/hadoop\/io\/Writable.html\" target=\"_blank\">custom writable<\/a> object, called TemperatureAveragingPair. The <a href=\"https:\/\/github.com\/bbejeck\/hadoop-algorithms\/blob\/master\/src\/bbejeck\/mapred\/aggregation\/TemperatureAveragingPair.java\" target=\"_blank\">TemperatureAveragingPair<\/a> object will contain two numbers (IntWritables), the temperature taken and a count of one. We will take the MaximumTemperatureMapper from Hadoop: The Definitive Guide and use it as inspiration for creating an AverageTemperatureMapper:         <\/p>\n<pre class=\"brush:java\">public class AverageTemperatureMapper extends Mapper&lt;LongWritable, Text, Text, TemperatureAveragingPair&gt; {\r\n \/\/sample line of weather data\r\n \/\/0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999\r\n\r\n\r\n    private Text outText = new Text();\r\n    private TemperatureAveragingPair pair = new TemperatureAveragingPair();\r\n    private static final int MISSING = 9999;\r\n\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        String line = value.toString();\r\n        String yearMonth = line.substring(15, 21);\r\n\r\n        int tempStartPosition = 87;\r\n\r\n        if (line.charAt(tempStartPosition) == '+') {\r\n            tempStartPosition += 1;\r\n        }\r\n\r\n        int temp = Integer.parseInt(line.substring(tempStartPosition, 92));\r\n\r\n        if (temp != MISSING) {\r\n            outText.set(yearMonth);\r\n            pair.set(temp, 1);\r\n            context.write(outText, pair);\r\n        }\r\n    }\r\n}<\/pre>\n<p>By having the mapper output a key and TemperatureAveragingPair object our MapReduce program is guaranteed to have the correct results regardless if the combiner is called.<br \/>\n<strong><br \/>\n<\/strong><strong>Combiner<\/strong><div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<p>We need to reduce the amount of data sent, so we will sum the temperatures, and sum the counts and store them separately. By doing so we will reduce data sent, but preserve the format needed for calculating correct averages. If\/when the combiner is called, it will take all the TemperatureAveragingPair objects passed in and emit a single TemperatureAveragingPair object for the same key, containing the summed temperatures and counts. Here is the code for the combiner:          <\/p>\n<pre class=\"brush:java\">  public class AverageTemperatureCombiner extends Reducer&lt;Text,TemperatureAveragingPair,Text,TemperatureAveragingPair&gt; {\r\n    private TemperatureAveragingPair pair = new TemperatureAveragingPair();\r\n\r\n    @Override\r\n    protected void reduce(Text key, Iterable&lt;TemperatureAveragingPair&gt; values, Context context) throws IOException, InterruptedException {\r\n        int temp = 0;\r\n        int count = 0;\r\n        for (TemperatureAveragingPair value : values) {\r\n             temp += value.getTemp().get();\r\n             count += value.getCount().get();\r\n        }\r\n        pair.set(temp,count);\r\n        context.write(key,pair);\r\n    }\r\n}<\/pre>\n<p>But we are really interested in being guaranteed we have reduced the amount of data sent to the reducers, so we\u2019ll have a look at how to achieve that next.<br \/>\n<strong><br \/>\n<\/strong><strong>In Mapper Combining Averages<\/strong>        <\/p>\n<p>Similar to the word-count example, for calculating averages, the in-mapper-combining mapper will use a hash map with the concatenated year+month as a key and a TemperatureAveragingPair as the value. Each time we get the same year+month combination, we\u2019ll take the pair object out of the map, add the temperature and increase the count by by one. Once the cleanup method is called we\u2019ll and emit all pairs with their respective key:         <\/p>\n<pre class=\"brush:java\">public class AverageTemperatureCombiningMapper extends Mapper&lt;LongWritable, Text, Text, TemperatureAveragingPair&gt; {\r\n \/\/sample line of weather data\r\n \/\/0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999\r\n\r\n\r\n    private static final int MISSING = 9999;\r\n    private Map&lt;String,TemperatureAveragingPair&gt; pairMap = new HashMap&lt;String,TemperatureAveragingPair&gt;();\r\n\r\n\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        String line = value.toString();\r\n        String yearMonth = line.substring(15, 21);\r\n\r\n        int tempStartPosition = 87;\r\n\r\n        if (line.charAt(tempStartPosition) == '+') {\r\n            tempStartPosition += 1;\r\n        }\r\n\r\n        int temp = Integer.parseInt(line.substring(tempStartPosition, 92));\r\n\r\n        if (temp != MISSING) {\r\n            TemperatureAveragingPair pair = pairMap.get(yearMonth);\r\n            if(pair == null){\r\n                pair = new TemperatureAveragingPair();\r\n                pairMap.put(yearMonth,pair);\r\n            }\r\n            int temps = pair.getTemp().get() + temp;\r\n            int count = pair.getCount().get() + 1;\r\n            pair.set(temps,count);\r\n        }\r\n    }\r\n\r\n\r\n    @Override\r\n    protected void cleanup(Context context) throws IOException, InterruptedException {\r\n        Set&lt;String&gt; keys = pairMap.keySet();\r\n        Text keyText = new Text();\r\n        for (String key : keys) {\r\n             keyText.set(key);\r\n             context.write(keyText,pairMap.get(key));\r\n        }\r\n    }\r\n}<\/pre>\n<p>By following the same pattern of keeping track of data between map calls, we can achieve reliable data reduction by implementing an in-mapper combining strategy. The same caveats apply for keeping state across all calls to the mapper, but looking at the gains that can be made in processing efficiency, using this approach merits some consideration.<br \/>\n<strong><br \/>\n<\/strong><strong>Reducer<\/strong>        <\/p>\n<p>At this point, writing our reducer is easy, take a list of pairs for each key, sum all the temperatures and counts then divide the sum of the temperatures by the sum of the counts.         <\/p>\n<pre class=\"brush:java\">public class AverageTemperatureReducer extends Reducer&lt;Text, TemperatureAveragingPair, Text, IntWritable&gt; {\r\n    private IntWritable average = new IntWritable();\r\n\r\n    @Override\r\n    protected void reduce(Text key, Iterable&lt;TemperatureAveragingPair&gt; values, Context context) throws IOException, InterruptedException {\r\n        int temp = 0;\r\n        int count = 0;\r\n        for (TemperatureAveragingPair pair : values) {\r\n            temp += pair.getTemp().get();\r\n            count += pair.getCount().get();\r\n        }\r\n        average.set(temp \/ count);\r\n        context.write(key, average);\r\n    }\r\n}\r\n<\/pre>\n<p><strong><br \/>\n<\/strong><strong>Results<\/strong>        <\/p>\n<p>The results are predictable with the combiner and in-mapper-combining mapper options showing substantially reduced data output.<br \/>\nNon-Optimized Mapper Option:         <\/p>\n<pre class=\"brush:java\">12\/10\/10 23:05:28 INFO mapred.JobClient:     Reduce input groups=12\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Combine output records=0\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Map input records=6565\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Reduce shuffle bytes=111594\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Reduce output records=12\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Spilled Records=13128\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Map output bytes=98460\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Combine input records=0\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Map output records=6564\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108\r\n12\/10\/10 23:05:28 INFO mapred.JobClient:     Reduce input records=6564<\/pre>\n<p>Combiner Option:         <\/p>\n<pre class=\"brush:java\">12\/10\/10 23:07:19 INFO mapred.JobClient:     Reduce input groups=12\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Combine output records=12\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Map input records=6565\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Reduce shuffle bytes=210\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Reduce output records=12\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Spilled Records=24\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Map output bytes=98460\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Combine input records=6564\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Map output records=6564\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108\r\n12\/10\/10 23:07:19 INFO mapred.JobClient:     Reduce input records=12<\/pre>\n<p>In-Mapper-Combining Option:         <\/p>\n<pre class=\"brush:java\">12\/10\/10 23:09:09 INFO mapred.JobClient:     Reduce input groups=12\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Combine output records=0\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Map input records=6565\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Reduce shuffle bytes=210\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Reduce output records=12\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Spilled Records=24\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Map output bytes=180\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Combine input records=0\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Map output records=12\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108\r\n12\/10\/10 23:09:09 INFO mapred.JobClient:     Reduce input records=12<\/pre>\n<p>Calculated Results:<br \/>\n(NOTE: the temperatures in the sample file are in Celsius * 10)         <\/p>\n<table border=\"1\" cellpadding=\"5\" cellspacing=\"5\">\n<tbody>\n<tr>\n<td>Non-Optimized<\/td>\n<td>Combiner      <\/td>\n<td>In-Mapper-Combiner Mapper<\/td>\n<\/tr>\n<tr>\n<td>190101 -25<br \/>\n190102 -91<br \/>\n190103 -49<br \/>\n190104 22<br \/>\n190105 76<br \/>\n190106 146<br \/>\n190107 192<br \/>\n190108 170<br \/>\n190109 114<br \/>\n190110 86<br \/>\n190111 -16<br \/>\n190112 -77 <\/td>\n<td>190101 -25<br \/>\n190102 -91<br \/>\n190103 -49<br \/>\n190104 22<br \/>\n190105 76<br \/>\n190106 146<br \/>\n190107 192<br \/>\n190108 170<br \/>\n190109 114<br \/>\n190110 86<br \/>\n190111 -16<br \/>\n190112 -77 <\/td>\n<td>190101 -25<br \/>\n190102 -91<br \/>\n190103 -49<br \/>\n190104 22<br \/>\n190105 76<br \/>\n190106 146<br \/>\n190107 192<br \/>\n190108 170<br \/>\n190109 114<br \/>\n190110 86<br \/>\n190111 -16<br \/>\n190112 -77 <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><br \/>\n<\/strong><strong>Conclusion<\/strong>        <\/p>\n<p>We have covered local aggregation for both the simple case where one could reuse the reducer as a combiner and a more complicated case where some insight on how to structure the data while still gaining the benefits of locally aggregating data for increase processing efficiency.<br \/>\n<strong><br \/>\nFurther Reading<\/strong><br \/>\n<strong><br \/>\n<\/strong><\/p>\n<ul>\n<li> <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\" title=\"Data-Intensive Text Processing with MapReduce\">Data-Intensive Processing with MapReduce<\/a> by Jimmy Lin and Chris Dyer<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1\" target=\"_blank\">Hadoop: The Definitive Guide<\/a> by Tom White<\/li>\n<li><a href=\"https:\/\/github.com\/bbejeck\/hadoop-algorithms\" target=\"_blank\" title=\"Source Code\">Source Code<\/a> from blog<\/li>\n<li><a href=\"http:\/\/hadoop.apache.org\/docs\/r0.20.2\/api\/index.html\">Hadoop API<\/a><\/li>\n<li><a href=\"http:\/\/mrunit.apache.org\/\" target=\"_blank\">MRUnit<\/a> for unit testing Apache Hadoop map reduce jobs<\/li>\n<li><a href=\"http:\/\/www.gutenberg.org\/\" target=\"_blank\" title=\"Project Gutenberg\">Project Gutenberg<\/a> a great source of books in plain text format, great for testing Hadoop jobs locally.<\/li>\n<\/ul>\n<p><strong><i>Reference: <\/i><\/strong><a href=\"http:\/\/codingjunkie.net\/text-processing-with-mapreduce-part-2\/\">Working Through Data-Intensive Text Processing with MapReduce \u2013 Local Aggregation Part II<\/a> from our <a href=\"http:\/\/www.javacodegeeks.com\/p\/jcg.html\">JCG partner<\/a> Bill Bejeck at the <a href=\"http:\/\/codingjunkie.net\/\">Random Thoughts On Coding<\/a> blog.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network. Reducing the amount of &hellip;<\/p>\n","protected":false},"author":110,"featured_media":63,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[184,372,183],"class_list":["post-2045","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-enterprise-java","tag-apache-hadoop","tag-big-data","tag-mapreduce"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks<\/title>\n<meta name=\"description\" content=\"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks\" \/>\n<meta property=\"og:description\" content=\"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html\" \/>\n<meta property=\"og:site_name\" content=\"Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2012-10-17T10:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2012-10-28T23:34:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Bill Bejeck\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bill Bejeck\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html\"},\"author\":{\"name\":\"Bill Bejeck\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\"},\"headline\":\"MapReduce: Working Through Data-Intensive Text Processing &#8211; Local Aggregation Part II\",\"datePublished\":\"2012-10-17T10:00:00+00:00\",\"dateModified\":\"2012-10-28T23:34:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html\"},\"wordCount\":980,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"keywords\":[\"Apache Hadoop\",\"Big Data\",\"MapReduce\"],\"articleSection\":[\"Enterprise Java\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html\",\"name\":\"MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"datePublished\":\"2012-10-17T10:00:00+00:00\",\"dateModified\":\"2012-10-28T23:34:59+00:00\",\"description\":\"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#primaryimage\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"width\":150,\"height\":150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/10\\\/mapreduce-working-through-data-2.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\\\/enterprise-java\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"MapReduce: Working Through Data-Intensive Text Processing &#8211; Local Aggregation Part II\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Developers Resource Center\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.javacodegeeks.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/javacodegeeks\",\"https:\\\/\\\/x.com\\\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\",\"name\":\"Bill Bejeck\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"caption\":\"Bill Bejeck\"},\"description\":\"Husband, father of 3, passionate about software development.\",\"sameAs\":[\"http:\\\/\\\/codingjunkie.net\\\/\"],\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/author\\\/Bill-Bejeck\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks","description":"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html","og_locale":"en_US","og_type":"article","og_title":"MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks","og_description":"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In","og_url":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html","og_site_name":"Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2012-10-17T10:00:00+00:00","article_modified_time":"2012-10-28T23:34:59+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","type":"image\/jpeg"}],"author":"Bill Bejeck","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Bill Bejeck","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#article","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html"},"author":{"name":"Bill Bejeck","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646"},"headline":"MapReduce: Working Through Data-Intensive Text Processing &#8211; Local Aggregation Part II","datePublished":"2012-10-17T10:00:00+00:00","dateModified":"2012-10-28T23:34:59+00:00","mainEntityOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html"},"wordCount":980,"commentCount":0,"publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","keywords":["Apache Hadoop","Big Data","MapReduce"],"articleSection":["Enterprise Java"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html","url":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html","name":"MapReduce: Working Through Data-Intensive Text Processing - Local Aggregation Part II - Java Code Geeks","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#primaryimage"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","datePublished":"2012-10-17T10:00:00+00:00","dateModified":"2012-10-28T23:34:59+00:00","description":"This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In","breadcrumb":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#primaryimage","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","width":150,"height":150},{"@type":"BreadcrumbList","@id":"https:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-2.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java","item":"https:\/\/www.javacodegeeks.com\/category\/java"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/www.javacodegeeks.com\/category\/java\/enterprise-java"},{"@type":"ListItem","position":4,"name":"MapReduce: Working Through Data-Intensive Text Processing &#8211; Local Aggregation Part II"}]},{"@type":"WebSite","@id":"https:\/\/www.javacodegeeks.com\/#website","url":"https:\/\/www.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Developers Resource Center","publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/www.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646","name":"Bill Bejeck","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","caption":"Bill Bejeck"},"description":"Husband, father of 3, passionate about software development.","sameAs":["http:\/\/codingjunkie.net\/"],"url":"https:\/\/www.javacodegeeks.com\/author\/Bill-Bejeck"}]}},"_links":{"self":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/2045","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/users\/110"}],"replies":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=2045"}],"version-history":[{"count":0,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/2045\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media\/63"}],"wp:attachment":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=2045"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=2045"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=2045"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}