{"id":1750,"date":"2012-09-28T16:00:00","date_gmt":"2012-09-28T16:00:00","guid":{"rendered":"http:\/\/www.javacodegeeks.com\/2012\/10\/mapreduce-working-through-data-intensive-text-processing.html"},"modified":"2012-10-22T06:40:46","modified_gmt":"2012-10-22T06:40:46","slug":"mapreduce-working-through-data","status":"publish","type":"post","link":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html","title":{"rendered":"MapReduce: Working Through Data-Intensive Text Processing"},"content":{"rendered":"<div dir=\"ltr\" style=\"text-align: left\">It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by <a href=\"https:\/\/www.coursera.org\/\" target=\"_blank\" title=\"Coursera\">Coursera<\/a>. There are some very interesting offerings and is worth a look. Some time ago, I purchased <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\" title=\"Data-Intensive Text Processing with MapReduce\">Data-Intensive Processing with MapReduce<\/a> by Jimmy Lin and Chris Dyer. The book presents several key MapReduce algorithms, but in pseudo code format. My goal is to take the algorithms presented in chapters 3-6 and implement them in Hadoop, using <a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1\" target=\"_blank\">Hadoop: The Definitive Guide<\/a> by Tom White as a reference. I\u2019m going to assume familiarity with Hadoop and MapReduce and not cover any introductory material. So let\u2019s jump into chapter 3 \u2013 MapReduce Algorithm Design, starting with local aggregation. <\/p>\n<p><strong>Local Aggregation<\/strong><\/p>\n<p>At a very high level, when Mappers emit data, the intermediate results are written to disk then sent across the network to Reducers for final processing. The latency of writing to disk then transferring data across the network is an expensive operation in the processing of a MapReduce job. So it stands to reason that whenever possible, reducing the amount of data sent from mappers would increase the speed of the MapReduce job. Local aggregation is a technique used to reduce the amount of data and improve the efficiency of our MapReduce job. Local aggregation can not take the place of reducers, as we need a way to gather results with the same key from different mappers. We are going to consider 3 ways of achieving local aggregation:         <\/p>\n<ol>\n<li>Using Hadoop Combiner functions.<\/li>\n<li>Two approaches of \u201cin-mapper\u201d combining presented in the Text Processing with MapReduce book.<\/li>\n<\/ol>\n<p>Of course any optimization is going to have tradeoffs and we\u2019ll discuss those as well.<br \/>\nTo demonstrate local aggregation, we will run the ubiquitous word count job on a plain text version of <a href=\"http:\/\/www.gutenberg.org\/cache\/epub\/46\/pg46.txt\">A Christmas Carol<\/a> by Charles Dickens (downloaded from <a href=\"http:\/\/www.gutenberg.org\/wiki\/Main_Page\" target=\"_blank\">Project Gutenberg<\/a>) on a pseudo distributed cluster installed on my MacBookPro, using the hadoop-0.20.2-cdh3u3 distribution from <a href=\"http:\/\/www.cloudera.com\/\" target=\"_blank\">Cloudera<\/a>. I plan in a future post to run the same experiment on an EC2 cluster with more realistic sized data.<br \/>\n<strong><br \/>\n<\/strong><strong>Combiners<\/strong>        <\/p>\n<p>A combiner function is an object that extends the Reducer class. In fact, for our examples here, we are going to re-use the same reducer used in the word count job. A combiner function is specified when setting up the MapReduce job like so:         <\/p>\n<pre class=\"brush:java\"> job.setReducerClass(TokenCountReducer.class);<\/pre>\n<p>Here is the reducer code:         <\/p>\n<pre class=\"brush:java\">public class TokenCountReducer extends Reducer&lt;Text,IntWritable,Text,IntWritable&gt;{\r\n    @Override\r\n    protected void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {\r\n        int count = 0;\r\n        for (IntWritable value : values) {\r\n              count+= value.get();\r\n        }\r\n        context.write(key,new IntWritable(count));\r\n    }\r\n}<\/pre>\n<p>The job of a combiner is to do just what the name implies, aggregate data with the net result of less data begin shuffled across the network, which gives us gains in efficiency. As stated before, keep in mind that reducers are still required to put together results with the same keys coming from different mappers. Since combiner functions are an optimization, the Hadoop framework offers no guarantees on how many times a combiner will be called, if at all.<br \/>\n<strong><br \/>\n<\/strong><strong>In Mapper Combining Option 1<\/strong>        <div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<p>The first alternative to using Combiners (figure 3.2 page 41) is very straight forward and makes a slight modification to our original word count mapper:         <\/p>\n<pre class=\"brush:java\">public class PerDocumentMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        IntWritable writableCount = new IntWritable();\r\n        Text text = new Text();\r\n        Map&lt;String,Integer&gt; tokenMap = new HashMap&lt;String, Integer&gt;();\r\n        StringTokenizer tokenizer = new StringTokenizer(value.toString());\r\n\r\n        while(tokenizer.hasMoreElements()){\r\n            String token = tokenizer.nextToken();\r\n            Integer count = tokenMap.get(token);\r\n            if(count == null) count = new Integer(0);\r\n            count+=1;\r\n            tokenMap.put(token,count);\r\n        }\r\n\r\n        Set&lt;String&gt; keys = tokenMap.keySet();\r\n        for (String s : keys) {\r\n             text.set(s);\r\n             writableCount.set(tokenMap.get(s));\r\n             context.write(text,writableCount);\r\n        }\r\n    }\r\n}<\/pre>\n<p>As we can see here, instead of emitting a word with the count of 1, for each word encountered, we use a map to keep track of each word already processed. Then when all of the tokens are processed we loop through the map and emit the total count for each word encountered in that line.<br \/>\n<strong><br \/>\n<\/strong><strong>In Mapper Combining Option 2<\/strong>        <\/p>\n<p>The second option of in mapper combining (figure 3.3 page 41) is very similar to the above example with two distinctions \u2013 when the hash map is created and when we emit the results contained in the map. In the above example, a map is created and has its contents dumped over the wire for <i>each<\/i> invocation of the map method. In this example we are going make the map an instance variable and shift the instantiation of the map to the setUp method in our mapper. Likewise the contents of the map will not be sent out to the reducers until all of the calls to mapper have completed and the cleanUp method is called.         <\/p>\n<pre class=\"brush:java\">public class AllDocumentMapper extends Mapper&lt;LongWritable,Text,Text,IntWritable&gt; {\r\n\r\n    private  Map&lt;String,Integer&gt; tokenMap;\r\n\r\n    @Override\r\n    protected void setup(Context context) throws IOException, InterruptedException {\r\n           tokenMap = new HashMap&lt;String, Integer&gt;();\r\n    }\r\n\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        StringTokenizer tokenizer = new StringTokenizer(value.toString());\r\n        while(tokenizer.hasMoreElements()){\r\n            String token = tokenizer.nextToken();\r\n            Integer count = tokenMap.get(token);\r\n            if(count == null) count = new Integer(0);\r\n            count+=1;\r\n            tokenMap.put(token,count);\r\n        }\r\n    }\r\n\r\n\r\n    @Override\r\n    protected void cleanup(Context context) throws IOException, InterruptedException {\r\n        IntWritable writableCount = new IntWritable();\r\n        Text text = new Text();\r\n        Set&lt;String&gt; keys = tokenMap.keySet();\r\n        for (String s : keys) {\r\n            text.set(s);\r\n            writableCount.set(tokenMap.get(s));\r\n            context.write(text,writableCount);\r\n        }\r\n    }\r\n}<\/pre>\n<p>As we can see from the above code example, the mapper is keeping track of unique word counts, across all calls to the map method. By keeping track of unique tokens and their counts, there should be a substantial reduction in the number of records sent to the reducers, which in turn should improve the running time of the MapReduce job. This accomplishes the same effect as using the combiner function option provided by the MapReduce framework, but in this case you are guaranteed that the combining code will be called. But there are some caveats with this approach also. Keeping state across map calls could prove problematic and definitely is a violation of the functional spirit of a \u201cmap\u201d function. Also, by keeping state across all mappers, depending on the data used in the job, memory could be another issue to contend with. Ultimately, one would have to weigh all of the trade offs to determine the best approach.<br \/>\n<strong><br \/>\n<\/strong><strong>Results<\/strong>        <\/p>\n<p>Now lets take a look at the some results of the different mappers. Since the job was run in pseudo-distributed mode, actual running times are irrelevant, but we can still infer how using local aggregation could impact the efficiency of MapReduce job running on a real cluster.          <\/p>\n<p>Per Token Mapper:         <\/p>\n<pre class=\"brush:java\">12\/09\/13 21:25:32 INFO mapred.JobClient:     Reduce shuffle bytes=366010\r\n12\/09\/13 21:25:32 INFO mapred.JobClient:     Reduce output records=7657\r\n12\/09\/13 21:25:32 INFO mapred.JobClient:     Spilled Records=63118\r\n12\/09\/13 21:25:32 INFO mapred.JobClient:     Map output bytes=302886<\/pre>\n<p>In Mapper Reducing Option 1:         <\/p>\n<pre class=\"brush:java\">12\/09\/13 21:28:15 INFO mapred.JobClient:     Reduce shuffle bytes=354112\r\n12\/09\/13 21:28:15 INFO mapred.JobClient:     Reduce output records=7657\r\n12\/09\/13 21:28:15 INFO mapred.JobClient:     Spilled Records=60704\r\n12\/09\/13 21:28:15 INFO mapred.JobClient:     Map output bytes=293402<\/pre>\n<p>In Mapper Reducing Option 2:         <\/p>\n<pre class=\"brush:java\">12\/09\/13 21:30:49 INFO mapred.JobClient:     Reduce shuffle bytes=105885\r\n12\/09\/13 21:30:49 INFO mapred.JobClient:     Reduce output records=7657\r\n12\/09\/13 21:30:49 INFO mapred.JobClient:     Spilled Records=15314\r\n12\/09\/13 21:30:49 INFO mapred.JobClient:     Map output bytes=90565<\/pre>\n<p>Combiner Option:         <\/p>\n<pre class=\"brush:java\">12\/09\/13 21:22:18 INFO mapred.JobClient:     Reduce shuffle bytes=105885\r\n12\/09\/13 21:22:18 INFO mapred.JobClient:     Reduce output records=7657\r\n12\/09\/13 21:22:18 INFO mapred.JobClient:     Spilled Records=15314\r\n12\/09\/13 21:22:18 INFO mapred.JobClient:     Map output bytes=302886\r\n12\/09\/13 21:22:18 INFO mapred.JobClient:     Combine input records=31559\r\n12\/09\/13 21:22:18 INFO mapred.JobClient:     Combine output records=7657<\/pre>\n<p>As expected the Mapper that did no combining had the worst results, followed closely by the first in-mapper combining option (although these results could have been made better had the data been cleaned up before running the word count). The second in-mapper combining option and the combiner function had virtually identical results. The significant fact is that both produced <strong><i>2\/3 less<\/i><\/strong> reduce shuffle bytes as the first two options. Reducing the amount of bytes sent over the network to the reducers by that amount would surely would have a positive impact on the efficiency of a MapReduce job. There is one point to keep in mind here and that is Combiners\/In-Mapper combining can not just be used in all MapReduce jobs, in this case the word count lends itself very nicely to such an enhancement, but that might not always be true.<br \/>\n<strong><br \/>\n<\/strong><strong>Conclusion<\/strong>        <\/p>\n<p>As you can see the benefits of using either in-mapper combining or the Hadoop combiner function require serious consideration when looking to improve the performance of your MapReduce jobs. As for which approach, it is up to you the weigh the trade offs for each approach.<br \/>\n<strong><br \/>\n<\/strong><strong>Related links<\/strong><br \/>\n<strong><br \/>\n<\/strong><\/p>\n<ul>\n<li> <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\" title=\"Data-Intensive Text Processing with MapReduce\">Data-Intensive Processing with MapReduce<\/a> by Jimmy Lin and Chris Dyer<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1\" target=\"_blank\">Hadoop: The Definitive Guide<\/a> by Tom White<\/li>\n<li><a href=\"https:\/\/github.com\/bbejeck\/hadoop-algorithms\" target=\"_blank\" title=\"Source Code\">Source Code<\/a> from blog<\/li>\n<li><a href=\"http:\/\/mrunit.apache.org\/\" target=\"_blank\">MRUnit<\/a> for unit testing Apache Hadoop map reduce jobs<\/li>\n<li><a href=\"http:\/\/www.gutenberg.org\/\" target=\"_blank\" title=\"Project Gutenberg\">Project Gutenberg<\/a> a great source of books in plain text format, great for testing Hadoop jobs locally.<\/li>\n<\/ul>\n<div>\n<\/div>\n<p>Happy coding and don&#8217;t forget to share!<\/p>\n<p><strong><i>Reference: <\/i><\/strong><a href=\"http:\/\/codingjunkie.net\/text-processing-with-mapreduce-part1\/\">Working Through Data-Intensive Text Processing with MapReduce<\/a> from our <a href=\"http:\/\/www.javacodegeeks.com\/p\/jcg.html\">JCG partner<\/a> Bill Bejeck at the <a href=\"http:\/\/codingjunkie.net\/\">Random Thoughts On Coding<\/a> blog.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is worth a look. Some time ago, I purchased Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer. The book presents several key MapReduce algorithms, but &hellip;<\/p>\n","protected":false},"author":110,"featured_media":63,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[184,372,183],"class_list":["post-1750","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-enterprise-java","tag-apache-hadoop","tag-big-data","tag-mapreduce"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks<\/title>\n<meta name=\"description\" content=\"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks\" \/>\n<meta property=\"og:description\" content=\"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html\" \/>\n<meta property=\"og:site_name\" content=\"Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2012-09-28T16:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2012-10-22T06:40:46+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Bill Bejeck\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bill Bejeck\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html\"},\"author\":{\"name\":\"Bill Bejeck\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\"},\"headline\":\"MapReduce: Working Through Data-Intensive Text Processing\",\"datePublished\":\"2012-09-28T16:00:00+00:00\",\"dateModified\":\"2012-10-22T06:40:46+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html\"},\"wordCount\":1166,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"keywords\":[\"Apache Hadoop\",\"Big Data\",\"MapReduce\"],\"articleSection\":[\"Enterprise Java\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html\",\"name\":\"MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"datePublished\":\"2012-09-28T16:00:00+00:00\",\"dateModified\":\"2012-10-22T06:40:46+00:00\",\"description\":\"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#primaryimage\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"width\":150,\"height\":150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2012\\\/09\\\/mapreduce-working-through-data.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\\\/enterprise-java\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"MapReduce: Working Through Data-Intensive Text Processing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Developers Resource Center\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.javacodegeeks.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/javacodegeeks\",\"https:\\\/\\\/x.com\\\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\",\"name\":\"Bill Bejeck\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"caption\":\"Bill Bejeck\"},\"description\":\"Husband, father of 3, passionate about software development.\",\"sameAs\":[\"http:\\\/\\\/codingjunkie.net\\\/\"],\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/author\\\/Bill-Bejeck\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks","description":"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html","og_locale":"en_US","og_type":"article","og_title":"MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks","og_description":"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is","og_url":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html","og_site_name":"Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2012-09-28T16:00:00+00:00","article_modified_time":"2012-10-22T06:40:46+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","type":"image\/jpeg"}],"author":"Bill Bejeck","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Bill Bejeck","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#article","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html"},"author":{"name":"Bill Bejeck","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646"},"headline":"MapReduce: Working Through Data-Intensive Text Processing","datePublished":"2012-09-28T16:00:00+00:00","dateModified":"2012-10-22T06:40:46+00:00","mainEntityOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html"},"wordCount":1166,"commentCount":0,"publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","keywords":["Apache Hadoop","Big Data","MapReduce"],"articleSection":["Enterprise Java"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html","url":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html","name":"MapReduce: Working Through Data-Intensive Text Processing - Java Code Geeks","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#primaryimage"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","datePublished":"2012-09-28T16:00:00+00:00","dateModified":"2012-10-22T06:40:46+00:00","description":"It has been a while since I last posted, as I\u2019ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is","breadcrumb":{"@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#primaryimage","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","width":150,"height":150},{"@type":"BreadcrumbList","@id":"https:\/\/www.javacodegeeks.com\/2012\/09\/mapreduce-working-through-data.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java","item":"https:\/\/www.javacodegeeks.com\/category\/java"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/www.javacodegeeks.com\/category\/java\/enterprise-java"},{"@type":"ListItem","position":4,"name":"MapReduce: Working Through Data-Intensive Text Processing"}]},{"@type":"WebSite","@id":"https:\/\/www.javacodegeeks.com\/#website","url":"https:\/\/www.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Developers Resource Center","publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/www.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646","name":"Bill Bejeck","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","caption":"Bill Bejeck"},"description":"Husband, father of 3, passionate about software development.","sameAs":["http:\/\/codingjunkie.net\/"],"url":"https:\/\/www.javacodegeeks.com\/author\/Bill-Bejeck"}]}},"_links":{"self":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/1750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/users\/110"}],"replies":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=1750"}],"version-history":[{"count":0,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/1750\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media\/63"}],"wp:attachment":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=1750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=1750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=1750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}