{"id":21767,"date":"2014-02-19T22:00:22","date_gmt":"2014-02-19T20:00:22","guid":{"rendered":"http:\/\/www.javacodegeeks.com\/?p=21767"},"modified":"2014-02-19T15:21:27","modified_gmt":"2014-02-19T13:21:27","slug":"mapreduce-algorithms-understanding-data-joins-part-ii","status":"publish","type":"post","link":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html","title":{"rendered":"MapReduce Algorithms &#8211; Understanding Data Joins Part II"},"content":{"rendered":"<p>It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was <a href=\"https:\/\/www.coursera.org\/course\/progfun\" target=\"_blank\">Functional Programming Principals in Scala<\/a> and <a href=\"https:\/\/www.coursera.org\/course\/reactive\" target=\"_blank\">Principles of Reactive Programming<\/a>. I found both of them to be great courses and would recommend taking either one if you have the time. In this post we resume our series on implementing the algorithms found in <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\">Data-Intensive Text Processing with MapReduce<\/a>, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase. In the last post on data joins we covered <a title=\"MapReduce Algorithms \u2013 Understanding Data Joins Part 1\" href=\"http:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html\" target=\"_blank\">reduce side joins<\/a>. Reduce-side joins are easy to implement, but have the drawback that all data is sent across the network to the reducers. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network. However, unlike reduce-side joins, map-side joins require very specific criteria be met. Today we will discuss the requirements for map-side joins and how we can implement them.<\/p>\n<h2>Map-Side Join Conditions<\/h2>\n<p>To take advantage of map-side joins our data must meet one of following criteria:<\/p>\n<ol>\n<li>The datasets to be joined are already sorted by the same key and have the same number of partitions<\/li>\n<li>Of the two datasets to be joined, one is small enough to fit into memory<\/li>\n<\/ol>\n<p>We are going to consider the first scenario where we have two (or more) datasets that need to be joined, but are too large to fit into memory. We will assume the worst case scenario, the files aren\u2019t sorted or partitioned the same.<\/p>\n<h2>Data Format<\/h2>\n<p>Before we start, let\u2019s take a look at the data we are working with. We will have two datasets:<\/p>\n<ol>\n<li>The first dataset consists of a GUID, First Name, Last Name, Address, City and State<\/li>\n<li>The second dataset consists of a GUID and Employer information<\/li>\n<\/ol>\n<p>Both datasets are comma delimited and the join-key (GUID) is in the first position. After the join we want the employer information from dataset two to be appended to the end of dataset one. Additionally, we want to keep the GUID in the first position of dataset one, but remove the GUID from dataset two.<br \/>\nDataset 1:<\/p>\n<pre class=\" brush:java\">aef9422c-d08c-4457-9760-f2d564d673bc,Linda,Narvaez,3253 Davis Street,Atlanta,GA\r\n  08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC\r\n  de68186a-1004-4211-a866-736f414eac61,Charles,Arnold,1764 Public Works Drive,Johnson City,TN\r\n  6df1882d-4c81-4155-9d8b-0c35b2d34284,John,Schofield,65 Summit Park Avenue,Detroit,MI<\/pre>\n<p>Dataset 2:<\/p>\n<pre class=\" brush:java\">de68186a-1004-4211-a866-736f414eac61,Jacobs\r\n  6df1882d-4c81-4155-9d8b-0c35b2d34284,Chief Auto Parts\r\n  aef9422c-d08c-4457-9760-f2d564d673bc,Earthworks Yard Maintenance\r\n  08db7c55-22ae-4199-8826-c67a5689f838,Ellman's Catalog Showrooms<\/pre>\n<p>Joined results:<\/p>\n<pre class=\" brush:java\">08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC,Ellman's Catalog Showrooms\r\n6df1882d-4c81-4155-9d8b-0c35b2d34284,John,Schofield,65 Summit Park Avenue,Detroit,MI,Chief Auto Parts\r\naef9422c-d08c-4457-9760-f2d564d673bc,Linda,Narvaez,3253 Davis Street,Atlanta,GA,Earthworks Yard Maintenance\r\nde68186a-1004-4211-a866-736f414eac61,Charles,Arnold,1764 Public Works Drive,Johnson City,TN,Jacobs<\/pre>\n<p>Now we move on to how we go about joining our two datasets.<\/p>\n<h2>Map-Side Joins with Large Datasets<\/h2>\n<p>To be able to perform map-side joins we need to have our data sorted by the same key and have the same number of partitions, implying that all keys for any record are in the same partition. While this seems to be a tough requirement, it is easily fixed. Hadoop sorts all keys and guarantees that keys with the same value are sent to the same reducer. So by simply running a MapReduce job that does nothing more than output the data by the key you want to join on and specifying the exact same number of reducers for all datasets, we will get our data in the correct form. Considering the gains in efficiency from being able to do a map-side join, it may be worth the cost of running additional MapReduce jobs. It bears repeating at this point it is crucial all datasets specify the <em>exact<\/em> same number of reducers during the \u201cpreparation\u201d phase when the data will be sorted and partitioned. In this post we will take two data-sets and run an initial MapReduce job on both to do the sorting and partitioning and then run a final job to perform the map-side join. First let\u2019s cover the MapReduce job to sort and partition our data in the same way.<\/p>\n<h2>Step One: Sorting and Partitioning<\/h2>\n<p>First we need to create a <code>Mapper<\/code> that will simply choose the key for sorting by a given index:<div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<pre class=\" brush:java\">public class SortByKeyMapper extends Mapper&lt;LongWritable, Text, Text, Text&gt; {\r\n\r\n    private int keyIndex;\r\n    private Splitter splitter;\r\n    private Joiner joiner;\r\n    private Text joinKey = new Text();\r\n\r\n    @Override\r\n    protected void setup(Context context) throws IOException, InterruptedException {\r\n        String separator =  context.getConfiguration().get(\"separator\");\r\n        keyIndex = Integer.parseInt(context.getConfiguration().get(\"keyIndex\"));\r\n        splitter = Splitter.on(separator);\r\n        joiner = Joiner.on(separator);\r\n    }\r\n\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        Iterable&lt;String&gt; values = splitter.split(value.toString());\r\n        joinKey.set(Iterables.get(values,keyIndex));\r\n        if(keyIndex != 0){\r\n            value.set(reorderValue(values,keyIndex));\r\n        }\r\n        context.write(joinKey,value);\r\n    }\r\n\r\n    private String reorderValue(Iterable&lt;String&gt; value, int index){\r\n        List&lt;String&gt; temp = Lists.newArrayList(value);\r\n        String originalFirst = temp.get(0);\r\n        String newFirst = temp.get(index);\r\n        temp.set(0,newFirst);\r\n        temp.set(index,originalFirst);\r\n        return joiner.join(temp);\r\n    }\r\n}<\/pre>\n<p>The <code>SortByKeyMapper<\/code> simply sets the value of the <code>joinKey<\/code> by extracting the value from the given line of text found at the position given by the configuration parameter <code>keyIndex<\/code>. Also, if the <code>keyIndex<\/code> is not equal to zero, we swap the order of the values found in the first position and the <code>keyIndex<\/code> position. Although this is a questionable feature, We\u2019ll discuss why we are doing this later. Next we need a <code>Reducer<\/code>:<\/p>\n<pre class=\" brush:java\">public class SortByKeyReducer extends Reducer&lt;Text,Text,NullWritable,Text&gt; {\r\n\r\n    private static final NullWritable nullKey = NullWritable.get();\r\n\r\n    @Override\r\n    protected void reduce(Text key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException {\r\n        for (Text value : values) {\r\n             context.write(nullKey,value);\r\n        }\r\n    }\r\n}<\/pre>\n<p>The <code>SortByKeyReducer<\/code> writes out all values for the given key, but throws out the key and writes a <code>NullWritable<\/code> instead. In the next section we will explain why we are not using the key.<\/p>\n<h2>Step Two: The Map-Side join<\/h2>\n<p>When performing a map-side join the records are merged <em>before<\/em> they reach the mapper. To achieve this, we use the <a title=\"CompositeInputFormat\" href=\"http:\/\/hadoop.apache.org\/docs\/current2\/api\/org\/apache\/hadoop\/mapreduce\/lib\/join\/CompositeInputFormat.html\" target=\"_blank\">CompositeInputFormat<\/a>. We will also need to set some configuration properties. Let\u2019s look at how we will configure our map-side join:<\/p>\n<pre class=\" brush:java\">private static Configuration getMapJoinConfiguration(String separator, String... paths) {\r\n        Configuration config = new Configuration();\r\n        config.set(\"mapreduce.input.keyvaluelinerecordreader.key.value.separator\", separator);\r\n        String joinExpression = CompositeInputFormat.compose(\"inner\", KeyValueTextInputFormat.class, paths);\r\n        config.set(\"mapred.join.expr\", joinExpression);\r\n        config.set(\"separator\", separator);\r\n        return config;\r\n    }<\/pre>\n<p>First, we are specifying the character that separates the key and values by setting the <code>mapreduce.input.keyvaluelinerecordreader.key.value.separator<\/code> property. Next we use the <code>CompositeInputFormat.compose<\/code> method to create a \u201cjoin expression\u201d specifying an inner join by using the word \u201cinner\u201d, then specifying the input format to use, the <a title=\"KeyValueTextInput\" href=\"http:\/\/hadoop.apache.org\/docs\/current2\/api\/org\/apache\/hadoop\/mapreduce\/lib\/input\/KeyValueTextInputFormat.html\" target=\"_blank\">KeyValueTextInput<\/a>class and finally a String varargs representing the paths of the files to join (which are the output paths of the map-reduce jobs ran to sort and partition the data). The <code>KeyValueTextInputFormat<\/code> class will use the separator character to set the first value as the key and the rest will be used for the value.<\/p>\n<h2>Mapper for the join<\/h2>\n<p>Once the values from the source files have been joined, the <code>Mapper.map<\/code> method is called, it will receive a <code>Text<\/code> object for the key (the same key across joined records) and a <code>TupleWritable<\/code> that is composed of the values joined from our input files for a given key. Remember we want our final output to have the join-key in the first position, followed by all of joined values in one delimited <code>String<\/code>. To achieve this we have a custom mapper to put our data in the correct format:<\/p>\n<pre class=\" brush:java\">public class CombineValuesMapper extends Mapper&lt;Text, TupleWritable, NullWritable, Text&gt; {\r\n\r\n    private static final NullWritable nullKey = NullWritable.get();\r\n    private Text outValue = new Text();\r\n    private StringBuilder valueBuilder = new StringBuilder();\r\n    private String separator;\r\n\r\n    @Override\r\n    protected void setup(Context context) throws IOException, InterruptedException {\r\n        separator = context.getConfiguration().get(\"separator\");\r\n    }\r\n\r\n    @Override\r\n    protected void map(Text key, TupleWritable value, Context context) throws IOException, InterruptedException {\r\n        valueBuilder.append(key).append(separator);\r\n        for (Writable writable : value) {\r\n            valueBuilder.append(writable.toString()).append(separator);\r\n        }\r\n        valueBuilder.setLength(valueBuilder.length() - 1);\r\n        outValue.set(valueBuilder.toString());\r\n        context.write(nullKey, outValue);\r\n        valueBuilder.setLength(0);\r\n    }\r\n}<\/pre>\n<p>In the <code>CombineValuesMapper<\/code> we are appending the key and all the joined values into one delimited <code>String<\/code>. Here we can finally see the reason why we threw the join-key away in the previous MapReduce jobs. Since the key is the first position in the values for all the datasets to be joined, our mapper naturally eliminates the duplicate keys from the joined datasets. All we need to do is insert the given key into a <code>StringBuilder<\/code>, then append the values contained in the <code>TupleWritable<\/code>.<\/p>\n<h2>Putting It All Together<\/h2>\n<p>Now we have all the code in place to run a map-side join on large datasets. Let\u2019s take a look at how we will run all the jobs together. As was stated before, we are assuming that our data is not sorted and partitioned the same, so we will need to run N (2 in this case) MapReduce jobs to get the data in the correct format. After the initial sorting\/partitioning jobs run, the final job performing the actual join will run.<\/p>\n<pre class=\" brush:java\">public class MapSideJoinDriver {\r\n\r\n    public static void main(String[] args) throws Exception {\r\n        String separator = \",\";\r\n        String keyIndex = \"0\";\r\n        int numReducers = 10;\r\n        String jobOneInputPath = args[0];\r\n        String jobTwoInputPath = args[1];\r\n        String joinJobOutPath = args[2];\r\n\r\n        String jobOneSortedPath = jobOneInputPath + \"_sorted\";\r\n        String jobTwoSortedPath = jobTwoInputPath + \"_sorted\";\r\n\r\n        Job firstSort = Job.getInstance(getConfiguration(keyIndex, separator));\r\n        configureJob(firstSort, \"firstSort\", numReducers, jobOneInputPath, jobOneSortedPath, SortByKeyMapper.class, SortByKeyReducer.class);\r\n\r\n        Job secondSort = Job.getInstance(getConfiguration(keyIndex, separator));\r\n        configureJob(secondSort, \"secondSort\", numReducers, jobTwoInputPath, jobTwoSortedPath, SortByKeyMapper.class, SortByKeyReducer.class);\r\n\r\n        Job mapJoin = Job.getInstance(getMapJoinConfiguration(separator, jobOneSortedPath, jobTwoSortedPath));\r\n        configureJob(mapJoin, \"mapJoin\", 0, jobOneSortedPath + \",\" + jobTwoSortedPath, joinJobOutPath, CombineValuesMapper.class, Reducer.class);\r\n        mapJoin.setInputFormatClass(CompositeInputFormat.class);\r\n\r\n        List&lt;Job&gt; jobs = Lists.newArrayList(firstSort, secondSort, mapJoin);\r\n        int exitStatus = 0;\r\n        for (Job job : jobs) {\r\n            boolean jobSuccessful = job.waitForCompletion(true);\r\n            if (!jobSuccessful) {\r\n                System.out.println(\"Error with job \" + job.getJobName() + \"  \" + job.getStatus().getFailureInfo());\r\n                exitStatus = 1;\r\n                break;\r\n            }\r\n        }\r\n        System.exit(exitStatus);\r\n    }<\/pre>\n<p>The <code>MapSideJoinDriver<\/code> does the basic configuration for running MapReduce jobs. One interesting point is the sorting\/partitioning jobs specify 10 reducers each, while the final job explicitly sets the number of reducers to 0, since we are joining on the map-side and don\u2019t need a reduce phase. Since we don\u2019t have any complicated dependencies, we put the jobs in an ArrayList and run the jobs in linear order (lines 24-33).<\/p>\n<h2>Results<\/h2>\n<p>Initially we had 2 files; name and address information in the first file and employment information in the second. Both files had a unique id in the first column.<br \/>\nFile one:<\/p>\n<pre class=\" brush:java\">....\r\n08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC\r\n...<\/pre>\n<p>File two:<\/p>\n<pre class=\" brush:java\">....\r\n08db7c55-22ae-4199-8826-c67a5689f838,Ellman's Catalog Showrooms\r\n....<\/pre>\n<p>Results:<\/p>\n<pre class=\" brush:java\">08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC,Ellman's Catalog Showrooms<\/pre>\n<p>As we can see here, we\u2019ve successfully joined the records together and maintained the format of the files without duplicate keys in the results.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post we\u2019ve demonstrated how to perform a map-side join when both data sets are large and can\u2019t fit into memory. If you get the feeling this takes a lot of work to pull off, you are correct. While in most cases we would want to use higher level tools like Pig or Hive, it\u2019s helpful to know the mechanics of performing map-side joins with large datasets. This especially true on those occasions when you need to write a solution from scratch. Thanks for your time.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a title=\"Data-Intensive Text Processing with MapReduce\" href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\">Data-Intensive Processing with MapReduce<\/a> by Jimmy Lin and Chris Dyer<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1\" target=\"_blank\">Hadoop: The Definitive Guide<\/a> by Tom White<\/li>\n<li><a title=\"Source Code\" href=\"https:\/\/github.com\/bbejeck\/hadoop-algorithms\" target=\"_blank\">Source Code and Tests<\/a> from blog<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Programming-Hive-Edward-Capriolo\/dp\/1449319335\" target=\"_blank\">Programming Hive<\/a> by Edward Capriolo, Dean Wampler and Jason Rutherglen<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Programming-Pig-Alan-Gates\/dp\/1449302645\" target=\"_blank\">Programming Pig<\/a> by Alan Gates<\/li>\n<li><a href=\"http:\/\/hadoop.apache.org\/docs\/current2\/api\/\">Hadoop API<\/a><\/li>\n<li><a href=\"http:\/\/mrunit.apache.org\/\" target=\"_blank\">MRUnit<\/a> for unit testing Apache Hadoop map reduce jobs<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<div style=\"border: 1px solid #D8D8D8; background: #FAFAFA; width: 100%; padding-left: 5px;\"><b><i>Reference: <\/i><\/b><a href=\"http:\/\/codingjunkie.net\/mapside-joins\/\">MapReduce Algorithms &#8211; Understanding Data Joins Part II<\/a> from our <a href=\"http:\/\/www.javacodegeeks.com\/jcg\">JCG partner<\/a> Bill Bejeck at the <a href=\"http:\/\/codingjunkie.net\/\">Random Thoughts On Coding<\/a> blog.<\/div>\n","protected":false},"excerpt":{"rendered":"<p>It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have &hellip;<\/p>\n","protected":false},"author":110,"featured_media":62,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[184,183],"class_list":["post-21767","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-enterprise-java","tag-apache-hadoop","tag-mapreduce"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>MapReduce Algorithms - Understanding Data Joins Part II<\/title>\n<meta name=\"description\" content=\"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MapReduce Algorithms - Understanding Data Joins Part II\" \/>\n<meta property=\"og:description\" content=\"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html\" \/>\n<meta property=\"og:site_name\" content=\"Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2014-02-19T20:00:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Bill Bejeck\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bill Bejeck\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html\"},\"author\":{\"name\":\"Bill Bejeck\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\"},\"headline\":\"MapReduce Algorithms &#8211; Understanding Data Joins Part II\",\"datePublished\":\"2014-02-19T20:00:22+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html\"},\"wordCount\":1436,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-logo.jpg\",\"keywords\":[\"Apache Hadoop\",\"MapReduce\"],\"articleSection\":[\"Enterprise Java\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html\",\"name\":\"MapReduce Algorithms - Understanding Data Joins Part II\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-logo.jpg\",\"datePublished\":\"2014-02-19T20:00:22+00:00\",\"description\":\"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-logo.jpg\",\"width\":150,\"height\":150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2014\\\/02\\\/mapreduce-algorithms-understanding-data-joins-part-ii.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\\\/enterprise-java\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"MapReduce Algorithms &#8211; Understanding Data Joins Part II\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Developers Resource Center\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.javacodegeeks.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/javacodegeeks\",\"https:\\\/\\\/x.com\\\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\",\"name\":\"Bill Bejeck\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"caption\":\"Bill Bejeck\"},\"description\":\"Husband, father of 3, passionate about software development.\",\"sameAs\":[\"http:\\\/\\\/codingjunkie.net\\\/\"],\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/author\\\/Bill-Bejeck\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MapReduce Algorithms - Understanding Data Joins Part II","description":"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html","og_locale":"en_US","og_type":"article","og_title":"MapReduce Algorithms - Understanding Data Joins Part II","og_description":"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional","og_url":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html","og_site_name":"Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2014-02-19T20:00:22+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg","type":"image\/jpeg"}],"author":"Bill Bejeck","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Bill Bejeck","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#article","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html"},"author":{"name":"Bill Bejeck","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646"},"headline":"MapReduce Algorithms &#8211; Understanding Data Joins Part II","datePublished":"2014-02-19T20:00:22+00:00","mainEntityOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html"},"wordCount":1436,"commentCount":1,"publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg","keywords":["Apache Hadoop","MapReduce"],"articleSection":["Enterprise Java"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html","url":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html","name":"MapReduce Algorithms - Understanding Data Joins Part II","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg","datePublished":"2014-02-19T20:00:22+00:00","description":"It\u2019s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional","breadcrumb":{"@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#primaryimage","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-logo.jpg","width":150,"height":150},{"@type":"BreadcrumbList","@id":"https:\/\/www.javacodegeeks.com\/2014\/02\/mapreduce-algorithms-understanding-data-joins-part-ii.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java","item":"https:\/\/www.javacodegeeks.com\/category\/java"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/www.javacodegeeks.com\/category\/java\/enterprise-java"},{"@type":"ListItem","position":4,"name":"MapReduce Algorithms &#8211; Understanding Data Joins Part II"}]},{"@type":"WebSite","@id":"https:\/\/www.javacodegeeks.com\/#website","url":"https:\/\/www.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Developers Resource Center","publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/www.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646","name":"Bill Bejeck","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","caption":"Bill Bejeck"},"description":"Husband, father of 3, passionate about software development.","sameAs":["http:\/\/codingjunkie.net\/"],"url":"https:\/\/www.javacodegeeks.com\/author\/Bill-Bejeck"}]}},"_links":{"self":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/21767","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/users\/110"}],"replies":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=21767"}],"version-history":[{"count":0,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/21767\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media\/62"}],"wp:attachment":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=21767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=21767"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=21767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}