{"id":14873,"date":"2013-07-01T13:00:06","date_gmt":"2013-07-01T10:00:06","guid":{"rendered":"http:\/\/www.javacodegeeks.com\/?p=14873"},"modified":"2013-07-02T12:42:19","modified_gmt":"2013-07-02T09:42:19","slug":"mapreduce-algorithms-understanding-data-joins-part-1","status":"publish","type":"post","link":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html","title":{"rendered":"MapReduce Algorithms &#8211; Understanding Data Joins Part 1"},"content":{"rendered":"<p>In this post we continue with our series of implementing the algorithms found in the <a href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\">Data-Intensive Text Processing with MapReduce<\/a> book, this time discussing data joins. While we are going to discuss the techniques for joining data in Hadoop and provide sample code, in most cases you probably won\u2019t be writing code to perform joins yourself. Instead, joining data is better accomplished using tools that work at a higher level of abstraction such as Hive or Pig. Why take the time to learn how to join data if there are tools that can take care of it for you? Joining data is arguably one of the biggest uses of Hadoop. Gaining a full understanding of how Hadoop performs joins is critical for deciding which join to use and for debugging when trouble strikes. Also, once you fully understand how different joins are performed in Hadoop, you can better leverage tools like Hive and Pig. Finally, there might be the one off case where a tool just won\u2019t get you what you need and you\u2019ll have to roll up your sleeves and write the code yourself.<\/p>\n<h2>The Need for Joins<\/h2>\n<p>When processing large data sets the need for joining data by a common key can be very useful, if not essential. By joining data you can further gain insight such as joining with timestamps to correlate events with a time a day. The need for joining data are many and varied. We will be covering 3 types of joins, Reduce-Side joins, Map-Side joins and the Memory-Backed Join over 3 separate posts. This installment we will consider working with Reduce-Side joins.<\/p>\n<h2>Reduce Side Joins<\/h2>\n<p>Of the join patterns we will discuss, reduce-side joins are the easiest to implement. What makes reduce-side joins straight forward is the fact that Hadoop sends identical keys to the same reducer, so by default the data is organized for us. To perform the join, we simply need to cache a key and compare it to incoming keys. As long as the keys match, we can join the values from the corresponding keys. The trade off with reduce-side joins is performance, since all of the data is shuffled across the network. Within reduce-side joins there are two different scenarios we will consider: one-to-one and one-to-many. We\u2019ll also explore options where we don\u2019t need to keep track of the incoming keys; all values for a given key will be grouped together in the reducer.<\/p>\n<h2>One-To-One Joins<\/h2>\n<p>A one-to-one join is the case where a value from dataset \u2018X\u2019 shares a common key with a value from dataset \u2018Y\u2019. Since Hadoop guarantees that equal keys are sent to the same reducer, mapping over the two datasets will take care of the join for us. Since sorting only occurs for keys, the order of the values is unknown. We can easily fix the situation by using <a title=\"MapReduce Algorithms \u2013 Secondary Sorting\" href=\"http:\/\/codingjunkie.net\/secondary-sort\/\" target=\"_blank\">secondary sorting<\/a>. Our implementation of secondary sorting will be to tag keys with either a \u201c1\u2033 or a \u201c2\u2033 to determine order of the values. We need to take a couple extra steps to implement our tagging strategy.<\/p>\n<h2>Implementing a WritableComparable<\/h2>\n<p>First we need to write a class that implements the WritableComparable interface that will be used to wrap our key.<\/p>\n<pre class=\" brush:java\">public class TaggedKey implements Writable, WritableComparable&lt;TaggedKey&gt; {\r\n\r\n    private Text joinKey = new Text();\r\n    private IntWritable tag = new IntWritable();\r\n\r\n    @Override\r\n    public int compareTo(TaggedKey taggedKey) {\r\n        int compareValue = this.joinKey.compareTo(taggedKey.getJoinKey());\r\n        if(compareValue == 0 ){\r\n            compareValue = this.tag.compareTo(taggedKey.getTag());\r\n        }\r\n       return compareValue;\r\n    }\r\n   \/\/Details left out for clarity\r\n }<\/pre>\n<p>When our TaggedKey class is sorted, keys with the same <code>joinKey<\/code> value will have a secondary sort on the value of the <code>tag<\/code> field, ensuring the order we want.<\/p>\n<h2>Writing a Custom Partitioner<\/h2>\n<p>Next we need to write a custom partitioner that will only consider the join key when determining which reducer the composite key and data are sent to:<\/p>\n<pre class=\" brush:java\">public class TaggedJoiningPartitioner extends Partitioner&lt;TaggedKey,Text&gt; {\r\n\r\n    @Override\r\n    public int getPartition(TaggedKey taggedKey, Text text, int numPartitions) {\r\n        return taggedKey.getJoinKey().hashCode() % numPartitions;\r\n    }\r\n}<\/pre>\n<p>At this point we have what we need to join the data and ensure the order of the values. But we don\u2019t want to keep track of the keys as they come into the <code>reduce()<\/code> method. We want all the values grouped together for us. To accomplish this we will use a <code>Comparator<\/code> that will consider only the join key when deciding how to group the values.<\/p>\n<h2>Writing a Group Comparator<\/h2>\n<p>Our Comparator used for grouping will look like this:<\/p>\n<pre class=\" brush:java\">public class TaggedJoiningGroupingComparator extends WritableComparator {\r\n\r\n    public TaggedJoiningGroupingComparator() {\r\n        super(TaggedKey.class,true);\r\n    }\r\n\r\n    @Override\r\n    public int compare(WritableComparable a, WritableComparable b) {\r\n        TaggedKey taggedKey1 = (TaggedKey)a;\r\n        TaggedKey taggedKey2 = (TaggedKey)b;\r\n        return taggedKey1.getJoinKey().compareTo(taggedKey2.getJoinKey());\r\n    }\r\n}<\/pre>\n<h2>Structure of the data<\/h2>\n<p>Now we need to determine what we will use for our key to join the data. For our sample data we will be using a CSV file generated from the <a href=\"http:\/\/www.fakenamegenerator.com\/order.php\" target=\"_blank\">Fakenames Generator<\/a>. The first column is a GUID and that will serve as our join key. Our sample data contains information like name, address, email, job information, credit cards and automobiles owned. For the purposes of our demonstration we will take the GUID, name and address fields and place them in one file that will be structured like this:<div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<pre class=\" brush:bash\">cdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Esther,Garner,4071 Haven Lane,Okemos,MI\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Timothy,Duncan,753 Stadium Drive,Taunton,MA\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Brett,Ramsey,4985 Shinn Street,New York,NY<\/pre>\n<p>Then we will take the GUID, email address, username, password and credit card fields and place then in another file that will look like:<\/p>\n<pre class=\" brush:bash\">cdd8dde3-0349-4f0d-b97a-7ae84b687f9c,517-706-9565,EstherJGarner@teleworm.us,Waskepter38,noL2ieghie,MasterCard,\r\n5305687295670850\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,508-307-3433,TimothyDDuncan@einrot.com,Conerse,Gif4Edeiba,MasterCard,\r\n5265896533330445\r\naef52cf1-f565-4124-bf18-47acdac47a0e,212-780-4015,BrettMRamsey@dayrep.com,Subjecall,AiKoiweihi6,MasterCard,524<\/pre>\n<p>Now we need to have a Mapper that will know how to work with our data to extract the correct key for joining and also set the proper tag.<\/p>\n<h2>Creating the Mapper<\/h2>\n<p>Here is our Mapper code:<\/p>\n<pre class=\" brush:java\">public class JoiningMapper extends Mapper&lt;LongWritable, Text, TaggedKey, Text&gt; {\r\n\r\n    private int keyIndex;\r\n    private Splitter splitter;\r\n    private Joiner joiner;\r\n    private TaggedKey taggedKey = new TaggedKey();\r\n    private Text data = new Text();\r\n    private int joinOrder;\r\n\r\n    @Override\r\n    protected void setup(Context context) throws IOException, InterruptedException {\r\n        keyIndex = Integer.parseInt(context.getConfiguration().get(\"keyIndex\"));\r\n        String separator = context.getConfiguration().get(\"separator\");\r\n        splitter = Splitter.on(separator).trimResults();\r\n        joiner = Joiner.on(separator);\r\n        FileSplit fileSplit = (FileSplit)context.getInputSplit();\r\n        joinOrder = Integer.parseInt(context.getConfiguration().get(fileSplit.getPath().getName()));\r\n    }\r\n\r\n    @Override\r\n    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {\r\n        List&lt;String&gt; values = Lists.newArrayList(splitter.split(value.toString()));\r\n        String joinKey = values.remove(keyIndex);\r\n        String valuesWithOutKey = joiner.join(values);\r\n        taggedKey.set(joinKey, joinOrder);\r\n        data.set(valuesWithOutKey);\r\n        context.write(taggedKey, data);\r\n    }\r\n\r\n}<\/pre>\n<p>Let\u2019s review what is going on in the <code>setup()<\/code> method.<\/p>\n<ol>\n<li>First we get the index of our join key and the separator used in the text from values set in the Configuration when the job was launched.<\/li>\n<li>Then we create a <a href=\"http:\/\/docs.guava-libraries.googlecode.com\/git-history\/release\/javadoc\/com\/google\/common\/base\/Splitter.html\" target=\"_blank\">Guava Splitter<\/a> used to split the data on the separator we retrieved from the call to <code>context.getConfiguration().get(\"separator\")<\/code>. We also create a <a href=\"http:\/\/docs.guava-libraries.googlecode.com\/git-history\/release\/javadoc\/com\/google\/common\/base\/Joiner.html\" target=\"_blank\">Guava Joiner<\/a> used to put the data back together once the key has been extracted.<\/li>\n<li>Next we get the name of the file that this mapper will be processing. We use the filename to pull the join order for this file that was stored in the configuration.<\/li>\n<\/ol>\n<p>We should also discuss what\u2019s going on in the <code>map()<\/code> method:<\/p>\n<ol>\n<li>Spitting our data and creating a List of the values<\/li>\n<li>Remove the join key from the list<\/li>\n<li>Re-join the data back into a single String<\/li>\n<li>Set the join key, join order and the remaining data<\/li>\n<li>Write out the data<\/li>\n<\/ol>\n<p>So we have read in our data, extracted the key, set the join order and written our data back out. Let\u2019s take a look how we will join the data.<\/p>\n<h2>Joining the Data<\/h2>\n<p>Now let\u2019s look at how the data is joined in the reducer:<\/p>\n<pre class=\" brush:java\">public class JoiningReducer extends Reduce&lt;TaggedKey, Text, NullWritable, Text&gt; {\r\n\r\n    private Text joinedText = new Text();\r\n    private StringBuilder builder = new StringBuilder();\r\n    private NullWritable nullKey = NullWritable.get();\r\n\r\n    @Override\r\n    protected void reduce(TaggedKey key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException {\r\n        builder.append(key.getJoinKey()).append(\",\");\r\n        for (Text value : values) {\r\n            builder.append(value.toString()).append(\",\");\r\n        }\r\n        builder.setLength(builder.length()-1);\r\n        joinedText.set(builder.toString());\r\n        context.write(nullKey, joinedText);\r\n        builder.setLength(0);\r\n    }\r\n}<\/pre>\n<p>Since the key with the tag of \u201c1\u2033 reached the reducer first, we know that the name and address data is the first value and the email,username,password and credit card data is second. So we don\u2019t need to keep track of any keys. We simply loop over the values and concatenate them together.<\/p>\n<h2>One-To-One Join results<\/h2>\n<p>Here are the results from running our One-To-One MapReduce job:<\/p>\n<pre class=\" brush:bash\">cdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Esther,Garner,4071 Haven Lane,Okemos,MI,517-706-9565,EstherJGarner@teleworm.us,Waskepter38,noL2ieghie,MasterCard,\r\n5305687295670850\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Timothy,Duncan,753 Stadium Drive,Taunton,MA,508-307-3433,TimothyDDuncan@einrot.com,Conerse,Gif4Edeiba,MasterCard,\r\n5265896533330445\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Brett,Ramsey,4985 Shinn Street,New York,NY,212-780-4015,BrettMRamsey@dayrep.com,Subjecall,AiKoiweihi6,MasterCard,\r\n5243379373546690<\/pre>\n<p>As we can see the two records from our sample data above have been merged into a single record. We have successfully joined the GUID, name,address,email address, username, password and credit card fields together into one file.<\/p>\n<h2>Specifying Join Order<\/h2>\n<p>At this point we may be asking how do we specify the join order for multiple files? The answer lies in our <code>ReduceSideJoinDriver<\/code> class that serves as the driver for our MapReduce program.<\/p>\n<pre class=\" brush:java\">public class ReduceSideJoinDriver {\r\n\r\n    public static void main(String[] args) throws Exception {\r\n        Splitter splitter = Splitter.on('\/');\r\n        StringBuilder filePaths = new StringBuilder();\r\n\r\n        Configuration config = new Configuration();\r\n        config.set(\"keyIndex\", \"0\");\r\n        config.set(\"separator\", \",\");\r\n\r\n        for(int i = 0; i&lt; args.length - 1; i++) {\r\n            String fileName = Iterables.getLast(splitter.split(args[i]));\r\n            config.set(fileName, Integer.toString(i+1));\r\n            filePaths.append(args[i]).append(\",\");\r\n        }\r\n\r\n        filePaths.setLength(filePaths.length() - 1);\r\n        Job job = Job.getInstance(config, \"ReduceSideJoin\");\r\n        job.setJarByClass(ReduceSideJoinDriver.class);\r\n\r\n        FileInputFormat.addInputPaths(job, filePaths.toString());\r\n        FileOutputFormat.setOutputPath(job, new Path(args[args.length-1]));\r\n\r\n        job.setMapperClass(JoiningMapper.class);\r\n        job.setReducerClass(JoiningReducer.class);\r\n        job.setPartitionerClass(TaggedJoiningPartitioner.class);\r\n        job.setGroupingComparatorClass(TaggedJoiningGroupingComparator.class);\r\n        job.setOutputKeyClass(TaggedKey.class);\r\n        job.setOutputValueClass(Text.class);\r\n        System.exit(job.waitForCompletion(true) ? 0 : 1);\r\n\r\n    }\r\n}<\/pre>\n<ol>\n<ol>\n<li>First we create a Guava Splitter on line 5 that will split strings by a &#8220;\/&#8221;.<\/li>\n<li>Then on lines 8-10 we are setting the index of our join key and the separator used in the files.<\/li>\n<li>In lines 12-17 we setting the tags for the input files to be joined. The order of the file names on the command line determines their position in the join. As we loop over the file names from the command line, we split the whole file name and retrieve the last value (the base filename) via the Guava <a href=\"http:\/\/docs.guava-libraries.googlecode.com\/git-history\/release\/javadoc\/com\/google\/common\/collect\/Iterables.html\" target=\"_blank\"><code>Iterables.getLast()<\/code><\/a> method. We then call <code>config.set()<\/code> with the filename as the key and we use <code>i + 1<\/code> as the value, which sets the tag or join order. The last value in the <code>args<\/code> array is skipped in the loop, as that is used for the output path of our MapReduce job on line 23. On the last line of the loop we append each file path in a StringBuilder which is used later (line 22) to set the input paths for the job.<\/li>\n<li>We only need to use one mapper for all files, the JoiningMapper, which is set on line 25.<\/li>\n<li>Lines 27 and 28 set our custom partitioner and group comparator (respectively) which ensure the arrival order of keys and values to the reducer and properly group the values with the correct key.<\/li>\n<\/ol>\n<\/ol>\n<p>By using the partitioner and the grouping comparator we know the first value belongs to first key and can be used to join with every other value contained in the <code>Iterable<\/code> sent to the <code>reduce()<\/code> method for a given key. Now it&#8217;s time to consider the one-to-many join.<\/p>\n<h2>One-To-Many Join<\/h2>\n<p>The good news is with all the work that we have done up to this point, we can actually use the code as it stands to perform a one-to-many join. There are 2 approaches we can consider for the one-to-many join: 1) A small file with the single records and a second file with many records for the same key and 2) Again a smaller file with the single records, but N number of files each containing a record that matches to the first file. The main difference is that with the first approach the order of the values beyond the join of the first two keys will be unknown. With the second approach however, we will &#8220;tag&#8221; each join file so we can can control the order of all the joined values. For our example the first file will remain our GUID-name-address file, and we will have 3 additional files that will contain automobile, employer and job description records. This is probably not the most realistic scenario but it will serve for the purposes of demonstration. Here&#8217;s a sample of how the data will look before we do the join:<\/p>\n<pre class=\" brush:bash\">\r\n\/\/The single person records\r\ncdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Esther,Garner,4071 Haven Lane,Okemos,MI\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Timothy,Duncan,753 Stadium Drive,Taunton,MA\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Brett,Ramsey,4985 Shinn Street,New York,NY\r\n\/\/Automobile records\r\ncdd8dde3-0349-4f0d-b97a-7ae84b687f9c,2003 Holden Cruze\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,2012 Volkswagen T5\r\naef52cf1-f565-4124-bf18-47acdac47a0e,2009 Renault Trafic\r\n\/\/Employer records\r\ncdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Creative Wealth\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Susie's Casuals\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Super Saver Foods\r\n\/\/Job Description records\r\ncdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Data entry clerk\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Precision instrument and equipment repairer\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Gas and water service dispatcher\r\n<\/pre>\n<h2>One-To-Many Join results<\/h2>\n<p>Now let&#8217;s look at a sample of the results of our one-to-many joins (using the same values from above to aid in the comparison):<\/p>\n<pre class=\" brush:bash\">\r\ncdd8dde3-0349-4f0d-b97a-7ae84b687f9c,Esther,Garner,4071 Haven Lane,Okemos,MI,2003 Holden Cruze,Creative Wealth,Data entry clerk\r\n81a43486-07e1-4b92-b92b-03d0caa87b5f,Timothy,Duncan,753 Stadium Drive,Taunton,MA,2012 Volkswagen T5,Susie's Casuals,Precision instrument and equipment repairer\r\naef52cf1-f565-4124-bf18-47acdac47a0e,Brett,Ramsey,4985 Shinn Street,New York,NY,2009 Renault Trafic,Super Saver Foods,Gas and water service dispatcher\r\n<\/pre>\n<p>As the results show, we have been able to successfully join several values in a specified order.<\/p>\n<h2>Conclusion<\/h2>\n<p>We have successfully demonstrated how we can perform reduce-side joins in MapReduce. Even though the approach is not overly complicated, we can see that performing joins in Hadoop can involve writing a fair amount of code. While learning how joins work is a useful exercise, in most cases we are much better off using tools like Hive or Pig for joining data. Thanks for your time.<\/p>\n<h2>Resources<\/h2>\n<ol>\n<ul>\n<li><a title=\"Data-Intensive Text Processing with MapReduce\" href=\"http:\/\/www.amazon.com\/Data-Intensive-Processing-MapReduce-Synthesis-Technologies\/dp\/1608453421\" target=\"_blank\">Data-Intensive Processing with MapReduce<\/a> by Jimmy Lin and Chris Dyer<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/1449311520\/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1\" target=\"_blank\">Hadoop: The Definitive Guide<\/a> by Tom White<\/li>\n<li><a title=\"Source Code\" href=\"https:\/\/github.com\/bbejeck\/hadoop-algorithms\" target=\"_blank\">Source Code and Tests<\/a> from blog<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Programming-Hive-Edward-Capriolo\/dp\/1449319335\" target=\"_blank\">Programming Hive<\/a> by Edward Capriolo, Dean Wampler and Jason Rutherglen<\/li>\n<li><a href=\"http:\/\/www.amazon.com\/Programming-Pig-Alan-Gates\/dp\/1449302645\" target=\"_blank\">Programming Pig<\/a> by Alan Gates<\/li>\n<li><a href=\"http:\/\/hadoop.apache.org\/docs\/r0.20.2\/api\/index.html\">Hadoop API<\/a><\/li>\n<li><a href=\"http:\/\/mrunit.apache.org\/\" target=\"_blank\">MRUnit<\/a> for unit testing Apache Hadoop map reduce jobs<\/li>\n<\/ul>\n<\/ol>\n<p>&nbsp;<\/p>\n<div style=\"border: 1px solid #D8D8D8; background: #FAFAFA; width: 100%; padding-left: 5px;\"><b><i>Reference: <\/i><\/b><a href=\"http:\/\/codingjunkie.net\/mapreduce-reduce-joins\/\">MapReduce Algorithms &#8211; Understanding Data Joins Part 1<\/a> from our <a href=\"http:\/\/www.javacodegeeks.com\/jcg\">JCG partner<\/a> Bill Bejeck at the <a href=\"http:\/\/codingjunkie.net\/\">Random Thoughts On Coding<\/a> blog.<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time discussing data joins. While we are going to discuss the techniques for joining data in Hadoop and provide sample code, in most cases you probably won\u2019t be writing code to perform joins &hellip;<\/p>\n","protected":false},"author":110,"featured_media":63,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[184,183],"class_list":["post-14873","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-enterprise-java","tag-apache-hadoop","tag-mapreduce"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>MapReduce Algorithms - Understanding Data Joins Part 1<\/title>\n<meta name=\"description\" content=\"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MapReduce Algorithms - Understanding Data Joins Part 1\" \/>\n<meta property=\"og:description\" content=\"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html\" \/>\n<meta property=\"og:site_name\" content=\"Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2013-07-01T10:00:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2013-07-02T09:42:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Bill Bejeck\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bill Bejeck\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html\"},\"author\":{\"name\":\"Bill Bejeck\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\"},\"headline\":\"MapReduce Algorithms &#8211; Understanding Data Joins Part 1\",\"datePublished\":\"2013-07-01T10:00:06+00:00\",\"dateModified\":\"2013-07-02T09:42:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html\"},\"wordCount\":1819,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"keywords\":[\"Apache Hadoop\",\"MapReduce\"],\"articleSection\":[\"Enterprise Java\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html\",\"name\":\"MapReduce Algorithms - Understanding Data Joins Part 1\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"datePublished\":\"2013-07-01T10:00:06+00:00\",\"dateModified\":\"2013-07-02T09:42:19+00:00\",\"description\":\"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/apache-hadoop-mapreduce-logo.jpg\",\"width\":150,\"height\":150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2013\\\/07\\\/mapreduce-algorithms-understanding-data-joins-part-1.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\\\/enterprise-java\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"MapReduce Algorithms &#8211; Understanding Data Joins Part 1\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Developers Resource Center\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.javacodegeeks.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/javacodegeeks\",\"https:\\\/\\\/x.com\\\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/69f9f11896bf9cfd7278b440efeda646\",\"name\":\"Bill Bejeck\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g\",\"caption\":\"Bill Bejeck\"},\"description\":\"Husband, father of 3, passionate about software development.\",\"sameAs\":[\"http:\\\/\\\/codingjunkie.net\\\/\"],\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/author\\\/Bill-Bejeck\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MapReduce Algorithms - Understanding Data Joins Part 1","description":"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html","og_locale":"en_US","og_type":"article","og_title":"MapReduce Algorithms - Understanding Data Joins Part 1","og_description":"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time","og_url":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html","og_site_name":"Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2013-07-01T10:00:06+00:00","article_modified_time":"2013-07-02T09:42:19+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","type":"image\/jpeg"}],"author":"Bill Bejeck","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Bill Bejeck","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#article","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html"},"author":{"name":"Bill Bejeck","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646"},"headline":"MapReduce Algorithms &#8211; Understanding Data Joins Part 1","datePublished":"2013-07-01T10:00:06+00:00","dateModified":"2013-07-02T09:42:19+00:00","mainEntityOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html"},"wordCount":1819,"commentCount":2,"publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","keywords":["Apache Hadoop","MapReduce"],"articleSection":["Enterprise Java"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html","url":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html","name":"MapReduce Algorithms - Understanding Data Joins Part 1","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","datePublished":"2013-07-01T10:00:06+00:00","dateModified":"2013-07-02T09:42:19+00:00","description":"In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time","breadcrumb":{"@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#primaryimage","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/apache-hadoop-mapreduce-logo.jpg","width":150,"height":150},{"@type":"BreadcrumbList","@id":"https:\/\/www.javacodegeeks.com\/2013\/07\/mapreduce-algorithms-understanding-data-joins-part-1.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java","item":"https:\/\/www.javacodegeeks.com\/category\/java"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/www.javacodegeeks.com\/category\/java\/enterprise-java"},{"@type":"ListItem","position":4,"name":"MapReduce Algorithms &#8211; Understanding Data Joins Part 1"}]},{"@type":"WebSite","@id":"https:\/\/www.javacodegeeks.com\/#website","url":"https:\/\/www.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Developers Resource Center","publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/www.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/69f9f11896bf9cfd7278b440efeda646","name":"Bill Bejeck","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6f0ab8cd639470515ff498599471cc60f21b2d0b14301ff22cadc708dc19c8be?s=96&d=mm&r=g","caption":"Bill Bejeck"},"description":"Husband, father of 3, passionate about software development.","sameAs":["http:\/\/codingjunkie.net\/"],"url":"https:\/\/www.javacodegeeks.com\/author\/Bill-Bejeck"}]}},"_links":{"self":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/14873","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/users\/110"}],"replies":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=14873"}],"version-history":[{"count":0,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/14873\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media\/63"}],"wp:attachment":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=14873"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=14873"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=14873"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}