Why is Object Serialization Essential in Hadoop MapReduce?
Hadoop MapReduce processes data in a distributed environment, meaning that data must be
transferred across nodes in a cluster.
Serialization is essential because:
Efficient Data Transfer: Converts objects into a format that can be sent across the
network.
Persistence & Storage: Stores intermediate results on disk between the map and reduce
phases.
Interoperability: Ensures data consistency between different nodes.
Example: Processing Student Exam Data
Imagine we have a large dataset of students' exam scores stored in multiple files across different
servers. Our goal is to find the top-scoring student per subject using MapReduce.
Step 1: Input Data (CSV format)
StudentID, Name, Subject, Marks
101, Alice, Math, 85
102, Bob, Math, 90
103, Charlie, Science, 78
104, David, Science, 92
Step 2: Map Phase (Serialization)
Each StudentRecord (Java object) is serialized and sent to the reducer.
Serialized format reduces network bandwidth usage.
Step 3: Shuffle & Sort Phase (Intermediate Data Storage)
Data is serialized and stored on disk before reducing.
Example serialized object (simplified representation):
{"StudentID":101, "Name":"Alice", "Subject":"Math", "Marks":85}
Step 4: Reduce Phase (Deserialization)
Serialized data is deserialized to Java objects.
The reducer identifies the top scorer per subject.
Final Output (Top Scorer per Subject)
Math: Bob (90)
Science: David (92)
Note:
i. Without serialization, transferring StudentRecord objects between nodes would be
inefficient.
ii. Hadoop’s Writable interface ensures fast and compact serialization.
iii. Serialization allows Hadoop to process massive datasets across multiple machines
efficiently.
WritableComparable and Comparators in Hadoop
In Hadoop's MapReduce framework, WritableComparable<T> is an interface that extends both
Writable (for serialization) and Comparable<T> (for sorting). It is used to define custom keys
that are both writable and comparable.
WritableComparable<T>
Used when a key needs to be serialized and sorted.
Implements write(DataOutput out) and readFields(DataInput in).
Implements compareTo(T o) for natural sorting.
Custom Comparators:
WritableComparator: Optimizes comparison by deserializing only key fields.
Comparator in MapReduce: Used for sorting keys in different ways.
KeyComparator (Sorting in shuffle phase)
GroupingComparator (Grouping values for reducers)
Example: Sorting Students by Marks using WritableComparable
Consider a scenario where we need to process student data, with each student having a name and
marks. We want to sort students by marks in descending order.
Step 1: Define Student Key as WritableComparable
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class Student implements WritableComparable<Student> {
private String name;
private int marks;
// Default constructor (required for Hadoop serialization)
public Student() {}
public Student(String name, int marks) {
this.name = name;
this.marks = marks;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(name);
out.writeInt(marks);
}
public void readFields(DataInput in) throws IOException {
name = in.readUTF();
marks = in.readInt();
}
public int compareTo(Student other) {
return Integer.compare(other.marks, this.marks); // Descending order
}
public String toString() {
return name + "\t" + marks;
}
}
Step 2: Custom Comparator for Sorting
We may need a custom comparator for sorting in a different order (e.g., ascending order of
marks).
import org.apache.hadoop.io.WritableComparator;
public class StudentComparator extends WritableComparator {
protected StudentComparator() {
super(Student.class, true);
}
public int compare(Object a, Object b) {
Student s1 = (Student) a;
Student s2 = (Student) b;
return Integer.compare(s1.marks, s2.marks); // Ascending order
}
}
Working in Hadoop
Student as a key ensures that Hadoop can sort student records.
Custom comparator (StudentComparator) can be set in the job configuration.
Note:
Use WritableComparable for keys that need sorting in Hadoop.
Define a comparator when sorting logic differs from the natural ordering.
Ensure efficient serialization to optimize performance.