Distributed File Systems
A Distributed File System (DFS) is a type of file system that allows multiple computers to
access and manage files stored on a network as if they were located on a local disk. This enables
efficient data sharing, scalability, and fault tolerance.
1. Physical Organization of Compute Nodes
In a DFS, compute nodes are the physical machines responsible for storing and processing data.
The organization of these nodes determines system performance, reliability, and fault tolerance.
Key Aspects of Physical Organization
1. Cluster-Based Architecture:
○ DFS typically uses clusters of machines connected via a high-speed network.
○ Each node can store data and perform computations in parallel.
2. Data Partitioning:
○ Files are split into smaller chunks or blocks.
○ These chunks are distributed across multiple nodes to balance load and improve
access speed.
3. Replication:
○ To ensure reliability, data is replicated across different nodes.
○ If a node fails, replicas ensure data availability.
4. Metadata and Name Nodes:
○ Metadata contains information about file locations and structure.
○ Some DFS (e.g., Hadoop HDFS) use a dedicated NameNode to manage
metadata, while others distribute it across nodes.
5. Networking:
○ High-speed communication between nodes is crucial.
○ Technologies like InfiniBand, Ethernet, and Remote Direct Memory Access
(RDMA) are used for efficient data transfer.
2. Large-Scale File-System Organization
Managing files efficiently at scale requires a well-structured DFS design.
Components of Large-Scale File-System Organization
1. File Distribution Strategies:
○ Hash-Based: Uses a hashing function to determine file placement.
○ Range-Based: Assigns file ranges to different nodes.
○ Dynamic Allocation: Adjusts file placement based on system load.
2. Namespace Management:
○ Single namespace (centralized) vs. distributed namespace (scalable).
○ Distributed namespace ensures no single point of failure.
3. Consistency and Fault Tolerance:
○ DFS must handle failures without data corruption.
○ Techniques like RAID, Replication, and Erasure Coding are used.
4. Caching and Prefetching:
○ To reduce latency, frequently accessed files are cached.
○ Prefetching loads data before it's needed to improve performance.
5. Security and Access Control:
○ Authentication: Ensuring only authorized users access files.
○ Encryption: Protecting data in transit and at rest.
○ Access Control Lists (ACLs): Managing permissions.
Examples of Distributed File Systems
● HDFS (Hadoop Distributed File System) – Used in big data applications.
● Google File System (GFS) – Foundation for Google’s data storage.
● CephFS – Open-source scalable DFS.
● Amazon S3 – Object storage with distributed architecture.
Conclusion: A well-organized DFS efficiently manages large-scale data storage and retrieval
across multiple compute nodes. By employing strategies like replication, partitioning, and
caching, DFS ensures high availability, fault tolerance, and scalability, making it ideal for cloud
computing, big data processing, and large-scale enterprise applications.
MapReduce Framework
MapReduce is a parallel processing model for efficiently processing large-scale datasets across
distributed computing environments. It consists of two main phases:
● Map Phase (Processes input data in parallel)
● Reduce Phase (Aggregates and summarizes results)
1. The Map Tasks
The Map phase transforms input data into key-value pairs.
Steps in the Map Phase:
1. The input dataset is split into chunks (usually 64MB or 128MB).
2. Each chunk is processed by a Map function running on different nodes in parallel.
3. The Map function processes each input record and emits intermediate (key, value) pairs.
Example:
For a word count application:
● Input: ["Hello World", "Hello MapReduce"]
Mapper Output:
("Hello", 1)
("World", 1)
("Hello", 1)
("MapReduce", 1)
2. Grouping by Key
After the Map phase, intermediate key-value pairs are shuffled and grouped by key before
being passed to the Reduce phase.
Key Features:
● The framework automatically groups all values for the same key together.
● Uses a process called shuffle and sort, ensuring that all occurrences of a key are sent to
the same reducer.
Example:Mapper Output:
("Hello", 1), ("World", 1), ("Hello", 1), ("MapReduce", 1)
Grouped Output (after shuffle and sort):
("Hello", [1, 1])
("World", [1])
("MapReduce", [1])
3. The Reduce Tasks: The Reduce phase aggregates or processes the grouped key-value pairs.
Steps in Reduce Phase:
1. The grouped (key, list of values) pairs are assigned to a reducer.
2. The Reduce function processes these values to generate a final output.
3. The output is stored in HDFS or another distributed storage system.
Example:
Reduce Function for Word Count:
● Input: ("Hello", [1, 1]), ("World", [1]), ("MapReduce", [1])
Reduce Output:
("Hello", 2)
("World", 1)
("MapReduce", 1)
4. Combiners
A Combiner is an optional optimization step that performs local aggregation before sending
data to the reducers.
Why Use Combiners?
● Reduces data transfer between mappers and reducers.
● Decreases network congestion and improves efficiency.
Example (Word Count with Combiner):
Mapper Output (before combiner):
("Hello", 1), ("Hello", 1), ("World", 1)
Combiner Output (before sending to reducer):
("Hello", 2), ("World", 1)
Note: Not all applications can use combiners effectively. They are mainly useful for
associative and commutative operations like sum, count, max, min.
5. Details of MapReduce Execution
The MapReduce job execution involves multiple stages:
Execution Steps:
1. Input Splitting: The dataset is divided into chunks.
2. Map Task Execution: Mappers process each chunk and emit intermediate key-value
pairs.
3. Shuffling and Sorting: Key-value pairs are grouped by key.
4. Reduce Task Execution: Reducers aggregate and generate final output.
5. Output Storage: The final results are written to HDFS or another storage system.
MapReduce Execution Workflow:
1. Job Submission: The user submits a job to the JobTracker.
2. Job Initialization: The framework divides the job into multiple tasks.
3. Task Execution: Mappers and reducers process data in parallel.
4. Monitoring & Fault Tolerance: Failed tasks are re-executed.
5. Completion: The final output is stored.
6. Coping with Node Failure
Since MapReduce runs in distributed environments, handling node failures is essential.
Fault Tolerance Mechanisms:
1. Task Re-execution:
○ If a node fails, incomplete tasks are reassigned to other nodes.
2. Data Replication:
○ HDFS stores multiple replicas of data blocks to ensure availability.
3. Speculative Execution:
○ If a task is slow due to hardware issues, a duplicate task is launched on another
node.
4. Heartbeat Monitoring:
○ Nodes send periodic signals (heartbeats) to the JobTracker.
○ If a node stops responding, it is marked as dead and tasks are reassigned.
Conclusion
MapReduce efficiently processes large-scale data using parallel computing. The framework
handles fault tolerance, optimization with combiners, and automatic load balancing. By
distributing computations across multiple nodes, it provides scalability and reliability, making it
suitable for big data applications like log analysis, machine learning, and data mining.
Algorithms Using MapReduce
MapReduce can be used to implement various algorithms for processing large-scale datasets.
Some key applications include matrix-vector multiplication, relational algebra operations,
and computing selections.
1. Matrix-Vector Multiplication Using MapReduce
Problem Statement:
Given a matrix M of size m×n and a vector V of size n×1, compute the matrix-vector
productR=M×V.
MapReduce Approach:
1. Mapper Phase:
○ Each mapper reads a matrix entry Mijand a corresponding vector element Vj.
○ The mapper emits intermediate key-value pairs (row index, partial product).
2. Shuffle & Sort:
○ The framework groups all partial results for each row index.
3. Reducer Phase:
○ The reducer sums up the partial products for each row index to compute the final
result.
Example:Let M=[2345],V=[12]
Mapper Output:
(0, 2×1) → (0, 2)
(0, 3×2) → (0, 6)
(1, 4×1) → (1, 4)
(1, 5×2) → (1, 10)
Reducer Output:
(0, 8) // Row 0 result (2 + 6)
(1, 14) // Row 1 result (4 + 10)
Final Output:
R=[8 14]
2. Relational Algebra Operations Using MapReduce
Relational Algebra Basics:
Relational operations like selection, projection, join, and aggregation can be implemented
using MapReduce.
Types of Operations:
1. Selection (σ condition (R)) → Filters rows based on conditions.
2. Projection (π attributes (R)) → Selects specific columns from a relation.
3. Join (R ⨝ S) → Combines two relations based on a common attribute.
4. Grouping & Aggregation (SUM, COUNT, AVG, MAX, MIN).
3. Computing Selections by MapReduce:
Selection (σ condition (R))
Selection involves filtering rows from a relation R based on a condition.
MapReduce Implementation:
1. Mapper Phase:
○ Reads each record from the dataset.
○ Emits the record only if it satisfies the selection condition.
2. Reducer Phase:
○ Simply passes the records to the output.
Example:
Consider a Student dataset:(ID, Name, Age, Grade)
(1, Alice, 21, A)
(2, Bob, 22, B)
(3, Charlie, 20, A)
Query:Find all students with Grade = A.
Selection Condition: σ Grade = 'A' (Student)
Mapper Output:
(1, Alice, 21, A)
(3, Charlie, 20, A)
Reducer Output:
(1, Alice, 21, A)
(3, Charlie, 20, A)
Optimization: Since selection is a filtering operation, it can be performed entirely
in the mapper phase without using reducers.
ConclusionMapReduce can efficiently process matrix operations, relational algebra
operations, and selection queries on large-scale data. By leveraging parallel computation, it
improves scalability and performance in distributed computing environments like Hadoop and
Spark.
Algorithms Using MapReduce (Continued)
MapReduce can be used for various relational algebra operations, including projection, set
operations (union, intersection, difference), natural join, and grouping & aggregation.
1. Computing Projections by MapReduce
Projection (π attributes (R)): Projection selects specific columns (attributes) from a relation R,
eliminating duplicates.
MapReduce Approach:
1. Mapper Phase:
○ Reads each record.
○ Emits only the required attributes as key-value pairs.
2. Reducer Phase:
○ Eliminates duplicate records.
Example:
Consider a Student dataset:
(ID, Name, Age, Grade)
(1, Alice, 21, A)
(2, Bob, 22, B)
(3, Charlie, 20, A)
(4, Alice, 21, A)
Query:Find all unique student names → π Name (Student)
Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Alice, 1)
Reducer Output (Removing Duplicates):
(Alice)
(Bob)
(Charlie)
2. Union, Intersection, and Difference by MapReduce
Union (R ∪ S): The union operation combines tuples from two relations R and S, removing
duplicates.
MapReduce Approach:
1. Mapper Phase:
○ Reads records from both R and S.
○ Emits (record, 1) as key-value pairs.
2. Reducer Phase:
○ Outputs unique records.
Example:
Two datasets:
R (Student1):
(1, Alice)
(2, Bob)
(3, Charlie)
S (Student2):
(2, Bob)
(4, David)
(5, Eve)
Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Bob, 1)
(David, 1)
(Eve, 1)
Reducer Output (Unique Values):
(Alice)
(Bob)
(Charlie)
(David)
(Eve)
Intersection (R ∩ S): 3The intersection operation finds common elements in R and S.
MapReduce Approach:
1. Mapper Phase:
○ Reads records from both R and S.
○ Emits (record, dataset_id) (e.g., 1 for R, 2 for S).
2. Reducer Phase:
○ Keeps only records appearing in both datasets (having both 1 and 2).
Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Bob, 2)
(David, 2)
(Eve, 2)
Reducer Output (Common Values):
(Bob)
Difference (R - S)
The difference operation finds records in R that are not in S.
MapReduce Approach:
1. Mapper Phase:
○ Reads records from both R and S.
○ Emits (record, dataset_id).
2. Reducer Phase:
○ Outputs records only present in R (dataset_id = 1 but not 2).
Reducer Output (R - S):
(Alice)
(Charlie)
3. Computing Natural Join by MapReduce
Natural Join (R ⨝ S)
A natural join combines two relations R(A, B) and S(B, C) based on a common attribute (B).
MapReduce Approach:
1. Mapper Phase:
○ Emits (B, (Relation Name, Attributes)).
2. Reducer Phase:
○ Joins records with the same B value from both relations.
Example:
R (Employee):
(ID, Name, Dept_ID)
(1, Alice, 101)
(2, Bob, 102)
(3, Charlie, 103)
S (Department):
(Dept_ID, Dept_Name)
(101, HR)
(102, IT)
(104, Sales)
Mapper Output:
(101, ("R", 1, Alice))
(102, ("R", 2, Bob))
(103, ("R", 3, Charlie))
(101, ("S", HR))
(102, ("S", IT))
(104, ("S", Sales))
Reducer Output (Joined Table):
(1, Alice, 101, HR)
(2, Bob, 102, IT)
4. Grouping and Aggregation by MapReduce
Grouping & Aggregation (SUM, COUNT, AVG, MAX, MIN)
Grouping and aggregation operations involve computing functions like SUM, COUNT, AVG,
MAX, and MIN for groups of records.
MapReduce Approach:
1. Mapper Phase:
○ Reads each record.
○ Emits (group_key, value).
2. Reducer Phase:
○ Aggregates values using SUM, COUNT, AVG, MAX, MIN.
Example: Compute Total Sales Per Product
Sales Data:
(Product, Sales)
(A, 100)
(A, 200)
(B, 300)
(B, 400)
(A, 150)
Mapper Output:
(A, 100)
(A, 200)
(B, 300)
(B, 400)
(A, 150)
Reducer Output (SUM Aggregation):
(A, 450)
(B, 700)
Conclusion
MapReduce enables efficient relational algebra operations at scale. By leveraging parallel
processing, it handles projection, set operations, joins, and aggregation efficiently, making it
suitable for big data analytics in frameworks like Hadoop and Spark.
Matrix Multiplication Using MapReduce
Matrix multiplication is a fundamental operation in big data analytics and can be efficiently
implemented using MapReduce. There are two approaches:
1. General Matrix Multiplication using Multiple MapReduce Steps
2. Optimized Matrix Multiplication using One MapReduce Step
1. General Matrix Multiplication Using MapReduce
Problem Statement:
Given two matrices A (m × n) and B (n × p), compute the product C = A × B, where C (m × p)
is the resultant matrix.
MapReduce Approach:
We need to compute each element of C as:
Cij=∑k=1nAik×BkjCij=k=1∑nAik×Bkj
Steps in MapReduce:
Step 1: Mapper Phase
● Read elements from both A and B.
● Emit key-value pairs with intermediate multiplication results.
Step 2: Shuffle & Grouping
● Group all terms (i, j, partial product) for each element C(i, j).
Step 3: Reducer Phase
● Sum all partial products to compute C(i, j).
Example: Compute C = A × B
Input Matrices:
A (2×3):
2 3 4
1 5 6
B (3×2):
1 2
3 4
5 6
Mapper Output:
Each matrix element is emitted in key-value pairs.
For A(i,k):
(0,0) → (0, (A, 0, 2))
(0,1) → (0, (A, 1, 3))
(0,2) → (0, (A, 2, 4))
(1,0) → (1, (A, 0, 1))
(1,1) → (1, (A, 1, 5))
(1,2) → (1, (A, 2, 6))
For B(k,j):
(0,0) → (0, (B, 0, 1))
(1,0) → (0, (B, 1, 3))
(2,0) → (0, (B, 2, 5))
(0,1) → (1, (B, 0, 2))
(1,1) → (1, (B, 1, 4))
(2,1) → (1, (B, 2, 6))
Reducer Output:
Perform the sum of products for each (i, j) key:
mathematica
C(0,0) = (2×1) + (3×3) + (4×5) = 2 + 9 + 20 = 31
C(0,1) = (2×2) + (3×4) + (4×6) = 4 + 12 + 24 = 40
C(1,0) = (1×1) + (5×3) + (6×5) = 1 + 15 + 30 = 46
C(1,1) = (1×2) + (5×4) + (6×6) = 2 + 20 + 36 = 58
Final Output C (2×2):
31 40
46 58
2. Matrix Multiplication with One MapReduce Step
Optimization Idea:
Instead of using multiple MapReduce steps, we can directly compute and reduce the output
matrix elements in a single MapReduce step.
Modified MapReduce Approach:
1. Mapper Phase:
○ Emit elements of A(i,k) and B(k,j) in such a way that reducers can directly
compute C(i,j).
2. Reducer Phase:
○ Compute C(i,j) by summing all partial products.
Example: One-Step MapReduce Matrix Multiplication
Mapper Output:
For A(i,k):
(0,0) → (0,0,2)
(0,1) → (0,1,3)
(0,2) → (0,2,4)
(1,0) → (1,0,1)
(1,1) → (1,1,5)
(1,2) → (1,2,6)
For B(k,j):
(0,0) → (0,0,1)
(1,0) → (1,0,3)
(2,0) → (2,0,5)
(0,1) → (0,1,2)
(1,1) → (1,1,4)
(2,1) → (2,1,6)
Reducer Output:
(0,0) → 31
(0,1) → 40
(1,0) → 46
(1,1) → 58
This method ensures that C(i, j) is computed directly in a single step, reducing the need for
extra MapReduce passes.
Comparison of Two Approaches
Approach Steps Required Data Transfer Performance
Overhead
General Matrix Multiple Higher (Intermediate Slower
Multiplication MapReduce steps outputs)
Optimized One-Step Single MapReduce Lower (Direct Faster
Multiplication step computation)
The optimized approach is preferred for big data applications like Hadoop and Spark, as it
minimizes network trafficand computation time.
Conclusion: Matrix multiplication using MapReduce can be performed efficiently in two ways:
1. Traditional Multi-Step Approach: Uses multiple MapReduce phases but is
straightforward.
2. Optimized One-Step Approach: Directly computes C(i, j) in a single MapReduce
step, reducing overhead and improving performance.
🚀
The One-Step approach is more efficient and widely used in big data frameworks for
handling large-scale matrix operations.
&&& With the help of Sequence diagram explain how job processing requests from the user are
handled in the Hadoop framework.
The Hadoop framework follows a Master-Slave architecture where the Client, JobTracker (or
ResourceManager in YARN), TaskTrackers (or NodeManagers in YARN), and DataNodes
work together to process job requests.
Steps in Hadoop Job Processing:
1. User submits a job:
○ The Client writes a job using the Hadoop API and submits it to the JobTracker (in
Hadoop 1.x) or ResourceManager (in YARN).
○ The job includes the MapReduce logic and input file location in HDFS.
2. JobTracker/ResourceManager initializes the job:
○ It checks for resource availability in the cluster.
○ The job is split into smaller Map tasks based on input splits.
3. Task Assignment to TaskTrackers/NodeManagers:
○ The JobTracker (or ApplicationMaster in YARN) assigns tasks to available
TaskTrackers (or NodeManagers in YARN).
○ Each TaskTracker fetches the required data from HDFS.
4. Map Tasks Execution:
○ The assigned nodes execute Mapper functions, transforming input data into
key-value pairs.
○ Output of the Mapper phase is stored locally before being shuffled.
5. Shuffling and Sorting:
○ The intermediate key-value pairs are grouped by key and shuffled to
corresponding reducers.
6. Reduce Tasks Execution:
○ The Reducer processes grouped key-value pairs and generates the final output.
7. Job Completion:
○ The JobTracker monitors task progress.
○ Upon completion, the output is stored in HDFS, and the Client is notified.
Explain how a program using MapReduce is executed elaborating interaction of processes, tasks,
and files. How node-failures are handled? Explain.
How a MapReduce Program is Executed
A MapReduce program is executed in four major stages:
1. Input Splitting
● The input file (stored in HDFS) is divided into fixed-size chunks (default: 128MB or
256MB).
● Each split is processed independently by a Mapper.
2. Map Phase
● A Mapper function is applied to each split, converting it into key-value pairs.
Example: Consider an input file with the following text:
nginx
CopyEdit
apple banana apple mango banana apple
●
The Mapper processes it and generates output like:
arduino
CopyEdit
("apple", 1), ("banana", 1), ("apple", 1), ("mango", 1),
("banana", 1), ("apple", 1)
●
3. Shuffling & Sorting
● The intermediate key-value pairs are grouped and sorted based on keys.
This ensures that all values for a particular key are sent to the same Reducer.
css
CopyEdit
apple → [1, 1, 1]
banana → [1, 1]
mango → [1]
●
4. Reduce Phase
The Reducer aggregates values for each key.
nginx
CopyEdit
apple → 3
banana → 2
mango → 1
●
● The final output is written to HDFS.
Handling Node Failures in Hadoop
Hadoop is designed to handle failures using replication, checkpointing, and task
reassignment.
1. TaskTracker/NodeManager Failure:
○ The JobTracker (or ResourceManager) detects missing heartbeats.
○ It reassigns tasks to a healthy node.
2. DataNode Failure:
○ Hadoop replicates data across multiple DataNodes.
○ If a node fails, another replica is used automatically.
3. JobTracker/ResourceManager Failure:
○ In Hadoop 1.x, the JobTracker failure causes job failure.
○ In Hadoop 2.x (YARN), high availability mechanisms restart the
ResourceManager.
With a neat diagram explain the notion of Map-Reduce computation. lustrate the MapReduce
approach to count the number of occurrences of a iven word in a data file. (Assume suitable data
wherever necessary)
MapReduce follows a divide-and-conquer approach to process large datasets in parallel. It
consists of two main functions:
● Map: Processes input data and emits intermediate key-value pairs.
● Reduce: Aggregates the key-value pairs and generates final output.
Example: Counting Word Occurrences
Consider the input file:
nginx
CopyEdit
apple banana apple mango banana apple
Step 1: Input Splitting
The file is split into chunks and sent to multiple Mappers.
Step 2: Map Phase
Each Mapper processes a split and generates:
arduino
CopyEdit
("apple", 1) ("banana", 1) ("apple", 1) ("mango", 1)
("banana", 1) ("apple", 1)
Step 3: Shuffling and Sorting
Key-value pairs from different Mappers are grouped:
css
CopyEdit
apple → [1, 1, 1]
banana → [1, 1]
mango → [1]
Step 4: Reduce Phase
The Reducer sums up the occurrences:
nginx
CopyEdit
apple → 3
banana → 2
mango → 1
MapReduce Computation Flow Diagram
This parallel processing approach makes MapReduce ideal for large-scale data processing. 🚀
. Matrix-Vector Multiplication (MapReduce)
Problem: Given a matrix A (m x n) and a vector v (n x 1), compute the product Av.
Case 1: Vector Fits Properly (Small Vector)
Matrix A:
A=|123|
|456|
● (m = 2, n = 3)
Vector v:
v=|7|
|8|
|9|
● (n = 3)
● Map Function:
○ Input: (row_index, row of A) and v
○ For each element A[i][j] in the row, emit (row_index, A[i][j] * v[j])
● Reduce Function:
○ Input: (row_index, list of partial products)
○ Sum the partial products to get the element in the result vector.
● Example Map Output:
○ (0, 1 * 7) = (0, 7)
○ (0, 2 * 8) = (0, 16)
○ (0, 3 * 9) = (0, 27)
○ (1, 4 * 7) = (1, 28)
○ (1, 5 * 8) = (1, 40)
○ (1, 6 * 9) = (1, 54)
● Example Reduce Output:
○ (0, [7, 16, 27]) -> (0, 7 + 16 + 27) = (0, 50)
○ (1, [28, 40, 54]) -> (1, 28 + 40 + 54) = (1, 122)
Result Vector:
Av = | 50 |
| 122 |
●
Case 2: Vector Too Large to Fit (Large Vector)
● This scenario requires partitioning the vector. We'll partition the vector and process it in
chunks.
Matrix A:
A=|1234|
|5678|
|9123|
● (m = 3, n = 4)
Vector v:
v=|1|
|2|
|3|
|4|
● (n=4)
● Map Function:
○ Input: (row_index, row of A), v
○ emit (row_index, A[row_index][column_index] * v[column_index])
● Reduce Function:
○ Input: (row_index, list of partial products)
○ Sum the partial products.
● Example Map Output:
○ (0, 1*1)= (0,1)
○ (0, 2*2)= (0,4)
○ (0, 3*3)= (0,9)
○ (0, 4*4)= (0,16)
○ (1, 5*1)= (1,5)
○ (1, 6*2)= (1,12)
○ (1, 7*3)= (1,21)
○ (1, 8*4)= (1,32)
○ (2, 9*1)= (2,9)
○ (2, 1*2)= (2,2)
○ (2, 2*3)= (2,6)
○ (2, 3*4)= (2,12)
● Example Reduce Output:
○ (0, [1,4,9,16]) = (0, 30)
○ (1, [5,12,21,32]) = (1, 70)
○ (2, [9,2,6,12]) = (2, 29)
Result Vector:
Av = | 30 |
| 70 |
| 29 |
●
2. Matrix Multiplication (MapReduce)
Problem: Given matrices A (m x k) and B (k x n), compute the product C = AB (m x n).
Matrix A:
A=|12|
|34|
● (m = 2, k = 2)
Matrix B:
B=|56|
|78|
● (k = 2, n = 2)
● Map Function:
○ For each element A[i][j], emit ((i, j), ('A', j, A[i][j])).
○ For each element B[j][k], emit ((i, k), ('B', j, B[j][k])).
● Reduce Function:
○ Input: ((i, k), list of values).
○ For each pair ('A', j, A[i][j]) and ('B', j, B[j][k]), compute A[i][j] * B[j][k].
○ Sum all products with the same (i, k).
● Example Map Output:
○ ((0, 0), ('A', 0, 1))
○ ((0, 1), ('A', 1, 2))
○ ((1, 0), ('A', 0, 3))
○ ((1, 1), ('A', 1, 4))
○ ((0, 0), ('B', 0, 5))
○ ((1, 0), ('B', 1, 7))
○ ((0, 1), ('B', 0, 6))
○ ((1, 1), ('B', 1, 8))
● Example Reduce Output:
○ ((0, 0), [('A', 0, 1), ('B', 0, 5), ('A', 1, 2), ('B', 1, 7)]) -> (0, 0, 15 + 27) = (0, 0, 19)
○ ((0, 1), [('A', 0, 1), ('B', 0, 6), ('A', 1, 2), ('B', 1, 8)]) -> (0, 1, 16 + 28) = (0, 1, 22)
○ ((1, 0), [('A', 0, 3), ('B', 0, 5), ('A', 1, 4), ('B', 1, 7)]) -> (1, 0, 35 + 47) = (1, 0, 43)
○ ((1, 1), [('A', 0, 3), ('B', 0, 6), ('A', 1, 4), ('B', 1, 8)]) -> (1, 1, 36 + 48) = (1, 1, 50)
Result Matrix C:
C = | 19 22 |
| 43 50 |
●
Matrix Multiplication with One MapReduce Step
Matrix multiplication using one MapReduce step means performing the entire multiplication
process in a single map and reduce phase rather than iterative MapReduce steps. This approach is
often used in distributed computing frameworks like Hadoop.
Problem Statement
Given two matrices:
A=[A11A12A21A22]A=[A11A21A12A22]B=[B11B12B21B22]B=[B11B21B12B22]
The result of matrix multiplication C=A×BC=A×B is:
Cij=∑kAik×BkjCij=k∑Aik×Bkj
MapReduce Steps
1. Mapper
Each element from A and B is emitted with a key that represents the corresponding position in
matrix C.
For matrix A (AikAik): Emit key-value pairs as:
css
CopyEdit
(i, j) → (A, k, Aik)
●
For matrix B (BkjBkj): Emit key-value pairs as:
css
CopyEdit
(i, j) → (B, k, Bkj)
●
2. Reducer
● The reducer groups values by key (i,j)(i,j), multiplies the corresponding elements, and
sums them up.
Example
Consider the matrices:
A=[1234]A=[1324]B=[5678]B=[5768]
Mapper Output
Each element is mapped based on multiplication rules.
For matrix A:
scss
CopyEdit
(0,0) → ("A",0,1) (0,1) → ("A",1,2)
(1,0) → ("A",0,3) (1,1) → ("A",1,4)
For matrix B:
scss
CopyEdit
(0,0) → ("B",0,5) (0,1) → ("B",1,6)
(1,0) → ("B",0,7) (1,1) → ("B",1,8)
Shuffling and Sorting
The pairs are grouped by key (i,j)(i,j):
scss
CopyEdit
(0,0) → [("A",0,1), ("A",1,2), ("B",0,5), ("B",1,7)]
(0,1) → [("A",0,1), ("A",1,2), ("B",0,6), ("B",1,8)]
(1,0) → [("A",0,3), ("A",1,4), ("B",0,5), ("B",1,7)]
(1,1) → [("A",0,3), ("A",1,4), ("B",0,6), ("B",1,8)]
Reducer Processing
Each reducer computes:
C00=(1×5)+(2×7)=5+14=19C00=(1×5)+(2×7)=5+14=19C01=(1×6)+(2×8)=6+16=22C01=(1×6)
+(2×8)=6+16=22C10=(3×5)+(4×7)=15+28=43C10=(3×5)+(4×7)=15+28=43C11=(3×6)+(4×8)=
18+32=50C11=(3×6)+(4×8)=18+32=50
So, the final result:
C=[19224350]C=[19432250]
This approach ensures that the entire multiplication is done in one MapReduce step.
HadoopArchitecture:
Hadoop Architecture
Hadoop follows a Master-Slave Architecture, primarily consisting of four core components:
1. Hadoop Distributed File System (HDFS) – Storage Layer
2. MapReduce – Processing Layer
3. Yet Another Resource Negotiator (YARN) – Resource Management Layer
4. Common Utilities – Libraries and utilities used across Hadoop modules
1. HDFS (Hadoop Distributed File System)
HDFS is the storage component of Hadoop. It splits large data files into smaller blocks and
distributes them across multiple nodes in a cluster.
HDFS Architecture
● NameNode (Master Node): Manages metadata (file locations, block mapping, etc.) and
regulates access to files.
● DataNodes (Slave Nodes): Store the actual data blocks and respond to read/write
requests from clients.
● Secondary NameNode: Periodically saves snapshots of the NameNode’s metadata to
help in recovery.
Key Features of HDFS
● Stores large files by breaking them into blocks (default size: 128MB or 256MB)
● Replicates data (default: 3 copies) for fault tolerance
● Works on commodity hardware
2. MapReduce (Processing Layer)
MapReduce is a programming model for processing large datasets in parallel across a Hadoop
cluster. It consists of two phases:
1. Map Phase: Processes input data and transforms it into key-value pairs.
2. Reduce Phase: Aggregates the mapped data to generate the final output.
3. YARN (Yet Another Resource Negotiator)
YARN is responsible for resource management and job scheduling in Hadoop. It has two main
components:
● ResourceManager (Master Node): Allocates resources across applications.
● NodeManager (Slave Nodes): Monitors resource usage and reports to the
ResourceManager.
4. Common Utilities
These are shared libraries required for different Hadoop modules to work together.
Hadoop Cluster Architecture
A Hadoop cluster consists of:
● Master Node: Runs NameNode and ResourceManager
● Slave Nodes: Run DataNodes and NodeManagers
● Client Node: Submits jobs and interacts with the cluster
MapReduceArchitecture:
MapReduce Architecture
MapReduce is a programming model in Hadoop used for processing large-scale data in a
distributed and parallel manner across a cluster. It follows a Master-Slave Architecture
consisting of two main phases:
1. Map Phase (Processing & Sorting the Input Data)
2. Reduce Phase (Aggregating & Generating the Final Output)
1. Components of MapReduce Architecture
MapReduce consists of the following key components:
a) JobTracker (Master Node) [In Traditional Hadoop 1.0]
● Assigns tasks to TaskTrackers (slave nodes).
● Monitors task execution and handles failures.
● In Hadoop 2.x, JobTracker is replaced by YARN’s ResourceManager.
b) TaskTracker (Slave Node) [In Traditional Hadoop 1.0]
● Runs Map and Reduce tasks as per JobTracker’s instructions.
● Reports task progress to JobTracker.
● In Hadoop 2.x, TaskTracker is replaced by NodeManager.
c) ResourceManager (Master Node) [In Hadoop 2.0+]
● Manages cluster resources and schedules tasks.
d) NodeManager (Slave Node) [In Hadoop 2.0+]
● Monitors resource usage on each node and reports to ResourceManager.
2. MapReduce Execution Flow
Step 1: Input Splitting
● The input file is divided into fixed-size splits (default: 128MB or 256MB).
● Each split is processed independently by a separate Mapper task.
Step 2: Mapping (Map Phase)
● Each Mapper processes one split and converts it into intermediate key-value pairs.
● Example:
○ Input: "Hello World Hello"
○ Output from Mapper: {Hello: 2, World: 1}
Step 3: Shuffling and Sorting
● Intermediate key-value pairs are grouped, sorted, and partitioned.
● The keys are assigned to different Reducers.
Step 4: Reducing (Reduce Phase)
● The Reducer processes the grouped key-value pairs to generate the final output.
● Example:
○ Input from Shuffle & Sort: {Hello: [1,1], World: [1]}
○ Output: {Hello: 2, World: 1}
Step 5: Writing Output
● The final reduced output is written to HDFS.
3. Diagram of MapReduce Architecture
4. Key Features of MapReduce
✅ Parallel Processing – Multiple nodes process data in parallel.
✅ Fault Tolerance – Automatic re-execution of failed tasks.
✅ Scalability – Handles petabytes of data.
✅ Data Locality – Processes data where it is stored (reducing network traffic).