0% found this document useful (0 votes)

16 views35 pages

Bda 2

A Distributed File System (DFS) allows multiple computers to access and manage files over a network, enhancing data sharing, scalability, and fault tolerance. The document details the physical organization of compute nodes, large-scale file-system organization, and the MapReduce framework for processing large datasets, including algorithms for matrix operations and relational algebra. It concludes that DFS and MapReduce are essential for efficient data management and processing in big data applications.

Uploaded by

sanghavi.k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views35 pages

Bda 2

Uploaded by

sanghavi.k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Distributed File Systems

A Distributed File System (DFS) is a type of file system that allows multiple computers to
access and manage files stored on a network as if they were located on a local disk. This enables
efficient data sharing, scalability, and fault tolerance.

1. Physical Organization of Compute Nodes

In a DFS, compute nodes are the physical machines responsible for storing and processing data.
The organization of these nodes determines system performance, reliability, and fault tolerance.

Key Aspects of Physical Organization

1. Cluster-Based Architecture:

○ DFS typically uses clusters of machines connected via a high-speed network.
○ Each node can store data and perform computations in parallel.
2. Data Partitioning:
○ Files are split into smaller chunks or blocks.
○ These chunks are distributed across multiple nodes to balance load and improve
access speed.
3. Replication:
○ To ensure reliability, data is replicated across different nodes.
○ If a node fails, replicas ensure data availability.
4. Metadata and Name Nodes:
○ Metadata contains information about file locations and structure.
○ Some DFS (e.g., Hadoop HDFS) use a dedicated NameNode to manage
metadata, while others distribute it across nodes.
5. Networking:
○ High-speed communication between nodes is crucial.
○ Technologies like InfiniBand, Ethernet, and Remote Direct Memory Access
(RDMA) are used for efficient data transfer.

2. Large-Scale File-System Organization

Managing files efficiently at scale requires a well-structured DFS design.

Components of Large-Scale File-System Organization

1. File Distribution Strategies:
○ Hash-Based: Uses a hashing function to determine file placement.
○ Range-Based: Assigns file ranges to different nodes.
○ Dynamic Allocation: Adjusts file placement based on system load.
2. Namespace Management:
○ Single namespace (centralized) vs. distributed namespace (scalable).
○ Distributed namespace ensures no single point of failure.
3. Consistency and Fault Tolerance:
○ DFS must handle failures without data corruption.
○ Techniques like RAID, Replication, and Erasure Coding are used.
4. Caching and Prefetching:
○ To reduce latency, frequently accessed files are cached.
○ Prefetching loads data before it's needed to improve performance.
5. Security and Access Control:
○ Authentication: Ensuring only authorized users access files.
○ Encryption: Protecting data in transit and at rest.
○ Access Control Lists (ACLs): Managing permissions.

Examples of Distributed File Systems

● HDFS (Hadoop Distributed File System) – Used in big data applications.

● Google File System (GFS) – Foundation for Google’s data storage.
● CephFS – Open-source scalable DFS.
● Amazon S3 – Object storage with distributed architecture.

Conclusion: A well-organized DFS efficiently manages large-scale data storage and retrieval
across multiple compute nodes. By employing strategies like replication, partitioning, and
caching, DFS ensures high availability, fault tolerance, and scalability, making it ideal for cloud
computing, big data processing, and large-scale enterprise applications.

MapReduce Framework
MapReduce is a parallel processing model for efficiently processing large-scale datasets across
distributed computing environments. It consists of two main phases:

● Map Phase (Processes input data in parallel)

● Reduce Phase (Aggregates and summarizes results)
1. The Map Tasks

The Map phase transforms input data into key-value pairs.

Steps in the Map Phase:

1. The input dataset is split into chunks (usually 64MB or 128MB).
2. Each chunk is processed by a Map function running on different nodes in parallel.
3. The Map function processes each input record and emits intermediate (key, value) pairs.

Example:

For a word count application:

● Input: ["Hello World", "Hello MapReduce"]

Mapper Output:

("Hello", 1)
("World", 1)
("Hello", 1)
("MapReduce", 1)

2. Grouping by Key
After the Map phase, intermediate key-value pairs are shuffled and grouped by key before
being passed to the Reduce phase.

Key Features:

● The framework automatically groups all values for the same key together.
● Uses a process called shuffle and sort, ensuring that all occurrences of a key are sent to
the same reducer.

Example:Mapper Output:

("Hello", 1), ("World", 1), ("Hello", 1), ("MapReduce", 1)

Grouped Output (after shuffle and sort):

("Hello", [1, 1])
("World", [1])
("MapReduce", [1])
3. The Reduce Tasks: The Reduce phase aggregates or processes the grouped key-value pairs.

Steps in Reduce Phase:

1. The grouped (key, list of values) pairs are assigned to a reducer.
2. The Reduce function processes these values to generate a final output.
3. The output is stored in HDFS or another distributed storage system.

Example:
Reduce Function for Word Count:

● Input: ("Hello", [1, 1]), ("World", [1]), ("MapReduce", [1])

Reduce Output:
("Hello", 2)
("World", 1)
("MapReduce", 1)

4. Combiners

A Combiner is an optional optimization step that performs local aggregation before sending
data to the reducers.

Why Use Combiners?

● Reduces data transfer between mappers and reducers.

● Decreases network congestion and improves efficiency.

Example (Word Count with Combiner):

Mapper Output (before combiner):

("Hello", 1), ("Hello", 1), ("World", 1)
Combiner Output (before sending to reducer):

("Hello", 2), ("World", 1)

Note: Not all applications can use combiners effectively. They are mainly useful for
associative and commutative operations like sum, count, max, min.
5. Details of MapReduce Execution

The MapReduce job execution involves multiple stages:

Execution Steps:

1. Input Splitting: The dataset is divided into chunks.

2. Map Task Execution: Mappers process each chunk and emit intermediate key-value
pairs.
3. Shuffling and Sorting: Key-value pairs are grouped by key.
4. Reduce Task Execution: Reducers aggregate and generate final output.
5. Output Storage: The final results are written to HDFS or another storage system.

MapReduce Execution Workflow:

1. Job Submission: The user submits a job to the JobTracker.

2. Job Initialization: The framework divides the job into multiple tasks.
3. Task Execution: Mappers and reducers process data in parallel.
4. Monitoring & Fault Tolerance: Failed tasks are re-executed.
5. Completion: The final output is stored.

6. Coping with Node Failure

Since MapReduce runs in distributed environments, handling node failures is essential.

Fault Tolerance Mechanisms:

1. Task Re-execution:

○ If a node fails, incomplete tasks are reassigned to other nodes.
2. Data Replication:
○ HDFS stores multiple replicas of data blocks to ensure availability.
3. Speculative Execution:
○ If a task is slow due to hardware issues, a duplicate task is launched on another
node.
4. Heartbeat Monitoring:
○ Nodes send periodic signals (heartbeats) to the JobTracker.
○ If a node stops responding, it is marked as dead and tasks are reassigned.
Conclusion

MapReduce efficiently processes large-scale data using parallel computing. The framework
handles fault tolerance, optimization with combiners, and automatic load balancing. By
distributing computations across multiple nodes, it provides scalability and reliability, making it
suitable for big data applications like log analysis, machine learning, and data mining.

Algorithms Using MapReduce

MapReduce can be used to implement various algorithms for processing large-scale datasets.
Some key applications include matrix-vector multiplication, relational algebra operations,
and computing selections.

1. Matrix-Vector Multiplication Using MapReduce

Problem Statement:

Given a matrix M of size m×n and a vector V of size n×1, compute the matrix-vector
productR=M×V.
MapReduce Approach:

1. Mapper Phase:

○ Each mapper reads a matrix entry Mijand a corresponding vector element Vj.
○ The mapper emits intermediate key-value pairs (row index, partial product).
2. Shuffle & Sort:
○ The framework groups all partial results for each row index.
3. Reducer Phase:
○ The reducer sums up the partial products for each row index to compute the final
result.

Example:Let M=[2345],V=[12]

Mapper Output:
(0, 2×1) → (0, 2)
(0, 3×2) → (0, 6)
(1, 4×1) → (1, 4)
(1, 5×2) → (1, 10)
Reducer Output:
(0, 8) // Row 0 result (2 + 6)
(1, 14) // Row 1 result (4 + 10)

Final Output:
R=[8 14]

2. Relational Algebra Operations Using MapReduce

Relational Algebra Basics:

Relational operations like selection, projection, join, and aggregation can be implemented
using MapReduce.

Types of Operations:

1. Selection (σ condition (R)) → Filters rows based on conditions.

2. Projection (π attributes (R)) → Selects specific columns from a relation.
3. Join (R ⨝ S) → Combines two relations based on a common attribute.
4. Grouping & Aggregation (SUM, COUNT, AVG, MAX, MIN).

3. Computing Selections by MapReduce:

Selection (σ condition (R))

Selection involves filtering rows from a relation R based on a condition.

MapReduce Implementation:

1. Mapper Phase:

○ Reads each record from the dataset.
○ Emits the record only if it satisfies the selection condition.
2. Reducer Phase:
○ Simply passes the records to the output.

Example:
Consider a Student dataset:(ID, Name, Age, Grade)
(1, Alice, 21, A)
(2, Bob, 22, B)
(3, Charlie, 20, A)

Query:Find all students with Grade = A.

Selection Condition: σ Grade = 'A' (Student)

Mapper Output:
(1, Alice, 21, A)
(3, Charlie, 20, A)

Reducer Output:
(1, Alice, 21, A)
(3, Charlie, 20, A)

Optimization: Since selection is a filtering operation, it can be performed entirely

in the mapper phase without using reducers.

ConclusionMapReduce can efficiently process matrix operations, relational algebra

operations, and selection queries on large-scale data. By leveraging parallel computation, it
improves scalability and performance in distributed computing environments like Hadoop and
Spark.

Algorithms Using MapReduce (Continued)

MapReduce can be used for various relational algebra operations, including projection, set
operations (union, intersection, difference), natural join, and grouping & aggregation.

1. Computing Projections by MapReduce

Projection (π attributes (R)): Projection selects specific columns (attributes) from a relation R,
eliminating duplicates.

MapReduce Approach:
1. Mapper Phase:
○ Reads each record.
○ Emits only the required attributes as key-value pairs.
2. Reducer Phase:
○ Eliminates duplicate records.

Example:

Consider a Student dataset:

(ID, Name, Age, Grade)

(1, Alice, 21, A)
(2, Bob, 22, B)
(3, Charlie, 20, A)
(4, Alice, 21, A)

Query:Find all unique student names → π Name (Student)

Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Alice, 1)

Reducer Output (Removing Duplicates):

(Alice)
(Bob)
(Charlie)

2. Union, Intersection, and Difference by MapReduce

Union (R ∪ S): The union operation combines tuples from two relations R and S, removing
duplicates.

MapReduce Approach:

1. Mapper Phase:

○ Reads records from both R and S.
○ Emits (record, 1) as key-value pairs.
2. Reducer Phase:
○ Outputs unique records.

Example:

Two datasets:

R (Student1):
(1, Alice)
(2, Bob)
(3, Charlie)

S (Student2):
(2, Bob)
(4, David)
(5, Eve)

Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Bob, 1)
(David, 1)
(Eve, 1)

Reducer Output (Unique Values):

(Alice)
(Bob)
(Charlie)
(David)
(Eve)

Intersection (R ∩ S): 3The intersection operation finds common elements in R and S.

MapReduce Approach:

1. Mapper Phase:

○ Reads records from both R and S.
○ Emits (record, dataset_id) (e.g., 1 for R, 2 for S).
2. Reducer Phase:
○ Keeps only records appearing in both datasets (having both 1 and 2).

Mapper Output:
(Alice, 1)
(Bob, 1)
(Charlie, 1)
(Bob, 2)
(David, 2)
(Eve, 2)

Reducer Output (Common Values):

(Bob)

Difference (R - S)

The difference operation finds records in R that are not in S.

MapReduce Approach:

1. Mapper Phase:

○ Reads records from both R and S.
○ Emits (record, dataset_id).
2. Reducer Phase:
○ Outputs records only present in R (dataset_id = 1 but not 2).

Reducer Output (R - S):

(Alice)
(Charlie)

3. Computing Natural Join by MapReduce

Natural Join (R ⨝ S)

A natural join combines two relations R(A, B) and S(B, C) based on a common attribute (B).
MapReduce Approach:

1. Mapper Phase:

○ Emits (B, (Relation Name, Attributes)).
2. Reducer Phase:
○ Joins records with the same B value from both relations.

Example:

R (Employee):

(ID, Name, Dept_ID)

(1, Alice, 101)
(2, Bob, 102)
(3, Charlie, 103)

S (Department):

(Dept_ID, Dept_Name)
(101, HR)
(102, IT)
(104, Sales)

Mapper Output:
(101, ("R", 1, Alice))
(102, ("R", 2, Bob))
(103, ("R", 3, Charlie))
(101, ("S", HR))
(102, ("S", IT))
(104, ("S", Sales))

Reducer Output (Joined Table):

(1, Alice, 101, HR)
(2, Bob, 102, IT)

4. Grouping and Aggregation by MapReduce

Grouping & Aggregation (SUM, COUNT, AVG, MAX, MIN)

Grouping and aggregation operations involve computing functions like SUM, COUNT, AVG,
MAX, and MIN for groups of records.

MapReduce Approach:

1. Mapper Phase:

○ Reads each record.
○ Emits (group_key, value).
2. Reducer Phase:
○ Aggregates values using SUM, COUNT, AVG, MAX, MIN.

Example: Compute Total Sales Per Product

Sales Data:
(Product, Sales)

(A, 100)
(A, 200)
(B, 300)
(B, 400)
(A, 150)

Mapper Output:
(A, 100)
(A, 200)
(B, 300)
(B, 400)
(A, 150)

Reducer Output (SUM Aggregation):

(A, 450)
(B, 700)

Conclusion
MapReduce enables efficient relational algebra operations at scale. By leveraging parallel
processing, it handles projection, set operations, joins, and aggregation efficiently, making it
suitable for big data analytics in frameworks like Hadoop and Spark.

Matrix Multiplication Using MapReduce

Matrix multiplication is a fundamental operation in big data analytics and can be efficiently
implemented using MapReduce. There are two approaches:

1. General Matrix Multiplication using Multiple MapReduce Steps

2. Optimized Matrix Multiplication using One MapReduce Step

1. General Matrix Multiplication Using MapReduce

Problem Statement:

Given two matrices A (m × n) and B (n × p), compute the product C = A × B, where C (m × p)

is the resultant matrix.

MapReduce Approach:

We need to compute each element of C as:

Cij=∑k=1nAik×BkjCij=k=1∑nAik×Bkj

Steps in MapReduce:

Step 1: Mapper Phase

● Read elements from both A and B.

● Emit key-value pairs with intermediate multiplication results.

Step 2: Shuffle & Grouping

● Group all terms (i, j, partial product) for each element C(i, j).

Step 3: Reducer Phase

● Sum all partial products to compute C(i, j).

Example: Compute C = A × B

Input Matrices:

A (2×3):

2 3 4
1 5 6

B (3×2):

1 2
3 4
5 6

Mapper Output:

Each matrix element is emitted in key-value pairs.

For A(i,k):

(0,0) → (0, (A, 0, 2))

(0,1) → (0, (A, 1, 3))
(0,2) → (0, (A, 2, 4))
(1,0) → (1, (A, 0, 1))
(1,1) → (1, (A, 1, 5))
(1,2) → (1, (A, 2, 6))

For B(k,j):

(0,0) → (0, (B, 0, 1))

(1,0) → (0, (B, 1, 3))
(2,0) → (0, (B, 2, 5))
(0,1) → (1, (B, 0, 2))
(1,1) → (1, (B, 1, 4))
(2,1) → (1, (B, 2, 6))
Reducer Output:

Perform the sum of products for each (i, j) key:

mathematica

C(0,0) = (2×1) + (3×3) + (4×5) = 2 + 9 + 20 = 31

C(0,1) = (2×2) + (3×4) + (4×6) = 4 + 12 + 24 = 40
C(1,0) = (1×1) + (5×3) + (6×5) = 1 + 15 + 30 = 46
C(1,1) = (1×2) + (5×4) + (6×6) = 2 + 20 + 36 = 58

Final Output C (2×2):

31 40
46 58

2. Matrix Multiplication with One MapReduce Step

Optimization Idea:

Instead of using multiple MapReduce steps, we can directly compute and reduce the output
matrix elements in a single MapReduce step.

Modified MapReduce Approach:

1. Mapper Phase:

○ Emit elements of A(i,k) and B(k,j) in such a way that reducers can directly
compute C(i,j).
2. Reducer Phase:
○ Compute C(i,j) by summing all partial products.

Example: One-Step MapReduce Matrix Multiplication

Mapper Output:
For A(i,k):
(0,0) → (0,0,2)
(0,1) → (0,1,3)
(0,2) → (0,2,4)
(1,0) → (1,0,1)
(1,1) → (1,1,5)
(1,2) → (1,2,6)

For B(k,j):

(0,0) → (0,0,1)
(1,0) → (1,0,3)
(2,0) → (2,0,5)
(0,1) → (0,1,2)
(1,1) → (1,1,4)
(2,1) → (2,1,6)

Reducer Output:
(0,0) → 31
(0,1) → 40
(1,0) → 46
(1,1) → 58
This method ensures that C(i, j) is computed directly in a single step, reducing the need for
extra MapReduce passes.

Comparison of Two Approaches

Approach Steps Required Data Transfer Performance

Overhead

General Matrix Multiple Higher (Intermediate Slower

Multiplication MapReduce steps outputs)

Optimized One-Step Single MapReduce Lower (Direct Faster

Multiplication step computation)

The optimized approach is preferred for big data applications like Hadoop and Spark, as it
minimizes network trafficand computation time.
Conclusion: Matrix multiplication using MapReduce can be performed efficiently in two ways:

1. Traditional Multi-Step Approach: Uses multiple MapReduce phases but is

straightforward.
2. Optimized One-Step Approach: Directly computes C(i, j) in a single MapReduce
step, reducing overhead and improving performance.

🚀
The One-Step approach is more efficient and widely used in big data frameworks for
handling large-scale matrix operations.

&&& With the help of Sequence diagram explain how job processing requests from the user are
handled in the Hadoop framework.
The Hadoop framework follows a Master-Slave architecture where the Client, JobTracker (or
ResourceManager in YARN), TaskTrackers (or NodeManagers in YARN), and DataNodes
work together to process job requests.

Steps in Hadoop Job Processing:

1. User submits a job:

○ The Client writes a job using the Hadoop API and submits it to the JobTracker (in
Hadoop 1.x) or ResourceManager (in YARN).
○ The job includes the MapReduce logic and input file location in HDFS.
2. JobTracker/ResourceManager initializes the job:
○ It checks for resource availability in the cluster.
○ The job is split into smaller Map tasks based on input splits.
3. Task Assignment to TaskTrackers/NodeManagers:
○ The JobTracker (or ApplicationMaster in YARN) assigns tasks to available
TaskTrackers (or NodeManagers in YARN).
○ Each TaskTracker fetches the required data from HDFS.
4. Map Tasks Execution:
○ The assigned nodes execute Mapper functions, transforming input data into
key-value pairs.
○ Output of the Mapper phase is stored locally before being shuffled.
5. Shuffling and Sorting:
○ The intermediate key-value pairs are grouped by key and shuffled to
corresponding reducers.
6. Reduce Tasks Execution:
○ The Reducer processes grouped key-value pairs and generates the final output.
7. Job Completion:
○ The JobTracker monitors task progress.
○ Upon completion, the output is stored in HDFS, and the Client is notified.

Explain how a program using MapReduce is executed elaborating interaction of processes, tasks,
and files. How node-failures are handled? Explain.
How a MapReduce Program is Executed

A MapReduce program is executed in four major stages:

1. Input Splitting

● The input file (stored in HDFS) is divided into fixed-size chunks (default: 128MB or
256MB).
● Each split is processed independently by a Mapper.

2. Map Phase

● A Mapper function is applied to each split, converting it into key-value pairs.

Example: Consider an input file with the following text:

nginx
CopyEdit
apple banana apple mango banana apple

●

The Mapper processes it and generates output like:

arduino
CopyEdit
("apple", 1), ("banana", 1), ("apple", 1), ("mango", 1),
("banana", 1), ("apple", 1)

●

3. Shuffling & Sorting

● The intermediate key-value pairs are grouped and sorted based on keys.

This ensures that all values for a particular key are sent to the same Reducer.
css
CopyEdit
apple → [1, 1, 1]

banana → [1, 1]

mango → [1]

●

4. Reduce Phase

The Reducer aggregates values for each key.

nginx
CopyEdit
apple → 3

banana → 2

mango → 1

●
● The final output is written to HDFS.

Handling Node Failures in Hadoop

Hadoop is designed to handle failures using replication, checkpointing, and task

reassignment.

1. TaskTracker/NodeManager Failure:

○ The JobTracker (or ResourceManager) detects missing heartbeats.
○ It reassigns tasks to a healthy node.
2. DataNode Failure:
○ Hadoop replicates data across multiple DataNodes.
○ If a node fails, another replica is used automatically.
3. JobTracker/ResourceManager Failure:
○ In Hadoop 1.x, the JobTracker failure causes job failure.
○ In Hadoop 2.x (YARN), high availability mechanisms restart the
ResourceManager.

With a neat diagram explain the notion of Map-Reduce computation. lustrate the MapReduce
approach to count the number of occurrences of a iven word in a data file. (Assume suitable data
wherever necessary)
MapReduce follows a divide-and-conquer approach to process large datasets in parallel. It
consists of two main functions:

● Map: Processes input data and emits intermediate key-value pairs.

● Reduce: Aggregates the key-value pairs and generates final output.

Example: Counting Word Occurrences

Consider the input file:

nginx

CopyEdit

apple banana apple mango banana apple

Step 1: Input Splitting

The file is split into chunks and sent to multiple Mappers.

Step 2: Map Phase

Each Mapper processes a split and generates:

arduino

CopyEdit
("apple", 1) ("banana", 1) ("apple", 1) ("mango", 1)
("banana", 1) ("apple", 1)

Step 3: Shuffling and Sorting

Key-value pairs from different Mappers are grouped:

css

CopyEdit

apple → [1, 1, 1]

banana → [1, 1]

mango → [1]

Step 4: Reduce Phase

The Reducer sums up the occurrences:

nginx

CopyEdit

apple → 3

banana → 2

mango → 1

MapReduce Computation Flow Diagram

This parallel processing approach makes MapReduce ideal for large-scale data processing. 🚀

. Matrix-Vector Multiplication (MapReduce)

Problem: Given a matrix A (m x n) and a vector v (n x 1), compute the product Av.

Case 1: Vector Fits Properly (Small Vector)

Matrix A:
A=|123|

|456|

● (m = 2, n = 3)

Vector v:
v=|7|
|8|

|9|

● (n = 3)
● Map Function:
○ Input: (row_index, row of A) and v
○ For each element A[i][j] in the row, emit (row_index, A[i][j] * v[j])
● Reduce Function:
○ Input: (row_index, list of partial products)
○ Sum the partial products to get the element in the result vector.
● Example Map Output:
○ (0, 1 * 7) = (0, 7)
○ (0, 2 * 8) = (0, 16)
○ (0, 3 * 9) = (0, 27)
○ (1, 4 * 7) = (1, 28)
○ (1, 5 * 8) = (1, 40)
○ (1, 6 * 9) = (1, 54)
● Example Reduce Output:
○ (0, [7, 16, 27]) -> (0, 7 + 16 + 27) = (0, 50)
○ (1, [28, 40, 54]) -> (1, 28 + 40 + 54) = (1, 122)

Result Vector:
Av = | 50 |

| 122 |

●

Case 2: Vector Too Large to Fit (Large Vector)

● This scenario requires partitioning the vector. We'll partition the vector and process it in
chunks.

Matrix A:
A=|1234|

|5678|

|9123|

● (m = 3, n = 4)
Vector v:
v=|1|

|2|

|3|

|4|

● (n=4)
● Map Function:
○ Input: (row_index, row of A), v
○ emit (row_index, A[row_index][column_index] * v[column_index])
● Reduce Function:
○ Input: (row_index, list of partial products)
○ Sum the partial products.
● Example Map Output:
○ (0, 1*1)= (0,1)
○ (0, 2*2)= (0,4)
○ (0, 3*3)= (0,9)
○ (0, 4*4)= (0,16)
○ (1, 5*1)= (1,5)
○ (1, 6*2)= (1,12)
○ (1, 7*3)= (1,21)
○ (1, 8*4)= (1,32)
○ (2, 9*1)= (2,9)
○ (2, 1*2)= (2,2)
○ (2, 2*3)= (2,6)
○ (2, 3*4)= (2,12)
● Example Reduce Output:
○ (0, [1,4,9,16]) = (0, 30)
○ (1, [5,12,21,32]) = (1, 70)
○ (2, [9,2,6,12]) = (2, 29)

Result Vector:
Av = | 30 |

| 70 |

| 29 |

●
2. Matrix Multiplication (MapReduce)

Problem: Given matrices A (m x k) and B (k x n), compute the product C = AB (m x n).

Matrix A:
A=|12|

|34|

● (m = 2, k = 2)

Matrix B:
B=|56|

|78|

● (k = 2, n = 2)
● Map Function:
○ For each element A[i][j], emit ((i, j), ('A', j, A[i][j])).
○ For each element B[j][k], emit ((i, k), ('B', j, B[j][k])).
● Reduce Function:
○ Input: ((i, k), list of values).
○ For each pair ('A', j, A[i][j]) and ('B', j, B[j][k]), compute A[i][j] * B[j][k].
○ Sum all products with the same (i, k).
● Example Map Output:
○ ((0, 0), ('A', 0, 1))
○ ((0, 1), ('A', 1, 2))
○ ((1, 0), ('A', 0, 3))
○ ((1, 1), ('A', 1, 4))
○ ((0, 0), ('B', 0, 5))
○ ((1, 0), ('B', 1, 7))
○ ((0, 1), ('B', 0, 6))
○ ((1, 1), ('B', 1, 8))
● Example Reduce Output:
○ ((0, 0), [('A', 0, 1), ('B', 0, 5), ('A', 1, 2), ('B', 1, 7)]) -> (0, 0, 15 + 27) = (0, 0, 19)
○ ((0, 1), [('A', 0, 1), ('B', 0, 6), ('A', 1, 2), ('B', 1, 8)]) -> (0, 1, 16 + 28) = (0, 1, 22)
○ ((1, 0), [('A', 0, 3), ('B', 0, 5), ('A', 1, 4), ('B', 1, 7)]) -> (1, 0, 35 + 47) = (1, 0, 43)
○ ((1, 1), [('A', 0, 3), ('B', 0, 6), ('A', 1, 4), ('B', 1, 8)]) -> (1, 1, 36 + 48) = (1, 1, 50)

Result Matrix C:

C = | 19 22 |

| 43 50 |
●

Matrix Multiplication with One MapReduce Step

Matrix multiplication using one MapReduce step means performing the entire multiplication
process in a single map and reduce phase rather than iterative MapReduce steps. This approach is
often used in distributed computing frameworks like Hadoop.

Problem Statement

Given two matrices:

A=[A11A12A21A22]A=[A11A21A12A22]B=[B11B12B21B22]B=[B11B21B12B22]

The result of matrix multiplication C=A×BC=A×B is:

Cij=∑kAik×BkjCij=k∑Aik×Bkj

MapReduce Steps

1. Mapper

Each element from A and B is emitted with a key that represents the corresponding position in
matrix C.

For matrix A (AikAik): Emit key-value pairs as:

css
CopyEdit
(i, j) → (A, k, Aik)

●

For matrix B (BkjBkj): Emit key-value pairs as:

css
CopyEdit
(i, j) → (B, k, Bkj)

●

2. Reducer

● The reducer groups values by key (i,j)(i,j), multiplies the corresponding elements, and
sums them up.
Example

Consider the matrices:

A=[1234]A=[1324]B=[5678]B=[5768]

Mapper Output

Each element is mapped based on multiplication rules.

For matrix A:

scss

CopyEdit

(0,0) → ("A",0,1) (0,1) → ("A",1,2)

(1,0) → ("A",0,3) (1,1) → ("A",1,4)

For matrix B:

scss

CopyEdit

(0,0) → ("B",0,5) (0,1) → ("B",1,6)

(1,0) → ("B",0,7) (1,1) → ("B",1,8)

Shuffling and Sorting

The pairs are grouped by key (i,j)(i,j):

scss

CopyEdit

(0,0) → [("A",0,1), ("A",1,2), ("B",0,5), ("B",1,7)]

(0,1) → [("A",0,1), ("A",1,2), ("B",0,6), ("B",1,8)]

(1,0) → [("A",0,3), ("A",1,4), ("B",0,5), ("B",1,7)]

(1,1) → [("A",0,3), ("A",1,4), ("B",0,6), ("B",1,8)]

Reducer Processing

Each reducer computes:

C00=(1×5)+(2×7)=5+14=19C00=(1×5)+(2×7)=5+14=19C01=(1×6)+(2×8)=6+16=22C01=(1×6)
+(2×8)=6+16=22C10=(3×5)+(4×7)=15+28=43C10=(3×5)+(4×7)=15+28=43C11=(3×6)+(4×8)=
18+32=50C11=(3×6)+(4×8)=18+32=50

So, the final result:

C=[19224350]C=[19432250]

This approach ensures that the entire multiplication is done in one MapReduce step.

HadoopArchitecture:
Hadoop Architecture

Hadoop follows a Master-Slave Architecture, primarily consisting of four core components:

1. Hadoop Distributed File System (HDFS) – Storage Layer

2. MapReduce – Processing Layer
3. Yet Another Resource Negotiator (YARN) – Resource Management Layer
4. Common Utilities – Libraries and utilities used across Hadoop modules

1. HDFS (Hadoop Distributed File System)

HDFS is the storage component of Hadoop. It splits large data files into smaller blocks and
distributes them across multiple nodes in a cluster.

HDFS Architecture
● NameNode (Master Node): Manages metadata (file locations, block mapping, etc.) and
regulates access to files.
● DataNodes (Slave Nodes): Store the actual data blocks and respond to read/write
requests from clients.
● Secondary NameNode: Periodically saves snapshots of the NameNode’s metadata to
help in recovery.

Key Features of HDFS

● Stores large files by breaking them into blocks (default size: 128MB or 256MB)
● Replicates data (default: 3 copies) for fault tolerance
● Works on commodity hardware

2. MapReduce (Processing Layer)

MapReduce is a programming model for processing large datasets in parallel across a Hadoop
cluster. It consists of two phases:

1. Map Phase: Processes input data and transforms it into key-value pairs.
2. Reduce Phase: Aggregates the mapped data to generate the final output.

3. YARN (Yet Another Resource Negotiator)

YARN is responsible for resource management and job scheduling in Hadoop. It has two main
components:

● ResourceManager (Master Node): Allocates resources across applications.

● NodeManager (Slave Nodes): Monitors resource usage and reports to the
ResourceManager.

4. Common Utilities
These are shared libraries required for different Hadoop modules to work together.
Hadoop Cluster Architecture

A Hadoop cluster consists of:

● Master Node: Runs NameNode and ResourceManager

● Slave Nodes: Run DataNodes and NodeManagers
● Client Node: Submits jobs and interacts with the cluster

MapReduceArchitecture:

MapReduce Architecture

MapReduce is a programming model in Hadoop used for processing large-scale data in a

distributed and parallel manner across a cluster. It follows a Master-Slave Architecture
consisting of two main phases:

1. Map Phase (Processing & Sorting the Input Data)

2. Reduce Phase (Aggregating & Generating the Final Output)

1. Components of MapReduce Architecture

MapReduce consists of the following key components:

a) JobTracker (Master Node) [In Traditional Hadoop 1.0]

● Assigns tasks to TaskTrackers (slave nodes).

● Monitors task execution and handles failures.
● In Hadoop 2.x, JobTracker is replaced by YARN’s ResourceManager.

b) TaskTracker (Slave Node) [In Traditional Hadoop 1.0]

● Runs Map and Reduce tasks as per JobTracker’s instructions.

● Reports task progress to JobTracker.
● In Hadoop 2.x, TaskTracker is replaced by NodeManager.

c) ResourceManager (Master Node) [In Hadoop 2.0+]

● Manages cluster resources and schedules tasks.

d) NodeManager (Slave Node) [In Hadoop 2.0+]

● Monitors resource usage on each node and reports to ResourceManager.

2. MapReduce Execution Flow

Step 1: Input Splitting

● The input file is divided into fixed-size splits (default: 128MB or 256MB).
● Each split is processed independently by a separate Mapper task.

Step 2: Mapping (Map Phase)

● Each Mapper processes one split and converts it into intermediate key-value pairs.
● Example:
○ Input: "Hello World Hello"
○ Output from Mapper: {Hello: 2, World: 1}

Step 3: Shuffling and Sorting

● Intermediate key-value pairs are grouped, sorted, and partitioned.

● The keys are assigned to different Reducers.
Step 4: Reducing (Reduce Phase)

● The Reducer processes the grouped key-value pairs to generate the final output.
● Example:
○ Input from Shuffle & Sort: {Hello: [1,1], World: [1]}
○ Output: {Hello: 2, World: 1}

Step 5: Writing Output

● The final reduced output is written to HDFS.

3. Diagram of MapReduce Architecture

4. Key Features of MapReduce
✅ Parallel Processing – Multiple nodes process data in parallel.
✅ Fault Tolerance – Automatic re-execution of failed tasks.
✅ Scalability – Handles petabytes of data.
✅ Data Locality – Processes data where it is stored (reducing network traffic).

BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Understanding MapReduce and Hadoop
No ratings yet
Understanding MapReduce and Hadoop
8 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
48 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
CST322 Module4 Part3 Hadoop
No ratings yet
CST322 Module4 Part3 Hadoop
45 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Understanding MapReduce and HDFS
No ratings yet
Understanding MapReduce and HDFS
4 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
MapReduce Workflow and Key Concepts
No ratings yet
MapReduce Workflow and Key Concepts
5 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Bda CHP 2
No ratings yet
Bda CHP 2
5 pages
Unit 2
No ratings yet
Unit 2
7 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Untitled
No ratings yet
Untitled
16 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
MapReduce and HDFS Architecture Guide
No ratings yet
MapReduce and HDFS Architecture Guide
9 pages
3 Unit
No ratings yet
3 Unit
17 pages
Week 02
No ratings yet
Week 02
115 pages
Module 2 MapReduce
No ratings yet
Module 2 MapReduce
45 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Unit 5
No ratings yet
Unit 5
32 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
? Mapreduce - Detailed Summary
No ratings yet
? Mapreduce - Detailed Summary
4 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Professional Hadoop Solutions 1st Edition Boris Lublinsky All Chapter Instant Download
100% (27)
Professional Hadoop Solutions 1st Edition Boris Lublinsky All Chapter Instant Download
52 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
The Java Interface
No ratings yet
The Java Interface
32 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
60 pages
Top 50 Apache Spark Interview Questions
No ratings yet
Top 50 Apache Spark Interview Questions
19 pages
Big Data Algorithms
100% (1)
Big Data Algorithms
476 pages
2-4 - Storage - System - Architecture - Copie
100% (1)
2-4 - Storage - System - Architecture - Copie
43 pages
Cloud Computing MODULE3
No ratings yet
Cloud Computing MODULE3
165 pages
Essential Data Science Skills Guide
No ratings yet
Essential Data Science Skills Guide
31 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
Bda Viva Q&a
100% (1)
Bda Viva Q&a
24 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
30 pages
Hitachi White Paper Big Data Infrastructure
No ratings yet
Hitachi White Paper Big Data Infrastructure
9 pages
ENISA RM Deliverable2 Final Version v1.0 2006-03-30
No ratings yet
ENISA RM Deliverable2 Final Version v1.0 2006-03-30
10 pages
Apache Spark Vs Apache Flink, Reproducible Experiments On Cloud
No ratings yet
Apache Spark Vs Apache Flink, Reproducible Experiments On Cloud
10 pages
Grid and Cloud Computing Overview
No ratings yet
Grid and Cloud Computing Overview
46 pages
C - 41 Cloud Computing in Big Data Features and Issues
No ratings yet
C - 41 Cloud Computing in Big Data Features and Issues
8 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
3 Hours / 70 Marks: Instructions
100% (1)
3 Hours / 70 Marks: Instructions
2 pages
Big Data Analytics & Technologies: Hbase
No ratings yet
Big Data Analytics & Technologies: Hbase
30 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
ECS Overview and Architecture
No ratings yet
ECS Overview and Architecture
59 pages
Data Analytics Professional Certificate Course
No ratings yet
Data Analytics Professional Certificate Course
22 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Big Data Analytics Project
No ratings yet
Big Data Analytics Project
9 pages
Catalgo General Huawei Indice PDF
No ratings yet
Catalgo General Huawei Indice PDF
116 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
13 pages
RGPV Syllabus Cbgs It 7 Sem It 7001 Cloud Computing
No ratings yet
RGPV Syllabus Cbgs It 7 Sem It 7001 Cloud Computing
1 page
Exploring Dataset in MapReduce
No ratings yet
Exploring Dataset in MapReduce
14 pages

Bda 2

Uploaded by

Bda 2

Uploaded by

Distributed File Systems

1. Physical Organization of Compute Nodes​

Key Aspects of Physical Organization

1.​ Cluster-Based Architecture:

2. Large-Scale File-System Organization​

Components of Large-Scale File-System Organization

Examples of Distributed File Systems

●​ HDFS (Hadoop Distributed File System) – Used in big data applications.

●​ Map Phase (Processes input data in parallel)

The Map phase transforms input data into key-value pairs.

Steps in the Map Phase:

For a word count application:

●​ Input: ["Hello World", "Hello MapReduce"]

Grouped Output (after shuffle and sort):​

Steps in Reduce Phase:

●​ Input: ("Hello", [1, 1]), ("World", [1]), ("MapReduce", [1])

Why Use Combiners?

●​ Reduces data transfer between mappers and reducers.

Example (Word Count with Combiner):

The MapReduce job execution involves multiple stages:

1.​ Input Splitting: The dataset is divided into chunks.

MapReduce Execution Workflow:

1.​ Job Submission: The user submits a job to the JobTracker.

6. Coping with Node Failure

Since MapReduce runs in distributed environments, handling node failures is essential.

Fault Tolerance Mechanisms:

1.​ Task Re-execution:

Algorithms Using MapReduce

1. Matrix-Vector Multiplication Using MapReduce

1.​ Mapper Phase:

2. Relational Algebra Operations Using MapReduce

Relational Algebra Basics:

1.​ Selection (σ condition (R)) → Filters rows based on conditions.

3. Computing Selections by MapReduce: ​

Selection involves filtering rows from a relation R based on a condition.

1.​ Mapper Phase:

Query:Find all students with Grade = A.​

Optimization: Since selection is a filtering operation, it can be performed entirely

ConclusionMapReduce can efficiently process matrix operations, relational algebra

Algorithms Using MapReduce (Continued)

1. Computing Projections by MapReduce

Consider a Student dataset:

(ID, Name, Age, Grade)

Query:Find all unique student names → π Name (Student)

Reducer Output (Removing Duplicates):

2. Union, Intersection, and Difference by MapReduce

1.​ Mapper Phase:

Reducer Output (Unique Values):

Intersection (R ∩ S): 3The intersection operation finds common elements in R and S.

1.​ Mapper Phase:

Reducer Output (Common Values):

The difference operation finds records in R that are not in S.

1.​ Mapper Phase:

Reducer Output (R - S):

3. Computing Natural Join by MapReduce

1.​ Mapper Phase:

(ID, Name, Dept_ID)

Reducer Output (Joined Table):

4. Grouping and Aggregation by MapReduce

1.​ Mapper Phase:

Example: Compute Total Sales Per Product

Reducer Output (SUM Aggregation):

Matrix Multiplication Using MapReduce

1.​ General Matrix Multiplication using Multiple MapReduce Steps

1. General Matrix Multiplication Using MapReduce

Given two matrices A (m × n) and B (n × p), compute the product C = A × B, where C (m × p)

We need to compute each element of C as:

Step 1: Mapper Phase

●​ Read elements from both A and B.

Step 2: Shuffle & Grouping

Step 3: Reducer Phase

●​ Sum all partial products to compute C(i, j).

Each matrix element is emitted in key-value pairs.

(0,0) → (0, (A, 0, 2))

(0,0) → (0, (B, 0, 1))

Perform the sum of products for each (i, j) key:

1. Physical Organization of Compute Nodes

1. Cluster-Based Architecture:

2. Large-Scale File-System Organization

● HDFS (Hadoop Distributed File System) – Used in big data applications.

● Map Phase (Processes input data in parallel)

● Input: ["Hello World", "Hello MapReduce"]

Grouped Output (after shuffle and sort):

● Input: ("Hello", [1, 1]), ("World", [1]), ("MapReduce", [1])

● Reduces data transfer between mappers and reducers.

1. Input Splitting: The dataset is divided into chunks.

1. Job Submission: The user submits a job to the JobTracker.

1. Task Re-execution:

1. Mapper Phase:

1. Selection (σ condition (R)) → Filters rows based on conditions.

3. Computing Selections by MapReduce:

1. Mapper Phase:

Query:Find all students with Grade = A.

1. Mapper Phase:

1. Mapper Phase:

1. Mapper Phase:

1. Mapper Phase:

1. Mapper Phase:

1. General Matrix Multiplication using Multiple MapReduce Steps

● Read elements from both A and B.

● Sum all partial products to compute C(i, j).

1. Mapper Phase:

1. Traditional Multi-Step Approach: Uses multiple MapReduce phases but is

1. User submits a job:

● A Mapper function is applied to each split, converting it into key-value pairs.

Example: Consider an input file with the following text:

The Mapper processes it and generates output like:

The Reducer aggregates values for each key.

1. TaskTracker/NodeManager Failure:

● Map: Processes input data and emits intermediate key-value pairs.

Result Matrix C:

For matrix A (AikAik): Emit key-value pairs as:

For matrix B (BkjBkj): Emit key-value pairs as:

1. Hadoop Distributed File System (HDFS) – Storage Layer

● ResourceManager (Master Node): Allocates resources across applications.

● Master Node: Runs NameNode and ResourceManager

1. Map Phase (Processing & Sorting the Input Data)