0% found this document useful (0 votes)

9 views86 pages

Big Data

Big data architecture is a framework for managing large, fast, or complex data that traditional systems cannot handle, focusing on ingestion, processing, storage, and analysis. It supports various workloads including batch processing, real-time processing, and machine learning, and is essential for organizations dealing with massive volumes of diverse data. Key components include data sources, storage solutions like data lakes, processing tools, and orchestration mechanisms to enable efficient data management and insights.

Uploaded by

Ritu Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views86 pages

Big Data

Uploaded by

Ritu Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Introduction to Big Data Architecture

Big data architecture is a framework designed to manage the ingestion,

processing, storage, and analysis of data that is too large, fast, or
complex for traditional database systems. It provides the infrastructure
necessary to derive insights from vast datasets, often combining both
batch and real-time processing.

The definition of big data is evolving: It's no longer just about the
volume of data, but also the variety, velocity, and the value
derived from advanced analytics and machine learning.

✅ Why Big Data Architecture is Needed

Organizations deal with:

● Massive volumes of data (from GBs to PBs)

● Diverse formats (structured, semi-structured, unstructured)

● High-speed data generation from IoT, apps, websites, sensors

● Need for real-time decisions and predictions

Big data architectures help:

● Store and manage huge datasets

● Support real-time and historical analysis

● Apply AI and ML for forecasting and automation

● Enable scalable and fault-tolerant data pipelines

✅ Common Big Data Workloads

1. Batch Processing

○ Handles data at rest.

○ Involves scheduled processing of large files.

○ Example: Processing daily logs every night.

2. Stream (Real-Time) Processing

○ Handles data in motion.

○ Involves real-time ingestion and transformation.

○ Example: Monitoring financial transactions or sensor data.

3. Interactive Data Exploration

○ Ad-hoc queries and exploratory analysis by analysts.

○ Useful for building models, discovering patterns.

4. Machine Learning and Predictive Analytics

○ Uses processed data to train models and make predictions.

○ Helps with tasks like classification, recommendation, forecasting.

✅ When to Use Big Data Architecture

Consider using big data architecture when:

● Data volume exceeds the capability of traditional systems.

● You need to process unstructured or semi-structured data.

● You want real-time insights or alerts.

● You aim to train AI/ML models on large datasets.

● There’s a need to integrate data from multiple sources.

✅ Key Components of Big Data Architecture

🔹 1. Data Sources
All architectures begin with data sources, which may include:
● Application Data Stores: SQL, NoSQL databases.

● Static Files: Log files, CSV, JSON, XML, images, etc.

● Real-time Sources: IoT devices, sensors, social media feeds.

● External APIs: Cloud services, 3rd-party integrations.

🔹 2. Data Storage (Data Lake)

Stores raw or processed data at scale. Usually supports varied file formats
(Parquet, Avro, CSV, etc.) and large file sizes.

● Acts as the central repository for structured and unstructured data.

● Known as a Data Lake.

Tools:

● Azure Data Lake Store

● Azure Blob Storage

● OneLake in Microsoft Fabric

● Hadoop Distributed File System (HDFS)

🔹 3. Batch Processing
Used for scheduled, large-scale processing of data files.
● Reads from storage → Processes → Writes output back.

● Suitable for ETL (Extract, Transform, Load) operations.

● Jobs may run for minutes or hours.

Tools:

● Azure Data Lake Analytics (U-SQL)

● Hive, Pig, MapReduce on HDInsight

● Apache Spark (Azure Databricks, Fabric)

● Notebooks: Python, Scala, SQL-based scripting

🔹 4. Real-Time Message Ingestion

Captures continuous streams of incoming data, like messages, events, or
logs.

● Acts as a buffer before real-time processing.

● Supports scale-out, reliable delivery, and event ordering.

Tools:

● Azure Event Hubs

● Azure IoT Hub

● Apache Kafka
🔹 5. Stream Processing
Processes live data streams in real-time. Performs tasks like:

● Filtering, aggregating, transforming

● Joining streams

● Detecting anomalies or trends

Tools:

● Azure Stream Analytics (SQL-like queries on data streams)

● Apache Spark Streaming (Databricks, HDInsight)

● Azure Functions (event-driven processing)

● Fabric real-time event streams

🔹 6. Machine Learning (ML)

Once the data is prepared (from batch or stream), ML models can be trained
to:

● Predict outcomes

● Recommend actions

● Automate decisions

Use Cases:
● Fraud detection

● Product recommendations

● Predictive maintenance

Tools:

● Azure Machine Learning (model training, deployment)

● Azure AI Services (vision, language, speech APIs)

● ML on Databricks, Spark MLlib

🔹 9. Orchestration
Manages workflows that automate:

● Data ingestion

● Batch jobs

● Data movement between components

● Loading data into reports

Tools:

● Azure Data Factory (pipelines, triggers)

● Apache Oozie
● Fabric orchestration

● Apache Sqoop (for data import/export from relational DBs)

Hadoop - Architecture
Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today
lots of Big Brand Companies are using Hadoop in their Organization to deal
with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.

● MapReduce

● HDFS(Hadoop Distributed File System)

● YARN(Yet Another Resource Negotiator)

● Common Utilities or Hadoop Common

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of
the parser will be a DAG (directed acyclic graph), which represents the Pig
Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes
and the data flows are represented as edges.

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latins data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an
Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece
of data or a simple atomic value is known as a field.

Example − raja or 30

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the
fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by {}. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by []

Example − [name#Raja, age#30]

SCALA architecture and functional connections illustrated as a
fundamental modelling concepts diagram [21]. Connections with overlaying
bullet points indicate bidirectional communication channels. The
Communication Module (CM) receives incoming data from several sources
and communication protocols. It transmits the data to the Main Controller
(MC), which coordinates the signal processing and eventually provides the
classification result to the Communication Module. The Signal Processing
Module is exchangeable, thereby contributing to the flexibility of SCALA.
MapReduce – Detailed Explanation
1. MapReduce Framework and Basics
MapReduce is a programming model developed by Google for processing large data sets with
a distributed algorithm on a cluster. The model simplifies data processing across massive
clusters by using two main functions: Map and Reduce.

● Map Function: Processes input data and produces a set of intermediate key-value
pairs.

● Reduce Function: Merges all intermediate values associated with the same key to
produce a final result.

MapReduce is the core of Hadoop, enabling it to handle big data workloads. The framework
automatically handles parallelization, fault tolerance, data distribution, and load balancing.

2. How MapReduce Works

The execution of a MapReduce job involves several phases:

1. Input Splitting: Input data is split into fixed-size blocks (default 128 MB) by the
InputFormat.

2. Mapping: Each block is processed by a separate Map task that converts input data into
intermediate key-value pairs.

3. Shuffling and Sorting: The Map output is shuffled to send all values with the same key
to a single Reducer. Sorting ensures keys are ordered before reducing.

4. Reducing: Reduce tasks aggregate or process the list of values for each key.

5. Output: Final results are written to HDFS by the OutputFormat.

Developing a MapReduce application in big data involves breaking down a large task
into smaller, parallelizable chunks, which are then processed by individual machines (map
phase) and finally combined (reduce phase) to produce the final result. This approach is
particularly effective for tasks requiring massive data processing and analysis, such as training
machine learning models or analyzing social media data.

Understanding MapReduce:

1. Map Phase: The input data is divided into smaller, manageable chunks and distributed
across a cluster of machines. Each machine processes its assigned chunk
independently, applying a user-defined function (the "map" function) to extract relevant
information or transform the data.

2. Shuffle and Sort: The intermediate key-value pairs generated by the map phase are
shuffled and sorted based on the keys, ensuring that all values associated with a
particular key are grouped together on the same machine.

3. Reduce Phase: The sorted key-value pairs are then processed by another user-defined
function (the "reduce" function). This function combines the values associated with each
key to produce the final output.

Key Considerations for Development:

1. Data Input and Output:MapReduce typically works with data stored in a distributed file
system like Hadoop Distributed File System (HDFS).

2. Programming Model:The core of MapReduce is a programming model that specifies how
the map and reduce functions are defined and executed.

3. Scalability:MapReduce is designed to be highly scalable, allowing it to process massive

datasets across a large number of machines.

4. Fault Tolerance:MapReduce can automatically handle machine failures and continue
processing, ensuring high reliability.

Examples of MapReduce Applications:

1. Machine Learning:Training machine learning models on large datasets for tasks like
linear regression, k-means clustering, and collaborative filtering.

2. Web Analytics:Analyzing web server logs to extract insights into user behavior, traffic
patterns, and other metrics.

3. Social Media Analysis:Analyzing social media data to identify trends, sentiment, and
other patterns.

MRUnit is a JUnit-based library specifically designed for unit testing Apache Hadoop
MapReduce jobs. It allows developers to test individual mappers and reducers, or the entire
MapReduce computation, without needing a full Hadoop cluster. This makes it easier to develop
and maintain Hadoop MapReduce code.

Key Features and Benefits:

1. Local Testing:MRUnit enables you to run unit tests locally without relying on a Hadoop
cluster, saving time and resources.

2. Simulated Environment:It provides a mock environment that simulates Hadoop's

MapReduce framework, allowing you to test your code in isolation.

3. Input and Output Control:You can define specific input data and then verify the expected
output from your mappers and reducers.

4. Integration with JUnit:MRUnit is built on top of JUnit, a popular Java unit testing
framework, making it familiar to developers.

5. Focus on Functionality:It allows you to focus on testing the core logic of your
MapReduce jobs without dealing with complex Hadoop configurations.

Anatomy of a MapReduce Job Run

A MapReduce job run involves several key components: the client submitting the job, the YARN
resource manager coordinating resource allocation, YARN node managers launching and
monitoring tasks, a MapReduce application master managing the job, and a distributed
filesystem for data sharing. The job itself consists of two main phases: a map phase where input
data is processed and intermediate key-value pairs are generated, and a reduce phase where
these pairs are combined to produce the final output.

Here's a more detailed breakdown of the anatomy of a MapReduce job run:

1. Client:
Submits the MapReduce job, including the application code, configuration, and input data.
Interacts with the JobTracker (in older Hadoop versions) or the YARN Resource Manager (in
newer versions) to initiate the job.

2. YARN (Yet Another Resource Negotiator) Resource Manager:

Coordinates the allocation of compute resources (CPU, memory, etc.) on the Hadoop cluster.
Receives requests from the MapReduce Application Master (MRAppMaster) for containers
(virtual machines) to run map and reduce tasks.
Negotiates with YARN Node Managers to determine which nodes can host the tasks.
3. YARN Node Managers:
Launch and monitor compute containers on the individual nodes in the cluster.
Manage the resources allocated to each container, ensuring that tasks have the necessary
resources.

4. MapReduce Application Master (MRAppMaster):

Manages the execution of the MapReduce job.
Requests containers from the YARN Resource Manager to run map and reduce tasks.
Splits the input data into smaller chunks (input splits) and assigns them to different map tasks.
Coordinately runs the reduce tasks.

5. Distributed Filesystem (e.g., HDFS):

Used for storing the input data, intermediate data, and the final output of the MapReduce job.
Facilitates data sharing between different components of the MapReduce framework.

YARN Child (Implicit):

When an application master or a container is created and managed by YARN, it's often referred
to as a "YARN child" in a simplified way. It means that these components are subordinate to
YARN's resource management and scheduling control.

6. Map and Reduce Tasks:

Map Tasks: Process input data in parallel, generating intermediate key-value pairs.
Reduce Tasks: Combine the intermediate key-value pairs from the map tasks to produce the
final output.

In summary: The MapReduce job run involves a client submitting the job, the YARN Resource
Manager allocating resources, YARN Node Managers managing tasks, the MapReduce
Application Master coordinating the job, and the distributed filesystem handling data storage
and sharing. The job is executed in two main phases: map and reduce, where the input data is
processed and aggregated to produce the final results.
In MapReduce, job scheduling determines how tasks are allocated and executed within a
distributed computing framework like Hadoop. Effective scheduling is crucial for optimizing
resource utilization, ensuring fairness among users, and managing dependencies between Map
and Reduce tasks.

Here's a breakdown of job scheduling in MapReduce:

MapReduce Scheduling Process:

1. Job Submission:Users submit jobs to a queue, and the cluster processes them in a
defined order.

2. Task Distribution:The master node (JobTracker) distributes Map and Reduce tasks to
different worker nodes (TaskTrackers).

3. Data Processing:Map tasks read data splits, apply the map function, and write
intermediate results locally.

4. Data Shuffling and Reducing:Reduce tasks read intermediate results, apply the reduce
function, and write final results to output files.

Key Scheduling Issues:

1. Locality:Prioritizing the assignment of tasks to nodes where the data is stored (data
locality) minimizes data transfer overhead.
2. Synchronization:Coordinating the execution of Map and Reduce tasks to ensure that
Reduce tasks don't start until all Map tasks are completed or have written intermediate
data.

3. Fairness:Equitably allocating resources among different users or groups, especially in

shared cluster environments.

Common Scheduling Algorithms:

1. FIFO (First-In, First-Out): Executes tasks in the order they are submitted.

2. Capacity Scheduler: Prioritizes jobs based on resource consumption, allowing different
queues to have different resource allocations.

3. Fair Scheduler: Ensures that jobs receive a proportional share of cluster resources
based on their queue, promoting fairness.
Failures in MapReduce, a common big data processing framework, can be
broadly categorized into worker failures and master failures. Worker failures
occur when a task assigned to a worker node fails, while master failures
happen when the JobTracker or ApplicationMaster fails, causing the entire
MapReduce job to be restarted. MapReduce provides mechanisms to handle
these failures, such as task retries, task isolation, and speculative execution.

Here's a more detailed breakdown:

1. Worker Failures:

Task Failure:
Tasks can fail due to various reasons, including runtime exceptions, sudden JVM
exits, or network issues.
TaskTracker Failure:
The entire TaskTracker, which manages tasks on a worker node, can fail, leading to
the need to re-execute tasks.
Heartbeat Monitoring:
The MapReduce framework uses heartbeats to monitor the health of worker nodes.
If a node stops responding, the framework assumes it has failed and re-assigns its
tasks.
Retry Mechanism:
The framework attempts to retry failed tasks a certain number of times before
declaring the job failed.
Speculative Execution:
To mitigate the impact of slow nodes, the framework can run duplicate tasks on
different nodes. The first task to finish provides the result.
Task Isolation:
If a node consistently causes failures, the framework will try to avoid assigning new
tasks to that node.

2. Master Failures:

● JobTracker/ApplicationMaster Failure: If the master component fails, the

MapReduce job needs to be restarted.
● Fault Tolerance: MapReduce employs techniques like checkpointing and
execution logs to resume the job from the last known state after a master failure.

3. Handling Failures:

● Task Retries: The framework attempts to rerun failed tasks, by default up to

four times.
● Task Isolation: Failed nodes are isolated to prevent further failures.
● Speculative Execution: Backup copies of tasks are run to speed up processing
if a node is slow or fails.
● Skipping Bad Records: If a task fails for a specific input record, the framework
can be configured to skip that record and continue processing.
● Failure Resilience: Techniques like Task Failure Resilience (TFR) can help
improve performance by allowing tasks to resume from the point of interruption
after a failure.

9. Shuffle and Sort Phase

This is one of the most critical phases in MapReduce:

● Shuffle: Intermediate key-value pairs from Mapper tasks are transferred to

appropriate Reducer tasks.

● Sort: Data is sorted based on keys before reaching the Reducer.

Sorting ensures all values for a given key are grouped together, which simplifies
reduction logic.

Issues in this phase can significantly affect performance and scalability of the
application.

10. Task Execution

Each job consists of multiple Map and Reduce tasks:

● Map Task Execution:

○ Reads input using RecordReader

○ Emits intermediate key-value pairs

○ Writes to local disk

● Reduce Task Execution:

○ Fetches Map output from different nodes

○ Sorts and merges data

○ Executes Reducer function

○ Writes output to HDFS

Both types of tasks are managed by NodeManager and executed in containers (YARN).

MapReduce Types in Big Data

MapReduce is a core component of big data processing frameworks like Hadoop. It allows
developers to process vast amounts of data using a simple programming model based on Map
and Reduce functions. Depending on the data processing requirements, different types of
MapReduce jobs can be implemented. Each type serves specific use cases and can be
optimized for performance and resource usage.

1. Map-Only Job
Description:

● A Map-only job uses only the Mapper phase. There is no Reducer involved.

● Useful when the processing doesn't require aggregation or grouping of data.

Use Cases:

● Filtering data (e.g., removing null records)

● Data formatting or transformations

● Log parsing

● Data sampling

Example:
Reading a log file and extracting all records with HTTP status code 500.

text

CopyEdit

Mapper: Input (line of text) → Output (filtered line)

Reducer: Not used

Benefits:

● Faster execution since the Reduce phase is skipped

● Reduced overhead and network I/O

2. Map + Reduce Job

Description:

● This is the most common and widely used MapReduce job type.

● It includes both Mapper and Reducer phases:

○ Mapper emits intermediate key-value pairs.

○ Reducer processes these pairs and outputs final results.

Use Cases:

● Word Count

● Aggregation (sum, average, min, max)

● Grouping and classification

● Joining datasets
Example:

Counting the frequency of each word in a text file:

text

CopyEdit

Mapper: (line) → (word, 1)

Reducer: (word, list of counts) → (word, total_count)

3. MapReduce with Combiner

Description:

● The Combiner is an optional component that acts as a mini-Reducer at the Mapper

node.

● It performs local aggregation before data is sent to the Reducer.

Use Cases:

● When reduce operations are associative and commutative (e.g., summing values)

Example:

In a word count job, the Combiner sums counts at the local node before passing them to the
Reducer.

Benefits:

● Reduces data transfer across the network

● Improves job performance

In MapReduce, input and output formats determine how data is read into and
written from the framework. Input formats handle data partitioning and
parsing, while output formats manage the final data storage. They both work
with key-value pairs, a core concept in MapReduce.

1. InputFormat in MapReduce
The InputFormat defines:

● How input files are split.

● How data is read and transformed into key-value pairs for the Mapper.

Each InputFormat provides:

● A method to split the data (getSplits()).

● A RecordReader to convert split data into key-value pairs.

Common InputFormats

1.1 TextInputFormat (Default)

● Splits input files by lines.

● Each line is read as a value with the key as the byte offset of the line.

● Suitable for processing plain text files.

Example:

Key = byte offset, Value = “This is a line of text”

1.2 KeyValueTextInputFormat
● Treats each line as a key-value pair split by a delimiter (default: tab \t).

● Useful when data is already in a key-value format.

Example:

Input Line: ID001John

Output: Key = "ID001", Value = "John"

1.3 SequenceFileInputFormat

● Reads data stored in Hadoop’s binary SequenceFile format.

● More efficient than text formats for intermediate data.

1.4 NLineInputFormat

● Assigns a fixed number of lines (N) to each Mapper.

● Ideal for use cases where each record spans multiple lines.

1.5 DBInputFormat

● Reads data from a relational database using JDBC.

● Data is divided into InputSplits using a SQL query.

2. OutputFormat in MapReduce
The OutputFormat defines:

● How the results from Reducers are written to storage.

● How the key-value pairs produced by Reducers are serialized and
stored.

Common OutputFormats

2.1 TextOutputFormat (Default)

● Writes results in plain text files.

● Each key-value pair is written on a new line separated by a tab.

Example Output:

apple 3
banana 7

2.2 SequenceFileOutputFormat

● Stores output in a binary format using Hadoop’s SequenceFile format.

● Used when output needs to be consumed by other MapReduce jobs.

Benefits:

● Faster processing.

● Supports compression.

2.3 MultipleOutputs

● Allows writing data to multiple files from a single MapReduce job.

● Useful when you want to separate output based on some condition

(e.g., error logs vs. success logs).

2.4 LazyOutputFormat

● Prevents the creation of empty output files when no output is produced.

● Useful for optimizing storage and reducing clutter.

2.5 DBOutputFormat

● Used to write output directly to a relational database using JDBC.

● Common in data warehousing and integration with traditional RDBMS.

14. MapReduce Features

Key features that make MapReduce powerful and scalable:

● Scalability: Handles petabytes of data across thousands of machines.

● Fault Tolerance: Recovers from task and node failures automatically.

● Parallelism: Breaks down tasks into smaller units for concurrent
execution.

● Data Locality: Moves computation closer to the data to reduce network

I/O.

● Resource Optimization: Efficient use of hardware with job scheduling

and task management.

15. Real-world MapReduce Applications

MapReduce has been used in various domains:

● Search Engines: Indexing the web (e.g., Google, Bing).

● E-commerce: Customer analytics and recommendation systems.

● Log Analysis: Web server log processing for monitoring and alerts.

● Social Media: Processing social graphs and user behavior analysis.

● Financial Systems: Fraud detection and real-time analytics.

● Scientific Research: DNA sequencing, simulations, and climate

modeling.
Design of HDFS in Big Data
The Hadoop Distributed File System (HDFS) is the primary storage system
used by the Hadoop framework. It is designed to store and manage massive
volumes of data across a distributed cluster of commodity machines, offering
high throughput, scalability, fault tolerance, and cost efficiency. HDFS is a
fundamental component of big data architecture.

1. Objectives of HDFS Design

● High fault tolerance

● Scalability to petabytes of data

● Streaming data access

● Simple and cost-effective hardware

● Write-once-read-many model

2. Architecture of HDFS
HDFS follows a master-slave architecture consisting of:

2.1 NameNode (Master)

● Maintains the metadata of the file system.

● Keeps track of the directory structure, file names, and the mapping
of blocks to DataNodes.

● Stores metadata in RAM for fast access and on disk for recovery.

● There is typically one active NameNode and one standby NameNode

for high availability.

2.2 DataNode (Slaves)

● Responsible for storing the actual data blocks.

● Each DataNode manages the storage on its local disk and periodically
sends heartbeats and block reports to the NameNode.

● Reads/writes requests from clients are handled directly by DataNodes.

2.3 Secondary NameNode

● Often misunderstood — it does not act as a backup NameNode.

● Periodically merges the FSImage and EditLogs from the NameNode
to reduce the size of EditLogs.

● Helps in faster NameNode restarts by producing a new checkpoint of

metadata.

1. FSImage (File System Image)

Definition:

FSImage is a persistent checkpoint of the HDFS namespace. It stores the

entire file system metadata in a snapshot format.

Details:

● Created and maintained by the NameNode.

● Contains metadata such as:

○ Directory structure

○ File-to-block mapping

○ Permissions, ownership

○ Replication factors

Purpose:

● To persist the current state of the file system.

● Loaded into memory when the NameNode starts up.

2. EditLogs: EditLogs are transaction logs that record all the
changes made to the HDFS namespace after the last FSImage was
created.

Details:

● Every time a file is added, deleted, or renamed, it is logged in the

EditLogs.

● Stored separately from FSImage.

Purpose:

● To ensure that all modifications to metadata are not lost.

● During recovery/startup, the NameNode loads FSImage first, then

replays EditLogs to reach the current state.

Example Flow:

1. FSImage has snapshot up to time t1.

2. EditLogs record all changes from t1 to t2.

3. When NameNode restarts, it combines FSImage + EditLogs = current

namespace.

3. Rack Awareness:Rack Awareness is the HDFS feature where

the physical rack location of DataNodes is considered while placing
blocks and replicas.

Why Important?

● Enhances fault tolerance.

● Reduces network traffic during read/write.

How It Works:

● Hadoop uses a script to determine the rack of each node.

● Replication is done as:

○ 1 replica on local rack (same node or same rack)

○ 1 replica on a different node in the same rack

○ 1 replica on a different rack

Benefits:

● Protects against rack failure.

● Optimizes bandwidth usage.

4. Secondary NameNode:The Secondary NameNode is a

helper node to the primary NameNode, used to periodically merge
FSImage and EditLogs.

Important Note:

● Not a backup NameNode!

● Doesn't take over if the NameNode crashes.

Function:
1. Downloads FSImage and EditLogs from the NameNode.

2. Applies EditLogs to FSImage to create a new merged FSImage.

3. Sends the updated FSImage back to NameNode.

Why Needed?

● Prevents EditLogs from growing too large.

● Helps in quicker NameNode restart.

5. Namespace in HDFS
Definition:

HDFS Namespace is the hierarchical structure of directories and files in the

distributed file system — much like a traditional file system.

Details:

● Managed entirely by the NameNode.

● Does not store actual data, only metadata.

Includes:

● File and folder names

● Ownership, permissions

● Mapping of files to blocks and blocks to DataNodes

Key Point:

It allows users to perform file system operations like:

● Create/Delete files

● Rename files/folders

● Manage access controls

6. Replication in HDFS: Replication in HDFS refers to storing

multiple copies of each data block across different DataNodes to
ensure reliability and fault tolerance.

Default Replication Factor: 3

Strategy:

1. First replica on the local node (where client writes).

2. Second replica on a different rack.

3. Third on a different node in the same rack as the second.

Benefits:

● Tolerates node/rack failures.

● Ensures high data availability.

● Enables load balancing during read operations.

Management:
● If a replica is lost, the NameNode detects it and triggers re-replication.

● Admins can configure replication factor per file.

3. HDFS File Storage Design

3.1 Blocks

● Files in HDFS are split into fixed-size blocks (default: 128 MB).

● Each block is replicated across multiple DataNodes (default: 3

replicas).

● Replication ensures fault tolerance and data availability.

3.2 Replication

● Default replication factor is 3:

○ One replica on the same rack

○ One on a different node in the same rack

○ One on a node in a different rack

● Balances performance and fault tolerance.

4. Data Flow in HDFS

4.1 Write Operation
● Client contacts the NameNode to get block locations.

● Client directly writes blocks to the selected DataNodes in a pipeline

fashion.

● Acks are sent back once replication is complete.

4.2 Read Operation

● Client contacts the NameNode for the block locations.

● Client fetches blocks directly from the nearest DataNode.

● Supports parallel reading of blocks for faster performance.

5. Fault Tolerance and Recovery

● If a DataNode fails, the NameNode detects it via missing heartbeats
and initiates replication of its blocks to maintain the replication factor.

● If a NameNode fails, a Standby NameNode can take over (in HA

setup).

● HDFS can automatically re-replicate lost blocks.

HDFS (Hadoop Distributed File System) offers significant benefits for big data
storage and processing, including fault tolerance, scalability, and
cost-effectiveness. However, it also faces challenges like low latency for
small file access and the need for careful cluster management.

Benefits:

Fault Tolerance:
HDFS automatically detects and recovers from hardware failures, ensuring data
availability and reliability.
Scalability:
HDFS can handle massive datasets and easily scale to accommodate growing data
volumes.
Cost-Effective:
HDFS utilizes inexpensive, off-the-shelf hardware, making it a cost-effective solution
for big data storage.
Data Locality:
Data is stored directly on the DataNodes, reducing network traffic and improving
processing speed.
Open Source:
HDFS is open source, allowing for community contributions and customization.
Diverse Data Formats:
HDFS can store a wide range of data formats, including structured, semi-structured,
and unstructured data.
High Throughput:
Optimized for high data throughput, making it suitable for streaming data
applications.

Challenges:

Small File Problem:

HDFS can be inefficient when dealing with a large number of small files, as each
small file requires a separate seek operation.
Low Latency:
HDFS is primarily designed for high throughput and may not be ideal for
applications requiring low-latency access to data.
Cluster Management:
Managing and maintaining a large HDFS cluster can be complex, requiring
expertise in Hadoop administration and resource management.
Security:
While HDFS has basic security measures, comprehensive security implementation
may require additional configurations and integrations.
Data Quality:
Ensuring data quality in a large-scale Hadoop environment can be a challenge,
requiring appropriate data cleansing and validation procedures

Benefits of HDFS
1. Fault Tolerance

● HDFS automatically replicates data blocks across multiple DataNodes.

● If a node fails, data is retrieved from another replica.

● The default replication factor is 3, ensuring data reliability and
availability.

2. Scalability

● HDFS can scale horizontally by simply adding more nodes to the

cluster.

● Handles petabytes of data efficiently.

● Designed to grow with increasing data demands.

3. High Throughput

● Optimized for batch processing rather than random reads.

● Enables fast data transfer rates between Hadoop jobs and the file
system.

● Suited for applications with large data sets.

4. Data Locality

● HDFS moves computation to where data resides, reducing network

I/O.

● Enhances processing speed by leveraging local disk reads instead of

network reads.

5. Cost-Effective

● Runs on commodity hardware, reducing infrastructure cost.

● Open-source and maintained by the Apache Foundation, eliminating
license costs.

6. Write Once, Read Many Model

● Files are written once and read many times, which simplifies data
coherency and eliminates complex locking mechanisms.

● Ideal for data archival and log storage.

7. Support for Large Files

● HDFS is capable of storing very large files (hundreds of gigabytes to

terabytes).

● Splits files into blocks (default: 128 MB or 256 MB) and distributes them
across the cluster.

8. Integration with Hadoop Ecosystem

● Seamlessly works with other tools like MapReduce, Hive, Pig, HBase,
and Spark.

● Makes it a complete Big Data processing platform.

Challenges of HDFS
1. Not Suitable for Small Files

● Each file and block has metadata stored in NameNode’s RAM.

● Too many small files can overload the NameNode, degrading

performance.
2. Single Point of Failure (Without HA)

● The NameNode is the only master managing the metadata.

● If the NameNode crashes (and HA is not configured), the entire HDFS

becomes unavailable.

3. Latency in Real-Time Applications

● HDFS is designed for high throughput, not low latency.

● Not ideal for real-time data processing or transactional systems.

4. Append-Only File System

● HDFS does not support updating files in place.

● Supports only appending to files — limits use in dynamic data storage.

5. Complex Management

● Requires skilled personnel for cluster management, monitoring, and

tuning.

● Managing node failures, data balancing, and resource optimization can

be complex.

6. Data Replication Overhead

● Storing multiple copies (replicas) of data increases storage

requirements.

● May lead to higher costs in terms of disk space usage.

7. Security Concerns

● Basic HDFS does not offer strong security.

● Needs integration with Kerberos, Ranger, or Sentry for access control

and auditing.

8. Heavy Memory Usage

● NameNode stores the entire namespace and metadata in memory.

● Requires large RAM capacity as the number of files and blocks

increases.

🔹 1. File Sizes in HDFS

📌 Definition:
In Big Data systems, the files to be processed are often very large — ranging
from hundreds of megabytes to several terabytes.

🧾 Characteristics:
● HDFS is optimized for large file sizes rather than numerous small files.

● Typical file types stored:

○ Log files

○ Sensor data

○ Audio/video data

○ Scientific datasets
🚫 Small File Problem:
● Storing many small files (less than block size) is inefficient, as each file
consumes NameNode memory for metadata.

● Too many small files can overload the NameNode and degrade cluster
performance.

🔹 2. Block Sizes in HDFS

📌 Definition:
HDFS splits large files into fixed-size blocks, which are stored across
different DataNodes.

📐 Default Block Size:

● Hadoop 1.x: 64 MB

● Hadoop 2.x and above: 128 MB

● Can be configured per file or system-wide

✅ Advantages of Large Block Sizes:

● Reduces seek time and disk I/O.

● Improves data transfer rates.

● Minimizes metadata overhead on the NameNode.

📦 Example:
If a 512 MB file is stored and block size is 128 MB:
● File will be split into 4 blocks:

○ Block 1: 128 MB

○ Block 2: 128 MB

○ Block 3: 128 MB

○ Block 4: 128 MB

🔹 3. Block Abstraction in HDFS

📌 What is a Block?
A block in HDFS is a logical representation of data.

● Unlike traditional file systems, where block size is small (4 KB or 8 KB),

HDFS blocks are much larger.

● A block in HDFS does not represent a fixed-size chunk on disk, but

rather a unit of storage abstraction.

In Hadoop Distributed File System (HDFS), block abstraction refers to the

way large files are split into smaller, manageable chunks called blocks, which
are then stored across multiple nodes (DataNodes) in the cluster. This
method simplifies file management, allows for data locality, and enables
efficient parallel processing.

Here's a more detailed explanation:

Key aspects of block abstraction in HDFS:

Splitting files into blocks:

HDFS divides files into blocks, with a default block size of 128 MB.
Storing blocks across DataNodes:
These blocks are then distributed and stored across the DataNodes within the
HDFS cluster.
Data locality:
HDFS aims to store data blocks on the same nodes where computation is
performed, minimizing network traffic and improving performance.
Replication for fault tolerance:
Each block is replicated multiple times (typically three copies) across different
DataNodes to ensure data availability and fault tolerance.
NameNode's role:
The NameNode manages block metadata, including their location and replication,
and communicates with DataNodes to handle block-related operations.
Benefits of block abstraction:
● Simplified file management: Breaking files into blocks makes it easier to
manage large files in a distributed environment.
● Improved storage utilization: HDFS can efficiently utilize storage space by
distributing blocks across different nodes.
● Enhanced parallel processing: HDFS can parallelize data processing by
allowing multiple mappers to process different blocks simultaneously.
● Fault tolerance and availability: Replication of blocks across multiple nodes
ensures that data is not lost if a node fails.
Data Replication in HDFS
Hadoop Distributed File System (HDFS) is designed to reliably store very large files across a distributed environment. To
ensure data availability, fault tolerance, and reliability, HDFS implements a powerful mechanism known as data
replication.

🔹 What is Data Replication?

Replication in HDFS refers to the process of storing multiple copies of each data block across different DataNodes in
the Hadoop cluster.

● This redundancy ensures that data is not lost even if a node fails.

● Every file in HDFS is divided into blocks (default: 128 MB or 256 MB).

● Each block is replicated and stored on multiple nodes.

🔹 Default Replication Factor

● The default replication factor is 3.

● This means each block of a file is stored on three different DataNodes.

● Can be configured at:

○ Cluster level (default for all files)

○ File level (per-file basis during creation)

🔹 Replica Placement Policy

HDFS uses a rack-aware replica placement strategy to ensure fault tolerance and data availability.

📌 Typical Placement (Replication Factor = 3):

1. First replica is placed on the local node (where the client is writing).

2. Second replica is placed on a node in a different rack.

3. Third replica is placed on a different node in the same rack as the second.

This ensures:
● One local copy

● One remote rack copy (for disaster recovery)

● One intra-rack copy (to balance network bandwidth)

🔹 Importance of Data Replication

✅ 1. Fault Tolerance
● If one or two nodes fail, the data can still be retrieved from other replicas.

● Ensures high data reliability.

✅ 2. Data Availability
● Multiple replicas mean more chances of data being available at all times.

● Crucial in large-scale distributed environments.

✅ 3. Load Balancing
● Read operations can be served by any replica, helping in load distribution.

● HDFS selects the closest replica to minimize read latency.

✅ 4. Cluster Performance
● Replication ensures parallel data processing across different nodes.

● Enhances performance of MapReduce and other jobs.

🔹 Handling Block Failures

HDFS uses the Block Scanner and DataNode heartbeat mechanisms:

● The NameNode continuously monitors the health of DataNodes through heartbeat signals.

● If a block becomes under-replicated (due to node failure), the NameNode schedules replication of the missing
block on a healthy node.

● If a block is over-replicated, the NameNode may delete excess replicas to save space.

How Does HDFS Store Data?

Step-by-Step: How HDFS Stores Data

1. File Division into Blocks

● A file to be stored in HDFS is split into fixed-size blocks (default size: 128 MB
or 256 MB).

● For example, a 512 MB file with 128 MB block size is split into 4 blocks.

2. Block Abstraction

● Each block is treated as an independent unit of storage.

● Blocks are identified by block IDs and are not stored contiguously.

3. Block Replication

● Each block is replicated (default: 3 copies) and distributed across different

DataNodes for fault tolerance.

● Replica placement strategy:

○ One replica on local node (if possible)

○ One on a remote rack

○ One on another node in the same rack as the second

4. Metadata Handling by NameNode

● The NameNode does not store the actual data.

● It stores metadata like:

○ File-to-block mapping

○ Block-to-DataNode mapping
○ Access permissions

● All this metadata is kept in memory (RAM) for fast access.

5. Block Storage in DataNodes

● DataNodes store the blocks on their local disk.

● They also send heartbeat and block reports periodically to the NameNode.

6. Replication Management

● If the NameNode detects a block is under-replicated, it instructs other

DataNodes to create new replicas.

● If a DataNode fails, replication is triggered automatically.

Example

Suppose a 300 MB file is uploaded to HDFS with a block size of 128 MB and
replication factor of 3:

● It is divided into:

○ Block A: 128 MB

○ Block B: 128 MB

○ Block C: 44 MB

● Each block is stored as 3 replicas:

○ A1, A2, A3 (on 3 different DataNodes)

○ B1, B2, B3

○ C1, C2, C3

Client uploads file → NameNode splits into blocks

↓ ↓

File.txt → [Block1] [Block2] [Block3]

↓ ↓ ↓

DataNode1 DataNode2 DataNode3

(A1) (A2) (A3)

How HDFS Writes a File

📌 Step-by-Step Write Process:

1. Client sends write request to the NameNode:

○ The client contacts the NameNode to create a file.

○ NameNode checks if the file already exists and permissions are valid.

2. NameNode allocates blocks:

○ The NameNode responds with the list of DataNodes to store the first
block and its replicas.

3. Client writes data to DataNodes:

○ The client divides the file into packets and writes them to the first
DataNode.

○ The first DataNode streams the data to the second, and then to the
third (pipeline replication).

4. Acknowledgement:
○ Once all DataNodes have received the block, they send
acknowledgments back through the pipeline.

○ Only after all replicas are stored, the write is considered complete.

5. Next block is written:

○ Process repeats for the next block until the entire file is
written.

🔹 3. How HDFS Reads a File

📌 Step-by-Step Read Process:
1. Client sends read request to the NameNode:

○ The client contacts the NameNode for the file metadata.

○ NameNode returns a list of DataNodes holding the blocks of

the file.

2. Client reads data from DataNodes:

○ The client selects the closest replica (based on network

proximity).

○ Data is read block by block, directly from the DataNodes.

3. Streaming read:

○ HDFS supports streaming access, i.e., data is processed as

it is read.
4. No contact with NameNode for each block:

○ After getting the block locations at the beginning, the client

communicates directly with DataNodes.

Java Interfaces to HDFS

The Hadoop Distributed File System (HDFS) provides a set of Java APIs that
allow developers to programmatically interact with HDFS for file operations such
as reading, writing, deleting, listing files, and managing directories.

These interfaces are part of the Hadoop Common library and are essential for
building Java-based applications that work directly with HDFS.

🔹 1. org.apache.hadoop.fs.FileSystem Class
This is the core interface in Hadoop to interact with file systems
(including HDFS).
📌 Purpose:
● Abstracts the interaction between applications and various file
systems (e.g., HDFS, local FS, S3).

📦 Key Methods:
Method Description

create(Path path) Creates a new file and returns

FSDataOutputStream
open(Path path) Opens a file for reading
(FSDataInputStream)

delete(Path path, Deletes a file or directory

boolean)

mkdirs(Path path) Creates a directory recursively

listStatus(Path Lists contents of a directory

path)

exists(Path path) Checks if a file exists

🔹 2. FSDataInputStream and FSDataOutputStream

These classes handle streaming I/O for files in HDFS.

🧾 FSDataInputStream:
● Used to read data from HDFS.

● Extends DataInputStream.

● Supports seek() operation to jump to any position in a file.

🧾 FSDataOutputStream:
● Used to write data to HDFS.

● Extends DataOutputStream.

● Supports writeBytes(), writeInt(), etc.

🔹 3. Path Class
Represents the location of a file or directory in HDFS.

● Similar to java.io.File, but adapted for HDFS URIs.

● Example: Path p = new Path("/user/data/input.txt");

🔹 4. Accessing HDFS via Java Code (Example)

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.*;

public class HDFSExample {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path file = new Path("/user/hadoop/myfile.txt");

// Writing to HDFS

FSDataOutputStream out = fs.create(file);

out.writeUTF("Hello, HDFS!");

out.close();

// Reading from HDFS

FSDataInputStream in = fs.open(file);

String content = in.readUTF();

System.out.println("Read from HDFS: " + content);

in.close();

🔹 5. Other Important Classes

Class Role

FileStatus Contains metadata about files (length, permissions)

RemoteIterator For iterating over file listings

FileContext Alternative API to FileSystem for advanced use

cases

Configuration Holds configuration parameters (like HDFS URI)

Hadoop CLI
The Hadoop Command Line Interface (CLI) provides an essential bridge
between users and the Hadoop ecosystem. The Hadoop CLI is a Unix-style
command-line tool that allows users to execute commands directly on HDFS
or perform administrative tasks such as starting and stopping daemons,
copying files, and checking job statuses. It allows users to perform file
operations, job submission, and system monitoring using simple, shell-like
commands. Mastery of CLI commands enables developers, administrators,
and analysts to efficiently interact with HDFS, manage data, and run
MapReduce jobs in a scalable and scriptable way.These commands are
typically executed in a Linux terminal or through a script using:

hadoop fs <command>

hadoop dfs <command>

2. Common HDFS Commands

Command Description

hdfs dfs -ls / Lists all files/directories in root

hdfs dfs -mkdir Creates a new directory
/user/hadoop/dir

hdfs dfs -put file.txt Uploads a local file to HDFS

/user/

hdfs dfs -get Downloads a file from HDFS to local

/user/file.txt .

hdfs dfs -cat Displays contents of a file

/user/file.txt

hdfs dfs -rm Deletes a file from HDFS

/user/file.txt

hdfs dfs -du -h /user/ Shows disk usage for a directory in HDFS

hdfs dfs -copyFromLocal Copies files from local to HDFS

hdfs dfs -copyToLocal Copies files from HDFS to local

🔹 3. Administrative Hadoop Commands

These are used to manage the cluster services:
Command Description

start-dfs.sh Starts the HDFS daemons (NameNode, DNs)

stop-dfs.sh Stops the HDFS daemons

start-yarn.sh Starts the YARN ResourceManager and NodeManagers

stop-yarn.sh Stops the YARN services

jps Displays running Java processes (to verify Hadoop daemons)

5. CLI Configuration

CLI commands rely on the Hadoop configuration files such as:

● core-site.xml

● hdfs-site.xml

● mapred-site.xml

● yarn-site.xml

These files are located in the $HADOOP_HOME/etc/hadoop directory and define

properties such as fs.defaultFS, block size, replication factor, etc.
🔹 6. Features of Hadoop CLI
● ✅ Simple and intuitive

● ✅ Supports most file system operations

● ✅ Can be used in shell scripts for automation

● ✅ Integrates with Linux permissions and file patterns

● ✅ Can be extended with custom commands

🔹 7. Limitations of Hadoop CLI

● ❌ Not suitable for real-time monitoring

● ❌ Lacks GUI for beginners

● ❌ Complex cluster operations may need scripting or UI (like Ambari or

Cloudera Manager)

Hadoop File System Interfaces

The Hadoop File System (HDFS) is the primary storage system of the
Hadoop ecosystem. It is designed to store large datasets across multiple
machines with fault tolerance and high throughput. To interact with HDFS and
other file systems, Hadoop provides several interfaces.

These interfaces allow users and applications to perform file operations,

access data, and manage file system metadata both programmatically and
through command-line tools.
Types of Hadoop File System Interfaces

Hadoop supports multiple types of interfaces for different users and use
cases:

Interface Type Purpose

CLI (Command Line) for interacting with HDFS from the terminal

ava API For developers to build Hadoop-enabled apps

Web UI For monitoring and browsing files via browser

REST API (WebHDFS) For accessing HDFS over HTTP from remote clclients

FUSE o mount HDFS as a local file system

streaming API or MapReduce programs to process streaming ddata

Command Line Interface (CLI)

The most commonly used interface for administrators and developers.

Commands like:

hdfs dfs -ls /user

hdfs dfs -put file.txt /user/data/

hdfs dfs -get /user/data/file.txt .

Let users:

● List directories

● Upload/download files

● Create/remove directories

● Check disk usage

Useful for scripting and automation of HDFS operations.

🔹 3. Java API (Programmatic Access)

Used by developers to write applications that interact with HDFS.

Key Classes:

● org.apache.hadoop.fs.FileSystem: Core class to interact with

HDFS.

● FSDataInputStream, FSDataOutputStream: For reading/writing

files.

● Path: Represents file or directory paths.

Example:

FileSystem fs = FileSystem.get(new Configuration());

Path path = new Path("/user/data.txt");

FSDataInputStream in = fs.open(path);

This is the most powerful interface, used by internal Hadoop components

and custom applications.

🔹 4. Web Interface (HDFS Web UI)

HDFS provides a browser-based UI (usually at http://namenode:9870/)
that allows users to:

● View file system hierarchy

● Monitor file blocks, replication

● Track data nodes and their status

● Diagnose NameNode metrics and logs

It is mainly for administrators and debugging.

What is Data Flow in Hadoop?

Data flow in Hadoop refers to the path taken by data from its source to its
destination (usually HDFS). It involves multiple components:

1. Data Sources – Web servers, databases, social media, IoT devices,
etc.

2. Ingestion Tools – Flume, Sqoop, Kafka, NiFi

3. Storage Systems – HDFS, HBase, Hive

4. Processing Frameworks – MapReduce, Spark

5. Output / Reporting – Dashboards, ML models, business reports

Data Flow and Data Ingest with Flume and

Sqoop in Big Data
In Big Data systems, data ingestion refers to the process of collecting and
importing data for immediate use or storage in a database or data warehouse.
Two common tools used for ingesting data into the Hadoop ecosystem are
Apache Flume and Apache Sqoop. These tools are designed for efficient
and reliable data flow from various sources into the Hadoop Distributed File
System (HDFS).

Apache Flume – For Streaming Data Ingestion

🔹 What is Flume?
Apache Flume is a distributed, reliable, and available system for efficiently
collecting, aggregating, and moving large amounts of log data or event
data into HDFS or HBase.

🔹 Architecture:
Flume is built around a simple and flexible architecture based on streaming
data flow.

Key components:

● Source: Accepts data (e.g., Avro source, Netcat source, exec source)

● Channel: Temporary store (e.g., memory, file-based)

● Sink: Destination (e.g., HDFS, HBase)

🔹 Data Flow in Flume:

The Flume agent runs a configuration of these three components. Multiple

agents can be used in a multi-hop or fan-in/fan-out configuration.

🔹 Example Use Case:

● Ingesting real-time logs from a web server to HDFS for later analysis
using Hive or Spark.

🔹 Benefits of Flume:
● Handles streaming data

● Supports reliability (event delivery guarantees)

● Customizable (interceptors, serializers)

Apache Sqoop – For RDBMS to HDFS Data Transfer

🔹 What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases (e.g., MySQL, PostgreSQL, Oracle).

🔹 Sqoop Architecture:
Sqoop uses MapReduce jobs internally to parallelize the import/export
process, thus making data transfers fast and scalable.

🔹 Data Flow in Sqoop:

Feature Flume Sqoop

Data Type Semi-structured / Log / Structured (RDBMS)
Event
Source Logs, real-time streams Relational Databases
Use Case Streaming ingestion Bulk data transfer
Integration HDFS, HBase HDFS, Hive, HBase
Scheduling Continuous Batch
Underlying Tech Streaming pipeline MapReduce

Hadoop Archives (HAR)

In Hadoop, managing millions of small files is inefficient and burdensome. Each file,
regardless of its size, stores metadata in the NameNode, which is kept in memory.
As a result, too many small files can overload the NameNode, reducing cluster
efficiency and increasing load times.

To solve this problem, Hadoop provides a special storage format called Hadoop
Archives (HAR files). These archives help in logically grouping multiple small files
into a larger archive, thereby optimizing storage and access.

🗂️ What is a Hadoop Archive (HAR)?

● A HAR file is a container for a set of files in HDFS, packaged into a single
archive.

● It helps reduce the metadata burden on the NameNode.

● While individual files remain accessible, they are no longer treated as

separate files in HDFS.

🧱 Structure of HAR Files

A HAR file contains:

1. _index – Stores metadata (file names, offsets, and lengths)

2. _masterindex – Stores archive-level information

3. Data files – Contain the actual contents of the original files

Together, these files make it possible to retrieve individual files from the archive
without unarchiving everything.
🛠️ Creating a HAR File
HAR files are created using the Hadoop command-line interface:

hadoop archive -archiveName archive.har -p /input_dir

/output_dir

Example:

hadoop archive -archiveName logs.har -p /logs/raw /logs/archive

This creates an archive named logs.har under /logs/archive/ containing all

files from /logs/raw.

Hadoop I/O: Compression, Serialization, Avro, and

File-Based Data Structures
In the Hadoop ecosystem, handling large volumes of data efficiently is crucial.
Hadoop I/O operations involve the movement, storage, and processing of data.
Optimizing Hadoop I/O helps reduce disk I/O, speed up processing, and lower
storage costs.

The major components of Hadoop I/O include:

● Data Compression

● Serialization

● Apache Avro

● File-based Data Structures

1️⃣ Compression in Hadoop

🔹 What is Compression?
Compression is the process of encoding data to reduce its storage size. In Hadoop,
compression is vital to:

● Save storage space in HDFS.

● Reduce disk I/O and network bandwidth during data movement.

● Speed up MapReduce jobs.

● Input files

● Intermediate map outputs

● Output files

🔹 Enabling Compression:
<property>

<name>mapreduce.output.fileoutputformat.compress</name>

</property>

2️⃣ Serialization in Hadoop

🔹 What is Serialization?
Serialization is the process of converting objects into a stream of bytes for storage
or transmission. Deserialization is the reverse process.

🔹 Importance in Hadoop:
● Transferring data between nodes in a distributed system

● Reading/writing data from disk

● Intermediate data in MapReduce jobs

🔹 Hadoop’s Default Serialization Framework:

Hadoop uses Writable objects for serialization.
Example:

public class IntWritable implements Writable {

private int value;

public void write(DataOutput out) throws IOException {

out.writeInt(value); }
public void readFields(DataInput in) throws IOException {
value = in.readInt(); }

🔹 Limitations of Writable:
● Tightly coupled to Java

● Verbose and not portable

● Not ideal for data exchange between languages

3️⃣ Avro in Hadoop

🔹 What is Avro?
Apache Avro is a language-neutral, schema-based serialization system
developed as part of the Apache Hadoop project. It solves the limitations of Writable.

🔹 Features of Avro:
● Compact and fast

● Supports dynamic schemas

● Suitable for cross-language data exchange

● Stores schema with the data

🔹 Avro Data Format:

● Data is stored in .avro files

● Contains a schema in JSON format

● Supports both row-based and block-based serialization

🔹 Avro Use Cases:

● Data serialization in MapReduce

● Interoperable data exchange between languages (Java, Python, C++)

● Storage format for Kafka, Hive, HDFS

4️⃣ File-Based Data Structures in Hadoop

Hadoop supports several structured file formats that are optimized for big data
workloads:

a) SequenceFile
● Binary key-value pair file format

● Used for intermediate output in MapReduce

● Supports compression

SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,

path, LongWritable.class, Text.class);

b) MapFile

● Sorted version of SequenceFile

● Allows fast lookup using an index

● Suitable for random access

c) Avro File

● Contains serialized data along with schema

● Efficient for data exchange and storage

● Supports schema evolution

d) Parquet

● Columnar storage format

● Ideal for analytical queries

● Supported by Hive, Impala, Spark

e) ORC (Optimized Row Columnar)

● Optimized for Hive

● High compression, low latency

● Efficient storage for structured data

Setting Up a Hadoop Cluster

A Hadoop Cluster is a collection of computers (nodes) working together to store
and process large volumes of data using the Hadoop Distributed File System
(HDFS) and the MapReduce processing model. Setting up a Hadoop cluster
involves configuring both hardware and software components to operate in a
distributed computing environment.

Prerequisites

a) Hardware Requirements:

● Minimum 2-4 nodes (one master + multiple slaves)

● At least 8 GB RAM (per node recommended)

● 100 GB+ disk space

● 1 Gbps Network for optimal performance

b) Software Requirements:

● Linux-based OS (e.g., Ubuntu/CentOS)

● Java JDK 8/11 installed

● Hadoop (latest stable version from Apache)

Steps to Set Up a Hadoop Cluster:

1. Install Java

Hadoop requires Java (usually OpenJDK).

sudo apt update

sudo apt install openjdk-11-jdk

Verify:

java -version

2. Create Hadoop User

Add a dedicated user (e.g., hadoop) for managing services.

sudo adduser hadoop

sudo usermod -aG sudo hadoop

3. Install Hadoop

Download Hadoop binaries and extract them to all nodes.
4. Configure SSH Access
Enable passwordless SSH from master to all slave nodes.

5. Configure Hadoop Files

Edit these important config files:

○ core-site.xml → HDFS settings

○ hdfs-site.xml → NameNode, DataNode settings

○ mapred-site.xml → JobTracker/ResourceManager settings

○ yarn-site.xml → NodeManager settings

○ slaves file → List of DataNodes

6. Format NameNode

Run: hdfs namenode -format

7. Start Hadoop Services

○ HDFS: start-dfs.sh

○ YARN: start-yarn.sh

8. Verify Web UIs

○ NameNode: http://<namenode>:9870

○ ResourceManager: http://<resourcemanager>:8088
Hadoop Cluster Specification
a) Node Types

A typical Hadoop cluster consists of:

Node Type Role

NameNode Manages HDFS namespace and metadata

DataNode tStores actual data in HDFS blocks

ResourceManager Manages system resources and job scheduling (YARN)

NodeManager Runs on each node, manages tasks and resources locally

Secondary aTakes periodic snapshots of NameNode metadata (FSImage +

NameNode EditLogs)

Master Node requires more memory and CPU than worker nodes.

b) Hardware Requirements

Componen Recommended Spec (Per Node)

CPU Quad-core or higher

RAM 8 GB minimum (16–64 GB for production)

Storage 500 GB to several TB (preferably SSD)

Network 1 Gbps Ethernet or better

c) Software Requirements

● Linux-based OS (Ubuntu/CentOS)

● Java JDK 8 or 11

● Hadoop (latest version, e.g., Hadoop 3.3.6)

● SSH (for passwordless login)

● Python (optional, for streaming)

Hadoop Configuration
To ensure that Hadoop functions correctly and efficiently, it must be properly configured.
Hadoop configuration involves setting up the environment, defining core functionalities,
tuning performance parameters, and specifying resource allocations through configuration
files.

Types of Configuration Files in Hadoop

Hadoop uses XML-based configuration files located in the HADOOP_HOME/etc/hadoop
directory.

a) core-site.xml

● Defines core Hadoop settings like default file system and I/O settings.

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

b) hdfs-site.xml

● Configures HDFS properties such as block size, replication, and storage paths.

<name>dfs.replication</name>

</property>

</configuration>
c) mapred-site.xml

● Contains settings for the MapReduce framework, like job tracker, input/output, and
number of reducers.

<name>mapreduce.framework.name</name>

</property>

</configuration>

d) yarn-site.xml

● Configures YARN resource management and task scheduling.

<name>yarn.resourcemanager.hostname</name>

<value>localhost</value>

</property>

</configuration>

Security in Hadoop
Security is crucial in big data systems as they often deal with sensitive data. Hadoop
provides multiple layers of security to protect data and access.
a) Authentication

● Hadoop supports Kerberos-based authentication.

● It verifies the identity of users before allowing access.

● Users receive a token from the Key Distribution Center (KDC).

b) Authorization

● Defines what authenticated users are allowed to do.

● Implemented through Access Control Lists (ACLs) and File/Directory

permissions in HDFS.

c) Encryption

● Data-at-Rest: Encrypts stored data blocks.

● Data-in-Transit: Uses SSL/TLS to encrypt network traffic between nodes.

d) Auditing

● Logs user activities and file accesses.

● Tools like Apache Ranger or Apache Sentry provide centralized security policy
management.

🛠️ 2. Administering Hadoop
Hadoop administration involves configuring, monitoring, and managing cluster operations.

a) Cluster Setup
● Configure core-site.xml, hdfs-site.xml, yarn-site.xml, and
mapred-site.xml.

● Setup passwordless SSH and proper user permissions.

b) Starting/Stopping Services

start-dfs.sh # Starts HDFS

start-yarn.sh # Starts YARN

stop-dfs.sh # Stops HDFS

stop-yarn.sh # Stops YARN

c) Job and Resource Management

● Use ResourceManager UI (http://master:8088) to monitor jobs.

● Use HDFS UI (http://master:9870) to check file system status.

d) User Management

● Create users and assign permissions for accessing data.

● Tools like Hue, Apache Knox, or Ranger simplify this process.

📊 3. HDFS Monitoring & Maintenance

Monitoring ensures that Hadoop performs optimally, and maintenance avoids data loss and
corruption.

a) Monitoring Tools

● JPS: Lists running Hadoop daemons.

● Nagios, Ganglia, Ambari: Monitor performance metrics like memory, CPU, disk
usage.

● HDFS Web UI: Shows block distribution, health, under-replicated blocks.

b) Regular Tasks

● Check disk space and system logs.

● Use hdfs fsck / to check file system consistency.

● Run balancer to ensure even data distribution across DataNodes.

hdfs balancer -threshold 10

c) Maintenance Activities

● Restart failed nodes.

● Backup FSImage and edit logs.

● Perform upgrades using rolling upgrade feature.

⚙️ 4. Hadoop Benchmarks
Benchmarks help evaluate the performance of a Hadoop cluster.

a) Teragen and Terasort

● Used for generating data and testing sort performance.

b) DFSIO

● Measures HDFS I/O throughput.

c) HiBench
● A comprehensive benchmark suite developed by Intel.

● Tests real-world workloads including SQL, streaming, machine learning.

Hadoop in the Cloud

Using Hadoop in the cloud enables scalable and cost-effective big data processing without
managing physical hardware.

a) Benefits

● Elasticity: Resources can be added/removed as needed.

● Cost Efficiency: Pay-as-you-go model.

● High Availability: Managed cloud services offer fault tolerance.

● Easy Setup: No hardware or manual cluster configuration.

b) Cloud Platforms Supporting Hadoop

Platform Service Name

Amazon Web EMR (Elastic

Services MapReduce)

Microsoft Azure HDInsight

Google Cloud Dataproc

IBM Cloud Analytics Engine

c) Cloud Features
● Pre-configured Hadoop environment.

● Integration with cloud storage like S3 (AWS), Blob (Azure), or GCS (Google).

● Scalable compute nodes and spot instances to reduce cost.

Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Data Science
No ratings yet
Data Science
87 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Unit 1 (Diagrams)
No ratings yet
Unit 1 (Diagrams)
10 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
BDA Final
No ratings yet
BDA Final
23 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Big Data: Hadoop Framework Guide
No ratings yet
Big Data: Hadoop Framework Guide
3 pages
BDA Notes
No ratings yet
BDA Notes
18 pages
Twitter Data Analysis with Hadoop
No ratings yet
Twitter Data Analysis with Hadoop
5 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Big Data and Hadoop Ecosystem Overview
No ratings yet
Big Data and Hadoop Ecosystem Overview
36 pages
Big Data Challenges and Hadoop Insights
No ratings yet
Big Data Challenges and Hadoop Insights
55 pages
Big Data Spark Lab Manual 2025-2026
No ratings yet
Big Data Spark Lab Manual 2025-2026
62 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Module 1
No ratings yet
Module 1
29 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Unified Big Data Lambda Architecture Wit
No ratings yet
Unified Big Data Lambda Architecture Wit
13 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Biggdata
No ratings yet
Biggdata
24 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
No ratings yet
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
15 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
63 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
InfoQ Modern Data Architectures Pipelines Streams
No ratings yet
InfoQ Modern Data Architectures Pipelines Streams
42 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
Understanding Big Data Types and History
No ratings yet
Understanding Big Data Types and History
22 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Unit 1 Big Data - VII SEM (2024-25)
No ratings yet
Unit 1 Big Data - VII SEM (2024-25)
48 pages
Big Data Characteristics and Management
No ratings yet
Big Data Characteristics and Management
8 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
BDA - Lab-Manual - 1to4
No ratings yet
BDA - Lab-Manual - 1to4
17 pages
Optimize Small Files in Hadoop
No ratings yet
Optimize Small Files in Hadoop
62 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Bdavdoc
No ratings yet
Bdavdoc
41 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Connecting Tokico LPG Pumps
100% (2)
Connecting Tokico LPG Pumps
4 pages
Ishmeet Singh Saluja Resume
No ratings yet
Ishmeet Singh Saluja Resume
1 page
Bluetooth™ USB Adapter F8T001/F8T003: Software User Manual
No ratings yet
Bluetooth™ USB Adapter F8T001/F8T003: Software User Manual
31 pages
Advanced C Programming Course Outline
No ratings yet
Advanced C Programming Course Outline
3 pages
Oracle Cloud Solutions Infrastructure 1072 PDF
No ratings yet
Oracle Cloud Solutions Infrastructure 1072 PDF
3 pages
Sample Resume - Software Developer
No ratings yet
Sample Resume - Software Developer
4 pages
Discuss How The Following Pairs of Scheduling Criteria Conflict in Certain Settings
No ratings yet
Discuss How The Following Pairs of Scheduling Criteria Conflict in Certain Settings
13 pages
IBM Cognos ICM Installation and Configuration Guide V8
No ratings yet
IBM Cognos ICM Installation and Configuration Guide V8
154 pages
Discovering Computers 2012: Your Interactive Guide To The Digital World
No ratings yet
Discovering Computers 2012: Your Interactive Guide To The Digital World
52 pages
Ks0159 Keyestudio Desktop Bluetooth Mini Smart Car
No ratings yet
Ks0159 Keyestudio Desktop Bluetooth Mini Smart Car
26 pages
Understanding Space Mouse Technology
No ratings yet
Understanding Space Mouse Technology
22 pages
Traditional File Oriented Approach
No ratings yet
Traditional File Oriented Approach
6 pages
6-Port AI PoE Switch with 4 Modes
No ratings yet
6-Port AI PoE Switch with 4 Modes
5 pages
Interview Questions 2
No ratings yet
Interview Questions 2
102 pages
En - stm32-Stm8 Embedded Software Solutions
No ratings yet
En - stm32-Stm8 Embedded Software Solutions
99 pages
Shubham Assignment It Practicle File
No ratings yet
Shubham Assignment It Practicle File
15 pages
Shubham Srs Library Management System
No ratings yet
Shubham Srs Library Management System
11 pages
Epson EcoTank Printer Datasheet
No ratings yet
Epson EcoTank Printer Datasheet
2 pages
Cisco AnyConnect Mobility Guide
No ratings yet
Cisco AnyConnect Mobility Guide
24 pages
Testing Using Log File Analysis: Tools, Methods, and Issues
No ratings yet
Testing Using Log File Analysis: Tools, Methods, and Issues
10 pages
VHDL Clock Signal Multiplexer Guide
No ratings yet
VHDL Clock Signal Multiplexer Guide
8 pages
Application Programming Interface Guide
No ratings yet
Application Programming Interface Guide
7 pages
iVMS-5200 Professional Port Guide
No ratings yet
iVMS-5200 Professional Port Guide
6 pages
Concurrency Control in Databases
No ratings yet
Concurrency Control in Databases
49 pages
Flip-Flops and Memory Systems Overview
No ratings yet
Flip-Flops and Memory Systems Overview
12 pages
GPRS Transmitter Setup Guide
No ratings yet
GPRS Transmitter Setup Guide
60 pages
Vue JS CheatSheet
No ratings yet
Vue JS CheatSheet
2 pages
Learning Python-WPS Office
No ratings yet
Learning Python-WPS Office
7 pages
WinChimp Toolkit Ebook
No ratings yet
WinChimp Toolkit Ebook
6 pages
Read Me - Instructions!
No ratings yet
Read Me - Instructions!
3 pages

Big Data

Uploaded by

Big Data

Uploaded by

Introduction to Big Data Architecture

Big data architecture is a framework designed to manage the ingestion,

✅ Why Big Data Architecture is Needed

●​ Massive volumes of data (from GBs to PBs)​

●​ Diverse formats (structured, semi-structured, unstructured)​

●​ Need for real-time decisions and predictions​

Big data architectures help:

●​ Store and manage huge datasets​

●​ Support real-time and historical analysis​

●​ Apply AI and ML for forecasting and automation​

●​ Enable scalable and fault-tolerant data pipelines​

✅ Common Big Data Workloads

○​ Handles data at rest.​

○​ Involves scheduled processing of large files.​

○​ Example: Processing daily logs every night.​

2.​ Stream (Real-Time) Processing​

○​ Handles data in motion.​

○​ Involves real-time ingestion and transformation.​

○​ Example: Monitoring financial transactions or sensor data.​

○​ Ad-hoc queries and exploratory analysis by analysts.​

○​ Useful for building models, discovering patterns.​

4.​ Machine Learning and Predictive Analytics​

○​ Uses processed data to train models and make predictions.​

○​ Helps with tasks like classification, recommendation, forecasting.​

✅ When to Use Big Data Architecture

●​ Data volume exceeds the capability of traditional systems.​

●​ You need to process unstructured or semi-structured data.​

●​ You want real-time insights or alerts.​

●​ You aim to train AI/ML models on large datasets.​

●​ There’s a need to integrate data from multiple sources.​

✅ Key Components of Big Data Architecture

●​ Static Files: Log files, CSV, JSON, XML, images, etc.​

●​ Real-time Sources: IoT devices, sensors, social media feeds.​

●​ External APIs: Cloud services, 3rd-party integrations.​

🔹 2. Data Storage (Data Lake)

●​ Acts as the central repository for structured and unstructured data.​

●​ Known as a Data Lake.​

●​ Azure Data Lake Store​

●​ Azure Blob Storage​

●​ OneLake in Microsoft Fabric​

●​ Hadoop Distributed File System (HDFS)​

●​ Suitable for ETL (Extract, Transform, Load) operations.​

●​ Jobs may run for minutes or hours.​

●​ Azure Data Lake Analytics (U-SQL)​

●​ Hive, Pig, MapReduce on HDInsight​

●​ Apache Spark (Azure Databricks, Fabric)​

●​ Notebooks: Python, Scala, SQL-based scripting​

🔹 4. Real-Time Message Ingestion

●​ Acts as a buffer before real-time processing.​

●​ Supports scale-out, reliable delivery, and event ordering.​

●​ Azure Event Hubs​

●​ Azure IoT Hub​

●​ Filtering, aggregating, transforming​

●​ Detecting anomalies or trends​

●​ Azure Stream Analytics (SQL-like queries on data streams)​

●​ Apache Spark Streaming (Databricks, HDInsight)​

●​ Azure Functions (event-driven processing)​

●​ Fabric real-time event streams​

🔹 6. Machine Learning (ML)

●​ Azure Machine Learning (model training, deployment)​

●​ Azure AI Services (vision, language, speech APIs)​

●​ ML on Databricks, Spark MLlib​

●​ Data movement between components​

●​ Loading data into reports​

●​ Azure Data Factory (pipelines, triggers)​

●​ Apache Sqoop (for data import/export from relational DBs)​

●​ HDFS(Hadoop Distributed File System)

●​ YARN(Yet Another Resource Negotiator)

●​ Common Utilities or Hadoop Common

Apache Pig Components

Pig Latin Data Model

Example − (Raja, 30)

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Example − [name#Raja, age#30]

● Massive volumes of data (from GBs to PBs)

● Diverse formats (structured, semi-structured, unstructured)

● Need for real-time decisions and predictions

● Store and manage huge datasets

● Support real-time and historical analysis

● Apply AI and ML for forecasting and automation

● Enable scalable and fault-tolerant data pipelines

○ Handles data at rest.

○ Involves scheduled processing of large files.

○ Example: Processing daily logs every night.

2. Stream (Real-Time) Processing

○ Handles data in motion.

○ Involves real-time ingestion and transformation.

○ Example: Monitoring financial transactions or sensor data.

○ Ad-hoc queries and exploratory analysis by analysts.

○ Useful for building models, discovering patterns.

4. Machine Learning and Predictive Analytics

○ Uses processed data to train models and make predictions.

○ Helps with tasks like classification, recommendation, forecasting.

● Data volume exceeds the capability of traditional systems.

● You need to process unstructured or semi-structured data.

● You want real-time insights or alerts.

● You aim to train AI/ML models on large datasets.

● There’s a need to integrate data from multiple sources.

● Static Files: Log files, CSV, JSON, XML, images, etc.

● Real-time Sources: IoT devices, sensors, social media feeds.

● External APIs: Cloud services, 3rd-party integrations.

● Acts as the central repository for structured and unstructured data.

● Known as a Data Lake.

● Azure Data Lake Store

● Azure Blob Storage

● OneLake in Microsoft Fabric

● Hadoop Distributed File System (HDFS)

● Suitable for ETL (Extract, Transform, Load) operations.

● Jobs may run for minutes or hours.

● Azure Data Lake Analytics (U-SQL)

● Hive, Pig, MapReduce on HDInsight

● Apache Spark (Azure Databricks, Fabric)

● Notebooks: Python, Scala, SQL-based scripting

● Acts as a buffer before real-time processing.

● Supports scale-out, reliable delivery, and event ordering.

● Azure Event Hubs

● Azure IoT Hub

● Filtering, aggregating, transforming

● Detecting anomalies or trends

● Azure Stream Analytics (SQL-like queries on data streams)

● Apache Spark Streaming (Databricks, HDInsight)

● Azure Functions (event-driven processing)

● Fabric real-time event streams

● Azure Machine Learning (model training, deployment)

● Azure AI Services (vision, language, speech APIs)

● ML on Databricks, Spark MLlib

● Data movement between components

● Loading data into reports

● Azure Data Factory (pipelines, triggers)

● Apache Sqoop (for data import/export from relational DBs)

● HDFS(Hadoop Distributed File System)

● YARN(Yet Another Resource Negotiator)

● Common Utilities or Hadoop Common

5. Output: Final results are written to HDFS by the OutputFormat.

3. Scalability:MapReduce is designed to be highly scalable, allowing it to process massive

2. Simulated Environment:It provides a mock environment that simulates Hadoop's

3. Fairness:Equitably allocating resources among different users or groups, especially in

● JobTracker/ApplicationMaster Failure: If the master component fails, the

● Task Retries: The framework attempts to rerun failed tasks, by default up to

● Shuffle: Intermediate key-value pairs from Mapper tasks are transferred to

● Sort: Data is sorted based on keys before reaching the Reducer.

● Map Task Execution:

○ Reads input using RecordReader

○ Emits intermediate key-value pairs

○ Writes to local disk

● Reduce Task Execution:

○ Fetches Map output from different nodes

○ Executes Reducer function

○ Writes output to HDFS

● Useful when the processing doesn't require aggregation or grouping of data.

● Filtering data (e.g., removing null records)

● Data formatting or transformations

● Faster execution since the Reduce phase is skipped

● Reduced overhead and network I/O