0% found this document useful (0 votes)
9 views86 pages

Big Data

Big data architecture is a framework for managing large, fast, or complex data that traditional systems cannot handle, focusing on ingestion, processing, storage, and analysis. It supports various workloads including batch processing, real-time processing, and machine learning, and is essential for organizations dealing with massive volumes of diverse data. Key components include data sources, storage solutions like data lakes, processing tools, and orchestration mechanisms to enable efficient data management and insights.

Uploaded by

Ritu Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views86 pages

Big Data

Big data architecture is a framework for managing large, fast, or complex data that traditional systems cannot handle, focusing on ingestion, processing, storage, and analysis. It supports various workloads including batch processing, real-time processing, and machine learning, and is essential for organizations dealing with massive volumes of diverse data. Key components include data sources, storage solutions like data lakes, processing tools, and orchestration mechanisms to enable efficient data management and insights.

Uploaded by

Ritu Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Introduction to Big Data Architecture

Big data architecture is a framework designed to manage the ingestion,


processing, storage, and analysis of data that is too large, fast, or
complex for traditional database systems. It provides the infrastructure
necessary to derive insights from vast datasets, often combining both
batch and real-time processing.

The definition of big data is evolving: It's no longer just about the
volume of data, but also the variety, velocity, and the value
derived from advanced analytics and machine learning.

✅ Why Big Data Architecture is Needed


Organizations deal with:

●​ Massive volumes of data (from GBs to PBs)​

●​ Diverse formats (structured, semi-structured, unstructured)​


●​ High-speed data generation from IoT, apps, websites, sensors​

●​ Need for real-time decisions and predictions​

Big data architectures help:

●​ Store and manage huge datasets​

●​ Support real-time and historical analysis​

●​ Apply AI and ML for forecasting and automation​

●​ Enable scalable and fault-tolerant data pipelines​

✅ Common Big Data Workloads


1.​ Batch Processing​

○​ Handles data at rest.​

○​ Involves scheduled processing of large files.​

○​ Example: Processing daily logs every night.​

2.​ Stream (Real-Time) Processing​

○​ Handles data in motion.​

○​ Involves real-time ingestion and transformation.​

○​ Example: Monitoring financial transactions or sensor data.​


3.​ Interactive Data Exploration​

○​ Ad-hoc queries and exploratory analysis by analysts.​

○​ Useful for building models, discovering patterns.​

4.​ Machine Learning and Predictive Analytics​

○​ Uses processed data to train models and make predictions.​

○​ Helps with tasks like classification, recommendation, forecasting.​

✅ When to Use Big Data Architecture


Consider using big data architecture when:

●​ Data volume exceeds the capability of traditional systems.​

●​ You need to process unstructured or semi-structured data.​

●​ You want real-time insights or alerts.​

●​ You aim to train AI/ML models on large datasets.​

●​ There’s a need to integrate data from multiple sources.​

✅ Key Components of Big Data Architecture


🔹 1. Data Sources
All architectures begin with data sources, which may include:
●​ Application Data Stores: SQL, NoSQL databases.​

●​ Static Files: Log files, CSV, JSON, XML, images, etc.​

●​ Real-time Sources: IoT devices, sensors, social media feeds.​

●​ External APIs: Cloud services, 3rd-party integrations.​

🔹 2. Data Storage (Data Lake)


Stores raw or processed data at scale. Usually supports varied file formats
(Parquet, Avro, CSV, etc.) and large file sizes.

●​ Acts as the central repository for structured and unstructured data.​

●​ Known as a Data Lake.​

Tools:

●​ Azure Data Lake Store​

●​ Azure Blob Storage​

●​ OneLake in Microsoft Fabric​

●​ Hadoop Distributed File System (HDFS)​

🔹 3. Batch Processing
Used for scheduled, large-scale processing of data files.
●​ Reads from storage → Processes → Writes output back.​

●​ Suitable for ETL (Extract, Transform, Load) operations.​

●​ Jobs may run for minutes or hours.​

Tools:

●​ Azure Data Lake Analytics (U-SQL)​

●​ Hive, Pig, MapReduce on HDInsight​

●​ Apache Spark (Azure Databricks, Fabric)​

●​ Notebooks: Python, Scala, SQL-based scripting​

🔹 4. Real-Time Message Ingestion


Captures continuous streams of incoming data, like messages, events, or
logs.

●​ Acts as a buffer before real-time processing.​

●​ Supports scale-out, reliable delivery, and event ordering.​

Tools:

●​ Azure Event Hubs​

●​ Azure IoT Hub​

●​ Apache Kafka​
🔹 5. Stream Processing
Processes live data streams in real-time. Performs tasks like:

●​ Filtering, aggregating, transforming​

●​ Joining streams​

●​ Detecting anomalies or trends​

Tools:

●​ Azure Stream Analytics (SQL-like queries on data streams)​

●​ Apache Spark Streaming (Databricks, HDInsight)​

●​ Azure Functions (event-driven processing)​

●​ Fabric real-time event streams​

🔹 6. Machine Learning (ML)


Once the data is prepared (from batch or stream), ML models can be trained
to:

●​ Predict outcomes​

●​ Recommend actions​

●​ Automate decisions​

Use Cases:
●​ Fraud detection​

●​ Product recommendations​

●​ Predictive maintenance​

Tools:

●​ Azure Machine Learning (model training, deployment)​

●​ Azure AI Services (vision, language, speech APIs)​

●​ ML on Databricks, Spark MLlib​

🔹 9. Orchestration
Manages workflows that automate:

●​ Data ingestion​

●​ Batch jobs​

●​ Data movement between components​

●​ Loading data into reports​

Tools:

●​ Azure Data Factory (pipelines, triggers)​

●​ Apache Oozie​
●​ Fabric orchestration​

●​ Apache Sqoop (for data import/export from relational DBs)​

Hadoop - Architecture
Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today
lots of Big Brand Companies are using Hadoop in their Organization to deal
with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components. ​

●​ MapReduce

●​ HDFS(Hadoop Distributed File System)

●​ YARN(Yet Another Resource Negotiator)

●​ Common Utilities or Hadoop Common

Apache Pig Components


As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of
the parser will be a DAG (directed acyclic graph), which represents the Pig
Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes
and the data flows are represented as edges.

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.

Pig Latin Data Model


The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latins data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an
Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece
of data or a simple atomic value is known as a field.

Example − raja or 30

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the
fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by {}. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by []

Example − [name#Raja, age#30]


SCALA architecture and functional connections illustrated as a
fundamental modelling concepts diagram [21]. Connections with overlaying
bullet points indicate bidirectional communication channels. The
Communication Module (CM) receives incoming data from several sources
and communication protocols. It transmits the data to the Main Controller
(MC), which coordinates the signal processing and eventually provides the
classification result to the Communication Module. The Signal Processing
Module is exchangeable, thereby contributing to the flexibility of SCALA.
MapReduce – Detailed Explanation
1. MapReduce Framework and Basics
MapReduce is a programming model developed by Google for processing large data sets with
a distributed algorithm on a cluster. The model simplifies data processing across massive
clusters by using two main functions: Map and Reduce.

●​ Map Function: Processes input data and produces a set of intermediate key-value
pairs.​

●​ Reduce Function: Merges all intermediate values associated with the same key to
produce a final result.​

MapReduce is the core of Hadoop, enabling it to handle big data workloads. The framework
automatically handles parallelization, fault tolerance, data distribution, and load balancing.

2. How MapReduce Works


The execution of a MapReduce job involves several phases:

1.​ Input Splitting: Input data is split into fixed-size blocks (default 128 MB) by the
InputFormat.​

2.​ Mapping: Each block is processed by a separate Map task that converts input data into
intermediate key-value pairs.​

3.​ Shuffling and Sorting: The Map output is shuffled to send all values with the same key
to a single Reducer. Sorting ensures keys are ordered before reducing.​

4.​ Reducing: Reduce tasks aggregate or process the list of values for each key.​

5.​ Output: Final results are written to HDFS by the OutputFormat.

Developing a MapReduce application in big data involves breaking down a large task
into smaller, parallelizable chunks, which are then processed by individual machines (map
phase) and finally combined (reduce phase) to produce the final result. This approach is
particularly effective for tasks requiring massive data processing and analysis, such as training
machine learning models or analyzing social media data.

Understanding MapReduce:

1.​ Map Phase: The input data is divided into smaller, manageable chunks and distributed
across a cluster of machines. Each machine processes its assigned chunk
independently, applying a user-defined function (the "map" function) to extract relevant
information or transform the data.

2.​ Shuffle and Sort: The intermediate key-value pairs generated by the map phase are
shuffled and sorted based on the keys, ensuring that all values associated with a
particular key are grouped together on the same machine.

3.​ Reduce Phase: The sorted key-value pairs are then processed by another user-defined
function (the "reduce" function). This function combines the values associated with each
key to produce the final output.

Key Considerations for Development:

1.​ Data Input and Output:MapReduce typically works with data stored in a distributed file
system like Hadoop Distributed File System (HDFS).

2.​ Programming Model:The core of MapReduce is a programming model that specifies how
the map and reduce functions are defined and executed.

3.​ Scalability:MapReduce is designed to be highly scalable, allowing it to process massive


datasets across a large number of machines.

4.​ Fault Tolerance:MapReduce can automatically handle machine failures and continue
processing, ensuring high reliability.

Examples of MapReduce Applications:


1.​ Machine Learning:Training machine learning models on large datasets for tasks like
linear regression, k-means clustering, and collaborative filtering.

2.​ Web Analytics:Analyzing web server logs to extract insights into user behavior, traffic
patterns, and other metrics.

3.​ Social Media Analysis:Analyzing social media data to identify trends, sentiment, and
other patterns.

MRUnit is a JUnit-based library specifically designed for unit testing Apache Hadoop
MapReduce jobs. It allows developers to test individual mappers and reducers, or the entire
MapReduce computation, without needing a full Hadoop cluster. This makes it easier to develop
and maintain Hadoop MapReduce code.

Key Features and Benefits:

1.​ Local Testing:MRUnit enables you to run unit tests locally without relying on a Hadoop
cluster, saving time and resources.

2.​ Simulated Environment:It provides a mock environment that simulates Hadoop's


MapReduce framework, allowing you to test your code in isolation.

3.​ Input and Output Control:You can define specific input data and then verify the expected
output from your mappers and reducers.

4.​ Integration with JUnit:MRUnit is built on top of JUnit, a popular Java unit testing
framework, making it familiar to developers.

5.​ Focus on Functionality:It allows you to focus on testing the core logic of your
MapReduce jobs without dealing with complex Hadoop configurations.

Anatomy of a MapReduce Job Run

A MapReduce job run involves several key components: the client submitting the job, the YARN
resource manager coordinating resource allocation, YARN node managers launching and
monitoring tasks, a MapReduce application master managing the job, and a distributed
filesystem for data sharing. The job itself consists of two main phases: a map phase where input
data is processed and intermediate key-value pairs are generated, and a reduce phase where
these pairs are combined to produce the final output.

Here's a more detailed breakdown of the anatomy of a MapReduce job run:

1. Client:
Submits the MapReduce job, including the application code, configuration, and input data.
Interacts with the JobTracker (in older Hadoop versions) or the YARN Resource Manager (in
newer versions) to initiate the job.

2. YARN (Yet Another Resource Negotiator) Resource Manager:


Coordinates the allocation of compute resources (CPU, memory, etc.) on the Hadoop cluster.
Receives requests from the MapReduce Application Master (MRAppMaster) for containers
(virtual machines) to run map and reduce tasks.
Negotiates with YARN Node Managers to determine which nodes can host the tasks.
3. YARN Node Managers:
Launch and monitor compute containers on the individual nodes in the cluster.
Manage the resources allocated to each container, ensuring that tasks have the necessary
resources.

4. MapReduce Application Master (MRAppMaster):


Manages the execution of the MapReduce job.
Requests containers from the YARN Resource Manager to run map and reduce tasks.
Splits the input data into smaller chunks (input splits) and assigns them to different map tasks.
Coordinately runs the reduce tasks.

5. Distributed Filesystem (e.g., HDFS):


Used for storing the input data, intermediate data, and the final output of the MapReduce job.
Facilitates data sharing between different components of the MapReduce framework.

YARN Child (Implicit):


When an application master or a container is created and managed by YARN, it's often referred
to as a "YARN child" in a simplified way. It means that these components are subordinate to
YARN's resource management and scheduling control.

6. Map and Reduce Tasks:


Map Tasks: Process input data in parallel, generating intermediate key-value pairs.
Reduce Tasks: Combine the intermediate key-value pairs from the map tasks to produce the
final output.

In summary: The MapReduce job run involves a client submitting the job, the YARN Resource
Manager allocating resources, YARN Node Managers managing tasks, the MapReduce
Application Master coordinating the job, and the distributed filesystem handling data storage
and sharing. The job is executed in two main phases: map and reduce, where the input data is
processed and aggregated to produce the final results.
In MapReduce, job scheduling determines how tasks are allocated and executed within a
distributed computing framework like Hadoop. Effective scheduling is crucial for optimizing
resource utilization, ensuring fairness among users, and managing dependencies between Map
and Reduce tasks.

Here's a breakdown of job scheduling in MapReduce:

MapReduce Scheduling Process:

1.​ Job Submission:Users submit jobs to a queue, and the cluster processes them in a
defined order.

2.​ Task Distribution:The master node (JobTracker) distributes Map and Reduce tasks to
different worker nodes (TaskTrackers).

3.​ Data Processing:Map tasks read data splits, apply the map function, and write
intermediate results locally.

4.​ Data Shuffling and Reducing:Reduce tasks read intermediate results, apply the reduce
function, and write final results to output files.

Key Scheduling Issues:

1.​ Locality:Prioritizing the assignment of tasks to nodes where the data is stored (data
locality) minimizes data transfer overhead.
2.​ Synchronization:Coordinating the execution of Map and Reduce tasks to ensure that
Reduce tasks don't start until all Map tasks are completed or have written intermediate
data.

3.​ Fairness:Equitably allocating resources among different users or groups, especially in


shared cluster environments.

Common Scheduling Algorithms:

1.​ FIFO (First-In, First-Out): Executes tasks in the order they are submitted.

2.​ Capacity Scheduler: Prioritizes jobs based on resource consumption, allowing different
queues to have different resource allocations.

3.​ Fair Scheduler: Ensures that jobs receive a proportional share of cluster resources
based on their queue, promoting fairness.
Failures in MapReduce, a common big data processing framework, can be
broadly categorized into worker failures and master failures. Worker failures
occur when a task assigned to a worker node fails, while master failures
happen when the JobTracker or ApplicationMaster fails, causing the entire
MapReduce job to be restarted. MapReduce provides mechanisms to handle
these failures, such as task retries, task isolation, and speculative execution.

Here's a more detailed breakdown:

1. Worker Failures:

​ Task Failure:​
Tasks can fail due to various reasons, including runtime exceptions, sudden JVM
exits, or network issues.
​ TaskTracker Failure:​
The entire TaskTracker, which manages tasks on a worker node, can fail, leading to
the need to re-execute tasks.
​ Heartbeat Monitoring:​
The MapReduce framework uses heartbeats to monitor the health of worker nodes.
If a node stops responding, the framework assumes it has failed and re-assigns its
tasks.
​ Retry Mechanism:​
The framework attempts to retry failed tasks a certain number of times before
declaring the job failed.
​ Speculative Execution:​
To mitigate the impact of slow nodes, the framework can run duplicate tasks on
different nodes. The first task to finish provides the result.
​ Task Isolation:​
If a node consistently causes failures, the framework will try to avoid assigning new
tasks to that node.

2. Master Failures:

●​ JobTracker/ApplicationMaster Failure: If the master component fails, the


MapReduce job needs to be restarted.
●​ Fault Tolerance: MapReduce employs techniques like checkpointing and
execution logs to resume the job from the last known state after a master failure.

3. Handling Failures:

●​ Task Retries: The framework attempts to rerun failed tasks, by default up to


four times.
●​ Task Isolation: Failed nodes are isolated to prevent further failures.
●​ Speculative Execution: Backup copies of tasks are run to speed up processing
if a node is slow or fails.
●​ Skipping Bad Records: If a task fails for a specific input record, the framework
can be configured to skip that record and continue processing.
●​ Failure Resilience: Techniques like Task Failure Resilience (TFR) can help
improve performance by allowing tasks to resume from the point of interruption
after a failure.

9. Shuffle and Sort Phase

This is one of the most critical phases in MapReduce:

●​ Shuffle: Intermediate key-value pairs from Mapper tasks are transferred to


appropriate Reducer tasks.​

●​ Sort: Data is sorted based on keys before reaching the Reducer.​

Sorting ensures all values for a given key are grouped together, which simplifies
reduction logic.

Issues in this phase can significantly affect performance and scalability of the
application.

10. Task Execution


Each job consists of multiple Map and Reduce tasks:

●​ Map Task Execution:​

○​ Reads input using RecordReader​

○​ Emits intermediate key-value pairs​

○​ Writes to local disk​

●​ Reduce Task Execution:​

○​ Fetches Map output from different nodes​


○​ Sorts and merges data​

○​ Executes Reducer function​

○​ Writes output to HDFS​

Both types of tasks are managed by NodeManager and executed in containers (YARN).

MapReduce Types in Big Data


MapReduce is a core component of big data processing frameworks like Hadoop. It allows
developers to process vast amounts of data using a simple programming model based on Map
and Reduce functions. Depending on the data processing requirements, different types of
MapReduce jobs can be implemented. Each type serves specific use cases and can be
optimized for performance and resource usage.

1. Map-Only Job
Description:

●​ A Map-only job uses only the Mapper phase. There is no Reducer involved.​

●​ Useful when the processing doesn't require aggregation or grouping of data.​

Use Cases:

●​ Filtering data (e.g., removing null records)​

●​ Data formatting or transformations​

●​ Log parsing​

●​ Data sampling​

Example:
Reading a log file and extracting all records with HTTP status code 500.

text

CopyEdit

Mapper: Input (line of text) → Output (filtered line)

Reducer: Not used

Benefits:

●​ Faster execution since the Reduce phase is skipped​

●​ Reduced overhead and network I/O​

2. Map + Reduce Job


Description:

●​ This is the most common and widely used MapReduce job type.​

●​ It includes both Mapper and Reducer phases:​

○​ Mapper emits intermediate key-value pairs.​

○​ Reducer processes these pairs and outputs final results.​

Use Cases:

●​ Word Count​

●​ Aggregation (sum, average, min, max)​

●​ Grouping and classification​

●​ Joining datasets​
Example:

Counting the frequency of each word in a text file:

text

CopyEdit

Mapper: (line) → (word, 1)

Reducer: (word, list of counts) → (word, total_count)

3. MapReduce with Combiner


Description:

●​ The Combiner is an optional component that acts as a mini-Reducer at the Mapper


node.​

●​ It performs local aggregation before data is sent to the Reducer.​

Use Cases:

●​ When reduce operations are associative and commutative (e.g., summing values)​

Example:

In a word count job, the Combiner sums counts at the local node before passing them to the
Reducer.

Benefits:

●​ Reduces data transfer across the network​

●​ Improves job performance


In MapReduce, input and output formats determine how data is read into and
written from the framework. Input formats handle data partitioning and
parsing, while output formats manage the final data storage. They both work
with key-value pairs, a core concept in MapReduce.

1. InputFormat in MapReduce
The InputFormat defines:

●​ How input files are split.​

●​ How data is read and transformed into key-value pairs for the Mapper.​

Each InputFormat provides:

●​ A method to split the data (getSplits()).​

●​ A RecordReader to convert split data into key-value pairs.​

Common InputFormats

1.1 TextInputFormat (Default)

●​ Splits input files by lines.​

●​ Each line is read as a value with the key as the byte offset of the line.​

●​ Suitable for processing plain text files.​

Example:

Key = byte offset, Value = “This is a line of text”

1.2 KeyValueTextInputFormat
●​ Treats each line as a key-value pair split by a delimiter (default: tab \t).​

●​ Useful when data is already in a key-value format.​

Example:

Input Line: ID001​John


Output: Key = "ID001", Value = "John"

1.3 SequenceFileInputFormat

●​ Reads data stored in Hadoop’s binary SequenceFile format.​

●​ More efficient than text formats for intermediate data.

1.4 NLineInputFormat

●​ Assigns a fixed number of lines (N) to each Mapper.​

●​ Ideal for use cases where each record spans multiple lines.​

1.5 DBInputFormat

●​ Reads data from a relational database using JDBC.​

●​ Data is divided into InputSplits using a SQL query.​

2. OutputFormat in MapReduce
The OutputFormat defines:

●​ How the results from Reducers are written to storage.​


●​ How the key-value pairs produced by Reducers are serialized and
stored.

Common OutputFormats

2.1 TextOutputFormat (Default)

●​ Writes results in plain text files.​

●​ Each key-value pair is written on a new line separated by a tab.​

Example Output:

apple​ 3
banana​ 7

2.2 SequenceFileOutputFormat

●​ Stores output in a binary format using Hadoop’s SequenceFile format.​

●​ Used when output needs to be consumed by other MapReduce jobs.​


Benefits:

●​ Faster processing.​

●​ Supports compression.​

2.3 MultipleOutputs

●​ Allows writing data to multiple files from a single MapReduce job.​

●​ Useful when you want to separate output based on some condition


(e.g., error logs vs. success logs).​

2.4 LazyOutputFormat

●​ Prevents the creation of empty output files when no output is produced.​

●​ Useful for optimizing storage and reducing clutter.

2.5 DBOutputFormat

●​ Used to write output directly to a relational database using JDBC.​

●​ Common in data warehousing and integration with traditional RDBMS.

14. MapReduce Features


Key features that make MapReduce powerful and scalable:

●​ Scalability: Handles petabytes of data across thousands of machines.​

●​ Fault Tolerance: Recovers from task and node failures automatically.​


●​ Parallelism: Breaks down tasks into smaller units for concurrent
execution.​

●​ Data Locality: Moves computation closer to the data to reduce network


I/O.​

●​ Resource Optimization: Efficient use of hardware with job scheduling


and task management.

15. Real-world MapReduce Applications


MapReduce has been used in various domains:

●​ Search Engines: Indexing the web (e.g., Google, Bing).​

●​ E-commerce: Customer analytics and recommendation systems.​

●​ Log Analysis: Web server log processing for monitoring and alerts.​

●​ Social Media: Processing social graphs and user behavior analysis.​

●​ Financial Systems: Fraud detection and real-time analytics.​

●​ Scientific Research: DNA sequencing, simulations, and climate


modeling.
Design of HDFS in Big Data
The Hadoop Distributed File System (HDFS) is the primary storage system
used by the Hadoop framework. It is designed to store and manage massive
volumes of data across a distributed cluster of commodity machines, offering
high throughput, scalability, fault tolerance, and cost efficiency. HDFS is a
fundamental component of big data architecture.

1. Objectives of HDFS Design


●​ High fault tolerance​

●​ Scalability to petabytes of data​

●​ Streaming data access​

●​ Simple and cost-effective hardware​


●​ Write-once-read-many model​

2. Architecture of HDFS
HDFS follows a master-slave architecture consisting of:

2.1 NameNode (Master)

●​ Maintains the metadata of the file system.​

●​ Keeps track of the directory structure, file names, and the mapping
of blocks to DataNodes.​

●​ Stores metadata in RAM for fast access and on disk for recovery.​

●​ There is typically one active NameNode and one standby NameNode


for high availability.​

2.2 DataNode (Slaves)

●​ Responsible for storing the actual data blocks.​

●​ Each DataNode manages the storage on its local disk and periodically
sends heartbeats and block reports to the NameNode.​

●​ Reads/writes requests from clients are handled directly by DataNodes.​

2.3 Secondary NameNode

●​ Often misunderstood — it does not act as a backup NameNode.​


●​ Periodically merges the FSImage and EditLogs from the NameNode
to reduce the size of EditLogs.​

●​ Helps in faster NameNode restarts by producing a new checkpoint of


metadata.

1. FSImage (File System Image)


Definition:

FSImage is a persistent checkpoint of the HDFS namespace. It stores the


entire file system metadata in a snapshot format.

Details:

●​ Created and maintained by the NameNode.​

●​ Contains metadata such as:​

○​ Directory structure​

○​ File-to-block mapping​

○​ Permissions, ownership​

○​ Replication factors​

Purpose:

●​ To persist the current state of the file system.​

●​ Loaded into memory when the NameNode starts up.​


2. EditLogs: EditLogs are transaction logs that record all the
changes made to the HDFS namespace after the last FSImage was
created.

Details:

●​ Every time a file is added, deleted, or renamed, it is logged in the


EditLogs.​

●​ Stored separately from FSImage.​

Purpose:

●​ To ensure that all modifications to metadata are not lost.​

●​ During recovery/startup, the NameNode loads FSImage first, then


replays EditLogs to reach the current state.​

Example Flow:

1.​ FSImage has snapshot up to time t1.​

2.​ EditLogs record all changes from t1 to t2.​

3.​ When NameNode restarts, it combines FSImage + EditLogs = current


namespace.

3. Rack Awareness:Rack Awareness is the HDFS feature where


the physical rack location of DataNodes is considered while placing
blocks and replicas.

Why Important?

●​ Enhances fault tolerance.​


●​ Reduces network traffic during read/write.​

How It Works:

●​ Hadoop uses a script to determine the rack of each node.​

●​ Replication is done as:​

○​ 1 replica on local rack (same node or same rack)​

○​ 1 replica on a different node in the same rack​

○​ 1 replica on a different rack​

Benefits:

●​ Protects against rack failure.​

●​ Optimizes bandwidth usage.​

4. Secondary NameNode:The Secondary NameNode is a


helper node to the primary NameNode, used to periodically merge
FSImage and EditLogs.

Important Note:

●​ Not a backup NameNode!​

●​ Doesn't take over if the NameNode crashes.​

Function:
1.​ Downloads FSImage and EditLogs from the NameNode.​

2.​ Applies EditLogs to FSImage to create a new merged FSImage.​

3.​ Sends the updated FSImage back to NameNode.​

Why Needed?

●​ Prevents EditLogs from growing too large.​

●​ Helps in quicker NameNode restart.​

5. Namespace in HDFS
Definition:

HDFS Namespace is the hierarchical structure of directories and files in the


distributed file system — much like a traditional file system.

Details:

●​ Managed entirely by the NameNode.​

●​ Does not store actual data, only metadata.​

Includes:

●​ File and folder names​

●​ Ownership, permissions​

●​ Mapping of files to blocks and blocks to DataNodes​


Key Point:

It allows users to perform file system operations like:

●​ Create/Delete files​

●​ Rename files/folders​

●​ Manage access controls​

6. Replication in HDFS: Replication in HDFS refers to storing


multiple copies of each data block across different DataNodes to
ensure reliability and fault tolerance.

Default Replication Factor: 3

Strategy:

1.​ First replica on the local node (where client writes).​

2.​ Second replica on a different rack.​

3.​ Third on a different node in the same rack as the second.​

Benefits:

●​ Tolerates node/rack failures.​

●​ Ensures high data availability.​

●​ Enables load balancing during read operations.​

Management:
●​ If a replica is lost, the NameNode detects it and triggers re-replication.​

●​ Admins can configure replication factor per file.​

3. HDFS File Storage Design


3.1 Blocks

●​ Files in HDFS are split into fixed-size blocks (default: 128 MB).​

●​ Each block is replicated across multiple DataNodes (default: 3


replicas).​

●​ Replication ensures fault tolerance and data availability.​

3.2 Replication

●​ Default replication factor is 3:​

○​ One replica on the same rack​

○​ One on a different node in the same rack​

○​ One on a node in a different rack​

●​ Balances performance and fault tolerance.​

4. Data Flow in HDFS


4.1 Write Operation
●​ Client contacts the NameNode to get block locations.​

●​ Client directly writes blocks to the selected DataNodes in a pipeline


fashion.​

●​ Acks are sent back once replication is complete.​

4.2 Read Operation

●​ Client contacts the NameNode for the block locations.​

●​ Client fetches blocks directly from the nearest DataNode.​

●​ Supports parallel reading of blocks for faster performance.​

5. Fault Tolerance and Recovery


●​ If a DataNode fails, the NameNode detects it via missing heartbeats
and initiates replication of its blocks to maintain the replication factor.​

●​ If a NameNode fails, a Standby NameNode can take over (in HA


setup).​

●​ HDFS can automatically re-replicate lost blocks.​


HDFS (Hadoop Distributed File System) offers significant benefits for big data
storage and processing, including fault tolerance, scalability, and
cost-effectiveness. However, it also faces challenges like low latency for
small file access and the need for careful cluster management.

Benefits:

​ Fault Tolerance:​
HDFS automatically detects and recovers from hardware failures, ensuring data
availability and reliability.
​ Scalability:​
HDFS can handle massive datasets and easily scale to accommodate growing data
volumes.
​ Cost-Effective:​
HDFS utilizes inexpensive, off-the-shelf hardware, making it a cost-effective solution
for big data storage.
​ Data Locality:​
Data is stored directly on the DataNodes, reducing network traffic and improving
processing speed.
​ Open Source:​
HDFS is open source, allowing for community contributions and customization.
​ Diverse Data Formats:​
HDFS can store a wide range of data formats, including structured, semi-structured,
and unstructured data.
​ High Throughput:​
Optimized for high data throughput, making it suitable for streaming data
applications.

Challenges:

​ Small File Problem:​


HDFS can be inefficient when dealing with a large number of small files, as each
small file requires a separate seek operation.
​ Low Latency:​
HDFS is primarily designed for high throughput and may not be ideal for
applications requiring low-latency access to data.
​ Cluster Management:​
Managing and maintaining a large HDFS cluster can be complex, requiring
expertise in Hadoop administration and resource management.
​ Security:​
While HDFS has basic security measures, comprehensive security implementation
may require additional configurations and integrations.
​ Data Quality:​
Ensuring data quality in a large-scale Hadoop environment can be a challenge,
requiring appropriate data cleansing and validation procedures

Benefits of HDFS
1. Fault Tolerance

●​ HDFS automatically replicates data blocks across multiple DataNodes.​

●​ If a node fails, data is retrieved from another replica.​


●​ The default replication factor is 3, ensuring data reliability and
availability.​

2. Scalability

●​ HDFS can scale horizontally by simply adding more nodes to the


cluster.​

●​ Handles petabytes of data efficiently.​

●​ Designed to grow with increasing data demands.​

3. High Throughput

●​ Optimized for batch processing rather than random reads.​

●​ Enables fast data transfer rates between Hadoop jobs and the file
system.​

●​ Suited for applications with large data sets.​

4. Data Locality

●​ HDFS moves computation to where data resides, reducing network


I/O.​

●​ Enhances processing speed by leveraging local disk reads instead of


network reads.​

5. Cost-Effective

●​ Runs on commodity hardware, reducing infrastructure cost.​


●​ Open-source and maintained by the Apache Foundation, eliminating
license costs.​

6. Write Once, Read Many Model

●​ Files are written once and read many times, which simplifies data
coherency and eliminates complex locking mechanisms.​

●​ Ideal for data archival and log storage.​

7. Support for Large Files

●​ HDFS is capable of storing very large files (hundreds of gigabytes to


terabytes).​

●​ Splits files into blocks (default: 128 MB or 256 MB) and distributes them
across the cluster.​

8. Integration with Hadoop Ecosystem

●​ Seamlessly works with other tools like MapReduce, Hive, Pig, HBase,
and Spark.​

●​ Makes it a complete Big Data processing platform.

Challenges of HDFS
1. Not Suitable for Small Files

●​ Each file and block has metadata stored in NameNode’s RAM.​

●​ Too many small files can overload the NameNode, degrading


performance.​
2. Single Point of Failure (Without HA)

●​ The NameNode is the only master managing the metadata.​

●​ If the NameNode crashes (and HA is not configured), the entire HDFS


becomes unavailable.​

3. Latency in Real-Time Applications

●​ HDFS is designed for high throughput, not low latency.​

●​ Not ideal for real-time data processing or transactional systems.​

4. Append-Only File System

●​ HDFS does not support updating files in place.​

●​ Supports only appending to files — limits use in dynamic data storage.​

5. Complex Management

●​ Requires skilled personnel for cluster management, monitoring, and


tuning.​

●​ Managing node failures, data balancing, and resource optimization can


be complex.​

6. Data Replication Overhead

●​ Storing multiple copies (replicas) of data increases storage


requirements.​

●​ May lead to higher costs in terms of disk space usage.​


7. Security Concerns

●​ Basic HDFS does not offer strong security.​

●​ Needs integration with Kerberos, Ranger, or Sentry for access control


and auditing.​

8. Heavy Memory Usage

●​ NameNode stores the entire namespace and metadata in memory.​

●​ Requires large RAM capacity as the number of files and blocks


increases.

🔹 1. File Sizes in HDFS


📌 Definition:
In Big Data systems, the files to be processed are often very large — ranging
from hundreds of megabytes to several terabytes.

🧾 Characteristics:
●​ HDFS is optimized for large file sizes rather than numerous small files.​

●​ Typical file types stored:​

○​ Log files​

○​ Sensor data​

○​ Audio/video data​

○​ Scientific datasets​
🚫 Small File Problem:
●​ Storing many small files (less than block size) is inefficient, as each file
consumes NameNode memory for metadata.​

●​ Too many small files can overload the NameNode and degrade cluster
performance.​

🔹 2. Block Sizes in HDFS


📌 Definition:
HDFS splits large files into fixed-size blocks, which are stored across
different DataNodes.

📐 Default Block Size:


●​ Hadoop 1.x: 64 MB​

●​ Hadoop 2.x and above: 128 MB​

●​ Can be configured per file or system-wide​

✅ Advantages of Large Block Sizes:


●​ Reduces seek time and disk I/O.​

●​ Improves data transfer rates.​

●​ Minimizes metadata overhead on the NameNode.​

📦 Example:
If a 512 MB file is stored and block size is 128 MB:
●​ File will be split into 4 blocks:​

○​ Block 1: 128 MB​

○​ Block 2: 128 MB​

○​ Block 3: 128 MB​

○​ Block 4: 128 MB

🔹 3. Block Abstraction in HDFS


📌 What is a Block?
A block in HDFS is a logical representation of data.

●​ Unlike traditional file systems, where block size is small (4 KB or 8 KB),


HDFS blocks are much larger.​

●​ A block in HDFS does not represent a fixed-size chunk on disk, but


rather a unit of storage abstraction.

In Hadoop Distributed File System (HDFS), block abstraction refers to the


way large files are split into smaller, manageable chunks called blocks, which
are then stored across multiple nodes (DataNodes) in the cluster. This
method simplifies file management, allows for data locality, and enables
efficient parallel processing.

Here's a more detailed explanation:

Key aspects of block abstraction in HDFS:

​ Splitting files into blocks:​


HDFS divides files into blocks, with a default block size of 128 MB.
​ Storing blocks across DataNodes:​
These blocks are then distributed and stored across the DataNodes within the
HDFS cluster.
​ Data locality:​
HDFS aims to store data blocks on the same nodes where computation is
performed, minimizing network traffic and improving performance.
​ Replication for fault tolerance:​
Each block is replicated multiple times (typically three copies) across different
DataNodes to ensure data availability and fault tolerance.
​ NameNode's role:​
The NameNode manages block metadata, including their location and replication,
and communicates with DataNodes to handle block-related operations.
​ Benefits of block abstraction:
●​ Simplified file management: Breaking files into blocks makes it easier to
manage large files in a distributed environment.
●​ Improved storage utilization: HDFS can efficiently utilize storage space by
distributing blocks across different nodes.
●​ Enhanced parallel processing: HDFS can parallelize data processing by
allowing multiple mappers to process different blocks simultaneously.
●​ Fault tolerance and availability: Replication of blocks across multiple nodes
ensures that data is not lost if a node fails.
Data Replication in HDFS
Hadoop Distributed File System (HDFS) is designed to reliably store very large files across a distributed environment. To
ensure data availability, fault tolerance, and reliability, HDFS implements a powerful mechanism known as data
replication.

🔹 What is Data Replication?


Replication in HDFS refers to the process of storing multiple copies of each data block across different DataNodes in
the Hadoop cluster.

●​ This redundancy ensures that data is not lost even if a node fails.​

●​ Every file in HDFS is divided into blocks (default: 128 MB or 256 MB).​

●​ Each block is replicated and stored on multiple nodes.​

🔹 Default Replication Factor


●​ The default replication factor is 3.​

●​ This means each block of a file is stored on three different DataNodes.​

●​ Can be configured at:​

○​ Cluster level (default for all files)​

○​ File level (per-file basis during creation)​

🔹 Replica Placement Policy


HDFS uses a rack-aware replica placement strategy to ensure fault tolerance and data availability.

📌 Typical Placement (Replication Factor = 3):


1.​ First replica is placed on the local node (where the client is writing).​

2.​ Second replica is placed on a node in a different rack.​

3.​ Third replica is placed on a different node in the same rack as the second.​

This ensures:
●​ One local copy​

●​ One remote rack copy (for disaster recovery)​

●​ One intra-rack copy (to balance network bandwidth)​

🔹 Importance of Data Replication


✅ 1. Fault Tolerance
●​ If one or two nodes fail, the data can still be retrieved from other replicas.​

●​ Ensures high data reliability.​

✅ 2. Data Availability
●​ Multiple replicas mean more chances of data being available at all times.​

●​ Crucial in large-scale distributed environments.​

✅ 3. Load Balancing
●​ Read operations can be served by any replica, helping in load distribution.​

●​ HDFS selects the closest replica to minimize read latency.​

✅ 4. Cluster Performance
●​ Replication ensures parallel data processing across different nodes.​

●​ Enhances performance of MapReduce and other jobs.​

🔹 Handling Block Failures


HDFS uses the Block Scanner and DataNode heartbeat mechanisms:

●​ The NameNode continuously monitors the health of DataNodes through heartbeat signals.​

●​ If a block becomes under-replicated (due to node failure), the NameNode schedules replication of the missing
block on a healthy node.​

●​ If a block is over-replicated, the NameNode may delete excess replicas to save space.

How Does HDFS Store Data?


Step-by-Step: How HDFS Stores Data

1. File Division into Blocks

●​ A file to be stored in HDFS is split into fixed-size blocks (default size: 128 MB
or 256 MB).​

●​ For example, a 512 MB file with 128 MB block size is split into 4 blocks.​

2. Block Abstraction

●​ Each block is treated as an independent unit of storage.​

●​ Blocks are identified by block IDs and are not stored contiguously.​

3. Block Replication

●​ Each block is replicated (default: 3 copies) and distributed across different


DataNodes for fault tolerance.​

●​ Replica placement strategy:​

○​ One replica on local node (if possible)​

○​ One on a remote rack​

○​ One on another node in the same rack as the second​

4. Metadata Handling by NameNode

●​ The NameNode does not store the actual data.​

●​ It stores metadata like:​

○​ File-to-block mapping​

○​ Block-to-DataNode mapping​
○​ Access permissions​

●​ All this metadata is kept in memory (RAM) for fast access.​

5. Block Storage in DataNodes

●​ DataNodes store the blocks on their local disk.​

●​ They also send heartbeat and block reports periodically to the NameNode.​

6. Replication Management

●​ If the NameNode detects a block is under-replicated, it instructs other


DataNodes to create new replicas.​

●​ If a DataNode fails, replication is triggered automatically.

Example

Suppose a 300 MB file is uploaded to HDFS with a block size of 128 MB and
replication factor of 3:

●​ It is divided into:​

○​ Block A: 128 MB​

○​ Block B: 128 MB​

○​ Block C: 44 MB​

●​ Each block is stored as 3 replicas:​

○​ A1, A2, A3 (on 3 different DataNodes)​

○​ B1, B2, B3​

○​ C1, C2, C3​


Client uploads file → NameNode splits into blocks

↓ ↓

File.txt → [Block1] [Block2] [Block3]

↓ ↓ ↓

DataNode1 DataNode2 DataNode3

(A1) (A2) (A3)

How HDFS Writes a File

📌 Step-by-Step Write Process:


1.​ Client sends write request to the NameNode:​

○​ The client contacts the NameNode to create a file.​

○​ NameNode checks if the file already exists and permissions are valid.​

2.​ NameNode allocates blocks:​

○​ The NameNode responds with the list of DataNodes to store the first
block and its replicas.​

3.​ Client writes data to DataNodes:​

○​ The client divides the file into packets and writes them to the first
DataNode.​

○​ The first DataNode streams the data to the second, and then to the
third (pipeline replication).​

4.​ Acknowledgement:​
○​ Once all DataNodes have received the block, they send
acknowledgments back through the pipeline.​

○​ Only after all replicas are stored, the write is considered complete.​

5.​ Next block is written:​

○​ Process repeats for the next block until the entire file is
written.​

🔹 3. How HDFS Reads a File


📌 Step-by-Step Read Process:
1.​ Client sends read request to the NameNode:​

○​ The client contacts the NameNode for the file metadata.​

○​ NameNode returns a list of DataNodes holding the blocks of


the file.​

2.​ Client reads data from DataNodes:​

○​ The client selects the closest replica (based on network


proximity).​

○​ Data is read block by block, directly from the DataNodes.​

3.​ Streaming read:​

○​ HDFS supports streaming access, i.e., data is processed as


it is read.​
4.​ No contact with NameNode for each block:​

○​ After getting the block locations at the beginning, the client


communicates directly with DataNodes.

Java Interfaces to HDFS


The Hadoop Distributed File System (HDFS) provides a set of Java APIs that
allow developers to programmatically interact with HDFS for file operations such
as reading, writing, deleting, listing files, and managing directories.

These interfaces are part of the Hadoop Common library and are essential for
building Java-based applications that work directly with HDFS.

🔹 1. org.apache.hadoop.fs.FileSystem Class
This is the core interface in Hadoop to interact with file systems
(including HDFS).
📌 Purpose:
●​ Abstracts the interaction between applications and various file
systems (e.g., HDFS, local FS, S3).​

📦 Key Methods:
Method Description

create(Path path) Creates a new file and returns


FSDataOutputStream
open(Path path) Opens a file for reading
(FSDataInputStream)

delete(Path path, Deletes a file or directory


boolean)

mkdirs(Path path) Creates a directory recursively

listStatus(Path Lists contents of a directory


path)

exists(Path path) Checks if a file exists

🔹 2. FSDataInputStream and FSDataOutputStream


These classes handle streaming I/O for files in HDFS.

🧾 FSDataInputStream:
●​ Used to read data from HDFS.​

●​ Extends DataInputStream.​

●​ Supports seek() operation to jump to any position in a file.​

🧾 FSDataOutputStream:
●​ Used to write data to HDFS.​

●​ Extends DataOutputStream.​

●​ Supports writeBytes(), writeInt(), etc.​

🔹 3. Path Class
Represents the location of a file or directory in HDFS.

●​ Similar to java.io.File, but adapted for HDFS URIs.​

●​ Example: Path p = new Path("/user/data/input.txt");​

🔹 4. Accessing HDFS via Java Code (Example)


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.*;

public class HDFSExample {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path file = new Path("/user/hadoop/myfile.txt");


// Writing to HDFS

FSDataOutputStream out = fs.create(file);

out.writeUTF("Hello, HDFS!");

out.close();

// Reading from HDFS

FSDataInputStream in = fs.open(file);

String content = in.readUTF();

System.out.println("Read from HDFS: " + content);

in.close();

🔹 5. Other Important Classes


Class Role

FileStatus Contains metadata about files (length, permissions)


RemoteIterator For iterating over file listings

FileContext Alternative API to FileSystem for advanced use


cases

Configuration Holds configuration parameters (like HDFS URI)

Hadoop CLI
The Hadoop Command Line Interface (CLI) provides an essential bridge
between users and the Hadoop ecosystem. The Hadoop CLI is a Unix-style
command-line tool that allows users to execute commands directly on HDFS
or perform administrative tasks such as starting and stopping daemons,
copying files, and checking job statuses. It allows users to perform file
operations, job submission, and system monitoring using simple, shell-like
commands. Mastery of CLI commands enables developers, administrators,
and analysts to efficiently interact with HDFS, manage data, and run
MapReduce jobs in a scalable and scriptable way.These commands are
typically executed in a Linux terminal or through a script using:

hadoop fs <command>

hadoop dfs <command>

2. Common HDFS Commands

Command Description

hdfs dfs -ls / Lists all files/directories in root


hdfs dfs -mkdir Creates a new directory
/user/hadoop/dir

hdfs dfs -put file.txt Uploads a local file to HDFS


/user/

hdfs dfs -get Downloads a file from HDFS to local


/user/file.txt .

hdfs dfs -cat Displays contents of a file


/user/file.txt

hdfs dfs -rm Deletes a file from HDFS


/user/file.txt

hdfs dfs -du -h /user/ Shows disk usage for a directory in HDFS

hdfs dfs -copyFromLocal Copies files from local to HDFS

hdfs dfs -copyToLocal Copies files from HDFS to local

🔹 3. Administrative Hadoop Commands


These are used to manage the cluster services:
Command Description

start-dfs.sh Starts the HDFS daemons (NameNode, DNs)

stop-dfs.sh Stops the HDFS daemons

start-yarn.sh Starts the YARN ResourceManager and NodeManagers

stop-yarn.sh Stops the YARN services

jps Displays running Java processes (to verify Hadoop daemons)

5. CLI Configuration

CLI commands rely on the Hadoop configuration files such as:

●​ core-site.xml​

●​ hdfs-site.xml​

●​ mapred-site.xml​

●​ yarn-site.xml​

These files are located in the $HADOOP_HOME/etc/hadoop directory and define


properties such as fs.defaultFS, block size, replication factor, etc.
🔹 6. Features of Hadoop CLI
●​ ✅ Simple and intuitive​

●​ ✅ Supports most file system operations​

●​ ✅ Can be used in shell scripts for automation​

●​ ✅ Integrates with Linux permissions and file patterns​

●​ ✅ Can be extended with custom commands​

🔹 7. Limitations of Hadoop CLI


●​ ❌ Not suitable for real-time monitoring​

●​ ❌ Lacks GUI for beginners​

●​ ❌ Complex cluster operations may need scripting or UI (like Ambari or


Cloudera Manager)​

Hadoop File System Interfaces


The Hadoop File System (HDFS) is the primary storage system of the
Hadoop ecosystem. It is designed to store large datasets across multiple
machines with fault tolerance and high throughput. To interact with HDFS and
other file systems, Hadoop provides several interfaces.

These interfaces allow users and applications to perform file operations,


access data, and manage file system metadata both programmatically and
through command-line tools.
Types of Hadoop File System Interfaces

Hadoop supports multiple types of interfaces for different users and use
cases:

Interface Type Purpose

CLI (Command Line) for interacting with HDFS from the terminal

ava API For developers to build Hadoop-enabled apps

Web UI For monitoring and browsing files via browser

REST API (WebHDFS) For accessing HDFS over HTTP from remote clclients

FUSE o mount HDFS as a local file system

streaming API or MapReduce programs to process streaming ddata

Command Line Interface (CLI)


The most commonly used interface for administrators and developers.

Commands like:

hdfs dfs -ls /user

hdfs dfs -put file.txt /user/data/


hdfs dfs -get /user/data/file.txt .

Let users:

●​ List directories​

●​ Upload/download files​

●​ Create/remove directories​

●​ Check disk usage​

Useful for scripting and automation of HDFS operations.

🔹 3. Java API (Programmatic Access)


Used by developers to write applications that interact with HDFS.

Key Classes:

●​ org.apache.hadoop.fs.FileSystem: Core class to interact with


HDFS.​

●​ FSDataInputStream, FSDataOutputStream: For reading/writing


files.​

●​ Path: Represents file or directory paths.​

Example:

FileSystem fs = FileSystem.get(new Configuration());


Path path = new Path("/user/data.txt");

FSDataInputStream in = fs.open(path);

This is the most powerful interface, used by internal Hadoop components


and custom applications.

🔹 4. Web Interface (HDFS Web UI)


HDFS provides a browser-based UI (usually at http://namenode:9870/)
that allows users to:

●​ View file system hierarchy​

●​ Monitor file blocks, replication​

●​ Track data nodes and their status​

●​ Diagnose NameNode metrics and logs​

It is mainly for administrators and debugging.

What is Data Flow in Hadoop?


Data flow in Hadoop refers to the path taken by data from its source to its
destination (usually HDFS). It involves multiple components:

1.​ Data Sources – Web servers, databases, social media, IoT devices,
etc.​

2.​ Ingestion Tools – Flume, Sqoop, Kafka, NiFi​


3.​ Storage Systems – HDFS, HBase, Hive​

4.​ Processing Frameworks – MapReduce, Spark​

5.​ Output / Reporting – Dashboards, ML models, business reports

Data Flow and Data Ingest with Flume and


Sqoop in Big Data
In Big Data systems, data ingestion refers to the process of collecting and
importing data for immediate use or storage in a database or data warehouse.
Two common tools used for ingesting data into the Hadoop ecosystem are
Apache Flume and Apache Sqoop. These tools are designed for efficient
and reliable data flow from various sources into the Hadoop Distributed File
System (HDFS).

Apache Flume – For Streaming Data Ingestion


🔹 What is Flume?
Apache Flume is a distributed, reliable, and available system for efficiently
collecting, aggregating, and moving large amounts of log data or event
data into HDFS or HBase.

🔹 Architecture:
Flume is built around a simple and flexible architecture based on streaming
data flow.

Key components:

●​ Source: Accepts data (e.g., Avro source, Netcat source, exec source)​

●​ Channel: Temporary store (e.g., memory, file-based)​


●​ Sink: Destination (e.g., HDFS, HBase)​

🔹 Data Flow in Flume:

The Flume agent runs a configuration of these three components. Multiple


agents can be used in a multi-hop or fan-in/fan-out configuration.

🔹 Example Use Case:


●​ Ingesting real-time logs from a web server to HDFS for later analysis
using Hive or Spark.​

🔹 Benefits of Flume:
●​ Handles streaming data​

●​ Supports reliability (event delivery guarantees)​

●​ Customizable (interceptors, serializers)

Apache Sqoop – For RDBMS to HDFS Data Transfer


🔹 What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases (e.g., MySQL, PostgreSQL, Oracle).

🔹 Sqoop Architecture:
Sqoop uses MapReduce jobs internally to parallelize the import/export
process, thus making data transfers fast and scalable.

🔹 Data Flow in Sqoop:

Feature Flume Sqoop


Data Type Semi-structured / Log / Structured (RDBMS)
Event
Source Logs, real-time streams Relational Databases
Use Case Streaming ingestion Bulk data transfer
Integration HDFS, HBase HDFS, Hive, HBase
Scheduling Continuous Batch
Underlying Tech Streaming pipeline MapReduce

Hadoop Archives (HAR)


In Hadoop, managing millions of small files is inefficient and burdensome. Each file,
regardless of its size, stores metadata in the NameNode, which is kept in memory.
As a result, too many small files can overload the NameNode, reducing cluster
efficiency and increasing load times.

To solve this problem, Hadoop provides a special storage format called Hadoop
Archives (HAR files). These archives help in logically grouping multiple small files
into a larger archive, thereby optimizing storage and access.

🗂️ What is a Hadoop Archive (HAR)?


●​ A HAR file is a container for a set of files in HDFS, packaged into a single
archive.​

●​ It helps reduce the metadata burden on the NameNode.​

●​ While individual files remain accessible, they are no longer treated as


separate files in HDFS.​

🧱 Structure of HAR Files


A HAR file contains:

1.​ _index – Stores metadata (file names, offsets, and lengths)​

2.​ _masterindex – Stores archive-level information​

3.​ Data files – Contain the actual contents of the original files​

Together, these files make it possible to retrieve individual files from the archive
without unarchiving everything.
🛠️ Creating a HAR File
HAR files are created using the Hadoop command-line interface:

hadoop archive -archiveName archive.har -p /input_dir


/output_dir

Example:

hadoop archive -archiveName logs.har -p /logs/raw /logs/archive

This creates an archive named logs.har under /logs/archive/ containing all


files from /logs/raw.

Hadoop I/O: Compression, Serialization, Avro, and


File-Based Data Structures
In the Hadoop ecosystem, handling large volumes of data efficiently is crucial.
Hadoop I/O operations involve the movement, storage, and processing of data.
Optimizing Hadoop I/O helps reduce disk I/O, speed up processing, and lower
storage costs.

The major components of Hadoop I/O include:

●​ Data Compression​

●​ Serialization​

●​ Apache Avro​

●​ File-based Data Structures​

1️⃣ Compression in Hadoop


🔹 What is Compression?
Compression is the process of encoding data to reduce its storage size. In Hadoop,
compression is vital to:

●​ Save storage space in HDFS.​

●​ Reduce disk I/O and network bandwidth during data movement.​

●​ Speed up MapReduce jobs.​

●​ Input files​

●​ Intermediate map outputs​

●​ Output files


🔹 Enabling Compression:
<property>

<name>mapreduce.output.fileoutputformat.compress</name>

<value>true</value>

</property>

2️⃣ Serialization in Hadoop

🔹 What is Serialization?
Serialization is the process of converting objects into a stream of bytes for storage
or transmission. Deserialization is the reverse process.

🔹 Importance in Hadoop:
●​ Transferring data between nodes in a distributed system​

●​ Reading/writing data from disk​

●​ Intermediate data in MapReduce jobs​

🔹 Hadoop’s Default Serialization Framework:


Hadoop uses Writable objects for serialization.
Example:

public class IntWritable implements Writable {

private int value;

public void write(DataOutput out) throws IOException {


out.writeInt(value); }
public void readFields(DataInput in) throws IOException {
value = in.readInt(); }

🔹 Limitations of Writable:
●​ Tightly coupled to Java​

●​ Verbose and not portable​

●​ Not ideal for data exchange between languages

3️⃣ Avro in Hadoop

🔹 What is Avro?
Apache Avro is a language-neutral, schema-based serialization system
developed as part of the Apache Hadoop project. It solves the limitations of Writable.

🔹 Features of Avro:
●​ Compact and fast​

●​ Supports dynamic schemas​

●​ Suitable for cross-language data exchange​


●​ Stores schema with the data​

🔹 Avro Data Format:


●​ Data is stored in .avro files​

●​ Contains a schema in JSON format​

●​ Supports both row-based and block-based serialization​

🔹 Avro Use Cases:


●​ Data serialization in MapReduce​

●​ Interoperable data exchange between languages (Java, Python, C++)​

●​ Storage format for Kafka, Hive, HDFS

4️⃣ File-Based Data Structures in Hadoop

Hadoop supports several structured file formats that are optimized for big data
workloads:

a) SequenceFile
●​ Binary key-value pair file format​

●​ Used for intermediate output in MapReduce​

●​ Supports compression​

SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,


path, LongWritable.class, Text.class);

b) MapFile

●​ Sorted version of SequenceFile​

●​ Allows fast lookup using an index​

●​ Suitable for random access​

c) Avro File

●​ Contains serialized data along with schema​

●​ Efficient for data exchange and storage​

●​ Supports schema evolution​

d) Parquet

●​ Columnar storage format​

●​ Ideal for analytical queries​

●​ Supported by Hive, Impala, Spark​

e) ORC (Optimized Row Columnar)


●​ Optimized for Hive​

●​ High compression, low latency​

●​ Efficient storage for structured data

Setting Up a Hadoop Cluster


A Hadoop Cluster is a collection of computers (nodes) working together to store
and process large volumes of data using the Hadoop Distributed File System
(HDFS) and the MapReduce processing model. Setting up a Hadoop cluster
involves configuring both hardware and software components to operate in a
distributed computing environment.

Prerequisites

a) Hardware Requirements:

●​ Minimum 2-4 nodes (one master + multiple slaves)​


●​ At least 8 GB RAM (per node recommended)​

●​ 100 GB+ disk space​

●​ 1 Gbps Network for optimal performance​

b) Software Requirements:

●​ Linux-based OS (e.g., Ubuntu/CentOS)​

●​ Java JDK 8/11 installed​

●​ Hadoop (latest stable version from Apache)​

Steps to Set Up a Hadoop Cluster:

1.​ Install Java​


Hadoop requires Java (usually OpenJDK).

sudo apt update

sudo apt install openjdk-11-jdk

Verify:

java -version

2.​ Create Hadoop User​


Add a dedicated user (e.g., hadoop) for managing services.

sudo adduser hadoop

sudo usermod -aG sudo hadoop​

3.​ Install Hadoop​


Download Hadoop binaries and extract them to all nodes.​
4.​ Configure SSH Access​
Enable passwordless SSH from master to all slave nodes.​

5.​ Configure Hadoop Files​


Edit these important config files:​

○​ core-site.xml → HDFS settings​

○​ hdfs-site.xml → NameNode, DataNode settings​

○​ mapred-site.xml → JobTracker/ResourceManager settings​

○​ yarn-site.xml → NodeManager settings​

○​ slaves file → List of DataNodes​

6.​ Format NameNode​


Run: hdfs namenode -format​

7.​ Start Hadoop Services​

○​ HDFS: start-dfs.sh​

○​ YARN: start-yarn.sh​

8.​ Verify Web UIs​

○​ NameNode: http://<namenode>:9870​

○​ ResourceManager: http://<resourcemanager>:8088
Hadoop Cluster Specification
a) Node Types

A typical Hadoop cluster consists of:

Node Type Role

NameNode Manages HDFS namespace and metadata

DataNode tStores actual data in HDFS blocks


ResourceManager Manages system resources and job scheduling (YARN)

NodeManager Runs on each node, manages tasks and resources locally

Secondary aTakes periodic snapshots of NameNode metadata (FSImage +


NameNode EditLogs)

Master Node requires more memory and CPU than worker nodes.

b) Hardware Requirements

Componen Recommended Spec (Per Node)


t

CPU Quad-core or higher

RAM 8 GB minimum (16–64 GB for production)

Storage 500 GB to several TB (preferably SSD)

Network 1 Gbps Ethernet or better

c) Software Requirements

●​ Linux-based OS (Ubuntu/CentOS)​

●​ Java JDK 8 or 11​

●​ Hadoop (latest version, e.g., Hadoop 3.3.6)​

●​ SSH (for passwordless login)​

●​ Python (optional, for streaming)


Hadoop Configuration
To ensure that Hadoop functions correctly and efficiently, it must be properly configured.
Hadoop configuration involves setting up the environment, defining core functionalities,
tuning performance parameters, and specifying resource allocations through configuration
files.

Types of Configuration Files in Hadoop


Hadoop uses XML-based configuration files located in the HADOOP_HOME/etc/hadoop
directory.

a) core-site.xml

●​ Defines core Hadoop settings like default file system and I/O settings.

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

b) hdfs-site.xml

●​ Configures HDFS properties such as block size, replication, and storage paths.

<configuration>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

</configuration>
c) mapred-site.xml

●​ Contains settings for the MapReduce framework, like job tracker, input/output, and
number of reducers.

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

d) yarn-site.xml

●​ Configures YARN resource management and task scheduling.​

<configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>localhost</value>

</property>

</configuration>

Security in Hadoop
Security is crucial in big data systems as they often deal with sensitive data. Hadoop
provides multiple layers of security to protect data and access.
a) Authentication

●​ Hadoop supports Kerberos-based authentication.​

●​ It verifies the identity of users before allowing access.​

●​ Users receive a token from the Key Distribution Center (KDC).​

b) Authorization

●​ Defines what authenticated users are allowed to do.​

●​ Implemented through Access Control Lists (ACLs) and File/Directory


permissions in HDFS.​

c) Encryption

●​ Data-at-Rest: Encrypts stored data blocks.​

●​ Data-in-Transit: Uses SSL/TLS to encrypt network traffic between nodes.​

d) Auditing

●​ Logs user activities and file accesses.​

●​ Tools like Apache Ranger or Apache Sentry provide centralized security policy
management.​

🛠️ 2. Administering Hadoop
Hadoop administration involves configuring, monitoring, and managing cluster operations.

a) Cluster Setup
●​ Configure core-site.xml, hdfs-site.xml, yarn-site.xml, and
mapred-site.xml.​

●​ Setup passwordless SSH and proper user permissions.​

b) Starting/Stopping Services

start-dfs.sh # Starts HDFS

start-yarn.sh # Starts YARN

stop-dfs.sh # Stops HDFS

stop-yarn.sh # Stops YARN

c) Job and Resource Management

●​ Use ResourceManager UI (http://master:8088) to monitor jobs.​

●​ Use HDFS UI (http://master:9870) to check file system status.​

d) User Management

●​ Create users and assign permissions for accessing data.​

●​ Tools like Hue, Apache Knox, or Ranger simplify this process.​

📊 3. HDFS Monitoring & Maintenance


Monitoring ensures that Hadoop performs optimally, and maintenance avoids data loss and
corruption.

a) Monitoring Tools

●​ JPS: Lists running Hadoop daemons.​


●​ Nagios, Ganglia, Ambari: Monitor performance metrics like memory, CPU, disk
usage.​

●​ HDFS Web UI: Shows block distribution, health, under-replicated blocks.​

b) Regular Tasks

●​ Check disk space and system logs.​

●​ Use hdfs fsck / to check file system consistency.​

●​ Run balancer to ensure even data distribution across DataNodes.

hdfs balancer -threshold 10

c) Maintenance Activities

●​ Restart failed nodes.​

●​ Backup FSImage and edit logs.​

●​ Perform upgrades using rolling upgrade feature.​

⚙️ 4. Hadoop Benchmarks
Benchmarks help evaluate the performance of a Hadoop cluster.

a) Teragen and Terasort

●​ Used for generating data and testing sort performance.

b) DFSIO

●​ Measures HDFS I/O throughput.​

c) HiBench
●​ A comprehensive benchmark suite developed by Intel.​

●​ Tests real-world workloads including SQL, streaming, machine learning.

Hadoop in the Cloud


Using Hadoop in the cloud enables scalable and cost-effective big data processing without
managing physical hardware.

a) Benefits

●​ Elasticity: Resources can be added/removed as needed.​

●​ Cost Efficiency: Pay-as-you-go model.​

●​ High Availability: Managed cloud services offer fault tolerance.​

●​ Easy Setup: No hardware or manual cluster configuration.​

b) Cloud Platforms Supporting Hadoop

Platform Service Name

Amazon Web EMR (Elastic


Services MapReduce)

Microsoft Azure HDInsight

Google Cloud Dataproc

IBM Cloud Analytics Engine

c) Cloud Features
●​ Pre-configured Hadoop environment.​

●​ Integration with cloud storage like S3 (AWS), Blob (Azure), or GCS (Google).​

●​ Scalable compute nodes and spot instances to reduce cost.​

You might also like