Hadoop is an open-source framework designed to process and store large volumes of data in a
distributed and scalable manner. It was originally developed by Doug Cutting and Mike Cafarella,
and it's based on the MapReduce programming model introduced by Google. Hadoop is widely
used in big data applications and allows organizations to process and analyze vast amounts of data
efficiently. The core components of Hadoop architecture include:
1. Hadoop Distributed File System (HDFS):
HDFS is a distributed file system designed to store massive amounts of data across multiple nodes in
a Hadoop cluster. It uses a master/slave architecture, where the NameNode acts as the master and
manages the metadata, while DataNodes serve as the slaves responsible for storing the actual data.
The data is distributed across DataNodes, providing fault tolerance and scalability.
2. MapReduce:
MapReduce is a programming model that enables distributed processing of large datasets. It divides
data processing tasks into two phases: the Map phase, where data is processed in parallel across
nodes, and the Reduce phase, where the results from the Map phase are aggregated to produce the
final output.
3. YARN (Yet Another Resource Negotiator):
YARN is the resource management layer in Hadoop, responsible for managing and allocating
resources (CPU, memory, etc.) to applications running in the cluster. It decouples the resource
management capabilities from MapReduce, allowing other data processing engines to run on
Hadoop, making the framework more versatile.
4. Hadoop Common:
Hadoop Common includes libraries and utilities that provide necessary support services for Hadoop
components. It comprises the shared utilities used by all the Hadoop modules.
5. Hadoop Ecosystem:
Hadoop's ecosystem consists of various projects that extend and complement its functionalities.
Some popular components of the Hadoop ecosystem include:
- Apache Hive: A data warehousing and SQL-like query language for Hadoop.
- Apache Pig: A high-level platform for creating MapReduce programs using a language called Pig
Latin.
- Apache HBase: A distributed, scalable NoSQL database that runs on top of HDFS.
- Apache Spark: An in-memory data processing engine that provides faster data processing
compared to traditional MapReduce.
- Apache Kafka: A distributed streaming platform for handling real-time data feeds.
- Apache ZooKeeper: A centralized service for maintaining configuration information,
synchronization, and coordination in distributed systems.
Overall, Hadoop's architecture allows for the distributed storage and processing of large datasets,
providing fault tolerance and scalability while enabling developers to build powerful big data
applications. However, it's worth noting that the big data landscape is continually evolving, and
other technologies like Apache Spark and cloud-based solutions have gained popularity for specific
use cases as well.
HDFS:
HDFS stands for Hadoop Distributed File System, and it is a fundamental component of the Hadoop
ecosystem. HDFS is designed to store and manage large volumes of data in a distributed and fault-
tolerant manner, making it an essential part of big data processing and analytics applications. Here
are the key features and components of HDFS:
1. **Distributed Storage:**
HDFS is designed to store vast amounts of data across multiple machines (nodes) in a Hadoop
cluster. Data is distributed across these nodes, allowing the system to scale horizontally as more
data is added.
2. **Fault Tolerance:**
HDFS achieves fault tolerance by replicating data across different nodes. By default, each data block
is replicated three times (configurable), with the replicas placed on different racks to protect against
hardware failures and network issues.
3. **NameNode:**
The NameNode is a critical component of HDFS and acts as the master node in the Hadoop cluster. It
stores all the metadata about the files and directories in the file system, such as the file hierarchy,
block locations, and permissions. As the central authority, the NameNode manages file system
operations like opening, closing, and renaming files.
4. **DataNodes:**
DataNodes are the worker nodes in the Hadoop cluster. These nodes store the actual data blocks
and are responsible for reading and writing data upon the request of clients and the NameNode.
They regularly send heartbeats to the NameNode to report their health status.
5. **Block Size:**
HDFS stores data in large blocks, typically ranging from 64 MB to 128 MB (configurable). This block
size is much larger than traditional file systems, which use smaller block sizes. The larger block size
reduces the NameNode's overhead since it needs to manage metadata for far fewer blocks.
6. **Rack Awareness:**
HDFS is rack-aware, meaning it is aware of the physical network topology of the cluster. Data
replication is performed such that copies of the same data block are stored on different racks to
enhance fault tolerance and data locality.
7. **Read and Write Operations:**
When a client wants to read data from HDFS, it contacts the NameNode to get information about
the block locations. The client can then directly access the DataNodes that hold the data blocks it
needs. Similarly, when writing data, the client communicates with the NameNode to determine
where to store the data, and the data is written to multiple DataNodes for replication.
HDFS is an essential component of the Hadoop ecosystem because it provides a scalable and reliable
storage infrastructure for big data applications. Its fault-tolerant design, along with the ability to
parallelize data processing with the MapReduce framework, allows Hadoop to efficiently handle and
analyze vast amounts of data on commodity hardware.
MapReduce &HDFS:
MapReduce and HDFS are two core components of the Hadoop ecosystem, working together to
process and store large-scale data efficiently. Let's explore each of them in more detail:
**HDFS (Hadoop Distributed File System):**
HDFS is a distributed file system designed to store vast amounts of data across multiple nodes in a
Hadoop cluster. Its primary purpose is to provide a fault-tolerant and scalable storage solution for
big data applications. Key features of HDFS include:
1. **Distributed Storage:** HDFS distributes data across multiple DataNodes in the cluster. Each file
is divided into fixed-size blocks, and these blocks are replicated across DataNodes to ensure fault
tolerance and data availability.
2. **Replication:** By default, HDFS replicates each data block three times (configurable). The
replicas are stored on different nodes and, ideally, different racks, ensuring that data remains
available even if some nodes or racks fail.
3. **NameNode and DataNodes:** The NameNode is the master node that manages the file
system's metadata, while DataNodes are the worker nodes responsible for storing the actual data
blocks.
4. **High Throughput:** HDFS is optimized for sequential read and write operations, making it well-
suited for data-intensive workloads like batch processing and analytics.
**MapReduce:**
MapReduce is a programming model and processing engine for distributed data processing on
Hadoop. It allows developers to write parallelizable algorithms to process vast amounts of data
efficiently. The MapReduce process consists of two main stages: Map and Reduce.
1. **Map Stage:** During the Map stage, input data is divided into splits, and multiple mappers
process these splits independently in parallel. Each mapper applies a user-defined map function to
the input data and generates intermediate key-value pairs.
2. **Shuffle and Sort:** After the Map stage, the MapReduce framework performs a Shuffle and
Sort phase, where the intermediate key-value pairs are sorted and grouped by keys across all
mappers. This ensures that all values for a specific key are brought together and passed to the same
reducer.
3. **Reduce Stage:** In the Reduce stage, reducers process the intermediate data generated by the
mappers. Each reducer applies a user-defined reduce function to the grouped data, producing the
final output.
**Integration between MapReduce and HDFS:**
The integration between MapReduce and HDFS is fundamental to Hadoop's capabilities. When a
MapReduce job is executed, it reads data from HDFS for processing. Here's how the integration
works:
1. **Data Input:** MapReduce jobs read data from HDFS. The data is divided into splits, and each
split is processed by an individual mapper.
2. **Intermediate Data:** The mappers generate intermediate key-value pairs during the Map
stage, which are then stored temporarily in local storage on the respective DataNodes.
3. **Data Output:** After the Reduce stage, the final output of the MapReduce job is typically
written back to HDFS, where it can be used as input for subsequent jobs or accessed for analysis and
reporting.
In summary, HDFS provides the storage infrastructure for Hadoop, allowing it to handle large
datasets efficiently. MapReduce enables distributed data processing by breaking down complex
tasks into smaller parallel tasks that can be executed across the cluster, leveraging the capabilities of
HDFS for data input and output. Together, they form the core foundation for big data processing in
the Hadoop ecosystem.
MapReduce diagram:
What is MapReduce in Hadoop?
MapReduce is a software framework and programming model used for processing huge amounts of
data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting
and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and
C++. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster.
The input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
MapReduce Architecture in Big Data explained with Example
The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and
reducing.
Now in this MapReduce tutorial, let’s understand with a MapReduce example–
Consider you have following input data for your MapReduce in Big data Program
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each split is
passed to a mapping function to produce output values. In our example, a job of mapping phase is to count
a number of occurrences of each word from input splits (more details about input-split is given below) and
prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from
Mapping phase output. In our example, the same words are clubed together along with their respective
frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from
Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of
each word.
MapReduce Architecture explained in detail
One map task is created for each split which then executes map function for each record in the
split.
It is always beneficial to have multiple splits because the time taken to process a split is small as
compared to the time taken for processing of the whole input. When the splits are smaller, the
processing is better to load balanced since we are processing the splits in parallel.
However, it is also not desirable to have splits too small in size. When splits are too small, the
overload of managing the splits and map task creation begins to dominate the total job execution
time.
For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB,
by default).
Execution of map tasks results into writing output to a local disk on the respective node and not to
HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of
HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to produce the final output.
Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns
the map task on another node and re-creates the map output.
Reduce task doesn’t work on the concept of data locality. An output of every map task is fed to the
reduce task. Map output is transferred to the machine where reduce task is running.
On this machine, the output is merged and then passed to the user-defined reduce function.
Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node
and other replicas are stored on off-rack nodes). So, writing the reduce output
How MapReduce Organizes Work?
Now in this MapReduce tutorial, we will learn how MapReduce works
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of
entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
How Hadoop MapReduce Works
A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different
data nodes.
Execution of individual task is then to look after by task tracker, which resides on every data node
executing part of the job.
Task tracker’s responsibility is to send the progress report to the job tracker.
In addition, task tracker periodically sends ‘heartbeat’ signal to the Jobtracker so as to notify him
of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the job
tracker can reschedule it on a different task tracker.
FIRST STEP IN HADOOP:
Getting started with Hadoop can be exciting but may seem overwhelming at first. Here are the first
steps to help you begin your journey with Hadoop:
1. **Understand Hadoop Concepts:**
Before diving into the technical aspects, it's essential to understand the fundamental concepts of
Hadoop, including HDFS, MapReduce, YARN, and the overall Hadoop ecosystem. Familiarize
yourself with distributed computing, data storage, and parallel processing principles.
2. **Set Up a Hadoop Cluster:**
You can create a small Hadoop cluster on your local machine using virtualization tools like
VirtualBox or by using Hadoop distributions like Cloudera QuickStart VM, Hortonworks Sandbox,
or Apache Hadoop in pseudo-distributed mode. This will enable you to experiment and practice
with Hadoop components in a controlled environment.
3. **Install and Configure Hadoop:**
Follow the official documentation of the Hadoop distribution you choose to install. Properly
configure HDFS, YARN, and other necessary components. Make sure you understand the
configuration files and how they affect the behavior of the Hadoop cluster.
4. **Explore HDFS:**
Learn how to interact with HDFS, the Hadoop Distributed File System. Practice commands for
uploading, downloading, and managing files and directories. Understand how data is distributed
across DataNodes and the replication mechanisms.
5. **Write and Run a MapReduce Job:**
Write a simple MapReduce program in a programming language like Java or use higher-level
abstractions like Apache Pig or Apache Hive to process data. Execute the MapReduce job on your
Hadoop cluster and observe the results.
6. **Learn Hadoop Ecosystem Components:**
Explore popular Hadoop ecosystem components like Apache Hive, Apache Pig, Apache Spark,
and Apache HBase. Each component serves different purposes and can help you in various big
data processing scenarios.
7. **Practice with Sample Data:**
Work with some sample datasets to gain hands-on experience with Hadoop's capabilities. There
are various public datasets available that you can use for learning and experimentation.
8. **Study Advanced Concepts:**
Once you are comfortable with the basics, delve into more advanced topics like data partitioning,
performance tuning, data compression, and high availability configurations.
9. **Join Hadoop Communities:**
Join online Hadoop communities, forums, and mailing lists to interact with experienced Hadoop
users and seek help if you encounter any issues.
10. **Explore Real-World Use Cases:**
Look into real-world use cases where Hadoop is used effectively, such as data analytics, log
processing, recommendation systems, and more. Understanding practical applications will deepen
your understanding of Hadoop's potential.
Remember, learning Hadoop takes time and practice. Start small, experiment, and gradually build
your expertise. As you become more proficient with Hadoop, you can explore other big data
technologies and frameworks to expand your knowledge in the ever-evolving field of data
engineering and analytics.