0% found this document useful (0 votes)
6 views10 pages

7) Intro To Hadoop and Mapreducer

Hadoop

Uploaded by

Vashu Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

7) Intro To Hadoop and Mapreducer

Hadoop

Uploaded by

Vashu Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

What is Hadoop?

Hadoop is an open source software programming framework for storing a large


amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
Hadoop has two main components:
● HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware, which
makes it cost-effective.
● YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as
CPU and memory) for processing the data stored in HDFS.
● Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
● Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large
data sets across clusters of computers using a simple programming model.

History of Hadoop
Apache Software Foundation is the developer of Hadoop, and its co-founders
are Doug Cutting and Mike Cafarella. Its co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding
for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data.
It was created by Apache Software Foundation in 2006, based on a white paper
written by Google in 2003 that described the Google File System (GFS) and the
MapReduce programming model. The Hadoop framework allows for the
distributed processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such
as data warehousing, log processing, and research. Hadoop has been widely
adopted in the industry and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data
processing:

● Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
● Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it
can continue to operate even in the presence of hardware failures.
● Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance
● High Availability: Hadoop provides a High Availability feature, which helps
to make sure that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to implement
a wide variety of data processing tasks.
● Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
● Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
● Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
● YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Hadoop Distributed File System


It has a distributed file system known as HDFS and this HDFS splits files into
blocks and sends them across various nodes in the form of large clusters. Also in
case of a node failure, the system operates and data transfer takes place between
the nodes which are facilitated by HDFS.

Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,


ability to tolerate faults, scalable, block structured, can process a large amount of
data simultaneously and many more.
Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small
quantities of data. Also, it has issues related to potential stability, restrictive and
rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark,
Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop


1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.

Hadoop framework is made up of the following modules:


1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
Advantages and Disadvantages of Hadoop
Advantages:
● Ability to store a large amount of data.

● High flexibility.

● Cost effective.

● High computational power.

● Tasks are independent.

● Linear scaling.
Hadoop has several advantages that make it a popular choice for big data
processing:

● Scalability: Hadoop can easily scale to handle large amounts of data by adding
more nodes to the cluster.
● Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of
data.
● Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down, the data can
still be processed by the other nodes.
● Flexibility: Hadoop can process structured, semi-structured, and unstructured
data, which makes it a versatile option for a wide range of big data scenarios.
● Open-source: Hadoop is open-source software, which means that it is free to
use and modify. This also allows developers to access the source code and
make improvements or add new features.
● Large community: Hadoop has a large and active community of developers
and users who contribute to the development of the software, provide support,
and share best practices.
● Integration: Hadoop is designed to work with other big data technologies such
as Spark, Storm, and Flink, which allows for integration with a wide range of
data processing and analysis tools.

Disadvantages:

● Not very effective for small data.


● Hard cluster management.
● Has stability issues.
● Security concerns.
● Complexity: Hadoop can be complex to set up and maintain, especially for
organizations without a dedicated team of experts.
● Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.
● Limited Support for Real-time Processing: Hadoop’s batch-oriented nature
makes it less suited for real-time streaming or interactive data processing use
cases.
● Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured data
processing
● Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
● Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.
● Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and machine
learning workloads, specialized components like Apache Graph and Mahout are
available but have some limitations.
● Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
● Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.
● Data Governance: Data Governance is a critical aspect of data management,
Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.
What is MapReduce?

MapReduce is a parallel, distributed programming model in the Hadoop


framework that can be used to access the extensive data stored in the Hadoop
Distributed File System (HDFS). Hadoop is capable of running the MapReduce
program written in various languages such as Java, Ruby, and Python. One of the
beneficial factors that MapReduce aids is that MapReduce programs are inherently
parallel, making the very large scale easier for data analysis.
When the MapReduce programs run in parallel, it speeds up the process. The
process of running MapReduce programs is explained below.
● Dividing the input into fixed-size chunks: Initially, it divides the work into
equal-sized pieces. When the file size varies, dividing the work into equal-sized
pieces isn’t the straightforward method to follow, because some processes will
finish much earlier than others while some may take a very long run to
complete their work. So one of the better approaches is that one that requires
more work is said to split the input into fixed-size chunks and assign each
chunk to a process.
● Combining the results: Combining results from independent processes is a
crucial task in MapReduce programming because it may often need additional
processing such as aggregating and finalizing the results.

Key components of MapReduce


There are two key components in MapReduce. The MapReduce consists of two
primary phases such as the map phase and the reduced phase. Each phase contains
the key-value pairs as its input and output and it also has the map function and
reducer function within it.
● Mapper: Mapper is the first phase of the MapReduce. The Mapper is
responsible for processing each input record and the key-value pairs are
generated by the InputSplit and RecordReader. Where these key-value pairs
can be completely different from the input pair. The MapReduce output holds
the collection of all these key-value pairs.
● Reducer: The reducer phase is the second phase of the MapReduce. It is
responsible for processing the output of the mapper. Once it completes
processing the output of the mapper, the reducer now generates a new set of
output that can be stored in HDFS as the final output data.

Execution workflow of MapReduce


Now let’s understand how the MapReduce Job execution works and what all the
components it contains. Generally, MapReduce processes the data in different
phases with the help of its different components. Take a look at the below figure
which illustrates the steps of the job execution workflow of MapReduce in
Hadoop.

● Input Files: The data for MapReduce tasks are present in the input files. These
input files reside in HDFS. The format for input files is arbitrary, while line-
based log files and binary format can also be used.

● InputFormat: The InputFormat is used to define how the input files are split
and read. It selects the files or objects that are used for input. In general, the
InputFormat is used to create the Input Split.

● Record Reader: The RecordReader can communicate with the Input Split in
the Hadoop MapReduce. It can also convert the data into key-value pairs so
that the mapper can read. By default, the RecordReader utilizes the
TextInputFormat for converting data into key-value pairs. The Record Reader
communicates with the Input Split until the file reading is completed. It then
assigns a byte offset (unique number) to each line present in the file. Then these
key-value pairs are sent to the mapper for further processing.

● Mapper: From the RecordReader, the mapper receives the input records. The
Mapper is responsible for processing those input records from the
RecordReader and it generates the new key-value pair. The Key-value pair
generated by the mapper can be completely different from the input pair. The
output of the mapper which is intermediate output is said to be stored in the
local disk since it is the temporary data.
● Combiner: The Combiner in MapReduce is also known as Mini-reducer. The
Hadoop MapReduce combiner performs the local aggregation on the mapper’s
output which minimizes the data transfer between the mapper and reducer.
Once the Combiner completes its process, the output of the combiner is passed
to the partitioner for further work.

● Partitioner: In Hadoop MapReduce, the partitioner is used when we are


working with more than one reducer. The Partitioner extracts the output from
the combiner and then it performs partitioning. The partitioning of output takes
place based on the key and then it is sorted. With the help of a hash function,
the key (or subset of the key) is used to derive the partition. Since MapReduce
execution works with the help of key-value, each combiner output is
partitioned and a record having the same key value moves into the same
partition and then each partition is sent to the reducer. Partitioning of the
output of the combiner allows the even distribution of the map output over the
reducer.

● Shuffling and Sorting: The Shuffling performs the shuffling operation on the
mapper’s output before it is sent to the reducer phase. Once all the mappers
have completed their work and their output is said to be shuffled on the reducer
nodes, then this intermediate output is merged and sorted. This sorted output is
passed as input to the reducer phase.

● Reducer: It takes the set of intermediate key-value pairs from the mapper as the
input and then it runs the reducer function on each of the key-value pairs to
generate the output. This output of the reducer phase is the final output and it is
stored in the HDFS.

● Output format: The Output format determines how these output values are
written in the output files by the record reader. The Output format instances
provided by Hadoop are generally used to write files on either HDFS or on the
local disk. Thus, the final output of the reducer is written on the HDFS by
output format instances.

Example:

You might also like