0% found this document useful (0 votes)

6 views10 pages

7) Intro To Hadoop and Mapreducer

Hadoop

Uploaded by

Vashu Rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

7) Intro To Hadoop and Mapreducer

Hadoop

Uploaded by

Vashu Rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

What is Hadoop?

Hadoop is an open source software programming framework for storing a large

amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
Hadoop has two main components:
● HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware, which
makes it cost-effective.
● YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as
CPU and memory) for processing the data stored in HDFS.
● Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
● Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large
data sets across clusters of computers using a simple programming model.

History of Hadoop
Apache Software Foundation is the developer of Hadoop, and its co-founders
are Doug Cutting and Mike Cafarella. Its co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding
for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data.
It was created by Apache Software Foundation in 2006, based on a white paper
written by Google in 2003 that described the Google File System (GFS) and the
MapReduce programming model. The Hadoop framework allows for the
distributed processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such
as data warehousing, log processing, and research. Hadoop has been widely
adopted in the industry and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data
processing:

● Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
● Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it
can continue to operate even in the presence of hardware failures.
● Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance
● High Availability: Hadoop provides a High Availability feature, which helps
to make sure that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to implement
a wide variety of data processing tasks.
● Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
● Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
● Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
● YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Hadoop Distributed File System

It has a distributed file system known as HDFS and this HDFS splits files into
blocks and sends them across various nodes in the form of large clusters. Also in
case of a node failure, the system operates and data transfer takes place between
the nodes which are facilitated by HDFS.

Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,

ability to tolerate faults, scalable, block structured, can process a large amount of
data simultaneously and many more.
Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small
quantities of data. Also, it has issues related to potential stability, restrictive and
rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark,
Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.

Hadoop framework is made up of the following modules:

1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
Advantages and Disadvantages of Hadoop
Advantages:
● Ability to store a large amount of data.

● High flexibility.

● Cost effective.

● High computational power.

● Tasks are independent.

● Linear scaling.
Hadoop has several advantages that make it a popular choice for big data
processing:

● Scalability: Hadoop can easily scale to handle large amounts of data by adding
more nodes to the cluster.
● Cost-effective: Hadoop is designed to work with commodity hardware, which
makes it a cost-effective option for storing and processing large amounts of
data.
● Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down, the data can
still be processed by the other nodes.
● Flexibility: Hadoop can process structured, semi-structured, and unstructured
data, which makes it a versatile option for a wide range of big data scenarios.
● Open-source: Hadoop is open-source software, which means that it is free to
use and modify. This also allows developers to access the source code and
make improvements or add new features.
● Large community: Hadoop has a large and active community of developers
and users who contribute to the development of the software, provide support,
and share best practices.
● Integration: Hadoop is designed to work with other big data technologies such
as Spark, Storm, and Flink, which allows for integration with a wide range of
data processing and analysis tools.

Disadvantages:

● Not very effective for small data.

● Hard cluster management.
● Has stability issues.
● Security concerns.
● Complexity: Hadoop can be complex to set up and maintain, especially for
organizations without a dedicated team of experts.
● Latency: Hadoop is not well-suited for low-latency workloads and may not be
the best choice for real-time data processing.
● Limited Support for Real-time Processing: Hadoop’s batch-oriented nature
makes it less suited for real-time streaming or interactive data processing use
cases.
● Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured data
processing
● Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
● Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.
● Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and machine
learning workloads, specialized components like Apache Graph and Mahout are
available but have some limitations.
● Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
● Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.
● Data Governance: Data Governance is a critical aspect of data management,
Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.
What is MapReduce?

MapReduce is a parallel, distributed programming model in the Hadoop

framework that can be used to access the extensive data stored in the Hadoop
Distributed File System (HDFS). Hadoop is capable of running the MapReduce
program written in various languages such as Java, Ruby, and Python. One of the
beneficial factors that MapReduce aids is that MapReduce programs are inherently
parallel, making the very large scale easier for data analysis.
When the MapReduce programs run in parallel, it speeds up the process. The
process of running MapReduce programs is explained below.
● Dividing the input into fixed-size chunks: Initially, it divides the work into
equal-sized pieces. When the file size varies, dividing the work into equal-sized
pieces isn’t the straightforward method to follow, because some processes will
finish much earlier than others while some may take a very long run to
complete their work. So one of the better approaches is that one that requires
more work is said to split the input into fixed-size chunks and assign each
chunk to a process.
● Combining the results: Combining results from independent processes is a
crucial task in MapReduce programming because it may often need additional
processing such as aggregating and finalizing the results.

Key components of MapReduce

There are two key components in MapReduce. The MapReduce consists of two
primary phases such as the map phase and the reduced phase. Each phase contains
the key-value pairs as its input and output and it also has the map function and
reducer function within it.
● Mapper: Mapper is the first phase of the MapReduce. The Mapper is
responsible for processing each input record and the key-value pairs are
generated by the InputSplit and RecordReader. Where these key-value pairs
can be completely different from the input pair. The MapReduce output holds
the collection of all these key-value pairs.
● Reducer: The reducer phase is the second phase of the MapReduce. It is
responsible for processing the output of the mapper. Once it completes
processing the output of the mapper, the reducer now generates a new set of
output that can be stored in HDFS as the final output data.

Execution workflow of MapReduce

Now let’s understand how the MapReduce Job execution works and what all the
components it contains. Generally, MapReduce processes the data in different
phases with the help of its different components. Take a look at the below figure
which illustrates the steps of the job execution workflow of MapReduce in
Hadoop.

● Input Files: The data for MapReduce tasks are present in the input files. These
input files reside in HDFS. The format for input files is arbitrary, while line-
based log files and binary format can also be used.

● InputFormat: The InputFormat is used to define how the input files are split
and read. It selects the files or objects that are used for input. In general, the
InputFormat is used to create the Input Split.

● Record Reader: The RecordReader can communicate with the Input Split in
the Hadoop MapReduce. It can also convert the data into key-value pairs so
that the mapper can read. By default, the RecordReader utilizes the
TextInputFormat for converting data into key-value pairs. The Record Reader
communicates with the Input Split until the file reading is completed. It then
assigns a byte offset (unique number) to each line present in the file. Then these
key-value pairs are sent to the mapper for further processing.

● Mapper: From the RecordReader, the mapper receives the input records. The
Mapper is responsible for processing those input records from the
RecordReader and it generates the new key-value pair. The Key-value pair
generated by the mapper can be completely different from the input pair. The
output of the mapper which is intermediate output is said to be stored in the
local disk since it is the temporary data.
● Combiner: The Combiner in MapReduce is also known as Mini-reducer. The
Hadoop MapReduce combiner performs the local aggregation on the mapper’s
output which minimizes the data transfer between the mapper and reducer.
Once the Combiner completes its process, the output of the combiner is passed
to the partitioner for further work.

● Partitioner: In Hadoop MapReduce, the partitioner is used when we are

working with more than one reducer. The Partitioner extracts the output from
the combiner and then it performs partitioning. The partitioning of output takes
place based on the key and then it is sorted. With the help of a hash function,
the key (or subset of the key) is used to derive the partition. Since MapReduce
execution works with the help of key-value, each combiner output is
partitioned and a record having the same key value moves into the same
partition and then each partition is sent to the reducer. Partitioning of the
output of the combiner allows the even distribution of the map output over the
reducer.

● Shuffling and Sorting: The Shuffling performs the shuffling operation on the
mapper’s output before it is sent to the reducer phase. Once all the mappers
have completed their work and their output is said to be shuffled on the reducer
nodes, then this intermediate output is merged and sorted. This sorted output is
passed as input to the reducer phase.

● Reducer: It takes the set of intermediate key-value pairs from the mapper as the
input and then it runs the reducer function on each of the key-value pairs to
generate the output. This output of the reducer phase is the final output and it is
stored in the HDFS.

● Output format: The Output format determines how these output values are
written in the output files by the record reader. The Output format instances
provided by Hadoop are generally used to write files on either HDFS or on the
local disk. Thus, the final output of the reducer is written on the HDFS by
output format instances.

Example:

CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
43 pages
Unit Ii BDT F
No ratings yet
Unit Ii BDT F
13 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
HADOOP
No ratings yet
HADOOP
10 pages
Hadoop Guide for CS Students
No ratings yet
Hadoop Guide for CS Students
11 pages
Hadoop Overview for Big Data Course
No ratings yet
Hadoop Overview for Big Data Course
11 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit 3
No ratings yet
Unit 3
90 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop Modules Overview and Features
No ratings yet
Hadoop Modules Overview and Features
6 pages
Unit 2
No ratings yet
Unit 2
17 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
1 page
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Hadoop for Big Data Solutions
No ratings yet
Hadoop for Big Data Solutions
31 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 2
No ratings yet
Unit 2
9 pages
By - Shubham Parmar
No ratings yet
By - Shubham Parmar
14 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
50 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Hadoop Is An Open
No ratings yet
Hadoop Is An Open
4 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Unit 2
No ratings yet
Unit 2
23 pages
Understanding Hadoop: Features & Limitations
No ratings yet
Understanding Hadoop: Features & Limitations
19 pages
Understanding Hadoop Framework Components
No ratings yet
Understanding Hadoop Framework Components
5 pages
Attachment
No ratings yet
Attachment
11 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Unit Iii
No ratings yet
Unit Iii
22 pages
Hadoop Features & Advantages
No ratings yet
Hadoop Features & Advantages
15 pages
Bda Unit IV
No ratings yet
Bda Unit IV
97 pages
Module 2 Hadoop Final
No ratings yet
Module 2 Hadoop Final
98 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
22 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Global Sales Marketing Operations Leader in San Francisco Bay CA Resume Rosemary Brooks
No ratings yet
Global Sales Marketing Operations Leader in San Francisco Bay CA Resume Rosemary Brooks
3 pages
HYPACK 2012 Common Driver Notes PDF
No ratings yet
HYPACK 2012 Common Driver Notes PDF
251 pages
Jawwal PDF
No ratings yet
Jawwal PDF
13 pages
Final Review5
No ratings yet
Final Review5
297 pages
Price List for DDS Data Cartridges
No ratings yet
Price List for DDS Data Cartridges
3 pages
Controller SMG
No ratings yet
Controller SMG
696 pages
Vikram Reddy Andem
No ratings yet
Vikram Reddy Andem
68 pages
Legal Tech Evolution 1970-2020
No ratings yet
Legal Tech Evolution 1970-2020
1 page
The Medoc Project: Meta Data: Cornelia Haber, Universit T Oldenburg
No ratings yet
The Medoc Project: Meta Data: Cornelia Haber, Universit T Oldenburg
17 pages
Principles of Compiler Design PDF
0% (1)
Principles of Compiler Design PDF
177 pages
KPI Formula - Updated
No ratings yet
KPI Formula - Updated
22 pages
22-23 African Countries - Reseach Guidelines
No ratings yet
22-23 African Countries - Reseach Guidelines
6 pages
Leading Principal Minors and Matrix Definiteness
No ratings yet
Leading Principal Minors and Matrix Definiteness
2 pages
Stack Operations and Memory Management
No ratings yet
Stack Operations and Memory Management
37 pages
Oet
0% (1)
Oet
15 pages
SMP Physics Exam Answer Key
No ratings yet
SMP Physics Exam Answer Key
2 pages
Mckinsey 7s Framework
No ratings yet
Mckinsey 7s Framework
7 pages
Single Bus Generation Expansion Guide
No ratings yet
Single Bus Generation Expansion Guide
15 pages
Sem 4 Cse PDF
No ratings yet
Sem 4 Cse PDF
14 pages
UVM Tips and Tricks
0% (1)
UVM Tips and Tricks
17 pages
ANSYS Workbench 16 Simulations Guide
No ratings yet
ANSYS Workbench 16 Simulations Guide
2 pages
Arrays and Sorting Algorithms Guide
No ratings yet
Arrays and Sorting Algorithms Guide
3 pages
BERAN Brochure With Covers Issue 3
No ratings yet
BERAN Brochure With Covers Issue 3
16 pages
Chai Yi Fan PDF
No ratings yet
Chai Yi Fan PDF
1 page
Troubleshooting Guide 001 For Tech Support
No ratings yet
Troubleshooting Guide 001 For Tech Support
561 pages
Linux Bash Command Reference
No ratings yet
Linux Bash Command Reference
6 pages
Photo Frame Tutorial: An Introduction To Driveworksxpress
No ratings yet
Photo Frame Tutorial: An Introduction To Driveworksxpress
21 pages
Practical Last
No ratings yet
Practical Last
2 pages
Cheat Sheet Python Data Structures Part-2
No ratings yet
Cheat Sheet Python Data Structures Part-2
3 pages
Camtasia Studio 5
No ratings yet
Camtasia Studio 5
649 pages

7) Intro To Hadoop and Mapreducer

Uploaded by

7) Intro To Hadoop and Mapreducer

Uploaded by

What is Hadoop?

Hadoop is an open source software programming framework for storing a large

Hadoop Distributed File System

Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,

Some common frameworks of Hadoop

Hadoop framework is made up of the following modules:

● High computational power.

● Tasks are independent.

● Not very effective for small data.

MapReduce is a parallel, distributed programming model in the Hadoop

Key components of MapReduce

Execution workflow of MapReduce

● Partitioner: In Hadoop MapReduce, the partitioner is used when we are

You might also like