0% found this document useful (0 votes)

13 views47 pages

BDA Unit-4

The document provides an overview of Hadoop, an open-source software framework for developing data processing applications in a distributed environment. It covers key concepts such as the Hadoop Distributed File System (HDFS), various file formats, features like scalability and fault tolerance, and the MapReduce programming model for data processing. Additionally, it outlines the Hadoop ecosystem components, including YARN, Pig, Hive, and Spark, and discusses the architecture of Hadoop, emphasizing its ability to handle large datasets efficiently.

Uploaded by

charanichowdaryaetukuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views47 pages

BDA Unit-4

Uploaded by

charanichowdaryaetukuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

UNIT - 4

BASICS OF HADOOP
BASICS OF HADOOP
UNIT-IV
• Contents
• Data format
• Features of Hadoop
• Analyzing data with Hadoop
• Design of Hadoop distributed file system
(HDFS)
• HDFS concepts
• Hadoop related tools
About Hadoop
• Apache Hadoop is an open source software framework
used to develop data processing applications which are
executed in a distributed computing environment.
• in Hadoop, data resides in a distributed file system which
is called as a Hadoop Distributed File system. The
processing model is based on 'Data Locality' concept
wherein computational logic is sent to cluster
nodes(server) containing data. This computational logic is
nothing, but a compiled version of a program written in a
high-level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
Types of Hadoop File Formats
• Text files : A text file is the most basic and a
human-readable file. It can be written in any
programming language like c, java or any editor.
• Sequence File: The sequence file format can be
used to store an image in the binary format. They
store key-value pairs in a binary container format
and are more efficient than a text file. However,
sequence files are not human- readable.
Types of Hadoop File Formats
• Avro data files: The Avro file format is ideal for long-
term storage of important data. It can read from and
write in many languages like Java, Scala and so on.
• Parquet file format: Apache Avro is a language-
neutral data serialization system. It was developed
by Doug Cutting, the father of Hadoop. Parquet is a
columnar format developed by Cloudera and
Twitter. It is supported in Spark, MapReduce, Hive,
Pig, Impala, Crunch, and so on. Like Avro, schema
metadata is embedded in the file.
SOME OTHER DATA FORMATS
• Photos – Pixel data - petabytes of storage

• Trade data – Time-series data – terabytes of data

per day

• Gene data – Sequential data – petabytes of data

• Internet Archive – Unstructured data – petabytes

to terabytes of data
• There’s a lot of data out there.

• But you are probably wondering how it affects

you?

• Does the advent of “Big Data,” as it is being

called, affect smaller organizations or individuals?

• Yes…More data usually beats better algorithms

Hadoop Features
• Suitable for Big Data Analysis
• As Big Data tends to be distributed and
unstructured in nature, HADOOP clusters are
best suited for analysis of Big Data. Since it is
processing logic (not the actual data) that
flows to the computing nodes, less network
bandwidth is consumed. This concept is called
as data locality concept which helps increase
the efficiency of Hadoop based applications.
Hadoop Features
• Scalability : HADOOP clusters can easily be
scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data.
• Fault Tolerance : HADOOP ecosystem has a
provision to replicate the input data on to other
cluster nodes. That way, in the event of a cluster
node failure, data processing can still proceed by
using data stored on another cluster node.
• 5.25’’ floppy disk – 1.2 MB - ~= 100kb/sec ~= 15 minutes.

• 5.25’’ hard disk – 1.3 GB - ~= 400kb/sec for 1x speed ~= 8

minutes.

• 5.25’’ compact disk – 700 MB/4.7 GB – [CD - 150kb/sec for 1x

speed], [DVD – 1350kb/sec for 1x speed] ~= 3 minutes.

• Today, Terabyte hard disk drives have a data transfer speed

around 100 MB/sec. Two and a half hours to read all the data off
the disk.
• This is a long time to read all data on a single drive and
writing is even slower.

• The obvious way to reduce the time is to read from

multiple disks at once.

• Imagine if we had 100 drives, each holding one

hundredth of the data.

• Working in parallel, we could read the data in under two

minutes.
• The first problem to solve is hardware failure.
– Replication of data is solution (HDFS).

• The second problem is that most analysis tasks

need to be able to combine the data in some
way.
– Various distributed systems allow data to be
combined from multiple sources, but doing this
correctly is notoriously challenging (MapReduce).
– Map Reduce provides a programming model that
abstracts the problem from disk reads and writes,
transforming it into a computation over sets of keys
and values.
MapReduce
• It is the core component of processing in a
Hadoop Ecosystem as it provides the logic of
data processing. It is a software framework
which helps in writing applications that
processes large data sets using distributed and
parallel algorithms inside Hadoop
environment.
• In a MapReduce program, Map() and
Reduce() are two functions.
MapReduce
– The Map function performs actions like filtering, grouping and
sorting.
– While Reduce function aggregates and summarizes the result
produced by map function.
– The result generated by the Map function is a key value pair (K,
V) which acts as the input for Reduce function.
– We want to calculate the number of students in each
department. Initially, Map program will execute and calculate
the students appearing in each department, producing the key
value pair as mentioned above. This key value pair is the input to
the Reduce function. The Reduce function will then aggregate
each department and calculate the total number of students in
each department and produce the given result.
Comparison with other systems
A brief history of Hadoop
Hadoop EcoSystem and
Components
HDFC EcoSystem
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query
(SQL-like)
• HBase -> NoSQL Database
HDFC EcoSystem
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain
cluster
YARN
• YARN as the brain of your Hadoop Ecosystem. It performs all
your processing activities by allocating resources and
scheduling tasks.
• It has two major components, i.e. ResourceManager and
NodeManager.
– ResourceManager is again a main node in processing .
– It receives the processing requests, and then passes the
parts of requests to corresponding NodeManagers
accordingly, where the actual processing takes place.
– NodeManagers are installed on every DataNode. It is
responsible for execution of task on every single
DataNode.
PIG
• PIG has two parts: Pig Latin, the language and the
pig runtime, for the execution environment. It is
same as Java and JVM.
• It supports pig latin language, which has SQL like
command structure.
• In PIG, first the load command, loads the data. Then
we perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you can
dump the data on the screen or you can store the
result back in HDFS.
APACHE HIVE
• Facebook created HIVE for people who are
fluent with SQL. Thus, HIVE makes them feel
at home while working in a Hadoop
Ecosystem.
• Basically, HIVE is a data warehousing
component which performs reading, writing
and managing large data sets in a distributed
environment using SQL-like interface.
• HIVE + SQL = HQL
Mahout
• Mahout provides an environment for creating
machine learning applications which are
scalable.
• It performs collaborative filtering, clustering
and classification. Some people also
consider frequent item set missing as
Mahout’s function. Let us understand them
individually:
Mahout
• Collaborative filtering: Mahout mines user
behaviors, their patterns and their
characteristics and based on that it predicts
and make recommendations to the users. The
typical use case is E-commerce website.
• Clustering: It organizes a similar group of data
together like articles can contain blogs, news,
research papers etc.
Mhaout
• Classification: It means classifying and
categorizing data into various sub-
departments like articles can be categorized
into blogs, news, essay, research papers and
other categories.
• Frequent item set missing: Here Mahout
checks, which objects are likely to be
appearing together and make suggestions, if
they are missing.
Spark
• Apache Spark is a framework for real time
data analytics in a distributed computing
environment.
• The Spark is written in Scala and was originally
developed at the University of California,
Berkeley.
Hadoop Architecture
Name Node and Data Nodes
Name Node and Data Node
• NameNode is the main node and it doesn’t store the
actual data. It contains metadata, just like a log file or you
can say as a table of content. Therefore, it requires less
storage and high computational resources.
• Data Nodes: data is stored here and hence it requires
more storage resources. These DataNodes are
commodity hardware (like your laptops and desktops) in
the distributed environment.You always communicate to
the NameNode while writing the data. Then, it internally
sends a request to the client to store and replicate data
on various DataNodes.
Hadoop Architecture
• Hadoop has a Master-Slave Architecture for
data storage and distributed data processing
using MapReduce and HDFS methods.
• NameNode: NameNode represented every
files and directory which is used in the
namespace.
• DataNode: DataNode helps you to manage
the state of an HDFS node and allows you to
interacts with the blocks.
DIFFERNET NODES IN HADOOP
• MasterNode: The master node allows you to
conduct parallel processing of data using Hadoop
MapReduce.
• Slave node: The slave nodes are the additional
machines in the Hadoop cluster which allows you
to store data to conduct complex calculations.
Moreover, all the slave node comes with Task
Tracker and a DataNode. This allows you to
synchronize the processes with the NameNode
and Job Tracker respectively.
Apache Hadoop and the Hadoop Ecosystem

The Hadoop projects

• MapReduce - A distributed data processing

model and execution environment that runs on
large clusters of commodity machines.

• HDFS - A distributed filesystem that runs on large

clusters of commodity machines.
Scaling out

• You’ve seen how MapReduce works for small inputs.

• Now it’s time to take a bird’s-eye view of the system and look at
the data flow for large inputs.

• For simplicity, the examples so far have used files on the local file
system.

• However, to scale out, we need to store the data in a distributed

file system, typically HDFS to allow Hadoop to move the Map
Reduce computation to each machine hosting a part of the data.
The dotted boxes indicate nodes,
The light arrows show data transfers on a node, and
The heavy arrows show data transfers between nodes.
Hadoop Streaming

• Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java.
• Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you
to create and run Map/Reduce jobs with any executable or script as the mapper and/or the
reducer.
• Hadoop Streaming uses Unix standard stream as the interface between Hadoop and your
program, so you can use any language that can read standard input and write to standard
output to write your MapReduce program.
• Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle
binary streams, too), and when used in text mode, it has a line-oriented view of data.
• Map input data is passed over standard input to your map function, which
processes it line by line and writes lines to standard output.
• Both the mapper and the reducer are python scripts that read the input from
standard input and emit the output to standard output. The utility will create a
Map/Reduce job, submit the job to an appropriate cluster, and monitor the
progress of the job until it completes.
• A map output key-value pair is written as a single tab-delimited line.
• Input to the reduce function is in the same format a tab-separated key-value
pair passed over standard input.
• The reduce function reads lines from standard input, which the framework
guarantees are sorted by key, and writes its results to standard output.

• Languages for writing map and reduce functions other than java
– Ruby
– Python
Hadoop pipes

• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.

• Hadoop Pipes uses sockets to enable task trackers to communicate processes
running the C++ map or reduce functions.
• Unlike Streaming, which uses standard input and output to communicate with
the map and reduce code, Pipes uses sockets as the channel over which the
tasktracker communicates with the process running the C++ map or reduce
function.
• JNI is not used.
HADOOP FILE SYSTEM (HDFS)

• HDFS is a filesystem designed for

– storing very large files with streaming data access

patterns

– running on clusters of commodity hardware.

Very large files

• “Very large” in this context means files that are

hundreds of megabytes, gigabytes, or terabytes
in size.

• There are Hadoop clusters running today that

store petabytes of data.
Streaming data access

• HDFS is built around the idea that the most efficient

data processing pattern is a write-once, read-many-
times pattern.

• A dataset is typically generated or copied analysis will

involve a large proportion, if not all, of the dataset, so
the time to read the whole dataset is more important
than the latency in reading the first record.
Commodity hardware
• Hadoop doesn’t require expensive, highly reliable hardware to run on.

• It’s designed to run on clusters of commodity hardware (commonly available

hardware available from multiple vendors) for which the chance of node
failure across the cluster is high, at least for large clusters.

• HDFS is designed to carry on working without a noticeable interruption to the

user in the face of such failure.

• It is also worth examining the applications for which using HDFS does not work
so well.

• While this may change in the future, these are areas where HDFS is not a good
fit today.
Low-latency data access

•Applications that require low-latency access to data, in the

tens of milli seconds range, will not work well with HDFS.

•Remember, HDFS is optimized for delivering a high

throughput of data, and this may be at the expense of
latency.

•HBase is currently a better choice for low-latency access.

Lots of small files
• Since the namenode holds file system meta data in memory, the
limit to the number of files in a file system is governed by the
amount of memory on the name node.

• As a rule of thumb, each file, directory, and block takes about 150
bytes.

• Files in HDFS may be written to by a single writer. Writes are

always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the
file.
HDFS Concepts

• Blocks

• Namenodes and Datanodes

Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Understanding Hadoop Ecosystem Components
No ratings yet
Understanding Hadoop Ecosystem Components
7 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Introduction & Use Cases
No ratings yet
Hadoop Introduction & Use Cases
22 pages
Importance of Big Data and Hadoop
No ratings yet
Importance of Big Data and Hadoop
13 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Part C - Assignment No. 5 Health Care Case Study
No ratings yet
Part C - Assignment No. 5 Health Care Case Study
10 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
Hadoop
No ratings yet
Hadoop
61 pages
Unit 2
No ratings yet
Unit 2
23 pages
Unit 2
No ratings yet
Unit 2
9 pages
Big Data Open Source Framework-Hadoop
No ratings yet
Big Data Open Source Framework-Hadoop
22 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Introduction to Hadoop Overview
No ratings yet
Introduction to Hadoop Overview
13 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
7 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Introduction To Hadoop: Module - II
No ratings yet
Introduction To Hadoop: Module - II
31 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Hadoop & Pig Data in Big Data Analytics
No ratings yet
Hadoop & Pig Data in Big Data Analytics
99 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Understanding Hadoop 4 Ecosystem
No ratings yet
Understanding Hadoop 4 Ecosystem
44 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
M5
No ratings yet
M5
18 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Ha Do Op
No ratings yet
Ha Do Op
24 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Hadoop File Formats and Processing
No ratings yet
Hadoop File Formats and Processing
12 pages
Ai&dl 3 2
No ratings yet
Ai&dl 3 2
28 pages
Boe U3
No ratings yet
Boe U3
39 pages
Goods Products and Services
No ratings yet
Goods Products and Services
12 pages
Python File and Directory Management
No ratings yet
Python File and Directory Management
32 pages
Manual Zte ZXHN F660: Read/Download
No ratings yet
Manual Zte ZXHN F660: Read/Download
2 pages
Cutsheet Saitel DP en Rev3.0
No ratings yet
Cutsheet Saitel DP en Rev3.0
2 pages
CSA Lecture 6 - Computer Instructions
No ratings yet
CSA Lecture 6 - Computer Instructions
46 pages
Multi-Area OSPF and Route Summarization: The Bryant Advantage CCNP ROUTE Study Guide
No ratings yet
Multi-Area OSPF and Route Summarization: The Bryant Advantage CCNP ROUTE Study Guide
32 pages
Google
No ratings yet
Google
88 pages
Riscv Spec PDF
No ratings yet
Riscv Spec PDF
239 pages
Computer Networks Unit 5 Notes Aktu
No ratings yet
Computer Networks Unit 5 Notes Aktu
19 pages
EET303 M3 Ktunotes - in
No ratings yet
EET303 M3 Ktunotes - in
33 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
MN Gtwin Reference JP en
No ratings yet
MN Gtwin Reference JP en
282 pages
Understanding the ulimit Command in Linux
No ratings yet
Understanding the ulimit Command in Linux
1 page
Module - 07 Implementing Disaster Recovery For Exchange Server 2019
No ratings yet
Module - 07 Implementing Disaster Recovery For Exchange Server 2019
16 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Dos Assignment
No ratings yet
Dos Assignment
9 pages
User Manual For Yhbpm 1.0
No ratings yet
User Manual For Yhbpm 1.0
42 pages
Oracle On Solaris 10 Fixing The ORA-27102 Out of Memory' Error
No ratings yet
Oracle On Solaris 10 Fixing The ORA-27102 Out of Memory' Error
10 pages
MPLS L3VPN Inter-AS Configuration
No ratings yet
MPLS L3VPN Inter-AS Configuration
25 pages
OS Programming Exercises
No ratings yet
OS Programming Exercises
5 pages
Computer Systems Organization Quiz
100% (94)
Computer Systems Organization Quiz
10 pages
Video Library Management System Overview
No ratings yet
Video Library Management System Overview
4 pages
BP 2071 AHV Networking PDF
No ratings yet
BP 2071 AHV Networking PDF
50 pages
DX Diag
No ratings yet
DX Diag
27 pages
IoT & Edge Developer Survey Report - 2021
No ratings yet
IoT & Edge Developer Survey Report - 2021
36 pages
Asus A68HM-K: Una Placa Base Con Socket AMD FM2+
No ratings yet
Asus A68HM-K: Una Placa Base Con Socket AMD FM2+
3 pages
Vision Manual
50% (2)
Vision Manual
354 pages
Telecom Network Hardware Guide
100% (1)
Telecom Network Hardware Guide
43 pages
01 Amazon Corretto 17 (Java) Install Guide
No ratings yet
01 Amazon Corretto 17 (Java) Install Guide
10 pages
Lastexception 63848095273
No ratings yet
Lastexception 63848095273
1 page
Build A Killer Customized Arch Linux Installation and Learn All About Linux in The Process
No ratings yet
Build A Killer Customized Arch Linux Installation and Learn All About Linux in The Process
34 pages
KTS-442 Bluetooth Data Transfer Guide
No ratings yet
KTS-442 Bluetooth Data Transfer Guide
6 pages

BDA Unit-4

Uploaded by

BDA Unit-4

Uploaded by

UNIT - 4

• Trade data – Time-series data – terabytes of data

• Gene data – Sequential data – petabytes of data

• Internet Archive – Unstructured data – petabytes

• But you are probably wondering how it affects

• Does the advent of “Big Data,” as it is being

• Yes…More data usually beats better algorithms

• 5.25’’ hard disk – 1.3 GB - ~= 400kb/sec for 1x speed ~= 8

• 5.25’’ compact disk – 700 MB/4.7 GB – [CD - 150kb/sec for 1x

• Today, Terabyte hard disk drives have a data transfer speed

• The obvious way to reduce the time is to read from

• Imagine if we had 100 drives, each holding one

• Working in parallel, we could read the data in under two

• The second problem is that most analysis tasks

The Hadoop projects

• MapReduce - A distributed data processing

• HDFS - A distributed filesystem that runs on large

• You’ve seen how MapReduce works for small inputs.

• However, to scale out, we need to store the data in a distributed

• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.

• HDFS is a filesystem designed for

– storing very large files with streaming data access

– running on clusters of commodity hardware.

• “Very large” in this context means files that are

• There are Hadoop clusters running today that

• HDFS is built around the idea that the most efficient

• A dataset is typically generated or copied analysis will

• It’s designed to run on clusters of commodity hardware (commonly available

• HDFS is designed to carry on working without a noticeable interruption to the

•Applications that require low-latency access to data, in the

•Remember, HDFS is optimized for delivering a high

•HBase is currently a better choice for low-latency access.

• Files in HDFS may be written to by a single writer. Writes are

• Namenodes and Datanodes

You might also like