0% found this document useful (0 votes)
16 views39 pages

Big Data Unit-III

HDFS (Hadoop Distributed File System) is designed for storing and managing large datasets across multiple machines, ensuring fault tolerance and high availability through data replication. It operates on a master/slave architecture with a NameNode managing metadata and DataNodes handling data storage, while also optimizing for parallel processing and minimizing network traffic. Despite its advantages in scalability and cost-effectiveness, HDFS faces challenges such as handling small files and security vulnerabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views39 pages

Big Data Unit-III

HDFS (Hadoop Distributed File System) is designed for storing and managing large datasets across multiple machines, ensuring fault tolerance and high availability through data replication. It operates on a master/slave architecture with a NameNode managing metadata and DataNodes handling data storage, while also optimizing for parallel processing and minimizing network traffic. Despite its advantages in scalability and cost-effectiveness, HDFS faces challenges such as handling small files and security vulnerabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Big Data

Unit- III

HDFS:-
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
HDFS instance may consists of hundreds of server machines each storing part of file system
data, hence failure of at least one server is inevitable. HDFS has been built to detect these
failures and automatically recover them quickly.
HDFS follows master/slave architecture with NameNode as master and DataNode as slave.
Each cluster comprises a single master node and multiple slave nodes. Internally the files
get divided into one or more blocks, and each block is stored on different slave machines
depending on the replication factor.
HDFS Architecture:-

NameNode
 It is the hardware that contains the operating system(GNU/Linux) and the
NameNode software. It is responsible for serving the client’s read/write
requests.
 It Stores metadata such as number of blocks, their location, permission,
replicas and other details on the local disk in the form of two files:
- FSImage(File System Image): It contains the complete namespace of the
Hadoop file system since the NameNode creation.
- Edit log: It contains all the recent changes performed to the file system
namespace to the most recent FSImage.
 NameNode maintains and manages the slave nodes, and assigns tasks to
them. It also keeps the status of data node and make sure that it is alive.
 It can manage the files, control a client’s access to files, and overseas file
operating processes such as renaming, opening, and closing files.
DataNode

 For every node in the HDFS cluster we locate a DataNode which is


hardware consisting operating system(GNU/Linux) and a DataNode
software which help to control the data storage of their system as they can
perform operations on the file systems if the client requests.

 It can also create, replicate, and block files when the NameNode instructs.

 Sends heartbeat and block report to the NameNode to report its health and
the list of block it contains respectively.

Blocks
Data Blocks in Hadoop HDFS

Internally HDFS split the file into multiple blocks with the default value of
128MB called block.

Rack
Rack is the collection of around 40–50 machines (DataNodes) connected
using the same network switch. If the network goes down, the whole rack
will be unavailable.

Rack Awareness in Hadoop is the concept that chooses DataNodes based


on the rack information in the large Hadoop cluster, to improve the
network traffic while reading/writing the HDFS file and to store replicas
and provide latency and fault tolerance. For the default replication factor
of 3, rack awareness algorithm will first store the replica on the local rack,
while the second replica will get stored on DataNode in same rack and the
third replica will get stored in different rack.

Goals of HDFS
 Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS
should have mechanisms for quick and automatic fault detection and
recovery.
 Huge datasets − HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
 Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
 Streaming data access - The HDFS applications usually run on the
general-purpose file system. This application requires streaming access to
their data sets.
 Coherence Model - The application that runs on HDFS require to follow
the write-once-ready-many approach. So, a file once created need not to
be changed. However, it can be appended and truncate.

Features of HDFS

o Highly Scalable - HDFS is highly scalable as it can scale hundreds of


nodes in a single cluster.
o Replication - Due to some unfavorable conditions, the node containing
the data may be loss. So, to overcome such problems, HDFS always
maintains the copy of data on a different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of
the system in the event of failure. The HDFS is highly fault-tolerant that
if any machine fails, the other machine containing the copy of that data
automatically become active.
o Distributed data storage - This is one of the most important features of
HDFS that makes Hadoop very powerful. Here, data is divided into
multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable
from platform to another.

Benefits And Challenges Of HDFS: -


1. Cost: Hadoop is open-source and uses cost-effective commodity hardware
which provides a cost-efficient model, unlike traditional Relational databases
that require expensive hardware and high-end processors to deal with Big
Data. The problem with traditional Relational databases is that storing the
Massive volume of data is not cost-effective, so the company’s started to
remove the Raw data. which may not result in the correct scenario of their
business. Means Hadoop provides us two main benefits with the cost one is
it’s open-source means free to use and the other is that it uses commodity
hardware which is also inexpensive.
2. Scalability: Hadoop is a highly scalable model. A large amount of data is
divided into multiple inexpensive machines in a cluster which is processed
parallelly. the number of these machines or nodes can be increased or
decreased as per the enterprise’s requirements. In traditional
RDBMS(Relational DataBase Management System) the systems cannot be
scaled to approach large amounts of data.
3. Flexibility: Hadoop is designed in such a way that it can deal with any kind of
dataset like structured(MySql Data), Semi-Structured(XML, JSON),
Unstructured (Images and Videos) very efficiently. This means it can easily
process any kind of data independent of its structure which makes it highly
flexible. which is very much useful for enterprises as they can process large
datasets easily, so the businesses can use Hadoop to analyze valuable insights
of data from sources like social media, email, etc. with this flexibility Hadoop
can be used with log processing, Data Warehousing, Fraud detection, etc
4. Speed: Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a
large size file is broken into small size file blocks then distributed among the
Nodes available in a Hadoop cluster, as this massive number of file blocks are
processed parallelly which makes Hadoop faster, because of which it provides
a High-level performance as compared to the traditional DataBase
Management Systems. When you are dealing with a large amount of
unstructured data speed is an important factor, with Hadoop you can easily
access TB’s of data in just a few minutes.
5. Fault Tolerance: Hadoop uses commodity hardware (inexpensive systems)
which can be crashed at any moment. In Hadoop data is replicated on various
DataNodes in a Hadoop cluster which ensures the availability of data if
somehow any of your systems got crashed. You can read all of the data from a
single machine if this machine faces a technical issue data can also be read
from other nodes in a Hadoop cluster because the data is copied or replicated
by default. Hadoop makes 3 copies of each file block and stored it into
different nodes.
6. High Throughput: Hadoop works on Distributed File System where various
jobs are assigned to various Data node in a cluster, the bar of this data is
processed parallel in the Hadoop cluster which produces high throughput.
Throughput is nothing but the task or job done per unit time.
7. Minimum Network Traffic: In Hadoop, each task is divided into various
small sub-task which is then assigned to each data node available in the
Hadoop cluster. Each data node processes a small amount of data which leads
to low traffic in a Hadoop cluster.

Challenges: -
1. Problem with Small files: Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file blocks which are from
128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small
size file in a large amount. This so many small files surcharge the Namenode and
make it difficult to work.
2. Vulnerability: Hadoop is a framework that is written in java, and java is one of the
most commonly used programming languages which makes it more insecure as it can
be easily exploited by any of the cyber-criminal.
3. Low Performance in Small Data Surrounding: Hadoop is mainly designed for
dealing with large datasets, so it can be efficiently utilized for the organizations that
are generating a massive volume of data. It’s efficiency decreases while performing in
small data surroundings.
4. Lack of Security: Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver needs to be careful with
this security face and should take appropriate action on it. Hadoop uses Kerberos for
security feature which is not easy to manage. Storage and network encryption are
missing in Kerberos which makes us more concerned about it.
5. High Up Processing: Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the data read or write
done from the disk which makes it difficult to perform in-memory calculation and
lead to processing overhead or High up processing.
6. Supports Only Batch Processing: The batch process is nothing but the processes that
are running in the background and does not have any kind of interaction with the user.
The engines used for these processes inside the Hadoop core is not that much
efficient. Producing the output with low latency is not possible with it

Design & Working Of HDFS: -


HDFS is built using the Java language and enables the rapid transfer of data between
compute nodes. At its outset, it was closely coupled with MapReduce, a framework for data
processing that filters and divides up work among the nodes in a cluster and organizes and
condenses the results into a cohesive answer to a query. Similarly, when HDFS takes in data,
it breaks the information down into separate blocks and distributes them to different nodes in
a cluster.

The following describes how HDFS works:


 With HDFS, data is written on the server once and read and reused numerous times.
 HDFS has a primary NameNode, which keeps track of where file data is kept in the
cluster.
 HDFS has multiple DataNodes on a commodity hardware cluster -- typically one per
node in a cluster. The DataNodes are generally organized within the same rack in the
data center. Data is broken down into separate blocks and distributed among the
various DataNodes for storage. Blocks are also replicated across nodes, enabling
highly efficient parallel processing.
 The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the machine cluster. The NameNode also manages access to
the files, including reads, writes, creates, deletes and the data block replication across
the DataNodes.
 The NameNode operates together with the DataNodes. As a result, the cluster can
dynamically adapt to server capacity demands in real time by adding or subtracting
nodes as necessary.
 The DataNodes are in constant communication with the NameNode to determine if
the DataNodes need to complete specific tasks. Consequently, the NameNode is
always aware of the status of each DataNode. If the NameNode realizes that one
DataNode isn't working properly, it can immediately reassign that DataNode's task to
a different node containing the same data block. DataNodes also communicate with
each other, which enables them to cooperate during normal file operations.
 The HDFS is designed to be highly fault tolerant. The file system replicates -- or
copies -- each piece of data multiple times and distributes the copies to individual
nodes, placing at least one copy on a different server rack than the other copies.
HDFS Storage Daemon’s:-
As we all know Hadoop works on the MapReduce algorithm which is a master-slave
architecture, HDFS has NameNode and DataNode that works in the similar pattern.
1. NameNode(Master)
2. DataNode(Slave)
HDFS uses a primary/secondary architecture where each HDFS cluster is comprised of many
worker nodes and one primary node or the NameNode. The NameNode is the controller node,
as it knows the metadata and status of all files including file permissions, names and location
of each block. An application or user can create directories and then store files inside these
directories. The file system namespace hierarchy is like most other file systems, as a user can
create, remove, rename or move files from one directory to another.
The HDFS cluster's NameNode is the primary server that manages the file system namespace
and controls client access to files. As the central component of the Hadoop Distributed File
System, the NameNode maintains and manages the file system namespace and provides
clients with the right access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
Name Node:
NameNode works as a Master in a Hadoop cluster that Guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. nothing but the data about the data.
Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop
cluster.
Meta Data can also be the name of the file, size, and the information about the location
(Block number, Block ids) of Datanode that Namenode stores to find the closest DataNode
for Faster Communication. Namenode instructs the DataNodes with the operation like delete,
create, Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or Processing power in
order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat
signals and block reports from all the slaves i.e. DataNodes.

The NameNode performs the following key functions:

 The NameNode performs file system namespace operations, including opening,


closing and renaming files and directories.
 The NameNode governs the mapping of blocks to the DataNodes.
 The NameNode records any changes to the file system namespace or its properties.
An application can stipulate the number of replicas of a file that the HDFS should
maintain.
 The NameNode stores the number of copies of a file, called the replication factor of
that file.
 To ensure that the DataNodes are alive, the NameNode gets block reports and
heartbeat data.
 In case of a DataNode failure, the NameNode selects new DataNodes for replica
creation.

Data Node:
DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that, the more
number of DataNode your Hadoop cluster has More Data can be stored. so it is advised that
the DataNode should have High storing capacity to store a large number of file blocks.
Datanode performs operations like creation, deletion, etc. according to the instruction
provided by the NameNode.

In HDFS, DataNodes function as worker nodes or Hadoop daemons and are typically made
of low-cost off-the-shelf hardware. A file is split into one or more of the blocks that are
stored in a set of DataNodes. Based on their replication factor, the files are internally
partitioned into many blocks that are kept on separate DataNodes.
The DataNodes perform the following key functions:
 The DataNodes serve read and write requests from the clients of the file system.
 The DataNodes perform block creation, deletion and replication when the NameNode
instructs them to do so.
 The DataNodes transfer periodic heartbeat signals to the NameNode to help keep
HDFS health in check.
 The DataNodes provide block reports to NameNode to help keep track of the blocks
included within the DataNodes. For redundancy and higher availability, each block is
copied onto two extra DataNodes by default.

File size: -
 HDFS supports large files and large numbers of files.
 A typical HDFS cluster has lens of millions of files.
 A typical file is 100 MB or larger.
 The NameNode maintains the namespace metadata of the file such as the filename,
directory name, user, permissions etc.
 A file is considered small if its size is much less than the block size. For example, if
the block
 size is 128MB and the file size is 1MB to 50MB, the file is considered a small file.

Block size: -
 HDFS stores files into fixed-size blocks.
 HDFS data blocks size is 128MB.
 Blocks size can be configured as per our requirements.
 Hadoop distributes these blocks on different slave machines, and the master machine
stores the metadata about blocks location.

Issues with Block Size:


Issues with small block size:
 Small block size also is a problem for Namenode since it keeps metadata of all
blocks and it keeps metadata in memory. Due to small block size Namenode can
run out of memory.
 Too small block size would result in more number of unnecessary splits, which
would result in more number of tasks which might be beyond the capacity of the
cluster.
Issues with large block size:
 The cluster would be underutilized because of large block size there would be
fewer splits and in turn would be fewer map tasks which will slow down the job.
 Large block size would decrease parallelism.
Advantages of Hadoop Data Blocks
1. No limitation on the file size: - A file can be larger than any single disk in the
network.
2. Simplicity of storage subsystem: - Since blocks are of fixed size, we can easily
calculate the number of blocks that can be stored on a given disk. Thus provide
simplicity to the storage subsystem.
3. Fit well with replication for providing Fault Tolerance and High Availability:- Blocks
are easy to replicate between DataNodes thus, provide fault tolerance and high
availability.
4. Eliminating metadata concerns: -Since blocks are just chunks of data to be stored, we
don’t need to store file metadata (such as permission information) with the blocks,
another system can handle metadata separately.
Conclusion
We can conclude that the HDFS data blocks are blocked-sized chunks having size 128
MB by default. We can configure this size as per our requirements. The files smaller
than the block size do not occupy the full block size. The size of HDFS data blocks is
large in order to reduce the cost of seek and network traffic.

Block abstraction: -
o HDFS block size is usually 64MB-128MB and unlike other filesystems, a file
smaller than the block size does not occupy the complete block size's worth of
memory.
o The block size is kept so large so that less time is made doing disk seeks as
compared to the data transfer rate.

Why do we need block abstraction?


o Files can be bigger than individual disks.
o Filesystem metadata does not need to be associated with each and every block.
Simplifies storage management - Easy to figure out the number of blocks
which can be stored on each disk.
o Fault tolerance and storage replication can be easily done on a per-block basis.
Data Replication: -
Data replication is an important part of the HDFS format as it ensures data remains
available if there's a node or hardware failure. As previously mentioned, the data is
divided into blocks and replicated across numerous nodes in the cluster and, except
for the last block in a file, all block sizes are the same. Therefore, when one node goes
down, the user can access the data that was on that node from other machines. HDFS
maintains the replication process at regular intervals.
The following are some key functions and benefits HDFS data replication provides:
1. Replication factor: The replication factor determines the number of copies that are
made of each data block. Since the replication factor in HDFS is set to 3 by default,
each data block is replicated three times. This guarantees that, in the event of a node
failure or data corruption, several copies of the data block will be available.
2. Data availability: HDFS replication makes data more available by enabling the
storage of several copies of a given data block on various nodes. This guarantees that
data can be accessible from other nodes even in the event of a temporary node outage.
3. Placement policy: HDFS replicates data blocks according to a placement policy.
When the replication factor is 3, HDFS places one replica on the local machine if the
writer is on a DataNode, one on a random DataNode in the same rack and one on a
separate node in a different rack. This placement policy provides data localization
which reduces network traffic.
4. Rack awareness: As HDFS is designed with rack awareness, it considers the cluster's
network structure. This helps minimize the effects of rack failures on data availability
by ensuring that duplicates of data blocks are distributed across different racks.

HDFS Read and Write Architecture:-


HDFS follow Write Once Read many models. So we cannot edit files already stored
in HDFS, but we can append data by reopening the file. In Read-Write operation
client first, interact with the NameNode. NameNode provides privileges so, the client
can easily read and write data blocks into/from the respective datanodes. In this blog,
we will discuss the internals of Hadoop HDFS data read and write operations. We will
also cover how client read and write the data from HDFS, how the client interacts
with master and slave nodes in HDFS data read and write operations.
Read Operation In HDFS:
Data read request is served by HDFS, NameNode, and DataNode. To read a file from
HDFS, a client needs to interact with namenode (master) as namenode is the
centerpiece of Hadoop cluster (it stores all the metadata i.e. data about the data). Now
namenode checks for required privileges, if the client has sufficient privileges then
namenode provides the address of the slaves where a file is stored. Now client will
interact directly with the respective datanodes to read the data blocks.
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System (DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node
and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading an
endless stream. Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through the stream. It will also call
the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on
the FSDataInputStream.

Write Operation In HDFS:-


To write a file in HDFS, a client needs to interact with master i.e. namenode (master).
Now namenode provides the address of the datanodes (slaves) on which client will
start writing the data. Client directly writes data on the datanodes, now datanode will
create data write pipeline.
The first datanode will copy the block to another datanode, which intern copy it to the
third datanode. Once it creates the replicas of blocks, it sends back the
acknowledgment.

Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).


Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory of suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third
(and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting
to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments before connecting to the name node to signal whether the
file is complete or not.
JAVA INTERFACE IN HDFS: -
Hadoop has an abstract notion of filesystems, of which HDFS is just one
implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents
the client interface to a filesystem in Hadoop, and there are several concrete
implementations. Hadoop is written in Java, so most Hadoop filesystem interactions
are mediated through the Java API. The filesystem shell, for example, is a Java
application that uses the Java FileSystem class to provide filesystem operations. By
exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-
Java applications to access HDFS. A file in a Hadoop filesystem is represented by a
Hadoop Path object. FileSystem is a general filesystem API, so the first step is to
retrieve an instance for the filesystem we want to use—HDFS, in this case. There are
several static factory methods for getting a FileSystem instance
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws
IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
A Configuration object encapsulates a client or server’s configuration, which is set
using configuration files read from the classpath, such as core-site.xml. The first
method returns the default filesystem (as specified in core-site.xml, or the default
local filesystem if not specified there). The second uses the given URI’s scheme and
authority to determine the filesystem to use, falling back to the default filesystem if no
scheme is specified in the given URI. The third retrieves the filesystem as the given
user, which is important in the context of security. With a FileSystem instance in
hand, we invoke an open() method to get the input stream for a file. The first method
uses a default buffer size of 4 KB.The second one gives an option to user to specify
the buffer size. The fourth one retrieves a local filesystem instance.
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
A Path object can be created by using a designated URI
Path f = new Path(uri)
Putting together, we can create the following file reading application
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
Path path = new Path(uri);
in = fs.open(path);
IOUtils.copyBytes(in, System.out, 4096, true);
}
}
The compilation simply uses the javac command, but it needs to point the
dependencies in the class path.
export HADOOP_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) javac
-cp
$HADOOP_CLASSPATH FileSystemCat.java
Then, a jar file is created and run as follows
jar cvf FileSystemCat.jar FileSystemCat*.class
hadoop jar FileSystemCat.jar FileSystemcat input/README.txt
The output is the same as processing a command hadoop fs –cat
Suppose an input stream is created to read a local file
To write a file on HDFS, the simplest way is to take a Path object for the file to be
created and return an output stream to write to
public FSDataOutputStream create(Path f) throws IOException
And then just copy the input stream to the output stream
Another, more flexible, way is to read the input stream into a buffer and then write to
the output stream
A file writing application
public class FileSystemPut {
public static void main(String[] args) throws Exception {
String localStr = args[0];
String hdfsStr = args[1];
Configuration conf = new Configuration();
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);
Path localFile = new Path(localStr);
Path hdfsFile = new Path(hdfsStr);
FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);
IOUtils.copyBytes(in, out, 4096, true);
}
}
Another file writing application
public class FileSystemPutAlt {
public static void main(String[] args) throws Exception {
String localStr = args[0];
String hdfsStr = args[1];
Configuration conf = new Configuration();
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);
Path localFile = new Path(localStr);
Path hdfsFile = new Path(hdfsStr);
FSDataInputStream in = local.open(localFile);
FSDataOutputStream out = hdfs.create(hdfsFile);
byte[] buffer = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}
in.close();
out.close();
}
}
Other file system API methods
The method mkdirs() creates a directory
The method getFileStatus() gets the meta information for a single file or directory
The method listStatus() lists contents of files in a directory
The method exists() checks whether a file exists
The method delete() removes a file
The Java API enables the implementation of customised applications to interact with
HDFS.

Command Line Interface: -


HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you need
to start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
Jps
Commands:
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is
useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
It will print all the directories present in HDFS. bin directory contains
executables so, bin/hdfs means we want the executables of hdfs particularly
dfs(Distributed File System) commands.
2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So
let’s first create it.
Syntax:
bin/hdfs dfs -mkdir
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
3. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store.
This is the most important command. Local filesystem means the files present on the
OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path>
<dest(present on hdfs)>
5. cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
7. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
8. cp: This command is used to copy files within hdfs. Lets copy folder geeks to
geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
9. mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt
from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
10. rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
11. du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
12. dus: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
13. stat: It will give the last modified time of directory or path. In short it will give stats
of the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
14. setrep: This command is used to change the replication factor of a file/directory in
HDFS. By default, it is 3 for anything which is stored in HDFS (as set in hdfs
coresite.xml).

HADOOP FILESYSTEM INTERFACE: -


Hadoop is capable of running various file systems and HDFS is just one single
implementation that out of all those file systems. The Hadoop has a variety of file systems
that can be implemented concretely. The Java abstract class org.apache.hadoop.fs.FileSystem
represents a file system in Hadoop.
Filesystem URI Java implementation (all Description
scheme under org.apache.hadoop)
Java
Local file fs.LocalFileSystem fs.LocalFileSystem
HDFS hdfs hdfs.DistributedFileSystem HDFS stands for Hadoop
Distributed File System and it is
drafted for working with
MapReduce efficiently.
HFTP hftp hdfs.HftpFileSystem The HFTP filesystem provides
read-only access to HDFS over
HTTP. There is no connection of
HFTP with FTP. This filesystem
is commonly used with distcp to
share data between HDFS clusters
possessing different versions.
HSFTP hsftp hdfs.HsftpFileSystem The HSFTP filesystem provides
read-only access to HDFS over
HTTPS. This file system also
does not have any connection
with FTP.
HAR har fs.HarFileSystem The HAR file system is mainly
used to reduce the memory usage
of NameNode by registering files
in Hadoop HDFS. This file
system is layered on some other
file system for archiving
purposes.
KFS kfs fs.kfs.KosmosFileSystem cloud store or KFS
(CloudStor (KosmosFileSystem) is a file
e) system that is written in c++. It is
very much similar to a distributed
file system like HDFS and
GFS(Google File System).
FTP ftp fs.ftp.FTPFileSystem The FTP filesystem is supported
by the FTP server.
S3 (native) s3n fs.s3native.NativeS3FileSyst This file system is backed by
em AmazonS3.
S3 s3 fs.s3.S3FileSystem S3 (block-based) file system
(blockbase which is supported by Amazon s3
d) stores files in blocks(similar to
HDFS) just to overcome S3’s file
system 5 GB file size limit.

HADOOP DATA FLOW: -


A basic data flow of the Hadoop system can be divided into four phases:
1. Capture Big Data : The sources can be extensive lists that are structured, semi
structured, and unstructured, some streaming, real-time data sources, sensors, devices,
machine-captured data, and many other sources. For data capturing and storage, we
have different data integrators such as, Flume, Sqoop, Storm, and so on in the Hadoop
ecosystem, depending on the type of data.
2. Process and Structure: We will be cleansing, filtering, and transforming the data by
using a MapReduce-based framework or some other frameworks which can perform
distributed programming in the Hadoop ecosystem. The frameworks available
currently are MapReduce, Hive, Pig, Spark and so on.
3. Distribute Results: The processed data can be used by the BI and analytics system or
the big data analytics system for performing analysis or visualization.
4. Feedback and Retain: The data analysed can be fed back to Hadoop and used for
improvements.

Data Ingestion: -
In big data, data ingestion refers to the process of collecting, importing, and loading
large amounts of data from various sources into a big data infrastructure for storage,
processing, and analysis.
What it is:
Data ingestion is the first step in a big data pipeline, where raw data is gathered from
diverse sources, such as databases, applications, web servers, and IoT devices.
Why it's important:
A robust data ingestion process is crucial for enabling businesses to effectively
leverage their big data assets for analytics, business intelligence, and machine
learning.
Process:
 Collection: Data is gathered from various sources, which can be structured or
unstructured.
 Import: The collected data is then imported into a big data storage system,
like a data lake or data warehouse.
 Loading: The data is loaded into the storage system, often after undergoing
transformation and cleaning to ensure data quality and consistency.
Benefits:
 Centralized Data: Data from multiple sources is consolidated into a single
location, making it easier to access and analyze.
 Improved Data Quality: Data transformation and cleaning processes ensure
data accuracy and consistency.
 Enables Analytics: The ingested data can be used for various analytical
purposes, such as identifying trends, patterns, and insights.
Challenges:
 Data Volume: Dealing with large volumes of data can be challenging.
 Data Variety: Different data sources and formats can complicate the ingestion
process.
 Data Velocity: Real-time data streams require fast and efficient ingestion
capabilities.
 Data Consistency: Ensuring data consistency across different sources and
formats can be difficult.

Apache Sqoop: -
In the context of big data, Apache Sqoop is a tool designed for efficiently transferring
bulk data between Apache Hadoop and external structured datastores, like relational
databases, enabling data ingestion and extraction for various analytical purposes.
Need for Apache Sqoop?
With increasing number of business organizations adopting Hadoop to analyse huge
amounts of structured or unstructured data, there is a need for them to transfer
petabytes or exabytes of data between their existing relational databases, data sources,
data warehouses and the Hadoop environment. Accessing huge amounts of
unstructured data directly from MapReduce applications running on large Hadoop
clusters or loading it from production systems is a complex task because data transfer
using scripts is often not effective and time consuming.

How Apache Sqoop works?


Sqoop is an effective hadoop tool for non-programmers which functions by looking at
the databases that need to be imported and choosing a relevant import function for the
source data. Once the input is recognized by Sqoop hadoop, the metadata for the table
is read and a class definition is created for the input requirements.
Hadoop Sqoop can be forced to function selectively by just getting the columns
needed before input instead of importing the entire input and looking for the data in it.
This saves considerable amount of time. In reality, the import from the database to
HDFS is accomplished by a MapReduce job that is created in the background by
Apache Sqoop.
Keys and Functionality:
 Data Transfer:
Sqoop facilitates the movement of data between Hadoop's Distributed File System
(HDFS) and external relational databases (RDBMS) like MySQL, Oracle, and
PostgreSQL.
 Import and Export:
It supports both importing data from external databases into HDFS and exporting
data from HDFS to external databases.
 ETL (Extract, Transform, Load):
Sqoop plays a crucial role in the ETL process, enabling the extraction of data from
various sources and loading it into Hadoop for further processing and analysis.
 Data Integrity:
Sqoop ensures data integrity during the transfer process, preventing data loss or
corruption.
 Scalability:
Sqoop is designed to handle large volumes of data, making it suitable for big data
environments.
 Automation:
Sqoop can be used to automate data transfer tasks, reducing manual effort and
improving efficiency.
 Integration:
Sqoop integrates well with other Hadoop ecosystem components like Hive and
HBase, allowing for seamless data processing and analysis.
 Data Ingestion:
Sqoop is a critical tool for data ingestion, enabling organizations to bring data
from various sources into Hadoop for analysis.
 Structured Data Focus:
Sqoop is primarily designed for working with structured data, making it a good fit
for relational databases.
 SQL-to-Hadoop:
Sqoop's name, "SQL-to-Hadoop," reflects its core functionality of transferring
data between SQL databases and Hadoop.

Features of Apache Sqoop: -


Apache Sqoop supports bulk import i.e. it can import the complete database or individual
tables into HDFS. The files will be stored in the HDFS file system and the data in built-in
directories. Sqoop parallelizes data transfer for optimal system utilization and fast
performance. Apache Sqoop provides direct input i.e. it can map relational databases and
import directly into HBase and Hive
 Sqoop makes data analysis efficient.
 Sqoop helps in mitigating the excessive loads to external systems.
 Sqoop provides data interaction programmatically by generating Java classes.
Companies Using Apache Sqoop
 The Apollo Group education company uses Sqoop to extract data from external
databases and inject results of Hadoop jobs back into the RDBMS’s.
 Coupons.com uses Sqoop tool for data transfer between its IBM Netezza data
warehouse and the Hadoop environment.

Flume: -
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data, especially log data, into
Hadoop for storage and analysis. Flume is a highly reliable, distributed, and configurable
tool. It is principally designed to copy streaming data (log data) from various web servers to
HDFS.
Need for Flume:-
Logs are usually a source of stress and argument in most of the big data companies. Logs are
one of the most painful resources to manage for the operations team as they take up huge
amount of space. Logs are rarely present at places on the disk where someone in the company
can make effective use of them or Hadoop developers can access them. Many big data
companies wind up building tools and processes to collect logs from application servers,
transfer them to some repository so that they can control the lifecycle without consuming
unnecessary disk space. This frustrates developers as the logs are often not present at the
location where they can view them easily, they have limited number of tools available for
processing logs and have confined capabilities in intelligently managing the lifecycle. Apache
Flume is designed to address the difficulties of both operations group and developers by
providing them an easy to use tool that can push logs from bunch of applications servers to
various repositories via a highly configurable agent.
How Apache Flume works?
Flume has a simple event driven pipeline architecture with 3 important roles Source, Channel
and Sink.
 Source defines where the data is coming from, for instance a message queue or a file.
 Sinks defined the destination of the data pipelined from various sources.
 Channels are pipes which establish connect between sources and sinks.
Apache flume works on two important concepts-
 The master acts like a reliable configuration service which is used by nodes for
retrieving their configuration.
 If the configuration for a particular node changes on the master then it will
dynamically be updated by the master.
Node is generally an event pipe in Hadoop Flume which reads from the source and writes to
the Sink. The characteristics and role of a flume node is determine by the behaviour of source
and sinks. Apache Flume is built with several source and sink options but if none of them fits
in your requirements then developers can write their own. A flume node can also be
configured with the help of a sink decorator which can interpret the event and transforms it as
it passes through. With all these basic primitives, developers can create different topologies
to collect data on any application server and direct it to any log repository.

Advantages of Flume: -
 Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
 When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized
stores and provides a steady flow of data between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message
delivery.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable .

Features of Flume: -
 Flume ingests log data from multiple web servers into a centralized store (HDFS,
HBase) efficiently.
 Using Flume, we can get the data from multiple servers immediately into Hadoop.
 Along with the log files, Flume is also used to import huge volumes of event data
produced by social networking sites like Facebook and Twitter, and e-commerce
websites like Amazon and Flipkart.
 Flume supports a large set of sources and destinations types.
 Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
 Flume can be scaled horizontally.
Companies Using Apache Flume
 Goibibo uses Hadoop flume to transfer logs from the production systems into HDFS.
 Mozilla uses flume Hadoop for the BuildBot project along with Elastic Search.
 Capillary technologies uses Flume for aggregating logs from 25 machines in
production.

Comparison between the Sqoop and Flume: -


Hadoop Archives:-
Hadoop Archive is a facility that packs up small files into one compact HDFS block to avoid
memory wastage of name nodes.
 Name node stores the metadata information of the HDFS data.
 If 1GB file is broken into 1000 pieces then NameNode will have to store metadata
about all those 1000 small files.
 In that manner, NameNode memory will be wasted in storing and managing a lot of
data.
 HAR is created from a collection of files and the archiving tool will run a MapReduce
job.
 These Maps reduces jobs to process the input files in parallel to create an archive file.
 Hadoop is created to deal with large files data, so small files are problematic and to be
handled efficiently.
 To handle this problem, Hadoop Archive has been created which packs the HDFS
files into archives and we can directly use these files as input to the MR jobs.
 It always comes with *.har extension.
 HAR Syntax: hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

Hadoop archive is a facility which packs up small files into one compact HDFS block to
avoid memory w astage of name node.name node stores the metadata information of the
the HDFS data. SO, say 1GB file is broken in 1000 pieces then namenode will have to
store metadata about all those 1000 small files. In that manner, namenode memory will be
wasted it storing and managing a lot of data. HAR is created from a collection of files and
the archiving tool will run a MapReduce job. These Maps reduce jobs to process the input
files in parallel to create an archive file.

Limitations of HAR Files:

1. Creation of HAR files will create a copy of the original files. So, we need as much
disk space as size of original files which we are archiving.We can delete the
original files after creation of archive to release some disk space.
2. Once an archive is created, to add or remove files from/to archive we need to
recreate the archive.
3. HAR file will require lots of map tasks which are inefficient.

HADOOP I/O:-
1. COMPRESSION:-
File compression brings two major benefits: it reduces the space needed to store files,
and it speeds up data transfer across the network or to or from disk. When dealing
with large volumes of data, both of these savings can be significant, so it pays to
carefully consider how to use compression in Hadoop.
I. Compressing input files:- If the input file is compressed, then the bytes read
in from HDFS is reduced, which means less time to read data. This time
conservation is beneficial to the performance of job execution. If the input
files are compressed, they will be decompressed automatically as they are read
by MapReduce, using the filename extension to determine which codec to use.
For example, a file ending in .gz can be identified as gzip-compressed file and
thus read with GzipCodec.
II. Compressing output files:- Often we need to store the output as history files.
If the amount of output per day is extensive, and we often need to store history
results for future use, then these accumulated results will take extensive
amount of HDFS space. However, these history files may not be used very
frequently, resulting in a waste of HDFS space. Therefore, it is necessary to
compress the output before storing on HDFS.
III. Compressing map output:- Even if your MapReduce application reads and
writes uncompressed data, it may benefit from compressing the intermediate
output of the map phase. Since the map output is written to disk and
transferred across the network to the reducer nodes, by using a fast compressor
such as LZO or Snappy, you can get performance gains simply because the
volume of data to transfer is reduced.

SERIALIZATION:-

Data serialization is the process of converting data objects present in complex


data structures into a byte stream for storage, transfer and distribution
purposes on physical devices.
Once the serialized data is transmitted the reverse process of creating objects
from the byte sequence called deserialization. Serialization is termed as
marshalling and deserialization is termed as unmarshalling.
In Hadoop, different components talk to each other via Remote Procedure
Calls (RPCs). A caller process serializes the desired function name and its
arguments as a byte stream before sending it to the called process. The called
process deserializes this byte stream, interprets the function type, and executes
it using the arguments that were supplied. The results are serialized and sent
back to the caller. This workflow naturally calls for fast serialization and
deserialization.

Advantages of Serialization

1. To save/persist state of an object.


2. To travel an object across a network.

AVRO:-

Apache Avro is a language-neutral data serialization system. It was developed by Doug


Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro
becomes quite helpful, as it deals with data formats that can be processed by multiple
languages. Avro is a preferred tool to serialize data in Hadoop. Avro has a schema-based
system. A language-independent schema is associated with its read and write operations.
Avro serializes the data which has a built-in schema. Avro serializes the data into a compact
binary format, which can be deserialized by any application.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as
Java, C, C++, C#, Python, and Ruby.
Avro Schemas:- Avro depends heavily on its schema. It allows every data to be written with
no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser
in size. Schema is stored along with the Avro data in a file for any further processing. In
RPC, the client and the server exchange schemas during the connection. This exchange helps
in the communication between same named fields, missing fields, extra fields, etc.

Avro schemas are defined with JSON that simplifies its implementation in languages with
JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as
Sequence Files, Protocol Buffers, and Thrift.

Features of Avro:-
Listed below are some of the prominent features of Avro –
 Avro is a language-neutral data serialization system.
 It can be processed by many languages (currently C, C++, C#, Java, Python, and
Ruby).
 Avro creates binary structured format that is both compressible and splittable. Hence
it can be efficiently used as the input to Hadoop MapReduce jobs.
 Avro provides rich data structures. For example, you can create a record that contains
an array, an enumerated type, and a sub record. These datatypes can be created in any
language, can be processed in Hadoop, and the results can be fed to a third language.
 Avro schemas defined in JSON, facilitate implementation in the languages that
already have JSON libraries.
 Avro creates a self-describing file named Avro Data File, in which it stores data along
with its schema in the metadata section.
 Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server
exchange schemas in the connection handshake.
General Working of Avro:-
To use Avro, you need to follow the given workflow –
Step 1 − Create schemas. Here you need to design Avro schema according to your data.
Step 2 − Read the schemas into your program. It is done in two ways –
o By Generating a Class Corresponding to Schema − Compile the schema using
Avro. This generates a class file corresponding to the schema
o By Using Parsers Library − You can directly read the schema using parsers
library.
Step 3 − Serialize the data using the serialization API provided for Avro, which is found in
the package org.apache.avro.specific.
Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in
the package org.apache.avro.specific.
FILE-BASED DATA STRUCTURES: -
There are several Hadoop-specific file formats that were specifically created to work well
with MapReduce. These Hadoop-specific file formats include file-based data structures such
as sequence files, serialization formats like Avro, and columnar formats such as RCFile and
Parquet. These file formats have differing strengths and weaknesses, but all share the
following characteristics that are important for Hadoop applications:
 Splittable compression:- These formats support common compression formats and
are also splittable. We’ll discuss splittability more in the section “Compression”, but
note that the ability to split files can be a key consideration for storing data in Hadoop
because it allows large files to be split for input to MapReduce and other types of
jobs. The ability to split a file for processing by multiple tasks is of course a
fundamental part of parallel processing, and is also key to leveraging Hadoop’s data
locality feature.
 Agnostic compression:- The file can be compressed with any compression codec,
without readers having to know the codec. This is possible because the codec is stored
in the header metadata of the file format.
The SequenceFile format is one of the most commonly used file-based formats in Hadoop,
but other file-based formats are available, such as MapFiles, SetFiles, ArrayFiles, and
BloomMapFiles. Because these formats were specifically designed to work with MapReduce,
they offer a high level of integration for all forms of MapReduce jobs, including those run via
Pig and Hive. We’ll cover the SequenceFile format here, because that’s the format most
commonly employed in implementing Hadoop jobs.
SequenceFiles store data as binary key-value pairs. There are three formats available for
records stored within SequenceFiles:
 Uncompressed:- For the most part, uncompressed SequenceFiles don’t provide any
advantages over their compressed alternatives, since they’re less efficient for
input/output (I/O) and take up more space on disk than the same data in compressed
form.
 Record-compressed:- This format compresses each record as it’s added to the file.
 Block-compressed:- This format waits until data reaches block size to compress,
rather than as each record is added. Block compression provides better compression
ratios compared to record-compressed SequenceFiles, and is generally the preferred
compression option for SequenceFiles. Also, the reference to block here is unrelated
to the HDFS or filesystem block. A block in block compression refers to a group of
records that are compressed together within a single HDFS block.
Regardless of format, every SequenceFile uses a common header format containing basic
metadata about the file, such as the compression codec used, key and value class names,
user-defined metadata, and a randomly generated sync marker. This sync marker is also
written into the body of the file to allow for seeking to random points in the file, and is
key to facilitating splittability. For example, in the case of block compression, this sync
marker will be written before every block in the file.
SequenceFiles are well supported within the Hadoop ecosystem, however their support
outside of the ecosystem is limited. They are also only supported in Java. A common use
case for SequenceFiles is as a container for smaller files. Storing a large number of small
files in Hadoop can cause a couple of issues. One is excessive memory use for the
NameNode, because metadata for each file stored in HDFS is held in memory. Another
potential issue is in processing data in these files—many small files can lead to many
processing tasks, causing excessive overhead in processing. Because Hadoop is optimized
for large files, packing smaller files into a SequenceFile makes the storage and processing
of these files much more efficient.

HADOOP CLUSTERS:-

Cluster is a collection of something, a simple computer cluster is a group of various


computers that are connected with each other through LAN (Local Area Network),
the nodes in a cluster share the data, work on the same task and these nodes are good
enough to work as a single unit means all of them to work together. Similarly, a
Hadoop cluster is also a collection of various commodity hardware (devices that are
inexpensive and amply available). These Hardware components work together as a
single unit. In the Hadoop cluster, there are lots of nodes (can be computer and
servers) contains Master and Slaves, the Name node and Resource Manager works as
Master and data node, and Node Manager works as a Slave. The purpose of Master
nodes is to guide the slave nodes in a single Hadoop cluster. We design Hadoop
clusters for storing, analysing, understanding, and for finding the facts that are hidden
behind the data or datasets which contain some crucial information. The Hadoop
cluster stores different types of data and processes them.
o Structured-Data: The data which is well structured like Mysql.
o Semi-Structured Data: The data which has the structure but not the
data type like XML, Json (Javascript object notation).
o Unstructured Data: The data that doesn’t have any structure like audio,
video.
Hadoop Clusters Properties:

1. Scalability: Hadoop clusters are very much capable of scaling-up and


scaling-down the number of nodes i.e. servers or commodity hardware.
Let’s see with an example of what actually this scalable property means.
Suppose an organization wants to analyze or maintain around 5PB of data
for the upcoming 2 months so he used 10 nodes(servers) in his Hadoop
cluster to maintain all of this data. But now what happens is, in between
this month the organization has received extra data of 2PB, in that case,
the organization has to set up or upgrade the number of servers in his
Hadoop cluster system from 10 to 12(let’s consider) in order to maintain it.
The process of scaling up or scaling down the number of servers in the
Hadoop cluster is called scalability.
2. Flexibility: This is one of the important properties that a Hadoop cluster
possesses. According to this property, the Hadoop cluster is very much
Flexible means they can handle any type of data irrespective of its type
and structure. With the help of this property, Hadoop can process any type
of data from online web platforms.
3. Speed: Hadoop clusters are very much efficient to work with a very fast
speed because the data is distributed among the cluster and also because of
its data mapping capability’s i.e. the MapReduce architecture which works
on the Master-Slave phenomena.
4. No Data-loss: There is no chance of loss of data from any node in a
Hadoop cluster because Hadoop clusters have the ability to replicate the
data in some other node. So in case of failure of any node no data is lost as
it keeps track of backup for that data.
5. Economical: The Hadoop clusters are very much cost-efficient as they
possess the distributed storage technique in their clusters i.e. the data is
distributed in a cluster among all the nodes. So in the case to increase the
storage we only need to add one more another hardware storage which is
not that much costliest.
Types of Hadoop clusters:
i. Single Node Hadoop Cluster
ii. Multiple Node Hadoop Cluster

Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name suggests the
cluster is of an only single node which means all our Hadoop Daemons i.e. Name Node, Data
Node, Secondary Name Node, Resource Manager, Node Manager will run on the same
system or on the same machine. It also means that all of our processes will be handled by
only single JVM (Java Virtual Machine) Process Instance.

Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the name suggests it
contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
in different-different nodes in the same cluster setup. In general, in multiple node Hadoop
cluster setup we try to utilize our higher processing nodes for Master i.e. Name node and
Resource Manager and we utilize the cheaper system for the slave Daemon’s i.e.Node
Manager and Data Node.

Cluster Specification: -
 Hadoop is designed to run on commodity hardware. That means that you are
not tied to expensive, proprietary offerings from a single vendor; rather, you
can choose standardized, commonly available hardware from any of a large
range of vendors to build your cluster.
 "Commodity" does not mean "low-end." Low-end machines often have cheap
components, which have higher failure rates than more expensive (but still
commodity-class) machines. When you are operating tens, hundreds, or
thousands of machines, cheap components turn out to be a false economy, as
the higher failure rate incurs a greater maintenance cost. • Hardware
specification for each cluster is different.
 Hadoop is designed to use multiple cores and disks, so it will be able to take
full advantage of more powerful hardware.
 The bulk of Hadoop is written in Java, and can therefore run on any platform
with a JVM.

Custer setup and installation:-

After the hardware is setup the next step is to install the software needed to run
hadoop.

Various ways to install and configure Hadoop.

Installing Java:

o Java 6 or later is required to run Hadoop.


o The latest stable Sun JDK is the preferred option, although Java
distributions from other vendors may work, too.
o The following command confirms that Java was installed correctly: java
--version

Creating a Hadoop User:

It's good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine.
Installing Hadoop:

o Download the Hadoop Package.


o Extract the Hadoop tar file.
o Change the owner of the Hadoop files to be the Hadoop user and group.

Testing and Installation:

o Once you've created the installation file, you are ready to test it by installing it
on the machines in your cluster.
o This will probably take a few iterations as you discover kinks in the install.
o When it's working, you can proceed to configure Hadoop and give it a test run.

Security in Hadoop:-

Hadoop security focuses on protecting data and ensuring the integrity of the Hadoop
ecosystem through authentication, authorization, auditing, and data protection measures like
encryption. Key aspects include securing access to the Hadoop cluster, managing permissions
for users and services, and encrypting data both at rest and in transit.

 Authentication: Verifying the identity of users and services attempting to access the
cluster. This is often achieved using Kerberos, a network authentication protocol.
 Authorization: Determining which users and services are authorized to access
specific resources (data, applications, etc.). This is often managed through access
control lists (ACLs) and role-based access controls (RBAC).
 Auditing: Recording and monitoring user and service activities within the cluster
for security analysis and compliance purposes.
 Data Protection: Implementing encryption to safeguard data both at rest (stored on
disk) and in transit (being transferred over the network). HDFS provides encryption
features for at-rest encryption, while SSL/TLS can be used for in-transit encryption
[1, 3, 5].
 Encryption: Securing data at rest and in transit using various encryption methods,
including HDFS Transparent Data Encryption (TDE) and SSL/TLS.

Hadoop Security Best Practices:


 Change Default Passwords and Ports: Modify default passwords and
communication ports for Hadoop services and components.
 Use a Private Network: Isolate the Hadoop cluster on a private network to limit
external access.
 Implement Kerberos: Utilize Kerberos for robust authentication of users and
services.
 Enforce Access Control Lists (ACLs): Implement ACLs and RBAC to control
access to data and applications.
 Regularly Audit Access Logs: Analyze access logs to identify suspicious activity
and track user access patterns.
 Monitor for Threats: Implement monitoring tools to detect and respond to potential
security threats.
 Regularly Update Software: Keep Hadoop software and components up to date to
patch security vulnerabilities.

Administering Hadoop:-

Hadoop administration involves managing and maintaining Hadoop clusters, ensuring they
run efficiently and securely. This includes tasks like installing, configuring, and monitoring
Hadoop components, as well as troubleshooting issues and optimizing performance. Hadoop
administrators also manage user access and permissions, and may manage other Hadoop
ecosystem resources like Hive, Pig, and HBase.

Key Aspects of Hadoop Administration:


 Cluster Management: This includes installing, configuring, and maintaining
Hadoop clusters and related components like HDFS, YARN, MapReduce, and
various ecosystem tools (e.g., Hive, Pig, Spark).
 Performance Monitoring and Optimization: Hadoop administrators monitor
cluster performance, identify bottlenecks, and implement optimizations to ensure
optimal efficiency and stability.
 Security and Access Control: They manage user access, permissions, and
implement security measures to protect the Hadoop cluster and its data.
 Troubleshooting and Problem Solving: Hadoop administrators troubleshoot issues,
identify root causes, and implement solutions to ensure the smooth operation of the
cluster.
 Backup and Recovery: They are responsible for backing up data and ensuring the
cluster can be recovered in case of failures.
 Data Management: Hadoop administrators manage data flow, ensuring it is stored
and processed efficiently.
 Ecosystem Management: Beyond the core Hadoop components, they may manage
other ecosystem resources like Hive, Pig, and HBase, ensuring their proper
integration and functioning.
 Monitoring and Alerting: Hadoop administrators use monitoring tools to track the
health and performance of the cluster, setting up alerts to notify them of any issues
that need attention.
 Capacity Planning: They plan for future growth and capacity needs, ensuring the
cluster can handle increasing workloads.
 Staying Updated: Hadoop administrators need to stay current with new versions of
Hadoop and its ecosystem, as well as with security best practices.
HDFS monitoring and maintenance:-
HDFS monitoring and maintenance involve ensuring the Hadoop Distributed File System is
healthy, efficient, and available. This includes monitoring key metrics like NameNode and
DataNode health, storage usage, and performance, as well as performing routine maintenance
tasks like upgrades, decommissioning nodes, and managing the maintenance state.
Monitoring HDFS:
 NameNode Health: Monitor the NameNode JVM and OS status, capacity, and
usage trends. LabEx provides comprehensive NameNode metrics.
 DataNode Health: Track individual DataNode status, including heartbeats, and
receiveinstant notifications when nodes go down, says ManageEngine.
 Storage Usage: Monitor HDFS storage utilization, including used space, free space,
and total capacity at both cluster and node levels.
 Block-Level Details: Use the hdfs fsck command to analyze block-level details,
including block locations, replication factors, and any potential issues.
 Performance: Monitor HDFS memory usage, throughput, and other performance
metrics to ensure efficient resource utilization.
 Alerting: Set up automated alerts for critical events to take proactive actions and
address potential issues promptly.

HDFS Maintenance:
 Rolling Upgrades: Perform upgrades to the HDFS software without downtime.
 Decommissioning DataNodes: Safely remove DataNodes from the cluster, ensuring
data replication before shutdown.
 Maintenance State: Temporarily take DataNodes out of service for short-term
maintenance, minimizing the need for full replication, says Cloudera.
 Data Replication: Monitor and manage data replication factors to ensure fault
tolerance and data availability.
 Performance Tuning: Regularly tune and optimize HDFS for optimal performance,
including adjusting replication factors, block sizes, and other parameters.
 Configuration Management: Manage HDFS configuration settings and ensure they
are consistent across the cluster.
Best Practices:
 Centralized Monitoring: Implement a centralized monitoring solution for easy
access to cluster health and performance data.
 Regular Review: Regularly review and update monitoring configurations to adapt
to changing workload requirements.
 Proactive Actions: Take proactive actions based on monitoring alerts and
performance data to prevent potential issues.
 Documentation: Maintain comprehensive documentation of HDFS configurations,
monitoring setup, and maintenance procedures.

Hadoop benchmarking:-
Hadoop benchmarking in big data involves using specific tests to evaluate the performance of
Hadoop clusters, particularly focusing on storage (HDFS) and processing (MapReduce)
capabilities. These benchmarks help measure factors like throughput, latency, and scalability
under different workloads, allowing for performance tuning and comparison of different
configurations.
Here's a more detailed look at Hadoop benchmarking in big data:
Benchmarking Tools and Techniques:
 HiBench: A widely used big data benchmark suite that includes various Hadoop, Spark,
and streaming workloads.
 DFSIO: Focuses on evaluating HDFS throughput by performing read and write
operations simultaneously, helping identify potential bottlenecks.
 TeraSort: A standard benchmark that assesses the performance of MapReduce and
HDFS by sorting a large dataset as quickly as possible, measuring the cluster's ability to
distribute and mapreduce files.
 Namenode Benchmarking: Measures the number of operations performed by the
Hadoop NameNode per second, reporting metrics like total execution time, throughput,
and average time per operation.
 MapReduce Benchmarking: Evaluates the performance of MapReduce jobs, focusing
on execution time, throughput, and resource utilization.

Why Benchmark Hadoop?


 Performance Tuning: Benchmarks help identify bottlenecks and areas for optimization
within the Hadoop cluster.
 Capacity Planning: Benchmarking provides insights into the cluster's capacity to handle
different workloads, allowing for better resource allocation.
 Cluster Comparison: Benchmarks enable comparing the performance of different
Hadoop distributions and configurations.
 Scalability Assessment: Benchmarking can assess the scalability of Hadoop clusters as
the amount of data and the complexity of jobs increase.

Key Metrics Measured during Benchmarking:


 Throughput: Measures the amount of data processed per unit of time (e.g.,
operations per second).
 Latency: Measures the time it takes for a specific operation to complete.
 Execution Time: The total time taken for a job or workload to complete.
 Resource Utilization: Measures how efficiently the cluster's resources (CPU,
memory, network) are being used.
 Scalability: Evaluates how well the cluster can handle increasing amounts of data and
workload complexity.

Examples of Benchmarking in Practice:


 Hadoop Performance Evaluation: Using benchmarks like TeraSort and DFSIO to
evaluate HDFS and MapReduce performance.
 Comparing Hadoop and Spark: Benchmarking frameworks like Spark and Hadoop
MapReduce on tasks like classification to evaluate their performance and scalability.
 Evaluating Big Data Workloads: Using benchmarks like BigBench, which tests a
complete big data analytics workload, including MapReduce, Hive, and Mahout.

Hadoop in the cloud: :-


 Hadoop in the cloud" means: it is running Hadoop clusters on resources offered by a
cloud · provider.
 This practice is normally compared with running Hadoop clusters on your own
hardware, called on-premises clusters or "on-prem."
 A cloud provider does not do everything for you; there are many choices and a variety
of provider features to understand and consider
Reasons to Run Hadoop in the Cloud:
 Lack of space: Your organization may need Hadoop clusters, but you don't have
anywhere to keep racks of physical servers, along with the necessary power and
cooling.
 Flexibility: Without physical servers to rack up or cables to run, Everything is
controlled through cloud provider APIs and web consoles.
 Speed of change: It is much faster to launch new cloud instances or allocate new
database servers than to purchase, unpack, rack, and configure physical computers.
 Lower risk: How much on-prem hardware should you buy? If you don't have
enough, the entire business slows down. If you buy too much, you've wasted money
and have idle hardware that continues to waste money. In the cloud, you can quickly
and easily change how many resources you use, so there is little risk of under
commitment or over commitment.
 Focus: An organization using a cloud provider to rent resources, instead of spending
time and effort on the logistics of purchasing and maintaining its own physical
hardware and networks, is free to focus on its core competencies, like using Hadoop
clusters to carry out its business. This is a compelling advantage for a tech startup.
 Worldwide availability
 Capacity
Reasons to Not Run Hadoop in the Cloud:
 Simplicity
 High levels of control
 Unique hardware needs
 Saving money

You might also like