UNIT - 4
BASICS OF HADOOP
BASICS OF HADOOP
UNIT-IV
• Contents
• Data format
• Features of Hadoop
• Analyzing data with Hadoop
• Design of Hadoop distributed file system
(HDFS)
• HDFS concepts
• Hadoop related tools
About Hadoop
• Apache Hadoop is an open source software framework
used to develop data processing applications which are
executed in a distributed computing environment.
• in Hadoop, data resides in a distributed file system which
is called as a Hadoop Distributed File system. The
processing model is based on 'Data Locality' concept
wherein computational logic is sent to cluster
nodes(server) containing data. This computational logic is
nothing, but a compiled version of a program written in a
high-level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
Types of Hadoop File Formats
• Text files : A text file is the most basic and a
human-readable file. It can be written in any
programming language like c, java or any editor.
• Sequence File: The sequence file format can be
used to store an image in the binary format. They
store key-value pairs in a binary container format
and are more efficient than a text file. However,
sequence files are not human- readable.
Types of Hadoop File Formats
• Avro data files: The Avro file format is ideal for long-
term storage of important data. It can read from and
write in many languages like Java, Scala and so on.
• Parquet file format: Apache Avro is a language-
neutral data serialization system. It was developed
by Doug Cutting, the father of Hadoop. Parquet is a
columnar format developed by Cloudera and
Twitter. It is supported in Spark, MapReduce, Hive,
Pig, Impala, Crunch, and so on. Like Avro, schema
metadata is embedded in the file.
SOME OTHER DATA FORMATS
• Photos – Pixel data - petabytes of storage
• Trade data – Time-series data – terabytes of data
per day
• Gene data – Sequential data – petabytes of data
• Internet Archive – Unstructured data – petabytes
to terabytes of data
• There’s a lot of data out there.
• But you are probably wondering how it affects
you?
• Does the advent of “Big Data,” as it is being
called, affect smaller organizations or individuals?
• Yes…More data usually beats better algorithms
Hadoop Features
• Suitable for Big Data Analysis
• As Big Data tends to be distributed and
unstructured in nature, HADOOP clusters are
best suited for analysis of Big Data. Since it is
processing logic (not the actual data) that
flows to the computing nodes, less network
bandwidth is consumed. This concept is called
as data locality concept which helps increase
the efficiency of Hadoop based applications.
Hadoop Features
• Scalability : HADOOP clusters can easily be
scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data.
• Fault Tolerance : HADOOP ecosystem has a
provision to replicate the input data on to other
cluster nodes. That way, in the event of a cluster
node failure, data processing can still proceed by
using data stored on another cluster node.
• 5.25’’ floppy disk – 1.2 MB - ~= 100kb/sec ~= 15 minutes.
• 5.25’’ hard disk – 1.3 GB - ~= 400kb/sec for 1x speed ~= 8
minutes.
• 5.25’’ compact disk – 700 MB/4.7 GB – [CD - 150kb/sec for 1x
speed], [DVD – 1350kb/sec for 1x speed] ~= 3 minutes.
• Today, Terabyte hard disk drives have a data transfer speed
around 100 MB/sec. Two and a half hours to read all the data off
the disk.
• This is a long time to read all data on a single drive and
writing is even slower.
• The obvious way to reduce the time is to read from
multiple disks at once.
• Imagine if we had 100 drives, each holding one
hundredth of the data.
• Working in parallel, we could read the data in under two
minutes.
• The first problem to solve is hardware failure.
– Replication of data is solution (HDFS).
• The second problem is that most analysis tasks
need to be able to combine the data in some
way.
– Various distributed systems allow data to be
combined from multiple sources, but doing this
correctly is notoriously challenging (MapReduce).
– Map Reduce provides a programming model that
abstracts the problem from disk reads and writes,
transforming it into a computation over sets of keys
and values.
MapReduce
• It is the core component of processing in a
Hadoop Ecosystem as it provides the logic of
data processing. It is a software framework
which helps in writing applications that
processes large data sets using distributed and
parallel algorithms inside Hadoop
environment.
• In a MapReduce program, Map() and
Reduce() are two functions.
MapReduce
– The Map function performs actions like filtering, grouping and
sorting.
– While Reduce function aggregates and summarizes the result
produced by map function.
– The result generated by the Map function is a key value pair (K,
V) which acts as the input for Reduce function.
– We want to calculate the number of students in each
department. Initially, Map program will execute and calculate
the students appearing in each department, producing the key
value pair as mentioned above. This key value pair is the input to
the Reduce function. The Reduce function will then aggregate
each department and calculate the total number of students in
each department and produce the given result.
Comparison with other systems
A brief history of Hadoop
Hadoop EcoSystem and
Components
HDFC EcoSystem
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query
(SQL-like)
• HBase -> NoSQL Database
HDFC EcoSystem
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain
cluster
YARN
• YARN as the brain of your Hadoop Ecosystem. It performs all
your processing activities by allocating resources and
scheduling tasks.
• It has two major components, i.e. ResourceManager and
NodeManager.
– ResourceManager is again a main node in processing .
– It receives the processing requests, and then passes the
parts of requests to corresponding NodeManagers
accordingly, where the actual processing takes place.
– NodeManagers are installed on every DataNode. It is
responsible for execution of task on every single
DataNode.
PIG
• PIG has two parts: Pig Latin, the language and the
pig runtime, for the execution environment. It is
same as Java and JVM.
• It supports pig latin language, which has SQL like
command structure.
• In PIG, first the load command, loads the data. Then
we perform various functions on it like grouping,
filtering, joining, sorting, etc. At last, either you can
dump the data on the screen or you can store the
result back in HDFS.
APACHE HIVE
• Facebook created HIVE for people who are
fluent with SQL. Thus, HIVE makes them feel
at home while working in a Hadoop
Ecosystem.
• Basically, HIVE is a data warehousing
component which performs reading, writing
and managing large data sets in a distributed
environment using SQL-like interface.
• HIVE + SQL = HQL
Mahout
• Mahout provides an environment for creating
machine learning applications which are
scalable.
• It performs collaborative filtering, clustering
and classification. Some people also
consider frequent item set missing as
Mahout’s function. Let us understand them
individually:
Mahout
• Collaborative filtering: Mahout mines user
behaviors, their patterns and their
characteristics and based on that it predicts
and make recommendations to the users. The
typical use case is E-commerce website.
• Clustering: It organizes a similar group of data
together like articles can contain blogs, news,
research papers etc.
Mhaout
• Classification: It means classifying and
categorizing data into various sub-
departments like articles can be categorized
into blogs, news, essay, research papers and
other categories.
• Frequent item set missing: Here Mahout
checks, which objects are likely to be
appearing together and make suggestions, if
they are missing.
Spark
• Apache Spark is a framework for real time
data analytics in a distributed computing
environment.
• The Spark is written in Scala and was originally
developed at the University of California,
Berkeley.
Hadoop Architecture
Name Node and Data Nodes
Name Node and Data Node
• NameNode is the main node and it doesn’t store the
actual data. It contains metadata, just like a log file or you
can say as a table of content. Therefore, it requires less
storage and high computational resources.
• Data Nodes: data is stored here and hence it requires
more storage resources. These DataNodes are
commodity hardware (like your laptops and desktops) in
the distributed environment.You always communicate to
the NameNode while writing the data. Then, it internally
sends a request to the client to store and replicate data
on various DataNodes.
Hadoop Architecture
• Hadoop has a Master-Slave Architecture for
data storage and distributed data processing
using MapReduce and HDFS methods.
• NameNode: NameNode represented every
files and directory which is used in the
namespace.
• DataNode: DataNode helps you to manage
the state of an HDFS node and allows you to
interacts with the blocks.
DIFFERNET NODES IN HADOOP
• MasterNode: The master node allows you to
conduct parallel processing of data using Hadoop
MapReduce.
• Slave node: The slave nodes are the additional
machines in the Hadoop cluster which allows you
to store data to conduct complex calculations.
Moreover, all the slave node comes with Task
Tracker and a DataNode. This allows you to
synchronize the processes with the NameNode
and Job Tracker respectively.
Apache Hadoop and the Hadoop Ecosystem
The Hadoop projects
• MapReduce - A distributed data processing
model and execution environment that runs on
large clusters of commodity machines.
• HDFS - A distributed filesystem that runs on large
clusters of commodity machines.
Scaling out
• You’ve seen how MapReduce works for small inputs.
• Now it’s time to take a bird’s-eye view of the system and look at
the data flow for large inputs.
• For simplicity, the examples so far have used files on the local file
system.
• However, to scale out, we need to store the data in a distributed
file system, typically HDFS to allow Hadoop to move the Map
Reduce computation to each machine hosting a part of the data.
The dotted boxes indicate nodes,
The light arrows show data transfers on a node, and
The heavy arrows show data transfers between nodes.
Hadoop Streaming
• Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java.
• Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you
to create and run Map/Reduce jobs with any executable or script as the mapper and/or the
reducer.
• Hadoop Streaming uses Unix standard stream as the interface between Hadoop and your
program, so you can use any language that can read standard input and write to standard
output to write your MapReduce program.
• Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle
binary streams, too), and when used in text mode, it has a line-oriented view of data.
• Map input data is passed over standard input to your map function, which
processes it line by line and writes lines to standard output.
• Both the mapper and the reducer are python scripts that read the input from
standard input and emit the output to standard output. The utility will create a
Map/Reduce job, submit the job to an appropriate cluster, and monitor the
progress of the job until it completes.
• A map output key-value pair is written as a single tab-delimited line.
• Input to the reduce function is in the same format a tab-separated key-value
pair passed over standard input.
• The reduce function reads lines from standard input, which the framework
guarantees are sorted by key, and writes its results to standard output.
• Languages for writing map and reduce functions other than java
– Ruby
– Python
Hadoop pipes
• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
• Hadoop Pipes uses sockets to enable task trackers to communicate processes
running the C++ map or reduce functions.
• Unlike Streaming, which uses standard input and output to communicate with
the map and reduce code, Pipes uses sockets as the channel over which the
tasktracker communicates with the process running the C++ map or reduce
function.
• JNI is not used.
HADOOP FILE SYSTEM (HDFS)
• HDFS is a filesystem designed for
– storing very large files with streaming data access
patterns
– running on clusters of commodity hardware.
Very large files
• “Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes
in size.
• There are Hadoop clusters running today that
store petabytes of data.
Streaming data access
• HDFS is built around the idea that the most efficient
data processing pattern is a write-once, read-many-
times pattern.
• A dataset is typically generated or copied analysis will
involve a large proportion, if not all, of the dataset, so
the time to read the whole dataset is more important
than the latency in reading the first record.
Commodity hardware
• Hadoop doesn’t require expensive, highly reliable hardware to run on.
• It’s designed to run on clusters of commodity hardware (commonly available
hardware available from multiple vendors) for which the chance of node
failure across the cluster is high, at least for large clusters.
• HDFS is designed to carry on working without a noticeable interruption to the
user in the face of such failure.
• It is also worth examining the applications for which using HDFS does not work
so well.
• While this may change in the future, these are areas where HDFS is not a good
fit today.
Low-latency data access
•Applications that require low-latency access to data, in the
tens of milli seconds range, will not work well with HDFS.
•Remember, HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of
latency.
•HBase is currently a better choice for low-latency access.
Lots of small files
• Since the namenode holds file system meta data in memory, the
limit to the number of files in a file system is governed by the
amount of memory on the name node.
• As a rule of thumb, each file, directory, and block takes about 150
bytes.
• Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the
file.
HDFS Concepts
• Blocks
• Namenodes and Datanodes