Big Data Fundamentals and Hadoop Overview
Big Data Fundamentals and Hadoop Overview
Unit – I
Introducton to Big Data: Data, Types of Data, Big Data – 3 Vs of Big Data, Analytcs, Types of Analytcs, Need for
Big Data Analytcs. Introducton to Apache Hadoop: Inventon of Hadoop, Hadoop Architecture, Hadoop
Components, Hadoop Eco Systems, Hadoop Distributons, Benefts of Hadoop
Data: Data is the collection of raw facts and figures. Actually data is unprocessed, that is why data is
called collection of raw facts and figures. We collect data from diferent resources. After collection,
data is entered into a machine for processing. Data may be collection of words, numbers, pictures, or
sounds etc.
Examples of Data:
• Student data on admission form- bundle of admission forms contains name, father’s name, address,
photograph etc.
• Student’s examination data - In examination system of a college/school, data about obtained marks
of different subjects for all students is collected, exam schedule etc.
• Census Report, Data of citizens- During census, data of all citizens like number of persons living in a
home, literate or illiterate, number of children, cast, religion etc.
• Survey Data – data can be collected by survey to know the opinion of people about their product like
/ unlike their products. They also collect data about their competitor companies in a particular area.
Information: Processed data is called information. When raw facts and figures are processed and
arranged in some proper order then they become information. Information has proper meanings.
Information is useful in decision-making. In other words, Information is data that has been processed in
such a way as to be meaningful values to the person who receives it.
Examples of information:
• Student’saddress labels- Stored data of students can be used to print address labels of students.
These address labels are used to send any intimation / information to students at their home
addresses.
• Student’s examination, Results- In examination system collected data (obtained marks in each
subject) is processed to get total obtained marks of a student. Total obtained marks are Information.
It is also used to prepare result card of a student.
• Census Report, Total Population- Census data is used to get report/information about total
population of a country and literacy rate, total population of males, females, children, aged persons,
persons in different categories line cast, religion, age groups etc.
• Survey Report – Survey data is summarized into reports/information to present to management of
the company. The management will take important decisions on the basis of data collected through
surveys.
Ex: The data collected is in a survey report is: ‘HYD20M’
If we process the above data we understand that code is information about a person as follows: HYD is
city name ‘Hyderabad’, 20 is age and M is to represent ‘MALE’.
Units of data:When dealing with big data, we consider numbers to represent like megabytes, gigabytes,
terabytes etc. Here is the system of units to represent data.
Kilobyte KB 103
Megabyte MB 106
Gigabyte GB 109
Terabyte TB 101
2
Petabyte PB 101
5
Exabyte EB 101
8
Zettabyte ZB 102
1
Yottabyte YB 102
4
Big Data:
• Big Data is a term used for a collection of data sets that are large and complex, which is difficult to
store and process using available database management tools or traditional data processing
applications. The quantity of data on planet earth is growing exponentially for many reasons. Various
sources and our day to day activities generate lots of data. With the smart objects going online, the
data growth rate has increased rapidly.
• The definition of Big Data, given by Gartner is, “Big data is high -volume, and high-velocity
and/or high-variety information assets that demand cost -efective, innovative forms of
information processing that enable enhanced insight, decision making, and process
automation”.
• The major sources of Big Data: social media sites, sensor networks, digital images/videos, cell
phones, purchase transaction records, web logs, medical records, archives, military surveillance,
ecommerce, complex scientific research and so on.
• By 2020, the data volumes will be around 40 Zettabytes which is equivalent to adding every single
grain of sand on the planet multiplied by seventy-five.
• Examples of Bigdata:
1. The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Social Media Impact:Statistic shows that 500+terabytes of new data gets ingested into the
databases of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, puting comments etc.
3. Walmart handles more than 1 million customer transactions every hour.
4. Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
5. 230+ millions of tweets are created every day.
6. More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
7. YouTube users upload 48 hours of new video every minute of the day.
8. Amazon handles 15 million customer click stream user data per day to recommend products.
9. 294 billion emails are sent every day. Services analyses this data to find the spams.
10. Single Jet engine can generate 10+ terabytes of data in 30 minutes of a fight time. With many thousand
fights per day, generaton of data reaches up to many Petabytes.
Types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
1. Structured Data:
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science have achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of
such data grows to a huge extent, typical sizes are being in the rage of multiple zettabyte.
Eg:Data stored in a relational database management system is one example of a 'structured' data.
2. Unstructured Data:
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out
of it. Typical example of unstructured data is, a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now a day organizations have wealth of data available with them
but unfortunately they don't know how to derive value out of it since this data is in its raw
form or unstructured format.
3. Semi-Structured Data:
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi- structured data is a data represented in XML file.
Examples of Semi-Structured Data: XML files or JSON documents are examples of semi-structured data.
3 V’s of Big Data: Three V’s of Big Data / Characteristics of 'Big Data' are as follows:
1. Volume – Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The
size of data generated by humans, machines and their interactions on social media itself is massive. Size
of data plays very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon volume of data. Hence,
'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'. Researchers
have predicted that
40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.
2. Variety: Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data considered
by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
3. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data is generated
and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with
the speed at which data flows in from sources like business processes, application logs, networks and
social media sites, sensors, Mobile devices, etc. The flow of data is massive and contnuous. If you are
able to handle the velocity, you will be able to generate insights and take decisions based on real-time
data.
Data Analytcs:
• Data Analytics is the process of examining raw data (data sets) with the purpose of drawing
conclusions about that information, increasingly with the aid of specialized systems and
software.
• Data Analytics involves applying an algorithmic or mechanical process to derive insights. For
example, running through a number of data sets to look for meaningful correlations between
each other.
• It is used in a number of industries to allow the organizations and companies to make beter
decisions as well as verify and disprove existing theories or models.
• The focus of Data Analytics lies in inference, which is the process of deriving conclusions that are
solely based on what the researcher already knows.
• Data analytics initiatives can help businesses increase revenues, improve operational eficiency,
optimize marketing campaigns and customer service efforts, respond more quickly to emerging
market trends and gain a competitive edge -- all with the ultimate goal of boosting business
performance.
Types of Analytics:
There are 4 types of analytcs. Here, we start with the simplest one and go down to more sophisticated.
As it happens, the more complex an analysis is, the more value it brings.
1. Descriptive Analytics
2. Diagnostic Analysis
3. Perspective Analytics
4. Predictive Analytics
1. Descriptive analytics:
• The simplest way to define descriptive analytics is that, it answers the question “What has
happened?”
• This type of analytics, analyses the data coming in real-time and historical data for insights on how to
approach the future.
• The main objective of descriptive analytics is to find out the reasons behind precious success or
failure in the past.
• The ‘Past’ here, refers to any particular time in which an event had occurred and this could be a
month
ago or even just a minute ago.
• The vast majority of big data analytics used by organizations falls into the category of descriptive
analytics. 90% of organizations today use descriptive analytics which is the most basic form of
analytics.
• Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the
past. However, these findings simply signal that something is wrong or right, without explaining why.
For this reason, highly data-driven companies do not content themselves with descriptive analytics
only, and prefer combining it with other types of data analytics.
• Eg:A manufacturer was able to decide on focus product categories based on the analysis of revenue,
monthly revenue per product group, income by product group, total quality of metal parts produced
per month.
2. Diagnostic analytics:
• At this stage, historical data can be measured against other data to answer the question of why
something happened.
• Companies go for diagnostic analytics, as it gives a deep insight into a particular problem. At the
same time, a company should have detailed information at their disposal; otherwise data collection
may turn out to be individual for every issue and time-consuming.
Unit – I Fundamentals of Big Data & Hadoop
• Eg: Let’s take another look at the examples from different industries: a healthcare provider compares
patients’ response to a promotional campaign in different regions; a retailer drills the sales down to
subcategories.
3. Predictive analytics:
• Predictive analytics tells what is likely to be happen. It uses the findings of descriptive and diagnostic
analytics to detect tendencies, clusters and exceptions, and to predict future trends, which makes it a
valuable tool for forecasting.
• Despite numerous advantages that predictive analytics brings, it is essential to understand that
forecasting is just an estimate, the accuracy of which highly depends on data quality and stability of
the situation, so it requires a careful treatment and continuous optimization.
• Eg: A management team can weigh the risks of investing in their company’s expansion based on cash
flow analysis and forecasting. Organizations like Walmart, Amazon and other retailers leverage
predictive analytics to identify trends in sales based on purchase paterns of customers, forecasting
customer behavior, forecasting inventory levels, predicting what products customers are likely to
purchase together so that they can ofer personalized recommendations, predicting the amount of
sales at the end of the quarter or year.
4. Prescriptive analytics
• The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future
problem or take full advantage of a promising trend.
• Prescriptive analytics is a combination of data, mathematical models and various business rules.
• The data for prescriptive analytics can be both internal (within the organization) and external (like
social media data).
• Besides, prescriptive analytics uses sophisticated tools and technologies, like machine learning,
business rules and algorithms, which make it sophisticated to implement and manage. That is why,
before deciding to adopt prescriptive analytics, a company should compare required eforts vs. an
expected added value.
• Prescriptive analytics are comparatvely complex in nature and many companies are not yet using
them in day-to-day business activities, as it becomes difficult to manage. Large scale organizations
use prescriptive analytics for scheduling the inventory in the supply chain, optmizing production,
etc. to optimize customer experience.
• An example of prescriptive analytics: a multinational company was able to identify opportunities for
repeat purchases based on customer analytics and sales history.
*Descriptive and diagnostic analytics help you construct a narrative of the past while predictive and
prescriptive analytics help you envision a possible future.
Need for Big Data
Analytcs:
The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and unearthed
information that could be used for future decisions, today that business can identify insights for
immediate decisions. The ability to work faster – and stay agile – gives organizations a competitive
edge they didn’t have before.
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more eficient operations, higher profits and happier
customers in the following ways:
1. Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring significant
cost advantages when it comes to storing large amounts of data – plus they can identify more
eficient ways of doing business.
2. Faster, better decision making: With the speed of Hadoop and in-memory analytcs, combined with
the ability to analyze new sources of data, businesses are able to analyze information immediately –
and make decisions based on what they’ve learned.
3. New products and services: With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big
data analytics, more companies are creating new products to meet customers’ needs.
4. End Users Can Visualize Data: While the business intelligence software market is relatively mature, a
big data initiative is going to require next-level data visualization tools, which present BI data in easy-
to-read charts, graphs and slideshows. Due to the vast quantities of data being examined, these
applications must be able to ofer processing engines that let end users query and manipulate
information quickly—even in real time in some cases.
Problem: An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year. Moreover, they want to find the
buying trend of these customers so that company can suggest more items related to them.
Iss
ue
s:
Apache Hadoop is not only a storage system but is a platform for data storage as well as
processing.
➢ Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which
uses commodity hardware to form clusters and store data in a distributed fashion. It works on
Write once, read many times principle.
➢ Processing: Map Reduce paradigm is applied to data distributed over network to find the required
output.
➢ Analyze: Pig, Hive can be used to analyze the data.
➢ Cost: Hadoop is open source so the cost is no more an issue.
Hadoop
Introduction:
➢ Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under
Apache Software Foundation (ASF).
➢ Hadoop is written in the Java programming language and ranks among the highest-level Apache
projects.
➢ Doug Cutting and Mike J. Cafarella developed Hadoop.
➢ By geting inspiration from Google, Hadoop is using technologies like Map-Reduce
programming model as well as Google file system (GFS).
➢ It is optmized to handle massive quantities of data that could be structured, unstructured or
semi-structured, using commodity hardware, that is, relatively inexpensive computers.
➢ It is intended to work upon from a single server to thousands of machines each ofering local
computation and storage. It supports the large collection of data set in a distributed computing
environment.
History:
Hadoop came into existence and why it is so popular in the industry nowadays. So, it all started with two
people, Mike Cafarella and Doug Cuting, who were in the process of building a search engine system
that can index 1 billion pages. After their research, they estimated that such a system will cost around
half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive.
However, they soon realized that their architecture would not be capable enough to work around with
billions of pages on the web.
Doug and Mike came across a paper, published in 2003, that described the architecture of Google’s
distributed file system, called GFS, which was being used in production at Google. Now, this paper on
GFS proved to be something that they were looking for, and soon, they realized that it would solve all
their problems of storing very large files that are generated as a part of the web crawl and indexing
process. Later in 2004, Google published one more paper that introduced MapReduce to the world.
Finally, these two papers led to the foundation of the framework called “Hadoop“. Doug Cuting, Mike
Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in
2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundaton.
Hadoop Architecture:
➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for
a cluster of systems with storage capacity and local computing power by leveraging commodity
hardware.
➢ Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on
different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable
of running on clusters of computers and they could perform complete statistical analysis for huge
amounts of data.
➢ Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets
using Hadoop MapReduce paradigm.
➢ The 3 important Hadoop core components that play a vital role in the Hadoop architecture are -
1. Hadoop Distributed File System (HDFS)
2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)
Hadoop ecosystem:
➢ Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework
which solves big data problems. You can consider it as a suite which encompasses a number of
services (ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea
about how the services work individually and in collaboration.
➢ The Hadoop ecosystem provides the furnishings that turn the framework into a comfortable home for
big data activity that reflects your specific needs and
tastes.
➢ The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions.
➢ Below are the Hadoop components, that together form a Hadoop ecosystem,
✓ HDFS -> Hadoop Distributed File System
✓ YARN -> Yet Another Resource Negotiator
✓ MapReduce -> Data processing using programming
✓ Spark -> In-memory Data Processing
✓ PIG, HIVE-> Data Processing Services using Query (SQL-
like)
✓ HBase -> NoSQL Database
✓ Mahout, Spark MLlib -> Machine Learning
✓ Apache Drill -> SQL on
Hadoop
✓ Zookeeper -> Managing Cluster
✓ Oozie -> Job Scheduling
✓ Flume, Sqoop -> Data Ingesting Services
✓ Solr&Lucene -> Searching & Indexing
s
generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig
translates Pig
Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS
cluster.
• HBase: HBase is a scalable, distributed, NoSQL database that sits atop the HFDS. It was
designed to store structured data in tables that could have billions of rows and millions of
columns. It has been deployed to power historical searches through large data sets, especially
when the desired data is contained within a large amount of unimportant or irrelevant data (also
known as sparse data sets).
• Oozie: Oozie is the workflow scheduler that was developed as part of the Apache Hadoop
project. It
manages how workflows start and execute, and also controls the execution path. Oozie is a
server- based Java web application that uses workflow definitions written in hPDL, which is an
XML Process Definition Language similar to JBOSS JBPM jPDL.
• Sqoop: Sqoop is bi-directional data injection tool. Think of Sqoop as a front-end loader for big
data.
Sqoop is a command-line interface that facilitates moving bulk data from Hadoop into relational
databases and other structured data stores. Using Sqoop replaces the need to develop scripts to
export and import data. One common use case is to move data from an enterprise data
warehouse to a Hadoop cluster for ETL processing. Performing ETL on the commodity Hadoop
cluster is resource efficient, while Sqoop provides a practical transfer method.
• Ambari – A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters
which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper,
Oozie, Pig, and Sqoop.
• Flume – A distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of streaming event data.
• Mahout – A scalable machine learning and data mining library.
• Zookeeper – A high-performance coordination service for distributed applications.
The ecosystem elements described above are all open source Apache Hadoop projects. There
are numerous commercial solutions that use or support the open source Hadoop projects.
Hadoop Distributions:
➢ Hadoop is an open-source, catch-all technology solution with incredible scalability, low
cost storage systems and fast paced big data analytics with economical server costs.
➢ Hadoop Vendor distributions overcome the drawbacks and issues with the open source
edition
of Hadoop. These distributions have added functionalities that focus on:
• Support:
Most of the Hadoop vendors provide technical guidance and assistance that makes it easy for
customers to adopt Hadoop for enterprise level tasks and mission critical applications.
• Reliability:
Hadoop vendors promptly act in response whenever a bug is detected. With the intent to
make commercial solutions more stable, patches and fixes are deployed immediately.
• Completeness:
Hadoop vendors couple their distributions with various other add-on tools which help
customers customize the Hadoop application to address their specific tasks.
• Fault
Tolerant:
Since the data has a default replication factor of three, it is highly available and fault-tolerant.
Here is a list of top Hadoop Vendors who play a key role in big data market
growth
➢ Amazon Elastic MapReduce
➢ Cloudera CDH Hadoop Distribution
➢ Hortonworks Data Platorm (HDP)
➢ MapR Hadoop Distribution
➢ IBM Open Platorm (IBM Infosphere Big insights)
➢ Microsoft Azure's HDInsight -Cloud based Hadoop Distribution
Advantages of
Hadoop:
The increase in the requirement of computing resources has made Hadoop a viable and extensively
used programming framework. Modern day organizations can learn Hadoop and leverage their
knowhow of managing processing power of their businesses.
1. Scalable: Hadoop is a highly scalable storage platorm, because it can stores and distribute very
large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional
relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop
enables businesses
to run applications on thousands of nodes involving many thousands of terabytes of
data.
2. Cost efective: Hadoop also offers a cost efective storage solution for businesses’ exploding data sets.
The problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an efort to
reduce costs, many companies in the past would have had to down-sample data and classify it based on
certain assumptions as to which data was the most valuable. The raw data would be deleted, as it
would be too cost-prohibitive to keep. While this approach may have worked in the short term, this
meant that when business priorities changed, the complete raw data set was not available, as it was too
expensive to store.
3. Flexible: Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means businesses can
use Hadoop to derive valuable business insights from data sources such as social media,
email conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.
4. Speed of Processing: Hadoop’s unique storage method is based on a distributed file system that
basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the
same servers where the data is located, resulting in much faster data processing. If you’re dealing with
large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just
minutes, and petabytes in hours.
5. Resilient to failure: A key advantage of using Hadoop is its fault tolerance. When data is sent to an
individual node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.
Summary of UNIT – I:
• Part – I - BigData
1. Data, Information, Knowledge application on data
2. What is Bigdata, examples of Bigdata (TBs &PBs of data)
3. Sources of Bigdata, Facts about Bigdata (ex: Aircrafts, social Media, tweets, e-commerce)
4. Types of Bigdata (structured, unstructured, semi-structured)
5. 3 Vs of Bigdata (properties of bigdata – volume, velocity, variety)
6. Data Analytics, Types of Data Analytics
7. Need of Data Analytics
• Part – II - Hadoop
1. What is Hadoop
2. History of Hadoop (Doug Cutting -2005)
3. Evolution of Hadoop
4. Basic Architecture of Hadoop (High level architecture also)
5. Hadoop eco-system (various tools- HDFS, Mapreduce, spark, Hbase, pig, Hive, Zoo
Keeper, Sqoop, Flume, Oozie etc).
6. Hadoop distributions
7. Benefits of Hadoop
Actvites:
8. Sketch Noting
9. One minute Chat
10.Quescussion
11.Presentations
12.Queston/Answers
Unit – II HDFS & Map Reduce
Unit – II HDFS & Map Reduce
Unit - II
Hadoop Distributed File System (HDFS): Introduction, Design Principles, Components of HDFS – Name
Node, Secondary Name Node, Data Nodes, HDFS File Blocks, Storing File Blocks into HDFS from Client
Machine, Rack Awareness, HDFS File Commands.
Map Reduce: Introduction, Architectural Components of Map Reduce, Functional Components of Map
Reduce, Heartbeat Signal, Speculatve Execution
HDFS:
➢ Hadoop Distributed file system is a Java based distributed file system that allows you to
store large data across multiple nodes in a Hadoop cluster. So, if you install Hadoop, you get
HDFS
as an underlying storage system for storing the data in the distributed environment.
➢ Hadoop File System was developed using distributed file system design. It runs on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
➢ HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
➢ HDFS runs on top of the existng file systems on each node in a Hadoop cluster. It is a block-
structured file system where each file is divided into blocks of a pre-determined size. These
blocks are stored across a cluster of one or several machines.
Features of HDFS:
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
Basic terminology:
Node: A node is simply a computer. This is typically non-enterprise, commodity hardware for
nodes that contain data.
Rack: Collection of nodes is called as a rack. A rack is a collection of 30 or 40 nodes that are
physically stored close together and are all connected to the same network switch.
Network bandwidth between any two nodes in the same rack is greater than bandwidth
between two nodes on different racks.
Cluster: A Hadoop Cluster (or just cluster from now on) is a collection of racks.
File Blocks:
Blocks are nothing but the smallest contnuous location on your hard drive where data is stored.
In general, in any of the File System, you store the data as a collection of blocks. Similarly, HDFS
stores each file as blocks which are scatered throughout the Apache Hadoop cluster. The
default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which
you can configure as per your requirement. All blocks of the file are the same size except the
last block, which can be either the same size or smaller. The files are split into 128 MB
blocks and then
stored into the Hadoop file system. The Hadoop application is responsible for distributing the
data block across multiple nodes.
Let’s take an example where we have a file “example.txt” of size 514 MB as shown in above
figure. Suppose that we are using the default configuration of block size, which is 128 MB.
Then,
5 blocks will be created. The first four blocks will be of 128 MB. But, the last block will be of 2
MB
size only.
Components of HDFS:
HDFS is a block-structured file system where each file is divided into blocks of a pre-determined
size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS
Architecture follows a Master/Slave Architecture, where a cluster comprises of a single
NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be
deployed on a broad spectrum of machines that support Java. Though one can run several
DataNodes on a single machine, but in the practical world, these DataNodes are spread across
various machines.
NameNode:
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and
manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly
available server that manages the File System Namespace and controls access to files by clients.
The HDFS architecture is built in such a way that the user data never resides on the NameNode.
Name node contains metadata and the data resides on DataNodes only.
Functons of NameNode:
• It is the master daemon that maintains and manages the DataNodes (slave nodes)
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
• They send heartbeats to the NameNode periodically to report the overall health of
HDFS, by default; this frequency is set to 3 seconds.
Secondary NameNode:
It is a separate physical machine which acts as a helper of name node. It performs periodic
check points. It communicates with the name node and take snapshot of meta data which
helps minimize downtime and loss of data. The Secondary NameNode works concurrently with
the primary NameNode as a helper daemon.
MapReduce:
MapReduce is a programming framework that allows us to perform parallel and distributed
processing on huge data sets in distributed environment.
Map Reduce can be implemented by its components:
1. Architectural components
a) Job Tracker
b) Task Trackers
2. Functional components
a) Mapper (map)
b) Combiner (Shufler)
c) Reducer (Reduce)
Architectural Components:
➢ The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called:
1. A Job tracker : Acts like a master (responsible for complete execution of submited job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job
➢ For every job submitted for execution in the system, there is one Job tracker that resides
on Name node and there are multiple task trackers which reside on Data node.
• A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.
• Execution of individual task is then look after by task tracker, which resides on every
data node executing part of the job.
• Task tracker's responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the Job tracker so as to
notify him of current state of the system.
• Thus job tracker keeps track of overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.
Functional Components:
A job (complete Work) is submitted to the master, Hadoop divides the job into phases , map
phase and reduce phase. In between Map and Reduce, there is small phase called shuffle &
Sort in MapReduce.
1. Map tasks
2. Reduce tasks
Map Phase: This is very first phase in the execution of map-reduce program. In this phase data
in each split is passed to a mapping function to produce output values. The map takes key/value
pair as input. Key is a reference to the input. Value is the data set on which to operate. Map
function applies the business logic to every value in input. Map produces an output is called
intermediate output. An output of map is stored on the local disk from where it is shuffled to
reduce nodes.
Reduce Phase: In MapReduce Reduce takes intermediate Key / Value pairs as input and process
the output of the mapper. Key/value pairs provided to reduce are sorted by key. Usually, in the
reducer, we do aggregation or summation sort of computation. A function defined by user
supplies the values for a given key to the Reduce function. Reduce produces a final output as a
list of key/value pairs. This final output is stored in HDFS and replication is done as usual.
Shuffling: This phase consumes output of mapping phase. Its task is to consolidate the relevant
records from Mapping phase output.
A Word Count Example of MapReduce:
Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we need to perform a word count on the sample.txt using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.
• First, we divide the input in three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, will occur once.
• Now, a list of key-value pair will be created where the key is the individual word and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all the nodes.
• After mapper phase, a partition process takes place where sorting and shufling happens
so that all the tuples with the same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As shown
in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts
the number of ones in the very list and gives the final output as – Bear, 2.
• Finally, all the output key/value pairs are then collected and written in the output file.
Heartbeat Signal:
➢ HDFS follows a master slave architecture. Namenode (master) stores metadata about the
data and Datanodes store/process the actual data (and its replications).
➢ Now the namenode should know if any datanode in a cluster is down (power failure/network
failure) otherwise it will continue assigning tasks or sending data/replications to that dead
datanode.
➢ Heartbeat is a mechanism for detecting datanode failure and ensuring that the link between
datanodes and namenode is intact. In Hadoop , Name node and data node do communicate
using Heartbeat. Therefore, Heartbeat is the signal that is sent by the datanode to the
namenode after the regular interval of time to indicate its presence, i.e. to indicate that it is
available.
➢ The default heartbeat interval is 3 seconds. If the DataNode in HDFS does not send
heartbeat to NameNode in ten minutes, then NameNode considers the DataNode to be out
of service and the Blocks replicas hosted by that DataNode to be unavailable.
➢ Hence once the heartbeat stops sending a signal to NameNode, then NameNode perform
certain tasks such as replicating the blocks present in DataNode to other DataNodes to
make the data is highly available and ensuring data reliability.
➢ NameNode that receives the Heartbeats from a DataNode also carries information like total
storage capacity, the fraction of storage in use, and the number of data transfers currently in
progress. For the NameNode’s block allocation and load balancing decisions, we use these
statistics.
Speculative Execution
➢ In Hadoop, MapReduce breaks jobs into tasks and these tasks run parallel rather than
sequential, thus reduces overall execution time. This model of execution is sensitive to slow
tasks (even if they are few in numbers) as they slow down the overall execution of a job.
➢ There may be various reasons for the slowdown of tasks, including hardware degradation or
software misconfiguration, but it may be difficult to detect causes since the tasks still
complete successfully, although more time is taken than the expected time.
➢ The Hadoop framework does not try to diagnose or fix the slow-running tasks. The
framework
tries to detect the task which is running slower than the expected speed and launches
another task, which is an equivalent task as a backup. The backup task is known as the
speculative task, and this process is known as speculative execution in Hadoop.
➢ As the name suggests, Hadoop tries to speculate the slow running tasks, and runs the same
tasks in the other nodes parallel. Whichever task is completed first, that output is
considered
for proceeding further, and the slow running tasks are killed.
➢ Firstly, all the tasks for the job are launched in Hadoop MapReduce. The speculative tasks are
launched for those tasks that have been running for some time (at least one minute) and
have
not made any much progress, on average, as compared with other tasks from the job. The
speculative task is killed if the original task completes before the speculative task, on the
other hand, the original task is killed if the speculative task finishes before it.
➢ In conclusion, we can say that Speculative execution is the key feature of Hadoop that
improves job efficiency. Hence, it reduces the job execution time.
BigData &Hadoop Unit - III
1
BigData &Hadoop Unit - III
By using BigInsights, users can extract new insights from this data to
enhance knowledge of your business. For more information about the IBM
Open Source Platform,
BigInsights incorporates tooling and value-add services for numerous users,
speeding
time to value and simplifying development and maintenance:
• Software developers can use the value-add services that are provided to
develop custom text analytic functions to analyze loosely structured or
largely unstructured text data.
• Data scientists and business analysts can use the data analysis tools within
the value- add services to explore and work with unstructured data in a
familiar spreadsheet- like environment.
• BigInsights provides distinct capabilities for discovering and analyzing
business insights that are hidden in large volumes of data. These
technologies and features combine to help your organization manage data
from the moment that it enters your enterprise.
• Apache Hadoop helps enterprises harness data that was previously
difficult to
manage and analyze. BigInsights features Hadoop and its related
technologies as a core component.
Figure 1 illustrates IBM's big data platform, which includes software for
processing streaming data and persistent data. BigInsights supports the
latter, while InfoSphere Streams supports the former. The two can be
deployed together to support real-time and batch analytics of various forms
of raw data, or they can be
deployed individually to meet specific application objectives.
BigData &Hadoop Unit - III
2
BigData &Hadoop Unit - III
Figure 1
IBM developed BigInsights to help frms process and analyze the increasing
volume, variety, and velocity of data of interest to many enterprises.
4
BigData &Hadoop Unit - III
5
BigData &Hadoop Unit - III
6
BigData &Hadoop Unit - III
Machine Data Accelerators: Tools that are aimed at developers that make
it easy to develop applications that process log fles, including web logs,
mail logs, and various specialized file formats
Social Data Accelerators: Tools to easily import and analyze social data at
scale
from multiple online sources, including tweets, boards, and
blogs
Adaptive
MapReduce:
Hadoop is now being used for fles that are smaller in size. As queries are run
against smaller data sets, programmers do not want to wait much time for
results of a query. The overhead of Apache MapReduce becomes more and
more noticeable. MapReduce does not scale well in a heterogeneous
environment (Heterogeneous environments are distributed systems with
nodes that vast greatly in hardware). These are some computational issues
with MapReduce programming. IBM’s Big Insights Adaptive MapReduce is
designed to address that type of problems.
BigData &Hadoop MapReduce component can run distributed application
Unit - III
The Adaptive services
on a scalable, shared, heterogeneous grid. This low-latency scheduling
solution supports sophisticated workload management capabilities beyond
those of standard Hadoop
MapReduce
.
7
BigData &Hadoop Unit - III
Features of Adaptive
MapReduce:
• Low-Latency task scheduling – It is the goal of AMR. Reduce the
task scheduling from 30 seconds to real-time. The adaptive MapReduce
approach allows a single task tracker to start the processing, after once
starting it is to keep working on multiple tasks.
• Faster performance for small job workloads – Another goal of
AMR is redesigning of shuffling algorithm. Redesigned shufing
algorithm is to skip disk spill before hand-off data to reduce phase.
(Spilling happens at least once, when the mapper fnished, because the
output of the mapper should be sorted and saved to the disk for
reducer processes to read it.)
• High availability MapReduce framework – Two tiers of workload and
resource scheduling. In Apache MapReduce, the workload scheduling
and the resource scheduling are two different layers and are tightly
coupled. Adaptive MapReduce separates these two layers, and there is
a concept of central site management of workloads. Therefore, no single
point of failure for job tracker, high availability is possible.
• Run MapReduce service on available hosts without fxed hostname
for job
tracker.
BigData &Hadoop Unit - III
8
BigData &Hadoop Unit - III
Adaptive MapReduce
Architecture:
There is a resource scheduler, called EGO Master (Enterprise Grid
Orchestrator). On each node in the cluster there is a small process called LIM
(Load Information Manager). The LIM sends its current state back to the EGO
master. Next there is a workload manager which can be dispatched to any
node in the cluster pool. This sends MapReduce requests to the resource
scheduler on plug-ins like SIM, J2EE etc. however with the current
implementation of BigInsights, it is only taken advantage of scheduling
mapreduce tasks.
9
BigData &Hadoop Unit - III
Compression
Techniques:
Even though Hadoop slave nodes are designed to be inexpensive, they are
not free, and with large volumes of data that have a tendency to grow at
increasing rates, compression is an obvious tool to control extreme data
volumes. Data coming into Hadoop can be compressed. Mappers can output
compressed data, reducers can read the compressed output from the map
task and the reducers can be directed to write the results in compressed
format. How much the size of the data decreases will depend on the chosen
compression algorithm.
➢ File compression brings two major benefits:
Gzip: A compression utility that was adopted by the GNU project, Gzip (short
for GNU zip) generates compressed fles that have a .gz extension. You can
use the gunzip command to decompress fles that were created by a few
compression utilities, including Gzip.
11
up an archive that you’ll rarely need to query and space is at a high
premium, then maybe would Bzip2 be worth considering.
Snappy: The Snappy codec from Google provides modest compression ratios,
but fast compression and decompression speeds. (In fact, it has the fastest
decompression speeds, which makes it highly desirable for data sets that are
likely to be queried often.). The Snappy codec is integrated into Hadoop
Common, a set of common utilities that supports other Hadoop subprojects.
You can use Snappy as an add-on for more recent versions of Hadoop that do
not yet provide Snappy codec support.
13
an index is required to tell the mapper where it can safely split the
compressed fle. LZO is only really desirable if you need to compress text
fles.
Hadoop Codecs
Authentication
:
BigData &Hadoop Unit - III
• If you choose PAM with flat file authentication during installation, the
password for the biadmin user must be the password that you enter in the
Secure Shell panel of the InfoSphere BigInsights installation program.
When you choose this option for security, you can either install sample
groups and users or specify the groups and users individually. If you install
the sample groups and users, the installation program generates the
groups and users and maps the roles automatically. The Example section
that follows the Procedure lists the default groups and user roles.
• If you specify the users and groups individually, then you must create
users and groups individually, set the password for each user, and assign
each user to groups.
16
PAM (Pluggable Authentication Module):
➢ Pluggable authentication modules are a common framework for
authentication
and security. A pluggable authentication module (PAM) is a
mechanism to integrate multiple low-level authentication schemes
into a high-level application programming interface (API). It
allows programs that rely on authentication to be written
independent of the underlying authentication scheme.
➢ PAM uses a pluggable, modular architecture, which affords the
system
administrator a great deal of flexibility in setting authentication policies
for the system.
➢ PAM authentication enables system users to be granted access and
privileges
transparently, so that applications do not need to contain information
about where and how user data is stored.
➢ If you are using PAM authentication methods, it allows you to
authenticate
using a Linux shadow password file or by communicating with an LDAP
server. If you using a Linux shadow password file, then the BigInsights
administration user that you intend to create must be defned in
/etc/shadow along with the encrypted password. The groups that
associated with the web console roles must be defned in /etc/groups
fle.
➢ The purpose of PAM is to separate the development of privilege
granting
software from the development of secure and appropriate
authentication schemes. This is accomplished by providing a library of
functions that an application may use to request that a user be
authenticated. When a program needs to authenticate a user, PAM
provides a library containing the functions for the proper authentication
scheme. Because this library is loaded dynamically, changing
17
authentication schemes can be done by simply editing a confguration
file.
➢ Flexibility is one of PAM’s greatest strengths, PAM can be confgured to
deny
certain programs the right to authenticate users, to only allow certain
users to
18
be authenticated, to warn when certain programs attempt to
authenticate, or even to deprive all users of login privileges. PAM’s
modular design gives you complete control over how users are
authenticated.
Option Description
LDAP Add users and user groups to your LDAP confguration. Ensure
that the full distinguished name for each user is mapped to the
group in LDAP.
Flat To add users and associated passwords to
fle the $BIGINSIGHTS_HOME/conf/install.xml fle, run the
following commands:
adduser user1
passwd user1
Run
the $BIGINSIGHTS_HOME/console/bin/refresh_security_confg.sh
script.
./stop.sh console
./start.sh console
Groups
A group associates a set of users who have similar business responsibilities.
Groups separate users at the business level rather than the technical level.
You might defne one or more groups that associate users into logical sets
based on business needs and constraints.
Users
A user is an entity that can be authenticated and typically represents an
individual user. Each user authenticates to the system by using credentials,
including a user name and password, and might belong to one of more
groups. A user inherits the
roles associated with all groups of which that user is a member.
BigData &Hadoop Unit - III
20
BigData &Hadoop Unit - III
IBM General Parallel File System (GPFS) is an IBM’s parallel cluster fle
system. It is high scalable, high performance fle management infrastructure
for AIX, Linux and Windows systems.
21
BigData &Hadoop Unit - III
GPFS Features:
➢ It is adaptable to many user environments by supporting a wide range of
basic confgurations and disk technologies.
➢ A highly available cluster architecture, Concurrent shared disk access to a
global
namespace.
➢ Having the capabilities for high performance parallel
workload.
➢ GPFS provides safe, high bandwidth access using the POSIX I/O
API.
➢ Provides good performance for large volume, I/O intensive
jobs.
➢ It works best for large record, sequential access patterns, has
optimizations for
other patterns.
➢ Converting to GPFS does not require application code changes provided
the code works in a POSIX compatible environment.
➢ No special nodes are required like name node in HDFS. It is easy to
add/remove
nodes and storage on the fly and rolling upgrades. It has integrated tiered
storage,
administer from any node is possible.
BigData &Hadoop Unit - III
22
BigData &Hadoop Unit - III
Features GPFS HDFS
Hierarchical Allows sufficient usage of disk
storage drives with different
manageme performance characteristics
nt
High Stripes data across disks by
performance using meta-blocks, which
support for allows a MapReduce split to be Places a MapReduce split
MapReduce spread over local disks. on one local disk
applications
▪ Manages metadata by
using the local node when
possible rather than
reading metadata into
memory unnecessarily
▪ Caches data on the client
side
to increase throughput of
random reads
▪ Supports concurrent
reads and writes by
High multiple
performance programs
support for ▪ Provides sequential
24
Unit – IV Hadoop
Unit – IV
1
Unit – IV Hadoop
➢ The kinds of workloads you have — CPU intensive, i.e. query; I/O
intensive,
i.e. ingestion, memory intensive, i.e. Spark processing.
➢ The storage mechanism for the data —
plain Text/AVRO/Parque/Jason/ORC/etc. or compresses GZIP, Snappy.
(For example, 30% container storage 70% compressed.)
Hadoop Cluster
Architecture
A hadoop cluster architecture consists of a data center, rack and the node
that executes the jobs. Data center consists of the racks and racks
consists of nodes. A medium to large cluster consists of a two or three
level hadoop cluster architecture that is built with rack mounted servers.
Every rack of servers is interconnected
through 1 gigabyte of Ethernet (1 GigE). Each rack level switch in a
hadoop cluster
2
Unit – IV Hadoop
In a single node hadoop cluster, all the daemons i.e. Data Node, Name
Node, Task Tracker and Job Tracker run on the same machine/host. In a
single node hadoop cluster setup everything runs on a single JVM instance.
The hadoop user need not make any confguration settings except for
setting the JAVA_HOME variable. For any single node hadoop cluster setup
the default replication factor is 1.
In a multi-node hadoop cluster, all the essential daemons are up and run
on different machines/hosts. A multi-node hadoop cluster setup has a
master slave architecture where in one machine acts as a master that
runs the Name Node daemon while the other machines acts as slave or
worker nodes to run other hadoop daemons. Usually in a multi-node
hadoop cluster there are cheaper machines (commodity computers)
that run the Task-Tracker and Data Node daemons while other services
are run on powerful servers. For a multi-node hadoop cluster, machines or
computers can be present in any location irrespective of the location of the
physical server.
Capacity
Planning:
Apache Hadoop is an open-source software framework that supports data-
intensive distributed applications, licensed under the Apache v2 license. It
supports the running of applications on large clusters of commodity
hardware. We need an efficient, correct approach to build a large hadoop
cluster with a large set of data having accuracy, speed. Capacity planning
plays important role to decide choosing right hardware configuration for
hadoop components.
Capacity planning usually flows from a top-down approach of
understanding.
➢ Hardware Considerations:
Hardware for Master Nodes:
5
Unit – IV Hadoop
Data Nodes:
• Data nodes confguration is depends on workload
• Signifcant processing machines are required - 2 quad-core 2 - 2.5GHz
CPUs.
• Multiple terabytes of disk drives
• Memory capacity should be in range of 16 – 24 GB RAM with 4 X
1 TB SATA disks.
• Fast network card - Gigabit Ethernet is required.
➢ Network Considerations:
• Network transfer rate plays a key role.
• Slave nodes at minimum 1GB network cards.
• 1Gbit switches – make sure that not exceeding the total capacity
of the switch.
• HDFS cluster will perform at the speed of slowest card.
Capacity calculations:
➢ Data Nodes Requirements: we can plan for commodity machines required
for the
cluster. The nodes that will be required depends on data
to be
stored/analyzed. The amount of space needed on data nodes depends
on the following factors:
• Replication Factor (Generally Three)
• 25% of additional space to handle shuffle files.
6
Unit – IV Hadoop
Ex:
Prob: Assume that your initial amount of data is 300TB and data growth
rate is about 1TB per day. Each Data Node has 4 disks with 1TB each.
Calculate how many data nodes are required?
Sol:
Current data size: 300TB
Data growth Rate:
1TB/day Replication
Factor: 3
So
Total Space required for Initial data is: 300TB * 4 =
1200TB Extra space required in a year is: 365TB * 4 =
1460TB
Total raw data requirement is: 1200TB + 1460TB = 2660TB
Let’s take an
example:
Say we have 70TB of raw data to store on a yearly basis (i.e. moving
window of 1 year). So after compression (say, with Gzip) we will get 70 –
(70 * 60%) = 28Tb that will multiply by 3x = 84, but keep 70% capacity:
84Tb = x * 70% thus x = 84/70% =
120Tb is the value we need for capacity
planning.
Number of nodes: Here are the recommended
specifcations for
DataNode/TaskTrackers in a balanced Hadoop
cluster:
12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) confguration (no
RAID, please!)
multi-core CPUs, running at least 2-
2.5GHz
So let’s divide up the value we have in capacity planning by the number
of hard
disks we need in a way that makes sense: 120Tb/12 1Tb = 10
nodes.
Memory: Now let’s fgure out the memory we can assign to these tasks. By
default, the tasktracker and datanode take up each 1 GB of RAM per
default. For each task calculate mapred.child.java.opts (200MB per
default) of RAM. In addition, count 2
GB for the OS. So say, having 24 Gigs of memory available, 24-2= 22 Gig
available for our 14 tasks – thus we can assign 1.5 Gig for each of our
tasks (14 * 1.5 = 21
Gigs)
.
Configuration
Management:
➢ Confguration Management is the practice of handling changes
systematically
so that a system maintains its integrity over time. Confguration
Management (CM) ensures that the current design and build state of
the system is known, good & trusted; and does not rely on the
knowledge of the development team.
➢ Each hadoop node in cluster has its own set of confguration fles,
and it
possible to set a single confguration fle that is used for all master and
worker
machines, that is fles on each node will be
confgured.
8
Unit – IV Hadoop
9
Unit – IV Hadoop
• core-site.xml
• hdfs-site.xml
• mapred-site.xml
Configuration Files:
Hadoop Cluster Configuration Files: The following table lists the same. All
these
fles are available under ‘conf’ directory of Hadoop installation directory.
10
Unit – IV Hadoop
hadoop-env.sh
• This fle specifies environment variables that affect the JDK used by
Hadoop
Daemon (bin/hadoop).
• As Hadoop framework is written in Java and uses Java
Runtime environment, one of the important environment
variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
This variable directs Hadoop daemon to the Java path in the system.
The following three fles are the important configuration fles for the
runtime environment settings of a Hadoop cluster.
core-site.xml
• This fle informs Hadoop daemon where NameNode runs in the
cluster.
• It contains the configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce where hostname
and port are the machine and port on which NameNode daemon
runs and listens.
• It also informs the Name Node as to which IP and port it should
bind. The commonly used port is 8020 and you can also specify IP
address rather than hostname.
hdfs-site.xml
• This fle contains the confguration settings for HDFS daemons; the
Name
Node, the Secondary Name Node, and the data nodes.
• You can also confgure hdfs-site.xml to specify default block
replication and permission checking on HDFS. The actual number of
replications can also be specified when the fle is created. The
default is used if replication is not specifed in create time.
• The value “true” for property ‘dfs.permissions’ enables permission
checking
in HDFS and the value “false” turns off the permission checking.
• Typically, the parameters in this file are marked as final to ensure
that they cannot be over written.
11
Unit – IV Hadoop
mapred-site.xml
• This fle contains the confguration settings for MapReduce
daemons; the job tracker and the task-trackers.
• The mapred.job.tracker parameter is a hostname (or IP
address) and port pair on which the Job Tracker listens for RPC
communication. This parameter specifes the location of the Job
Tracker to Task Trackers and MapReduce clients.
You can replicate all of the four fles explained above to all the Data
Nodes and Secondary Namenode. These fles can then be confgured
for any node specifc confguration e.g. in case of a different
JAVA_HOME on one of the Datanodes.
The following two fle ‘masters’ and ‘slaves’ determine the master and
salve
Nodes in Hadoop cluster.
Masters
This fle informs about the Secondary Namenode location to hadoop
daemon. The
‘masters’ fle at Master server contains a hostname Secondary Name Node
servers.
Slaves
The ‘slaves’ fle at Master node contains a list of hosts, one per line,
that are to host Data Node and Task Tracker servers. The ‘slaves’ fle on
Slave server contains the IP address of the slave node. Notice that the
‘slaves’ fle at Slave node contains only its own IP address and not of
any other Data Nodes in the cluster.
Cluster Administration:
Rack Topology
Cluster status
Node Administration
Balancer,
Safe Mode at start up
Dashboards
13
Unit – IV Hadoop
Oozie
Glossary of Oozie
terminology
Action
An execution/computation task (Map-Reduce job, Pig job, a shell
command). It can also be referred as task or 'action node'.
Workflow
A collection of actions arranged in a control dependency DAG (Direct
Acyclic Graph). "Control dependency" from one action to another means
that the second action can't run until the frst action has completed.
Workflows can be defned by using workflow definition language, that is
hPDL (An XML Process Defnition Language) and are stored in a fle called
workflow.xml.
Workflow Defnition
A programmatic description of a workflow that can be executed.
Workflow defnitions are written in hPDL (An XML Process Defnition
Language) and are stored in a fle called workflow.xml.
Introduction:
14
Unit – IV Hadoop
Features of Oozie:
1. Oozie
Workflow
1. Action nodes
2. Control-flow nodes.
An action node represents a workflow task, e.g., moving fles into HDFS,
running a MapReduce, Pig or Hive jobs, importing data using Sqoop or
running a shell script of a program written in Java. It triggers the
execution of tasks running scripts or map reduce jobs. Action node
specifies the type of task that is to be invoked, it could be a MapReduce
job, streaming or piping script. Oozie supports two mechanisms in order to
detect when a task completes.
Callback - oozie provides a unique callback URL to the task to notify that it
is completed.
Polling – if for some reason the task fails to invoke the URL, it polls the
task for completion.
depending on the result of earlier action node. The following are the nodes
consider in control nodes:
• Start Node - designates start of the workflow job. The start node
is the entry point for a workflow job, it indicates the frst
workflow node the workflow job must transitions to. When a
workflow is started, it automatically transitions to the node
specifed in the start. A workflow defnition must have one start
node.
• End Node – It signals end of the job. The end node is the end for a
workflow job, it indicates that the workflow job has completed
successfully. If one or more actions started by the workflow job are
executing when the end node is reached, the actions will be killed. A
workflow defnition must have one end
node.
Example Workflow
Diagram
17
Unit – IV Hadoop
2. Oozie
Coordinator
18
Unit – IV Hadoop
o FIFO (default)
o LIFO
o LAST_ONLY
3. Oozie
Bundle
Oozie Bundle system allows you to defne and execute a set of
coordinator applications, often called a data pipeline. In aOozie bundle,
there is no explicit dependency among the coordinator applications.
However, you could use the data dependency of coordinator applications
to create an implicit data application pipeline. You can
start/stop/suspend/resume/rerun the bundle. It gives a better and easy
operational control.
Kick-off-time − The time when a bundle should start and submit
coordinator
application
s.
Expression
Language:
➢ Basic EL Functions:
Some examples of “good” EL functions are ones that perform a simple
string operation, arithmetic operation, or return some useful internal
value from Oozie. A few of them are:
urlEncode(String s):It returns the URL UTF-8 encoded value of the given
string. A string with null value is considered as an empty string.
timestamp( ): It returns the UTC current date and time in W3C format
down to the second (YYYY-MM-DDThh:mm:ss.sZ). I.e.: 1997-07-
16T19:20:30.45Z
Hadoop EL Constants
Workflow EL Functions:
wf:id( )- It returns the workflow job ID for the current workflow job.
20
Unit – IV Hadoop
workflow job.
Each type of action node must define its complete error code list.
specifed
action node, or an empty string if no action node has not exited
with ERROR state.
workflow job,
normally 0 unless the workflow job is re-run, in which case indicates the
current
run.
Workflow Job:
21
Unit – IV Hadoop
If the URL contains any of the following tokens, they will be replaced
with the actual values by Oozie before making the notifcation:
PREP: When a workflow job is frst create it will be in PREP state. The
workflow job is defined but it is not running.
Commonly, workflow jobs are run based on regular time intervals and/or
data availability. And, in some cases, they can be triggered by an external
event.
The oozie coordinator system allows for recurring and dependent workflow
jobs. This means that you can - using coordinator defne triggers to invoke
workflows on regular time intervals, data availability and for some limited
external events. These definitions are defned in coordinateor.xml fle.
input to the next workflow. For example, the outputs of last 4 runs of a
workflow that runs every 15 minutes become the input of another
workflow that runs every
60 minutes. Chaining together these workflows result it is referred as a
data application pipeline.
The Oozie Coordinator system allows the user to defne and execute
recurrent and interdependent workflow jobs (data application pipelines).
Coordinator
Application
Types of coordinator
applications:
Coordinator
Job
23
Unit – IV Hadoop
When a coordinator job is submitted, oozie parses the coordinator job XML.
Oozie then creates a record for the coordinator with status PREP and
returns a unique ID. The coordinator is also started immediately if pause
time is not set.
When a coordinator job starts, oozie puts the job in status RUNNING and
start materializing workflow jobs based on job frequency.
When pause time reaches for a coordinator job that is in RUNNING status,
oozie puts the job in status PAUSED .
When the coordinator job materialization fnished and all workflow jobs
fnish, oozie updates the coordinator status accordingly. For example, if
all workflows are SUCCEEDED, oozie put the coordinator job into
SUCCEEDED status.
24
Unit – IV Hadoop
Coordinator Action
25
Unit V: Schedulers & Moving Data into Hadoop
Managing Job Execution – Schedulers – FIFO, FAIR - equal share, minimum share, minimum
share-no demand , minimum share exceeds available slots, minimum share less than fair share
and weights
Moving Data into Hadoop – Data Load Scenarios – Data at Rest, Streaming Data, and Data from
Data Warehouse, Sqoop , Data in Motion, Apache Flume, Flume Components.
Job
Execution:
There are two types of nodes that control job
execution.
➢ There is a single Job Tracker node which may run on the
NameNode or may run on a node by itself.
➢ There are Task Tracker nodes run on the
DataNodes.
Each job is broken up into a list of Map functions and Reduce
functions. It is the job of the Task Tracker to schedule the jobs. Tasks
are handed out to the Task Tracker, which are responsible for
scheduling those tasks. Each Task Tracker is assigned with a number of
slots which may vary from task tracker to task tracker. Each map
function or reduce function consumes a slot when executing. Each task
tracker informs to the job tracker the number of slots that are
available. If a slot is available on a task tracker, then task tracker is
able to accept a task for execution.
Scheduler
s:
Schedulers are used to control the number of task slots that can be
consumed by a user group. Users submit jobs, these jobs invoke tasks
on the data nodes and tasks can make use of task slots to execute.
The job of scheduler is to see if any task slots are available on a
datanode before sending a task to that datanode. Also scheduler can
ensure that not all slots are consumed by a particular user of a group.
FIFO (First In First Out) scheduler:
The original scheduling algorithm that was integrated within the
JobTracker was called FIFO. In FIFO scheduling, a JobTracker pulled jobs
from a work queue, oldest job first. This schedule had no concept of
the priority or size of the job, but the approach was simple to
implement and efficient.
The default scheduler for Hadoop is FIFO scheduler. Jobs are executed
based on priority in the sequence in which they were received.
FAIR
scheduler
➢ The fair scheduler was developed by
Facebook.
➢ It allows multiple users to run jobs on the cluster at the same
time.
➢ The idea with this scheduler is that all jobs get on the average an
equal
share of resources over time. If there is just a single job running,
then it can use the entire cluster.
➢ When other jobs are submitted, task jobs that are freed up are
assigned to
the new jobs. This concept of scheduling allows short jobs to fnish
in a reasonable amount of time without starving for long running
jobs. The FAIR scheduler can also work with priorities.
➢ Jobs are assigned to pools and resources are shared fairly between
the pools.
By default there is a separate pool for each user so that each user
gets some share of the cluster.
Data in
Motion:
This is the data that is continuously updated. New data might be
regularly added to these data sources and data will appended to a fle
or logs can be merged into a single log fle. We need to have the
capability of merging the fles before copying them into Hadoop
cluster.
For basic Hadoop the open source product Flume could be used to load
this type of scenario. Also there are several sample applications that
come with BigInsights that can be used to load data in motion.
Flume is a great tool for collecting data from a web server and storing
it into HDFS. Another option is to use Java Management Extension
(JMX) commands.
Ex: Web Crawler, Board
Reader
Streaming
Data:
InfoSphere Streams process continuous data on the fly. This type of
data is different from logging data. Although logging data might be
continuously updated, it is also being stored on disk but there is not a
sense of urgency when processing that data.
InfoSphere Streams is designed to work with data that continuously
flows and
theoretically may not have an end. So waiting for this data to write on
to the disk might not make a sense. Once the appropriate analysis has
been completed on the streaming data, we may decide some results
should be kept for further analysis. For this reason InfoSphere streams
is able to write its results into HDFS files.
Ex: Data from a security camera or patient monitoring
system.
In case of patient monitoring data, data must be analyzed instantly
that it gets generated. To wait for the data to be accumulated and
then written to disk before starting analysis could be the difference
between life and death.
Sqoop
connection:
➢ Database connection requirements are same for import and
export of data in sqoop.
➢ JDBC (Java Database Connectivity) connection string is required
with
the specifications – username. Password for sqoop
import/export, connection path, username of DB2 user and
password.
Ex: sqoop import --connect
jdbc:db2//your.db2.com:50000/yourDB \--
username db2user –password db2password
Sqoop
import
➢ The Sqoop import command is used to extract data from a RDBMS
table and load the data into Hadoop.
➢ Each row in the table corresponds to a separate record in HDFS.
The
resulting data in HDFS can be stored as text fles or binary fles as
well as imported directly into Hbase or Hive.
➢ Sqoop allows you to specify particular columns or a condition to
limit the
rows. You can write your own queries to access the relational data.
Ex: -sqoopimport –connect jdbc:db2://your.db2.com:50000/yourDB \ --
username db2user –password db2password –table db2table\ --target-
dir sqoopdta.
The above command will extract all rows and columns from a table
db2table that is a DB2 database called yourDB. The results are stored
in a directory in HDFs called sqoopdata. It is to connect to a DB2
system that listens to port
50000 and is running on a system with a host name as
“your.db2.com”. The
connection is made with a userid and
password.
Sqoop
export
➢ The sqoop export command reads data in Hadoop and
places into relational tables. The target table name must already
exist to store the
exported data.
➢ By default, sqoop inserts rows into the relational table. If there
are any errors when doing the insert, the export process will fail.
➢ The other export mode to load the data is update mode. It causes
sqoop
to generate update statements. To do updates you must
specify the
“update-key” argument.
➢ The third mode is call mode. With this mode, sqoop calls and
passes the record to a stored procedure.
Ex: -sqoopexport –connect
jdbc:db2://your.db2.com:50000/yourDB \ -- username db2user –
password db2password –table employee \ --export-dir
/employeedata/processed.
Flume
:
Flume is a distributed, reliable and available service for efficiently
collecting, aggregating and moving large amounts of streaming data
into HDFS. It is a great tool for collecting data from a web server and
storing it into HDFS.
It can be used to collect data from sources and transfer to other
agents.
Flume
architecture:
Data generators (such as facebook, twitter) generate data which gets
collected by individual flume agents running on them. Thereafter, a
data collector (which is an agent) collects the data from the agents
which is aggregated and stored into a centralized store such as HBase,
HDFS.