Module 3
Hadoop & HDFS
Introduction to Hadoop
Hadoop
● Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
● It is designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
● Its framework is based on Java programming with some native code in C and
shell scripts.
Disadvantages of Hadoop
1. Not ideal for real-time processing (better suited for batch
processing).
2. Complexity in programming with MapReduce.
3. High latency for certain types of queries.
4. Requires skilled professionals to manage and develop.
Applications
1. Banking: Fraud detection, risk modeling.
2. Retail: Customer behavior analysis, inventory management.
3. Healthcare: Disease prediction, patient record analysis.
4. Telecom: Network performance monitoring.
5. Social Media: Trend analysis, user recommendation engines.
Hadoop Architecture
Hadoop has two main components:
● Hadoop Distributed File System (HDFS): HDFS breaks big files into blocks
and spreads them across a cluster of machines. This ensures data is replicated,
fault-tolerant and easily accessible even if some machines fail.
● MapReduce: MapReduce is the computing engine that processes data in a
distributed manner. It splits large tasks into smaller chunks (map) and then
merges the results (reduce), allowing Hadoop to quickly process massive
datasets.
Hadoop Architecture
● Hadoop framework also includes the following two modules:
Hadoop Common which are Java libraries and utilities required by
other Hadoop modules
Hadoop YARN which is a framework for job scheduling and cluster
resource management.
Hadoop Distributed File System (HDFS)
● HDFS is the storage layer of Hadoop. It breaks large files into smaller
blocks (usually 128 MB or 256 MB) and stores them across multiple
DataNodes.
● Each block is replicated (usually 3 times) to ensure fault tolerance so
even if a node fails, the data remains available.
Key features of HDFS:
● Scalability: Easily add more nodes as data grows.
● Reliability: Data is replicated to avoid loss.
● High Throughput: Designed for fast data access and transfer.
MapReduce
MapReduce is the computation layer in Hadoop. It works in two main
phases:
1. Map Phase: Input data is divided into chunks and processed in
parallel. Each mapper processes a chunk and produces key-value
pairs.
2. Reduce Phase: These key-value pairs are then grouped and
combined to generate final results.
How Does Hadoop Work?
1. Data is loaded into HDFS, where it's split into blocks and distributed across
DataNodes.
2. MapReduce jobs are submitted to the ResourceManager.
3. The job is divided into map tasks, each working on a block of data.
4. Map tasks produce intermediate results, which are shuffled and sorted.
5. Reduce tasks aggregate results and generate final output.
6. The results are stored back in HDFS or passed to other applications.
Advantages of Hadoop
1. Scalability: Easily scale to thousands of machines.
2. Cost-effective: Uses low-cost hardware to process big data.
3. Fault Tolerance: Automatic recovery from node failures.
4. High Availability: Data replication ensures no loss even if nodes fail.
5. Flexibility: Can handle structured, semi-structured and unstructured data.
6. Open-source and Community-driven: Constant updates and wide
support.
History of Hadoop
Hadoop
was invented
by
Doug Cutting and Mike Cafarella
In December 2011
• Apache Software Foundation released Apache Hadoop version 1.0.
In August 2013
• Version 2.0.6 was available.
In December 2017
• Apache Hadoop version 3.0 released which is currently we are having.
In 2002
• Hadoop was started.
• Doug Cutting and Mike Cafarella both started to work on Apache Nutch project.
• After a lot of research on Nutch, they concluded that this project is very expensive.
• 0.5 million dollars for hardware, Running cost/month $30,000 approximately
In 2003
• They came across Google’s white paper GFS (Google File System)
• This paper described the architecture of Google’s distributed file system, for storing the
large datasets.
But it was just the half solution to their problem
In 2004
• Google published one more paper on the MapReduce technique
This paper was another half solution
And now they had complete solution for their Nutch Project.
• So, they started implementing Google’s techniques (GFS & MapReduce) as open
source in the Apache Nutch Project.
In 2005
• Cutting found that Nutch is limited to only 20 to 40 node clusters.
• He soon realized two problems:
✔ Nutch won’t achieve its potential until it ran reliably on the larger clusters.
✔ And that was looking impossible with just two people (Doug Cutting & Mike Cafarella)
In 2006
• Doug Cutting joined Yahoo along with Nutch project.
• He wanted to provided the world with an open-source, reliable, scalable computing
framework, with help of Yahoo.
• So, at Yahoo first, he separated the distributed computing parts from Nutch and formed a
new project “Hadoop”.
In 2007
• Yahoo successfully tested Hadoop on a 1000 node cluster
• And started using it.
In 2008
• In January, Yahoo released Hadoop as an open-source project to ASF (Apache Software
Foundation).
• In July, Apache Software Foundation successfully tested a 4000-node cluster with
Hadoop.
In 2009
• Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17 hours.
• Doug Cutting left the Yahoo and joined Cloudera.
• To fulfill the spread of Hadoop to other industries.
Analysing data with unix
Unix tools provide a powerful framework for analysing data.
Essential Unix tools for data analysis:
1. grep: Global search and print. Used for searching patterns in text files.
2. sed: Stream editor. Used for editing and manipulating text files.
3. awk: Pattern-directed scanning and processing language. Used for data extraction, filtering, and reporting.
4. sort: Sort lines of text files. Used for sorting data in ascending or descending order.
5. uniq: Report or omit repeated lines. Used for removing duplicates or counting unique occurrences.
6. cut: Cut out selected fields of each line of a file. Used for extracting specific columns or fields from a file.
7. paste: Merge lines of files. Used for combining data from multiple files.
8. join: Join lines of two files on a common field. Used for combining data from two files based on a common key.
9. wc: Word, line, character, and byte count. Used for counting the number of lines, words, or characters in a file.
10. head and tail: Display first or last part of a file. Used for viewing a portion of a large file.
Some examples of using these tools for data analysis:-
• Data filtering: grep "pattern" file.txt to search for lines containing a specific pattern.
• Data sorting: sort -n file.txt to sort a file numerically.
• Data aggregation: awk '{sum+=$1} END {print sum}' file.txt to calculate the sum of the first column.
• Data joining: join -t, -1 1 -2 2 file1.txt file2.txt to join two files based on a common key.
Steps for Analysing Data with Hadoop
1. Data Ingestion: Load data into HDFS from various sources, such as log files, databases, or social media
platforms.
2. Data Processing: Use MapReduce or other processing frameworks like Spark, Flink, or Hive to process the
data.
3. Data Storage: Store the processed data in HDFS or other storage systems like HBase or Cassandra.
4. Data Analysis: Use data analytics frameworks like Hive, Pig, or Mahout to analyse the data.
5. Data Visualization: Use data visualization tools like Tableau, Power BI, or D3.js to visualize the insights.
Hadoop Streaming
● It is a utility or feature that comes with a Hadoop distribution that allows
developers or programmers to write the Map-Reduce program using different
programming languages like Ruby, Perl, Python, C++, etc.
● We can use any language that can read from the standard input(STDIN) like
keyboard input and all and write using standard output(STDOUT).
● We all know the Hadoop Framework is completely written in java but programs
for Hadoop are not necessarily need to code in Java programming language.
How Hadoop Streaming Works
How Hadoop Streaming Works
In the above example image, we can see that the flow shown in a dotted block is a basic MapReduce job. In that, we have an Input Reader which is
responsible for reading the input data and produces the list of key-value pairs. We can read data in .csv format, in delimiter format, from a database table,
image data(.jpg, .png), audio data etc. The only requirement to read all these types of data is that we have to create a particular input format for that data
with these input readers. The input reader contains the complete logic about the data it is reading. Suppose we want to read an image then we have to
specify the logic in the input reader so that it can read that image data and finally it will generate key-value pairs for that image data.
If we are reading an image data then we can generate key-value pair for each pixel where the key will be the location of the pixel and the value will be its
color value from (0-255) for a colored image. Now this list of key-value pairs is fed to the Map phase and Mapper will work on each of these key-value pair
of each pixel and generate some intermediate key-value pairs which are then fed to the Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS. These are how a simple Map-Reduce job works.
Now let's see how we can use different languages like Python, C++, Ruby with Hadoop for execution. We can run this arbitrary language by running them
as a separate process. For that, we will create our external mapper and run it as an external separate process. These external map processes are not part of
the basic MapReduce flow. This external mapper will take input from STDIN and produce output to STDOUT. As the key-value pairs are passed to the
internal mapper the internal mapper process will send these key-value pairs to the external mapper where we have written our code in some other language
like with python with help of STDIN. Now, these external mappers process these key-value pairs and generate intermediate key-value pairs with help of
STDOUT and send it to the internal mappers.
Similarly, Reducer does the same thing. Once the intermediate key-value pairs are processed through the shuffle and sorting process they are fed to the
internal reducer which will send these pairs to external reducer process that are working separately through the help of STDIN and gathers the output
generated by external reducers with help of STDOUT and finally the output is stored to our HDFS.
This is how Hadoop Streaming works on Hadoop which is by default available in Hadoop. We are just utilizing this feature by making our external mapper
and reducers. Now we can see how powerful feature is Hadoop streaming. Anyone can write his code in any language of his own choice.
Some Hadoop Streaming Commands
Hadoop Ecosystem
● Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems.
● It includes Apache projects and various commercial tools and solutions.
● There are four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and
Hadoop Common.
● They work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling
Hadoop Distributed File System(HDFS)
● HDFS is the primary or major component of Hadoop ecosystem.
● It is responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of log
files.
● It can manage data with high volume, velocity, and variety.
● HDFS stores data in blocks. Each data block has a default size of 128MB.
This block size is configurable.
Hadoop Distributed File System(HDFS) (cont…)
● HDFS’s block-based structure enables efficient distribution and fault tolerance,
allowing for parallel processing across nodes.
● This design boosts performance and scalability in big data applications.
● HDFS consists of two core components i.e.
• Name node : It is the prime node which contains metadata.
• Data Node : Store the actual data..
Yarn (Yet Another Resource Negotiator)
● It performs scheduling and resource allocation for the Hadoop System. Consists of
three major components i.e ,
● Resource Manager : Resource manager has the privilege of allocating resources for
the applications in a system .
● Nodes Manager : Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
● Application Manager: Application manager works as an interface between the
resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce
● MapReduce is a programming model and processing technique used for
handling and analyzing large datasets in distributed computing environments,
including the cloud.
● It splits a computation task into smaller chunks, distributes them across multiple
machines, and processes them in parallel to improve efficiency and scalability.
● It has two functions:
Map()
Reduce()
MapReduce (cont…)
Map(): It performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
Reduce(): It does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG
● Developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
● It is a platform for structuring the data flow, processing and
analyzing huge data sets.
● After the processing, pig stores the result in HDFS.
● Helps to achieve ease of programming and optimization and hence
is a major segment of the Hadoop Ecosystem.
HIVE
● With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets.
● Its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch
processing both.
● All the SQL data types are supported by Hive thus, making the
query processing easier.
HIVE (cont…)
● Two components:
JDBC Drivers : JDBC (Java Database Connectivity) , along with
ODBC (Open Database Connectivity) drivers work on establishing
the data storage permissions and connection.
HIVE Command Line : Helps in the processing of queries.
Mahout
● Mahout, allows Machine Learnability to a system or application.
● Machine Learning, as the name suggests helps the system to develop
itself based on some patterns, user/environmental interaction or on the
basis of algorithms.
● It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification.
● It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark
● It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
● It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
● Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most of
the companies interchangeably.
Apache HBase
● It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database.
● It provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.
● At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be processed
within a short quick span of time.
● It gives us a tolerant way of storing limited data.
Solar, Lucene
● These are the two services that perform the task of searching and
indexing with the help of some java libraries.
● Lucene is a high performance library written in java and provide
technology of full text search,indexing and retrieval.And also it allows
the spell check mechanism.
● Solar in built on the top of Lucene and it simplifies the process of
deploying search solutions.
Zookeeper
● There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop
which resulted in inconsistency, often.
● Zookeeper overcame all the problems by performing synchronization,
inter-component-based communication, grouping, and maintenance.
Oozie
● Oozie simply performs the task of a scheduler, thus scheduling jobs
and binding them together as a single unit.
● There is two kinds of jobs . i.e Oozie workflow and Oozie coordinator
jobs.
● Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner .
● Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.
IBM Big Data Strategy
IBM's Big Data strategy is designed to help
organizations effectively manage and analyze large
volumes of structured and unstructured data.
Key Components:
- InfoSphere BigInsights: A platform that leverages Apache Hadoop to manage
and analyze large datasets, providing enterprise-grade features like administrative
tools, workflow management, and security controls.
- BigSheets: A spreadsheet-style tool that allows users to analyze and visualize
big data, making it easier to gain insights and make data-driven decisions.
- IBM Watson: A suite of AI-powered analytics tools that enable organizations to
extract insights from big data.
Strategy:
IBM's Big Data strategy focuses on:
- Data Management: Providing a robust platform for managing large datasets,
ensuring data quality, and security.
- Analytics: Enabling organizations to analyze and visualize big data, gaining
valuable insights and making informed decisions.
- AI-powered Insights: Leveraging AI and machine learning to extract insights
from big data, predicting future outcomes, and identifying opportunities.
Benefits:
- Improved Decision-Making: IBM's Big Data strategy enables organizations to
make data-driven decisions, reducing the risk of errors and improving outcomes.
- Increased Efficiency: Automating data analysis and insights extraction,
freeing up resources for more strategic activities.
- Competitive Advantage: Organizations can gain a competitive advantage by
leveraging big data insights to drive innovation, improve customer experiences,
and optimize operations.
Real-World Applications:
- Customer Relationship Management: Analyzing customer data to improve
relationships, personalize experiences, and increase loyalty.
- Predictive Maintenance: Analyzing sensor data to predict equipment
failures, reducing downtime, and improving maintenance efficiency.
- Healthcare Analytics: Analyzing patient data to improve diagnosis,
treatment, and patient outcomes.
Introduction to InfoSphere BigInsights and BigSheets
IBM InfoSphere BigInsights and BigSheets are powerful
tools for managing and analyzing large volumes of
structured and unstructured data.
InfoSphere BigInsights:
❏ A comprehensive platform for big data analytics
❏ Leverages Apache Hadoop and other open-source technologies
❏ Provides enterprise-grade features for data management, security, and
governance
❏ Enables organizations to analyze and gain insights from large datasets
BigSheets:
❏ A spreadsheet-style tool for analyzing and visualizing big data
❏ Allows users to easily explore and analyze large datasets
❏ Provides a user-friendly interface for data analysis and
visualization
❏ Enables users to gain insights and make data-driven decisions
Key Benefits:
● Improved Data Management: BigInsights provides a robust platform
for managing large datasets.
● Enhanced Analytics: BigSheets enables users to easily analyze and
visualize big data.
● Increased Productivity: Users can quickly gain insights and make
data-driven decisions.
Use Cases:
● Data Analytics: Analyzing large datasets to gain insights and make informed
decisions.
● Business Intelligence: Using big data to drive business intelligence and
analytics.
● Data Science: Enabling data scientists to explore and analyze large
datasets.
By leveraging InfoSphere BigInsights and BigSheets, organizations can
unlock the value of their big data and drive business success.