0% found this document useful (0 votes)
90 views19 pages

Understanding Big Data and Hadoop Framework

This document discusses big data and Hadoop. It defines big data as very large amounts of data that are difficult to process using traditional data processing applications. It notes that Hadoop is an open source framework used to store, process, and analyze big data across clusters of commodity hardware. Key aspects of Hadoop include HDFS for storage, YARN for resource management, and MapReduce as a programming model for distributed processing of large datasets.

Uploaded by

naka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views19 pages

Understanding Big Data and Hadoop Framework

This document discusses big data and Hadoop. It defines big data as very large amounts of data that are difficult to process using traditional data processing applications. It notes that Hadoop is an open source framework used to store, process, and analyze big data across clusters of commodity hardware. Key aspects of Hadoop include HDFS for storage, YARN for resource management, and MapReduce as a programming model for distributed processing of large datasets.

Uploaded by

naka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

______________________

Omer Gafar Ahmed


What is big data?
• Data which are very large in size is called Big Data.
• 90% of today's data has been generated in the past
3 years.
Where it comes from?
How it look like ?

Structured Data • databases

• XML files
Semi Structured Data • comma-separated values (CSV) files

• Text
Unstructured Data • audio
• video
Velocity

3V's of
Big
Volume
Data Veracity
Issues

• Huge amount of data which needs to be :


• Stored
• Processed
• analyzed
Solution

is an open source framework from Apache and is used


to store , process and analyze big data.
How?
• Storage: Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store
data in a distributed fashion. It works on Write once, read
many times principle.
• Processing: Map Reduce paradigm is applied to data
distributed over network to find the required output.
• Analyze: Pig, Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.
History
• Oct 2003 - Google releases papers with GFS (Google File System)
• Dec 2004 - Google releases papers with MapReduce
• 2006 - Yahoo! created Hadoop based on GFS and MapReduce
• 2007 - Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008 - Apache took over Hadoop.
• Dec 2011 - Hadoop releases version 1.0
• Aug 2016 - Hadoop releases version 2.7.3
Modules of Hadoop
• HDFS
• Yarn
• MapReduce
What is HDFS?

• Hadoop comes with a distributed file system called HDFS. In


HDFS data is distributed over several machines and
replicated to ensure their durability to failure and high
availability to parallel application.
• It is cost effective as it uses commodity hardware.
• Name Node: HDFS works in master-worker pattern where the name
node acts as master. Name Node is controller and manager of HDFS
as it knows the status and the metadata of all the files in HDFS. e
HDFS cluster is accessed by multiple clients concurrently, so all this
information is handled by a single machine. The file system
operations like opening, closing, renaming etc.
• Data Node: They store and retrieve blocks.
Yarn

• Resource Negotiator is used for job scheduling and manage


the cluster.
MapReduce
• MapReduce is a programming model for writing applications
that can process Big Data in parallel on multiple nodes.
• MapReduce provides analytical capabilities for analyzing
huge volumes of complex data.
• MapReduce divides a task into small parts and assigns them
to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
• MapReduce Jobs are harder to write so most people use Pig
and Hive instead of writing Mapper and Reducers.
Hadoop and cloud
• Microsoft have hadoop distribution on their cloud
• it’s called HDInsight.

• Amazon they have hadoop distribution on their cloud


• It’s called ERM(Elastic MapReduce )
Useful Resources
• Official website
• https://hadoop.apache.org
• To download Linux with hadoop for vmware
• https://www.mapr.com
• https://www.cloudera.com
Where to Find Data?
• wwww.gutenberg.org (small text books)
• aws.amazon.com/datasets (very large data)
• www.infochimps.com/datasets
• en.wikipedia.org/wiki/wikipedia:database_download
Summary
• Big data is term not technology.
• Hadoop is the framework for working with big data, it’s open
source.
• You can use other tools like pig,hive to analyze big data
rather than writing MapReduce codes.
Thank you

You might also like