0% found this document useful (0 votes)
158 views40 pages

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Big Data, MapReduce & Hadoop provides information about big data, MapReduce, and Hadoop. It discusses how big data involves large volumes of structured and unstructured data from various sources that businesses analyze for insights. It then describes MapReduce and how it was developed by Google to process large datasets in parallel across clusters of computers. Finally, it explains that Hadoop is an open-source software framework that uses MapReduce and allows distributed processing of big data across clusters of servers.

Uploaded by

18941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views40 pages

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Big Data, MapReduce & Hadoop provides information about big data, MapReduce, and Hadoop. It discusses how big data involves large volumes of structured and unstructured data from various sources that businesses analyze for insights. It then describes MapReduce and how it was developed by Google to process large datasets in parallel across clusters of computers. Finally, it explains that Hadoop is an open-source software framework that uses MapReduce and allows distributed processing of big data across clusters of servers.

Uploaded by

18941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Data, Map

Reduce & Hadoop

By:
Surbhi Vyas(7)
Varsha(8)
Big Data
Big data is a term that describes the large volume
of data both structured and unstructured that
inundates(floods) a business on a day-to-day basis.
But its not the amount of data thats important.
Its what organizations do with the data that
matters. It can be analyzed for insights that lead
to better decisions and strategic business moves.
An example of big data might bepetabytes(1,024
terabytes) orexabytes(1,024 petabytes) of data
consisting of billions to trillions of records of
millions of peopleall from different sources (e.g.
Web, sales, customer contact center, social media,
mobile data and so on).
Big Data History & Current
Considerations
While the term big data is relatively new, the
act of gathering and storing large amounts of
information for eventual analysis is ages old.
Characteristics:
Volume:Organizations collect data from a variety
of sources, including business transactions, social
media and information from sensor or machine-to-
machine data. In the past, storing it wouldve
been a problem but new technologies (such as
Hadoop) have eased the burden.
Velocity:Data streams in at an unprecedented
speed and must be dealt with in a timely manner.
Variety: Data comes in all types of formats from
structured, numeric data in traditional databases
to unstructured text documents, email, video,
audio, stock ticker data and financial transactions.
Big data has increased the demand of
information management specialists in
thatSoftware AG,Oracle
Corporation,IBM,Microsoft,SAP,EMC,HP
andDellhave spent more than $15billion on
software firms specializing in data management
and analytics.
In 2010, this industry was worth more than
$100billion and was growing at almost
Who uses big data?

Big data affects organizations across


practically every industry:
Banking
Education
Healthcare
Government
Manufacturing
Retail
Before MapReduce

Large scale data processing was difficult!


Managing hundreds or thousands of processors
Managing parallelization and distribution
I/O Scheduling
Status and monitoring
Fault/crash tolerance
MapReduce provides all of these, easily!
MapReduce Overview

What is it?
Programming model used by Google
A combination of the Map and Reduce models with
an associated implementation
Used for processing and generating large data sets
MapReduce Overview

How does it solve our previously mentioned problems?


MapReduce is highly scalable and can be used across many
computers.
Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction

Inputs a key/value pair


Key is a reference to the input value
Value is the data set on which to operate
Evaluation
Function defined by user
Applies to every value in value input
Might need to parse input

Produces a new list of key/value pairs


Can be different type from input pair
Map Example
Reduce Abstraction

Starts with intermediate Key / Value pairs


Ends with finalized Key / Value pairs

Starting pairs are sorted by key


Iterator supplies the values for a given key to
the Reduce function.
Reduce Abstraction

Typically a function that:


Starts with a large number of key/value pairs
One key/value for each word in all files being greped
(including multiple entries for the same word)
Ends with very few key/value pairs
One key/value for each unique word across all the
files with the number of instances summed into this
entry

Broken up so a given worker works with input


of the same key.
Reduce Example
How Map and Reduce Work
Together

Reduce applies a
user defined
Map returns Reduces accepts
function to
information information
reduce the
amount of data
MapReduce workflow
Input Data Output Data

Worker Output
write
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize,
record filter, or
transform
Example: Word Count
Applications

MapReduce is built on top of GFS, the Google File System.


Input and output files are stored on GFS.
While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
The original implementation was done by Google. It is used
internally for a large number of Google services.
The Apache Hadoop project built a clone to specs defined
by Google. Amazon, in turn, uses Hadoop MapReduce
running on their EC2 (elastic cloud) computing-on-demand
service to offer the Amazon Elastic MapReduce service.
Why is this approach better?

Creates an abstraction for dealing with complex


overhead
The computations are simple, the overhead is messy
Removing the overhead makes programs much smaller
and thus easier to use
Less testing is required as well. The MapReduce libraries can
be assumed to work properly, so only user code needs to be
tested
Division of labor also handled by the MapReduce
libraries, so programmers only need to focus on the
actual computation
Hadoop
Hadoop
Hadoop is an open-source software framework for
storing data and running applications on clusters
of commodity hardware.
It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
Hadoops strength lies in its ability to scale across
thousands of commodity servers that dont share
memory or disk space.
Hadoop delegates tasks across these servers (called
worker nodes or slave nodes), essentially
harnessing the power of each device and running
This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
Hadoop can be thought of as an ecosystemits
comprised of many different components that all
work together to create a single platform.
There are two key functional components within
this ecosystem:
The storage of data (Hadoop Distributed File
System, or HDFS)
The framework for running parallel
Imagine, for example, that an entire
MapReduce job is the equivalent of building a
house.
Each job is broken down into individual tasks
(e.g. lay the foundation, put up drywall) and
assigned to various workers, or mappers
and reducers.
Completing each task results in a single,
combined output: the house is complete.
History of Hadoop

Hadoop was created byDoug CuttingandMike


Cafarellain 2006. Cutting, who was working atYahoo! at
the time,named it after his son's toy elephant.
It was originally developed to support distribution for
theNutch search engine project.
Apache Nutchis a highly extensible and scalableopen
source web crawler software project.
AWeb crawlersystematically browses theWorld Wide
Web, typically for the purpose ofWeb indexing.
As the World Wide Web grew in the late 1900s and
early 2000s, search engines and indexes were
created to help locate relevant information amid
the text-based content.
In the early years, search results were returned
by humans. But as the web grew from dozens to
millions of pages, automation was needed.
Web crawlers were created, many as university-
led research projects, and search engine start-ups
took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search
engine called Nutch the brainchild of Doug
They wanted to return web search results faster
by distributing data and calculations across
different computers so multiple tasks could be
accomplished simultaneously.
During this time, another search engine project
called Google was in progress. It was based on the
same concept storing and processing data in a
distributed, automated way so that relevant web
search results could be returned faster.
The Nutch project was divided the web crawler
portion remained as Nutch and the distributed
computing and processing portion became Hadoop .
In 2008, Yahoo released Hadoop as an open-source
project.
Today, Hadoops framework and ecosystem of
technologies are managed and maintained by the non-
profit Apache Software Foundation (ASF), a global
community of software developers and contributors.
Search engines in 1990s

1996

1996

1996

1997
Google search engines

1998

2017
Framework of Hadoop

The base Apache Hadoop framework is composed of the


following modules:
Hadoop Common contains libraries and utilities
needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster;
Hadoop YARN a resource-management platform
responsible for managing computing resources in
clusters and using them for scheduling of users'
applications
Hadoop MapReduce an implementation of
theMapReduceprogramming model for large scale data
processing.
The Hadoop framework itself is mostly written in the
Java programming language, with some native code
inCandcommand lineutilities written asshell scripts.
Hadoop Architecture

At its core, Hadoop has two major layers namely:


(a) Processing/Computation layer (Map Reduce), and
(b) Storage layer (Hadoop Distributed File System).
Hadoop framework also includes the following two modules:

Hadoop Common: These are Java libraries and utilities


required by other Hadoop modules.

Hadoop YARN: This is a framework for job scheduling and


cluster resource management.
HOW DOES HADOOP WORK?

It is quite expensive to build bigger servers with heavy configurations


that handle large scale processing, but as an alternative, you can tie
together many commodity computers with single-CPU, as a single
functional distributed system and practically, the clustered machines can
read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So this is the first
motivational factor behind using Hadoop that it runs across clustered
and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs:
Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further
processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Cost
Scalable
Effective

Flexible Fast

Resilient
to Failure
Disadvantages of Hadoop
Security Concerns

Not fit for Small Data

Vulnerable by Nature

Potential Stability Issue

General Limitation
Companies Using Hadoop
Used as a source for reporting and machine learning.

Used to store & process tweets, log files.

Its data flows through clusters.

Used for scaling tests.

Client Projects in finance, telecom & retail.

Used for search optimization & research.

Client Projects

Designing & building web applications.

Log analysis & machine learning.

Social services to structured data storage.

Common questions

Powered by AI

Technologies like Hadoop and MapReduce have democratized access to data analytics by providing scalable, cost-effective solutions to process massive datasets, previously infeasible for many industries due to high infrastructure costs . Hadoop's ability to run on clusters of inexpensive commodity hardware has lowered the entry barrier for companies to deploy big data technologies . Its scalability ensures that businesses, regardless of size, can process data in parallel across multiple nodes, enabling quick insights and decision-making . MapReduce simplifies complex data processing tasks with a straightforward programming model, making it easier for developers to implement without deep expertise in parallel computing . These technological advancements have broadened the scope of data analytics applications across diverse sectors such as finance, healthcare, and retail, fostering innovation and improving competitive advantage .

Hadoop has revolutionized big data analytics by providing a scalable, cost-effective solution for data storage and processing compared to previous technologies that relied on expensive, high-end servers. It enables the use of commodity hardware to create a distributed computing environment, which significantly reduces costs . Hadoop addresses issues of scalability, as it can process petabytes of data by distributing tasks across numerous nodes in a cluster, improving speed and capacity . Its architecture allows processing large datasets in parallel, making it ideal for industries where large volumes of data (e.g., banking, healthcare, retail) need rapid analysis for strategic decisions . Consequently, Hadoop has become a cornerstone technology for tackling complex data challenges that previous technologies couldn't efficiently manage .

The concept of big data evolved significantly with technologies like Hadoop that enabled the storage and analysis of vast amounts of data at low costs using distributed computing systems . Initially, big data was challenging to manage due to limitations in processing power and storage capabilities. Hadoop addressed these challenges by providing a framework that can process and manage petabytes of data across clusters of commodity hardware . This evolution has profound implications for businesses, allowing them to derive insights from diverse datasets such as customer interactions, social media, and sensor data in real-time . Consequently, businesses can make strategic decisions faster and more accurately, leading to competitive advantages in the marketplace. Additionally, as data continues to grow, frameworks like Hadoop ensure that companies can scale their operations without exorbitant costs .

Hadoop has certain limitations, particularly in environments with smaller data requirements. It is not well-suited for small datasets because its architecture and overhead are designed for large-scale data processing, making it inefficient for smaller data due to higher computational time and resource usage . Additionally, Hadoop can pose security risks, as it was originally developed without rigorous access controls or encryption measures, leading to potential data vulnerabilities . Also, its complexity and potential stability issues could be challenging to manage without skilled personnel . These limitations make it less ideal for use cases where data volume does not justify the overhead associated with deploying and managing a Hadoop cluster .

The Hadoop framework consists of four primary components: Hadoop Common, HDFS, Hadoop YARN, and Hadoop MapReduce. - Hadoop Common provides the necessary libraries and utilities for other Hadoop modules . - HDFS (Hadoop Distributed File System) is responsible for storing data across a cluster of commodity hardware, allowing for high bandwidth and fault tolerance through data replication . - Hadoop YARN handles resource management and job scheduling across the cluster . - Hadoop MapReduce implements the MapReduce programming model to execute data processing tasks in parallel across the cluster . Together, these components allow for scalable, distributed processing of large datasets by dividing tasks into smaller subtasks that are processed across various nodes, leading to high throughput and resilience .

MapReduce offers several benefits within the Hadoop ecosystem, particularly for processing large-scale datasets. It allows developers to write simple processing functions (map and reduce) that perform complex data processing tasks in parallel across a distributed network, enhancing efficiency and scalability . This abstraction of complex tasks into simpler computations reduces the labor and complexity associated with big data processing . However, challenges include handling intermediate data, which can be resource-intensive and result in bottlenecks during the shuffle phase between the map and reduce stages . Additionally, MapReduce's batch-oriented processing model may not be suitable for real-time data needs or interactive analytics, limiting its applicability in scenarios requiring immediate data insights . Despite these challenges, the integration of MapReduce in Hadoop remains a robust solution for batch-processing large datasets efficiently .

The MapReduce programming model simplifies large-scale data processing by abstracting the complexity associated with parallelization, distribution, and fault tolerance. In traditional settings, managing hundreds of processors or ensuring fault tolerance was challenging. MapReduce addresses these by allowing users to define simple 'map' and 'reduce' functions to process data . The 'map' function processes input data and converts it into intermediate key-value pairs, while the 'reduce' function aggregates these pairs to produce a summarized output . These processes are handled efficiently across multiple nodes, making large-scale processing feasible on smaller, cost-effective hardware .

The adoption of big data technologies, such as Hadoop, has had significant socio-economic impacts on industries and employment. As organizations leverage big data for decision-making, efficiencies and innovation have increased across sectors like banking, healthcare, and retail, leading to economic growth . This transformation has spurred demand for data analytics talent, driving employment opportunities and reshaping job markets with roles focused on analytics, data science, and IT infrastructure management . However, this shift also poses challenges, as workers need to reskill to meet the burgeoning demand for new technical competencies tied to big data technologies . Additionally, while these technologies enhance operational efficiencies, they may replace some manual jobs, creating socio-economic disparities that require focused workforce development and training initiatives .

Hadoop's role in distributed computing is significantly influenced by its origins with the Nutch project and further development at Yahoo. Initially, Nutch aimed to improve web search performance by distributing data processing across multiple nodes, addressing the scalability and speed needed as the Internet grew . This experience led Doug Cutting and Mike Cafarella to develop Hadoop as they sought to handle large datasets more efficiently . At Yahoo, Hadoop was used to support search engine functionalities, driven by the need for faster and scalable data processing . By releasing Hadoop as an open-source project in 2008, Yahoo catalyzed its adoption across industries, establishing Hadoop as a foundational technology for distributed computing. Its development from a web crawling function to a versatile platform for big data processing has made it integral to today's data analytics landscape .

Hadoop ensures data reliability and fault tolerance primarily through the Hadoop Distributed File System (HDFS), which replicates data blocks across multiple nodes in a cluster. Each file is divided into blocks, typically 128 MB, and each block is replicated by default across three different nodes . This replication ensures that if one node fails, data is still accessible from other nodes, allowing for continued processing without data loss . HDFS also monitors the health of nodes and facilitates automatic data recovery by replicating the blocks from non-faulty nodes. This architecture provides robust fault tolerance and data reliability in environments where hardware failures are expected .

You might also like