0% found this document useful (0 votes)

158 views40 pages

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Big Data, MapReduce & Hadoop provides information about big data, MapReduce, and Hadoop. It discusses how big data involves large volumes of structured and unstructured data from various sources that businesses analyze for insights. It then describes MapReduce and how it was developed by Google to process large datasets in parallel across clusters of computers. Finally, it explains that Hadoop is an open-source software framework that uses MapReduce and allows distributed processing of big data across clusters of servers.

Uploaded by

18941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

158 views40 pages

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Uploaded by

18941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Big Data, Map

Reduce & Hadoop

By:
Surbhi Vyas(7)
Varsha(8)
Big Data
Big data is a term that describes the large volume
of data both structured and unstructured that
inundates(floods) a business on a day-to-day basis.
But its not the amount of data thats important.
Its what organizations do with the data that
matters. It can be analyzed for insights that lead
to better decisions and strategic business moves.
An example of big data might bepetabytes(1,024
terabytes) orexabytes(1,024 petabytes) of data
consisting of billions to trillions of records of
millions of peopleall from different sources (e.g.
Web, sales, customer contact center, social media,
mobile data and so on).
Big Data History & Current
Considerations
While the term big data is relatively new, the
act of gathering and storing large amounts of
information for eventual analysis is ages old.
Characteristics:
Volume:Organizations collect data from a variety
of sources, including business transactions, social
media and information from sensor or machine-to-
machine data. In the past, storing it wouldve
been a problem but new technologies (such as
Hadoop) have eased the burden.
Velocity:Data streams in at an unprecedented
speed and must be dealt with in a timely manner.
Variety: Data comes in all types of formats from
structured, numeric data in traditional databases
to unstructured text documents, email, video,
audio, stock ticker data and financial transactions.
Big data has increased the demand of
information management specialists in
thatSoftware AG,Oracle
Corporation,IBM,Microsoft,SAP,EMC,HP
andDellhave spent more than $15billion on
software firms specializing in data management
and analytics.
In 2010, this industry was worth more than
$100billion and was growing at almost
Who uses big data?

Big data affects organizations across

practically every industry:
Banking
Education
Healthcare
Government
Manufacturing
Retail
Before MapReduce

Large scale data processing was difficult!

Managing hundreds or thousands of processors
Managing parallelization and distribution
I/O Scheduling
Status and monitoring
Fault/crash tolerance
MapReduce provides all of these, easily!
MapReduce Overview

What is it?
Programming model used by Google
A combination of the Map and Reduce models with
an associated implementation
Used for processing and generating large data sets
MapReduce Overview

How does it solve our previously mentioned problems?

MapReduce is highly scalable and can be used across many
computers.
Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction

Inputs a key/value pair

Key is a reference to the input value
Value is the data set on which to operate
Evaluation
Function defined by user
Applies to every value in value input
Might need to parse input

Produces a new list of key/value pairs

Can be different type from input pair
Map Example
Reduce Abstraction

Starts with intermediate Key / Value pairs

Ends with finalized Key / Value pairs

Starting pairs are sorted by key

Iterator supplies the values for a given key to
the Reduce function.
Reduce Abstraction

Typically a function that:

Starts with a large number of key/value pairs
One key/value for each word in all files being greped
(including multiple entries for the same word)
Ends with very few key/value pairs
One key/value for each unique word across all the
files with the number of instances summed into this
entry

Broken up so a given worker works with input

of the same key.
Reduce Example
How Map and Reduce Work
Together

Reduce applies a
user defined
Map returns Reduces accepts
function to
information information
reduce the
amount of data
MapReduce workflow
Input Data Output Data

Worker Output
write
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize,
record filter, or
transform
Example: Word Count
Applications

MapReduce is built on top of GFS, the Google File System.

Input and output files are stored on GFS.
While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
The original implementation was done by Google. It is used
internally for a large number of Google services.
The Apache Hadoop project built a clone to specs defined
by Google. Amazon, in turn, uses Hadoop MapReduce
running on their EC2 (elastic cloud) computing-on-demand
service to offer the Amazon Elastic MapReduce service.
Why is this approach better?

Creates an abstraction for dealing with complex

overhead
The computations are simple, the overhead is messy
Removing the overhead makes programs much smaller
and thus easier to use
Less testing is required as well. The MapReduce libraries can
be assumed to work properly, so only user code needs to be
tested
Division of labor also handled by the MapReduce
libraries, so programmers only need to focus on the
actual computation
Hadoop
Hadoop
Hadoop is an open-source software framework for
storing data and running applications on clusters
of commodity hardware.
It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
Hadoops strength lies in its ability to scale across
thousands of commodity servers that dont share
memory or disk space.
Hadoop delegates tasks across these servers (called
worker nodes or slave nodes), essentially
harnessing the power of each device and running
This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
Hadoop can be thought of as an ecosystemits
comprised of many different components that all
work together to create a single platform.
There are two key functional components within
this ecosystem:
The storage of data (Hadoop Distributed File
System, or HDFS)
The framework for running parallel
Imagine, for example, that an entire
MapReduce job is the equivalent of building a
house.
Each job is broken down into individual tasks
(e.g. lay the foundation, put up drywall) and
assigned to various workers, or mappers
and reducers.
Completing each task results in a single,
combined output: the house is complete.
History of Hadoop

Hadoop was created byDoug CuttingandMike

Cafarellain 2006. Cutting, who was working atYahoo! at
the time,named it after his son's toy elephant.
It was originally developed to support distribution for
theNutch search engine project.
Apache Nutchis a highly extensible and scalableopen
source web crawler software project.
AWeb crawlersystematically browses theWorld Wide
Web, typically for the purpose ofWeb indexing.
As the World Wide Web grew in the late 1900s and
early 2000s, search engines and indexes were
created to help locate relevant information amid
the text-based content.
In the early years, search results were returned
by humans. But as the web grew from dozens to
millions of pages, automation was needed.
Web crawlers were created, many as university-
led research projects, and search engine start-ups
took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search
engine called Nutch the brainchild of Doug
They wanted to return web search results faster
by distributing data and calculations across
different computers so multiple tasks could be
accomplished simultaneously.
During this time, another search engine project
called Google was in progress. It was based on the
same concept storing and processing data in a
distributed, automated way so that relevant web
search results could be returned faster.
The Nutch project was divided the web crawler
portion remained as Nutch and the distributed
computing and processing portion became Hadoop .
In 2008, Yahoo released Hadoop as an open-source
project.
Today, Hadoops framework and ecosystem of
technologies are managed and maintained by the non-
profit Apache Software Foundation (ASF), a global
community of software developers and contributors.
Search engines in 1990s

1996

1997
Google search engines

1998

2017
Framework of Hadoop

The base Apache Hadoop framework is composed of the

following modules:
Hadoop Common contains libraries and utilities
needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster;
Hadoop YARN a resource-management platform
responsible for managing computing resources in
clusters and using them for scheduling of users'
applications
Hadoop MapReduce an implementation of
theMapReduceprogramming model for large scale data
processing.
The Hadoop framework itself is mostly written in the
Java programming language, with some native code
inCandcommand lineutilities written asshell scripts.
Hadoop Architecture

At its core, Hadoop has two major layers namely:

(a) Processing/Computation layer (Map Reduce), and
(b) Storage layer (Hadoop Distributed File System).
Hadoop framework also includes the following two modules:

Hadoop Common: These are Java libraries and utilities

required by other Hadoop modules.

Hadoop YARN: This is a framework for job scheduling and

cluster resource management.
HOW DOES HADOOP WORK?

It is quite expensive to build bigger servers with heavy configurations

that handle large scale processing, but as an alternative, you can tie
together many commodity computers with single-CPU, as a single
functional distributed system and practically, the clustered machines can
read the dataset in parallel and provide a much higher throughput.
Moreover, it is cheaper than one high-end server. So this is the first
motivational factor behind using Hadoop that it runs across clustered
and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs:
Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further
processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Cost
Scalable
Effective

Flexible Fast

Resilient
to Failure
Disadvantages of Hadoop
Security Concerns

Not fit for Small Data

Vulnerable by Nature

Potential Stability Issue

General Limitation
Companies Using Hadoop
Used as a source for reporting and machine learning.

Used to store & process tweets, log files.

Its data flows through clusters.

Used for scaling tests.

Client Projects in finance, telecom & retail.

Used for search optimization & research.

Client Projects

Designing & building web applications.

Log analysis & machine learning.

Social services to structured data storage.

Common questions

Technologies like Hadoop and MapReduce have democratized access to data analytics by providing scalable, cost-effective solutions to process massive datasets, previously infeasible for many industries due to high infrastructure costs . Hadoop's ability to run on clusters of inexpensive commodity hardware has lowered the entry barrier for companies to deploy big data technologies . Its scalability ensures that businesses, regardless of size, can process data in parallel across multiple nodes, enabling quick insights and decision-making . MapReduce simplifies complex data processing tasks with a straightforward programming model, making it easier for developers to implement without deep expertise in parallel computing . These technological advancements have broadened the scope of data analytics applications across diverse sectors such as finance, healthcare, and retail, fostering innovation and improving competitive advantage .

Hadoop has revolutionized big data analytics by providing a scalable, cost-effective solution for data storage and processing compared to previous technologies that relied on expensive, high-end servers. It enables the use of commodity hardware to create a distributed computing environment, which significantly reduces costs . Hadoop addresses issues of scalability, as it can process petabytes of data by distributing tasks across numerous nodes in a cluster, improving speed and capacity . Its architecture allows processing large datasets in parallel, making it ideal for industries where large volumes of data (e.g., banking, healthcare, retail) need rapid analysis for strategic decisions . Consequently, Hadoop has become a cornerstone technology for tackling complex data challenges that previous technologies couldn't efficiently manage .

The concept of big data evolved significantly with technologies like Hadoop that enabled the storage and analysis of vast amounts of data at low costs using distributed computing systems . Initially, big data was challenging to manage due to limitations in processing power and storage capabilities. Hadoop addressed these challenges by providing a framework that can process and manage petabytes of data across clusters of commodity hardware . This evolution has profound implications for businesses, allowing them to derive insights from diverse datasets such as customer interactions, social media, and sensor data in real-time . Consequently, businesses can make strategic decisions faster and more accurately, leading to competitive advantages in the marketplace. Additionally, as data continues to grow, frameworks like Hadoop ensure that companies can scale their operations without exorbitant costs .

Hadoop has certain limitations, particularly in environments with smaller data requirements. It is not well-suited for small datasets because its architecture and overhead are designed for large-scale data processing, making it inefficient for smaller data due to higher computational time and resource usage . Additionally, Hadoop can pose security risks, as it was originally developed without rigorous access controls or encryption measures, leading to potential data vulnerabilities . Also, its complexity and potential stability issues could be challenging to manage without skilled personnel . These limitations make it less ideal for use cases where data volume does not justify the overhead associated with deploying and managing a Hadoop cluster .

The Hadoop framework consists of four primary components: Hadoop Common, HDFS, Hadoop YARN, and Hadoop MapReduce. - Hadoop Common provides the necessary libraries and utilities for other Hadoop modules . - HDFS (Hadoop Distributed File System) is responsible for storing data across a cluster of commodity hardware, allowing for high bandwidth and fault tolerance through data replication . - Hadoop YARN handles resource management and job scheduling across the cluster . - Hadoop MapReduce implements the MapReduce programming model to execute data processing tasks in parallel across the cluster . Together, these components allow for scalable, distributed processing of large datasets by dividing tasks into smaller subtasks that are processed across various nodes, leading to high throughput and resilience .

MapReduce offers several benefits within the Hadoop ecosystem, particularly for processing large-scale datasets. It allows developers to write simple processing functions (map and reduce) that perform complex data processing tasks in parallel across a distributed network, enhancing efficiency and scalability . This abstraction of complex tasks into simpler computations reduces the labor and complexity associated with big data processing . However, challenges include handling intermediate data, which can be resource-intensive and result in bottlenecks during the shuffle phase between the map and reduce stages . Additionally, MapReduce's batch-oriented processing model may not be suitable for real-time data needs or interactive analytics, limiting its applicability in scenarios requiring immediate data insights . Despite these challenges, the integration of MapReduce in Hadoop remains a robust solution for batch-processing large datasets efficiently .

The MapReduce programming model simplifies large-scale data processing by abstracting the complexity associated with parallelization, distribution, and fault tolerance. In traditional settings, managing hundreds of processors or ensuring fault tolerance was challenging. MapReduce addresses these by allowing users to define simple 'map' and 'reduce' functions to process data . The 'map' function processes input data and converts it into intermediate key-value pairs, while the 'reduce' function aggregates these pairs to produce a summarized output . These processes are handled efficiently across multiple nodes, making large-scale processing feasible on smaller, cost-effective hardware .

The adoption of big data technologies, such as Hadoop, has had significant socio-economic impacts on industries and employment. As organizations leverage big data for decision-making, efficiencies and innovation have increased across sectors like banking, healthcare, and retail, leading to economic growth . This transformation has spurred demand for data analytics talent, driving employment opportunities and reshaping job markets with roles focused on analytics, data science, and IT infrastructure management . However, this shift also poses challenges, as workers need to reskill to meet the burgeoning demand for new technical competencies tied to big data technologies . Additionally, while these technologies enhance operational efficiencies, they may replace some manual jobs, creating socio-economic disparities that require focused workforce development and training initiatives .

Hadoop's role in distributed computing is significantly influenced by its origins with the Nutch project and further development at Yahoo. Initially, Nutch aimed to improve web search performance by distributing data processing across multiple nodes, addressing the scalability and speed needed as the Internet grew . This experience led Doug Cutting and Mike Cafarella to develop Hadoop as they sought to handle large datasets more efficiently . At Yahoo, Hadoop was used to support search engine functionalities, driven by the need for faster and scalable data processing . By releasing Hadoop as an open-source project in 2008, Yahoo catalyzed its adoption across industries, establishing Hadoop as a foundational technology for distributed computing. Its development from a web crawling function to a versatile platform for big data processing has made it integral to today's data analytics landscape .

Hadoop ensures data reliability and fault tolerance primarily through the Hadoop Distributed File System (HDFS), which replicates data blocks across multiple nodes in a cluster. Each file is divided into blocks, typically 128 MB, and each block is replicated by default across three different nodes . This replication ensures that if one node fails, data is still accessible from other nodes, allowing for continued processing without data loss . HDFS also monitors the health of nodes and facilitates automatic data recovery by replicating the blocks from non-faulty nodes. This architecture provides robust fault tolerance and data reliability in environments where hardware failures are expected .

Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
36 pages
Hortonworks Sandbox for Beginners
No ratings yet
Hortonworks Sandbox for Beginners
12 pages
Document Database Data Modeling
No ratings yet
Document Database Data Modeling
27 pages
The Secrets of Oracle Bitmap Indexes
100% (14)
The Secrets of Oracle Bitmap Indexes
3 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Spark
No ratings yet
Spark
160 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
NoSQL vs MySQL Performance Analysis
No ratings yet
NoSQL vs MySQL Performance Analysis
3 pages
A Performance Comparison of SQL and NoSQL Databases
No ratings yet
A Performance Comparison of SQL and NoSQL Databases
5 pages
Neo4j - Graph Database PDF
0% (1)
Neo4j - Graph Database PDF
19 pages
Hortonworks Sandbox Setup
No ratings yet
Hortonworks Sandbox Setup
12 pages
MongoDB Schema Design Guide
No ratings yet
MongoDB Schema Design Guide
59 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
Unit 6 - Compression and Serialization in Hadoop
No ratings yet
Unit 6 - Compression and Serialization in Hadoop
24 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
17 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Cypher Database Manipulation Guide
No ratings yet
Cypher Database Manipulation Guide
26 pages
BATCH2728
No ratings yet
BATCH2728
85 pages
Bitmap Indexing Overview & Applications
No ratings yet
Bitmap Indexing Overview & Applications
11 pages
Data Warehousing - Book
No ratings yet
Data Warehousing - Book
203 pages
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
No ratings yet
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
40 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
PostgreSQL SQL Queries Guide
No ratings yet
PostgreSQL SQL Queries Guide
1 page
Bitmap Index
No ratings yet
Bitmap Index
20 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
Key-Value Database Characteristics
No ratings yet
Key-Value Database Characteristics
13 pages
SQL Database Systems Guide
No ratings yet
SQL Database Systems Guide
46 pages
Database Materialized Views Guide
No ratings yet
Database Materialized Views Guide
31 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
Neo4j Graph Database Lecture
No ratings yet
Neo4j Graph Database Lecture
46 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Execution Plans The Secret To Query Tuning Success
No ratings yet
Execution Plans The Secret To Query Tuning Success
98 pages
Relational (OLTP) Data Modeling
No ratings yet
Relational (OLTP) Data Modeling
2 pages
Databricks Widgets Overview and Usage
No ratings yet
Databricks Widgets Overview and Usage
13 pages
Big Data - RDBMS, NoSQL and DynamoDB
No ratings yet
Big Data - RDBMS, NoSQL and DynamoDB
6 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
Top 10 Guidelines For Deploying Modern Data Architecture For The Data Driven Enterprise
No ratings yet
Top 10 Guidelines For Deploying Modern Data Architecture For The Data Driven Enterprise
6 pages
Data Lakehouse Benefits for Enterprises
No ratings yet
Data Lakehouse Benefits for Enterprises
3 pages
Migration Strategy
No ratings yet
Migration Strategy
3 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
156 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Neo4j Manual
50% (2)
Neo4j Manual
529 pages
Oracle Indexes
No ratings yet
Oracle Indexes
3 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Bitmap Index Internals
No ratings yet
Bitmap Index Internals
54 pages
Hadoop HDFS Commands With Examples
No ratings yet
Hadoop HDFS Commands With Examples
3 pages
Ultimate Mongodb Cheatsheet
No ratings yet
Ultimate Mongodb Cheatsheet
5 pages
CT113H Lecture 1 - Introduction To NoSQL
No ratings yet
CT113H Lecture 1 - Introduction To NoSQL
51 pages
Avro
No ratings yet
Avro
5 pages
Overview of Apache Druid Architecture
No ratings yet
Overview of Apache Druid Architecture
12 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
21 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Unit 5
No ratings yet
Unit 5
35 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Protein Assignb - Odt 0
No ratings yet
Protein Assignb - Odt 0
5 pages
Architecture & Consideration
No ratings yet
Architecture & Consideration
7 pages
Complexity of Sorting
No ratings yet
Complexity of Sorting
9 pages
Assignment 1
No ratings yet
Assignment 1
18 pages
Artificial Intelligence: Prof. Taj Mahmood Department of Computer Science & IT University of Sargodha
No ratings yet
Artificial Intelligence: Prof. Taj Mahmood Department of Computer Science & IT University of Sargodha
36 pages
Descrete Math MTH 401 PDF
67% (3)
Descrete Math MTH 401 PDF
11 pages
As 2805.6.5.1-2000 Electronic Funds Transfer - Requirements For Interfaces Key Management - TCU Initializatio
No ratings yet
As 2805.6.5.1-2000 Electronic Funds Transfer - Requirements For Interfaces Key Management - TCU Initializatio
8 pages
Forward vs Backward Chaining in AI
No ratings yet
Forward vs Backward Chaining in AI
6 pages
Computer Organization Exam June 2008
No ratings yet
Computer Organization Exam June 2008
4 pages
HTC One Patent Infringement Claim Chart
No ratings yet
HTC One Patent Infringement Claim Chart
37 pages
JB Portal Positioning
No ratings yet
JB Portal Positioning
3 pages
VHDL Memory Models Guide
No ratings yet
VHDL Memory Models Guide
24 pages
Java Notes - Know Program
No ratings yet
Java Notes - Know Program
95 pages
DCX 8510 4 HardwareManual
No ratings yet
DCX 8510 4 HardwareManual
156 pages
As IEC 61078-2008 Analysis Techniques For System Reliability - Reliability Block Diagram and Boolean Methods
No ratings yet
As IEC 61078-2008 Analysis Techniques For System Reliability - Reliability Block Diagram and Boolean Methods
8 pages
Next Gen Network Architecture Guide
No ratings yet
Next Gen Network Architecture Guide
20 pages
Siemens PLC Data Exchange Guide
No ratings yet
Siemens PLC Data Exchange Guide
6 pages
Canadian Senior Math Contest
No ratings yet
Canadian Senior Math Contest
4 pages
Antiderivatives & Integration Guide
No ratings yet
Antiderivatives & Integration Guide
5 pages
Algorithms and Complexity: Zeph Grunschlag
No ratings yet
Algorithms and Complexity: Zeph Grunschlag
71 pages
Visio DFD Tutorial for E-Commerce
No ratings yet
Visio DFD Tutorial for E-Commerce
4 pages
Scheme of VII and VIII Sem For 2021 Scheme
No ratings yet
Scheme of VII and VIII Sem For 2021 Scheme
2 pages
Space & Time Complexity
No ratings yet
Space & Time Complexity
3 pages
KMP Algorithm: Efficient String Matching
No ratings yet
KMP Algorithm: Efficient String Matching
4 pages
EES Lecture 6 A Lecture On The Use of EES
100% (1)
EES Lecture 6 A Lecture On The Use of EES
24 pages
2002 Computer Studies Paper 2
No ratings yet
2002 Computer Studies Paper 2
21 pages
SensoMineR: Free Sensory Analysis Tool
No ratings yet
SensoMineR: Free Sensory Analysis Tool
38 pages
John Smith's Administration Officer CV
No ratings yet
John Smith's Administration Officer CV
2 pages
Understanding Object-Oriented Programming
No ratings yet
Understanding Object-Oriented Programming
56 pages
Analyst Guide
No ratings yet
Analyst Guide
825 pages
Bourne Shell Programming
No ratings yet
Bourne Shell Programming
35 pages
Software Design Doc Template
No ratings yet
Software Design Doc Template
11 pages
Intel® Parallel Studio XE 2011 SP1 For Windows Installation Guide and Release Notes
No ratings yet
Intel® Parallel Studio XE 2011 SP1 For Windows Installation Guide and Release Notes
7 pages
Hammond's Method For Arranged Pack of Cards (Card Magic - Stack)
No ratings yet
Hammond's Method For Arranged Pack of Cards (Card Magic - Stack)
6 pages

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Uploaded by

Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha

Uploaded by

Big Data, Map

Reduce & Hadoop

Big data affects organizations across

Large scale data processing was difficult!

How does it solve our previously mentioned problems?

Inputs a key/value pair

Produces a new list of key/value pairs

Starts with intermediate Key / Value pairs

Starting pairs are sorted by key

Typically a function that:

Broken up so a given worker works with input

MapReduce is built on top of GFS, the Google File System.

Creates an abstraction for dealing with complex

Hadoop was created byDoug CuttingandMike

The base Apache Hadoop framework is composed of the

At its core, Hadoop has two major layers namely:

Hadoop Common: These are Java libraries and utilities

Hadoop YARN: This is a framework for job scheduling and

It is quite expensive to build bigger servers with heavy configurations

Not fit for Small Data

Potential Stability Issue

Used to store & process tweets, log files.

Its data flows through clusters.

Used for scaling tests.

Client Projects in finance, telecom & retail.

Used for search optimization & research.

Designing & building web applications.

Log analysis & machine learning.

Social services to structured data storage.

Common questions

In what ways have technologies like Hadoop and MapReduce made data analytics more accessible to different industries?

In what ways have technologies like Hadoop and MapReduce made data analytics more accessible to different industries?

In what ways has Hadoop transformed the handling of big data for industries compared to previous technologies?

In what ways has Hadoop transformed the handling of big data for industries compared to previous technologies?

How has the conception of big data evolved with the advent of technologies like Hadoop, and what are the implications for businesses using this data?

How has the conception of big data evolved with the advent of technologies like Hadoop, and what are the implications for businesses using this data?

What are the potential limitations of using Hadoop for data processing, especially in environments with smaller data requirements?

What are the potential limitations of using Hadoop for data processing, especially in environments with smaller data requirements?

What are the core components of the Hadoop framework and how do they function together to process large datasets?

What are the core components of the Hadoop framework and how do they function together to process large datasets?

Explain the benefits and challenges of using the MapReduce programming model within the Hadoop ecosystem for processing large-scale datasets.

Explain the benefits and challenges of using the MapReduce programming model within the Hadoop ecosystem for processing large-scale datasets.

How does the MapReduce programming model help overcome the challenges faced in traditional large-scale data processing?

How does the MapReduce programming model help overcome the challenges faced in traditional large-scale data processing?

What socio-economic impacts has the adoption of big data technologies, like Hadoop, had on industries and employment?

What socio-economic impacts has the adoption of big data technologies, like Hadoop, had on industries and employment?

Discuss how the role of Hadoop in distributed computing has been influenced by its origin with the Nutch project and its development at Yahoo.

Discuss how the role of Hadoop in distributed computing has been influenced by its origin with the Nutch project and its development at Yahoo.

How does Hadoop ensure data reliability and fault tolerance in its distributed file system?

How does Hadoop ensure data reliability and fault tolerance in its distributed file system?

You might also like