Unit 2

Hadoop is an open-source framework for distributed storage and processing of large data sets, maintained by the Apache Software Foundation. It consists of various components, including HDFS for data storage and MapReduce for data processing, and has evolved through several versions, improving scalability and fault tolerance. Hadoop is widely used across industries such as social media, finance, and government for its ability to handle massive quantities of data efficiently.

Uploaded by

anurag2200455

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views73 pages

Unit 2

Uploaded by

anurag2200455

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 73

Big Data (BCS-061)

Unit II
Hadoop
• Hadoop is a platform that provides both distributed
storage and computational capabilities.
• Hadoop is a well-adopted, standards-based, open-
source software framework
• Hadoop is designed to abstract away much of the
complexity of distributed processing
• The not-for-profit Apache Software Foundation has
taken over maintenance of Hadoop, with Yahoo!
making significant contributions.
Hadoop
• Hadoop is the Apache Software Foundation top-level
project that holds the various Hadoop subprojects that
graduated from the Apache Incubator.
• The Hadoop project provides and supports the
development of open source software that supplies a
framework for the development of highly scalable
distributed computing applications.
• The Hadoop framework handles the processing
details, leaving developers free to focus on application
logic.
Hadoop: In Industry
✓ Social media (e.g., Facebook, Twitter)
✓ Life sciences
✓ Financial services
✓ Retail
✓ Government
Hadoop: History
• In 2002, when the WWW was relatively new and before
things were “Googled”, Doug Cutting and Mike Cafarella
wanted to crawl the Web and index the content so that they
could produce an Internet search engine.
• They began a project called Nutch to do this but needed a
scalable method to store the content of their indexing.
• The standard method to organize and store data in 2002
was by means of relational database management systems
(RDBMS), which were accessed in a language called SQL.
• But almost all SQL and relational stores were not
appropriate for Internet search engine storageand
retrieval.
Hadoop: History
• They were costly, not terribly scalable, not as tolerant
to failure as required, and possibly not as performant
as desired.
• In 2003 and 2004, Google released two important
papers, one on the Google File System and the other
on a programming model on clustered servers called
MapReduce.
• Cutting and Cafarella incorporated these technologies
into their project, and eventually Hadoop was born.
• Yahoo! Began using Hadoop as the basis of its search
engine, and soon its use spread to many other
organizations
Hadoop: Name Origin
• The name Hadoop is not an acronym; it’s a made-up
name
• The name Doug Cutting’s, project’s creator, kid gave a
stuffed yellow elephant.
• Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere
Hadoop: Features
• It is optimized to handle massive quantities of all three kinds
of data using commodity hardware, that is, relatively
inexpensive computers
• It has a shared nothing architecture
• It replicates data across multiple computers
• It is for high throughput than low latency
• It complements OLAP and OLTP
• It is not good when work cannot be parallelized or when
there data dependencies
• It is not good for processing small files.
Hadoop: Advantage
• Stores data in its native format
• Scalable
• Cost- effective
• Resilient to failures
• Flexibility
• Fast
Evolution of Hadoop
Hadoop: Versions
• Versions of Hadoop:
– Hadoop 1.0
– Hadoop 2.0
– Hadoop 3.0
Hadoop 1.0

• It has two main parts:

– Data Storage Framework
• It is general purpose file system called HDFS
• HDFS is a schema less
• It simply stores data files of any format as close to their
original form as possible
– Data Processing Framework
• This is simple functional programming model initially
popularized by Google as MapReduce
• It uses two functions: MAP and Reduce
Hadoop 1.0: Limitations
• Requirement fro MapReduce programming expertise along
with proficiency in other language like, Java
• It supported only batch processing
• It was tightly computationally coupled with MapReduce
Hadoop 2.0
• In hadoop 2.0, HDFS continues to be data storage
framework
• YARN (Yet Another Resource Negotiator) a separate
resource management framework was added
• Any application capable of dividing itself into parallel
tasks is supported by YARN
• MapReduce programming expertise was no longer
required
• It also supports real time processing along with batch
processing
YARN
• The idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate
daemons.
• The idea is to have a global ResourceManager (RM) and per-
application ApplicationMaster (AM).
• An application is either a single job or a DAG of jobs.
• The ResourceManager and the NodeManager form the data-
computation framework.
• The RM is the ultimate authority that arbitrates resources
among all the applications in the system.
• The NodeManager is the per-machine framework agent who is
responsible for containers, monitoring their resource usage
(cpu, memory, disk, network) and reporting the same to the
ResourceManager/Scheduler.
Hadoop 3.0
• Apache Hadoop 3.0.0 incorporates a
number of significant enhancements over
the previous major release line (hadoop-
2.x).
Hadoop 2.0 Hadoop 3.0
License Apache 2.0, Open Apache 2.0, Open Source
Source
Java Version Java 7 Java 8
Fault Replication Erasure coding
Tolerance
Data Balancing HDFS balancer Intra-data node balancer
Storage Uses 3X replication Support for erasure
Scheme scheme encoding in HDFS
Storage 200% 50%
Overhead
YARN Timeline Uses an old timeline Improve the timeline
Service service which has service and improves the
scalability issues scalability and reliability of
timeline service.
Hadoop 2.0 Hadoop 3.0
Compatible HDFS (Default FS), Previous as well as
File System FTP File system Microsoft Azure Data Lake
filesystem.
Single Point of Features to overcome No needs manual
Failure SPOF intervention to overcome it
Scalability 10,000 Nodes per More than 10,000 nodes
cluster per cluster.
NameNode 1 NameNode 1 NameNode
1 Passive NameNode More than 1 Passive
NameNode
Hadoop: Ecosystem
Apache HDFS The Hadoop Distributed File System 1. hadoop.apache.org
(HDFS) offers a way to store large files 2. Google FileSystem - GFS Paper
across multiple machines. Hadoop and 3. Cloudera Why HDFS
HDFS was derived from Google File 4. Hortonworks Why HDFS
System (GFS) paper. Prior to Hadoop
2.0.0, the NameNode was a single
point of failure (SPOF) in an HDFS
cluster. With Zookeeper the HDFS High
Availability feature addresses this
problem by providing the option of
running two redundant NameNodes in
the same cluster in an Active/Passive
configuration with a hot standby.

Apache MapReduce MapReduce is a programming model 1. Apache MapReduce

for processing large data sets with a 2. Google MapReduce paper
parallel, distributed algorithm on a 3. Writing YARN applications
cluster. Apache MapReduce was
derived from Google MapReduce:
Simplified Data Processing on Large
Clusters paper. The current Apache
MapReduce version is built over
Apache YARN Framework. YARN stands
for “Yet-Another-Resource-Negotiator”.
Hadoop: Includes
• Hadoop Core, our flagship sub-project, provides a distributed
filesystem (HDFS) and support for the MapReduce distributed
computing metaphor.
• HBase builds on Hadoop Core to provide a scalable, distributed
database.
• Pig is a high-level data-flow language and execution framework for
parallel computation. It is built on top of Hadoop Core.
• ZooKeeper is a highly available and reliable coordination system.
Distributed applications use ZooKeeper to store and mediate
updates for critical shared state.
• Hive is a data warehouse infrastructure built on Hadoop Core that
provides data summarization, adhoc querying and analysis of
datasets.
Hadoop: Resources
• The Hadoop Distributed File System (HDFS)
• The MapReduce programing platform
• The Hadoop ecosystem
The Hadoop Distributed File
System (HDFS)
HDFS
• The Hadoop Distributed File System (HDFS) is the place in a
Hadoop cluster where the data is stored.
• Built for data-intensive applications, the HDFS is designed to
run on clusters of inexpensive commodity servers.
• HDFS is optimized for high-performance, readintensive
operations, and is resilient to failures in the cluster.
• It does not prevent failures, but is unlikely to lose data,
because HDFS by default makes multiple copies of each of its
data blocks.
HDFS
• Moreover, HDFS is a write once, read many (or WORM-ish)
filesystem: once a file is created, the filesystem API only
allows you to append to the file, not to overwrite it.
• As a result, HDFS is usually inappropriate for normal online
transaction processing (OLTP) applications.
• Most uses of HDFS are for sequential reads of large files.
• These files are broken into large blocks, usually 64 MB or
larger in size, and these blocks are distributed among the
nodes in the server.
HDFS
• Blocks in HDFS are mapped to files in the host’s underlying
filesystem, often ext3 in Linux systems.
• HDFS does not assume that the underlying disks in the host
are RAID protected, so by default, three copies of each block
are made and are placed on different nodes in the cluster.
• This provides protection against lost data when nodes or disks
fail and assists in Hadoop’s notion of accessing data where it
resides, rather than moving it through a network to access it.
The Design of HDFS
• HDFS is a filesystem designed for storing very large
files with streaming data access patterns, running on
clusters of commodity hardware.
Hadoop High Level Architecture
Hadoop High Level Architecture
• Hadoop is a distributed Master-slave architecture
• MasterNode is known as NameNode
• SlaveNode is known as DataNode
• Components of NameNode:
– Master HDFS
• Responsible for partition of data storage across the slave
node
• Keep track of location of DataNode
– Master MapReduce
• Decides and schedules computation task on slave nodes.
Hadoop Distributions
• APACHE: Apache is the organization that maintains the core Hadoop
code and distribution.
– Hadoop 1.0, Hadoop 2.0
• CLOUDERA: Cloudera is the most tenured Hadoop distribution, and it
employs a large number of Hadoop (and Hadoop ecosystem)
committers. Doug Cutting, who along with Mike Caferella originally
created Hadoop, is the chief architect at Cloudera.
– CDH 4.0, CDH 5.0
• HORTONWORKS: Hortonworks offers the same advantages as
Cloudera in terms of the ability to quickly address problems and
feature requests in core Hadoop and its ecosystem projects.
– HDP 1.0, HDP 2.0
• MAPR: M3, M5, M8
An HDFS client communicating with the master
NameNode and slave DataNodes
Daemons
Daemon # per Purpose
cluster
Namenode 1 Stores filesystem metadata, stores file to block
map, and provides a global picture of the
filesystem
Secondary 1 Performs internal namenode transaction log
namenode checkpointing
Datanode Many Stores block data (file contents)
Daemons: NameNode
• NameNode stores HDFS namespace
• Hadoop’s NameNode keeps all the HDFS metadata in memory
for fast metadata operations.
• The early releases in Hadoop 1.X had a single NameNode
• Any failure to the NameNode's hardware meant the entire
cluster becoming unusable.
• Subsequent releases had a cold standby in the form of a
secondary NameNode.
• The secondary NameNode merged the edit logs and
NameNode image files
• HDFS breaks a large file into smaller pieces called blocks.
Daemons: NameNode
• NameNode uses a rack ID to identify DataNodes in the rack
• A rack is a collection of DataNode within the cluster
• NameNode keeps tracks of blocks of a file a it is placed on
various DataNode
• It manages operations such as read, write, create & delete
• NameNode uses an EditLog to record every transaction that
happens to the file system metadata
• The main of NameNode is to manage File System Namespace
• File System Namespace is a collection of files in the cluster
• File System Namespace includes mapping of blocks to file, file
properties and is stored in a file called FsImage
Daemons: DataNode
• The DataNode stores each HDFS data block in a separate file
on its local filesystem with no knowledge about the HDFS files
themselves.
• The DataNode does not create all files in the same directory.
• Instead, it uses heuristics to determine the optimal number of
files per directory, and creates subdirectories appropriately.
• The DataNode stores the received blocks in a local filesystem,
and forwards that portion of data to the next DataNode in the
list.
• There are multiple DataNode per cluster
Daemons: DataNode
• DataNode continuously sends heartbeat message to the
NameNode to ensure the connectivity between NameNode
and DataNode.
•
Daemons: Secondary DataNode
• The secondary NameNode merged the edit logs and
NameNode image files, periodically bringing in two benefits.
• One, the primary NameNode startup time was reduced as the
NameNode did not have to do the entire merge on startup.
• Two, the secondary NameNode acted as a replica that could
minimize data loss on NameNode disasters.
• However, the secondary NameNode is not a backup node for
NameNode
• It was still not a hot standby, leading to high failover and
recovery times and affecting cluster availability.
DATA FORMAT
• A data/file format defines how information is stored in HDFS.
• Hadoop does not have a default file format and the choice of
a format depends on its use.
• The big problem in the performance of applications that use
HDFS is the information search time and the writing time.
• Managing the processing and storage of large volumes of
information is very complex that’s why a certain data format
is required.
DATA FORMAT
• The choice of an appropriate file format can
produce the following benefits:
– Optimum writing time
– Optimum reading time
– File divisibility
DATA FORMAT
• Some of the most commonly used formats of the Hadoop
ecosystem are :
– Text/CSV: A plain text file or CSV is the most common format both
outside and within the Hadoop ecosystem.
– SequenceFile: The SequenceFile format stores the data in binary format,
this format accepts compression but does not store metadata.
– Avro: Avro is a row-based storage format. This format includes the
definition of the scheme of your data in JSON format. Avro allows block
compression along with its divisibility, making it a good choice for most
cases when using Hadoop.
– Parquet: Parquet is a column-based binary storage format that can store
nested data structures. This format is very efficient in terms of disk
input/output operations when the necessary columns to be used are
specified.
DATA FORMAT
– RCFile (Record Columnar File): RCFile is a columnar
format that divides data into groups of rows, and
inside it, data is stored in columns.
– ORC (Optimized Row Columnar): ORC is
considered an evolution of the RCFile format and
has all its benefits alongside some improvements
such as better compression, allowing faster
queries
Anatomy of File Read
Anatomy of File Write
Processing Data With Hadoop
• MapReduce Programming is a software framework
• It helps in processing massive amount of data in parallel
• In this, the input dataset is split into independent chunks
• Map tasks process these chunks in a parallel manner
• The output given by Map Tasks is intermediate data which is
stored in the local disk of the user.
• The output of the mappers are automatically shuffled and
stored by the framework
• MapReduce Framework sorts the output based on keys
• This sorted output becomes input to the reduce tasks
Processing Data With Hadoop
• Reduce tasks provided reduced output by combining the
output of various mappers.
• Job input and output are stored in a file system
• MapReduce framework also takes care of other tasks such as
scheduling, monitoring.
• MapReduce framework has two daemons:
– JobTracker
– TaskTracker
MapReduce Daemons: JobTracker
• JobTracker is a master daemon responsible for executing
overall MapReduce Job
• It provides connectivity between Hadoop and the application.
• When a job is submitted, the JobTracker creates the execution
plan
• When a job fails, it re-schedule the task to a new node after a
number of retries.
• JobTracker = 1/cluster
MapReduce Daemons: TaskTracker
• It is responsible for executing individual tasks assigned by the
JobTracker
• There is a single TaskTracker per slave and spawns multiple
JVM to handle multiple map or reduce tasks in parallel
• TaskTracker sends periodically heartbeat to JobTracker
Interaction between Daemons
MapReduce Programming Architecture
MapReduce Programming Workflow
MapReduce Working
MapReduce’s shuffle and sort phases
MapReduce Example:
WordCount Problem
• This is an apple
• Apple is red in colour
Developing a MapReduce Application
• Write map , reduce , driver functions.
• Test with a small subset of dataset.
• If it fails use IDE’s debugger to identify solve
the problem.
• Run on full dataset and if it fails debug it using
hadoop debugging tools.
• Do profiling to tune the performance of the
program.
MAPPER
• Reads in input pair <Key,Value>
• Outputs a pair <K’, V’>
• Let’s count number of each word in user
queries (or Tweets/Blogs)
• The input to the mapper will be <queryID,
QueryText>
REDUCER
• Accepts the Mapper output, and aggregates
values on the key
Shuffle and Sort
• Probably the most complex aspect of MapReduce and heart
of the map reduce!
• Map side
– Map outputs are buffered in memory in a circular buffer.
– When buffer reaches threshold, contents are “spilled” to disk.
– Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here first.
• Reduce side
– First, map outputs are copied over to reducer machine.
– “Sort” is a multi-pass merge of map outputs (happens in memory
and on disk): combiner runs here again.
– Final merge pass goes directly into reducer
Mapper Phase Code
• The first stage in development of MapReduce
Application is the Mapper Class.
• Here, RecordReader processes each Input record and
generates the respective key-value pair.
• Hadoop’s Mapper store saves this intermediate data
into the local disk
Reducer Phase Code
• The Intermediate output generated from the mapper
is fed to the reducer which processes it and
generates the final output which is then saved in the
HDFS.
Driver code
• The major component in a MapReduce job is a Driver
Class.
• It is responsible for setting up a MapReduce Job to
run-in Hadoop.
• We specify the names of Mapper and Reducer
Classes long with data types and their respective job
names.
Debugging a Mapreduce Application
• For the process of debugging Log files are essential.
• Log Files can be found on the local fs of each TaskTracker and
if JVM reuse is enabled, each log accumulates the entire JVM
run.
• Anything written to standard output or error is directed to
the relevant logfile.
Unit tests with MRunit
• Apache MRUnit is a unit testing framework designed specifically for Apache
Hadoop map-reduce jobs. It allows developers to write and run tests locally
without requiring a Hadoop cluster, saving both time and resources. MRUnit is
used in Data Science to ensure the correctness of map-reduce logic before it’s
deployed to a production environment.
• History
• Apache MRUnit was developed as part of the larger Apache Hadoop project,
allowing for map-reduce job testing within the Hadoop ecosystem. It was adopted
by the Apache Software Foundation in early 2012 and has since seen regular
updates and improvements.
• Functionality and Features
• Apache MRUnit offers the following key features:
• Local execution of map-reduce tests without needing a Hadoop cluster.
• Support for multiple input and output data types.
• Mock object interfaces for simulating complex object interactions.
Unit tests with MRunit
• Benefits and Use Cases
• Apache MRUnit provides several benefits:
• It speeds up the development process by providing a local testing
framework.
• It aids in debugging by allowing developers to step through their code.
• It provides robust testing capabilities ensuring quality before deployment.
• Challenges and Limitations
• However, Apache MRUnit also has certain limitations:
• It does not fully replicate the Hadoop runtime environment, which may
lead to some inconsistencies.
• It lacks support for testing the entire job flow including both Mapper and
Reducer portions simultaneously.
Anatomy of MapReduce
Anatomy of MapReduce
• There are five independent entities:
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of
• compute resources on the cluster.
• The YARN node managers, which launch and monitor the compute
• containers on machines in the cluster.
• The MapReduce application master, which coordinates the tasks
• running the MapReduce job The application master and the
MapReduce tasks run in containers that are scheduled by the
resource manager and managed by the node managers.
• The distributed filesystem, which is used for sharing job files
between the other entities.
Anatomy of MapReduce
• Job Submission :
• The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.
• Having submitted the job, waitForCompletion polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
• When the job completes successfully, the job counters are displayed. Otherwise, the error
that caused the job to fail is logged to the console.
• The job submission process implemented by JobSubmitter does the following:
• Asks the resource manager for a new application ID, used for the MapReduce job ID.
• Checks the output specification of the job For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program.
• Computes the input splits for the job If the splits cannot be computed (because the input
paths don’t exist, for example), the job is not submitted and an error is thrown to the
MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the shared filesystem in a directory named after the job ID.
• Submits the job by calling submitApplication() on the resource manager.
Anatomy of MapReduce

• Job Initialization :
• When the resource manager receives a call to its submitApplication() method, it
hands off the request to the YARN scheduler.
• The scheduler allocates a container, and the resource manager then launches
the application master’s process there, under the node manager’s management.
• The application master for MapReduce jobs is a Java application whose main
class is MRAppMaster .
• It initializes the job by creating a number of bookkeeping objects to keep track of
the job’s progress, as it will receive progress and completion reports from the
tasks.
• It retrieves the input splits computed in the client from the shared filesystem.
• It then creates a map task object for each split, as well as a number of reduce
task objects determined by the mapreduce.job.reduces property (set by the
setNumReduceTasks() method on Job).
Anatomy of MapReduce
• Task Assignment:
• If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager .
• Requests for map tasks are made first and with a higher priority than those for reduce tasks,
since all the map tasks must complete before the sort phase of the reduce can start.
• Requests for reduce tasks are not made until 5% of map tasks have completed.

• Task Execution:
• Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by contacting the
node manager.
• The task is executed by a Java application whose main class is YarnChild.
• Before it can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
• Finally, it runs the map or reduce task.
Failures in MapReduce
• Worker Failures:
• If a worker node (where map or reduce tasks are executed)
fails, the tasks assigned to that node need to be rescheduled
and re-executed on other nodes.
• Master Failures:
• If the master node (JobTracker) fails, the entire MapReduce
job might be restarted by a different master, requiring a new
job submission.
• Resilience:
• MapReduce frameworks are designed to be resilient to
failures, with mechanisms to detect and recover from them.
Job Scheduling in MapReduce
• Scheduling Algorithms:
• MapReduce frameworks employ scheduling algorithms
(e.g., Capacity Scheduler, Fair Scheduler, FIFO Scheduler)
to determine which tasks to run on which nodes.
• Task Allocation:
• The scheduler allocates tasks to available nodes, taking
into account resource availability and task dependencies.
• Failure Handling:
• The scheduler also plays a crucial role in rescheduling
failed tasks and ensuring that the job progresses despite
failures.
Shuffle and Sort
• Shuffle Phase:
• In the shuffle phase, the intermediate data produced by the map tasks is partitioned and
distributed to the appropriate reduce tasks.
• Sort Phase:
• The data is then sorted by key, ensuring that all data with the same key is grouped together for the
reduce phase.
• Reduce Stage:
• The reduce tasks then process the sorted data, combining values for each key to produce the final
output.
• Efficiency:
• The efficiency of the shuffle and sort phases significantly impacts the overall performance of the
MapReduce job.
• External Merge Sort:
• When the intermediate data is too large to fit in memory, an external merge sort algorithm is used
during the sort phase.
• Data Distribution:
• The shuffle phase aims to distribute data evenly among the reduce tasks to prevent data skew,
where some reduce tasks receive significantly more data than others.

Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Unit 2
No ratings yet
Unit 2
17 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
HADOOP
No ratings yet
HADOOP
10 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
History and Architecture of Hadoop
No ratings yet
History and Architecture of Hadoop
53 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Unit 3
No ratings yet
Unit 3
90 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Hadoop: Big Data Processing Essentials
No ratings yet
Hadoop: Big Data Processing Essentials
19 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
DATA228 Lecture Notes Week 3
No ratings yet
DATA228 Lecture Notes Week 3
21 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
IBM Hadoop
No ratings yet
IBM Hadoop
11 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
Understanding Hadoop: Architecture & Use Cases
No ratings yet
Understanding Hadoop: Architecture & Use Cases
55 pages
Introduction to Hadoop Software
No ratings yet
Introduction to Hadoop Software
47 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit 5
No ratings yet
Unit 5
101 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Overview of Hadoop Modules
100% (1)
Overview of Hadoop Modules
40 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Introduction to Hadoop Ecosystem Basics
No ratings yet
Introduction to Hadoop Ecosystem Basics
23 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Hadoop
No ratings yet
Hadoop
154 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Unit 2
No ratings yet
Unit 2
28 pages
Ps Pps 9.1r1 802.1X Auth Cisco Switch PDF
No ratings yet
Ps Pps 9.1r1 802.1X Auth Cisco Switch PDF
28 pages
AAA, RADIUS & DIAMETER Protocols
No ratings yet
AAA, RADIUS & DIAMETER Protocols
3 pages
Fourth Generation Computers: Microprocessors & Key Innovations
No ratings yet
Fourth Generation Computers: Microprocessors & Key Innovations
8 pages
Importance of Computer Cases
No ratings yet
Importance of Computer Cases
18 pages
Practice Chapter1
No ratings yet
Practice Chapter1
19 pages
DTMF Interfacing 8051 Microcontroller
No ratings yet
DTMF Interfacing 8051 Microcontroller
3 pages
Integrating FAISS With Django For A Scalable Recommendation System
No ratings yet
Integrating FAISS With Django For A Scalable Recommendation System
13 pages
M02 Install and Optimise Oper - Syst
No ratings yet
M02 Install and Optimise Oper - Syst
58 pages
Iperf For Windows Using Cygwin
100% (1)
Iperf For Windows Using Cygwin
13 pages
JDEtips Article E1PagesCreation
No ratings yet
JDEtips Article E1PagesCreation
20 pages
X Way Forensic
No ratings yet
X Way Forensic
4 pages
VLAN Broadcast Traffic Analysis
No ratings yet
VLAN Broadcast Traffic Analysis
2 pages
Esp32 s3 Mini 1 Mini 1u Datasheet en
No ratings yet
Esp32 s3 Mini 1 Mini 1u Datasheet en
52 pages
IBM PVM Getting Started Guide
No ratings yet
IBM PVM Getting Started Guide
104 pages
PS/2 Mouse Command Protocol Overview
No ratings yet
PS/2 Mouse Command Protocol Overview
5 pages
Installation Guide: Lathe Mill-Turn Swiss Wire Router Design Mill
No ratings yet
Installation Guide: Lathe Mill-Turn Swiss Wire Router Design Mill
6 pages
L1 Cache Simulation for 32-bit Processors
No ratings yet
L1 Cache Simulation for 32-bit Processors
30 pages
Sppu Paper May - Jun - 2018 ENDSEM
No ratings yet
Sppu Paper May - Jun - 2018 ENDSEM
3 pages
MiCOM C264
No ratings yet
MiCOM C264
15 pages
Rithmic Trader Pro - Cur
No ratings yet
Rithmic Trader Pro - Cur
31 pages
Manual OLT-Huwei
No ratings yet
Manual OLT-Huwei
30 pages
victor/C•CURE 9000 Unified Install Guide
No ratings yet
victor/C•CURE 9000 Unified Install Guide
11 pages
Exporting A Keystore From ASM To A Target Host For Oracle TDE Provisioning
No ratings yet
Exporting A Keystore From ASM To A Target Host For Oracle TDE Provisioning
8 pages
Sever Virtualization
No ratings yet
Sever Virtualization
11 pages
Hadoop Classroom Notes
100% (2)
Hadoop Classroom Notes
76 pages
Nokia Help
100% (2)
Nokia Help
143 pages
En Vivo - El Incendio en La Catedral de Notre Dame en Paris - LA NACION
No ratings yet
En Vivo - El Incendio en La Catedral de Notre Dame en Paris - LA NACION
10 pages
BT Home Hub Recovery Instructions Windows
No ratings yet
BT Home Hub Recovery Instructions Windows
5 pages
Computer Memory Systems Guide
No ratings yet
Computer Memory Systems Guide
77 pages
AWS Cloud Practitioner Cheat Sheet
No ratings yet
AWS Cloud Practitioner Cheat Sheet
7 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Big Data (BCS-061)

• It has two main parts:

Apache MapReduce MapReduce is a programming model 1. Apache MapReduce

You might also like