0% found this document useful (0 votes)

25 views21 pages

Module 2

Uploaded by

bauskarsanket

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views21 pages

Module 2

Uploaded by

bauskarsanket

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Module 2 :- Hadoop HDFS and MapReduce

2.1 Distributed File Systems

Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big Data.
HDFS is a distributed file system that runs on commodity hardware. HDFS is already
configured with default configuration for many installations. Hadoop interact directly with
HDFS by shell-like commands.

Hadoop Distributed File System is a block-structured file system where each file is
divided into blocks of a pre-determined size. These blocks are stored across a cluster of
one or several machines.

HDFS Components

There are two major components of Hadoop HDFS.

• NameNode (Master Node)

• DataNode (Slave Node)

• NameNode

It is also known as Master node. NameNode does not store actual data or dataset.
It stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.

Tasks of HDFS NameNode

• Manage file system namespace.

• Regulates client’s access to files.

• Executes file system execution such as naming, closing, opening files and
directories.

• DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the
clients. Replica block of Datanode consists of 2 files on the file system. The first
file is for
data and second file is for recording the block’s metadata. HDFS Metadata
includes checksums for data. At startup, each Datanode connects to its
corresponding Namenode and does handshaking. Verification of namespace ID
and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.

Tasks of HDFS DataNode:

• DataNode performs operations like block replica creation, deletion, and

replication according to the instruction of NameNode.
• DataNode manages data storage of the system.
Hadoop HDFS Daemons
There are 2 daemons which run for HDFS for data storage:

• Namenode:
This is the daemon that runs on all the masters. Name node stores metadata like
filename, the number of blocks, number of replicas, a location of blocks, block IDs
etc. This metadata is available in memory in the master for faster retrieval of data. In
the local disk, a copy of metadata is available for persistence. So name node memory
should be high as per the requirement.

• Datanode:
This is the daemon that runs on the slave. These are actual worker nodes that store
the data.
HDFS Architecture

• This architecture gives you a complete picture of Hadoop Distributed File System.
There is a single namenode which stores metadata and there are multiple datanodes
which do actual storage work.
• Nodes are arranged in racks and Replicas of data blocks are stored on different racks in
the cluster to provide fault tolerance.
• To read or write a file in HDFS, the client needs to interact with Namenode. HDFS
applications need a write-once-read-many access model for files. A file once created
and written cannot be edited.
• Namenode stores metadata and datanode which stores actual data. The client interacts
with namenode for any task to be performed as namenode is the centerpiece in the
cluster.
There are several datanodes in the cluster which store HDFS data in the local disk.
Datanode sends a heartbeat message to namenode periodically to indicate that it is
alive. Also, it replicates data to other datanode as per the replication factor.
Data

storage in HDFS
Whenever any file has to be written in HDFS, it is broken into small pieces of data
known as blocks. HDFS has a default block size of 128 MB which can be increased as
per the requirements. These blocks are stored in the cluster in distributed manner on
different nodes. This provides a mechanism for MapReduce to process the data in
parallel in the cluster.

2.1.1 Physical Organization of Compute Nodes

• Compute nodes are stored om Racks, perhaps 8-64 on a rack.

• Nodes on a single rack are connected by a network.

• There can be many racks of compute nodes, and racks are connected by
another level of network or a switch.
• The bandwidth of inter-rack communication is greater than the intrarack
ethernet but given the number of pairs of nodes that might need to
communicate between racks.
Rack Awareness in Hadoop HDFS

• Hadoop runs on a cluster of computers which are commonly spread across

many racks.
• NameNode places replicas of a block on multiple racks for improved fault
tolerance.
• NameNode tries to place at least one replica of a block in each rack, so that if a
complete rack goes down then also system will be highly available.
• Optimizing replica placement distinguishes HDFS from most other
distributed file systems.
• The purpose of a rack-aware replica placement policy is to improve data
reliability, availability, and network bandwidth utilization.

2.1.2 Large Scale File System Organization

• This new file system is often called as Distributed File System or DFS. • DFS
is useful when files can be enormous, possibly a terabyte in size and are rarely
updated.
• Files are devided into chunks which are typically 64 megabytes in size. •
Chunks are replicated, perhaps three times, at three different compute nodes. •
The nodes holding copies of one chunk should be located on different racks so
as avoid loosing all copies due to a rack failure.
• User can decide the chunk size and the degreed of replication.

• Master node is used to find the chunks of a file. This node is itself replicated,
and a directory for the file system as a whole knows where to find ts
copies.The directory itself can be replicated, and all participants using the
DFS know where the directory copies are.
• DFS Implementations are

▪ The Google File System(GFS)

▪ Hadoop Distributed File System(HDFS)

▪ CloudStore

2.2 Map-Reduce:

• MapReduce is the data processing component of Hadoop.

• MapReduce is a software framework which helps in writing applications that

processes large data sets using distributed and parallel algorithms inside Hadoop
environment.

• MapReduce works in two phases.Map Phase and Reduce Phase

• Map Phase:

▪ This phase takes input as key-value pairs and produces output as key
value pairs. It can write custom business logic in this phase.

▪ The Map function performs actions like filtering, grouping and sorting.

▪ The result generated by the Map function is a key value pair (K, V)
which acts as the input for Reduce function.
• Reduce Phase:

▪ This phase aggregates and summarizes the result produced by map

function.

2.2.1

The Map Tasks

• A Map function is written to convert input elements to Key-Value

pairs.

• The types of keys and values are each arbitrary.

• Keys are not Unique. A Map task can produce several key-value
pairs with the same key, even from the same element. Element can
be a tuple or a document.

Example :- Counting the number of occurrences for each word in a

collection of documents.

• The input file is a repository of documents, and each document is an

element.

• The Map function uses keys that are of type String(The words) and
values that are integers.

• The Map Task reads a document and breaks it into its sequence of
words W1, W2,..., Wn.
• It then emits a sequence of key-value pairs where the value is
always 1. That is, the output of Map Task for this document is the
sequence of Key-Value pairs: (W1, 1), (W2, 1),...,(Wn, 1)
2.2.2 Grouping by key

• After all the Map tasks have completed successfully, the master
controller merges the file from each Map task that are destined for a
particular Reduce task and feeds the merged file to that process as a
sequence of key-list-of-value pairs.

• For each key k, the input to the Reduce task that handles key k is a
pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn)
are all the key-value pairs with key k coming from all the Map
tasks.

2.2.3 The Reduce Tasks

• The input to the reduce task is the pairs consisting of a key and its
list of associated values and combine those values in some way.

• The output of a Reduce task is a sequence of key-value pairs

consisting of each input key k(The word) that the Reduce task
received, paired with the combined value constructed from the list
of values that the Reduce task received along with the key k.

• The outputs from all the Reduce tasks are merged into a single file.

Example:- The word-count example

• The reduce function adds up all the values.

• The The output of a Reduce task is a sequence of (w, m) pairs where

‘w’ is a word that appears at least once among all the input
documents and ‘m’ is the total number of occurrences of ‘w’ along
all those documents.

2.2.4 Combiners

• Hadoop Combiner is also known as “Mini-Reducer” that summarizes the

Mapper output record with the same Key before passing to the Reduce. • On a
large dataset when MapReduce job runs , large chunks of intermediate data is
generated by the Mapper and this intermediate data is passed on the Reducer for
further processing, which leads to enormous network congestion.

• MapReduce framework provides a function known as Hadoop Combiner that

plays a key role in reducing network. Congestion.
Working of MapReduce Combiner :
• In the below diagram, no combiner is used. Input is split into two mappers and 9
keys are generated from the mappers. Hence, (9 key/value) intermediate data is
generated by mapper , the mapper will send this data directly to reducer and
while sending data to the reducer, it consumes some network bandwidth. It will
take more time to transfer data to reducer if the size of data is big.

• Now in between mapper and reducer if hadoop combiner is used, then combiner
shuffles intermediate data (9 key/value) before sending it to the reducer and
generates 4 key/value pair as an output.
• Reducer now needs to process only 4 key/value pair data which is generated
from 2 combiners. Thus reducer needs to be executed only 4 times to produce
final output, which increases the overall performance.

2.2.5 Details of Map-Reduce Execution

• The user program first forks the master controller and worker
processes at different computing nodes.

• The worker takes up a job of mapper or reducer but not both

simultaneously.

• The master does the job of creating map tasks by considering

chunks of an input file where usually one or more chunk is
handled by one map task and reduce tasks as dictated by the user
program.

• The number of reduce tasks would be less in number compared to

map tasks.

• The master assigns tasks to workers.

• The master keeps track of Map and Reduce job status.it could be
idle, executing, completed.

• The job of a worker is to inform master, about progress of a job

being executed.

• The Map task creates intermediate file on the local disk of

worker for assigned operation. It further passes it to Reduce task.

• The reduce task does reduction based on the code supplied by the
user and gives output accordingly.
2.2.6 Coping with Node Failures

• The compute node at which the Master is executing fails

▪ The master node which controls everything fails, which brings

the entire process down and the entire map-reduce job must be
restarted.

▪ Other than master node failures will be managed by the

master, and the map-reduce job will complete eventually.

• The compute node at which a Map worker resides fails

▪ This failure will be detected by the Master, because it

periodically pings the Worker processes.

▪ All the Map tasks that were assigned to this Worker will have
to be redone, even if they had completed. The reason for
redoing completed Map tasks is that their output destined for
the Reduce tasks resides at that compute node, and is now
unavailable to the Reduce tasks.

▪ The Master sets the status of each of these Map tasks to idle
and will schedule them on a Worker when one becomes
available.
▪ The Master must also inform each Reduce task that the
location of its input from that Map task has changed.

• The failure at the node of a Reduce worker

▪ The Master simply sets the status of its currently executing

Reduce tasks to idle This will be rescheduled on another
reduce worker later.

2.3 Algorithms Using Map-Reduce

2.3.1 Marix-Vector Multiplication by Map-Reduce

An n x n matrix mij, whose elements in row i and column j will be

denoted Mij. A vector v of length n, whose jth element is Vj. Then, the
matrix vector product is the vector X of length n, whose i th element Xi
is given by

Pseudo code:-
map(key, value):

for (i,j,aij) in value:

emit(i, aij * v[j])
reduce(key, values):
result = 0
for value in values:
result += value
emit(key, result)

If the vector v Cannot Fit in Main Memory

• If n = 100,then there is no need to use a DFS or MapReduce for this calculation. But
this sort of calculation is at the heart of the ranking of Web pages that goes on at
search engines, and there, n is in the tens of billions.

• It is assume that n is large, but not so large that vector v cannot fit in main memory
and thus be available to every Map task. The matrix M and the vector v each will
be stored in a file of the DFS. We assume that the row-column coordinates of each
matrix element will be discoverable, either from its position in the file, or because
it is stored with explicit coordinates, as a triple (i, j,mij). Also assume the
position of element vj in the vector v will be discoverable in the analogous way.
1. The Map Function: The Map function is written to apply to one element of M.
However, if v is not already read into main memory at the compute node
executing a Map task, then v is first read, in its entirety, and subsequently will
be available to all applications of the Map function performed at this Map task.
Each Map task will operate on a chunk of the matrix M. From each matrix
element mij it produces the key-value pair (i, mij * vj ). Thus, all terms of
the sum that make up the component xi of the matrix-vector product will get
the same key, i.
2. The Reduce Function: The Reduce function simply sums all the values
associated with a given key i. The result will be a pair (i, xi).
3. The matrix is divided into vertical stripes of equal width and the vector is
divided into an equal number of horizontal stripes, of the same height. Here, goal
is to use enough stripes so that the portion of the vector in one stripe can fit
conveniently into main memory at a compute node. Figure suggests what the
partition looks like if the matrix and vector are each divided into five stripes.

2.3.2 Relational Algebra Operations

Relational Databases vs. MapReduce

• Relational databases:
– Multipurpose: analysis and transactions; batch and
interactive – Data integrity via ACID transactions
– Lots of tools in software ecosystem (for ingesting, reporting,
etc.) – Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Relational Algebra Operations
1. Selection.
2. Projection.
3. Union & Intersection.
4. Natural Join.
5. Grouping & Aggregation.

Selection:
Selection: σC(R) – Apply condition C to each tuple of relation R – Produce in output a
relation containing only tuples that satisfy C
A MapReduce implementation of σC(R)

Map (key, valve)

for tuple t in valve :
if tuple t satisfies C :
emit (t, t)

Reduce (key, valves)

emit (key, key)
projection:
Projection: πS(R)
– Given a subset S of relation R attributes
– Produce in output a relation containing only tuples for the attributes in S

Map (key, valve)

for tuple t in valve :
ts = tuple with only the components for the attributes in S.
emit (ts, ts)
Reduce (key, values)
emit (key, key)
the reduce operation is duplicate elimination
– This operation is associative and commutative; hence it is possible to optimize
MapReduce by using a Combiner in each mapper.

Union

Suppose relations R and S have the same schema

– Map tasks will be assigned chunks from either R or S
– Mappers don’t do much, just pass by to reducers
– Reducers do duplicate elimination

2.3.3 Computing Selections by Map-Reduce

Selections can be done mostly in the map portion alone.

The Map Function:

For each tuple t in R, test if satisfies condition C. If so, produce
the key value pair(t,t). That is, both the key and value are t.

The Reduce Function:

The reduce function is the identity. It simply passes each key-value
pair to the output.

Pseudo Code:
map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)

2.3.3 Computing Projections by Map-Reduce

The Reduce function is used to eliminate duplicates since projection
may cause the same tuple to appear several times.

The Map Function:

For each tuple t in R, construct a tuple ts by eliminating from t those
components whose attributes are not in S. Output the key-value pair
(ts, ts).

The Reduce Function:

For each key ts produced by any of the Map task, there will be
one or more key-value pairs(ts, ts). The Reduce function turns [ts,
ts, ..., ts] into (ts, ts), so it produces exactly one pair (ts, ts) for this
key ts.
Pseudo Code:
map(key, value):
for tuple in value:
ts = tuple with only the components for the attributes in S
emit(ts, ts)
reduce(key, values):
emit(key, key)

2.3.4 Union, Intersection, and Difference by Map-Reduce

2.3.4.1 Union
For union operation, both the relations R and S need to
have the same schema.
Map tasks will be assigned chunks from either R or S relation.
The Map tasks only pass their input tuple as key-value pairs
to the Reduce tasks.

Mappers are fed by all tuples of two sets to be united.

Reducer is used to eliminate duplicates.

The Map Function:

Turn each input tuple t into a key-value pair (t, t).

The Reduce Function:

Associated with each key t there will be either one or two
values. Produce output (t, t) in either case.

Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
emit(key, key)

2.3.4.2 Intersection
Mappers are fed by all tuples of both R and S realations
to be intersected.

Reducer emits only tuples that occured twice.

The Reduce function must produce a tuple only if both relations
have the tuple. If the key t has a list of two values [t, t]
associated with it, then the Reduce task for t should produce (t,
t).
If the value-list associated with key t is just [t], then one of R
and S is misisng t, so we do not want to produce a tuple for the
intersection.

The Map Function:

Turn each tuple t into a key-value pair (t, t).

The Reduce Function:

If key t has value list [t, t], then produce (t, t). Otherwise,
produce nothing.

Pseudo Code:
map(key, value):
for tuple in value:
emit(tuple, tuple)
reduce(key, values):
if values = = [key, key]
emit(key, key)

2.3.4.3 Difference
The difference R – S, a tuple t can appear in the output is if it
is in relation R, but not in relation S.
The Map function can pass tuples from R and S through, but
must inform the Reduce function whether the tuple came
from R or S.

The Map Function:

For a tuple t in R, produce key-value pair (t , R), and for a
tuple t in S, produce key-value pair(t, S).
The Reduce Function:
For each key t, if the associated value list is [R], then
produce (t, t). Otherwise, produce nothing.

Pseudo Code:
map(key, value):
if key = = R:
for tuple in value:
emit(tuple, R)
else:
for tuple in value:
emit(tuple, S)
reduce(key, values):
if values = = [R]
emit(key, key)

2.4 Hadoop Limitations

• Issues with Small Files

The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the
ability to support the random reading of small due to its high capacity design.

Small files are smaller than the HDFS Block size (default 128MB). If you are storing
these huge numbers of small files, HDFS cannot handle these lots of small files. As
HDFS was designed to work with a small number of large files for storing large data sets
rather than a large number of small files. If there are lot many small files, then the
NameNode will be overloaded since it stores the namespace of HDFS.

Solution:

Simply merge the small files to create bigger files and then copy bigger to HDFS.

Hadoop Archives (HAR files) deals with the problem of lots of small files. Hadoop
Archives works by building a layered filesystem on the top of HDFS. With the help
Hadoop archive command, HAR files are created; this runs a MapReduce job to pack the
files being archived into a small number of HDFS files. Reading files through HAR is not
more efficient than reading through HDFS. As each HAR file access requires two index
files read as well the data file to read, this will make it slower.

Sequence files also overcome the small file problem. In which we use the filename as the
key and the file contents as the value. By writing a program for files (100 KB), we can
put them into a single Sequence file and then we can process them in a streaming fashion
operating on the Sequence file. MapReduce in Hadoop can break Sequence file into
chunks and operate on each chunk independently because Sequence file is splittable.

By storing files in Hbase we can overcome the small file problem. We are not actually
storing millions of small file into HBase rather adding the binary content of the file to a
cell.

• Slow Processing Speed

MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking

the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to
perform these tasks, thus increasing latency. Hence, reduces processing speed.

Solution:

By in-memory processing of data, Apache Spark overcomes this issue. As in In-memory

processing, no time is spent in moving the data/processes in and out of the disk, thus this
makes it faster. Apache Spark is 100 times faster as compared to MapReduce because it
processes everything in memory.

Flink can also overcome this issue. Flink processes faster than Spark because of its
streaming architecture.

• Support for Batch Processing only

Hadoop only supports batch processing, it is not suitable for streaming data. Hence,
overall performance is slower. MapReduce framework doesn’t leverage the memory of
the Hadoop cluster to the maximum.

Solution

Apache Spark solves this problem as it supports stream processing. But Spark stream
processing is not as much efficient as Flink as it uses micro-batch processing. Apache
Flink improves the overall performance as it provides single run-time for the streaming
as well as batch processing.
• No Real-time Processing

Apache Hadoop is a batch processing framework. It means it takes a huge amount of data
in input, processes it and produces the result. Batch processing is very efficient for
processing a high volume of data, but depends on the size of data being processed and
computational power of the system; an output can be delayed significantly. Apache
Hadoop is not suitable for Real-time processing.

Solution:

Spark is suitable for stream processing. Steaming processing provide continuous

input/output data. It process data within the small amount of time.
Flink provides single run-time for both streamings as well as batch

processing. • Iterative Processing

Apache Hadoop is not much efficient for iterative processing. As Hadoop is not
supported cyclic data flow (i.e. a chain of stages in which each output of the previous
stage is the input to the next stage).

Solution:

Spark overcomes this issue. As Apache Spark accesses data from RAM instead of the
Disk. This dramatically improves the performance of an iterative algorithm that accesses
the same dataset repeatedly. In Apache Spark, for iterative processing, each iteration has
to be scheduled and executed separately.

• Latency

MapReduce in Hadoop is slower because it supports different format, structured and huge
amount of data. In MapReduce, Map takes a set of data and converts it into another set of
data, where an individual element is broken down into a key-value pair. Reduce takes the
output from the map as and Reduce takes the output from the map as input and process
further. MapReduce requires a lot of time to perform these tasks thereby increasing
latency.

Solution:

Apache Spark can reduce this issue. Although Spark is the batch system, it is relatively
faster, because it caches much of the input data on memory by RDD. Apache Flink data
streaming achieves low latency and high throughput.

• No Ease of Use

MapReduce developer in Hadoop needs to hand code for each and every operation which
makes it very difficult to work. In Hadoop, MapReduce has no interactive mode, but
adding hive and pig makes working with MapReduce little easier.

Solution:

Spark has overcome this issue, as the Spark has an interactive mode. So, that developers
and users alike can have intermediate feedback for queries and other activities. As spark
has tons of high-level operators so it is easy to program Spark. One can also use Apache
Flink as it also has high-level operators.

• Security Issue

Apache Hadoop is challenging in maintaining the complex applications. Hadoop is

missing encryption at the storage and network levels, which is a major point of concern.
Apache Hadoop supports Kerberos authentication, which is hard to manage.

Solution:

Apache Spark provides security bonus. If you run Apache Spark in HDFS, it can use
HDFS ACLs and file level permissions.

• Vulnerable by Nature

Apache Hadoop is written in Java. Java, is a most popular language, hence java most
heavily exploited by cybercriminals.

• No Caching

Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate
data in memory for the further requirement and this diminishes the performance of
Hadoop.

Solution:
Spark and Flink overcome this issue. Spark and Flink cache data in memory for further
iterations which enhance the overall performance.

• Lengthy Code

Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of
bugs. Hence it will take more time to execute the programs.

Solution:

Spark and Flink are written in Scala and Java. But the implementation is in Scala, so
the number of line of code is lesser than Hadoop. Thus, it takes less time to execute the
programs.

Module 2
No ratings yet
Module 2
21 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit 2
No ratings yet
Unit 2
56 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Understanding MapReduce and HDFS
No ratings yet
Understanding MapReduce and HDFS
4 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Module 2 MapReduce
No ratings yet
Module 2 MapReduce
45 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Module II
No ratings yet
Module II
46 pages
BDA Chapter 2
No ratings yet
BDA Chapter 2
36 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Hakro GmbH NoSQL Initiatives 2025
No ratings yet
Hakro GmbH NoSQL Initiatives 2025
32 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Bigdata Unit IV
No ratings yet
Bigdata Unit IV
29 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Hadoop for Big Data Analysis
No ratings yet
Hadoop for Big Data Analysis
4 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit - II
No ratings yet
Unit - II
64 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
5 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
HDFS
No ratings yet
HDFS
46 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Unit 2
No ratings yet
Unit 2
14 pages
Understanding Hadoop Architecture and MapReduce
No ratings yet
Understanding Hadoop Architecture and MapReduce
33 pages
Module-2-Introduction To HDFS and Tools
No ratings yet
Module-2-Introduction To HDFS and Tools
38 pages
Bda QB Sample Unit
No ratings yet
Bda QB Sample Unit
12 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Understanding Apache Hadoop Ecosystem
No ratings yet
Understanding Apache Hadoop Ecosystem
48 pages
Hadoop
No ratings yet
Hadoop
4 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Week007-LogicalOperator-LabExer004 Rivera Dennis
No ratings yet
Week007-LogicalOperator-LabExer004 Rivera Dennis
7 pages
DBS211 Lab04 DDL W24
No ratings yet
DBS211 Lab04 DDL W24
4 pages
Database Management Systems
No ratings yet
Database Management Systems
42 pages
Taxonomy & Content Classification: Market Milestone Report
No ratings yet
Taxonomy & Content Classification: Market Milestone Report
60 pages
Complex Excel Formulas Explained
No ratings yet
Complex Excel Formulas Explained
14 pages
BASIC USER - USER GUIDE - Entity Search
No ratings yet
BASIC USER - USER GUIDE - Entity Search
22 pages
Board Practical 2023 - 24
No ratings yet
Board Practical 2023 - 24
5 pages
Web Mining and Social Graph Analysis
No ratings yet
Web Mining and Social Graph Analysis
13 pages
BrentOzar First Responder Kit
No ratings yet
BrentOzar First Responder Kit
3 pages
Data Gyan 30 SQL Q&A
No ratings yet
Data Gyan 30 SQL Q&A
4 pages
Experiment No 9 DBMSLAB 2023
No ratings yet
Experiment No 9 DBMSLAB 2023
3 pages
FTK Imager
No ratings yet
FTK Imager
40 pages
Classic MultiProvider Vs Multiprovider
100% (1)
Classic MultiProvider Vs Multiprovider
14 pages
Question Bank
No ratings yet
Question Bank
2 pages
Associative Memory & Hopfield Networks
No ratings yet
Associative Memory & Hopfield Networks
14 pages
Data Analytics Exam Solutions Guide
No ratings yet
Data Analytics Exam Solutions Guide
5 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
70-533 Azure Exam Dumps & Study Guide
No ratings yet
70-533 Azure Exam Dumps & Study Guide
6 pages
Free Questions For: Shared by On
No ratings yet
Free Questions For: Shared by On
15 pages
How To Copy Updates From One Project To Another
No ratings yet
How To Copy Updates From One Project To Another
14 pages
Sreekanth DataAnalyst
No ratings yet
Sreekanth DataAnalyst
4 pages
SEO Essentials for Beginners
No ratings yet
SEO Essentials for Beginners
16 pages
Bharti Airtel's Business Intelligence Challenges
No ratings yet
Bharti Airtel's Business Intelligence Challenges
2 pages
SQL Prac 4
No ratings yet
SQL Prac 4
3 pages
SharePoint Online Power Automate Notes
No ratings yet
SharePoint Online Power Automate Notes
5 pages
Architectural Design Best Practices Report
No ratings yet
Architectural Design Best Practices Report
83 pages
Sas 9.4
No ratings yet
Sas 9.4
8 pages
Object Oriented and Object Relational Databases Final
No ratings yet
Object Oriented and Object Relational Databases Final
10 pages
DBMS Class 10 Questions & Answers
No ratings yet
DBMS Class 10 Questions & Answers
11 pages
Module08 Databases
100% (2)
Module08 Databases
67 pages