0% found this document useful (0 votes)

3 views18 pages

Big Data Notes

Uploaded by

niteshagrawal2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views18 pages

Big Data Notes

Uploaded by

niteshagrawal2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

What is Big Data?

The most common myth associated with it is that it is just about the size or volume of data. But
actually, it’s not just about the “big” amounts of data being collected. Big Data refers to the large
amounts of data that is pouring in from various data sources and has different formats. Even
previously there was huge data which were being stored in databases, but because of the
varied nature of this Data, the traditional relational database systems are incapable of handling
this Data. Big Data is much more than a collection of datasets with different formats; it is an
important asset that can be used to obtain enumerable benefits.

The three different formats of big data are:

1. Structured: Organised data format with a fixed schema. Ex: RDBMS 2.

Semi-Structured: Partially organised data which does not have a fixed format. Ex: XML,
JSON
3. Unstructured: Unorganised data with an unknown schema. Ex: Audio, video files etc.

The above image depicts the five V’s of Big Data but as and when the data keeps evolving so
will the V’s. So there could be five more V’s which have developed gradually over time:

• Validity: correctness of data

• Variability: dynamic behaviour
• Volatility: tendency to change in time
• Vulnerability: vulnerable to breach or attacks
• Visualization: visualizing meaningful usage of data
Big Data is a term used for a collection of data sets that are large and complex, which is difficult
to store and process using available database management tools or traditional data processing
applications. The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.

Big Data Characteristics

The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and Value.
1. VOLUME

Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.
The size of data generated by humans, machines and their interactions on social media
itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.

2. VELOCITY

Velocity is defined as the pace at which different sources generate the data every day.
This flow of data is massive and continuous. There are 1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This
shows how fast the number of users are growing on social media and how fast the data is
getting generated daily. If you are able to handle the velocity, you will be able to generate
insights and take decisions based on real-time data.

3.
VARIETY
As there are many sources which are contributing to Big Data, the type of data they are
generating is different. It can be structured, semi-structured or unstructured. Hence, there
is a variety of data which is getting generated every day. Earlier, we used to get the data
from excel and databases, now the data are coming in the form of images, audios, videos,
sensor data etc. as shown in below image. Hence, this variety of unstructured data creates
problems in capturing, storage, mining and analyzing the data.
4. VERACITY

Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness. In the image below, you can see that few values are
missing in the table. Also, a few values are hard to accept, for example – 15000 minimum
value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity.

Data
available can sometimes get messy and maybe difficult to trust. With many forms of big
data, quality and accuracy are difficult to control like Twitter posts with hashtags,
abbreviations, typos and colloquial speech. The volume is often the reason behind for the
lack of quality and accuracy in the data.

• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to
make decisions.
• It was found in a survey that 27% of respondents were unsure of how much of their data
was inaccurate.
• Poor data quality costs the US economy around $3.1 trillion a year. 5. VALUE

After discussing Volume, Velocity, Variety and Veracity, there is another V that should be
taken into account when looking at Big Data i.e. Value. It is all well and good to have
access to big data but unless we can turn it into value it is useless. By turning it into value
I mean, Is it adding to the benefits of the organizations who are analyzing big data? Is the
organization working on Big Data achieving high ROI (Return On Investment)? Unless,
it adds to their profits by working on Big Data, it is useless.

Types of Big Data

Big Data could be of three types:

• Structured
• Semi-Structured
• Unstructured
1. Structured
The data that can be stored and processed in a fixed format is called as Structured Data.
Data stored in a relational database management system (RDBMS) is one example of
‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured
Query Language (SQL) is often used to manage such kind of Data.

2. Semi-Structured

Semi-Structured Data is a type of data which does not have a formal structure of a data
model, i.e. a table definition in a relational DBMS, but nevertheless it has some
organizational properties like tags and other markers to separate semantic elements that
makes it easier to analyze. XML files or JSON documents are examples of semi
structured data.

3. Unstructured

The data which have unknown form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured format is called as unstructured data.
Text Files and multimedia contents like images, audios, videos are example of
unstructured data. The unstructured data is growing quicker than others, experts say that
80 percent of the data in an organization are unstructured.

Examples of Big Data

• Walmart handles more than 1 million customer transactions every hour.

• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. •
230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to recommend products. • 294
billion emails are sent every day. Services analyses this data to find the spams. • Modern cars have
close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of
sensor data.
• Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can extract
meaningful information and then build applications that can predict the patient’s deteriorating
condition in advance.
• Telecom: Telecom sectors collects information, analyzes it and provide solutions to different
problems. By using Big Data applications, telecom companies have been able to significantly
reduce data packet loss, which occurs when networks are overloaded, and thus, providing a
seamless connection to their customers.

• Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of big data.
The beauty of using big data in retail is to understand consumer behavior. Amazon’s
recommendation engine provides suggestion based on the browsing history of the consumer.

• Traffic control: Traffic congestion is a major challenge for many cities globally. Effective use of data
and sensors will be key to managing traffic better as cities become increasingly densely
populated.

• Manufacturing: Analyzing big data in the manufacturing industry can reduce component defects,
improve product quality, increase efficiency, and save time and money.

• Search Quality: Every time we are extracting information from google, we are simultaneously
generating data for it. Google stores this data and uses it to improve its search quality.

Challenges with Big Data

Let me tell you few challenges which come along with Big Data:

1. Data Quality – The problem here is the 4 V i.e. Veracity. The data here is very messy, inconsistent
th

and incomplete. Dirty data cost $600 billion to the companies every year in the United States.

2. Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes
of data using extremely powerful algorithms to find patterns and insights are very difficult.

3. Storage – The more data an organization has, the more complex the problems of managing it can
become. The question that arises here is “Where to store it?”. We need a storage system which
can easily scale up or down on-demand.
4. Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are
dealing with, so analyzing that data is even more difficult.

5. Security – Since the data is huge in size, keeping it secure is another challenge. It includes user
authentication, restricting access based on a user, recording data access histories, proper use of
data encryption etc.

6. Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated
team of developers, data scientists and analysts who also have sufficient amount of domain
knowledge is still a challenge.
Hadoop with its distributed processing, handles large volumes of structured and unstructured
data more efficiently than the traditional enterprise data warehouse. Hadoop makes it possible to
run applications on systems with thousands of commodity hardware nodes, and to handle
thousands of terabytes of data. Organizations are adopting Hadoop because it is an open source
software and can run on commodity hardware (your personal computer). The initial cost savings
are dramatic as commodity hardware is very cheap. As the organizational data increases, you
need to add more & more commodity hardware on the fly to store it and hence, Hadoop proves
to be economical. Additionally, Hadoop has a robust Apache community behind it that continues
to contribute to its advancement.

What is Hadoop?

Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license.

Hadoop was developed, based on the paper written by Google on the MapReduce system and
it applies concepts of functional programming. Hadoop is written in the Java programming
language and ranks among the highest-level Apache projects. Hadoop was developed by
Doug Cutting and Michael J. Cafarella.

Hadoop-as-a-Solution

• The first problem is storing huge amount of data.

As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in DataNodes and you specify the size of each block. Suppose you
have 512 MB of data and you have configured HDFS such that it will create 128 MB of data
blocks. Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
DataNodes. While storing these data blocks into DataNodes, data blocks are replicated on
different DataNodes to provide fault tolerance.

Hadoop follows horizontal scaling instead of vertical scaling. In horizontal scaling, you can
add new nodes to HDFS cluster on the run as per requirement, instead of increasing the
hardware stack present in each node.

• Next problem was storing a variety of data.

As you can see in the above image, in HDFS you can store all kinds of data whether it is
structured, semi-structured or unstructured. In HDFS, there is no pre-dumping schema
validation. It also follows write once and read many models. Due to this, you can just write any
kind of data once and you can read it multiple times for finding insights.

• The third challenge was about processing the data faster.

In order to solve this, we move the processing unit to data instead of moving data to the
processing unit.
So, what does it mean by moving the computation unit to data?

It means that instead of moving data from different nodes to a single master node for
processing, the processing logic is sent to the nodes where data is stored so as that each
node can process a part of data in parallel. Finally, all of the intermediary output produced by
each node is merged together and the final response is sent back to the client.

Features of Hadoop

Fig: Hadoop Tutorial – Hadoop Features

Reliability

When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors.

But if I would have used hardware-based RAID with Oracle for the same purpose, I would end
up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is
minimized. It is easier to maintain a Hadoop environment and is economical as well. Also,
Hadoop is open-source software and hence there is no licensing cost.

Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if
you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor
because you can go ahead and procure more hardware and expand your set up within minutes
whenever required.

Flexibility

Hadoop is very flexible in terms of the ability to deal with all kinds of data. data can be of any
kind and Hadoop can store and process them all, whether it is structured, semi-structured or
unstructured data.

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Let
us understand what the core components of Hadoop are.

Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting up
Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for
Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN
is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS

The main components of HDFS are the NameNode and the DataNode. Let us talk about the
roles of these two components in detail.

Fig: Hadoop Tutorial – HDFS

NameNode

• It is the master daemon that maintains and manages the DataNodes (slave nodes) •
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
• It records each and every change that takes place to the file system metadata • If a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog
• It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are alive
• It keeps a record of all the blocks in the HDFS and DataNode in which they are stored •
It has high availability and federation features DataNode

• It is the slave daemon which runs on each slave machine

• The actual data is stored on DataNodes
• It is responsible for serving read and write requests from the clients
• It is also responsible for creating blocks, deleting blocks and replicating the same
based on the decisions taken by the NameNode
• It sends heartbeats to the NameNode periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds

YARN

YARN comprises of two major components: ResourceManager and NodeManager.

Fig: Hadoop Tutorial – YARN

ResourceManager

• It is a cluster-level (one for each cluster) component and runs on the master machine •
It manages resources and schedules applications running on top of YARN • It has two
components: Scheduler & ApplicationManager
• The Scheduler is responsible for allocating resources to the various running
applications
• The ApplicationManager is responsible for accepting job submissions and negotiating
the first container for executing the application
• It keeps a track of the heartbeats from the Node Manager

NodeManager

• It is a node-level component (one on each node) and runs on each slave machine • It is
responsible for managing containers and monitoring resource utilization in each
container
• It also keeps track of node health and log management
• It continuously communicates with ResourceManager to remain up-to-date

Components of Hadoop Ecosystem:

HDFS

• Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
• HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.
• It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
• HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it
requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires
more storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason, why
Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.

YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks.

• It has two major components, i.e. ResourceManager and NodeManager. 1.

ResourceManager is again a main node in the processing department. 2. It receives
the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.
3. NodeManagers are installed on every DataNode. It is responsible for execution
of task on every single DataNode.

•
1. Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
2. ApplicationsManager: While ApplicationsManager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process
executes) for executing the application specific ApplicationMaster and monitoring
the progress. ApplicationMasters are the deamons which reside on DataNode
and communicates to containers for execution of tasks on each
[Link] has two components, i.e. Schedulers and
ApplicationsManager.
MAPREDUCE
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing. In other words, MapReduce is a software framework which helps in writing
applications that processes large data sets using distributed and parallel algorithms inside
Hadoop environment.

• In a MapReduce program, Map() and Reduce() are two functions.

1. The Map function performs actions like filtering, grouping and sorting. 2. While
Reduce function aggregates and summarizes the result produced by map function.
3. The result generated by the Map function is a key value pair (K, V) which acts as
the input for Reduce function.

APACHE HIVE
• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel
at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.

HIVE + SQL = HQL

• The query language of Hive is called Hive Query Language(HQL), which is very similar like
SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver. • The Hive
Command line interface is used to execute HQL commands. • While, Java Database
Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage.

• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
• It supports all primitive data types of SQL.

APACHE MAHOUT

Mahout provides an environment for creating machine learning applications which are scalable.
It performs collaborative filtering, clustering and classification

APACHE SPARK

• Apache Spark is a framework for real time data analytics in a distributed computing
environment.
• The Spark is written in Scala and was originally developed at the University of California,
Berkeley.
• It executes in-memory computations to increase speed of data processing over Map
Reduce.
• It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.

APACHE HBASE

• HBase is an open source, non-relational distributed database. In other words, it is a

NoSQL database.
• It supports all types of data and that is why, it’s capable of handling anything and
everything inside a Hadoop ecosystem.
• It is modelled after Google’s BigTable, which is a distributed storage system designed to
cope up with large data sets.
• The HBase was designed to run on top of HDFS and provides BigTable like capabilities.

APACHE OOZIE

For Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
APACHE FLUME

The Flume is a service which helps in ingesting unstructured and semi-structured data into
HDFS.

APACHE SQOOP

Sqoop can import as well as export structured data from RDBMS or Enterprise data
warehouses to HDFS or vice versa.

Map reduce Framework:

MapReduce is a programming framework that allows us to perform distributed and parallel

processing on large data sets in a distributed environment.

• MapReduce consists of two distinct tasks – Map and Reduce.

• As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
• So, the first is the map job, where a block of data is read and processed to produce key
value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer. •
The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.

• Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.

• RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value

Pairs. Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
A Word Count Example of MapReduce

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

• First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
– Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer. • So,
after the sorting and shuffling phase, each reducer will have a unique key and a list
of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc. •
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2. • Finally, all the output
key/value pairs are then collected and written in the output file.

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets
reduced by a tremendous amount as shown in the figure
2. Data Locality:

Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:

• Moving huge data to processing is costly and deteriorates the network performance. •
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the
following advantages:

• It is very cost-effective to move processing unit to the data.

• The processing time is reduced as all the nodes are working with their part of the data in
parallel.
• Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.

Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The image below represents
the YARN Architecture.

The first component of YARN Architecture is,

Resource Manager

• It is the ultimate authority in resource allocation.

• On receiving the processing requests, it passes parts of requests to corresponding node
managers accordingly, where the actual processing takes place.
• It is the arbitrator of the cluster resources and decides the allocation of the available
resources for competing applications.
• Optimizes the cluster utilization like keeping all resources in use all the time against
various constraints such as capacity guarantees, fairness, and SLAs.
• It has two major components: a) Scheduler b) Application Manager a)

Scheduler

• The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
• It is called a pure scheduler in ResourceManager, which means that it does not perform
any monitoring or tracking of status for the applications.
• If there is an application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
• Performs scheduling based on the resource requirements of the applications. • It has a
pluggable policy plug-in, which is responsible for partitioning the cluster resources among
the various applications. There are two such plug-ins: Capacity Scheduler and Fair
Scheduler, which are currently used as Schedulers in ResourceManager.

b) Application Manager

• It is responsible for accepting job submissions.

• Negotiates the first container from the Resource Manager for executing the application
specific Application Master.
• Manages running the Application Masters in a cluster and provides service for restarting
the Application Master container on failure.

Coming to the second component which is :

Node Manager

• It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
• It registers with the Resource Manager and sends heartbeats with the health status of the
node.
• Its primary goal is to manage application containers assigned to it by the resource
manager.
• It keeps up-to-date with the Resource Manager.
• Application Master requests the assigned container from the Node Manager by sending it
a Container Launch Context(CLC) which includes everything the application needs in order
to run. The Node Manager creates the requested container process and starts it. • Monitors
resource usage (memory, CPU) of individual containers.
• Performs Log management.
• It also kills the container as directed by the Resource Manager.

The third component of Apache Hadoop YARN is,

Application Master

• An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity. • It is the
process that coordinates an application’s execution in the cluster and also manages faults.
• Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
• It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
• Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.
The fourth component is:

Container

• It is a collection of physical resources such as RAM, CPU cores, and disks on a single
node.
• YARN containers are managed by a container launch context which is container life
cycle(CLC). This record contains a map of environment variables, dependencies stored
in a remotely accessible storage, security tokens, payload for Node Manager services
and the command necessary to create the process.
• It grants rights to an application to use a specific amount of resources (memory, CPU etc.)
on a specific host. Application Submission in YARN

The steps involved in application submission of Hadoop YARN:

1) Submit the job

2) Get Application ID

3) Application Submission Context

4 a) Start Container Launch

b) Launch Application Master

5) Allocate Resources

6 a) Container

b) Launch

7) Execute

Application Workflow in Hadoop YARN

1. Client submits an application

2. Resource Manager allocates a container to start Application Manager
3. Application Manager registers with Resource Manager
4. Application Manager asks containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Application Manager unregisters with Resource Manager

Python MCQ PDF
No ratings yet
Python MCQ PDF
26 pages
Word Count With Explanation
No ratings yet
Word Count With Explanation
5 pages
TC Unit 1 Exercise 4
No ratings yet
TC Unit 1 Exercise 4
6 pages
Mesh Analysis Solved Examples
No ratings yet
Mesh Analysis Solved Examples
12 pages
Nor Gate Assessment 2 Vlsi
No ratings yet
Nor Gate Assessment 2 Vlsi
1 page
NAS - Introduction
No ratings yet
NAS - Introduction
8 pages
Node Analysis
No ratings yet
Node Analysis
13 pages
DSD Notes Unit 3
No ratings yet
DSD Notes Unit 3
59 pages
Autosar Ug
No ratings yet
Autosar Ug
794 pages
Cyber Security Important Questions MAKAUT BCA 063848
No ratings yet
Cyber Security Important Questions MAKAUT BCA 063848
3 pages
Geo-Enabling Initiative For Monitoring and Supervision - GEMS
No ratings yet
Geo-Enabling Initiative For Monitoring and Supervision - GEMS
35 pages
Careers in Network & Cyber Security: Job Specifications, Progression & Salary Expectations
No ratings yet
Careers in Network & Cyber Security: Job Specifications, Progression & Salary Expectations
18 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Laravel: A Developer's Guide
No ratings yet
Laravel: A Developer's Guide
2 pages
Cybersecurity Essentials Guide
No ratings yet
Cybersecurity Essentials Guide
27 pages
CLASSICAL WATERFALL MODEL S
No ratings yet
CLASSICAL WATERFALL MODEL S
11 pages
SAP HANA Data Modeling Exercise
No ratings yet
SAP HANA Data Modeling Exercise
4 pages
Firebird Database Security Best Practices
No ratings yet
Firebird Database Security Best Practices
42 pages
Seniority Dates
No ratings yet
Seniority Dates
2 pages
Glor - Io Wall Hack
50% (8)
Glor - Io Wall Hack
889 pages
ERP Marketplace and Marketplace Dynamics
81% (16)
ERP Marketplace and Marketplace Dynamics
36 pages
Supervisor Responsibilities in Aadhaar Enrolment
No ratings yet
Supervisor Responsibilities in Aadhaar Enrolment
4 pages
ERONET
No ratings yet
ERONET
4 pages
Soal LKS SMK 2017 - IT Software Application
No ratings yet
Soal LKS SMK 2017 - IT Software Application
21 pages
Skill Enhancement Course-I Aws Cloud Computing: Submitted by
No ratings yet
Skill Enhancement Course-I Aws Cloud Computing: Submitted by
34 pages
CMP 101 Introduction To Computer Science Module 1D
No ratings yet
CMP 101 Introduction To Computer Science Module 1D
14 pages
Online Exam Instructions & Details
No ratings yet
Online Exam Instructions & Details
3 pages
MySQL Travel Agency Database System
50% (2)
MySQL Travel Agency Database System
2 pages
23CD21T1 - Introduction To Data Science
No ratings yet
23CD21T1 - Introduction To Data Science
2 pages
SoftwareONE Time Sheet Tracker
No ratings yet
SoftwareONE Time Sheet Tracker
234 pages
Debezium Connector Deployment on OpenShift
No ratings yet
Debezium Connector Deployment on OpenShift
7 pages
Data Modeling for Data Architects
No ratings yet
Data Modeling for Data Architects
2 pages
Object-Oriented Systems Basics
No ratings yet
Object-Oriented Systems Basics
11 pages
File Systems in Modern Operating Systems
No ratings yet
File Systems in Modern Operating Systems
20 pages
HP Whitepaper IT Performance Management With HP Open View
No ratings yet
HP Whitepaper IT Performance Management With HP Open View
14 pages
NCIIPC Newsletter Q2 2024: CIIP Updates
No ratings yet
NCIIPC Newsletter Q2 2024: CIIP Updates
28 pages
See Solutions Computer Science Pabson Set A See 2079
No ratings yet
See Solutions Computer Science Pabson Set A See 2079
8 pages
UC Specification
No ratings yet
UC Specification
10 pages

Big Data Notes

Uploaded by

Big Data Notes

Uploaded by

What is Big Data?

The three different formats of big data are:

1. Structured: Organised data format with a fixed schema. Ex: RDBMS 2.

• Validity: correctness of data

Big Data Characteristics

Types of Big Data

Examples of Big Data

• Walmart handles more than 1 million customer transactions every hour.

Challenges with Big Data

• The first problem is storing huge amount of data.

• Next problem was storing a variety of data.

• The third challenge was about processing the data faster.

Fig: Hadoop Tutorial – Hadoop Features

Hadoop Core Components

Fig: Hadoop Tutorial – HDFS

• It is the slave daemon which runs on each slave machine

YARN comprises of two major components: ResourceManager and NodeManager.

Fig: Hadoop Tutorial – YARN

Components of Hadoop Ecosystem:

• It has two major components, i.e. ResourceManager and NodeManager. 1.

• In a MapReduce program, Map() and Reduce() are two functions.

HIVE + SQL = HQL

• HBase is an open source, non-relational distributed database. In other words, it is a

Map reduce Framework:

MapReduce is a programming framework that allows us to perform distributed and parallel

• MapReduce consists of two distinct tasks – Map and Reduce.

Pairs. Reducer Class

The two biggest advantages of MapReduce are:

• It is very cost-effective to move processing unit to the data.

The first component of YARN Architecture is,

• It is the ultimate authority in resource allocation.

• It is responsible for accepting job submissions.

Coming to the second component which is :

The third component of Apache Hadoop YARN is,

The steps involved in application submission of Hadoop YARN:

1) Submit the job

3) Application Submission Context

4 a) Start Container Launch

b) Launch Application Master

Application Workflow in Hadoop YARN

1. Client submits an application

You might also like