Big Data Notes
Big Data Notes
The most common myth associated with it is that it is just about the size or volume of data. But
actually, it’s not just about the “big” amounts of data being collected. Big Data refers to the large
amounts of data that is pouring in from various data sources and has different formats. Even
previously there was huge data which were being stored in databases, but because of the
varied nature of this Data, the traditional relational database systems are incapable of handling
this Data. Big Data is much more than a collection of datasets with different formats; it is an
important asset that can be used to obtain enumerable benefits.
The above image depicts the five V’s of Big Data but as and when the data keeps evolving so
will the V’s. So there could be five more V’s which have developed gradually over time:
Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.
The size of data generated by humans, machines and their interactions on social media
itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
2. VELOCITY
Velocity is defined as the pace at which different sources generate the data every day.
This flow of data is massive and continuous. There are 1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This
shows how fast the number of users are growing on social media and how fast the data is
getting generated daily. If you are able to handle the velocity, you will be able to generate
insights and take decisions based on real-time data.
3.
VARIETY
As there are many sources which are contributing to Big Data, the type of data they are
generating is different. It can be structured, semi-structured or unstructured. Hence, there
is a variety of data which is getting generated every day. Earlier, we used to get the data
from excel and databases, now the data are coming in the form of images, audios, videos,
sensor data etc. as shown in below image. Hence, this variety of unstructured data creates
problems in capturing, storage, mining and analyzing the data.
4. VERACITY
Veracity refers to the data in doubt or uncertainty of data available due to data
inconsistency and incompleteness. In the image below, you can see that few values are
missing in the table. Also, a few values are hard to accept, for example – 15000 minimum
value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity.
Data
available can sometimes get messy and maybe difficult to trust. With many forms of big
data, quality and accuracy are difficult to control like Twitter posts with hashtags,
abbreviations, typos and colloquial speech. The volume is often the reason behind for the
lack of quality and accuracy in the data.
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to
make decisions.
• It was found in a survey that 27% of respondents were unsure of how much of their data
was inaccurate.
• Poor data quality costs the US economy around $3.1 trillion a year. 5. VALUE
After discussing Volume, Velocity, Variety and Veracity, there is another V that should be
taken into account when looking at Big Data i.e. Value. It is all well and good to have
access to big data but unless we can turn it into value it is useless. By turning it into value
I mean, Is it adding to the benefits of the organizations who are analyzing big data? Is the
organization working on Big Data achieving high ROI (Return On Investment)? Unless,
it adds to their profits by working on Big Data, it is useless.
• Structured
• Semi-Structured
• Unstructured
1. Structured
The data that can be stored and processed in a fixed format is called as Structured Data.
Data stored in a relational database management system (RDBMS) is one example of
‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured
Query Language (SQL) is often used to manage such kind of Data.
2. Semi-Structured
Semi-Structured Data is a type of data which does not have a formal structure of a data
model, i.e. a table definition in a relational DBMS, but nevertheless it has some
organizational properties like tags and other markers to separate semantic elements that
makes it easier to analyze. XML files or JSON documents are examples of semi
structured data.
3. Unstructured
The data which have unknown form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured format is called as unstructured data.
Text Files and multimedia contents like images, audios, videos are example of
unstructured data. The unstructured data is growing quicker than others, experts say that
80 percent of the data in an organization are unstructured.
• Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of big data.
The beauty of using big data in retail is to understand consumer behavior. Amazon’s
recommendation engine provides suggestion based on the browsing history of the consumer.
• Traffic control: Traffic congestion is a major challenge for many cities globally. Effective use of data
and sensors will be key to managing traffic better as cities become increasingly densely
populated.
• Manufacturing: Analyzing big data in the manufacturing industry can reduce component defects,
improve product quality, increase efficiency, and save time and money.
• Search Quality: Every time we are extracting information from google, we are simultaneously
generating data for it. Google stores this data and uses it to improve its search quality.
1. Data Quality – The problem here is the 4 V i.e. Veracity. The data here is very messy, inconsistent
th
and incomplete. Dirty data cost $600 billion to the companies every year in the United States.
2. Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes
of data using extremely powerful algorithms to find patterns and insights are very difficult.
3. Storage – The more data an organization has, the more complex the problems of managing it can
become. The question that arises here is “Where to store it?”. We need a storage system which
can easily scale up or down on-demand.
4. Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are
dealing with, so analyzing that data is even more difficult.
5. Security – Since the data is huge in size, keeping it secure is another challenge. It includes user
authentication, restricting access based on a user, recording data access histories, proper use of
data encryption etc.
6. Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated
team of developers, data scientists and analysts who also have sufficient amount of domain
knowledge is still a challenge.
Hadoop with its distributed processing, handles large volumes of structured and unstructured
data more efficiently than the traditional enterprise data warehouse. Hadoop makes it possible to
run applications on systems with thousands of commodity hardware nodes, and to handle
thousands of terabytes of data. Organizations are adopting Hadoop because it is an open source
software and can run on commodity hardware (your personal computer). The initial cost savings
are dramatic as commodity hardware is very cheap. As the organizational data increases, you
need to add more & more commodity hardware on the fly to store it and hence, Hadoop proves
to be economical. Additionally, Hadoop has a robust Apache community behind it that continues
to contribute to its advancement.
What is Hadoop?
Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license.
Hadoop was developed, based on the paper written by Google on the MapReduce system and
it applies concepts of functional programming. Hadoop is written in the Java programming
language and ranks among the highest-level Apache projects. Hadoop was developed by
Doug Cutting and Michael J. Cafarella.
Hadoop-as-a-Solution
As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in DataNodes and you specify the size of each block. Suppose you
have 512 MB of data and you have configured HDFS such that it will create 128 MB of data
blocks. Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
DataNodes. While storing these data blocks into DataNodes, data blocks are replicated on
different DataNodes to provide fault tolerance.
Hadoop follows horizontal scaling instead of vertical scaling. In horizontal scaling, you can
add new nodes to HDFS cluster on the run as per requirement, instead of increasing the
hardware stack present in each node.
As you can see in the above image, in HDFS you can store all kinds of data whether it is
structured, semi-structured or unstructured. In HDFS, there is no pre-dumping schema
validation. It also follows write once and read many models. Due to this, you can just write any
kind of data once and you can read it multiple times for finding insights.
In order to solve this, we move the processing unit to data instead of moving data to the
processing unit.
So, what does it mean by moving the computation unit to data?
It means that instead of moving data from different nodes to a single master node for
processing, the processing logic is sent to the nodes where data is stored so as that each
node can process a part of data in parallel. Finally, all of the intermediary output produced by
each node is merged together and the final response is sent back to the client.
Features of Hadoop
Reliability
When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Economical
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors.
But if I would have used hardware-based RAID with Oracle for the same purpose, I would end
up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is
minimized. It is easier to maintain a Hadoop environment and is economical as well. Also,
Hadoop is open-source software and hence there is no licensing cost.
Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if
you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor
because you can go ahead and procure more hardware and expand your set up within minutes
whenever required.
Flexibility
Hadoop is very flexible in terms of the ability to deal with all kinds of data. data can be of any
kind and Hadoop can store and process them all, whether it is structured, semi-structured or
unstructured data.
These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Let
us understand what the core components of Hadoop are.
HDFS
The main components of HDFS are the NameNode and the DataNode. Let us talk about the
roles of these two components in detail.
NameNode
• It is the master daemon that maintains and manages the DataNodes (slave nodes) •
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
• It records each and every change that takes place to the file system metadata • If a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog
• It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are alive
• It keeps a record of all the blocks in the HDFS and DataNode in which they are stored •
It has high availability and federation features DataNode
YARN
• It is a cluster-level (one for each cluster) component and runs on the master machine •
It manages resources and schedules applications running on top of YARN • It has two
components: Scheduler & ApplicationManager
• The Scheduler is responsible for allocating resources to the various running
applications
• The ApplicationManager is responsible for accepting job submissions and negotiating
the first container for executing the application
• It keeps a track of the heartbeats from the Node Manager
NodeManager
• It is a node-level component (one on each node) and runs on each slave machine • It is
responsible for managing containers and monitoring resource utilization in each
container
• It also keeps track of node health and log management
• It continuously communicates with ResourceManager to remain up-to-date
• Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
• HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.
• It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
• HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it
requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires
more storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason, why
Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
YARN
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks.
•
1. Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
2. ApplicationsManager: While ApplicationsManager accepts the job submission,
negotiates to containers (i.e. the Data node environment where process
executes) for executing the application specific ApplicationMaster and monitoring
the progress. ApplicationMasters are the deamons which reside on DataNode
and communicates to containers for execution of tasks on each
[Link] has two components, i.e. Schedulers and
ApplicationsManager.
MAPREDUCE
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing. In other words, MapReduce is a software framework which helps in writing
applications that processes large data sets using distributed and parallel algorithms inside
Hadoop environment.
APACHE HIVE
• Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel
at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
• The query language of Hive is called Hive Query Language(HQL), which is very similar like
SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver. • The Hive
Command line interface is used to execute HQL commands. • While, Java Database
Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage.
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
• It supports all primitive data types of SQL.
APACHE MAHOUT
Mahout provides an environment for creating machine learning applications which are scalable.
It performs collaborative filtering, clustering and classification
APACHE SPARK
• Apache Spark is a framework for real time data analytics in a distributed computing
environment.
• The Spark is written in Scala and was originally developed at the University of California,
Berkeley.
• It executes in-memory computations to increase speed of data processing over Map
Reduce.
• It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.
APACHE HBASE
APACHE OOZIE
For Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
APACHE FLUME
The Flume is a service which helps in ingesting unstructured and semi-structured data into
HDFS.
APACHE SQOOP
Sqoop can import as well as export structured data from RDBMS or Enterprise data
warehouses to HDFS or vice versa.
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.
• Input Split
It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.
• RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-Value
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
A Word Count Example of MapReduce
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
• First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
– Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer. • So,
after the sorting and shuffling phase, each reducer will have a unique key and a list
of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc. •
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2. • Finally, all the output
key/value pairs are then collected and written in the output file.
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets
reduced by a tremendous amount as shown in the figure
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the network performance. •
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the
following advantages:
Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The image below represents
the YARN Architecture.
Scheduler
• The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
• It is called a pure scheduler in ResourceManager, which means that it does not perform
any monitoring or tracking of status for the applications.
• If there is an application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
• Performs scheduling based on the resource requirements of the applications. • It has a
pluggable policy plug-in, which is responsible for partitioning the cluster resources among
the various applications. There are two such plug-ins: Capacity Scheduler and Fair
Scheduler, which are currently used as Schedulers in ResourceManager.
b) Application Manager
Node Manager
• It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
• It registers with the Resource Manager and sends heartbeats with the health status of the
node.
• Its primary goal is to manage application containers assigned to it by the resource
manager.
• It keeps up-to-date with the Resource Manager.
• Application Master requests the assigned container from the Node Manager by sending it
a Container Launch Context(CLC) which includes everything the application needs in order
to run. The Node Manager creates the requested container process and starts it. • Monitors
resource usage (memory, CPU) of individual containers.
• Performs Log management.
• It also kills the container as directed by the Resource Manager.
• An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity. • It is the
process that coordinates an application’s execution in the cluster and also manages faults.
• Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
• It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
• Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.
The fourth component is:
Container
• It is a collection of physical resources such as RAM, CPU cores, and disks on a single
node.
• YARN containers are managed by a container launch context which is container life
cycle(CLC). This record contains a map of environment variables, dependencies stored
in a remotely accessible storage, security tokens, payload for Node Manager services
and the command necessary to create the process.
• It grants rights to an application to use a specific amount of resources (memory, CPU etc.)
on a specific host. Application Submission in YARN
2) Get Application ID
5) Allocate Resources
6 a) Container
b) Launch
7) Execute