0% found this document useful (0 votes)
9 views41 pages

Module 4 (Adbms)

Big Data refers to vast and complex data sets that traditional data management tools struggle to process efficiently. It encompasses structured, unstructured, and semi-structured data, with key characteristics including volume, variety, velocity, and variability. Businesses leverage Big Data analytics to derive insights, but face challenges such as high costs, complexity, and a lack of skilled workers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views41 pages

Module 4 (Adbms)

Big Data refers to vast and complex data sets that traditional data management tools struggle to process efficiently. It encompasses structured, unstructured, and semi-structured data, with key characteristics including volume, variety, velocity, and variability. Businesses leverage Big Data analytics to derive insights, but face challenges such as high costs, complexity, and a lack of skilled workers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Module-4

Big Data

What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which may be

stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical

recording media.

What is Big Data?

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data
with so large size and complexity that none of traditional data management tools can store it or process
it efficiently.
Big data is also a data but with huge size.

The high volumes of data sets, that a traditional computing tool cannot process, are being collected daily.

We refer to these high volumes of data as big data.

Businesses, nowadays, rely heavily on big data to gain better knowledge about their customers. The
process of extracting meaningful insights from such raw big data is big data analytics.

What is an Example of Big Data?

1
[Link]
Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade

data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site
Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges,
putting comments etc.

Types Of Big Data


Following are the types of Big Data:

1. Structured
2. Unstructured
2
[Link]
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’

data. Over the period of time, talent in computer science has achieved greater success in developing

techniques for working with such kind of data (where the format is well known in advance)

and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data

grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000


7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being

huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it.

A typical example of unstructured data is a heterogeneous data source containing a combination of simple

text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately,

they don’t know how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

3
[Link]
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in

form but it is actually not defined with e.g. a table definition in relational DBMS. Example

of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Characteristics Of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
4
[Link]
very crucial role in determining value out of data. Also, whether a particular data can actually

be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is

one characteristic which needs to be considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.

During earlier days, spreadsheets and databases were the only sources of data considered by most

of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,

audio, etc. are also being considered in the analysis applications. This variety of unstructured data

poses certain issues for storage,

mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is

generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,

application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is

massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.

Introduction to principles and practice of systems that improve performance

through experience

Principle 1. Design based on your data volume:

Before you start to build any data processes, you need to know the data volume you are working with:

what will be the data volume to start with, and what the data volume will be growing into. If the data size is

always small, design and implementation can be much more straightforward and faster. If the data start with

5
[Link]
being large, or start with being small but will grow fast, the design needs to take performance optimization

into consideration. The applications and processes that perform well for big data usually incur too much

overhead for small data and cause adverse

impact to slow down the process. On the other hand, an application designed for small data would take

too long for big data to complete. In other words, an application or process should be designed

differently for small data vs. big data. Below lists the reasons in detail

1. Because it is time-consuming to process large datasets from end to end, more breakdowns and
2. checkpoints
3. are required in the middle. The goal is 2-folds: first to allow one to check the immediate results or
4. raise the exception earlier in the process, before the whole process ends; second, in the case that a job
5. fails, to allow restarting from the last successful checkpoint, avoiding re-starting from the beginning
6. which is more expensive.

2. When working with small data, the impact of any in efficiencies in the process also tends to be

small, but the same inefficiencies could become a major resource issue for large data sets.

4. Paralleling processing and data partitioning (see below) not only require extra design and development
5. time to implement but also takes more resources during running time, which, therefore, should be
6. skipped for small data.

4. When working with large data, performance testing should be included in the unit testing; this is usually not a
concern for small data.

Principle 2: Reduce data volume earlier in the process.

When working with large data sets, reducing the data size early in the process is always the most effective

way to achieve good performance. There is no silver bullet to solving the big data issue no matter how

much resources and hardware you put in. So always try to reduce the data size before starting the real

work.

There are many ways to achieve this, depending on different use cases. Below lists some common

techniques, among many others:

1. Do not take storage (e.g., space or fixed-length field) when a field has NULL value.

6
[Link]
2. Choose the data type economically. For example, if a number is never negative, use integer type,

7
[Link]
but not unsigned integer; If there is no decimal, do not use float.

3. Code text data with unique identifiers in integer, because the text field can take much more space

and should be avoided in processing.

Principle 3: Partition the data properly based on processing logic

Enabling data parallelism is the most effective way of fast data processing. As the data volume grows,

t he number of parallel processes grows, hence, adding more hardware will scale the overall data

process without the need to change the code. For data engineers, a common method is data

partitioning.

There are many details regarding data partitioning techniques, which is beyond the scope of this

article. Generally speaking, an effective partitioning should lead to the following results:

1. Allow the downstream data processing steps, such as join and aggregation, to happen in the same partitio n.
2. For example, partitioning by time periods is usually a good idea if the data processing logic is
3. self-contained within a month.

2. The size of each partition should be even, in order to ensure the same amount of time taken to

process each partition.

3. As the data volume grows, the number of partitions should increase, while the processing programs and

logic stay the same.

Principle 4: Avoid unnecessary resource-expensive processing steps whenever possible

As stated in Principle 1, designing a process for big data is very different from designing for small data.

An important aspect of designing is to avoid unnecessary resource-expensive operations whenever possible.

This requires highly skilled data engineers with not just a good understanding of how the software works with

the operating system and the available hardware resources, but also comprehensive knowledge of the data

and business use cases. I only focus on the top two processes that we should avoid to make a data process

more efficient: data sorting and disk I/[Link] the data records in a certain order is often needed when
[Link]
7

[Link]
1) joining with another dataset;

2) aggregation; 3) scan;

4) deduplication, among other things.

However, sorting is one of the most expensive operations that require memory and processors, as well as

disks when the input dataset is much larger than the memory available. To get good performance, it is

important to be very about sorting, with the following principles:

1. Do not sort again if the data is already sorted in the upstream or the source system.

2. Usually, a join of two datasets requires both datasets to be sorted and then merged. When joining a

large dataset with a small dataset, change the small dataset to a hash lookup. This allows one to avoid

sorting the large dataset.

3. Sort only after the data size has been reduced (Principle 2) and within a partition (Principle

Supervised Learning

Supervised learning is the types of machine learning in which machines are trained using well

"labelled" training data, and on basis of that data, machines predict the output. The labelled data means

some input data is already tagged with the correct output.

8
[Link]
Unsupervised Learning

As the name suggests, unsupervised learning is a machine learning technique in which models are not

supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given

data.

Difference between supervised v/s unsupervised learning

Supervised machine Unsupervised

Supervised Learning can be used for 2 Unsupervised Learning can be used for
2
different types of problems i.e. regression and different types of problems i.e. clustering
classification and association.
Input Data is provided to the model along with Only input data i providedin
s
the Unsupervised
output in the Supervised Learning. Learning.
Output is predicted by the Supervised Hidden patterns in the data can be found
Learning. using
the unsupervised learning model.
Labeled data is used to train supervised Unlabeled data is used to train
learning
unsupervised
algorithms.

pg. 9
learning algorithms.

pg. 10
Accurate results are produced using a The accuracy of results produced are
supervised less in
learning model. unsupervised learning models.

Training the model to predict output when a Finding useful insights, hidden patterns
new data is provided is the objective of from the unknown dataset is the
Supervised Learning. objective of the
unsupervised learning.
Supervised Learning includes various Unsupervised Learning includes various
algorithms such as Bayesian Logic, Decision algorithms like KNN, Apriori Algorit
Tree, Logistic Regression, Linear Regression, hm, and Clustering.
Multi-class
Classification, Support Vector Machine etc.
To assess whether right output is being No feedback will be taken by the
predicted, direct feedback is accepted by the unsupervised learning model.
Supervised
Learning Model.
In Supervised Learning, for right prediction of Unsupervised Learning has more
output, the model has to be trained for each resemblance to Artificial Intelligence, as
data, hence Supervised Learning does not have it keeps learning new things with more
close resemblance experience.
to Artificial Intelligence.
Number of classes are known in Supervised Number of classes are not known in
Learning. Unsupervised Learning
In scenarios where one is aware of output and In the scenarios where one is not aware
input data, supervised learning can be used. of output data, but is only aware of the
input data
then Unsupervised Learning could be
used.
Supervised Learning includes various Unsupervised Learning includes various
algorithms such as Bayesian Logic, Decision algorithms like KNN, Apriori Algorit
Tree, Logistic Regression, Linear Regression, hm, and Clustering.
Multi-class
Classification, Support Vector Machine etc.
Number of classes are known in Supervised Number of classes are not known in
Learning. Unsupervised Learning
In scenarios where one is aware of output and In the scenarios where one is not aware
input data, supervised learning can be used. of output data, but is only aware of the
input
data then Unsupervised Learning could
be used.

Big Data problem

pg. 11
Lack of Understanding

pg. 12
High Cost of Data
Solutions Too Many
Choices
Complex Systems for Managing Data
Security Gaps
Low Quality and Inaccurate Data
Keeping Up Growth in Data
Accessibility
Lack of Skilled Workers
Data Integration
Processing Large Data Set
1. Lack of Understanding
Companies can leverage data to boost performance in many areas. Some of the best use cases
for data are to: decrease expenses, create innovation, launch new products, grow the bottom line,
and increase efficiency, to name a few. Despite the benefits, companies have been slow to adopt
data technology or put a plan in place for how to create a data-centric culture. In fact, according
to a Gartner study, out of 196 companies surveyed, 91% say they have yet to reach a
“transformational” level of maturity in their data and analytics.
2. High Cost of Data Solutions
After understanding how your business will benefit most from implementing data solutions,
you’re likely to find that buying and maintaining the necessary components can be expensive.

Along with hardware like servers and storage to software, there also comes the cost of human
resources and time.
3. Too Many Choices
According to psychologist Barry Schwartz, less really can be more. Coined as the “paradox of
choice,” Schwartz explains how option overload can cause inaction on behalf of a buyer.
Instead, by limiting a consumer’s choices, anxiety and stress can be lessened. In the world of

pg. 13
data and data tools, the options are almost as widespread as the data itself, so it is
understandably

pg. 14
overwhelming when deciding the solution that’s right for your business, especially when it will
likely affect all departments and hopefully be a long-term strategy.
4. Complex Systems for Managing Data
Moving from a legacy data management system and integrating a new solution comes as a
challenge in itself. Furthermore, with data coming from multiple sources, and IT teams creating
their own data while managing data, systems can become complex quickly.
5. Security Gaps
The importance of data security cannot go unnoticed. However, as solutions are being
implemented, it’s not always easy to focus on data security with many moving pieces. Data also
needs to be stored properly, which starts with encryption and constant backups.
6. Low Quality and Inaccurate Data
Having data is only useful when it’s accurate. Low quality data not only serves no purpose, but
it also uses unnecessary storage and can harm the ability to gather insights from clean data
7. Keeping Up with Growth in Data
Like scaling a company, growing with data is a challenge. You want to make sure that you
can scale your solution with the company’s growth so that the costs and quality don’t
decrease as it expands.
8. Accessibility
Sometimes, companies data to one person or one department. Not only does this put
immense responsibility on a select few, but it also creates a lack of accessibility throughout the
organization in departments where the data can be of use to provide a positive impact. Data silos
directly inhibit the benefits of collecting data in the first place.

9. Data Integration
Data integration consists of taking data from various sources and combining it to create valuable
and usable information.
10. Processing Large Data Sets

pg. 15
Large data sets are challenging to process and make sense of. The three V’s of big data include
volume, velocity and variety. Volume is the amount of data, velocity is the rate that new data is
created, and variety is the various formats that data exists in like images, videos and text.

Challenges of Big Data

1. Lack of proper understanding of Big Data

Companies fail in their Big Data initiatives due to insufficient understanding. Employees may not know
wha t data is, its storage, processing, importance, and sources. Data professionals may know what is going
on, but others may not have a clear picture For example, if employees do not understand the importance of
data storage, they might not keep the backup of sensitive data.
They might not use databases properly for storage. As a result, when this important data is
required, it cannot be retrieved easily.
2. Data growth issues

One of the most pressing challenges of Big Data is storing all these huge sets of data properly. The amount
of data being stored in data centers and databases of companies is increasing rapidly. As these data sets
grow exponentially with time, it gets extremely difficult to handle.
Most of the data is unstructured and comes from documents, videos, audios, text files and other
sources. This means that you cannot find them in databases.
3. Confusion while Big Data tool selection

Companies often get confused while selecting the best tool for Big Data analysis and storage.
Is HBase or Cassandra the best technology for data storage? Is Hadoop MapReduce good enough or will
Spark be a better option for data analytics and storage?

4. Lack of data professionals

pg. 16
To run these modern technologies and Big Data tools, companies need skilled data professionals.
These professionals will include data scientists, data analysts and data engineers who are experienced in
working with the tools and making sense out of huge data sets.
5. Securing data

Securing these huge sets of data is one of the daunting challenges of Big Data. Often companies are so
busy in understanding, storing and analyzing their data sets that they push data security for later stages.
But, this is not a smart move as unprotected data repositories can become breeding grounds for malicious
hackers.

Applications of big data

1. Big Data in Retail

The retail industry is the one that faces the most fierce competition of all. Retailers constantly hunt for ways
that will give them a competitive edge over others. Customers are the real king sounds legit for the retail
industry in particular.
For retailers to thrive in this competitive world, they need to understand their customers in a better way.
If they are aware of their customers’ needs and how to fulfill those needs in the best possible way, then they
know everything.
2. Big Data in Healthcare

Big Data and healthcare are an ideal match. It complements the healthcare industry better than anything
ever will. The amount of data the healthcare industry has to deal with is unimaginable.
3. Big Data in Education
When you ask people about the use of the data that an educational institute gathers, the majority of the
people will have the same answer that the institute or the student might need it for future references.

pg. 17
Even you had the same perception about this data, didn’t you? But the fact is, this data holds enormous

importance. Big Data is the key to shaping the future of the people and has the power to transform the
education system for better.
4. Big Data in E-commerce

One of the greatest revolutions this generation has seen is that of E-commerce. It is now part and parcel of
our routine life. Whenever we need to buy something, the first thought that provokes our mind is E-commerce.
And not your surprise, Big Data has been the face of it.
Some of the biggest E-commerce companies of the world like Amazon, Flipkart, Alibaba, and many more are
now bound to Big Data and analytics is itself an evidence of the level of popularity Big Data has gained in
recent times.
5. Big Data in Media and Entertainment

Media and Entertainment industry is all about art and employing Big Data in it is a sheer piece of art.
Art and science are often considered to be the two completely contrasting domains but when employed
together, they do make a deadly duo and Big Data’s endeavors in the media industry are a perfect
example of it.
6. Big Data in Finance

The functioning of any financial organization depends heavily on its data and to safeguard that data is one
of the toughest challenges any financial firm faces. Data has been the second most important commodity for
them after money.
Even before Big Data gained popularity, the finance industry was already conquering the technical
field. In addition to it, financial firms were among the earliest adopters of Big Data and Analytics.
8. Big Data in Telecom

The telecom industry is the soul of every digital revolution that takes place around the world. With the

pg. 18
ever-increasing popularity of smartphones, it has flooded the telecom industry with massive amounts of data.
9. Big Data in Automobile
“A business like an automobile, has to be driven, in order to get results.” B.C. Forbes And Big Data has now
taken complete control of the automobile industry and is driving it

smoothly. Big Data is driving the automobile industry towards some unbelievable and never before results.

Types of big data technology:


Big Data Technologies are broadly classified into two categories.

1. Operational Big Data Technologies


2. Analytical Big Data Technologies

Operational Big Data Technologies


Operational Big Data Technologies indicates the volume of data generated every day, such as online
transactions, social media or any information from a particular company used for analysis by software
based on big data technology. It acts as raw data to supply big data analysis technology. Few cases of
Operational Big Data Technologies include information on MNC management, Amazon, Flipkart,
Walmart, online ticketing for movies, flights, railways and more.
Analytical Big Data Technologies
Analytical Big Data Technologies concerns the advanced adjustment of Big Data Technologies, which is
rather complicated than Operational Big Data. This category includes the real analysis of Big Data,
which is essential to business decisions. Some examples in this area include stock marketing, weather
forecasting, time series and medical records analysis.
Top Big Data Technologies
Top big data technologies are divided into 4 fields which are classified as follows:

• Data Storage

pg. 19
• Data Mining
• Data Analytics
• Data Visualization

Data Storage:

Hadoop Framework was designed to store and process data in a Distributed Data Processing
Environment with commodity hardware with a simple programming model. It can Store and
Analyse the data present in different machines with High Speeds and Low Costs.

 Developed by: Apache Software Foundation in the


year 2011 10th of Dec.
 Written in: JAVA
 Current stable version: Hadoop 3.11

Companies Using Hadoop:

pg. 20
Companies are using mongoDB

pg. 21
Rainstor

big-data-technologies-storage-picture-3RainStor is a software company that developed a


Database Management System of the same name designed to Manage and Analyse Big Data for
large enterprises. It uses Deduplication Techniques to organize the process of storing large
amounts of data for reference.

Developed by: RainStor Software company in the year 2004.

Works like: SQL

Current stable version: RainStor 5.5

Companies Using RainStor:

Data Mining

Presto

pg. 22
Presto is an open source Distributed SQL Query Engine for running Interactive Analytic
Queries against data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows
querying data in Hive, Cassandra, Relational Databases and Proprietary Data Stores.

Developed by: Apache Foundation in the year 2013.

Written in: JAVA

Current stable version: Presto 0.22

Companies Using Presto:

Rapid

pg. 23
RapidMiner is a Centralized solution that features a very powerful and robust Graphical User
Interface that enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating
very Advanced Workflows, Scripting support in several languages.

 Developed by: RapidMiner in the year 2001


 Written in: JAVA
 Current stable version: RapidMiner 9.2

Companies Using Rapid Miner:

Data Analytics

1. Kafka

Apache Kafka is a Distributed Streaming platform. A streaming platform has Three Key
Capabilities that are as follows:

pg. 24
Publisher

Subscriber

Consumer

This is similar to a Message Queue or an Enterprise Messaging System. Developed

by: Apache Software Foundation in the year 2011

Written in: Scala, JAVA

Current stable version: Apache Kafka 2.2.0

Companies Using Kafka:

2. Splunk

Splunk captures, Indexes, and correlates Real-time data in a Searchable Repository from which
it can generate Graphs, Reports, Alerts, Dashboards, and Data Visualizations. It is also used for
Application Management, Security and Compliance, as well as Business and Web Analytics.

Developed by: Splunk INC in the year 2014 6th

May Written in: AJAX, C++, Python, XML

Current stable version: Splunk 7.3

Companies Using Splunk:


pg. 25
3. KNIME

KNIME allows users to visually create Data Flows, Selectively execute some or All Analysis
steps, and Inspect the Results, Models, and Interactive views. KNIME is written in Java and
based on Eclipse and makes use of its Extension mechanism to add Plugins providing Additiona
l Functionality.

 Developed by: KNIME in the year 2008


 Written in: JAVA
 Current stable version: KNIME 3.7.2

Companies Using KNIME:

pg. 26
Big Data Training

1. Spark

park provides In-Memory Computing capabilities to deliver Speed, a Generalized Execution Model
to support a wide variety of applications, and Java, Scala, and Python APIs for ease of developme
nt.

Developed by: Apache Software Foundation

Written in: Java, Scala, Python, R

Current stable version: Apache Spark 2.4.3

Companies Using Spark:

2.R-Language

pg. 27
R is a Programming Language and free software environment for Statistical Computing and
Graphics. The R language is widely used among Statisticians and Data Miners for
developing Statistical Software and majorly in Data Analysis.

Developed by: R-Foundation in the year 2000 29th Feb

Written in: Fortran

Current stable version: R-3.6.0

Companies Using R-Language:

[Link]

BlockChain is used in essential functions such as payment, escrow, and title can also reduce
fraud, increase financial privacy, speed up transactions, and internationalize markets.

BlockChain can be used for achieving the following in a Business Network Environment:

 Shared Ledger: Here we can append the Distributed


System of records across a Business network.
 Smart Contract: Business terms are embedded in the
transaction Database and Executed with transactions.
 Privacy: Ensuring appropriate Visibility, Transactions are
Secure, Authenticated and Verifiable

pg. 28
 Developed by: Bitcoin
 Written in: JavaScript, C++, Python
 Current stable version: Blockchain 4.0

Companies Using Blockchain:

Data Visualization

1. Tableau

Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business
Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are
in the form of Dashboards and Worksheets.

 Developed by: TableAU 2013 May 17th


 Written in: JAVA, C++, Python, C
 Current stable version: TableAU 8.2

Companies Using Tableau:

pg. 29
2. Plotly

Mainly used to make creating Graphs faster and more efficient. API libraries for
Python, R, MATLAB, [Link], Julia, and Arduino and a REST API. Plotly can also be used to style
Interactive Graphs with Jupyter notebook.

 Developed by: Plotly in the year 2012


 Written in: JavaScript
 Current stable version: Plotly 1.47.4

pg. 30
3. Beam

Apache Beam provides a Portable API layer for building sophisticated Parallel-Data Processing
Pipelines that may be executed across a diversity of Execution Engines or Runners.

 Developed by: Apache Software Foundation in the


year 2016 June 15th
 Written in: JAVA, Python
 Current stable version: Apache Beam 0.1.0 incubating.

Companies Using Beam:

Map Reduce paradigm

What is MapReduce in Hadoop?

MapReduce is a software framework and programming model used for processing huge
amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map
tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the
data.

pg. 31
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby,
Python, and C++. The programs of Map Reduce in cloud computing are parallel in nature, thus
are very useful for performing large- scale data analysis using multiple machines in the
cluster.

The input to each phase is key-value pairs. In addition, every programmer needs to specify
two functions: map function and reduce function.

A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file
called [Link] whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the [Link] using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.

pg. 32
First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.

Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of
the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every
word, in itself, will occur once.

Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs –
Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.

After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer.

So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values

corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.

Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.

Finally, all the output key/value pairs are then collected and written in the output file.

Advantages of MapReduce
The two biggest advantages of MapReduce are:

1. Parallel Processing:
2. Data Locality:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machine s instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below (2).

pg. 33
Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial

2. Data Locality:

Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:

Moving huge data to processing is costly and deteriorates the network performance.

Processing takes time as the data is processed by a single unit which becomes the bottleneck.

The master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the
data. So, as you can see in the above image that the data is distributed among multiple nodes
where each node processes the part of the data residing on it. This allows us to have the follow
ing advantages:

It is very cost-effective to move processing unit to the data.

The processing time is reduced as all the nodes are working with their part of the data in parallel.

Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.

pg. 34
Hadoop ecosystem

(OR)

pg. 35
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises
that what is big data ? Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop has made its
place in the industries and companies that need to work on large data sets which are sensitive
and needs efficient handling. Hadoop is a framework that enables processing of large data sets
which reside in the form of clusters. Being a framework, Hadoop is made up of several modules
that are supported by a large ecosystem of technologies.

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to


solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common. Most of the tools or solutions are used to supplement or support these
major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

HDFS:

HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.

HDFS consists of two core components i.e.

 Name node
 Data Node

Name Node is the prime node which contains metadata (data about data) requiring comparative
ly fewer resources than the data nodes that stores the actual data. These data nodes are commodit
y hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.

HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.

YARN:

Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for
the Hadoop System.

Consists of three major components i.e.

 Resource Manager
 Nodes Manager
 Application Manager

Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requireme nt of the two.

MapReduce:

By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.

MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce()
method.

Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.

PIG:

Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.

It is a platform for structuring the data flow, processing and analyzing huge data sets.

Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.

Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.

Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.

HIVE:

With the help of SQL methodology and interface, HIVE performs reading and writing of large data
sets. However, its query language is called as HQL (Hive Query Language).

It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.

Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.

JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.

Mahout:

Mahout, allows Machine Learnability to a system or application. Machine Learning, as the


name suggests helps the system to develop itself based on some patterns, user/environmental
interact ion or on the basis of algorithms.

It provides various libraries or functionalities such as collaborative filtering, clustering, and


classification which are nothing but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.

Apache Spark:

It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.

It consumes in memory resources hence, thus being faster than the prior in terms of optimization.

Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.

Apache HBase:

It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.

At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data.

Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:

Solr, Lucene: These are the two services that perform the task of searching and indexing with the
help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.

Zookeeper: There was a huge issue of management of coordination and synchronization


among the resources or the components of Hadoop which resulted in inconsistency, often.

Zookeeper overcame all the problems by performing synchronization, inter - component-based


communication, grouping, and maintenance.

Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.

You might also like