Big Data Analytics(BDA)
GTU #3170722
Unit-1
Introductio
n to Big
Data
Prof. Maulik D. Trivedi
Computer Engineering Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9998 265 805
Looping
Outline
• Introduction to Big Data
• Big Data Characteristics
• Challenges of Conventional System
• Types of Big Data
• Intelligent Data Analysis
• Traditional vs. Big Data business Approach
• Case Study of Big Data Solutions
Introduction to Big
Data
Introduction
Firstly, We need to know “what is data?”
The quantities, characters, or symbols on which operations are performed
by a computer, which may be stored and transmitted in the form of
electrical signals and recorded on magnetic, optical, or mechanical
recording media.
Data Comes From Types of Data
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 4
Computer Data as Information
Computer data is information processed or stored by a computer.
This information may be in the form of text documents, images, audio
clips, software programs, or other types of data.
Computer data may be processed by the computer's CPU and is stored
in files and folders on the computer's hard disk.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 5
Definition – Big Data
Big Data is a massive collection of data that
continues to grow dramatically over time.
It is a data set that is so huge and
complicated that no typical data management
technologies can effectively store or process
it.
Big Data is like regular data, but it is much
larger.
A data which are very large in size.
Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but
data in Peta bytes i.e. 1015 byte size is called
Big Data.
It is stated that almost 90% of today's data
has been generated in the past 3 years.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 6
Sources of Big Data
Posts, Photos Videos, Likes
and Comments on Social
Media
Traffic data & GPS
Signals
Emails, Blogs and e-
news Software logs, camera and
microphone
Huge data from Weather station
and satellite that stored and
manipulated to forecasting
Digital Pictures &
Videos
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 7
Big Data
Characteristics
Big Data Characteristics
Volume represents the volume i.e. amount of data that is growing at a
high rate i.e. data volume in Petabytes.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 9
Big Data Characteristics
Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 10
Big Data Characteristics
Veracity refers to the uncertainty of available data. Veracity arises due to
the high volume of data that brings incompleteness and inconsistency.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 11
Big Data Characteristics
Visualization is the process of displaying data in charts, graphs, maps,
and other visual forms.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 12
Big Data Characteristics
Variety refers to the different data types i.e. various data formats like
text, audios, videos, etc.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 13
Big Data Characteristics
Velocity is the rate at which data grows. Social media contributes a
major role in the velocity of growing data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 14
Big Data Characteristics
Virality describes how quickly information gets spread across people to
people (P2P) networks.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 15
Volume
As it follows from the name, big data is used to
refer to enormous amounts of information. Volume
[ Data at Rest ]
We are talking about not gigabytes but terabytes
and petabytes of data.
The IoT (Internet of Things) is creating exponential
growth in data.
The volume of data is projected to change
significantly in the coming years.
Hence, 'Volume' is one characteristic which needs • Terabytes,
Petabytes
to be considered while dealing with Big Data.
• Records/Arch
• Table/Files
• Distributed
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 16
Variety
Variety refers to heterogeneous sources and the Variety
nature of data, both structured and unstructured. [ Data in many
Data comes in different formats – from structured, Forms ]
numeric data in traditional databases to
unstructured text documents, emails, videos,
audios, stock ticker data and financial transactions.
This variety of unstructured data poses certain
issues for storage, mining and analysing data.
Organizing the data in a meaningful way is no • Structured
simple task, especially when the data itself changes • Unstructured
• Text
rapidly. • Multimedia
Another challenge of Big Data processing goes
beyond the massive volumes and increasing
velocities of data but also in manipulating the
enormous variety of these data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 17
Veracity
Veracity describes whether the data can be trusted.
Veracity
Veracity refers to the uncertainty of available data. [ Data in Doubt ]
Veracity arises due to the high volume of data that
brings incompleteness and inconsistency.
Hygiene of data in analytics is important because
otherwise, you cannot guarantee the accuracy of
your results.
Because data comes from so many different
sources, it’s difficult to link, match, cleanse and • Trustworthiness
• Authenticity
transform data across systems.
• Accurate
However, it is useless if the data being analysed are • Availability
inaccurate or incomplete.
Veracity is all about making sure the data is
accurate, which requires processes to keep the bad
data from accumulating in your systems.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 18
Velocity
Velocity is the speed in which data is grows, process
and becomes accessible. Velocity
[ Data in Motion ]
A data flows in from sources like business
processes, application logs, networks, and social
media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.
Most data are warehoused before analysis, there is
an increasing need for real-time processing of these
enormous volumes. • Streaming
Real-time processing reduces storage requirements • Batch
• Real / Near Time
while providing more responsive, accurate and • Processes
profitable responses.
It should be processed fast by batch, in a stream-
like manner because it just keeps growing every
years.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 19
Value
It refers to turning data into value. By turning Value
accessed big data into values, businesses may [ Data into
generate revenue. Money ]
Value is the end game. After addressing volume,
velocity, variety, variability, veracity, and
visualization – which takes a lot of time, effort and
resources – you want to be sure your organization is
getting value from the data.
For example, data that can be used to analyze
• Statistical
consumer behavior is valuable for your company • Events
because you can use the research results to make • Correlations
individualized offers.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 20
Visualization
Big data visualization is the process of displaying Visualizatio
data in charts, graphs, maps, and other visual n
forms. [ Data Readable ]
It is used to help people easily understand and
interpret their data at a glance, and to clearly show
trends and patterns that arise from this data.
Raw data comes in a different formats, so creating
data visualizations is process of gathering,
managing, and transforming data into a format • Readable
that’s most usable and meaningful. • Accessible
Big Data Visualization makes your data as • Presentation
accessible as possible to everyone within your • Visual Forms
organization, whether they have technical data
skills or not.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 21
Virality
Virality describes how quickly information gets
spread across people to people (P2P) networks. Virality
[ Data Spread ]
It is measures how quickly data is spread and
shared to each unique node.
Time is a determinant factor along with rate of
spread.
• P2P
• Shared
• Rate of Spread
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 22
Challenges of
Conventional System
Challenges of Conventional System
There are main three challenges of conventional system, which are as
follows:
1. Volume of Data
2. Processing and Analyzing
3. Management of Data
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 24
Volume of Data
The volume of data increasing day by day, especially the data generated
from machine, telecommunication service, airline services, data from
sensors, etc…
The rapid growth in data every year is coming with new source of data
which are emerging.
As per survey, the growth in volume of data is so rapid that it is expected
by IBM that by 2020 around 35 zettabyte of data will get stored in the
world.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 25
Processing & Analyzing
Processing of such large volume of data is major challenge and is very
difficult.
Organization make use of such large volume of data by analyzing in order
to achieve their business goals.
Taking out insights from such large amount of data is time consuming and
it also takes lot of effort to do.
Processing and analyzing of data is also costly since the data is in different
format and is complex.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 26
Management of Data
As the data gathered have different formats like structured, semi-
structured and unstructured, it is very challenging to manage such
different variety of data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 27
Types of Big Data
Types of Big Data
1. Unstructured
2. Semi-structured
3. Structured
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 29
Unstructured
Any data with unknown form or the structure is classified as unstructured
data.
In addition to the size being huge, un-structured data poses multiple
challenges in terms of its processing for deriving value out of it.
Typical example of unstructured data is, a heterogeneous data source
containing a combination of simple text files, images, videos like search in
Google Engine.
Now a day organizations have wealth of data available with them but
Machine Generated
Human Generated Data
unfortunately they don't know how to derive value outDataof it since this data
is in its raw form or unstructured format.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 30
Unstructured - Example
The output returned by 'Google Search'
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 31
Structured
Any data that can be stored, accessed and processed in the form of fixed
format is termed as a "Structured" data.
Over the period of time, talent in computer science have achieved greater
success in developing techniques for working with such kind of data
(where the format is well known in advance) and also determining value
out of it.
When size of such data grows to a huge extent, typical sizes are being in
the range of multiple zettabyte.
Data stored in a relational database management system in one example
of a structured data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 32
Structured - Example
Employee_Table
Employee_ID Employee_Na Gender Department Salary_In_lacs
me
1 XYX MALE FINANCE 850000
2 ABC MALE ADMIN 250000
3 PQR FEMALE SALES 350000
4 MNR FEMALE FINANCE 600000
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 33
Semi-structured
Semi structured is the third type of big data.
Semi-structured data can contain both the forms of data.
Semi-structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data.
To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information or
tags that segregate individual elements within the data.
Web application data, which is unstructured, consists of log files,
transaction history files etc.
Online transaction processing systems are built to work with structured
data wherein data is stored in relations (tables).
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 34
Semi-structured - Example
User can see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Personal data stored in a XML file:
<rec><name>Prashant
Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema
R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish
Mane</name><sex>Male</sex><age>29</age></rec>
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 35
Difference
Semi-structured
Factors Structured data Unstructured data
data
It is more flexible than
It is flexible in nature
It is dependent and structured data but
Flexibility and there is an
less flexible less than flexible than
absence of a schema
unstructured data
Matured transaction The transaction is No transaction
Transaction
and various adapted from DBMS management and no
Management
concurrency technique not matured concurrency
Structured query allow Queries over An only textual query
Query performance anonymous nodes are
complex joining is possible
It is based on the possible This is based on
It is based on RDF and
Technology relational database character and library
XML
table data
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 36
Intelligent Data
Analysis
Intelligent Data Analysis
Intelligent Data Analysis (IDA) is one of the major issues in the field of
artificial intelligence and information.
Intelligent data analysis reveals implicit, previously unknown and
potentially valuable information or knowledge from large amounts of data.
It also helps in making a decision.
All zones of data visualization, data pre-preparing(combination, altering,
change, separating, examining), data engineering, database mining
procedure, devices and applications, use of domain knowledge in in data
analysis, big data applications, developmental algorithms, etc…
It includes three major steps:
1. Data Preparation
2. Rules finding or data mining
3. Result validation and explanation
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 38
Intelligent Data Analysis – Cont.
Data Preparation:
It includes extracting or collecting relevant data from source and then creating an
data set.
Rules finding or Data mining:
It is working out rules contained in the dataset by means of certain methods or
algorithms.
Result Validation and Explanation:
This result validation means examining these rules.
And Result explanation is giving intuitive, reasonable, and understandable
description using logical reasoning.
IDA is to extract useful knowledge, the process demands a combination of
extraction, analysis, conversion, classification, organization, reasoning,
and so on.
We can imply machine learning and deep learning concept for IDA.
It will helps in many area:
Banking & Securities, Communications, Media,
#3170722 (BDA) Unit:1 & Entertainment
– Introduction to Big
Prof. Maulik D Trivedi 39
Traditional vs. Big Data
Business Approach
Importance of Big Data
Complex or massive data sets which are quite impractical to be managed
using the traditional database system and software tools are referred to as
big data.
Big data is utilized by organizations in one or another way. It is the
technology which possibly realizes big data’s value.
It is the voluminous amount of both multi-structured as well unstructured
data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 41
Traditional vs. Big Data
Confidentiality & Data Accuracy
Data Relationship
Data Storage Size
Different types of data
Flexibility
Real-time Analytics
Distributed Architecture
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 42
Majors between Traditional Data & Big Data
TRADITIONAL DATA BIG DATA
Traditional Data
Traditional data is generated in Big data is generated in outside and
enterprise level. enterprise level.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Big data system deals with
Traditional database system deals
structured, semi structured and
with structured data.
unstructured data.
Big Data
Traditional data is generated per But big data is generated more
hour or per day or more. frequently mainly per seconds.
Traditional data source is
Big data source is distributed and it
centralized and it is managed in
is managed in distributed form.
centralized form.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 43
Majors between Traditional Data & Big Data
TRADITIONAL DATA BIG DATA
Traditional Data
Data integration is very easy. Data integration is very difficult.
Normal system configuration is High system configuration is
capable to process traditional data. required to process big data.
The size is more than the traditional
The size of the data is very small.
data size.
Big Data Traditional data base tools are Special kind of data base tools are
required to perform any data base required to perform any data base
operation. operation.
Normal functions can manipulate Special kind of functions can
data. manipulate data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 44
Majors between Traditional Data & Big Data
TRADITIONAL DATA BIG DATA
Traditional Data
Its data model is strict schema Its data model is flat schema based
based and it is static. and it is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Traditional data is in manageable Big data is in huge volume which
volume. becomes unmanageable.
Big Data
It is easy to manage and It is difficult to manage and
manipulate the data. manipulate the data.
Its data sources includes ERP
Its data sources includes social
transaction data, CRM transaction
media, device data, sensor data,
data, financial data, organizational
video, images, audio etc.
data, web transaction data etc.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 45
Case Study of Big Data
Solutions
Case Study of Big Data Solution
Undoubtedly Big Data has become a major game change in most part of
the cutting edge industries over the last few years.
As Big Data keeps on going day by day, the number of various
organizations that are adopting Big Data keeps on expanding.
Let’s discuss example:
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of
100$ to its top 10 customers who have spent the most in the previous year.
Moreover, they want to find the buying trend of these customers so that company
can suggest more items related to them.
Issues: Huge amount of unstructured data which needs to be stored, processed and
analyzed.
Solution:
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 51
Where are businesses finding uses for Big
Data ?
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 52
Walmart
Biggest retiler in the world and world’s biggest organization by revenue.
Approx. 2 million workers and 20000 stores in 28+ nations.
It started to use Big Data concept in earlier stage.
It used data mining to find designs pattern that can be used to give
product suggestions to client, depending on which products were brought
together.
Based on data mining result, it has expanding its conversion rate of
customers.
Main taget of walmart is to holding customers and enhance their
experience.
Hadoop and NoSQL technologies are used to furnished these customers
real time data to gathered from various sources and their effective
valuable use.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 53
Uber
It is the best option for individuals around the globe when moving people
and making conveyances.
It utilizes individuals information of the user to intently monitor which
features of services are used.
To analyze usage pattern and to figure out where the services should be
more engaged.
It focuses around the oraganic market of the services because of which
the costs of services gave changes.
The use of data is surge pricing and its influences the rate of demand.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 54
Netflix
It is very popular entertainment company work in online on-request web
based video streaming for its customers.
It has been determined to be able to predict what precisely its customers
will appreciate viewing with Big Data.
Recently, Netflix begun positioning itself as a content creator, not simply a
distribution medium which is solidly said based on data analytics.
Data likes are recommandation engines take care of customers watch,
regularly playback halted, ratings and so on.
It has incorporates with Hadoop, Hive and Pig and other traditional
business intelligence.
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 55
More Case Studies of Big Data
https://www.scnsoft.com/blog/big-data-use-cases-stats-and-examples
https://www.tableau.com/learn/articles/big-data-examples-use-cases
#3170722 (BDA) Unit:1 – Introduction to Big
Prof. Maulik D Trivedi 56
Big Data Analytics(BDA)
GTU #3170722
Thank
You
Prof. Maulik D Trivedi
Computer Engineering Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9998 265 805