BIG DATA ANALYTICS
By: Syed Nawaz Pasha @ SR Univeristy
Professional Elective-5
B.Tech IV-II SEM
Preferred Text Book
Author: Seema Acharya and
subhashini chellapan.
Book Title: Big Data and
Analytics
Publisher : Wiley
Types of digital data
Data is precious and irreplaceable asset.
Data->information
Information-> valuable insights
Classification of digital data
Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
Unstructured data
This is the data which does not conform to a
data model or is not in a form which can be
used easily by a computer program.
About 80—90% data of an organization is in
this format.
Example: PPT’s, images, videos, letters, body
of an email, etc.
Semi-structured data
This is the data which does not conform to a
data model but has some structure.
However, it is not in a form which can be used
easily by a computer program
Examples: emails, XML, markup languages
like HTML, etc.
Structured data
This is the data which is in an organized form
(e.g., in rows and columns) and can be easily
used by a computer program. Relationships
exist between entities of data, such as
classes and their objects.
Data stored in databases is an example of
structured data.
Structured data continue
When do we say that the data is structured?
When data conforms to predefined
schema/structure.
Ex : RDMS
Relational data model where data is stored as
rows and columns.
Sources of structured data
Ease of working with structured
data
Insert/update/delete
Indexing
Scalability
Transaction processing
Semi structured data
Also referred to as self describing structure.
Characteristics of semi-structured data:
Inconsistent structure
Semi structured data Self describing(label/value pairs)
Often schema information is blended with data values
Data objects may have different attributes not known before h
Sources of semi structured data
XML:web services developed using SOAP
JSON:transmit data between server and web
application using REST.
HTML
JSON
{
Id:9
Booktitle=“BDA”
Author=“seema acharya”
Publisher=“wiley india”
YOP=“2011”
Unstructured data
Does not conform to any predefined data
model.
Issues with terminology of unstructured data
Structure can be implied despite not being formerly
Issues with defined
terminology
Data with some structure may still be labeled
unstructured if the structure doesn’t help with
processing task in hand
Data may have some structure or may even be
highly structured in ways that are unanticipated or
unannounced
Dealing with unstructured data
Data mining
Dealing with
unstructured data NLP
Text analytics
Noisy text
analytics
Techniques used to find patterns in
or interpret unstructured data
1. Data Mining
Association rule mining
Regression analysis
Collaborative filtering
Text analytics or text mining:text categorization,text
clustering,sentimental analysis..
Noisy text analytics: extracting structured or
semistructured information from noisy unstructred
data such as chats,blogs,wikis,email.
Part-of-speech tagging:POS or POST
Remind me
Structured data
Semi-structured data
Unstructured data
Test me(seggregate the below as
structured,semi structured and
unstructured)
Email
Msaccess
Images
Database
Chat conversation
Relations
Facebook‘
Videos
Ms excel
XML
Sources of unstructured data
Definition of big data-By
Gartner
Big data is high volume, high velocity and high
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insights and decision
making.
Big Data Definition
Big data refers to huge amount of data which
is difficult to store and process using on-hand
database system tools or traditional data
processing applications.
Characteristics of Big data(5 V’s of Big
Data).
volume
velocity
Variety
Value
veracity
5 V’s of big data
Challenges with big data
Exponential growth of data-Today’s BIG may be
normal tomorrow.
Cloud computing and virtualization.Complicates
the decision to host big data solutions outside the
enterprise.
Period of retention of big data.
No skilled professionals of data science.
Difficult to
capture,store,prepare,search,analyze,transfer,sec
ure and visualize big data.
Poor data visualization experts.
Big data analytics
Process of examining large datasets of big data to uncover hidden
patterns,unknown correlations,understand the retionale behind
market trends,recognize customer preferences and other business
information.
Classification of analytics
Descriptive analytics
Diagnostic analytics
Predictive analytics
Prescriptive analytics
Analytics 1.0(mid 1950’s Analytics 2.0(2005 to 2012)
to 2009) Descriptive statistics predictive
Descriptive(report on statistics
events,occurrence etc (use data from past to make
predictions for future).
Analytics 3.0
(2012 to present)
Descriptive+predictive+pr
escriptive)
Analytics 1.0,2.0,3.0
Terminologies used in big data
environment
In memory analytics
In database processing
Symmetric multiprocessor system
Massively parallel processing
Distributed vs parallel systems
Share nothing architectures
Shared memory(SM)
Shared Disk(SD)
Shared Nothing(SN)
CAP Theorem
It is impossible for a distributed system to
simultaneously provide all three of the following
guarantees: (Pick any two)
1. Consistency: All nodes should see the same
data at the same time or reads return latest
written value by any client
2. Availability: Every request receives a
response. The system allows operations all the
time and operations return quickly
3. Partition – Tolerance: the system continues to
operate despite arbitrary partitioning due to
network failures
Samples of databases that follow one of the
possible three combinations:AP,CP,CA