0% found this document useful (0 votes)
49 views107 pages

Unit - I DA

The document provides an overview of data analytics, including its definitions, classifications, and characteristics, with a focus on Big Data. It outlines the data analytics lifecycle, detailing phases such as discovery, data preparation, model planning, and operationalization, while emphasizing the need for various roles in successful analytics projects. Additionally, it contrasts traditional analytics with Big Data analytics and discusses the importance of modern tools and technologies for managing large datasets.

Uploaded by

ianurags2509
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views107 pages

Unit - I DA

The document provides an overview of data analytics, including its definitions, classifications, and characteristics, with a focus on Big Data. It outlines the data analytics lifecycle, detailing phases such as discovery, data preparation, model planning, and operationalization, while emphasizing the need for various roles in successful analytics projects. Additionally, it contrasts traditional analytics with Big Data analytics and discusses the importance of modern tools and technologies for managing large datasets.

Uploaded by

ianurags2509
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Unit - I

• Introduction to Data Analytics: Sources and


nature of data, classification of data (structured,
semi-structured, unstructured), characteristics of
data, introduction to Big Data platform, need of
data analytics, evolution of analytic scalability,
analytic process and tools, analysis vs reporting,
modern data analytic tools, applications of data
analytics.
• Data Analytics Lifecycle: Need, key roles for
successful analytic projects, various phases of data
analytics lifecycle – discovery, data preparation,
model planning, model building, communicating
results, operationalization.
Source of Data
What is Big data
CLASSIFICATION OF DATA
• Data classification is broadly defined as the process
of organizing data by relevant categories so that it
may be used and protected more efficiently. On a
basic level, the classification process makes data
easier to locate and retrieve. Data classification is of
particular importance when it comes to risk
management, compliance, and data security.
• Big Data includes huge volume, high velocity, and
extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and
Unstructured data.
Structured data
• Structured data is data whose elements are
addressable for effective analysis. It has been
organized into a formatted repository that is typically
a database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They
have relational keys and can easily be mapped into
pre-designed fields. Today, those data are most
processed in the development and simplest way to
manage information. Example: Relational data.
Unstructured data
• Unstructured data is a data that is which is not
organized in a pre-defined manner or does not
have a pre-defined data model, thus it is not a good
fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms
for storing and managing, it is increasingly
prevalent in IT systems and is used by
organizations in a variety of business intelligence
and analytics applications. Example: Word, PDF,
Text, Media logs.
Semi-Structured data
• Semi-structured data is information that does
not reside in a relational database but that have
some organizational properties that make it
easier to analyze. With some process, you can
store them in the relation database (it could be
very hard for some kind of semi-structured
data), but Semi-structured exist to ease
space. Example: XML data.
CHARACTERISTICS OF DATA
Volume
Variety
Veracity
Value
Velocity
How Much Data
cern's large hadron collider
TYPES OF DATA
VARIETIES BIG DATA COLLECTED
What is Big data
• Big data exceeds to reach of commonly used
hardware environments and software tools
to capture, manage and process it within a
tolerable elapsed time for its user population -
Merv ardern
• Big data' refers to datasets whose size is
beyond the ability of typical database
software tools to capture, store, manage, and
analyze. – Mckinsey global institute
Summary
• Definition of Data Analytics
• Data Analytics vs Data Mining
• Definitions to Big data
• Classification's of big data
• Characteristics of big data
• Applications of Big data
TRADITIONAL ANALYTICS VS BIG
DATA ANALYTICS
360-degree view
EVOLUTION OF ANALYTIC
SCALABILITY
• The amount of data organizations process
continues to increase.
The old method for handling data doesn't work
efficiently
• Important technologies to handle big data are
MPP (Massive Parallel processing )
The cloud
Grid computing
Map reduce
MORDERN DATA BASE
ARCHITECTURE
Massively Parallel Processing
What is cloud computing ?
Grid Computing
Map Reduce
Working process
Good & Bad
Technologies can integrate and work
together
Evolution of Analytical Processes
Definition of Analytical frame work
An internal Configuration
An External Configuration
A Hybrid Configuration
Benefits
Definition of ADS
The data that is pulled together in order to create
an analysis or model
• In the format required for the specific analysis at hand

• Generated by transforming, aggregating, and combining


data

• Help to bridge the gap between efficient storage and ease of


use
Two Primary kinds of Analytics Data sets
Traditional Analytics data sets
Enterprise Analytic Data Set
EDA Set - Structure
Summary Table or View?
Embedded Scoring
Model and Score Management
• Model and score management procedures will need to
be in place to scale the use of models by an
organization.
REPORTING Vs ANALYSIS
• Reporting: The process of organizing data into
informational summaries in order to monitor how
different areas of a business are performing.

– They select the reports which they want to run


– Get the reports executed
– View Results
• Analysis: The process of exploring data and reports
in order to extract meaningful insights, which can be
used to better understand and improve business
performance.

– Tracking Problem
– Finding Data Required
– Analyze the data
– Interpret the result
Difference
Making Inference
• To produce a great analysis, it is necessary to infer
potential actions
– Make initial inference based on analysis
– Visualization plays vital role in understanding
– An effective Visualization can bring out much
more inferences
– Today visualization tool allows multiple tabs, links
the graphs and charts
– New idea for visualizations is 3-D
Applications
• Open source software have been around for some
time
– In many cases, open source products are outside
the mainstream
• Many individuals are contributing to improve the
functionality
– Bugs can be patched soon
Data Analytics Lifecycle
• Big Data analysis differs from traditional data
analysis primarily due to the volume, velocity and
va r ie ty cha ra cte r s t i c s o f t h e d a t a b e i n
g processes.
• To address the distinct requirements for performing
analysis on Big Data, a step-by-step methodology is
needed to organize the activities and tasks involved
with acquiring, processing, analyzing and repurposing
data.
Key Roles for a Successful Analytics
Project
• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business
domain expertise based on deep understanding of the
data
• Database Administrator (DBA) – creates DB
environment
• Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and
modeling
Data Analytics Lifecycle (cont..)
• The data analytic lifecycle is designed for Big Data
problems and data science projects
• The cycle is iterative to represent a real project
• Work can return to earlier phases as new
• information is uncovered
Data Analytics Lifecycle-Abstract
View
Discovery
• In this phase,
• The data science team must learn and investigate the
problem,
• Develop context and understanding and Learn about
the data sources needed and available for the project.
• In addition, the team formulates initial hypotheses
that
• can later be tested with data.
• The team should perform five main activities during this step
of the discovery.
• Identify data sources: Make a list of data sources the team
may need to test the initial hypotheses outlined in this phase.
Make an inventory of the datasets currently available and
those that can be purchased or otherwise acquired for the
tests the team wants to perform.
• Capture aggregate data sources: This is for previewing the
data and providing high-level understanding.
It enables the team to gain a quick overview of the data and
perform further exploration on specific areas.
• Review the raw data: Begin understanding the
interdependencies among the data attributes.
Become familiar with the content of the data, its quality,
and its limitations
• Evaluate the data structures and tools needed: The data
type and structure dictate which tools the team can use to
analyze the data.
• Scope the sort of data infrastructure needed for this type of
problem: In addition to the tools needed, the data influences
the kind of infrastructure that's required, such as disk storage
and network capacity.
• Unlike many traditional stage-gate processes, in which the
team can advance only when specific criteria are met, the Data
Analytics Lifecycle is intended to accommodate more
ambiguity
• For each phase of the process, it is recommended to pass
certain checkpoints as a way of gauging whether the team is
ready to move to the next phase of the Data Analytics
Lifecycle.
Data preparation
• This phase includes
• Steps to explore, Preprocess, and condition data prior to
modeling and analysis.
• It requires the presence of an analytic sandbox (workspace), in
which the team can work with data and perform analytics for the
duration of the project.
✔ The team needs to execute Extract, Load, and Transform (ELT)
or extract, transform and load (ETL) to get data into the
sandbox.
✔ In ETL, users perform processes to extract data from a datastore,
perform data transformations, and load the data back into the
datastore
✔ The ELT and ETL are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team can work with it and
analyze it.
Data preparation (Cont.,)
Data preparation (Cont.,)
Common Tools for the Data
Preparation Phase
Model Planning
Common Tools for the Model
Planning Phase
Model Building
Communicate Results
Operationalize
Common Tools for the Model
Building Phase
Key outputs for each of the main
stakeholders

You might also like