0% found this document useful (0 votes)

46 views68 pages

(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics

Uploaded by

jidey30017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views68 pages

(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics

Uploaded by

jidey30017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 68

Data Analytics

(Subject Code: 410243)

(Class: TE Computer Engineering)
2019 Pattern

Designed By: Prof Balaji Bodkhe

Objectives and outcomes
• Course Objectives
–To understand and apply different design methods and techniques
–To understand architectural design and modeling
–To understand and apply testing techniques
–To implement design and testing using current tools and
techniques in distributed, concurrent and parallel
– Environments
• Course Outcomes
– To present a survey on design techniques for software system
– To present a design and model using UML for a given software
system
– To present a design of test cases and implement automated testing
for client server, distributed, mobile applications

2
Concepts

UNIT-I

INTRODUCTION AND LIFE CYCLE

4
UNIT-I CONCEPTS

Syllabus

Introduction: Big data overview, state of the practice in Analytics- BI Vs Data

Science, Current Analytical Architecture, drivers of Big Data, Emerging Big
Data Ecosystem and new approach.
Data Analytic Life Cycle: Overview, phase 1- Discovery, Phase 2- Data
preparation, Phase 3- Model Planning, Phase 4- Model Building, Phase 5-
Communicate Results, Phase 6- Opearationalize. Case Study: GINA

4
What’s Big Data?
No single definition; here is from Wikipedia:

Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
The trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic
conditions.” 5
Example Of Big Data
• Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions.

• Mobile phone companies analyze subscribers' calling patterns to determine, for

example, whether a caller 's frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to
defect, the mobile phone company can proactively offer the subscriber an incentive
to remain in her contract.

• For companies such as Linked In and Facebook, data itself is their primary product.
The valuations of these companies are heavily derived from the data they gather and
host, which contains more and
more intrinsic value as the data grows.
Three attributes stand out as defining Big Data characteristics:

• Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.

• Complexity of data types and structures: Big Data reflects the variety of new data
sources, formats, and structures, including digital traces being left on the web and
other digital repositories for subsequent analysis.

• Speed of new data creation and growth: Big Data can describe high velocity data,
with rapid data ingestion and near real time analysis.
Data Structures
Big data can come in multiple forms, including structured and non-structured data
such as financial data, text files, multimedia files, and genetic mappings. Contrary to
much of the traditional data analysis performed by organizations, most of the Big
Data is unstructured or semi-structured in nature, which requires different
techniques and tools to process and analyze
Four types of data structures

 Structured data: Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional DBMS,
CSV files, and even simple spreadsheets).

 Semi-structured data: Textual data files with a discernible pattern that enables
parsing (such as Extensible Markup Language [XML] data files that are self-
describing and defined by an XML schema).

 Quasi-structured data: Textual data with erratic data formats that can be formatted
with effort, tools, and time (for instance, web click stream data that may contain
inconsistencies in data values and formats).

 Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
State of the Practice in Analytics
Current business problems provide many opportunities for organizations to become
more analytical and data driven

BI Vs Data Science

1. Perspective: BI systems are designed to look backwards based on real data from real events. Data
Science looks forward, interpreting the information to predict what might happen in the future.
2. Focus: BI delivers detailed reports, trends but it doesn’t tell you what this data may look like in the
future in the form of patterns and experimentation.
3. Process: Traditional BI systems tend to be static and comparative. They do not offer room for
exploration and experimentation in terms of how the data is collected and managed.
4. Data sources: Because of its static nature, BI data sources tend to be pre-planned and added slowly.
Data science offers a much more flexible approach as it means data sources can be added on the go
as needed.
5. Transform: How the data delivers a difference to the business is key also. BI helps you answer the
questions you know, whereas Data Science helps you to discover new questions because of the way
it encourages companies to apply insights to new data.
6. Storage: Like any business asset, data needs to be flexible. BI systems tend to be warehoused and
siloed, which means it is difficult to deploy across the business. Data Science can be distributed real
time.
7. Data quality: Any data analysis is only as good as the quality of the data
captured. BI provides a single version of truth while data science offers
precision, confidence level and much wider probabilities with its findings.
8. IT owned vs. business owned
In the past, BI systems were often owned and operated by the IT department,
sending along intelligence to analysts who interpreted it. With Data Science, the
analysts are in charge. The new Big Data solutions are designed to be owned by
analysts, who spend little of their time on ‘IT housekeeping’ and most of their
time analyzing data and making predictions upon which to base business
decisions.
9. Analysis: A retrospective and prescriptive BI system is much less likely to be
placed to do this than a Predictive Data Science programme.
Current Analytical Architecture
1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured, and normalized with the appropriate data type definitions.
Although this kind of centralization enables security, backup, and failover of highly
critical data, it also means that data typically must go through significant
preprocessing and checkpoints before it can enter this sort of controlled environment,
which does not lend itself to data exploration and iterative analytics.

2. As a result of this level of control on the EDW, additional local systems may
emerge in the form of departmental warehouses and local data marts that business
users create to accommodate their need for flexible analysis. These local data marts
may not have the same constraints for security and structure as the main EDW and
allow users to do some level of more in-depth analysis. However, these one-off
systems reside in isolation, often are not synchronized or integrated with other data
stores, and may not be backed up.
3. Once in the data warehouse, data is read by additional applications across the
enterprise for BI and reporting purposes. These are high-priority operational
processes getting critical data feeds from the data warehouses and repositories.

4. At the end of this workflow, analysts get data provisioned for their downstream
analytics. Because users generally are not allowed to run custom or intensive
analytics on production databases, analysts create data extracts from the EDW to
analyze data offline in R or other local analytical tools.
Drivers (Sources) of Big Data
The data now comes from multiple sources, such as these:

 Medical information, such as genomic sequencing and diagnostic imaging

 Photos and video footage uploaded to the World Wide Web

 Video surveillance, such as the thousands of video cameras spread across a city

 Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart
phones

 Smart devices, which provide sensor-based collection of information from smart

electric grids, smart buildings, and many other public and industry
infrastructures

 Nontraditional IT devices, including the use of radio-frequency identification

(RFID) readers, GPS navigation systems, and seismic processing
Emerging Big Data Ecosystem and a New Approach to Analytics

Four main groups of players

Data devices
 Games, smartphones, computers, etc.
Data collectors
 Phone and TV companies, Internet, Gov’t, etc.
Data aggregators – make sense of data
 Websites, credit bureaus, media archives, etc.
Data users and buyers
 Banks, law enforcement, marketers, employers, etc.
Key Roles for the New Big Data Ecosystem

1. Deep analytical talent

 Advanced training in quantitative disciplines – e.g., math,
statistics, machine learning
2. Data savvy professionals
 Savvy but less technical than group 1
3. Technology and data enablers
 Support people – e.g., DB admins, programmers, etc.
Three Key Roles of the New Big Data
Ecosystem
Data Analytics Lifecycle

Data science projects differ from BI projects

More exploratory in nature
Critical to have a project process
Participants should be thorough and rigorous
Break large projects into smaller pieces
Spend time to plan and scope the work
Documenting adds rigor and credibility
Data Analytics Lifecycle
Data Analytics Lifecycle Overview
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicate Results
Phase 6: Operationalize
Case Study: GINA
Data Analytics Lifecycle Overview

The data analytic lifecycle is designed for Big Data problems and
data science projects
With six phases the project work can occur in several phases
simultaneously
The cycle is iterative to portray a real project
Work can return to earlier phases as new information is uncovered
Key Roles for a Successful Analytics
Project
Key Roles for a Successful Analytics Project

Business User – understands the domain area

Project Sponsor – provides requirements
Project Manager – ensures meeting objectives
Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
Database Administrator (DBA) – creates DB environment
Data Engineer – provides technical skills, assists data management
and extraction, supports analytic sandbox
Data Scientist – provides analytic techniques and modeling
Background and Overview of Data Analytics Lifecycle

Data Analytics Lifecycle defines the analytics process and best

practices from discovery to project completion
The Lifecycle employs aspects of
 Scientific method
 Cross Industry Standard Process for Data Mining (CRISP-DM)
 Process model for data mining
 Davenport’s DELTA framework
 Hubbard’s Applied Information Economics (AIE) approach
 MAD Skills: New Analysis Practices for Big Data by Cohen et al.
Overview of Data Analytics Lifecycle
Phase 1: Discovery
Phase 1: Discovery

1. Learning the Business Domain

2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Phase 2: Data Preparation
Phase 2: Data Preparation

Includes steps to explore, preprocess, and condition data

Create robust environment – analytics sandbox
Data preparation tends to be the most labor-intensive
step in the analytics lifecycle
Often at least 50% of the data science project’s time
The data preparation phase is generally the most iterative
and the one that teams tend to underestimate most often
Preparing the Analytic Sandbox

Create the analytic sandbox (also called workspace)

Allows team to explore data without interfering with live
production data
Sandbox collects all kinds of data (expansive approach)
The sandbox allows organizations to undertake ambitious
projects beyond traditional data analysis and BI to perform
advanced predictive analytics
Although the concept of an analytics sandbox is relatively new,
this concept has become acceptable to data science teams and IT
groups
Performing ETLT
(Extract, Transform, Load, Transform)

In ETL users perform extract, transform, load

In the sandbox the process is often ELT – early load
preserves the raw data which can be useful to examine
Example – in credit card fraud detection, outliers can
represent high-risk transactions that might be
inadvertently filtered out or transformed before being
loaded into the database
Hadoop (Chapter 10) is often used here
Learning about the Data

Becoming familiar with the data is critical

This activity accomplishes several goals:
Determines the data available to the team early in the
project
Highlights gaps – identifies data not currently
available
Identifies data outside the organization that might be
useful
Learning about the Data Sample Dataset Inventory
Data Conditioning

Data conditioning includes cleaning data, normalizing

datasets, and performing transformations
Often viewed as a preprocessing step prior to data analysis, it
might be performed by data owner, IT department, DBA, etc.
Best to have data scientists involved
Data science teams prefer more data than too little
Data Conditioning

Additional questions and considerations

What are the data sources? Target fields?
How clean is the data?
How consistent are the contents and files? Missing or
inconsistent values?
Assess the consistence of the data types – numeric,
alphanumeric?
Review the contents to ensure the data makes sense
Look for evidence of systematic error
Survey and Visualize

Leverage data visualization tools to gain an overview of

the data
Shneiderman’s mantra:
“Overview first, zoom and filter, then details-on-demand”
This enables the user to find areas of interest, zoom and filter to
find more detailed information about a particular area, then find
the detailed data in that area
Survey and Visualize Guidelines and Considerations

Review data to ensure calculations are consistent

Does the data distribution stay consistent?
Assess the granularity of the data, the range of values, and the level
of aggregation of the data
Does the data represent the population of interest?
Check time-related variables – daily, weekly, monthly? Is this good
enough?
Is the data standardized/normalized? Scales consistent?
For geospatial datasets, are state/country abbreviations consistent
Common Tools for Data Preparation

Hadoop can perform parallel ingest and analysis

Alpine Miner provides a graphical user interface for creating
analytic workflows
OpenRefine (formerly Google Refine) is a free, open source tool
for working with messy data
Similar to OpenRefine, Data Wrangler is an interactive tool for
data cleansing an transformation
Phase 3: Model Planning
Phase 3: Model Planning

Activities to consider
 Assess the structure of the data – this dictates the tools and analytic
techniques for the next phase
 Ensure the analytic techniques enable the team to meet the business
objectives and accept or reject the working hypotheses
 Determine if the situation warrants a single model or a series of
techniques as part of a larger analytic workflow
 Research and understand how other analysts have approached this kind
or similar kind of problem
Phase 3: Model Planning
Model Planning in Industry Verticals

Example of other analysts approaching a similar problem

Data Exploration and Variable Selection

 Explore the data to understand the relationships among the variables to

inform selection of the variables and methods
 A common way to do this is to use data visualization tools
 Often, stakeholders and subject matter experts may have ideas
 For example, some hypothesis that led to the project

 Aim for capturing the most essential predictors and variables

 This often requires iterations and testing to identify key variables

 If the team plans to run regression analysis, identify the candidate

predictors and outcome variables of the model
Model Selection

 The main goal is to choose an analytical technique, or several candidates,

based on the end goal of the project
 We observe events in the real world and attempt to construct models that
emulate this behavior with a set of rules and conditions
 A model is simply an abstraction from reality

 Determine whether to use techniques best suited for structured data,

unstructured data, or a hybrid approach
 Teams often create initial models using statistical software packages such
as R, SAS, or Matlab
 Which may have limitations when applied to very large datasets

 The team moves to the model building phase once it has a good idea about
the type of model to try
Common Tools for the Model Planning Phase
 R has a complete set of modeling capabilities
 R contains about 5000 packages for data analysis and graphical presentation

 SQL Analysis services can perform in-database analytics of common data

mining functions, involved aggregations, and basic predictive models
 SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connections
Phase 4: Model Building
Phase 4: Model Building

Execute the models defined in Phase 3

Develop datasets for training, testing, and production
Develop analytic model on training data, test on test data
Question to consider
 Does the model appear valid and accurate on the test data?
 Does the model output/behavior make sense to the domain experts?
 Do the parameter values make sense in the context of the domain?
 Is the model sufficiently accurate to meet the goal?
 Does the model avoid intolerable mistakes? (see Chapters 3 and 7)
 Are more data or inputs needed?
 Will the kind of model chosen support the runtime environment?
 Is a different form of the model required to address the business problem?
Common Tools for the Model Building Phase
Commercial Tools
 SAS Enterprise Miner – built for enterprise-level computing and analytics
 SPSS Modeler (IBM) – provides enterprise-level computing and analytics
 Matlab – high-level language for data analytics, algorithms, data exploration
 Alpine Miner – provides GUI frontend for backend analytics tools
 STATISTICA and MATHEMATICA – popular data mining and analytics tools

Free or Open Source Tools

 R and PL/R - PL/R is a procedural language for PostgreSQL with R
 Octave – language for computational modeling
 WEKA – data mining software package with analytic workbench
 Python – language providing toolkits for machine learning and analysis
 SQL – in-database implementations provide an alternative tool (see Chap 11)
Phase 5: Communicate Results
Phase 5: Communicate Results
Determine if the team succeeded or failed in its objectives
Assess if the results are statistically significant and valid
 If so, identify aspects of the results that present salient findings
 Identify surprising results and those in line with the hypotheses
Communicate and document the key findings and major
insights derived from the analysis
 This is the most visible portion of the process to the outside
stakeholders and sponsors
Phase 6: Operationalize
Phase 6: Operationalize

In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools
like R, especially with larger datasets
To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line
of business
Monitor model accuracy and retrain the model if necessary
Phase 6: Operationalize
Key outputs from successful analytics project
Phase 6: Operationalize
Key outputs from successful analytics project

Business user – tries to determine business benefits and

implications
Project sponsor – wants business impact, risks, ROI
Project manager – needs to determine if project completed on time,
within budget, goals met
Business intelligence analyst – needs to know if reports and
dashboards will be impacted and need to change
Data engineer and DBA – must share code and document
Data scientist – must share code and explain model to peers,
managers, stakeholders
Phase 6: Operationalize
Four main deliverables

Although the seven roles represent many interests, the interests

overlap and can be met with four main deliverables
1. Presentation for project sponsors – high-level takeaways for executive level
stakeholders
2. Presentation for analysts – describes business process changes and
reporting changes, includes details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code
Case Study: Global Innovation Network
and Analysis (GINA)

In 2012 EMC’s new director wanted to improve the

company’s engagement of employees across the global
centers of excellence (GCE) to drive innovation,
research, and university partnerships
This project was created to accomplish
Store formal and informal data
Track research from global technologists
Mine the data for patterns and insights to improve the team’s
operations and strategy
Phase 1: Discovery

Team members and roles

Business user, project sponsor, project manager – Vice President
from Office of CTO
BI analyst – person from IT
Data engineer and DBA – people from IT
Data scientist – distinguished engineer
Phase 1: Discovery

The data fell into two categories

 Five years of idea submissions from internal innovation contests
 Minutes and notes representing innovation and research activity from
around the world
Hypotheses grouped into two categories
 Descriptive analytics of what is happening to spark further creativity,
collaboration, and asset generation
 Predictive analytics to advise executive management of where it should
be investing in the future
Phase 2: Data Preparation

Set up an analytics sandbox

Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
Team recognized that poor quality data could impact
subsequent steps
They discovered many names were misspelled and problems
with extra spaces
These seemingly small problems had to be addressed
Phase 3: Model Planning

The study included the following considerations

Identify the right milestones to achieve the goals
Trace how people move ideas from each milestone
toward the goal
Tract ideas that die and others that reach the goal
Compare times and outcomes using a few different
methods
Phase 4: Model Building

Several analytic method were employed

NLP on textual descriptions
Social network analysis using R and Rstudio
Developed social graphs and visualizations
Phase 4: Model Building
Social graph of data submitters and finalists
Phase 4: Model Building
Social graph of top innovation influencers
Phase 5: Communicate Results

Study was successful in in identifying hidden innovators

Found high density of innovators in Cork, Ireland
The CTO office launched longitudinal studies
Phase 6: Operationalize

Deployment was not really discussed

Key findings
Need more data in future
Some data were sensitive
A parallel initiative needs to be created to improve basic BI
activities
A mechanism is needed to continually reevaluate the model
after deployment
Phase 6: Operationalize

Big Data Analytics Overview
100% (1)
Big Data Analytics Overview
81 pages
Unit - 1 Learning Notes
No ratings yet
Unit - 1 Learning Notes
11 pages
Big Data Intro & Data Sci Role
No ratings yet
Big Data Intro & Data Sci Role
30 pages
Data Analytics Unit I 1
No ratings yet
Data Analytics Unit I 1
87 pages
Big Data Analytics: Key Concepts & Roles
No ratings yet
Big Data Analytics: Key Concepts & Roles
37 pages
UNUT 1 - Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1 - Introduction and Data Analytics Life Cycle
86 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
61 pages
Lecture 2
No ratings yet
Lecture 2
50 pages
Chapter - 01 - Introduction To Big Data
No ratings yet
Chapter - 01 - Introduction To Big Data
22 pages
1.2 Big Data
No ratings yet
1.2 Big Data
23 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
Big Data Analytics for Students
No ratings yet
Big Data Analytics for Students
23 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
OC - Module 1 - Intro To BDA 021312
No ratings yet
OC - Module 1 - Intro To BDA 021312
37 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
Data Structures
No ratings yet
Data Structures
50 pages
Key Roles and Skills in Big Data Analytics
No ratings yet
Key Roles and Skills in Big Data Analytics
6 pages
3 UNIT-3 Big Data Analytics
No ratings yet
3 UNIT-3 Big Data Analytics
200 pages
Bigdata Mod-1
No ratings yet
Bigdata Mod-1
33 pages
Big-Data-Analytics Notes For Ug
No ratings yet
Big-Data-Analytics Notes For Ug
10 pages
Big Data
No ratings yet
Big Data
34 pages
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
No ratings yet
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
56 pages
Big Data Analytics Course Intro
No ratings yet
Big Data Analytics Course Intro
36 pages
DAUnit 1
No ratings yet
DAUnit 1
20 pages
DA - Presentation - 20250421 - 182554 - 0000
No ratings yet
DA - Presentation - 20250421 - 182554 - 0000
19 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Unit - I DA
No ratings yet
Unit - I DA
107 pages
Fundamentos y Variedad en Big Data
No ratings yet
Fundamentos y Variedad en Big Data
4 pages
Big Data Analytics for Businesses
No ratings yet
Big Data Analytics for Businesses
39 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
Introduction Part
No ratings yet
Introduction Part
5 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
10 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Aall
No ratings yet
Aall
41 pages
Introduction
No ratings yet
Introduction
21 pages
Big Data Analytics
No ratings yet
Big Data Analytics
28 pages
INFO8095 - Week 1 - Slides
No ratings yet
INFO8095 - Week 1 - Slides
38 pages
Data Analytics-Unit1 Notes
No ratings yet
Data Analytics-Unit1 Notes
30 pages
Charlotte Informatics Presentation
No ratings yet
Charlotte Informatics Presentation
26 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
73 pages
Demystifying Big Data RGc1.0
100% (1)
Demystifying Big Data RGc1.0
10 pages
Unit-01 Bda
No ratings yet
Unit-01 Bda
25 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
Data Science Notes
No ratings yet
Data Science Notes
3 pages
Big Data Analytics Overview and Challenges
100% (1)
Big Data Analytics Overview and Challenges
3 pages
Pixel-Oriented Visualization in Data Analytics
No ratings yet
Pixel-Oriented Visualization in Data Analytics
61 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Big Data Insights for IT Professionals
No ratings yet
Big Data Insights for IT Professionals
35 pages
BDS Session 3
No ratings yet
BDS Session 3
56 pages
Big Data Analytics Data Science-M10
No ratings yet
Big Data Analytics Data Science-M10
62 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
BDT 1
No ratings yet
BDT 1
49 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
EE 413: Operations Research & Quality Control: Chapter - 1
100% (1)
EE 413: Operations Research & Quality Control: Chapter - 1
26 pages
Testimony of Jeff Orlick
No ratings yet
Testimony of Jeff Orlick
2 pages
Lec - 01 - Introduction To Automation Technology
No ratings yet
Lec - 01 - Introduction To Automation Technology
38 pages
CGMiner 4.11.1 Performance Stats
No ratings yet
CGMiner 4.11.1 Performance Stats
1 page
Report On Bodhi Tree Multimedia
No ratings yet
Report On Bodhi Tree Multimedia
5 pages
BJT Biasing and Design Exercises
No ratings yet
BJT Biasing and Design Exercises
4 pages
Schmelzle2010 Fourier Pricing
No ratings yet
Schmelzle2010 Fourier Pricing
86 pages
Angle Chasing Problems in Geometry
No ratings yet
Angle Chasing Problems in Geometry
4 pages
Units and Dimensions (Test I) Time Allowed: 60 Minutes Max Marks: 120 Instructions
67% (3)
Units and Dimensions (Test I) Time Allowed: 60 Minutes Max Marks: 120 Instructions
3 pages
Internship Report by Sosina
No ratings yet
Internship Report by Sosina
5 pages
PTV Installation Manual
No ratings yet
PTV Installation Manual
84 pages
Mhhsrp33bcba0a D3B4 4928062025172237300
No ratings yet
Mhhsrp33bcba0a D3B4 4928062025172237300
1 page
VBA Workbook Mastery Guide
No ratings yet
VBA Workbook Mastery Guide
34 pages
Climate Responsive Design
No ratings yet
Climate Responsive Design
13 pages
Bluman Elem Stats 9e CH03 PPTS
No ratings yet
Bluman Elem Stats 9e CH03 PPTS
89 pages
IoT Medical Aid Dispenser System
No ratings yet
IoT Medical Aid Dispenser System
6 pages
Empowering Local Communities: Decentralization, Empowerment and Community Driven Development
No ratings yet
Empowering Local Communities: Decentralization, Empowerment and Community Driven Development
8 pages
Enhancing Student Engagement in Math
No ratings yet
Enhancing Student Engagement in Math
1 page
Presentation Gambits Salsabila Maharani
No ratings yet
Presentation Gambits Salsabila Maharani
20 pages
2020 Scala Mens Summer Santana Only-Min
No ratings yet
2020 Scala Mens Summer Santana Only-Min
17 pages
William M. Tuttle-BirthIndustrySynthetic-1981
No ratings yet
William M. Tuttle-BirthIndustrySynthetic-1981
34 pages
Aspera File Transfer Quiz Results
No ratings yet
Aspera File Transfer Quiz Results
14 pages
Tugas 3 Bahasa Inggris Risma Aisyiyah
No ratings yet
Tugas 3 Bahasa Inggris Risma Aisyiyah
6 pages
Emaar South: Luxury Living in Dubai
No ratings yet
Emaar South: Luxury Living in Dubai
1 page
Job Satisfaction
No ratings yet
Job Satisfaction
3 pages
Engineering Units and Conversions Guide
No ratings yet
Engineering Units and Conversions Guide
5 pages
Get 211 Intro Class
No ratings yet
Get 211 Intro Class
40 pages
Airbnb: Business Model Development and Future Challenges Critical Case Analysis
No ratings yet
Airbnb: Business Model Development and Future Challenges Critical Case Analysis
1 page
IDC MarketScape - Worldwide SD-WAN Infrastructure 2021 Vendor Assessment
No ratings yet
IDC MarketScape - Worldwide SD-WAN Infrastructure 2021 Vendor Assessment
10 pages
Lesson-3 State Nation Globalization
No ratings yet
Lesson-3 State Nation Globalization
21 pages

(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics

Uploaded by

(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics

Uploaded by

Data Analytics

(Subject Code: 410243)

Designed By: Prof Balaji Bodkhe

INTRODUCTION AND LIFE CYCLE

Introduction: Big data overview, state of the practice in Analytics- BI Vs Data

• Mobile phone companies analyze subscribers' calling patterns to determine, for

 Medical information, such as genomic sequencing and diagnostic imaging

 Photos and video footage uploaded to the World Wide Web

 Smart devices, which provide sensor-based collection of information from smart

 Nontraditional IT devices, including the use of radio-frequency identification

Four main groups of players

1. Deep analytical talent

Data science projects differ from BI projects

Business User – understands the domain area

Data Analytics Lifecycle defines the analytics process and best

1. Learning the Business Domain

Includes steps to explore, preprocess, and condition data

Create the analytic sandbox (also called workspace)

In ETL users perform extract, transform, load

Becoming familiar with the data is critical

Data conditioning includes cleaning data, normalizing

Additional questions and considerations

Leverage data visualization tools to gain an overview of

Review data to ensure calculations are consistent

Hadoop can perform parallel ingest and analysis

Example of other analysts approaching a similar problem

 Explore the data to understand the relationships among the variables to

 Aim for capturing the most essential predictors and variables

 If the team plans to run regression analysis, identify the candidate

 The main goal is to choose an analytical technique, or several candidates,

 Determine whether to use techniques best suited for structured data,

 SQL Analysis services can perform in-database analytics of common data

Execute the models defined in Phase 3

Free or Open Source Tools

Business user – tries to determine business benefits and

Although the seven roles represent many interests, the interests

In 2012 EMC’s new director wanted to improve the

Team members and roles

The data fell into two categories

Set up an analytics sandbox

The study included the following considerations

Several analytic method were employed

Study was successful in in identifying hidden innovators

Deployment was not really discussed

You might also like