🧠
Data science Theory
Class Data science theory
Completed
Created Jun 12, 2020 1019 PM
Materials
Source Udemy
Type Lecture
Analysis and analytics
Analysis-
preform analysis on things that have already happended in the past.
Example: Hoe the sales decreased in the summer.
We do analysis to find what or something happen
Analytics-
Exploring patterns in exploring what we can do in the future.
There are two type of analytics
Qualitative analysis = intution and analysis
Data science Theory 1
Quantitative analysis =formulas and algorithms
Introduction:
In this some business activities are data driven while others are subjective or
experience driven.
Business needs -
Business case studies - real world experience of how companies succeed
and fail. We dont need a data set to understand case studies.
Qualitative anaytics - its all about intuition and knowleage about the market
,This includes working with tools to pridict the future behavior.
Preliminary data reporting
reporting with visuals
Creating dashboards
Sales forecastings
👆In the following the pink are data driven
👆The yello is experience driven
Some of the terms you refer to activites that aim to explain past
behavior(This is called as Analysis) while others refer to activites used for
predicting future behavior(This is called as analytics).
Data science Theory 2
Here the business case studies are analysis and qualitative analysis is all
about analytics predicting the fututre(Analytics).
NOTE Business analytics=business analysis + business analytics.
Data science: Can be used to improve the accuracy of prediction based on
data extraced from various activities.
Business Intelligence BI :The process of analysing and reporting historical
business data .Aims to explain past events using business data.preliminary
step of predictive analytics
Analyse past data and extract useful insights
create appropriate models
Reporting visuals and creating dashboards is all about BI
Machine Learning: The ability of mahine to pridict outcomes without being
explicitly programmed. is all about creating and implementing algorithms that
let machines receive data and use this data to
Make pridictions
analyses patterns
give recommendations
Artificial intelligence: Simulating human knowledge and decision makeing with
computers.
Data science Handbook :
Data science Theory 3
Approaches and techniques working with traditional data.
Raw data to processed data and to information
Data science Theory 4
Raw facts or Raw data
Cannot be analysied straight away
in is untouched data you have accumulated and storded in the server
Data collection
Examples: Survey Can be taken by surveys.How much people like
or dislike the product in the scale of 1 to 10 }
Cookies : They provide companies with detailed information about
users activities on a web site.
processed data
Data pre-processing :
Before data processing we do data pre-processing.This we do
after data collection.This is a group of operation that will basically
convert your raw data into a format that is more understandable.
Example : In the SQL database is the person enters the age is 932
or name as united kingdom
Before any analysis that data should be makred as invalid or
corrected.
Methods in pre-processing:
Class labeling -
This inculdes labeling the data point to the correct data
type or arranging data by category.
This can be
Numerical - number of unites sold in the day
categorical - cannot be manipulated.
Data cleansing = data cleaning = data scrubbing
It is to deal with inconsistant data
Example: Correcting spelling mistakes and deal with
missing values.
Example for Data preprocessing :
Data science Theory 5
Balancing : Imagine you have copiled a survey to gather
data on the shopping habits of man and women .To find
who spends more money in the weekend.When you have
the data 80% of women and 20% of men in the
respondents. So the trends you may notice are not going
towards men as much as women to counteract.Applying
balancing techiques wiuld be the best thing to do such as
takeing equal number of respondents from each group.so
the ratio is 50/50.
Data shuffling : Shuffling the observation from the dataset
is just like shuffling of cards.Prevents unwanted
patterns.Improves predictive perforance.helps avoid
misleading results.Suffling is the process of randomize
data.
Information
Visualization represents databases containing traditional data.
(visualization of relational database management system)
Entity relationship diagram (or ER Relational schema
Showes how the tables in the Here each rectangle represent a
datbase are related. distinct data table. and the line
represents which is and which are.
Teachniques for working with big data
Here there are much more verity beyond categorial and numerical Examples of
big data can be number text,digital images ,digital video data ,digital audio
data.
Data science Theory 6
with a wider range of data types comes with wider range if data cleansing
methods.
There are thchniques that verify that a digital image observation is ready for
processing.
Text data mining: The process of deriving valuable ,unstructured data from a
text.
Data masking: analyse the information without compromising private detailes.
Business intelligence (BI) analysis:
Data skills + business knowledge and intution to eplain the past performance
of the company.
How we measure business performance.
We start by collecting observation.
For example Collecting variables shuch as sales volume or new
customer enrolled in your web site
Each monthly revenue is each customer is consider a single
observation
Then we must quantify that information.Quantification is the process of
representing observation as numbers.
Measure: ameasure is the
accumulation of observations to
show some information
For example : If you total the
revenue of all three months
to obtain the value of $350
that would be that will be the
measure if the revenue of the
first quarter of that year.
Similary add together the
nukmber of new customer for
the same period : 50 and you
have a another measure.
Data science Theory 7
Analyze the data
Metrics - refers to the value that derives from the measure you obtain and
aims at gauging business performance or progress.
NOTE : Metric=meansure + business meaning
☝This is useful for comparision.
Can we kepp track of all possible metric we can extract from data set? - YES
Does it makes sense to do that ? NO
What you need to do is choose the metrics that are tightly aligned with your
business objective.There metrics are called KPIs Key Performance Indicators)
KPIs=metrics + business objectives
Key - related to ypur business goals
Performance - how successfully you have performed within a specified time
frame.
Indicators - generated only from users who have clicked on a link provided in
your ad campaign.
Metric KPI
The traffic of a page from your The traffic generated only from users
website that was visited by any type who have clicked on a link provided
of user. in your ad campaign
Data science Theory 8
And the next step every quantitative meaning you extracted must me
visulaized.
Traditional methods
At this stage we start applying analytics.
Techniques for working with traditional data
Regression: A model used for quantifying casal relationships among the
different variables included in your analysis.
For example:
Linear regression models
The table below is the data of price and house in square feet. This is linear
regression models.
Here the Red line is regression line.
because the all the point are close to the red
line while its not close to the green line. So
green line is not regression line
Data science Theory 9
So this red line can be written as
y = bx
Here, y -house price ,b-coefficient and x-house size
Logistic regression
The values in the vertical line will be 1s or 0s only.
Such models used in decision making process.
Companies apply logistical regression algorithms to filter job candidates
during their screening process.
If the algorithm estimates the probability that a prospective candidate will
perform well and the company is above 50 % it would be predicted one or
a successful application. Otherwise its 0
Data science Theory 10
Cluster analysis
For example if the house price vs house square feet graph is like below
Here the red line is the regression line. But here we ca do more : cluster
analysis .
This is another technique that will take into account that certain observations
exhibit similar house sizes and prices
Here in the cluster city
center : cost high and small
,far from the city : big but
cost less , nice
Data science Theory 11
neighborhoods : in the city
cost high and big house
For this example we only have the house size and house price.
but when it comes to this table:
Here the mathematical expression for regression model.
y = a + b1 x1 + b2 x2 + b3 x3 + ....... + bn xn
NOTE X explanatory variable is AKA regressor or independent variable
=predictor variable
For example analyzing a survey that consist of 100 questions.
In this question the regression model is:
y = a + b1 x1 + b2 + x2 + b3 x3 + .......... + b100 x100
Data science Theory 12
Here the factor analysis comes place.
In the example : Question 1 : I like animals ⭕⭕⭕⭕⭕
Question 2 : I care about animals ⭕⭕⭕⭕⭕
Question 3 : I am against animal cruelty ⭕⭕⭕⭕⭕
Whoever marks 5 to the first question most likely to give 5 for the rest two
questions.In other words if you strongly agree with one of there questions
you will not disagree with other 2.
With factor analysis We can add all the three questions to general attitude
towards animals.
⎧
⎪x1 1. I like animals
z1 = ⎨x2 2. I care about animals
⎪
⎩
x3 3. I am against animal cruelty
By this way we can reduce the regressor to 100 to 10.Which is more accurate
prediction.
y = n + n1 z1 + n2 z2 + n3 z3 + ......... + n10 z10
Time series
Plotting values against time. Time is always in x-axis.
Example for traditional methods
Example : User experience
Image you are the head of the user experience UX)department of a web site
selling goods on a global scale.
So as the head of UX our goal is to maximize user satisfaction.
Assume you already designed and implemented a survey that measured the
attitude of your customers towards the latest global products you have
launched
Data science Theory 13
When you the data on survey as the graph in left side. We should do the
cluster analysis.
Once we find out there are 4 separate groups it makes sense to run four
separate test.
Machine learning
creating an algorithm, which a computer then uses to find a model that fits the
data as best as possible and makes vert predictions based on that.
Machine learning algorithm -A trial and error process. Each consecutive trial
is at least as good as the previous one .
There are 4 ingredients.
Data
Model
Objective function - To measure the inaccuracy
Optimization algorithm - To improve
Types of machine learning :
Supervised learning - This uses the prior results here the data is labeled
Unsupervised learning - Here the data is unlabeled.
Reinforcement learning -
Data science Theory 14