Statistics for Data Science -
CIMDS 51103
Instructor
◼ Lemma Ebssa, Ph.D.
Class email: [email protected]
Personal email: [email protected]
Claims:
◼ Several pictures in this lecture are adopted from
the World Wide Web.
◼ Data are obtained from R software, Kaggle,
datacamp, Reference book (Bruce & Bruce)
◼ R software is used to analyze data
Class Structure - Modules
1. Introduction and concepts
◼ Statistics, data analysis, data science
2. Data Exploration
◼ Descriptive statistics: summary, tabular, graphical; central tendency, variability
◼ Inferential statistics: sampling, population, hypothesis testing
3. Correlation and Regression
◼ Correlation
◼ OLS regression
◼ Generalized model: Logistic regression, Poisson Regression
◼ Variable selection and Model building (if time permits)
4. Introduction to Probability
◼ Definition and concepts of probability
◼ Set theory
◼ Conditional probability, independent events, Bayes’ theory
◼ Random variables, expectation of random variables
5. Probability Distribution
◼ Probability distribution to describe data – Binomial, Poisson, Logistic, Normal
◼ Probability distribution for statistical test – Z-, t-, F-, chi-square distribution
◼ Data normalization/standardization
◼ Central Limit theorem
6. Other selected Topics (if time permits):
◼ A/B Testing
◼ Profiling (classification): Discriminatory analysis, Cluster analysis, PCA, LCA
Class Structure
12 weeks (M, T, W, F):
5 Modules
4 Take-home assignments
◼ Assigned in classes 2, 4, 6, 8
◼ Due in classes 3, 5, and 7, 9
1 Final exam
◼ Assigned in class 10 -- return for final comments.
◼ * For all assignments and exams in this class,
copying from other sources or from one-another
in any form or fashion is Plagiarism, an academic
theft, and results in academic punishments!
References
Practical Statistics for Data Scientists,
by Peter Bruce and Andrew Bruce,
The Institute for Statistics Education
Probability and Statistics, Fourth Edition,
by Morris H. DeGroot and Mark J. Schervish,
Carnegie Mellon University
Introduction to R
by W. N. Venables, D. M. Smith
Course Outline and References
Week Module Reference
Pages
1-3 1. Introduction to statistics concepts: 93-129,
Statistical concepts, population and sample, data and variable types 125-129
4-5 2. Descriptive statistics: 26-67
Summary statistics (five-number), Charts (Histogram, boxplot,
stem-leaf), Tables (frequency); measure of location (mean, median,
mode), measure of dispersion (standard deviation, variance, IQR,
MAD)
6-7 3. Correlation and Regression 68-76
Correlation: Correlation coefficient (r), correlation matrix, scatter plot,
scatter plot matrix; Coefficient of Determination (R^2). 231-267,
Linear Regression: 272-292
simple, multiple; fitting models, model selection model assumption,
model diagnostics [generalized models: logistic, Poisson…]
8-9 4. Introduction to Probability: DeGroot
Definition, Set theory for data science, Mutually Exclusive or and
collectively exhaustive, conditional probability, Bayes Rules, Random Schervish
variables, mean of random variable, Central Limit Theorem
10-12 5. Probability distribution 130-169,
Normal, Binomial, Poisson, Logistic, z-, t-, F, Chi-square distributions 180-200,
220-224
Module
Introduction to Statistics
Basics of Statistics
Statistics is science of collection, analysis,
presentation, and reasonable interpretation of data.
Statistics allows a rigorous scientific method to gain
insight into data. Viewing the weight measure of 100
patients in a study fails to provide an informative
account. However, graphical presentation or numerical
summarization of the measurements by the methods
of statistics can give an instant overall picture without
viewing individual data points. Furthermore, inferential
statistics may help to predict weight of a similar
patient but not in the current study and assess the
relationship between different variables.
Data, record (observation), Variable
Data (plural or singular) refers to collected observations or
measurements often through research.
Variables are the characteristics or attributes that you are
observing, measuring and recording data for.
Record is an entity (subject) on which different types of data are
gathered. In a given dataset, records are usually independent
and identically distributed (iid in statistics). A given dataset can
contain duplicates of records varying in one or more variables.
Often quantifying unique number of records is needed.
In a structured dataset, rows are called observations/records,
and columns are called variables. Missing values do not affect
the structure of the dataset.
◼ A freshman student from Jimma was 172 cm tall, weighs 60 Kg, has brown
eyes, married and has 1 child. She gained weight during the 4-year college and
was 72 Kg where her other parameters remained the same. Another student
from Asayita joined the university as a single with a height of 156 cm, weight of
58 Kg but grow taller during his college stay reaching a height of 162 cm and
67 Kg at graduation. He remained single the whole time.
◼ Create a dataset and distinguish record, data, and variables.
Dataset
1
1
2
Data Organization, Management,
and Storage
Read The following articles.
Data Organization in Spreadsheets
https://doi.org/10.1080/00031305.2
017.1375989
Everything a Data Scientist Should
Know About Data Management
summary
Come for summary and discussion
DS Tools for Data Analysis,
Visualization, and Management
R
SAS
Python
SQL
AWS
Tableau
MS Access and Excel (for smaller data)
Statistics, Data Analysis, Data
Science
Statistics: a deductive method of science concerned
with collecting, analyzing, interpreting, and presenting data
to understand, make decision, and predict.
Data Analysis: the process of systematically applying
statistical and/or logical techniques to describe and
illustrate, condense and recap, and evaluate data.
Data Science: “the profession that uses scientific
methods to liberate and create meaning from raw data”.
◼ a multidisciplinary approach to extracting actionable insights from
the large and ever-increasing volumes of data collected and created
by today’s organizations (IBM, 2020).
Data Science
Bad Statistics if No ‘Good Data’
Data everywhere - Data, Data, Data, Data
But avoid garbage in to avoid garbage out.
Because you have some data in front of
you does not guarantee making a sound
decision by making some tables or graphs
and showing a fancy PowerPoint!
Data scientists spend up to 80% of their
time in cleaning and 20% in analyzing
data!
Collecting Data
Basic principles of data collection:
◼ simple data, valid, reliable, credible, ethically OK to collect.
◼ planning the entire process of data selection, collection,
analysis and use from very beginning.
Two types of collected data:
◼ Structured rectangular data (spreadsheet, table) → easy
to process
◼ Unstructured text, images, audio, videos →cleaning,
coding intensive
In data Cleaning / Data Transformation
◼ Categorization – collapse data into fewer groups
◼ Reduction –combining several variable into one
◼ Standardization –square root to minimize variation, Z-score
◼ Create structured data from unstructured one
Data Transformation Examples
DOB is given; create Age
Changing Age from numerical to
categorical
Create BMI from weight and height
To minimize variation (a remedy for
non-normally distributed data), compute
a new variable (use Box-Cox method)
Impute for missing data
Standardize data (create z-normal)
Structured Data Types
Numerical- continuous (ratio), interval, discrete
Categorical- Binary, ordinal, nominal
◼ data type determines the type of visual display, data
analysis methods, or statistical model
A typical data structure in data science is a
rectangular matrix in which rows are records and
columns are variables (features/inputs).
Match variables to data type:
◼ Gender, student rank, smoking, # of insects,
temperature, weight
Typical Data Structure
Some Definitions of Data Values
Variable - any characteristic of an individual or entity. A variable can
take different values for different entries. Variables can be
categorical or quantitative.
• Nominal - Categorical variables with no inherent order or ranking sequence
such as names or classes (e.g., gender). Value may be a numerical, but without
numerical value (e.g., I, II, III). The only operation that can be applied to Nominal
variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe.
Can be compared for equality, or greater or less, but not how much greater or
less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally,
differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
operations.
• Ratio (continuous)- Variables with all properties of Interval plus an absolute,
non-arbitrary zero point, e.g. age, weight, temperature (Kelvin). Addition,
subtraction, multiplication, and division are all meaningful operations.
Data type: Levels of
Measurements
Attributes of Data Type
Attributes Nominal Ordinal Discrete Interval Ratio
Sequence of levels – Yes Yes Yes Yes
Possible value between
- - - Yes Yes
successive values
Mode Yes Yes Yes Yes Yes
Median – Yes Yes Yes Yes
Mean – – - Yes Yes
Difference between
– – Yes Yes Yes
levels can be evaluated
Addition and
Subtraction of – – Yes Yes Yes
variables
Multiplication and
– – - – Yes
Division of variables
Example color rating # of people temperature weight
Era of Data: Big Data
Extremely large data sets that may
be analyzed computationally to reveal
patterns, trends, and associations,
especially relating to human behavior
and interactions.
An accumulation of data that is too
large and complex for processing by
traditional database management
tools (Webster Dictionary)
Size of Big data
Big data includes unstructured, semi-
structured and structured data,. Big data
"size" is a constantly moving target from a
few dozen of terabytes to many zettabytes
of data.
Automated data collection is flooding the
world of data every day and cost of storing
these huge data and speed of processing
them is decreasing over time.
Data Scientists for Big data
Business is highly motivated to make their
daily decision based on data – data driven
decision.
Scientific knowledge in processing these
ever-increasing data size is also growing
along. But still human activity is required
to process and make some sense out of
such Zita size information - here is where
Data Scientists are needed.
Data-Computer-Science
❖Explosion of data volume
❖Computer processing power
❖Knowledge in data science
❖Calling for changes in everyday
decision-making process
➔Need for Data Scientist
Categories of Statistics
Descriptive statistics describe the
characteristics of a set of data. E.g.,
employment rate of new college
graduates over the past five years, a
graph of students' birthday months in a
kindergarten class.
Inferential statistics provide a way to
draw conclusions and predictions about a
population based on data provided by a
sample of the population being studied.
Module
Descriptive Statistics
Descriptive Statistics
Aim to describe a mass of raw data using summary statistics,
graphs, and tables.
Allow to understand a group of data much more quickly and
easily compared to just staring at rows of raw data.
Summary statistics: a single value summary (five-numbers)
◼ Measures of central tendency: these numbers describe
where the center of a dataset is located. mean, median,
mode
◼ Measures of dispersion: how spread out the values are in
the dataset. range, interquartile range, standard deviation,
and variance
Graphs: to quickly visualize assess data.
boxplots, histograms, stem-and-leaf plots, scatterplot
Tables: to understand how data is distributed. frequency table.
Descriptive statistics (Point estimates)
Three ways of describing data:
◼ Summary statistics: e.g., five-numbers for continuous data
◼ Charts: Boxplot for continuous data per group
◼ Tables: frequency table for categorical
Tools of Descriptive Statistics in a dataset:
◼ Mean: arithmetic or geometric averages
◼ Median: the center number; Quartiles (Q1, Q3)
◼ Mode: most frequently occurring number
◼ Range: difference between the highest and the lowest value; IQR
◼ standard deviation: average distance an observation is away from
an overall mean. Standard error, Variance
◼ Correlation coefficient: strength of linear relationship of variables
◼ Skewness; kurtosis:
Skewness
Measures asymmetry of data
◼ Positive or right skewed: Longer right tail
◼ Negative or left skewed: Longer left tail
Let x1 , x2 ,... xn be n observatio ns. Then,
n
n ( xi − x ) 3
Skewness = i =1
3/ 2
n
2
( xi − x )
i =1
Kurtosis
Measures peakedness of the distribution of
data. The kurtosis of normal distribution is 0.
Let x1 , x2 ,... xn be n observatio ns. Then,
n
n ( xi − x ) 4
Kurtosis = i =1
2
−3
n 2
( xi − x )
i =1
Mean or Median
The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g., family income. For
example, mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give
a realistic picture of the major part of the data. It is
influenced by extreme value 990.
Notation : Let x1 , x2 , ... xn are n observatio ns of a variable
x. Then the mean of this variable,
n
x + x2 + ... + xn x i
x= 1 = i =1
n n
Standard Deviation or Variance
Variance: The variance of a set of observations is the average
of the squares of the deviations of the observations from their
mean. In symbols, the variance of the n observations x1,
x2,…xn is ( x1 − x ) 2 + .... + ( xn − x ) 2
S =
2
n −1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance
is
( 5 − 5) + ( 3 − 5) + ( 7 − 5 )
2 2 2
=4
3 −1
Standard Deviation: Square root of the variance. The
standard deviation of the above example is 2.
Quartiles, Deciles, or Percentiles
Quartiles: Data can be divided into four regions that cover the total
range of observed values. Cut points for these regions are known as
quartiles.
In notations, quartiles of a data is the ((n+1)/4)qth observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile
(Q2) is between the 25th and 50th percentage points in the data. The
upper bound of Q2 is the median. The third quartile (Q3) is the 25%
of the data lying between the median and the 75% cut point in the
data.
Q1 is the median of the first half of the ordered observations and Q3 is
the median of the second half of the ordered observations.
Quartiles, Deciles, or Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut
points are called Deciles
Percentiles (Quantiles): If data is ordered and divided into
100 parts, then cut points are called Percentiles. 25th
percentile is the Q1, 50th percentile is the Median (Q2) and
the 75th percentile of the data is Q3.
In notations, percentiles of a data is the ((n+1)/100)pth
observation of the data, where p is the desired percentile and n
is the number of observations of data.
Mean & Median in Skewed Data
Skewness and Kurtosis
Describing a Given Dataset
❑ Using center of location
❑ Using variability of distribution
❑ Inclination to one side (large/small,
left/right, negative/positive; extreme
values/outlier?)
Descripti G1 G2 G3
on
Mean, Median < Mean Median Median =
median Mean Mean
SD, IQR Small IQR; medium Large SD
outlier affect SD, IQR but no
Small SD outlier
skewness negative/left left no
kurtosis outlier no no
Statistical Formulas, Attributes
Complete for your own
the table for other
parameters
Parameters Indicates
Mean, median, mode Location
SD, variance, Skewness; Variability
kurtosis, quartiles, range, IQR, /dispersion /
mean absolute deviation,
Percentiles, coefficient of
variation (CV)
Correlation coefficient Relationship
Boxplots
A boxplot (also called a box-and-whisker
plot) is a plot that shows the five-number
summary (descriptive statistics) of a dataset.
The five-number summary include:
• The minimum, The first quartile (Q1), The median,
The third quartile (Q3), The maximum
A boxplot allows us to easily visualize the
distribution of values in a dataset using one
simple plot.
Examples
Find: Mean, SD (variance), IQR,
MAD (median absolute deviation),
mode
Boxplot
Visualizing Data
Data science deals with complex data: 100s of columns and
hundred thousands (or even million, billions) of rows.
◼ Graphs quickly and effectively describe data and compare groups
Thus, visualization is a way to go
Business Intelligence (BI) analytic tools, e.g., Power BI, Tableau to slice
and dice data into smaller cells of grouping
Boxplots
Stem and Leaf Plots
Scatterplots
Relative Frequency Histogram
Density plots
Central and Dispersal of Data
Questions
When do you prefer to use mean than
median, IQR instead of SD?
Difference in importance of SD vs.
Variance?
IQR or Range?
SD vs. stander error of the means
R package and Class Data
Install R on your computer
Download data and codes from the book website:
https://github.com/gedeck/practical-statistics-for-data-scientists
Extract the zip file on your local drive
setup your work directory, e.g.,
◼ > setwd(“C:\\...\\practical-statistics-for-data-scientists-master\\data")
Import the data to R, e.g.,: state <- read.csv(‘state.csv’)
Open R code files in R and copy any part to run (or copy any
code from the github site): e.g., mean(state[['Population']])
References for R Programming
RStudio101
Rstudio-IDE-cheatsheet
Quick R-Compiled
UC Business Analytics R
Programming Guide2
cheat-sheet dplyr
ggplot2-cheatsheet
Few R codes
> summary(DATA) /* DATA is r data file */
> STAT(DATA$x) /* STAT could be mean, sd, var, min, max, median, range , */
> sapply(DATA, STAT, na.rm=TRUE);
◼ > tapply(DATA$VAR, DATA$GROUP, STAT);
◼ > tapply(DATA$VAR1, DATA$GROUP1, quantile, probs = c(0.25, 0.50, 0.75))
> boxplot(DATA$VAR1, ylab=‘abc’); boxplot(DATA$VAR1 ~ DATA$GROUP1)
> hist(DATA$VAR1, breaks = n, xlim=c(min, max)) xlab=‘abc’)
tab1 <- table(DATA$var1, DATA$var2)
◼ > chisq.test(tab1)
◼ gmodels::CrossTable(DATA$VAR1, DATA$VAR2)
> DATA_NUM <- dplyr::select(DATA_ALL, -(c(CAT1, CAT2)))
pairs(Dat _num)
hist(DATA$x, freq=FALSE)
lines(density(DATA$VAR1), lwd=3, col="blue")
> Cor(x, y)
> Cor(select(DATA, -c(CAT1, CAT2)))
> library(corrplot)
corrplot(cor(select(DATA, -c(CAT1, CAT2))), method="number") /* different options for
method */
> library(ggplot2)
ggplot(data = DATA) +
geom_point(mapping = aes(x = VAR1, y = VAR2, color = CAT1))
car::vif (model)
Visualization
Practice Question
What does the line in the center of the box
represent - mean or median?
For a skewed data, what the center of gravity
of the boxplot – mean or median?
In Excel, graph a boxplot of and calculate
mean, median, Q1, Q3 of 2, 5, 8, 11, 13.
Change the last number (13) to 200 and re-
graph/recalculate the parameters and describe
the change.
Add one more observation (10), keep 200, and
re-graph the boxplot. Describe the change.
Stem and Leaf Plots
A stem-and-leaf plot displays data by
splitting up each value in a dataset into a
“stem” and a “leaf.” The “leaf” of each
value is the last digit.
Example 1: 12, 14, 18, 22, 22, 23, 25, 25, 28, 45, 47, 48
Example 2: 134, 156, 158, 159, 160, 162, 164
◼ 1|2 4 8
2|2 2 3 5 5 8
3|
4|5 7 8
◼ 13 | 4
14 |
15 | 6 8 9
16 | 0 2 4
Scatterplots
Scatterplots are used to display the
relationship between two variables (bivariate analysis).
Interpreting Scatterplots:
◼ relationship (positive, negative, none)
◼ strength (weak, strong)
Scatterplot matrix: a grid of scatter plots to
visualize/explore bivariate relationships in a
single chart.
Interpret:
Correlation of body
parts of penguins Spp.
Frequency Table
Usually used to describe data of categorical (non-
numerical) variable, e.g., gender, marital status…
Frequencies simply tell us how many times a certain
event has occurred for each level of the category.
Can be expressed in percentage of all observations,
of columns, or of rows.
A contingency table shows distribution of one
variable in rows and of another in columns to study
the association between the variables.
A Chi-square or Fisher exact test shows significance
of the association.
Given Long Categorical Data
Create a categorical age group and
ID Age Gender
create a summary frequency table of
1 50 m the age category and gender.
Question:
2 25 m - Overall difference of counts of
gender, age group
3 33 F
- Percent of different age groups
… with a given gender or difference
in percent of gender with a given
300 18 3 age group
- Is there any association between
Age group and gender
- In the population what is the
highest group of the society?
Categorical data analysis is
a bivariate analysis
Frequency Counts and Percent (cell, marginal
(column, row) of Age Group by Gender
Age male female Total R.margin Age male female Total R.margin
al al
<= 18 50 60 110 110/295 <= 18 50/110 60/110 110/ 110/295
19-30 40 40 80 80/295 19-30 40/80 40/80 80 80/295
35-45 20 30 50 50/295 35-45 50 50/295
46-60 10 20 30 30/295 46-60 30 30/295
60+ 5 20 25 25/295 60+ 25 25/295
Total 125 170 295 295/295 Total 125 170 295 295/295
C.marginal 125/295 170/295 100% C.marginal 125/295 170/295 100%
Age male female Total R.margin Age male female Total R.margin
al al
<= 18 50/295 60/295 110 110/295 <= 18 50/125 60/170 110 110/295
19-30 40/295 40/295 80 80/295 19-30 40/125 40/170 80 80/295
35-45 50 50/295 35-45 50 50/295
46-60 30 30/295 46-60 30 30/295
60+ 25 25/295 60+ 25 25/295
Total 125 170 295 295/295 Total 125 170 295 295/295
C.marginal 125/295 170/295 100% C.marginal 125/295 170/295 100%
Relative vs. Cumulative Frequency Distribution
Age male female Relative Cumulative Relative Cumulative
Count Count Percent Percent
<= 18 50 60 110 110 110/295 110/295
19-30 40 40 80 190 80/295 190/295
35-45 20 30 50 240 50/295 240/295
46-60 10 20 30 270 30/295 270/295
60+ 5 20 25 295 25/295 295/295
Relative 125 170 295 295/295
Count
Cumulative 125 295
Count
Relative 125/295 175/295
Percent
Cumulative 125/295 295/295
Percent
Density Plots (Kernel density plot)
Shows distribution of a variable on a
continuous interval.
A variation of Histograms but not
affected by number of bins (better
determines distribution shape)
Estimated (smoothed) data across
bins to smoothen out noise
Normal distribution curves are an
example of density plots.
Where is the most frequent value, mean,
or median is located? What is that value?
Questions
What is the difference between
frequency tables and percentiles
(quartiles) in describing data and
when do you use them?
What are similarity and differences
between Histogram and density plots?
R
RStudio Screen vs. base R
Package, library, function
◼ Install R libraries using packages that unpack a
couple of libraries. Each library has a couple of
functions that do some function
Working directory
Importing data (external, internal)
Stat (five-number, mean, sd, var, etc.)
Graphics (boxplot, hist, density)
Frequency