0% found this document useful (0 votes)

26 views63 pages

Lecture1 2

The document outlines a course on Statistics for Data Science, taught by Dr. Lemma Ebssa, covering key statistical concepts, data exploration, correlation, regression, and probability. It includes a structured class format with modules, assignments, and a final exam, emphasizing the importance of data collection and cleaning in data science. References for further reading and a detailed course outline are also provided.

Uploaded by

Mohamed Romance

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views63 pages

Lecture1 2

Uploaded by

Mohamed Romance

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Statistics for Data Science -

CIMDS 51103
 Instructor
◼ Lemma Ebssa, Ph.D.
 Class email: [email protected]
 Personal email: [email protected]
 Claims:
◼ Several pictures in this lecture are adopted from
the World Wide Web.
◼ Data are obtained from R software, Kaggle,
datacamp, Reference book (Bruce & Bruce)
◼ R software is used to analyze data
Class Structure - Modules
 1. Introduction and concepts
◼ Statistics, data analysis, data science
 2. Data Exploration
◼ Descriptive statistics: summary, tabular, graphical; central tendency, variability
◼ Inferential statistics: sampling, population, hypothesis testing
 3. Correlation and Regression
◼ Correlation
◼ OLS regression
◼ Generalized model: Logistic regression, Poisson Regression
◼ Variable selection and Model building (if time permits)
 4. Introduction to Probability
◼ Definition and concepts of probability
◼ Set theory
◼ Conditional probability, independent events, Bayes’ theory
◼ Random variables, expectation of random variables
 5. Probability Distribution
◼ Probability distribution to describe data – Binomial, Poisson, Logistic, Normal
◼ Probability distribution for statistical test – Z-, t-, F-, chi-square distribution
◼ Data normalization/standardization
◼ Central Limit theorem
 6. Other selected Topics (if time permits):
◼ A/B Testing
◼ Profiling (classification): Discriminatory analysis, Cluster analysis, PCA, LCA
Class Structure
 12 weeks (M, T, W, F):
 5 Modules
 4 Take-home assignments
◼ Assigned in classes 2, 4, 6, 8
◼ Due in classes 3, 5, and 7, 9
 1 Final exam
◼ Assigned in class 10 -- return for final comments.
◼ * For all assignments and exams in this class,
copying from other sources or from one-another
in any form or fashion is Plagiarism, an academic
theft, and results in academic punishments!
References
Practical Statistics for Data Scientists,
by Peter Bruce and Andrew Bruce,
The Institute for Statistics Education
Probability and Statistics, Fourth Edition,
by Morris H. DeGroot and Mark J. Schervish,
Carnegie Mellon University
Introduction to R
by W. N. Venables, D. M. Smith
Course Outline and References
Week Module Reference
Pages

1-3 1. Introduction to statistics concepts: 93-129,

Statistical concepts, population and sample, data and variable types 125-129

4-5 2. Descriptive statistics: 26-67

Summary statistics (five-number), Charts (Histogram, boxplot,
stem-leaf), Tables (frequency); measure of location (mean, median,
mode), measure of dispersion (standard deviation, variance, IQR,
MAD)
6-7 3. Correlation and Regression 68-76
Correlation: Correlation coefficient (r), correlation matrix, scatter plot,
scatter plot matrix; Coefficient of Determination (R^2). 231-267,
Linear Regression: 272-292
simple, multiple; fitting models, model selection model assumption,
model diagnostics [generalized models: logistic, Poisson…]
8-9 4. Introduction to Probability: DeGroot
Definition, Set theory for data science, Mutually Exclusive or and
collectively exhaustive, conditional probability, Bayes Rules, Random Schervish
variables, mean of random variable, Central Limit Theorem
10-12 5. Probability distribution 130-169,
Normal, Binomial, Poisson, Logistic, z-, t-, F, Chi-square distributions 180-200,
220-224
Module
Introduction to Statistics
Basics of Statistics
 Statistics is science of collection, analysis,
presentation, and reasonable interpretation of data.
 Statistics allows a rigorous scientific method to gain
insight into data. Viewing the weight measure of 100
patients in a study fails to provide an informative
account. However, graphical presentation or numerical
summarization of the measurements by the methods
of statistics can give an instant overall picture without
viewing individual data points. Furthermore, inferential
statistics may help to predict weight of a similar
patient but not in the current study and assess the
relationship between different variables.
Data, record (observation), Variable
 Data (plural or singular) refers to collected observations or
measurements often through research.
 Variables are the characteristics or attributes that you are
observing, measuring and recording data for.
 Record is an entity (subject) on which different types of data are
gathered. In a given dataset, records are usually independent
and identically distributed (iid in statistics). A given dataset can
contain duplicates of records varying in one or more variables.
Often quantifying unique number of records is needed.
 In a structured dataset, rows are called observations/records,
and columns are called variables. Missing values do not affect
the structure of the dataset.
◼ A freshman student from Jimma was 172 cm tall, weighs 60 Kg, has brown
eyes, married and has 1 child. She gained weight during the 4-year college and
was 72 Kg where her other parameters remained the same. Another student
from Asayita joined the university as a single with a height of 156 cm, weight of
58 Kg but grow taller during his college stay reaching a height of 162 cm and
67 Kg at graduation. He remained single the whole time.
◼ Create a dataset and distinguish record, data, and variables.
Dataset

1
1
2
Data Organization, Management,
and Storage
Read The following articles.
 Data Organization in Spreadsheets
https://doi.org/10.1080/00031305.2
017.1375989
 Everything a Data Scientist Should
Know About Data Management
summary

 Come for summary and discussion

DS Tools for Data Analysis,
Visualization, and Management
 R
 SAS
 Python
 SQL
 AWS
 Tableau
 MS Access and Excel (for smaller data)
Statistics, Data Analysis, Data
Science
 Statistics: a deductive method of science concerned
with collecting, analyzing, interpreting, and presenting data
to understand, make decision, and predict.
 Data Analysis: the process of systematically applying
statistical and/or logical techniques to describe and
illustrate, condense and recap, and evaluate data.
 Data Science: “the profession that uses scientific
methods to liberate and create meaning from raw data”.
◼ a multidisciplinary approach to extracting actionable insights from
the large and ever-increasing volumes of data collected and created
by today’s organizations (IBM, 2020).
Data Science
Bad Statistics if No ‘Good Data’
 Data everywhere - Data, Data, Data, Data
 But avoid garbage in to avoid garbage out.
 Because you have some data in front of
you does not guarantee making a sound
decision by making some tables or graphs
and showing a fancy PowerPoint!
 Data scientists spend up to 80% of their
time in cleaning and 20% in analyzing
data!
Collecting Data
 Basic principles of data collection:
◼ simple data, valid, reliable, credible, ethically OK to collect.
◼ planning the entire process of data selection, collection,
analysis and use from very beginning.
 Two types of collected data:
◼ Structured rectangular data (spreadsheet, table) → easy
to process
◼ Unstructured text, images, audio, videos →cleaning,
coding intensive
 In data Cleaning / Data Transformation
◼ Categorization – collapse data into fewer groups
◼ Reduction –combining several variable into one
◼ Standardization –square root to minimize variation, Z-score
◼ Create structured data from unstructured one
Data Transformation Examples
 DOB is given; create Age
 Changing Age from numerical to
categorical
 Create BMI from weight and height
 To minimize variation (a remedy for
non-normally distributed data), compute
a new variable (use Box-Cox method)
 Impute for missing data
 Standardize data (create z-normal)
Structured Data Types
 Numerical- continuous (ratio), interval, discrete
 Categorical- Binary, ordinal, nominal
◼ data type determines the type of visual display, data
analysis methods, or statistical model
 A typical data structure in data science is a
rectangular matrix in which rows are records and
columns are variables (features/inputs).
 Match variables to data type:
◼ Gender, student rank, smoking, # of insects,
temperature, weight
Typical Data Structure
Some Definitions of Data Values
Variable - any characteristic of an individual or entity. A variable can
take different values for different entries. Variables can be
categorical or quantitative.
• Nominal - Categorical variables with no inherent order or ranking sequence
such as names or classes (e.g., gender). Value may be a numerical, but without
numerical value (e.g., I, II, III). The only operation that can be applied to Nominal
variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe.
Can be compared for equality, or greater or less, but not how much greater or
less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally,
differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
operations.
• Ratio (continuous)- Variables with all properties of Interval plus an absolute,
non-arbitrary zero point, e.g. age, weight, temperature (Kelvin). Addition,
subtraction, multiplication, and division are all meaningful operations.
Data type: Levels of
Measurements
Attributes of Data Type
Attributes Nominal Ordinal Discrete Interval Ratio

Sequence of levels – Yes Yes Yes Yes

Possible value between

- - - Yes Yes
successive values
Mode Yes Yes Yes Yes Yes
Median – Yes Yes Yes Yes
Mean – – - Yes Yes
Difference between
– – Yes Yes Yes
levels can be evaluated
Addition and
Subtraction of – – Yes Yes Yes
variables
Multiplication and
– – - – Yes
Division of variables
Example color rating # of people temperature weight
Era of Data: Big Data
 Extremely large data sets that may
be analyzed computationally to reveal
patterns, trends, and associations,
especially relating to human behavior
and interactions.
 An accumulation of data that is too
large and complex for processing by
traditional database management
tools (Webster Dictionary)
Size of Big data
 Big data includes unstructured, semi-
structured and structured data,. Big data
"size" is a constantly moving target from a
few dozen of terabytes to many zettabytes
of data.
 Automated data collection is flooding the
world of data every day and cost of storing
these huge data and speed of processing
them is decreasing over time.
Data Scientists for Big data
 Business is highly motivated to make their
daily decision based on data – data driven
decision.
 Scientific knowledge in processing these
ever-increasing data size is also growing
along. But still human activity is required
to process and make some sense out of
such Zita size information - here is where
Data Scientists are needed.
Data-Computer-Science
❖Explosion of data volume
❖Computer processing power
❖Knowledge in data science
❖Calling for changes in everyday
decision-making process
➔Need for Data Scientist
Categories of Statistics
 Descriptive statistics describe the
characteristics of a set of data. E.g.,
employment rate of new college
graduates over the past five years, a
graph of students' birthday months in a
kindergarten class.
 Inferential statistics provide a way to
draw conclusions and predictions about a
population based on data provided by a
sample of the population being studied.
Module
Descriptive Statistics
Descriptive Statistics
 Aim to describe a mass of raw data using summary statistics,
graphs, and tables.
 Allow to understand a group of data much more quickly and
easily compared to just staring at rows of raw data.
 Summary statistics: a single value summary (five-numbers)
◼ Measures of central tendency: these numbers describe
where the center of a dataset is located. mean, median,
mode
◼ Measures of dispersion: how spread out the values are in
the dataset. range, interquartile range, standard deviation,
and variance
 Graphs: to quickly visualize assess data.
boxplots, histograms, stem-and-leaf plots, scatterplot
 Tables: to understand how data is distributed. frequency table.
Descriptive statistics (Point estimates)
 Three ways of describing data:
◼ Summary statistics: e.g., five-numbers for continuous data
◼ Charts: Boxplot for continuous data per group
◼ Tables: frequency table for categorical

 Tools of Descriptive Statistics in a dataset:

◼ Mean: arithmetic or geometric averages
◼ Median: the center number; Quartiles (Q1, Q3)
◼ Mode: most frequently occurring number
◼ Range: difference between the highest and the lowest value; IQR
◼ standard deviation: average distance an observation is away from
an overall mean. Standard error, Variance
◼ Correlation coefficient: strength of linear relationship of variables
◼ Skewness; kurtosis:
Skewness
 Measures asymmetry of data
◼ Positive or right skewed: Longer right tail
◼ Negative or left skewed: Longer left tail
Let x1 , x2 ,... xn be n observatio ns. Then,
n
n  ( xi − x ) 3
Skewness = i =1
3/ 2
 n
2
  ( xi − x ) 
 i =1 
Kurtosis
 Measures peakedness of the distribution of
data. The kurtosis of normal distribution is 0.

Let x1 , x2 ,... xn be n observatio ns. Then,

n
n  ( xi − x ) 4
Kurtosis = i =1
2
−3
 n 2
  ( xi − x ) 
 i =1 
Mean or Median
The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g., family income. For
example, mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations out of
4 lie between 20-40. So, the mean 270 really fails to give
a realistic picture of the major part of the data. It is
influenced by extreme value 990.
Notation : Let x1 , x2 , ... xn are n observatio ns of a variable
x. Then the mean of this variable,
n

x + x2 + ... + xn x i
x= 1 = i =1

n n
Standard Deviation or Variance
Variance: The variance of a set of observations is the average
of the squares of the deviations of the observations from their
mean. In symbols, the variance of the n observations x1,
x2,…xn is ( x1 − x ) 2 + .... + ( xn − x ) 2
S =
2

n −1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance
is
( 5 − 5) + ( 3 − 5) + ( 7 − 5 )
2 2 2
=4
3 −1

Standard Deviation: Square root of the variance. The

standard deviation of the above example is 2.
Quartiles, Deciles, or Percentiles
Quartiles: Data can be divided into four regions that cover the total
range of observed values. Cut points for these regions are known as
quartiles.
In notations, quartiles of a data is the ((n+1)/4)qth observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile
(Q2) is between the 25th and 50th percentage points in the data. The
upper bound of Q2 is the median. The third quartile (Q3) is the 25%
of the data lying between the median and the 75% cut point in the
data.
Q1 is the median of the first half of the ordered observations and Q3 is
the median of the second half of the ordered observations.
Quartiles, Deciles, or Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut
points are called Deciles

Percentiles (Quantiles): If data is ordered and divided into

100 parts, then cut points are called Percentiles. 25th
percentile is the Q1, 50th percentile is the Median (Q2) and
the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)pth

observation of the data, where p is the desired percentile and n
is the number of observations of data.
Mean & Median in Skewed Data
Skewness and Kurtosis
Describing a Given Dataset
❑ Using center of location
❑ Using variability of distribution
❑ Inclination to one side (large/small,
left/right, negative/positive; extreme
values/outlier?)

Descripti G1 G2 G3
on
Mean, Median < Mean Median Median =
median  Mean Mean
SD, IQR Small IQR; medium Large SD
outlier affect SD, IQR but no
Small SD outlier
skewness negative/left  left no
kurtosis outlier no no
Statistical Formulas, Attributes
Complete for your own
the table for other
parameters

Parameters Indicates
Mean, median, mode Location

SD, variance, Skewness; Variability

kurtosis, quartiles, range, IQR, /dispersion /
mean absolute deviation,
Percentiles, coefficient of
variation (CV)
Correlation coefficient Relationship
Boxplots
 A boxplot (also called a box-and-whisker
plot) is a plot that shows the five-number
summary (descriptive statistics) of a dataset.
 The five-number summary include:
• The minimum, The first quartile (Q1), The median,
The third quartile (Q3), The maximum
 A boxplot allows us to easily visualize the
distribution of values in a dataset using one
simple plot.
Examples

Find: Mean, SD (variance), IQR,

MAD (median absolute deviation),
mode
Boxplot
Visualizing Data
 Data science deals with complex data: 100s of columns and
hundred thousands (or even million, billions) of rows.
◼ Graphs quickly and effectively describe data and compare groups
 Thus, visualization is a way to go
 Business Intelligence (BI) analytic tools, e.g., Power BI, Tableau to slice
and dice data into smaller cells of grouping
 Boxplots
 Stem and Leaf Plots
 Scatterplots
 Relative Frequency Histogram
 Density plots
Central and Dispersal of Data
Questions
 When do you prefer to use mean than
median, IQR instead of SD?
 Difference in importance of SD vs.
Variance?
 IQR or Range?
 SD vs. stander error of the means
R package and Class Data
 Install R on your computer
 Download data and codes from the book website:
https://github.com/gedeck/practical-statistics-for-data-scientists
 Extract the zip file on your local drive
 setup your work directory, e.g.,
◼ > setwd(“C:\\...\\practical-statistics-for-data-scientists-master\\data")
 Import the data to R, e.g.,: state <- read.csv(‘state.csv’)
 Open R code files in R and copy any part to run (or copy any
code from the github site): e.g., mean(state[['Population']])
References for R Programming
 RStudio101
 Rstudio-IDE-cheatsheet
 Quick R-Compiled
 UC Business Analytics R
Programming Guide2
 cheat-sheet dplyr
 ggplot2-cheatsheet
Few R codes
 > summary(DATA) /* DATA is r data file */
 > STAT(DATA$x) /* STAT could be mean, sd, var, min, max, median, range , */
 > sapply(DATA, STAT, na.rm=TRUE);
◼ > tapply(DATA$VAR, DATA$GROUP, STAT);
◼ > tapply(DATA$VAR1, DATA$GROUP1, quantile, probs = c(0.25, 0.50, 0.75))
 > boxplot(DATA$VAR1, ylab=‘abc’); boxplot(DATA$VAR1 ~ DATA$GROUP1)
 > hist(DATA$VAR1, breaks = n, xlim=c(min, max)) xlab=‘abc’)
 tab1 <- table(DATA$var1, DATA$var2)
◼ > chisq.test(tab1)
◼ gmodels::CrossTable(DATA$VAR1, DATA$VAR2)
 > DATA_NUM <- dplyr::select(DATA_ALL, -(c(CAT1, CAT2)))
pairs(Dat _num)
 hist(DATA$x, freq=FALSE)
lines(density(DATA$VAR1), lwd=3, col="blue")
 > Cor(x, y)
 > Cor(select(DATA, -c(CAT1, CAT2)))
 > library(corrplot)
corrplot(cor(select(DATA, -c(CAT1, CAT2))), method="number") /* different options for
method */
 > library(ggplot2)
ggplot(data = DATA) +
geom_point(mapping = aes(x = VAR1, y = VAR2, color = CAT1))
 car::vif (model)
Visualization
Practice Question
 What does the line in the center of the box
represent - mean or median?
 For a skewed data, what the center of gravity
of the boxplot – mean or median?
 In Excel, graph a boxplot of and calculate
mean, median, Q1, Q3 of 2, 5, 8, 11, 13.
Change the last number (13) to 200 and re-
graph/recalculate the parameters and describe
the change.
 Add one more observation (10), keep 200, and
re-graph the boxplot. Describe the change.
Stem and Leaf Plots
 A stem-and-leaf plot displays data by
splitting up each value in a dataset into a
“stem” and a “leaf.” The “leaf” of each
value is the last digit.
 Example 1: 12, 14, 18, 22, 22, 23, 25, 25, 28, 45, 47, 48
 Example 2: 134, 156, 158, 159, 160, 162, 164
◼ 1|2 4 8
2|2 2 3 5 5 8
3|
4|5 7 8
◼ 13 | 4
14 |
15 | 6 8 9
16 | 0 2 4
Scatterplots
 Scatterplots are used to display the
relationship between two variables (bivariate analysis).
 Interpreting Scatterplots:
◼ relationship (positive, negative, none)
◼ strength (weak, strong)
 Scatterplot matrix: a grid of scatter plots to
visualize/explore bivariate relationships in a
single chart.
Interpret:
Correlation of body
parts of penguins Spp.
Frequency Table
 Usually used to describe data of categorical (non-
numerical) variable, e.g., gender, marital status…
 Frequencies simply tell us how many times a certain
event has occurred for each level of the category.
 Can be expressed in percentage of all observations,
of columns, or of rows.
 A contingency table shows distribution of one
variable in rows and of another in columns to study
the association between the variables.
 A Chi-square or Fisher exact test shows significance
of the association.
Given Long Categorical Data
Create a categorical age group and
ID Age Gender
create a summary frequency table of
1 50 m the age category and gender.
Question:
2 25 m - Overall difference of counts of
gender, age group
3 33 F
- Percent of different age groups
… with a given gender or difference
in percent of gender with a given
300 18 3 age group
- Is there any association between
Age group and gender
- In the population what is the
highest group of the society?
Categorical data analysis is
a bivariate analysis
Frequency Counts and Percent (cell, marginal
(column, row) of Age Group by Gender
Age male female Total R.margin Age male female Total R.margin
al al
<= 18 50 60 110 110/295 <= 18 50/110 60/110 110/ 110/295
19-30 40 40 80 80/295 19-30 40/80 40/80 80 80/295
35-45 20 30 50 50/295 35-45 50 50/295
46-60 10 20 30 30/295 46-60 30 30/295
60+ 5 20 25 25/295 60+ 25 25/295
Total 125 170 295 295/295 Total 125 170 295 295/295
C.marginal 125/295 170/295 100% C.marginal 125/295 170/295 100%

Age male female Total R.margin Age male female Total R.margin
al al
<= 18 50/295 60/295 110 110/295 <= 18 50/125 60/170 110 110/295
19-30 40/295 40/295 80 80/295 19-30 40/125 40/170 80 80/295
35-45 50 50/295 35-45 50 50/295
46-60 30 30/295 46-60 30 30/295
60+ 25 25/295 60+ 25 25/295
Total 125 170 295 295/295 Total 125 170 295 295/295
C.marginal 125/295 170/295 100% C.marginal 125/295 170/295 100%
Relative vs. Cumulative Frequency Distribution
Age male female Relative Cumulative Relative Cumulative
Count Count Percent Percent

<= 18 50 60 110 110 110/295 110/295

19-30 40 40 80 190 80/295 190/295

35-45 20 30 50 240 50/295 240/295

46-60 10 20 30 270 30/295 270/295

60+ 5 20 25 295 25/295 295/295

Relative 125 170 295 295/295

Count
Cumulative 125 295
Count
Relative 125/295 175/295
Percent
Cumulative 125/295 295/295
Percent
Density Plots (Kernel density plot)
 Shows distribution of a variable on a
continuous interval.
 A variation of Histograms but not
affected by number of bins (better
determines distribution shape)
 Estimated (smoothed) data across
bins to smoothen out noise
 Normal distribution curves are an
example of density plots.
Where is the most frequent value, mean,
or median is located? What is that value?
Questions
 What is the difference between
frequency tables and percentiles
(quartiles) in describing data and
when do you use them?
 What are similarity and differences
between Histogram and density plots?
R
 RStudio Screen vs. base R
 Package, library, function
◼ Install R libraries using packages that unpack a
couple of libraries. Each library has a couple of
functions that do some function
 Working directory
 Importing data (external, internal)
 Stat (five-number, mean, sd, var, etc.)
 Graphics (boxplot, hist, density)
 Frequency

22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Data Science 5
100% (4)
Data Science 5
216 pages
Unit 3
No ratings yet
Unit 3
36 pages
CS3552 - Fods - QB 2024
No ratings yet
CS3552 - Fods - QB 2024
11 pages
Notes
No ratings yet
Notes
18 pages
Data Analysis Course Overview
No ratings yet
Data Analysis Course Overview
45 pages
Fds Presentation II YEAR
No ratings yet
Fds Presentation II YEAR
21 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
Data Ana With R
No ratings yet
Data Ana With R
45 pages
AI Data Science: Stats & Python Analysis
No ratings yet
AI Data Science: Stats & Python Analysis
7 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
DS Assignment No 2
No ratings yet
DS Assignment No 2
21 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
What Exactly Is Data Science
No ratings yet
What Exactly Is Data Science
15 pages
Unit - II - Part I - Importance of Statistics in Data Science
No ratings yet
Unit - II - Part I - Importance of Statistics in Data Science
10 pages
Unit 3
No ratings yet
Unit 3
30 pages
Data Collection & Organization Guide
No ratings yet
Data Collection & Organization Guide
13 pages
Statistical Methods for Data Science
No ratings yet
Statistical Methods for Data Science
31 pages
Ms Data Science S, 24 (WEEK# 1)
No ratings yet
Ms Data Science S, 24 (WEEK# 1)
30 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
19 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
20 pages
Data Science Overview for TYCS VI
No ratings yet
Data Science Overview for TYCS VI
28 pages
Statistics Module: Arijit Mitra
No ratings yet
Statistics Module: Arijit Mitra
25 pages
Quantitative Methods - I (Statistics)
No ratings yet
Quantitative Methods - I (Statistics)
30 pages
Introduction To STATISTICS-new
No ratings yet
Introduction To STATISTICS-new
44 pages
Day 5 Statistics (1 of 3) - Basics
No ratings yet
Day 5 Statistics (1 of 3) - Basics
19 pages
Statistics and Data Analytics Notes
No ratings yet
Statistics and Data Analytics Notes
4 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Module 1
No ratings yet
Module 1
53 pages
Final SRB Unit 2
No ratings yet
Final SRB Unit 2
162 pages
Introduction To Satistics .Doc1
No ratings yet
Introduction To Satistics .Doc1
7 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Comprehensive Ebook of Statistics For Data Science - Chaitali
No ratings yet
Comprehensive Ebook of Statistics For Data Science - Chaitali
21 pages
Statistics Basics for Data Science
100% (2)
Statistics Basics for Data Science
27 pages
CCM 202 Lecture 2 Statistics
No ratings yet
CCM 202 Lecture 2 Statistics
11 pages
Introduction to Data & Statistics
No ratings yet
Introduction to Data & Statistics
21 pages
Probability Distribution of Operating Rooms
No ratings yet
Probability Distribution of Operating Rooms
24 pages
Chapter 01
No ratings yet
Chapter 01
96 pages
Businessstatistics 160426122610
No ratings yet
Businessstatistics 160426122610
50 pages
CH 01
No ratings yet
CH 01
11 pages
Introduction to Data Types in Statistics
No ratings yet
Introduction to Data Types in Statistics
109 pages
Part 1 - Basic Statistics
No ratings yet
Part 1 - Basic Statistics
44 pages
Introduction to Business Statistics
No ratings yet
Introduction to Business Statistics
54 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Econ15A Lecture01
No ratings yet
Econ15A Lecture01
22 pages
FDSA Unit - 2
No ratings yet
FDSA Unit - 2
142 pages
Business Math & Stat Midterm Topics Summary
No ratings yet
Business Math & Stat Midterm Topics Summary
17 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Statistics Refresher
No ratings yet
Statistics Refresher
11 pages
Module 2
No ratings yet
Module 2
83 pages
Basic Statistics-Concepts and Applications
No ratings yet
Basic Statistics-Concepts and Applications
45 pages
Assignment 2.2
No ratings yet
Assignment 2.2
6 pages
Introduction To DS
No ratings yet
Introduction To DS
97 pages
Image Presensatio Ass Three
No ratings yet
Image Presensatio Ass Three
13 pages
Concept Sentiment Analysis
No ratings yet
Concept Sentiment Analysis
13 pages
Assigment Document Mohamed and Omar
No ratings yet
Assigment Document Mohamed and Omar
41 pages
Chi Jin Du 1632016 BJ Emt 30809
No ratings yet
Chi Jin Du 1632016 BJ Emt 30809
10 pages
Final Proposal
No ratings yet
Final Proposal
38 pages
Module 1
No ratings yet
Module 1
38 pages
Research in Political Science 1 - Quantitative Research
No ratings yet
Research in Political Science 1 - Quantitative Research
2 pages
Practice 2-Midterm 3 1 1
No ratings yet
Practice 2-Midterm 3 1 1
7 pages
Synthetic Data in Healthcare: A Review
No ratings yet
Synthetic Data in Healthcare: A Review
19 pages
MBQT1001
No ratings yet
MBQT1001
1 page
Dupak 2022
No ratings yet
Dupak 2022
163 pages
Internal Exam Syllabus For Mba Sem 1
No ratings yet
Internal Exam Syllabus For Mba Sem 1
3 pages
Wubishet Shitaye
No ratings yet
Wubishet Shitaye
81 pages
Definition and Types of Quantitative Research
No ratings yet
Definition and Types of Quantitative Research
43 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Answer Key Split Up Fds
No ratings yet
Answer Key Split Up Fds
11 pages
51 Junarti Et Al
No ratings yet
51 Junarti Et Al
35 pages
Nda Maths 50 Questions 23 Converted 24
No ratings yet
Nda Maths 50 Questions 23 Converted 24
10 pages
Introduction To Econometrics - Module
No ratings yet
Introduction To Econometrics - Module
85 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
10.4324 9781003470229 Previewpdf
No ratings yet
10.4324 9781003470229 Previewpdf
65 pages
FRM Part 1: Book 2 - Quantitative Analysis
No ratings yet
FRM Part 1: Book 2 - Quantitative Analysis
12 pages
VoIP Network Forensics of Instant Messaging Calls
No ratings yet
VoIP Network Forensics of Instant Messaging Calls
13 pages
Model Question For BBA BIM and BBM 3rd Sem Statistics
No ratings yet
Model Question For BBA BIM and BBM 3rd Sem Statistics
6 pages
Research
No ratings yet
Research
3 pages
Mor Et Al., (2006)
No ratings yet
Mor Et Al., (2006)
22 pages
Statistical Analysis of Solid Waste Data
No ratings yet
Statistical Analysis of Solid Waste Data
11 pages
Managerial Report
No ratings yet
Managerial Report
3 pages
A Basis For Scaling Qualitative Data
No ratings yet
A Basis For Scaling Qualitative Data
13 pages
Hotel Diversity Management Insights
No ratings yet
Hotel Diversity Management Insights
20 pages
Correlational Research Methodology
100% (1)
Correlational Research Methodology
7 pages
Orthogonal Property of Standard Design/Orthogonality of Design and Factorial Experiments (Statistics)
No ratings yet
Orthogonal Property of Standard Design/Orthogonality of Design and Factorial Experiments (Statistics)
16 pages
Chapter 3. Risk and Return
No ratings yet
Chapter 3. Risk and Return
39 pages
G10 Physics: LDR Distance & Resistance
No ratings yet
G10 Physics: LDR Distance & Resistance
12 pages
Group 1 Bakbakan (12 - Idealist)
No ratings yet
Group 1 Bakbakan (12 - Idealist)
51 pages
12+JST+VOL 12+NO +2+Eva+Pratiwi+Pane+385-395
No ratings yet
12+JST+VOL 12+NO +2+Eva+Pratiwi+Pane+385-395
11 pages

Lecture1 2

Uploaded by

Lecture1 2

Uploaded by

Statistics for Data Science -

1-3 1. Introduction to statistics concepts: 93-129,

4-5 2. Descriptive statistics: 26-67

 Come for summary and discussion

Sequence of levels – Yes Yes Yes Yes

Possible value between

 Tools of Descriptive Statistics in a dataset:

Let x1 , x2 ,... xn be n observatio ns. Then,

Standard Deviation: Square root of the variance. The

Percentiles (Quantiles): If data is ordered and divided into

In notations, percentiles of a data is the ((n+1)/100)pth

SD, variance, Skewness; Variability

Find: Mean, SD (variance), IQR,

<= 18 50 60 110 110 110/295 110/295

19-30 40 40 80 190 80/295 190/295

35-45 20 30 50 240 50/295 240/295

46-60 10 20 30 270 30/295 270/295

60+ 5 20 25 295 25/295 295/295

Relative 125 170 295 295/295

You might also like