DataAnalytics Descriptive Analytics Module 1
DataAnalytics Descriptive Analytics Module 1
Statistics
RISHI SINHA
What does the word statistics
bring to mind?
To most people, it suggests numerical facts or data, such as unemployment figures, farm prices,
or the number of marriages and divorces.
1. [used with a plural verb] facts or data, either numerical or nonnumerical, organized and
summarized so as to provide useful and accessible information about a particular subject.
2. [used with a singular verb] the science of organizing and summarizing numerical or
nonnumerical information.
“Statistics is the branch of scientific method which deals with the data –
obtained by counting or measuring the properties of population of natural phenomena.”
Statistics
Definition:
“Statistics is the scientific methodology which deals with the collection, classification & tabulation of
numerical facts as a basis for explanation, description & comparison of social phenomena.”
Limitations of Statistics:
There are some saying: “There are three kinds of lies –lies, white lies & Statistics”.
Statistics
Statistics
Descriptive Statistics
2.Types of Data
1. Quantitative Data:
1. Numerical data that can be measured and expressed with numbers.
2. Examples include age, height, weight, and income.
2. Qualitative Data:
1. Descriptive data that cannot be quantified.
2. Examples include gender, color, marital status, and customer feedback.
Understanding Data
Data Representation
• Nominal Data:
• Categorical data with no inherent order or ranking.
• Examples include gender (male/female), colors, and types of fruit.
• Ordinal Data:
• Categorical data with a natural order or ranking.
• Examples include education level (high school, college, graduate), rating scales (low, medium, high), and satisfaction levels (poor, fair,
good, excellent).
• Interval Data:
• Numerical data with equal intervals between values, but no true zero point.
• Examples include temperature measured in Celsius or Fahrenheit.
• Ratio Data:
• Numerical data with equal intervals between values and a true zero point.
• Examples include height, weight, income, and number of items sold.
Data Display & Representation
•Tabular Methods (Class Interval, Frequency, Relative Frequency, Cumulative Frequency < , Cumulative
Frequency >, Midpoint): Explanation and Examples
•Graphical Methods: Bar Charts, Histograms, Scatter Plots, Pie Charts, Polygon, Ogive (less than &
greater than)
16
Objectives of this session
▪ To know how to make frequency distributions and its importance
17
Investigation
Data Collection
Inferential Statistiscs
Descriptive Statistics
Data Presentation
Univariate analysis
Estimation Hypothesis
Measures of Location
Tabulation Testing
Measures of Dispersion Multivariate analysis
Diagrams Point estimate
Measures of Skewness &
Graphs Interval estimate
Kurtosis
18
Frequency Distributions
19
Frequency Distributions
▪Data distribution – pattern of variability.
▪ The center of a distribution
▪ The ranges
▪ The shapes
▪Simple frequency distributions
▪Grouped frequency distributions
20
Simple Frequency Distribution
•The number of times that score occurs
•Make a table with highest score at top and decreasing for every
possible whole number
•N (total number of scores) always equals the sum of the frequency
• f = N
21
Categorical or Qualitative
Frequency Distributions
•What is a categorical frequency distribution?
AB B A O B
O B O A O
B O B B B
A O AB AB O
A B AB O A
Categorical Frequency Distribution for
the Blood Types -- Example (Contd…)
Note: The classes for the distribution are the blood types.
Quantitative Frequency Distributions
-- Ungrouped
Note: The
(ungrouped)
classes are the
observed values
themselves.
Example of a simple frequency distribution (ungrouped)
9 3
8 2
7 2
6 1
5 4
4 4
3 3
2 3
1 3
f = 25 (No. of families)
Relative Frequency Distribution
Proportion of the total N
Divide the frequency of each score by N
Rel. f = f/N
Sum of relative frequencies should equal 1.0
Gives us a frame of reference
29
Relative Frequency Distribution
5781593422349714568943521
Unique Observation f Re. F
9 3 0.12
8 2 0.08
7 2 0.08
6 1 0.04
5 4 0.16
4 4 0.16
3 3 0.12
2 3 0.12
1 3 0.12
f = 25 rel f = 1.0
Cumulative Frequency
Distributions
cf = cumulative frequency: number of scores at or below a
particular score
A score’s standing relative to other scores
Count from lower scores and add the simple frequencies for
all scores below that score
32
Example of a simple frequency distribution
5781593422349714568943521
f rel f cf
9 3 .12 3
8 2 .08 5
7 2 .08 7
6 1 .04 8
5 4 .16 12
4 4 .16 16
3 3 .12 19
2 3 .12 22
1 3 .12 25
f = 25 rel f = 1.0
33
Quantitative Frequency Distributions --
Grouped
35
Steps for making a table
36
DUMMY TABLE Tall Marks TABLE
Hb (g/dl) Tall marks No. Hb (g/dl) Tall marks No.
patients patients
Total Total - 30
Table Frequency distribution of 30 adult male
patients by Hb
Hb (g/dl) No. of patients
9.0 – 9.9 1
10.0 – 10.9 3
11.0 – 11.9 6
12.0 – 12.9 10
13.0 – 13.9 5
14.0 – 14.9 3
15.0 – 15.9 2
Total 30
38
Table Frequency distribution of adult patients by
Hb and gender
Hb Gender Total
(g/dl)
Male Female
<9.0 0 2 2
9.0 – 9.9 1 3 4
10.0 – 10.9 3 5 8
11.0 – 11.9 6 8 14
12.0 – 12.9 10 6 16
13.0 – 13.9 5 4 9
14.0 – 14.9 3 2 5
15.0 – 15.9 2 0 2
Total 30 30 60
39
Example data
68 63 42 27 30 36 28 32
79 27 22 28 24 25 44 65
43 25 74 51 36 42 28 31
28 25 45 12 57 51 12 32
49 38 42 27 31 50 38 21
16 24 64 47 23 22 43 27
49 28 23 19 11 52 46 31
30 43 49 12
40
Histogram
Continuous Data
43
44
Box and Whiskers Plots / Box Plot
Descriptive statistics report: Boxplot
- minimum score
- maximum score
- lower quartile
- upper quartile
- median
- mean
46
Box and Whisker Plots
48
Pie Chart
•Circular diagram – total -100%
25 20 20
• Heights of the bar indicates
20 16 frequency
Number
15 12 12
9 8
10
5 • Frequency in the Y axis and
0
Smo Alc Chol DM HTN No F-H
categories of variable in the X axis
Exer
Risk factor • The bars should be of equal width
and no touching the other bars
12
Enrollment (hundred)
10
Bar chart
8
6
Men
4 Women
2
0
1986 1987 1988 1989 1990 1991 1992
Year
52
HIV cases Enrollment
in USA by gender
18
16
Enrollment (Thousands)
Stocked bar
14
chart
12
10
8 Women
6 Men
4
2
0
1986 1987 1988 1989 1990 1991 1992
Year
53
54
General rules for designing graphs
55
Tabular and Graphical Procedures
Data
•Bar Graph
•Frequency •Pie Chart •Frequency •Histogram
Distribution Distribution •Freq. curve
•Rel. Freq. Dist. •Rel. Freq. Dist. •Box plot
•% Freq. Dist. •Cum. Freq. Dist. •Scatter
•Cum. Rel. Freq. Diagram
Distribution
•Cross tabulation
Case Study
Weighted Mean
The weighted mean is defined as an average computed by giving different weights to some of the
individual values. When all the weights are equal, then the weighted mean is similar to the
arithmetic mean. A free online tool called the weighted mean calculator is used to calculate the
weighted mean for the given range of values.
Formula:
To calculate the weighted mean for a given set of non-negative data x1,x2,x3,...xn with non-negative
weights w1,w2,w3,..., we use the formula given below.
Measures of Central Tendency – Weighted Mean
(Contd…)
• Weighted means are useful in a wide variety of scenarios in our daily life. For example, a student uses a weighted mean in
order to calculate their percentage grade in a course. In such a case, the student has to multiply the weighing of all
assessment items in the course (e.g., assignments, exams, projects, etc.) by the respective grade that was obtained in
each of the categories.
• It is used in descriptive statistical analysis, such as index numbers calculation. For example, stock market indices such as
Nifty or BSE Sensex are computed using the weighted average method. It can also be applied in physics to find the center
of mass and moments of inertia of an object.
• It is also useful for businessmen to evaluate the average prices of goods purchased from different vendors where the
purchased quantity is considered as the weight. It gives a better understanding of his expenses.
• A customer's decision on whether to buy a product or not depends on the quality of the product, knowledge of the
product, cost of the product, and service by the franchise. The customer allocates weight to each criterion and calculates
the weighted average. This will help him to make a better decision on buying the product.
Measures of Central Tendency – Percentile
➢ A percentile indicates the relative standing of a data value when data are sorted into numerical order from
smallest to largest.
➢ Data value percentages are less than or equal to the pth percentile.
Quartiles are a type of percentile. A percentile is a value with a certain percentage of the data falling below it.
In general terms, k% of the data falls below the kth percentile.
➢ The first quartile (Q1, or the lowest quartile) is the 25th percentile, meaning that 25% of the data falls
below the first quartile.
➢ The second quartile (Q2, or the median) is the 50th percentile, meaning that 50% of the data falls
below the second quartile.
➢ The third quartile (Q3, or the upper quartile) is the 75th percentile, meaning that 75% of the data falls
below the third quartile.
By splitting the data at the 25th, 50th, and 75th percentiles, the quartiles divide the data into four equal parts.
Measures of Dispersion
•Definition of Skewness
Set2: 4,5,5,5,6,6,6,6,7,7,7,8
Above 2 dataset also happen to have same no.of observations i.e. n=12
The two dataset have same central tendency but different variability or dispersion.
Measures of Dispersion
❖Dataset:
❖ Set1: 1,2,3,4,5,6,6,7,8,9,10,11
❖ Set2: 4,5,5,5,6,6,6,6,7,7,7,8
❖The two dataset have same central tendency but different variability or dispersion.
❖To describe the above difference quantitatively, we use a descriptive measure that indicates the amount of
variation, or spread, in a data set. Such descriptive measures are referred to as measures of variation or measures
of spread or measure of dispersion.
Measures of Dispersion
Just as there are several different measures of central tendency, there are also several different
measures of variation:-
•Range
•Percentile & Quartile
•Variance
•Standard Deviation
•Skewness
•Kurtosis
Calculation:-
• Take root
It is expressed as a percentage and is the most commonly used relative measure of dispersion.
To calculate the coefficient of variation we first have to calculate the standard deviation and the
arithmetic mean, followed by the coefficient of variance formula.
Normal Distribution
•Definition and Characteristics of Normal Distribution
A normal distribution is a type of continuous probability distribution in which most data points
cluster toward the middle of the range, while the rest taper off symmetrically toward either
extreme. The middle of the range is also known as the mean of the distribution.
The normal distribution is also known as a Gaussian distribution or probability bell curve. It is
symmetric about the mean and indicates that values near the mean occur more frequently
than the values that are farther away from the mean.
Assumptions of Normal Distribution
Assumptions of Normal Distribution (Contd…)
Empirical rule
•In normally distributed data, there is a constant proportion of data points lying under the curve
between the mean and a specific number of standard deviations from the mean.
•Thus, for a normal distribution, almost all values lie within
3 standard deviations of the mean.
•These check buttons of normal distribution will help you
realize the appropriate percentages of the area under the curve.
•Remember that this empirical rule applies to all normal
distributions. Also, note that these rules are applied
only to the normal distributions.
Visualizing Normal Distribution
•Homogeneity of Variance
•Linearity