0% found this document useful (0 votes)
17 views88 pages

DataAnalytics Descriptive Analytics Module 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views88 pages

DataAnalytics Descriptive Analytics Module 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Descriptive

Statistics
RISHI SINHA
What does the word statistics
bring to mind?
To most people, it suggests numerical facts or data, such as unemployment figures, farm prices,
or the number of marriages and divorces.

Two common definitions of the word statistics are as follows:

1. [used with a plural verb] facts or data, either numerical or nonnumerical, organized and
summarized so as to provide useful and accessible information about a particular subject.

2. [used with a singular verb] the science of organizing and summarizing numerical or
nonnumerical information.

“Statistics is the branch of scientific method which deals with the data –
obtained by counting or measuring the properties of population of natural phenomena.”
Statistics
Definition:

“Statistics is the scientific methodology which deals with the collection, classification & tabulation of
numerical facts as a basis for explanation, description & comparison of social phenomena.”

Limitations of Statistics:

1. Statistics deals with quantitative data only.


2. Statistical law holds good only for aggregate of items or average individuals. It may not be true for a
particular individual or item.
3. Inadequate knowledge of data interpretation may lead to invalid decision.

There are some saying: “There are three kinds of lies –lies, white lies & Statistics”.
Statistics
Statistics
Descriptive Statistics

Source: Introductory Statistics by Neil A. Weiss


Descriptive Statistics
Descriptive statistics serve as the foundation for understanding and summarizing data, facilitating data-
driven decision-making across various domains.
Descriptive Statistics (Contd…)
How do we do it??
• Summarizing Data:
• Condense large datasets into manageable summaries.
• Present the main features of the data in a concise manner.
• Measuring Central Tendency:
• Calculate measures like mean, median, and mode to represent the typical or central value of the data.
• Provide insights into the average behavior or characteristics of the dataset.
• Analyzing Dispersion:
• Assess the spread or variability of the data around the central value.
• Measure dispersion using metrics such as variance, standard deviation, and range.
• Identifying Distribution Shape:
• Determine the shape of the data distribution, such as normal, skewed, or uniform.
• Visualize the distribution using histograms, box plots, or probability plots.
Descriptive Statistics (Contd…)
How do we do it??
• Detecting Outliers:
• Identify data points that significantly deviate from the rest of the dataset.
• Evaluate the impact of outliers on the overall analysis and decision-making.
• Exploring Relationships:
• Examine the relationship between different variables in the dataset.
• Assess correlations, associations, or patterns using scatter plots or correlation coefficients.
• Visualizing Data:
• Create graphical representations of the data to enhance understanding.
• Utilize charts, graphs, and plots to illustrate key findings and trends.
Descriptive Statistics (Contd…)
How do we do it??
• Interpreting Results:
• Interpret descriptive statistics in the context of the research question or problem.
• Draw conclusions and make recommendations based on the analyzed data.
• Communicating Insights:
• Present descriptive statistics findings in reports, presentations, or visualizations.
• Clearly communicate key insights and findings to stakeholders or decision-makers.
• Informing Decision-Making:
• Provide valuable information to support informed decision-making processes.
• Guide actions, strategies, or interventions based on the analyzed data.
Descriptive Statistics (Contd…)

Img Source: Medium.com


Understanding Data
1. Data refers to facts, observations, or
measurements collected for analysis.
2. It can be qualitative or quantitative and is used
to derive insights and make informed
decisions.
Understanding Data
1.What is Data?
1. Data refers to facts, observations, or measurements collected for analysis.
2. It can be qualitative or quantitative and is used to derive insights and make informed decisions.

2.Types of Data
1. Quantitative Data:
1. Numerical data that can be measured and expressed with numbers.
2. Examples include age, height, weight, and income.
2. Qualitative Data:
1. Descriptive data that cannot be quantified.
2. Examples include gender, color, marital status, and customer feedback.
Understanding Data
Data Representation

• Nominal Data:
• Categorical data with no inherent order or ranking.
• Examples include gender (male/female), colors, and types of fruit.

• Ordinal Data:
• Categorical data with a natural order or ranking.
• Examples include education level (high school, college, graduate), rating scales (low, medium, high), and satisfaction levels (poor, fair,
good, excellent).

• Interval Data:
• Numerical data with equal intervals between values, but no true zero point.
• Examples include temperature measured in Celsius or Fahrenheit.

• Ratio Data:
• Numerical data with equal intervals between values and a true zero point.
• Examples include height, weight, income, and number of items sold.
Data Display & Representation
•Tabular Methods (Class Interval, Frequency, Relative Frequency, Cumulative Frequency < , Cumulative
Frequency >, Midpoint): Explanation and Examples

•Graphical Methods: Bar Charts, Histograms, Scatter Plots, Pie Charts, Polygon, Ogive (less than &
greater than)

•Importance of Choosing the Right Representation


Tabular & Graphical
Presentation of data

16
Objectives of this session
▪ To know how to make frequency distributions and its importance

▪ To know different terminology in frequency distribution table

▪ To learn different graphs/diagrams for graphical presentation of data.

17
Investigation

Data Collection

Inferential Statistiscs
Descriptive Statistics
Data Presentation
Univariate analysis
Estimation Hypothesis
Measures of Location
Tabulation Testing
Measures of Dispersion Multivariate analysis
Diagrams Point estimate
Measures of Skewness &
Graphs Interval estimate
Kurtosis

18
Frequency Distributions

“A Picture is Worth a Thousand Words”

19
Frequency Distributions
▪Data distribution – pattern of variability.
▪ The center of a distribution
▪ The ranges
▪ The shapes
▪Simple frequency distributions
▪Grouped frequency distributions

20
Simple Frequency Distribution
•The number of times that score occurs
•Make a table with highest score at top and decreasing for every
possible whole number
•N (total number of scores) always equals the sum of the frequency
• f = N

21
Categorical or Qualitative
Frequency Distributions
•What is a categorical frequency distribution?

A categorical frequency distribution represents data that can be


placed in specific categories, such as gender, blood group, & hair
color, etc.
Categorical or Qualitative Frequency
Distributions -- Example
Example: The blood types of 25 blood donors are given below.
Summarize the data using a frequency distribution.

AB B A O B
O B O A O
B O B B B
A O AB AB O
A B AB O A
Categorical Frequency Distribution for
the Blood Types -- Example (Contd…)

Note: The classes for the distribution are the blood types.
Quantitative Frequency Distributions
-- Ungrouped

•What is an ungrouped frequency distribution?

An ungrouped frequency distribution simply lists the


data values with the corresponding frequency counts
with which each value occurs.
Quantitative Frequency Distributions
– Ungrouped -- Example
Example: The at-rest pulse rate for 16 athletes at a meet were 57,
57, 56, 57, 58, 56, 54, 64, 53, 54, 54, 55, 57, 55, 60, and 58.
Summarize the information with an ungrouped frequency
distribution.
Quantitative Frequency Distributions –
Ungrouped -- Example (Contd…)

Note: The
(ungrouped)
classes are the
observed values
themselves.
Example of a simple frequency distribution (ungrouped)

5 7 8 1 5 9 3 4 2 2 3 4 9 7 1 4 5 6 8 9 4 3 5 2 1 (No. of children in 25 families)

9 3

8 2

7 2

6 1

5 4

4 4

3 3

2 3

1 3

f = 25 (No. of families)
Relative Frequency Distribution
Proportion of the total N
Divide the frequency of each score by N
Rel. f = f/N
Sum of relative frequencies should equal 1.0
Gives us a frame of reference

29
Relative Frequency Distribution

Note: The relative


frequency for a
class is obtained
by computing f/n.
Example of a simple frequency distribution

5781593422349714568943521
Unique Observation f Re. F
9 3 0.12
8 2 0.08
7 2 0.08
6 1 0.04
5 4 0.16
4 4 0.16
3 3 0.12
2 3 0.12
1 3 0.12
f = 25  rel f = 1.0
Cumulative Frequency
Distributions
cf = cumulative frequency: number of scores at or below a
particular score
A score’s standing relative to other scores
Count from lower scores and add the simple frequencies for
all scores below that score

32
Example of a simple frequency distribution
5781593422349714568943521

f rel f cf

9 3 .12 3

8 2 .08 5

7 2 .08 7

6 1 .04 8

5 4 .16 12

4 4 .16 16

3 3 .12 19

2 3 .12 22

1 3 .12 25

f = 25  rel f = 1.0

33
Quantitative Frequency Distributions --
Grouped

•What is a grouped frequency distribution? A grouped frequency


distribution is obtained by constructing classes (or intervals) for
the data, and then listing the corresponding number of values
(frequency counts) in each interval.
Tabulate the hemoglobin values of 30 adult
male patients listed below
Patient Hb Patient Hb Patient Hb
No (g/dl) No (g/dl) No (g/dl)

1 12.0 11 11.2 21 14.9

2 11.9 12 13.6 22 12.2

3 11.5 13 10.8 23 12.2

4 14.2 14 12.3 24 11.4

5 12.3 15 12.3 25 10.7

6 13.0 16 15.7 26 12.5

7 10.5 17 12.6 27 11.8

8 12.8 18 9.1 28 15.1

9 13.2 19 12.9 29 13.4

10 11.2 20 14.6 30 13.1

35
Steps for making a table

Step1 Find Minimum (9.1) & Maximum (15.7)

Step 2 Calculate difference 15.7 – 9.1 = 6.6

Step 3 Decide the number and width of

the classes (7 c.l) 9.0 -9.9, 10.0-10.9,----

Step 4 Prepare dummy table –

Hb (g/dl), Tally mark, No. patients

36
DUMMY TABLE Tall Marks TABLE
Hb (g/dl) Tall marks No. Hb (g/dl) Tall marks No.
patients patients

9.0 – 9.9 9.0 – 9.9 l 1


10.0 – 10.9 10.0 – 10.9 lll 3
11.0 – 11.9 11.0 – 11.9 llll 1 6
12.0 – 12.9 12.0 – 12.9
13.0 – 13.9
llll llll 10
14.0 – 14.9 13.0 – 13.9 llll 5
15.0 – 15.9 14.0 – 14.9 lll 3
15.0 – 15.9 ll 2

Total Total - 30
Table Frequency distribution of 30 adult male
patients by Hb
Hb (g/dl) No. of patients

9.0 – 9.9 1
10.0 – 10.9 3
11.0 – 11.9 6
12.0 – 12.9 10
13.0 – 13.9 5
14.0 – 14.9 3
15.0 – 15.9 2

Total 30

38
Table Frequency distribution of adult patients by
Hb and gender

Hb Gender Total
(g/dl)
Male Female

<9.0 0 2 2
9.0 – 9.9 1 3 4
10.0 – 10.9 3 5 8
11.0 – 11.9 6 8 14
12.0 – 12.9 10 6 16
13.0 – 13.9 5 4 9
14.0 – 14.9 3 2 5
15.0 – 15.9 2 0 2
Total 30 30 60

39
Example data

68 63 42 27 30 36 28 32
79 27 22 28 24 25 44 65
43 25 74 51 36 42 28 31
28 25 45 12 57 51 12 32
49 38 42 27 31 50 38 21
16 24 64 47 23 22 43 27
49 28 23 19 11 52 46 31
30 43 49 12

40
Histogram
Continuous Data

No segmentation of data into groups


Graphical Representation: Multiple
variables
• Scatter Plots: Two quantitative variables
• Box Plots: One categorical with one quantitative variable
• Contigency Tables: 2 categorical variables with frequency of occurrence as the theme
Challenge

43
44
Box and Whiskers Plots / Box Plot
Descriptive statistics report: Boxplot
- minimum score
- maximum score
- lower quartile
- upper quartile
- median
- mean

- The skew of the distribution


positive skew: mean > median & high-score whisker is
longer
negative skew: mean < median & low-score whisker is
longer

46
Box and Whisker Plots

Popular in Epidemiologic Studies


Useful for presenting comparative data graphically
Application of a box and Whisker diagram

48
Pie Chart
•Circular diagram – total -100%

10% •Divided into segments each


representing a category
20% Mild
•Decide adjacent category
Moderate
Severe
•The amount for each category is
70%
proportional to slice of the pie

The prevalence of different degree of Hypertension


in the population
49
Top 10 causes of death: pie chart
Each slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.

Percent of people dying from


top 10 causes of death in the United States in 2001
Bar Graphs

25 20 20
• Heights of the bar indicates
20 16 frequency
Number

15 12 12
9 8
10
5 • Frequency in the Y axis and
0
Smo Alc Chol DM HTN No F-H
categories of variable in the X axis
Exer
Risk factor • The bars should be of equal width
and no touching the other bars

The distribution of risk factor among cases with


Cardio vascular Diseases
51
HIV cases enrolment in USA by gender

12

Enrollment (hundred)
10
Bar chart
8
6
Men
4 Women
2
0
1986 1987 1988 1989 1990 1991 1992

Year

52
HIV cases Enrollment
in USA by gender

18
16

Enrollment (Thousands)
Stocked bar
14
chart
12
10
8 Women
6 Men
4
2
0
1986 1987 1988 1989 1990 1991 1992
Year

53
54
General rules for designing graphs

1. A graph should have a self-explanatory legend

2. A graph should help reader to understand data

3. Axis labeled, units of measurement indicated

4. Scales important. Start with zero (otherwise // break)

5. Avoid graphs with three-dimensional impression, it may be misleading (reader visualize


less easily

55
Tabular and Graphical Procedures
Data

Qualitative Data Quantitative Data

Tabular Graphical Tabular Graphical


Methods Methods Methods Methods

•Bar Graph
•Frequency •Pie Chart •Frequency •Histogram
Distribution Distribution •Freq. curve
•Rel. Freq. Dist. •Rel. Freq. Dist. •Box plot
•% Freq. Dist. •Cum. Freq. Dist. •Scatter
•Cum. Rel. Freq. Diagram
Distribution
•Cross tabulation
Case Study

Source: Introductory Statistics by Neil A. Weiss


Measures of Central Tendency
•Mean: Arithmetic Mean, Weighted Mean

•Median: Definition and Calculation

•Mode: Explanation and Interpretation


Measures of Central Tendency
Measures of Central Tendency
Measures of Central Tendency – Weighted Mean

Weighted Mean

The weighted mean is defined as an average computed by giving different weights to some of the
individual values. When all the weights are equal, then the weighted mean is similar to the
arithmetic mean. A free online tool called the weighted mean calculator is used to calculate the
weighted mean for the given range of values.
Formula:
To calculate the weighted mean for a given set of non-negative data x1,x2,x3,...xn with non-negative
weights w1,w2,w3,..., we use the formula given below.
Measures of Central Tendency – Weighted Mean
(Contd…)

Uses of Weighted Means

• Weighted means are useful in a wide variety of scenarios in our daily life. For example, a student uses a weighted mean in
order to calculate their percentage grade in a course. In such a case, the student has to multiply the weighing of all
assessment items in the course (e.g., assignments, exams, projects, etc.) by the respective grade that was obtained in
each of the categories.

• It is used in descriptive statistical analysis, such as index numbers calculation. For example, stock market indices such as
Nifty or BSE Sensex are computed using the weighted average method. It can also be applied in physics to find the center
of mass and moments of inertia of an object.

• It is also useful for businessmen to evaluate the average prices of goods purchased from different vendors where the
purchased quantity is considered as the weight. It gives a better understanding of his expenses.

• A customer's decision on whether to buy a product or not depends on the quality of the product, knowledge of the
product, cost of the product, and service by the franchise. The customer allocates weight to each criterion and calculates
the weighted average. This will help him to make a better decision on buying the product.
Measures of Central Tendency – Percentile
➢ A percentile indicates the relative standing of a data value when data are sorted into numerical order from
smallest to largest.
➢ Data value percentages are less than or equal to the pth percentile.

•Low percentiles always correspond to lower data values


•High percentiles always correspond to higher data values
Measures of Central Tendency – Quartiles

Quartiles are a type of percentile. A percentile is a value with a certain percentage of the data falling below it.
In general terms, k% of the data falls below the kth percentile.

➢ The first quartile (Q1, or the lowest quartile) is the 25th percentile, meaning that 25% of the data falls
below the first quartile.
➢ The second quartile (Q2, or the median) is the 50th percentile, meaning that 50% of the data falls
below the second quartile.
➢ The third quartile (Q3, or the upper quartile) is the 75th percentile, meaning that 75% of the data falls
below the third quartile.

By splitting the data at the 25th, 50th, and 75th percentiles, the quartiles divide the data into four equal parts.
Measures of Dispersion
•Definition of Skewness

•Positive and Negative Skewness: Characteristics and Interpretation

•Measures of Skewness: Calculation and Application


Measures of Dispersion
Set1: 1,2,3,4,5,6,6,7,8,9,10,11

Set2: 4,5,5,5,6,6,6,6,7,7,7,8

For set1 & set2 => Mean = Median = Mode = 6

Above 2 dataset also happen to have same no.of observations i.e. n=12

But the 2 dataset are different !!!

The two dataset have same central tendency but different variability or dispersion.
Measures of Dispersion
❖Dataset:
❖ Set1: 1,2,3,4,5,6,6,7,8,9,10,11
❖ Set2: 4,5,5,5,6,6,6,6,7,7,7,8

❖For set1 & set2 => Mean = Median = Mode = 6


❖Above 2 dataset also happen to have same no.of observations i.e. n=12
❖But the 2 dataset are different !!!

❖The two dataset have same central tendency but different variability or dispersion.

❖To describe the above difference quantitatively, we use a descriptive measure that indicates the amount of
variation, or spread, in a data set. Such descriptive measures are referred to as measures of variation or measures
of spread or measure of dispersion.
Measures of Dispersion
Just as there are several different measures of central tendency, there are also several different
measures of variation:-
•Range
•Percentile & Quartile
•Variance
•Standard Deviation
•Skewness
•Kurtosis

Source: Introductory Statistics by Neil A. Weiss


Measures of Dispersion - Range
The range of a data set is the difference between the maximum (largest) and minimum (smallest)
observations.

Team I: Range = 78 − 72 = 6 inches,

Team II: Range = 84 − 67 = 17 inches.

Source: Introductory Statistics by Neil A. Weiss


Measures of Dispersion – Standard Deviation

• In contrast to the range, the standard deviation


takes into account all the observations.

• It is the preferred measure of variation when the


mean is used as the measure of center.

• The standard deviation measures variation by


indicating how far, on average, the observations
are from the mean.

• For a dataset with a small amount of variation,


the observations will, on average, be close to the
mean; so the standard deviation will be small.

Source: Introductory Statistics by Neil A. Weiss


Measures of Dispersion – Standard Deviation

Calculation:-

• The quantities represent deviations from the


mean, adding them to get a total deviation from the
mean is of no value because their sum,
, always equals zero.

• To obtain quantities that do not sum to zero, we


square the deviations from the mean.

• Sum up the Squared deviations

• Divide by N (population) or n-1 (sample)

• Take root

Source: Introductory Statistics by Neil A. Weiss


Measures of Dispersion – Variance

Source: Introductory Statistics by Neil A. Weiss


Coefficient of Variation
How to Calculate the Coefficient of Variation?

The coefficient of variation formula is as follows –

Coefficient of variance formula = (Standard Deviation)/Arithmetic Mean * 100

It is expressed as a percentage and is the most commonly used relative measure of dispersion.

To calculate the coefficient of variation we first have to calculate the standard deviation and the
arithmetic mean, followed by the coefficient of variance formula.
Normal Distribution
•Definition and Characteristics of Normal Distribution

•Properties and Assumptions of Normal Distribution

•Importance in Statistical Analysis


What is normal distribution?

A normal distribution is a type of continuous probability distribution in which most data points
cluster toward the middle of the range, while the rest taper off symmetrically toward either
extreme. The middle of the range is also known as the mean of the distribution.

The normal distribution is also known as a Gaussian distribution or probability bell curve. It is
symmetric about the mean and indicates that values near the mean occur more frequently
than the values that are farther away from the mean.
Assumptions of Normal Distribution
Assumptions of Normal Distribution (Contd…)
Empirical rule
•In normally distributed data, there is a constant proportion of data points lying under the curve
between the mean and a specific number of standard deviations from the mean.
•Thus, for a normal distribution, almost all values lie within
3 standard deviations of the mean.
•These check buttons of normal distribution will help you
realize the appropriate percentages of the area under the curve.
•Remember that this empirical rule applies to all normal
distributions. Also, note that these rules are applied
only to the normal distributions.
Visualizing Normal Distribution

Source: Introductory Statistics by Neil A. Weiss


Assumptions of Normal
Distribution
•Independence of Observations

•Homogeneity of Variance

•Linearity

•Residuals Normally Distributed


Applications of Normal
Distribution
•Central Limit Theorem

•Z-Score and Standard Normal Distribution

•Hypothesis Testing and Confidence Intervals


Classification - Accuracy
Summary
•Recap of Key Concepts Covered

•Importance of Data Analysis in Various Fields

•Encouragement for Further Learning and Exploration

You might also like