Statistical Analysis With Software Applications BSA PDF
Statistical Analysis With Software Applications BSA PDF
INSTRUCTIONAL
MATERIALS
FOR
STAT 20053
Statistical Analysis with
Software Applications
Compiled by:
THELMA D. OLAIVAR
Associate Professor 3
STAT 20053
Statistical Analysis with Software Applications
BSA 2
CONTENTS
Module 0 Overview
.
PRE-TEST
𝑁
Using 𝑛 = 1+𝑁𝑒2 fill up the following table:
𝑵 𝒆 𝒏
1000 0.05
1000 0.01
2000 0.02
2000 0.03
500 0.10
500 250
5000 1000
5000 500
1.0 Types of Statistics
A population is the complete and entire collection of elements under study. When
population is very large, to economize on time, money, and effort, a sample which is a
representative subset of the population must be used instead.
When gathering data, a census is attractive to use if the population is small because
It eliminates sampling error and it provides data on all the members or elements of the
population.
Example 2: Find the sample size if the population size is 20,000 and the margin of error
is 1%.
20000 20000
𝑛 = = = 6667
(
1 + 20000 0.01 2 ) 1+2
Sampling Techniques
Probability Sampling Techniques are techniques that result to samples are chosen in
such a way that every member of the population has a known though not necessarily an equal
chance of being selected. The samples used are unbiased samples.
1. Simple Random Sampling - all members of the population have an equal chance of
being selected. Randomization is done through Lottery
Technique or Fish-Bowl Technique or using a Table of
Random Numbers.
𝑁
randomization
2. Stratified Random Sampling - this is used when the sample is divided into groups
or strata and samples are randomly selected from
20 each stratum. The sampling ratio n/N is used to
𝑛
Use sampling ratio =
𝑁
𝑛
3. Systematic Sampling with a Random Start - there is the selection of every kth
element of the population where k = N/n
AB CD E is called the sampling interval. The starting
FGHI JKL
element is selected randomly.
𝑁 MNOPQ RS
TUVWXYZ
𝑁
Use 𝑘 =
𝑛
D HL
𝑛 PTX
1 2
5 N in clusters
3 4
8
6
2 n in clusters
5 8
easily accessible
𝑛
2. Quota Sampling - this is used when there is stratification but the sampling ratio is not
used, instead the one doing the sampling merely decides
on the allocation or quota
n by quota
qualified sample
1. The Direct or Interview Method - method in which there is a direct contact with the
respondent thus more accurate response is obtained
since clarification from the interviewee can be readily
obtained.
2. The Indirect or Questionnaire Method - method in which a lot of money and time will
be saved because a questionnaire can be
given to the respondents at the same time.
4. The Experimental Method - method used to find out cause and effect relationship
POST TEST
Construct a frequency
distribution
Graph a frequency
distribution.
Graphical Presentation of Data - may be in the form of bar graphs, line graphs or
pie charts which help facilitate comparison and
interpretation without going through the numerical
data
Activity 2a
Consider the following table. Label its parts: table number, table title, column headers,
and source note.
Table 2a:
Distribution of BSEd Math Students
According to Year Level
On your Activity Notebook, cut and paste examples of the different types of Data Presentation.
The categorical frequency distribution is used for data that can be placed in specific
categories, such as nominal, or ordinal level data.
Example 1 The following data give the blood types of 40 BSA students.
A O O A B O AB O B AB
O B O A A O O O O A
AB O A O B AB B A O O
A O O O O A O O A B
When observations are sorted into classes of single values, the result is called a
frequency distribution for ungrouped data. When observations are sorted into classes of more
than one value, the result is called a frequency distribution for grouped data.
Lower class limit – the smallest data value that can be included in the class
Upper class limit – the largest value that can be included in the class
Class boundaries – are used to separate the classes so that there are no gaps in the
frequency distribution.
The class width of the preceding distribution is 200 (301− 101 = 200).
Step 1 Decide on the number of classes your frequency table will have. Usually,
it is between 5 and 20.
Step 2 Find the range. This is the difference between the highest and lowest
scores.
Step 3 Find the class width. Divide the range by the number of classes. The
class with should be an odd number. This ensures that the midpoint of
each class has the same place value as the data.
Step 4 Select a starting point, either the lowest score or the lower class limit. Add
the class width to the starting point to get the second lower class limit.
Then enter the upper class limit.
Step 5 Find the boundaries by subtracting 0.5 from each lower class limit and
adding 0.5 to the upper class limit.
Step 6 Represent each score by a tally.
Step 7 Count the total frequency for each class.
The measures of central tendency are measures that represent a set of scores. They
are called averages.
Mean – computational average
- every score participates in the computation
Notation: 𝑥̅ sample mean
𝓊 population mean (𝓂𝓊)
Formula (Ungrouped Data)
Σ𝑥
𝑥̅ =
𝑁
Illustrations:
1. Consider the grades in five quizzes in Accounting of two accounting students
Petra 75 80 85 90 95
Juana 100 75 80 70 100
Σ𝑥 425
Petra’s mean would be 𝑥̅ = 𝑁 = 5 = 85
Σ𝑥 425
Juana’s mean would be 𝑥̅ = 𝑁
= 5
= 85
2. The monthly salaries of five BSEd Math graduates a year after graduation are
as follows: P15000, P25000, P7000, P100000 and P8000
What is the mean salary?
Σ𝑥
𝑥̅ = 𝑁
= 155000
5
= 31,000
Do you think P31000 is really a very good representative score?
3. Weighted Mean – used when every score is given weight
Consider the grades of Anne in four major subjects:
Subject Grade (𝑥) No. of units (𝑤) 𝑤𝑥
1.0 3 3.0
2.0 4 8.0
1.5 3 4.5
1.5 3 4.5
Σ𝑤 = 13 Σ𝑤𝑥 = 20
Σw𝑥
𝑥̅ 𝑤 =
Σw
= 20
13
= 1.538
4. Weighted Mean – used when dealing with Likert Scale or Modified Likert Scale
Consider the following table showing the attitudes of 20 students towards
mathematics based on the level of agreement to three statements
Statement 1 2 3 4 5 𝑥̅ 𝑤 Descriptive
Interpretation
I like Math 0 3 5 10 2 3.55 Agree
I enjoy problem solving 1 3 7 8 1
I just want to compute 0 2 4 12 2
Σf𝑥 0 (1)+3(2)+5(3) +10(4) +2(5)
𝑥̅ 𝑤 = Σf =
20
= 71
20
= 3.55
𝐻𝑆−𝐿𝑆
The descriptive interpretation is based on the Range Interval =
𝑛𝑜.𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠
5− 1 4
= = = 0.80
5 5
1.00− 1.80 Strongly Disagree
1.81− 2.60 Disagree
2.61− 3.40 Uncertain (Neutral)
3.41− 4.20 Agree
4.21− 5.00 Strongly Agree
Median – positional average
- middle score that divides the set of scores into two equal parts
Notation: 𝑀𝑑 , 𝑥̃
Formula:
𝑥 𝑛 +1
𝑥̃ =
2
Illustrations:
1. Consider the grades of Petra and Juana Arranged in arrays
𝑥1 𝑥2 𝑥3
Petra 75 80 85 90 95 𝑥̃ = 𝑥 6 = 𝑥 3 = 85
2
Juana 70 75 80 100 100 𝑥̃ = 80
2. The monthly salaries of five BSEd students arranged in an array
7000,8000,15000,25000,100000
𝑥̃ = 𝑥 3 = 15,000
Which is the better representative score when there are outliers
(𝑖𝑛 𝑡ℎ𝑖𝑠 𝑐𝑎𝑠𝑒 𝑃100,000.00)?
3. What if there is an even number of scores?
85 90 65 100 90 75 95 85
Array: 65 75 85 85 90 90 95 100
𝑥4 𝑥5
85 + 90
𝑥̃ = 𝑥 8+1 = 𝑥 4.5 = = 87.5
2 2
Mode – nominal average
- the most frequently occurring score
Notation: 𝑀𝑜 , 𝑥̂
No formula, just look at the scores with highest frequency
Illustrations:
1. For Petra’s grades, there is no mode.
For Juana’s grades, the mode is 100.
2. For the monthly salaries of five BSED graduates, there is no mode.
3. For the even number of scores, there are two modes 85 and 90 (bimodal)
Illustration:
Use the following table showing the scores of 40 BSEd freshmen in a long quiz
on Contemporary World
Classes 𝑓 𝑥 𝑓𝑥
25 – 29 1 27 27
30 – 34 2 32 64
35 – 39 8 37 296
40 – 44 20 42 840
45 - 49 9 47 423
Note that each class is represented by 𝑥 the class midmark or midpoint
𝛴𝑓𝑥 1650
𝑥̅ = = = 41.25
𝑛 40
Median – positional average
Properties:
1. The score or class in a distribution, below which 50% of the score fall and
above which another 50% lie.
2. Not affected by extreme or deviant values.
3. Appropriate to use when there are outliers or extreme or deviant values.
4. Use when data are ordinal.
5. It exists in both quantitative and qualitative data.
𝑛
−𝑐𝑓
2
𝑥̃ = 𝐿 𝐵 + ( )𝑖
𝑓𝑚
20−11
= 39.5 + ( ) (5)
20
9
= 39.5 + ( ) (5)
20
= 39.5 + 2.25
= 41.75
Mode – nominal average
Properties:
1. It is used when we want to find the value which occurs most often.
3. It is an inspection average.
Illustrations:
Classes 𝑓 𝑥 Boundaries < 𝑐𝑓
25 – 29 1 27 24.5 – 29.5 1
30 – 34 2 32 29.5 – 34.5 3
35 – 39 8 37 34.5 – 39.5 11
Modal class 40 – 44 20 42 39.5 – 44.5 31
45 - 49 9 47 44.5 – 49.5 40
Identify the modal class, the class with the highest frequency.
Note that
∆ 1 = 20 − 8 = 12
∆ 2 = 20 − 9 = 11
12
𝑥̂ = 39.5+ ( ) (5)
12 + 11
= 39.5 + 2.608
= 42.108
3.2 Measures of Location
Measures of location are measures that divide the distribution into equal parts. They are
also called quantiles.
Median – measure of location that divides the distribution into two equal parts
50% 50%
𝑥̃
Quartiles – measures of location that divide the distribution into four equal parts
25% 25%
25% 25%
Formulas:
𝑁
( −𝑐𝑓𝑄1 )(𝑖)
4
𝑄1 = 𝐿 𝑄1 +
𝑓 𝑄1
𝐷1 𝐷2 𝐷9
Percentiles – measures of location that divide the distribution in one hundred equal
parts
𝑃1 𝑃99
Activity 3a
Use the following table.
Classes 𝑓 𝑥 Boundaries < 𝑐𝑓
25 – 29 1 27 24.5 – 29.5 1
30 – 34 2 32 29.5 – 34.5 3
35 – 39 8 37 34.5 – 39.5 11
40 – 44 20 42 39.5 – 44.5 31
45 - 49 9 47 44.5 – 49.5 40
Find the following. Write the formulas first before the computation.
1. 𝑄1 6. 𝐷9
2. 𝑄2 7. 𝑃25
3. 𝑄3 8. 𝑃75
4. 𝐷1 9. 𝑃90
5. 𝐷5 10. 𝑃99
Activity 3b
1. Solve for the mean, median and mode of the following frequency distribution
Classes 𝑓 𝑥 Boundaries < 𝑐𝑓
25 – 29 1
30 – 34 2
35 – 39 8
40 – 44 20
45 - 49 9
2. Solve for the mean, median and mode of the following frequency distribution
Classes 𝑓 𝑥 Boundaries < 𝑐𝑓
25 – 29 8
30 – 34 8
35 – 39 8
40 – 44 8
45 - 49 8
MODULE 4
Elementary Statistics
and Probability
𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑠 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦
Analyze and interpret the graphs of distribution of monthly wages of workers from three
companies 𝑋, 𝑌 and 𝑍.
= √Σ 𝑥−𝑥̅
( )2
𝑠
𝑛
Illustrations:
Given the data
3 5 6 6 7 10 12 15
5. Range
𝑅𝑎𝑛𝑔𝑒 = 15 − 3
= 12
6. Variance and standard deviation
𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
3 −5 25
5 −3 9
6 −2 4
6 −2 4
7 −1 1
10 2 4
12 4 16
15 7 49
Σ𝑥 = 64 Σ(𝑥 − 𝑥̅ )2 = 112
𝑥̅ = 8
(𝑥 − 𝑥̅ )2
𝑠2 =
8
112
=
8
𝑠 2 = 14
𝑠 = √14
= 3.741657 ≈ 3.74
𝑄3 − 𝑄1
𝑄𝐷 =
2
11 − 5.5
=
2
5.5
=
2
= 2.75
Range
Formula: (Grouped Data)
𝑅𝑎𝑛𝑔𝑒 = 𝑈𝐵𝐻 − 𝐿 𝐵𝐿
where 𝑈𝐵𝐻 is the upper boundary of the highest class
Σ 𝑓 ( 𝑥−𝑥̅) 2
𝑠=√
𝑛
Σ 𝑓𝑥 2
𝑠=√ − (𝑥̅ )2
𝑛
Activity 4a
Using the table, let us see if you can follow the formula
Solve for:
1. Range
2. Variance
3. Standard Deviation
4. 𝑄𝐷
4.2 Skewness
𝑠𝑘 < 0
𝑥̅ 𝑥̂
Positive Skewness means the tail is on the right and the mean is greater than mode.
𝑠𝑘 > 0
𝑥̂ 𝑥̅
𝑠𝑘 = 0
𝑥̅ = 𝑥̂
Formula 1:
Skewness (Ungrouped)
Σ( 𝑥−𝑥̅) 3
𝑠𝑘 =
𝑛𝑠3
Formula 2:
Σ 𝑓 ( 𝑥−𝑥̅) 3
𝑠𝑘 =
𝑛𝑠3
4.3 Kurtosis
platykurtic
𝑘<3
mesokurtic
𝑘=3
leptokurtic
𝑘>3
Formula 1: (Ungrouped)
Σ( 𝑥−𝑥̅) 4
𝑘=
𝑛𝑠4
Formula 2: (Grouped)
Σ 𝑓 ( 𝑥−𝑥̅) 4
𝑘=
𝑛𝑠4
Activity 4b
1.
2.
In a study about IQ, IQ of people approximates the normal distribution with 𝜇 = 100, with
𝜎 = 15
−3 −2 −1 0 1 2 3
68%
95%
99%
−3 −2 −1 1 2 3
Illustrations:
Find the following areas.
1. Area (𝑧 = 0 to 𝑧 = 1.56)
Locate 1.5 and column 6
𝐴𝑟𝑒𝑎 = 0.4406
1.56
2. Area (𝑧 = −2 to 𝑧 = 1)
Area (𝑧 = −2 to 𝑧 = 0) = 0.4772
Area (𝑧 = 0 to 𝑧 = 1) = 0.3413
𝐴𝑟𝑒𝑎 = 0.8185
−2 1
3. Area (𝑧 > 2)
Area (𝑧 = 0 to 𝑧 = 2) = 0.4772
Area = 0.5000 − 0.4772
= 0.0228
Find Pearson’s 𝑟
Find Spearman’s 𝜌
Interpret 𝑟 and 𝑟 2
________3. IQ and
________9. number of persons to do a work and time spent in completing the work
Consider the following data on the number of hours spent in studying (𝑥) and the
grades received (𝑦) by 10 students.
𝑥 𝑦 𝑅𝑥 𝑅𝑦 𝐷 𝐷2
3 72 6.5 6.0 0.5 .25
6 89 1.5 1.0 0.5 .25
2 57 9.0 10.0 -1.0 1.00
3 69 6.5 8.0 -1.5 2.25
2 63 9.0 9.0 0 0
4 75 5.0 4.0 1.0 1.00
5 73 3.5 5.0 -1.5 2.25
2 70 9.0 7.0 2.0 4.00
6 82 1.5 3.0 -1.5 2.25
5 84 3.5 2.0 1.5 2.25
Σ𝐷2 = 15.5
6 Σ𝐷2
𝜌=1−
𝑛(𝑛2 − 1)
6 (15.5)
𝜌=1−
10(99)
93
= 1−
990
= 1 − 0.09393
= 0.90607
≈ 0.91
𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2
3 72
6 89
2 57
3 69
2 63
4 75
5 73
2 70
6 82
5 84
If two variables are correlated, that is 𝑟, the coefficient of correlation is significant, then it
is possible to predict or estimate the value of the dependent variable from the independent
variable. This is sometimes called causal forecasting.
Another type of problem which uses regression analysis is when variables
corresponding to years are given and it is possible to predict the value of the variable
several years hence. This is sometimes called forecasting and is related to time-series
analysis.
For these types of problems concerning linear regression, the so called Methods of
Least Squares is used where the “line of best fit”
𝑦 = 𝑎 + 𝑏𝑥 becomes the equation model
Regression Equation: 𝑦 = 𝑎 + 𝑏𝑥
1. 𝑥 = 7 ℎ𝑜𝑢𝑟𝑠
𝑦 = 53.31 + 5.29(7) = 90.34 ≈ 90
2. 𝑥 = 30 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
𝑦 = 53.31 + 5.29(.5) = 53.31 + 2.645
= 55.96
≈ 56
3. 𝑥 = 1 ℎ𝑟
𝑦 = 53.31 + 5.29(1) = 53.31 + 5.29
= 58.6
≈ 59
Problems on forecasting production, sales, income, profits, enrollment and many others
which are collected at regular intervals of time can be explained by time-series analysis. The
independent variable 𝑥 represent time period in regular interval and 𝑦 is the dependent
variable to be forecasted.
Illustration
Consider the following data on the enrollment of a kindergarten school which initially
operated in 2015. Forecast the enrollment in 2020.
Σ𝑥𝑦 − Σ𝑥 Σ𝑦
𝑏=
𝑛 Σ𝑥 2 − (Σ𝑥)2
5(1040) − (15)(290)
=
5(55) − (15)2
5200− 4350
=
275 − 225
850
=
50
= 17
Σ𝑦 Σ𝑥
𝑎= −𝑏
𝑛 𝑛
= 58 − (17)(3)
= 58 − 51
=7
𝑦 = 7 + 17𝑥
For year 2020, 𝑥 = 6
𝑦 = 7 + (17)(6)
= 7 + 102
= 109
A short-cut method can be used with Σ𝑥 = 0
𝑥 𝑦 𝑥𝑦 𝑥2
-2 25 -50 4
-1 40 -40 1
0 60 0 0
1 70 70 1
2 95 190 4
Σ𝑥 = 0 Σ𝑦 = 290 Σ𝑥𝑦 = 170 Σ𝑥 2 = 10
𝑛 Σ𝑥𝑦 − Σ𝑥 Σ𝑦 Σ𝑥𝑦
𝑏= =
𝑛 Σ𝑥 2 − (Σ𝑥)2 Σ𝑥 2
170
=
10
= 17
Σ𝑦 290
𝑎= =
𝑛 5
= 58
𝑦 = 58 + 17𝑥
For 2020, 𝑥 = 3
𝑦 = 58 + 51 = 109
The short-cut method yields the same result. What if 𝑛 is even? Use the same data
with 2020 enrollment equal to 96.
𝑥 𝑦 𝑥𝑦 𝑥2
-5 25 -125 25
-3 40 -120 9
-1 60 -60 1
1 70 70 1
3 95 285 9
5 96 480 25
Σ𝑥 = 0 Σ𝑦 = 386 Σ𝑥𝑦 = 530 Σ𝑥 2 = 70
530
𝑏= = 7.57
70
𝑎 = 64.67
𝑦 = 64.67+ 7.57𝑥
Activity 6d
The methods of inference used to support or reject claims based on sample data are
known as tests of significance. But it is not enough to test the significance of differences.
There is a need to write the hypothesis, thus the process of hypothesis testing.
What is a hypothesis?
An educated guess about the population parameter
An assumption about the population parameter
Hypothesis Testing : This is the process of making an inference or generalization on
population parameters based on the results of the study on
samples.
Statistical Hypotheses : It is a guess or prediction made by the researcher regarding the
possible outcome of the study.
Null Hypothesis (𝑯𝟎 ) is always hoped to be rejected. Always contains “ = ” sign
Alternative Hypothesis (𝑯𝒂 ) challenges 𝐻0. Never contains “ = ” sign.
Uses “ < or > or ≠ ”. It generally represents the idea which the researcher
wants to prove.
Hypothesis Testing : A procedure for deciding if the null hypothesis should be rejected
in favor of an alternative hypothesis, or will not be rejected.
Hypothesis Testing Approaches: Critical Value Approach and p-value Approach and the
5-step solution
The Critical Value Approach
One way of deciding whether or not to reject 𝐻0 is by comparing the value of the test
statistics with the critical value. The critical value is the value that the test statistics (𝑍 𝑜𝑟 𝑇)
must exceed in order for the null hypothesis (𝐻0) to be rejected. We reject 𝐻0 if the absolute
value of the computed 𝑍 𝑜𝑟 𝑇 ≥ the absolute value of the critical value.
Critical value
p-value approach
approach
The three known statistical hypothesis tests for means are the T -test, Z-Test, and the F-
Test or ANOVA. Please see illustrations below.
F-test 2 or more
F-distribution
(𝐴𝑁𝑂𝑉𝐴) means
Definition:
Degree of Freedom: It is the number of variables which are free to vary
Z-Test for Testing the significance of difference between:
Population or hypothesized mean,
that is Population mean vs Sample mean
Z-test
𝜎 is known
𝑛 ≥ 30
Quizzes
Attendance / Active Use of Module / Online Activity
Recitation
Group or Paired Output / Assignment / Seatwork
CONSULTATION TIME :
Prepared by: Reviewed by:
Recommending Approval:
Approved: