0% found this document useful (0 votes)
103 views69 pages

Statistical

Uploaded by

Sneka ramar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views69 pages

Statistical

Uploaded by

Sneka ramar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

lOMoARcPSD|52009748

( Final all Units) Statistical Computing

Business Statistics And Mathematics (Madurai Kamaraj University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing (SC)


Abstract:
Correlation - Definition of Correlation- Kari Pearson’s Coefficient of Linear Correlation-
Regression Analysis - Regression and Correlation- Regression Coefficients- Probability
Distribution- Random Variable Expectation of Random Variable- Sampling and Sampling
Distributions- Chi-Square (x2) and Snedecor’s F- Distributions- Statistical Inference- Testing
of Hypothesis Significance of a mean - Using t Distribution

Created By: -
Dr. S.Dinesh, M.Sc., MRM, M.Phil., Ph.D., Post.Doc
Lecturer (GL)
Department of Computer Science,
Government Art and Science College,
Veerapandi-625534,
Theni.

Downloaded by Sneka ramar ([email protected])


lOMoARcPSD|52009748

Statistical Computing (SC)

Unit -1
Correlation - Definition of Correlation- Scatter Diagram- Kari Pearson’s Coefficient of Linear
Correlation- Coefficient of Correlation and Probable Error of r- Coefficient of Determination - Merits and
Limitations of Coefficient of Correlation- Spearman’s Rank Correlation
(7.1-7.9.4).

Unit -2
Regression Analysis - Regression and Correlation (Intro)- Difference between Correlation and
Regression Analysis- Linear Regression Equations -Least Square Method- Regression Lines- Properties of
Regression Coefficients- Standard Error of Estimate. (8.1-8.8)

Unit -3
Probability Distribution and mathematical Expectation- Random Variable- Defined - Probability
Distribution a Random Variable Expectation of Random Variable- Properties of Expected Value and
Variance (12.2-12.4).

Unit -4
Sampling and Sampling Distributions - Data Collection- Sampling and Non-Sampling Errors –
Principles of Sampling-- Merits and Limitations of Sampling- Methods of Sampling- Parameter and Statistic-
Sampling Distribution of a Statistic- Examples of Sampling Distributions- Standard Normal, Student’s t, Chi-
Square (x2) and Snedecor’s F- Distributions (14.1-14.16).

Unit -5
Statistical Inference- Estimation and Testing of Hypothesis - Statistical Inference- Estimation- Point
and interval- Confidence interval using normal, t and x2Distributions- Testing of Hypothesis Significance of
a mean - Using t Distribution (15.1-15.10.2).

Text Book:
K.L. Sehgal, “Quantitative Techniques and Statistics”, First Edition, Himalaya Publishing House, 2011

Downloaded by Sneka ramar ([email protected])


lOMoARcPSD|52009748

Unit – 1
1.1 Correlation 01
1.2 Scatter diagram 04
1.3 Sarl pearson’s coefficient of correlation 07
1.4 Spearman’s rank coefficient of correlation 13

Unit –2

2.1 Meaning 17
2.2 Distinction Between Correlation and Regression 17
2.3 Review of Correlation and Regression Analysis 27

Unit –3

3.1 Probability Distribution and Mathematical Expectation 31


3.2 Random Variable- Defined Probability Distribution a Random Variable
Expectation of Random Variable 32
3.3 Properties of Expected Value and Variance 35

Unit –4

4.1 Sampling and Sampling Distributions - Data Collection 41


4.2 Sampling and Non-Sampling Errors – Principles of Sampling 42
4.3 Merits and Limitations of Sampling- Methods of Sampling 44
4.4 Parameter and Statistic- Sampling Distribution of a Statistic 47
4. 5 Examples of Sampling Distributions- Standard Normal 49
4.6 Student’s T, Chi-Square (X2) And Snedecor’s F- Distributions 52

Unit –5

5.1 Statistical Inference- Estimation and Testing Of Hypothesis N 56


5.2 Statistical Inference- Estimation- Point and Interval 58
5.3 Confidence Interval Using Normal, T And X2distributions 60
5.4 Testing of Hypothesis Significance of A Mean - Using T Distribution 63

Downloaded by Sneka ramar ([email protected])


lOMoARcPSD|52009748

Statistical Computing |1

UNIT – 1 : CORRELATION
Introduction:
1.1 Meaning:
Correlation is a statistical technique to ascertain the association or relationship
between two or more variables. Correlation analysis is a statistical technique to study
the degree and direction of relationship between two or more variables.
A correlation coefficient is a statistical measure of the degree to which changes
to the value of one variable predict change to the value of another. When the
fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change
in the other.
Uses of correlations:
1. Correlation analysis helps inn deriving precisely the degree and the direction of
such relationship.
2. The effect of correlation is to reduce the range of uncertainty of our prediction.
The prediction based on correlation analysis will be more reliable and near to
reality.
3. Correlation analysis contributes to the understanding of economic behavior,
aids in locating the critically important variables on which others depend, may
reveal to the economist the connections by which disturbances spread and
suggest to him the paths through which stabilizing farces may become effective
4. Economic theory and business studies show relationships between variables
like price and quantity demanded advertising expenditure and sales promotion
measures etc.
5. The measure of coefficient of correlation is a relative measure of change.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


1
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |2

Types of Correlation:
Correlation is described or classified in several different ways. Three of the
most important are:
I. Positive and Negative
II. Simple, Partial and Multiple
III. Linear and non-linear
I. Positive, Negative and Zero Correlation:
Whether correlation is positive (direct) or negative (in-versa) would depend
upon the direction of change of the variable.
Positive Correlation: If both the variables vary in the same direction, correlation is
said to be positive. It means if one variable is increasing, the other on an average is
also increasing or if one variable is decreasing, the other on an average is also
deceasing, then the correlation is said to be positive correlation. For example, the
correlation between heights and weights of a group of persons is a positive
correlation.
Height (cm) : X 158 160 163 166 168 171 174 176
Weight (kg) : Y 60 62 64 65 67 69 71 72
Negative Correlation: If both the variables vary in opposite direction, the correlation
is said to be negative. If means if one variable increases, but the other variable
decreases or if one variable decreases, but the other variable increases, then the
correlation is said to be negative correlation. For example, the correlation between the
price of a product and its demand is a negative correlation.
Price of Product (Rs. Per Unit) : X 6 5 4 3 2 1
Demand (In Units) : Y 75 120 175 250 215 400
Zero Correlation: Actually it is not a type of correlation but still it is called as zero or
no correlation. When we don’t find any relationship between the variables then, it is
said to be zero correlation. It means a change in value of one variable doesn’t influence
or change the value of other variable. For example, the correlation between weight of
person and intelligence is a zero or no correlation.
II. Simple, Partial and Multiple Correlation:
The distinction between simple, partial and multiple correlation is based upon
the number of variables studied.
Simple Correlation: When only two variables are studied, it is a case of simple
correlation. For example, when one studies relationship between the marks secured
by student and the attendance of student in class, it is a problem of simple correlation.
Partial Correlation: In case of partial correlation one studies three or more variables
but considers only two variables to be influencing each other and the effect of other
influencing variables being held constant. For example, in above example of
relationship between student marks and attendance, the other variable influencing
such as effective teaching of teacher, use of teaching aid like computer, smart board
etc are assumed to be constant.
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
2
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |3

Multiple Correlation: When three or more variables are studied, it is a case of


multiple correlation. For example, in above example if study covers the relationship
between student marks, attendance of students, effectiveness of teacher, use of
teaching aids etc, it is a case of multiple correlation.
III. Linear and Non-linear Correlation:
Depending upon the constancy of the ratio of change between the variables, the
correlation may be Linear or Non-linear Correlation.
Linear Correlation: If the amount of change in one variable bears a constant ratio to
the amount of change in the other variable, then correlation is said to be linear. If such
variables are plotted on a graph paper all the plotted points would fall on a straight
line. For example: If it is assumed that, to produce one unit of finished product we
need 10 units of raw materials, then subsequently to produce 2 units of finished
product we need double of the one unit.
Raw material : X 10 20 30 40 50 60
Finished Product : Y 2 4 6 8 10 12
Non-linear Correlation: If the amount of change in one variable does not bear a
constant ratio to the amount of change to the other variable, then correlation is said to
be non-linear. If such variables are plotted on a graph, the points would fall on a curve
and not on a straight line. For example, if we double the amount of advertisement
expenditure, then sales volume would not necessarily be doubled.
Advertisement Expenses : X 10 20 30 40 50 60
Sales Volume : Y 2 4 6 8 10 12

Illustration 01:
State in each case whether there is
(a) Positive Correlation
(b) Negative Correlation
(c) No Correlation
Sl No Particulars Solution
1 Price of commodity and its demand Negative
2 Yield of crop and amount of rainfall Positive
3 No of fruits eaten and hungry of a person Negative
4 No of units produced and fixed cost per unit Negative
5 No of girls in the class and marks of boys No Correlation
6 Ages of Husbands and wife Positive
7 Temperature and sale of woollen garments Negative
8 Number of cows and milk produced Positive
9 Weight of person and intelligence No Correlation
10 Advertisement expenditure and sales volume Positive

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


3
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |4

Methods of measurement of correlation:


Quantification of the relationship between variables is very essential to take the
benefit of study of correlation. For this, we find there are various methods of
measurement of correlation, which can be represented as given below:

Methods of Measurement of Correlation

Graphic Method Algebric Method

1. Karl Pearson’s Coefficient of


Correlation
1. Scatter Diagram 2. Spearman’s Rank Coefficient of
2. Graph Method Correlation
3. Concurrent Deviation Method
4. Method of Least Square

Among these methods we will discuss only the following methods:


1. Scatter Diagram
2. Karl Pearson’s Coefficient of Correlation
3. Spearman’s Rank Coefficient of Correlation

1.2 Scatter Diagram:


This is graphic method of measurement of correlation. It is a diagrammatic
representation of bivariate data to ascertain the relationship between two variables.
Under this method the given data are plotted on a graph paper in the form of dot. i.e.
for each pair of X and Y values we put dots and thus obtain as many points as the
number of observations. Usually an independent variable is shown on the X-axis
whereas the dependent variable is shown on the Y-axis. Once the values are plotted on
the graph it reveals the type of the correlation between variable X and Y. A scatter
diagram reveals whether the movements in one series are associated with those in the
other series.
• Perfect Positive Correlation: In this case, the points will form on a straight line
falling from the lower left hand corner to the upper right hand corner.
• Perfect Negative Correlation: In this case, the points will form on a straight line
rising from the upper left hand corner to the lower right hand corner.
• High Degree of Positive Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a rising tendency from the lower left hand
corner to the upper right hand corner.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


4
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |5

• High Degree of Negative Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a declining tendency from upper left hand
corner to the lower right hand corner.
• Low Degree of Positive Correlation: If the points are widely scattered over the
diagrams, wherein points are rising from the left hand corner to the upper right
hand corner.
• Low Degree of Negative Correlation: If the points are widely scattered over the
diagrams, wherein points are declining from the upper left hand corner to the
lower right hand corner.
• Zero (No) Correlation: When plotted points are scattered over the graph
haphazardly, then it indicate that there is no correlation or zero correlation
between two variables.

Diagram – I Diagram – II

Diagram – III Diagram – IV

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


5
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |6

Diagram – V Diagram – VI

Diagram – VII

Illustration 02:
Given the following pairs of values:
Capital Employed (Rs. In Crore) 1 2 3 4 5 7 8 9 11 12
Profit (Rs. In Lakhs) 3 5 4 7 9 8 10 11 12 14
(a) Draw a scatter diagram
(b) Do you think that there is any correlation between profits and capital
employed? Is it positive or negative? Is it high or low?
Solution:
From the observation of scatter diagram we can say that the variables are positively
correlated. In the diagram the points trend toward upward rising from the lower left
hand corner to the upper right hand corner, hence it is positive correlation. Plotted
points are in narrow band which indicates that it is a case of high degree of positive
correlation.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


6
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |7

16
14

Profit (Rs. in Lakhs)


12

10

0
0 2 4 6 8 10 12 14
Capital Employed (Rs. in Crore)

1.3 Karl Pearson’s Coefficient of Correlation:

Illustration 03:
From following information find the correlation coefficient between advertisement
expenses and sales volume using Karl Pearson’s coefficient of correlation method.
Firm 1 2 3 4 5 6 7 8 9 10
Advertisement Exp. (Rs. In Lakhs) 11 13 14 16 16 15 15 14 13 13
Sales Volume (Rs. In Lakhs) 50 50 55 60 65 65 65 60 60 50

Solution:
Let us assume that advertisement expenses are variable X and sales volume are
variable Y.
Calculation of Karl Pearson’s coefficient of correlation
Firm X Y x=X-Ẋ x2 y=Y -Ẏ y2 xy
1 11 50 -3 9 -8 64 24
2 13 50 -1 1 -8 64 8
3 14 55 0 0 -3 9 0
4 16 60 2 4 2 4 4
5 16 65 2 4 7 49 14
6 15 65 1 1 7 49 7
7 15 65 1 1 7 49 7
8 14 60 0 0 2 4 0
9 13 60 -1 1 2 4 -2
10 13 50 -1 1 -8 64 8
140 580 22 360 70
∑X ∑Y ∑x2 ∑y2 ∑xy
Ẋ = ∑X = 140 = 14 Ẏ = ∑Y = 580 = 58
n 10 n 10

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


7
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |8

∑xy 70 70
r= = = = 0.7866
√∑x2 ∑y2 √22∗360 88.9944

Interpretation: From the above calculation it is very clear that there is high degree of
positive correlation i.e. r = 0.7866, between the two variables. i.e. Increase in
advertisement expenses leads to increased sales volume.

Illustration 04:
Find the correlation coefficient between age and playing habits of the following
students using Karl Pearson’s coefficient of correlation method.
Age 15 16 17 18 19 20
Number of students 250 200 150 120 100 80
Regular Players 200 150 90 48 30 12

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


8
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |9

Solution:
To find the correlation between age and playing habits of the students, we need to
compute the percentages of students who are having the playing habit.

Percentage of playing habits = No. of Regular Players / Total No. of Students * 100

Now, let us assume that ages of the students are variable X and percentages of playing
habits are variable Y.

Calculation of Karl Pearson’s coefficient of correlation


Percentage
No of Regular
Age (X)
Students Players
of Playing X-Ẋ (X - Ẋ)2 Y-Ẏ (Y - Ẏ)2 (X - Ẋ)(Y - Ẏ)
Habits (Y)
15 250 200 80 -2.5 6.25 30 900 -75
16 200 150 75 -1.5 2.25 25 625 -37.5
17 150 90 60 -0.5 0.25 10 100 -5
18 120 48 40 0.5 0.25 -10 100 -5
19 100 30 30 1.5 2.25 -20 400 -30
20 80 12 15 2.5 6.25 -35 1225 -87.5
105 300 17.5 3350 -240
∑X ∑Y ∑x2 ∑y2 ∑xy

Ẋ = ∑X = 105 = 17.5 Ẏ = ∑Y = 300 = 50


n 6 n 6

∑(X−X)(Y−Y) −240 −240


r= = = = -0.9912
√∑(X−X)2 ∑(Y−Y)2 √17.5∗3350 242.126

Interpretation: From the above calculation it is very clear that there is high degree of
negative correlation i.e. r = -0.9912, between the two variables of age and playing
habits. i.e. Playing habits among students decreases when their age increases.

Illustration 05:
Find Karl Pearson’s coefficient of correlation between capital employed and profit
obtained from the following data.
Capital Employed (Rs. In Crore) 10 20 30 40 50 60 70 80 90 100
Profit (Rs. In Crore) 2 4 8 5 10 15 14 20 22 50

Solution:
Let us assume that capital employed is variable X and profit is variable Y.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


9
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 10

Calculation of Karl Pearson’s coefficient of correlation


n∑XY−∑X ∑Y
X Y X2 Y2 XY r=
10 2 100 4 20 √[n(∑X2) − (∑X)2][n(∑Y2) − (∑Y)2]
20 4 400 16 80
30 8 900 64 240 (10 ∗ 11500) − (550 ∗ 150)
r=
40 5 1600 25 200 √[(10∗38500)−(5502)] [ (10∗4014)−(1502)]
50 10 2500 100 500
60 15 3600 225 900
70 14 4900 196 980 (1,15,000) − (82,500)
r=
80 20 6400 400 1600 √[(3,85,000)−(3,02,500)] [ (40,140)−(22,500)]
90 22 8100 484 1980
100 50 10000 2500 5000 32,500 32,500
550 150 38500 4014 11500 r= =
√(82,500) (17,640) √1455300000
∑X ∑Y ∑X2 ∑Y2 ∑XY
32,500
r= = 0.8519
38148.3945

Illustration 06:
A computer while calculating the correlation coefficient between the variable X and Y
obtained the following results:
N = 30; ∑X = 120 ∑X2 = 600 ∑Y = 90 ∑Y2 = 250 ∑XY = 335
It was, however, later discovered at the time of checking that it had copied down two
pairs of observations as: (X, Y) : (8, 10) (12, 7)
While the correct values were: (X, Y) : (8, 12) (10, 8)
Obtain the correct value of the correlation coefficient between X and Y.

Solution:
Correct ∑X = 120 – 8 – 12 + 8 + 10 = 118
Correct ∑X2 = 600 – 82 – 122 + 82 + 102
= 600 – 64 – 144 + 64 + 100 = 556
Correct ∑Y = 90 – 10 – 7 + 12 + 8 = 93
Correct ∑Y2 = 250 – 102 – 72 + 122 + 82
= 250 – 100 – 49 + 144 + 64 = 309
Correct ∑XY = 335 – (8*10) – (12*7) + (8*12) + (10*8)
= 335 – 80 – 84 + 96 + 80 = 347

n∑XY−∑X ∑Y −564 −564


r= r= =
√[n(∑X2) − (∑X)2][n(∑Y2) − (∑Y)2] √(2,756) (621) √1711476

(30 ∗ 347) − (118 ∗ 93) −564


r= r= = -0.4311
1308.2339
√[(30∗556)−(1182)] [ (30∗309)−(932)]

(10,410) − (10,974) Therefore, the correct value of correlation


r=
√[(16,680)−(13,924)] [ (9270)−(8649)] coefficient between X and Y is moderately
negative correlation of -0.4311.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


10
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 11

Illustration 07:
Coefficient of correlation between X and Y is 0.3. Their covariance is 9. The variance of
X is 16. Find the standard devotion of Y series.

Solution:
Given information:
r = 0.3 Cov (X, Y) = 9 Var (X) = 16
𝐶𝑜𝑣(𝑋,𝑌)
r= 0.3 = 9 0.3 = 9
√𝑉𝑎𝑟(𝑋) ∗ 𝑉𝑎𝑟 (𝑌) √16 ∗ 𝑉𝑎𝑟 (𝑌) 4 ∗ √ 𝑉𝑎𝑟 (𝑌)

0.3 * 4 = 9 1.2 = 9 SD(Y) = 9


= 7.5
𝑆𝐷(𝑌) 𝑆𝐷(𝑌) 1.2

Therefore the standard deviation of Y series = σ(Y) = 7.5

Illustration 08:
Calculate correlation coefficient from the following two-way table, with X representing
the average salary of families selected at random in a given area and Y representing
the average expenditure on entertainment.
Expenditure on Average Salary (Rs. ‘000)
Entertainment (Rs. ‘000) 100-150 150-200 200-250 250-300 300-350
0 – 10 5 4 5 2 4
10 – 20 2 7 3 7 1
20 – 30 - 6 - 4 5
30 – 40 8 - 4 - 8
40 – 50 - 7 3 5 10

Solution:
Let us assume that Average Salary is variable X and Expenditure on
Entertainment is variable Y.
In case of grouped data, we need to follow the assumed mean method to
calculate Karl Pearson’s Coefficient of Correlation. Following steps are followed to
compute correlation.
1. Identify the mid-point of the class intervals for variable X and Y.
2. Chose an assumed mean from the mid-point identified above for both X and Y.
3. To simplify further, deviation from assumed mean is computed by dividing
deviation by a common factor.
4. Add the values in cell, row-wise and column-wise, to compute frequencies (f).
Sum of either row-wise or column-wise represent the value of N.
5. Obtain the product of dx and dy and the corresponding frequencies (f) in each
cell. Write the figure thus obtained in the right corner of each cell which
represent the value of fdxdy.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


11
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 12

Calculation of Karl Pearson’s coefficient of correlation


X 100 - 150 – 200 – 250 – 300 –
150 200 250 300 350
f dy fdy fdy2 fdxdy
Mid
Y Point
125 175 225 275 325

20 8 0 -4 -16
0 – 10 5 20 -2 -40 80 8
5 4 5 2 4
4 7 0 -7 -2
10 – 20 15 20 -1 -20 20 2
2 7 3 7 1
- 0 - 0 0
20 – 30 25 15 0 0 0 0
- 6 - 4 5
-16 - 0 - 16
30 – 40 35 20 1 20 20 0
8 - 4 - 8
- -14 0 10 40
40 – 50 45 25 2 50 100 36
- 7 3 5 10
100
f 15 24 15 18 28 10 220 46
=N
dx -2 -1 0 1 2 ∑fdy ∑fdy2 ∑fdxdy

fdx -30 -24 0 18 56 20 ∑fdx

fdx2 60 24 0 18 112 214 ∑fdx2

fdxdy 8 1 0 -1 38 46 ∑fdxdy

dx = Mid Point of Series X – Assumed Mean of Series X = MP(X) - 225


dy = Mid Point of Series Y – Assumed Mean of Series Y = MP(Y) - 25

n∑𝑓𝑑xdy−∑fdx ∑fdy (100∗46)− (20∗10)


r= =
√[n(∑𝑓𝑑x2) − (∑𝑓𝑑x)2][n(∑𝑓𝑑𝑦2) − (∑𝑓𝑑𝑦)2] √[(100∗214) − (20)2][(100∗220) − (10)2]

(4,600)− (200) 4,400 4,400


r= = = = 0.2052
√[21,400 − 400][22,000 − 100] √[21,000]∗[21,900] 21,445.2792

Interpretation: From the above calculation it is very clear that there is low degree of
positive correlation i.e. r = 0.2052, between the two variables of salary and
expenditure. It means average salary of income have slightly or low influence over
entertainment expenditure.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


12
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 13

1.4 Spearman’s Rank Coefficient of Correlation:


When quantification of variables becomes difficult such beauty of female, leadership
ability, knowledge of person etc, then this method of rank correlation is useful which
was developed by British psychologist Charles Edward Spearman in 1904. In this
method ranks are allotted to each element either in ascending or descending order.
The correlation coefficient between these allotted two series of ranks is popularly
called as “Spearman’s Rank Correlation” and denoted by “R”.

To find out correlation under this method, the following formula is used.
2
R=1- 6∑D
where, D =Difference of the ranks between paired items in two series.
N3− N
N = Number of pairs of ranks

In case of tie in ranks or equal ranks:


In some cases it may be possible that it becomes necessary to assign same rank
to two or more elements or individual or entries. In such situation, it is customary to
give each individual or entry an average rank. For example, if two individuals are
ranked equal to 5th place, then both of them are allotted with common rank (5+6)/2 =
5.5 and if three are ranked in 5th place, then they are given the rank of (5+6+7)/3 = 6.
It means where two or more individuals are to be ranked equal, the rank assigned for
the purpose of calculating coefficient of correlation is the average of the ranks with
these individual or items or entries would have got had they differed slightly with each
other.
Where equal ranks are assigned to some entries, an adjustment factor is to be
added to the value of 6∑D2 in the above formula for calculating the rank coefficient
correlation. This adjustment factor is to be added for every repetition of rank.
1
Adjustment factor = (m13-m1) where, m = number of items whose rank are common
12
For example, if a particular rank repeated two times then m=2 and if it repeats three
times then m= 3 and so on.
Hence the above formula can be re-written as follows:
1 1 1
6 ∗ [∑D2+ (m3−m)+ (m3−m)+ (m3−m)+ …… ]
R=1– 12 12
N3 − N
12

Illustration 09:
Find out spearman’s coefficient of correlation between the two kinds of assessment of
graduate students’ performance in a college.
Name of students A B C D E F G H I
Internal Exam 51 68 73 46 50 65 47 38 60
External Exam 49 72 74 44 58 66 50 30 35

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


13
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 14

Solution:
Calculation of Spearman’s Rank Coefficient of Correlation
Internal External
Name Ranks (R1) Ranks (R2) D = R1 – R2 D2
Exam Exam
A 51 5 49 6 -1 1
B 68 2 72 2 0 0
C 73 1 74 1 0 0
D 46 8 44 7 1 1
E 50 6 58 4 2 4
F 65 3 66 3 0 0
G 47 7 50 5 2 4
H 36 9 30 9 0 0
I 60 4 35 8 -4 16
∑D2 = 26

R=1-
6∑D2
=1–
6∗26
= 1-
156
=1-
156 = 1 - 0.2167 = 0.7833
N3− N 9 3− 9 729 − 9 720

Interpretation: From the above calculation it is very clear that there is high degree of
positive correlation i.e. R = 0.7833, between two exams. It means there is a high
degree of positive correlation between the internal exam and external exam of the
students.

Illustration 10:
The coefficient of rank correlation of the marks obtained by 10 students in statistics
and accountancy was found to be 0.8. It was later discovered that the difference in
ranks in the two subjects obtained by one of the students was wrongly taken as 7
instead of 9. Find the correct coefficient of rank correlation.

Solution:
2 2 2 2
R = 1 - 6∑D => 0.8 = 1 - 6∑D => 0.8 = 1 - 6∑D => 6∑D = 1-0.8 =>
N 3− N 103 − 10 990 990

6∑D2
990
= 0.2 => 6∑D2 = 0.2 * 990 => ∑D2 = 198/6 => ∑D2 = 33

But this is not correct ∑D2 therefore we need to compute correct value
Correct ∑D2 = 33 – 72 + 92 = 65
Hence, correct
2
value of rank coefficient of correlation is:
R = 1 - 6∑D = 1 – 6∗65 = 1 - 390 = 1 – 0.394 = 0.606
N3− N 990 990

Illustration 11:
Ten competitors in a beauty contest are ranked by three judges in the following order:
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


14
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 15

Use the rank correlation coefficient to determine which pairs of judges has the nearest
approach to common tastes in beauty.

Solution:
In order to find out which pair of judges has the nearest approach to common tastes in
beauty, we compare rank correlation between the judgements of
1. 1st Judge and 2nd Judge
2. 2nd Judge and 3rd Judge
3. 1st Judge and 3rd Judge
Calculation of Spearman’s Rank Coefficient of Correlation
Rank by 1st Rank by 2nd Rank by 3rd
Judge (R1) Judge (R2) Judge (R3) D2 = (R1–R2)2 D2 = (R2–R3)2 D2 = (R1–R3)2
1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 36 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1
N = 10 N = 10 N = 10 ∑D2 = 200 ∑D2 = 214 ∑D2 = 60
2
1. 1st Judge and 2nd Judge: R = 1 - 6∑D
=1–
6∗200
= 1 – 1200 = 1 – 1.2121= -0.2121
N3 − N 103 − 10 990
2
2. 2nd Judge and 3rd Judge: R = 1 - 6∑D
=1–
6∗214
=1–
1284
= 1 – 1.297 = -0.297
N3 − N 103 − 10 990
2
3. 1st Judge and 3rd Judge: R = 1 - 6∑D
=1–
6∗60
=1–
360
= 1 – 0.3636 = 0.6364
N3 − N 103 − 10 990

Interpretation: From the above calculation it can be observed that coefficient of


correlation is positive in the judgement of the first and third judges. Therefore, it can
be concluded that first and third judges have the nearest approach to common tastes
in beauty.

Illustration 12:
From the following data, compute the rank correlation.
X 82 68 75 61 68 73 85 68
Y 81 71 71 68 62 69 80 70

Solution:
In the problem we find there are repetitions of ranks. Value of X = 68 repeated 3 times
and Value of Y = 71 repeated 2 times. Therefore we need to compute adjustment factor
to be added to the value of ∑D2.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


15
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 16

Calculation of Spearman’s Rank Coefficient of Correlation


X Y R1 R2 D= R1 – R 2 D2
82 81 2 1 1 1
68 71 6 3.5 2.5 6.25
75 71 3 3.5 -0.5 0.25
61 68 8 7 1 1
68 62 6 8 -2 4
73 69 4 6 -2 4
85 80 1 2 -1 1
68 70 6 5 1 1
∑D2 18.5
1 1
6 ∗ [∑D2+ (m3−m)+ (m3−m)]
R=1– 12
N3 − N
12

When value X repeated three times, m=3,


Adjustment factor (1) = 1 (33-3) = 1 * (27-3) = 1 * 24 = 2
12 12 12
When value Y repeated two times, m=2,
Adjustment factor (2) = 1 (23-2) = 1 * (8-2) = 1 * 6 = 0.5
12 12 12

R=1–
6 ∗ [18.5 + 2 + 0.5]
=1–
6 ∗ 21 = 1 – 126 = 1 – 0.25 = 0.75
8 3− 8 512− 8 504

Spearman’s Rank Coefficient of Correlation = 0.75, which indicates there is high


degree of positive correlation.

Properties of Coefficient of Correlation:


1. The coefficient of correlation always lies between – 1 to +1, symbolically it can
written as – 1 ≤ r ≤ 1.
2. The coefficient of correlation is independent of change of origin and scale.
3. The coefficient of correlation is a pure number and is independent of the units of
measurement. It means if X represent say height in inches and Y represent say
weights in kgs, then the correlation coefficient will be neither in inches nor in kgs
but only a pure number.
4. The coefficient of correlation is the geometric mean of two regression coefficient,
symbolically r = √bxy ∗ byx
5. If X and Y are independent variables then coefficient of correlation is zero.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


16
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 17

Unit 2 REGRESSION
2.1Meaning:
A study of measuring the relationship between associated variables, wherein
one variable is dependent on another independent variable, called as Regression. It is
developed by Sir Francis Galton in 1877 to measure the relationship of height between
parents and their children.
Regression analysis is a statistical tool to study the nature and extent of
functional relationship between two or more variables and to estimate (or predict) the
unknown values of dependent variable from the known values of independent
variable.
The variable that forms the basis for predicting another variable is known as
the Independent Variable and the variable that is predicted is known as dependent
variable. For example, if we know that two variables price (X) and demand (Y) are
closely related we can find out the most probable value of X for a given value of Y or
the most probable value of Y for a given value of X. Similarly, if we know that the
amount of tax and the rise in the price of a commodity are closely related, we can find
out the expected price for a certain amount of tax levy.

Uses of Regression Analysis:


1. It provides estimates of values of the dependent variables from values of
independent variables.
2. It is used to obtain a measure of the error involved in using the regression line as
a basis for estimation.
3. With the help of regression analysis, we can obtain a measure of degree of
association or correlation that exists between the two variables.
4. It is highly valuable tool in economies and business research, since most of the
problems of the economic analysis are based on cause and effect relationship.

2.2Distinction between Correlation and Regression


Sl No Correlation Regression
1 It measures the degree and direction It measures the nature and extent of
of relationship between the variables. average relationship between two or
more variables in terms of the original
units of the data
2 It is a relative measure showing It is an absolute measure of
association between the variables. relationship.
3 Correlation Coefficient is independent Regression Coefficient is independent
of change of both origin and scale. of change of origin but not scale.
4 Correlation Coefficient is independent Regression Coefficient is not
of units of measurement. independent of units of measurement.
5 Expression of the relationship Expression of the relationship
between the variables ranges from –1 between the variables may be in any
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
17
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 18

to +1. of the forms like:


Y = a + bX
Y = a + bX + cX2
6 It is not a forecasting device. It is a forecasting device which can be
used to predict the value of dependent
variable from the given value of
independent variable.
7 There may be zero correlation such as There is nothing like zero regression.
weight of wife and income of husband.

Regression Lines and Regression Equation:


Regression lines and regression equations are used synonymously. Regression
equations are algebraic expression of the regression lines. Let us consider two
variables: X & Y. If y depends on x, then the result comes in the form of simple
regression. If we take the case of two variable X and Y, we shall have two regression
lines as the regression line of X on Y and regression line of Y on X. The regression line
of Y on X gives the most probable value of Y for given value of X and the regression line
of X on Y given the most probable value of X for given value of Y. Thus, we have two
regression lines. However, when there is either perfect positive or perfect negative
correlation between the two variables, the two regression line will coincide, i.e. we
will have one line. If the variables are independent, r is zero and the lines of regression
are at right angles i.e. parallel to X axis and Y axis.
Therefore, with the help of simple linear regression model we have the
following two regression lines
1. Regression line of Y on X: This line gives the probable value of Y (Dependent
variable) for any given value of X (Independent variable).
Regression line of Y on X : Y – Ẏ = byx (X – Ẋ)
OR : Y = a + bX

2. Regression line of X on Y: This line gives the probable value of X (Dependent


variable) for any given value of Y (Independent variable).
Regression line of X on Y : X – Ẋ = bxy (Y – Ẏ)
OR : X = a + bY

In the above two regression lines or regression equations, there are two
regression parameters, which are “a” and “b”. Here “a” is unknown constant and “b”
which is also denoted as “byx” or “bxy”, is also another unknown constant popularly
called as regression coefficient. Hence, these “a” and “b” are two unknown constants
(fixed numerical values) which determine the position of the line completely. If the
value of either or both of them is changed, another line is determined. The parameter
“a” determines the level of the fitted line (i.e. the distance of the line directly above or
below the origin). The parameter “b” determines the slope of the line (i.e. the change
in Y for unit change in X).
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
18
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 19

If the values of constants “a” and “b” are obtained, the line is completely
determined. But the question is how to obtain these values. The answer is provided by
the method of least squares. With the little algebra and differential calculus, it can be
shown that the following two normal equations, if solved simultaneously, will yield
the values of the parameters “a” and “b”.
Two normal equations:
X on Y Y on X
∑X = Na + b∑Y ∑Y = Na + b∑X
∑XY = a∑Y + b∑Y 2 ∑XY = a∑X + b∑X2

This above method is popularly known as direct method, which becomes quite
cumbersome when the values of X and Y are large. This work can be simplified if
instead of dealing with actual values of X and Y, we take the deviations of X and Y
series from their respective means. In that case:
Regression equation Y on X:
Y = a + bX will change to (Y – Ẏ) = byx (X – Ẋ)
Regression equation X on Y:
X = a + bY will change to (X – Ẋ) = bxy (Y – Ẏ)
In this new form of regression equation, we need to compute only one
parameter i.e. “b”. This “b” which is also denoted either “byx” or “bxy” which is called as
regression coefficient.

Regression Coefficient:
The quantity “b” in the regression equation is called as the regression
coefficient or slope coefficient. Since there are two regression equations, therefore, we
have two regression coefficients.
1. Regression Coefficient X on Y, symbolically written as “bxy”
2. Regression Coefficient Y on X, symbolically written as “byx”
Different formula’s used to compute regression coefficients:
Method Regression Coefficient X on Y Regression Coefficient Y on X
Using the correlation σ𝑥 σ𝑦
coefficient (r) and bxy = 𝑟 byx = 𝑟
σ𝑦 σ𝑥
standard deviation (σ)
Direct Method: Using bxy =
N∑XY− ∑X∑Y
byx =
N∑XY− ∑X∑Y
sum of X and Y N∑Y2− (∑Y)2 N∑X2− (∑X)2
∑𝑥𝑦 ∑𝑥𝑦
When deviations are bxy = byx =
taken from arithmetic ∑𝑦2 ∑𝑥2
mean where x = X - Ẋ and y = Y - Ẏ where x = X - Ẋ and y = Y - Ẏ

Properties of Regression Coefficients:


1. The coefficient of correlation is the geometric mean of the two regression
coefficients. Symbolically r = √bxy ∗ byx
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
19
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 20

2. If one of the regression coefficients is greater than unity, the other must be less
than unity, since the value of the coefficient of correlation cannot exceed unity.
For example if bxy = 1.2 and byx = 1.4 “r” would be = √1.2 ∗ 1.4 = 1.29, which is
not possible.
3. Both the regression coefficient will have the same sign. i.e. they will be either
positive or negative. In other words, it is not possible that one of the regression
coefficients is having minus sign and the other plus sign.
4. The coefficient of correlation will have the same sign as that of regression
coefficient, i.e. if regression coefficient have a negative sign, “r” will also have
negative sign and if the regression coefficient have a positive sign, “r” would also
be positive. For example, if bxy = -0.2 and byx = -0.8 then r = - √0.2 ∗ 0.8 = – 0.4
5. The average value of the two regression coefficient would be greater than the
value of coefficient of correlation. In symbol (bxy + byx) / 2 > r. For example, if
bxy = 0.8 and byx = 0.4 then average of the two values = (0.8 + 0.4) / 2 = 0.6 and
the value of r = r = √0.8 ∗ 0.4 = 0.566 which less than 0.6
6. Regression coefficients are independent of change of origin but not scale.

Illustration 01:
Find the two regression equation of X on Y and Y on X from the following data:
X : 10 12 16 11 15 14 20 22
Y : 15 18 23 14 20 17 25 28

Solution:
Calculation of Regression Equation
X Y X2 Y2 XY
10 15 100 225 150
12 18 144 324 216
16 23 256 529 368
11 14 121 196 154
15 20 225 400 300
14 17 196 289 238
20 25 400 625 500
22 28 484 784 616
120 160 1,926 3,372 2,542
∑X ∑Y ∑X2 ∑Y2 ∑XY
Here N = Number of elements in either series X or series Y = 8
Now we will proceed to compute regression equations using normal equations.
Regression equation of X on Y: X = a + bY
The two normal equations are:
∑X = Na + b∑Y
∑XY = a∑Y + b∑Y2
Substituting the values in above normal equations, we get
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
20
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 21

120 = 8a + 160b .....(i)


2542 = 160a + 3372b ..... (ii)
Let us solve these equations (i) and (ii) by simultaneous equation method
Multiply equation (i) by 20 we get 2400 = 160a + 3200b
Now rewriting these equations:
2400 = 160a + 3200b
2542 = 160a + 3372b
(-) (-) (-) .
-142 = -172b
Therefore now we have -142 = -172b, this can rewritten as 172b = 142
Now, b = 142 = 0.8256 (rounded off)
172
Substituting the value of b in equation (i), we get
120 = 8a + (160 * 0.8256)
120 = 8a + 132 (rounded off)
8a = 120 - 132
8a = -12
a = -12/8
a = -1.5
Thus we got the values of a = -1.5 and b = 0.8256
Hence the required regression equation of X on Y:
X = a + bY => X = -1.5 + 0.8256Y

Regression equation of Y on X: Y = a + bX
The two normal equations are:
∑Y = Na + b∑X
∑XY = a∑X + b∑X2
Substituting the values in above normal equations, we get
160 = 8a + 120b .....(iii)
2542 = 120a + 1926b .....(iv)
Let us solve these equations (iii) and (iv) by simultaneous equation method
Multiply equation (iii) by 15 we get 2400 = 120a + 1800b
Now rewriting these equations:
2400 = 120a + 1800b
2542 = 120a + 1926b
(-) (-) (-) .
-142 = -126b
Therefore now we have -142 = -126b, this can rewritten as 126b = 142
Now, b = 142 = 1.127 (rounded off)
126
Substituting the value of b in equation (iii), we get
160 = 8a + (120 * 1.127)
160 = 8a + 135.24

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


21
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 22

8a = 160 - 135.24
8a = 24.76
a = 24.76/8
a = 3.095
Thus we got the values of a = 3.095 and b = 1.127
Hence the required regression equation of Y on X:
Y = a + bX => Y = 3.095 + 1.127X

Illustration 02:
After investigation it has been found the demand for automobiles in a city depends
mainly, if not entirely, upon the number of families residing in that city. Below are the
given figures for the sales of automobiles in the five cities for the year 2019 and the
number of families residing in those cities.
City No. of Families (in lakhs): X Sale of automobiles (in ‘000): Y
Belagavi 70 25.2
Bangalore 75 28.6
Hubli 80 30.2
Kalaburagi 60 22.3
Mangalore 90 35.4
Fit a linear regression equation of Y on X by the least square method and estimate the
sales for the year 2020 for the city Belagavi which is estimated to have 100 lakh
families assuming that the same relationship holds true.

Solution:
Calculation of Regression Equation
City X Y X2 XY
Belagavi 70 25.2 4900 1764
Bangalore 75 28.6 5625 2145
Hubli 80 30.2 6400 2416
Kalaburagi 60 22.3 3600 1338
Mangalore 90 35.4 8100 3186
375 141.7 28,625 10,849
∑X ∑Y ∑X2 ∑XY
Regression equation of Y on X: Y = a + bX
The two normal equations are:
∑Y = Na + b∑X
∑XY = a∑X + b∑X2
Substituting the values in above normal equations, we get
141.7 = 5a + 375b........................................ (i)
10849= 375a + 28625b ................................... (ii)
Let us solve these equations (i) and (ii) by simultaneous equation method
Multiply equation (i) by 75 we get 10627.5 = 375a + 28125b

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


22
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 23

Now rewriting these equations:


10627.5 = 375a + 28125b
10849 = 375a + 28625b
(-) (-) (-) .
-221.5 = -500b
Therefore now we have -221.5 = -500b, this can rewritten as 500b = 221.5
Now, b = 221.5 = 0.443
500
Substituting the value of b in equation (i), we get
141.7 = 5a + (375 * 0.443)
141.7 = 5a + 166.125
5a = 141.7 - 166.125
5a = -24.425
a = -24.425/5
a = -4.885
Thus we got the values of a = -4.885 and b = 0.443
Hence, the required regression equation of Y on X:
Y = a + bX => Y = -4.885 + 0.443X
Estimated sales of automobiles (Y) in city Belagavi for the year 2020, where number of
families (X) are 100(in lakhs):
Y = -4.885 + 0.443X
Y = -4.885 + (0.443 * 100)
Y = -4.885 + 44.3
Y = 39.415 (‘000)
Means sales of automobiles would be 39,415 when number of families are 100,00,000

Illustration 03:
From the following data obtain the two regression lines:
Capital Employed (Rs. in lakh): 7 8 5 9 12 9 10 15
Sales Volume (Rs. in lakh): 4 5 2 6 9 5 7 12

Solution:
Calculation of Regression Equation
X Y X2 Y2 XY
7 4 49 16 28
8 5 64 25 40
5 2 25 4 10
9 6 81 36 54
12 9 144 81 108
9 5 81 25 45
10 7 100 49 70
15 12 225 144 180
75 50 769 380 535
∑X ∑Y ∑X2 ∑Y2 ∑XY

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


23
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 24

Regression line/equation of X on Y: Regression line/equation of Y on X:


(X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ)

Ẋ = ∑X = 75 = 9.375 Ẋ = ∑X = 75 = 9.375
n 8 n 8

Ẏ = ∑Y = 50 = 6.25 Ẏ = ∑Y = 50 = 6.25
n 8 n 8

Regression coefficient of X on Y: Regression coefficient of Y on X:


n∑XY− ∑X∑Y n∑XY− ∑X∑Y
bxy = byx =
n∑Y2− (∑Y)2 n∑X2− (∑X)2

(8∗535) – (75∗50) (8∗535) – (75∗50)


bxy = bxy =
(8∗380) – (50)2 (8∗769) – (75)2
4280 – 3750 4280 – 3750
= =
3040 – 2500 6152 – 5625
530 530
= = 0.9815 = = 1.0057
540 527

(X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ)
 X – 9.375 = 0.9815 (Y – 6.25)  Y – 6.25 = 1.0057 (X – 9.375)
 X – 9.375 = 0.9815Y – 6.1344  Y – 6.25 = 1.0057X – 9.4284)
 X = 9.375 – 6.1344 + 0.9815Y  Y = 6.25 – 9.4284 + 1.0057X
 X = 3.2406 + 0.9815Y  Y = -3.1784 + 1.0057X

Illustration 04:
From the following information find regression equations and estimate the production
when the capacity utilisation is 70%.
Average (Mean) Standard Deviation
Production (in lakh units) 42 12.5
Capacity Utilisation (%) 88 8.5
Correlation Coefficient (r) 0.72
Solution:
Let production be variable X and capacity utilisation be variable Y. Regression
equation of production based on based on capacity utilisation shall be given by X on Y
and regression equation of capacity utilisation of production shall be given by Y on X,
which can be computed as given below:
Given Information: Ẋ = 42 Ẏ = 88 σx = 12.5 σy = 8.5 r = 0.72
Regression coefficient of X on Y: Regression coefficient of Y on X:
σ
𝑥 𝑦
σ
bxy = 𝑟 = 0.72 ∗ 12.5 = 1.0588 byx = 𝑟 = 0.72 ∗ 8.5 = 0.4896
σ𝑦 8.5 σ𝑥 12.5
Regression Equation of X on Y: Regression Equation of Y on X:
(X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ)
 X – 42 = 1.0588 (Y – 88)  Y – 88 = 0.4896 (X – 42)
 X = 42 – 93.1744 + 1.0588Y  Y = 88 – 20.5632 + 0.4896X
 X = -51.1744 + 1.0588Y  Y = 67.4368 + 0.4896X
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
24
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 25

Estimation of the production when the capacity utilisation is 70% is regression


equation X on Y, where Y = 70
Regression Equation of X on Y:
(X – Ẋ) = bxy (Y – Ẏ)
X = -51.1744 + 1.0588Y
= -51.1744 + (1.0588 * 70)
= -51.1744 + 74.116
= 22.9416
Therefore, the estimated production would be 22,94,160 units when there is a
capacity utilisation of 70%.

Illustration 05:
The following data gives the age and blood pressure (BP) of 10 sports persons.
Name : A B C D E F G H I J
Age (X) : 42 36 55 58 35 65 60 50 48 51
BP (Y) : 98 93 110 85 105 108 82 102 118 99
i. Find regression equation of Y on X and X on Y (Use the method of deviation
from arithmetic mean)
ii. Find the correlation coefficient (r) using the regression coefficients.
iii. Estimate the blood pressure of a sports person whose age is 45.

Solution:
Calculation of Regression Equation
x=X-Ẋ y=Y-Ẏ
Name Age (X) BP (Y)
x=X-50 y=Y-100 x2 y2 xy
A 42 98 -8 -2 64 4 16
B 36 93 -14 -7 196 49 98
C 55 110 5 10 25 100 50
D 58 85 8 -15 64 225 -120
E 35 105 -15 5 225 25 -75
F 65 108 15 8 225 64 120
G 60 82 10 -18 100 324 -180
H 50 102 0 2 0 4 0
I 48 118 -2 18 4 324 -36
J 51 99 1 -1 1 1 -1
500 1,000 0 0 904 1,120 -128
∑X ∑Y ∑x ∑y ∑x2 ∑y2 ∑xy

Ẋ = ∑X = 500 = 50 Ẏ = ∑Y = 1000 = 100


n 10 n 10
Regression coefficients can be computed using the following formula:
∑𝑥𝑦 ∑𝑥𝑦
bxy = byx = where x = X - Ẋ and y = Y - Ẏ
∑𝑦2 ∑𝑥2

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


25
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 26

Regression coefficient of X on Y: Regression coefficient of Y on X:


∑𝑥𝑦 ∑𝑥𝑦
bxy = = −128 = -0.1143 byx = =
−128
= -0.1416
∑𝑦2 1120 ∑𝑥2 904

Regression equation of X on Y: Regression equation of Y on X:


(X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ)
 X – 50 = -0.1143 (Y – 100)  Y – 100 = -0.1416 (X – 50)
 X – 50 = -0.1143Y + 11.43  Y – 100 = -0.1416X + 7.08
 X = 50 + 11.43 – 0.1143Y  Y = 100 + 7.08 – 0.1416X
 X = 61.43 - 0.1143Y  Y = 107.08 – 0.1416X

Computation of coefficient of correlation using regression coefficient:


r = √bxy ∗ byx = – √0.1143 ∗ 0.1416 = – √0.01618488 = – 0.1272
Therefore, we have low degree of negative correlation between age and blood
pressure of sports person.

Estimation of the blood pressure (Y) of a sports person whose age is X=45 can
be calculated using regression equation Y on X:
Regression equation of Y on X:
(Y – Ẏ) = byx (X – Ẋ)
 Y = 107.08 – 0.1416X = 107.08 – (0.1416 * 45) = 107.08 – 6.372 = 100.708
It means estimated blood pressure of a sports person is 101 (rounded off)
whose age is 45.

Illustration 06:
There are two series of index numbers, P for price index and S for stock of commodity.
The mean and standard deviation of P are 100 and 8 and S are 103 and 4 respectively.
The correlation coefficient between the two series is 0.4. With these data, work out a
linear equation to read off values of P for various values of S. Can the same equation be
used to read off values of S for various values of P?

Solution:
Let us assume that P=Price Index be variable X an S=Stock of Commodity be variable Y.
Linear equation to read off values of P for various values of S would be regression
equation of X on Y. Regression coefficient is to be computed using mean and standard
deviation.
From the problem we can list out the given information:
Ẋ = 100 Ẏ = 103 σx = 8 σy = 4 r = 0.4

Regression equation of X on Y:
(X – Ẋ) = bxy (Y – Ẏ)

 (X – Ẋ) = 𝑟 σ𝑥 (Y – Ẏ)
σ𝑦
Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC
26
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 27

 (X – 100) = (0.4 ∗ 8) (Y – 103)


4
 (X – 100) = 0.8 (Y – 103)
 (X – 100) = 0.8Y – 82.4
 X = 100 – 82.4 + 0.8Y
 X = 17.6 + 0.8Y
Linear equation to read off values of P for various values of S is X = 17.6 + 0.8Y
To read off values of S for various values of P we need regression equation of Y on X
and therefore above linear equation cannot be used. Hence, the following regression
equation of Y on X be computed:
(Y – Ẏ) = byx (X – Ẋ)
 (Y – Ẏ) = 𝑟 σ𝑦 (X – Ẋ)
σ𝑥
 (Y – 103) = 0.4 ∗ 4
(X – 100)
8
 (Y – 103) = 0.2 (X – 100)
 Y – 103 = 0.2X – 20
 Y = 103 – 20 + 0.2X
 Y = 83 + 0.2X
Hence, the linear equation to read off values of S for various values of P is Y = 83 + 0.2X

2.3 Review of Correlation and Regression Analysis:


In correlation analysis, when we are keen to know whether two variables under
study are associated or correlated and if correlated what is the strength of correlation.
The best measure of correlation is proved by Karl Pearson’s Coefficient of Correlation.
However, one severe limitation of this method is that it is applicable only in case of a
linear relationship between two variables. If two variables say X and Y are
independent or not correlated then the result of correlation coefficient is zero.
Correlation coefficient measuring a linear relationship between the two
variables indicates the amount of variation one variable accounted for by the other
variable. A better measure for this purpose is provided by the square of the
correlation coefficient, known as “coefficient of determination”. This can be
interpreted as the ratio between the explained variance to total variance:
Explained variance
r2 = Similarly, Coefficient of non-determination = (1 – r2).
Total Variance
Regression analysis is concerned with establishing a functional relationship
between two variables and using this relationship for making future projection.
This can be applied, unlike correlation for any type of relationship linear as well as
curvilinear. The two lines of regression coincide i.e. become identical when r= -1 or
+1 in other words, there is a perfect negative or positive correlation between the
two variables under discussion if r = 0, then regression lines are perpendicular to
each other.

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


27
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 28

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


28
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 29

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


29
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 30

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


30
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 31

Unit- 3

Probability & Expectation


.

3.1 Probability Distribution and mathematical Expectation

Probability Distribution

A probability distribution describes how the probabilities are distributed over the values of the
random variable. There are two main types of probability distributions: discrete and continuous.

1. Discrete Probability Distribution: This applies to scenarios where the random variable can take
on a countable number of distinct values. Examples include rolling a die or flipping a coin.
2. Continuous Probability Distribution: This applies to scenarios where the random variable can
take on an infinite number of possible values. Examples include measuring the height of people
or the time it takes to complete a task.

Example of a Discrete Probability Distribution

Consider the roll of a fair six-sided die. The possible outcomes are 1, 2, 3, 4, 5, and 6. The probability
distribution can be represented as:
1
P (X = x) = for x = 1, 2, 3, 4, 5, 6
6
where X is the random variable representing the outcome of the die roll.

Mathematical Expectation (Expected Value)

The expected value (or mathematical expectation) of a random variable gives a measure of the center
X with possible values
of the distribution of the variable. For a discrete random variable
x1, x2, … , xn and corresponding probabilities P (X = x1), P (X = x2), … , P (X = xn), the
expected value E(X) is given by:
n
E(X) = ∑ xiP (X = xi)
i=1

For a continuous random variable X with probability density function f (x), the expected value E(X)
is given by:

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


31
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 32


E(X) = ∫ xf (x) dx
−∞

Example Problem

Let's find the expected value of the roll of a fair six-sided die.
1
The possible outcomes are 1, 2, 3, 4, 5, and 6, each with a probability of 6.
The expected value E(X) is calculated as follows:
6 1 1 1 1 1 1
E(X) = ∑ xiP (X = xi) = 1 ⋅ + 2 ⋅ + 3 ⋅ + 4 ⋅ + 5 ⋅ + 6 ⋅
6 6 6 6 6 6
i =1

1+2+3+4+5+6 21
E(X) = = = 3.5
6 6
Thus, the expected value of the roll of a fair six-sided die is 3.5.

Diagram

Let's visualize this with a diagram. We'll plot the probability distribution of the die roll and indicate the
expected value.
I'll create the diagram now.
Output image
The diagram above shows the probability distribution of rolling a fair six-sided die. Each bar
represents the probability of each outcome (1 through 6), each with a probability of 16. The red dashed
line indicates the expected value, which is 3.5.

Summary

1. Probability Distribution: Describes the likelihood of different outcomes of a random variable.


Discrete Probability Distribution: Applies to countable outcomes.
Continuous Probability Distribution: Applies to uncountable outcomes.
2. Mathematical Expectation (Expected Value): The weighted average of all possible values of a
random variable, where the weights are the probabilities of each outcome.
For discrete variables: E(X) = ∑i xiP (X = xi)
For continuous variables: E(X) = ∫−∞
∞ xf (x) dx

3. Example: The expected value of rolling a fair six-sided die is 3.5.

The diagram visualizes how the expected value represents the center of the probability distribution.

3.2 Random Variable- Defined - Probability Distribution a Random Variable


Expectation of Random Variable

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


32
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 33

Random Variable

A random variable is a variable whose possible values are numerical outcomes of a random
phenomenon. There are two types of random variables:

1. Discrete Random Variable: Takes on a countable number of distinct values. Examples include
the number of heads in a series of coin flips or the roll of a die.
2. Continuous Random Variable: Takes on an infinite number of possible values. Examples include
the height of individuals or the time required to complete a task.

Probability Distribution of a Random Variable

A probability distribution describes how probabilities are assigned to the possible values of a random
variable.

Discrete Probability Distribution

For a discrete random variable X with possible values x1, x2, … , xn, the probability distribution is
given by P (X = xi), which satisfies:

1. 0≤ P (X = xi) ≤ 1
n
2. ∑ i=1 P (X = xi) = 1

Continuous Probability Distribution

For a continuous random variable X with probability density function (PDF) f (x):
1. f (x)≥0
∞ f (x) dx = 1
2. ∫ −∞

Expectation of a Random Variable

The expectation (or expected value) of a random variable provides a measure of the center of the
distribution of the variable.

For a Discrete Random Variable

X with possible values x1, x2, … , xn and corresponding probabilities


For a discrete random variable
P (X = x1), P (X = x2), … , P (X = xn):
n
E(X) = ∑ xiP (X = xi)
i=1

For a Continuous Random Variable

For a continuous random variable X with PDF f (x):



E(X) = ∫ xf (x) dx
−∞

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


33
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 34

Example Problem

Consider a random variable X representing the outcome of a roll of a fair six-sided die. The possible
1
values of X are 1, 2, 3, 4, 5, and 6, each with probability6 .

Probability Distribution

The probability distribution for X is:


1
P (X = x) = for x = 1, 2, 3, 4, 5, 6
6
Expected Value

The expected value E(X) is calculated as:


6 1 1 1 1 1 1
E(X) = ∑ xiP (X = xi) = 1 ⋅ + 2 ⋅ + 3 ⋅ + 4 ⋅ + 5 ⋅ + 6 ⋅
i=1
6 6 6 6 6 6

1+2+3+4+5+6 21
E(X) = = = 3.5
6 6
Thus, the expected value of the roll of a fair six-sided die is 3.5.

Diagram

Let's visualize this with a diagram. We'll plot the probability distribution of the die roll and indicate the
expected value.
I'll create the diagram now.
Output image
The diagram above shows the probability distribution of rolling a fair six-sided die. Each bar
represents the probability of each outcome (1 through 6), each with a probability of 16. The red dashed
line indicates the expected value, which is 3.5.

Summary

1. Random Variable: A variable whose possible values are numerical outcomes of a random
phenomenon.
Discrete Random Variable: Takes on a countable number of distinct values.
Continuous Random Variable: Takes on an infinite number of possible values.
2. Probability Distribution: Describes how probabilities are assigned to the possible values of a
random variable.
Discrete Probability Distribution: For discrete random variables.
Continuous Probability Distribution: For continuous random variables.
3. Mathematical Expectation (Expected Value): The weighted average of all possible values of a
random variable, where the weights are the probabilities of each outcome.
For discrete variables: E(X) = ∑i xiP (X = xi)
For continuous variables: E(X) = ∫−∞
∞ xf (x) dx

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


34
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 35

4. Example: The expected value of rolling a fair six-sided die is 3.5.

The diagram visualizes how the expected value represents the center of the probability distribution.

3.3 Properties of Expected Value and Variance

Properties of Expected Value

The expected value, also known as the mean, has several important properties:

1. Linearity:
For any random variables X and Y , and constants a and b:

E(aX + bY ) = aE(X) + bE(Y )

This property shows that the expected value of a linear combination of random variables is
the same linear combination of their expected values.
2. Expectation of a Constant:
If c is a constant:

E(c) = c

3. Expectation of the Sum of Random Variables:


For any random variables X and Y :

E(X + Y ) = E(X) + E(Y )

Properties of Variance

Variance measures the spread of the random variable around the mean. The variance Var(X) of a
random variable X with mean μ = E(X) is given by:

Var(X) = E[(X − μ)2]

Some important properties of variance are:

1. Variance of a Constant:
If c is a constant:

Var(c) = 0

2. Variance of a Linear Transformation:


For any random variable X and constants a and b:

Var(aX + b) = a2Var(X)

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


35
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 36

3. Variance of the Sum of Independent Random Variables:


If X and Y are independent random variables:

Var(X + Y ) = Var(X) + Var(Y )

Example Problem

Consider two independent random variables X and Y :

X represents the outcome of rolling a fair six-sided die.


Y represents the outcome of flipping a fair coin twice and counting the number of heads (values
can be 0, 1, or 2).

Let's find the expected value and variance for the sum Z =X +Y.

Expected Value

The expected value of X (as calculated earlier) is:

E(X) = 3.5

For Y , the probabilities are:

P (Y = 0) = 1
4
P (Y = 1) = 1
2
P (Y = 2) = 1
4

The expected value of Y is:


1 1 1 1 1
E(Y ) = 0 ⋅ +1⋅ +2⋅ = 0+ + =1
4 2 4 2 2
Since X and Y are independent, the expected value of Z is:

E(Z) = E(X + Y ) = E(X) + E(Y ) = 3.5 + 1 = 4.5

Variance

The variance of X is calculated as:

Var(X) = E[(X − E(X))2]

For a fair six-sided die, the variance is:


1
Var(X) = [(1 − 3.5)2 + (2 − 3.5)2 + (3 − 3.5)2 + (4 − 3.5)2 + (5 − 3.5)2 + (6 − 3.5)2]
6

1
Var(X) = [(2.5)2 + (1.5)2 + (0.5)2 + (0.5)2 + (1.5)2 + (2.5)2]
6

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


36
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 37

1
Var(X) = [6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25]
6

1
Var(X) = [17.5] = 2.9167
6
For Y :

Var(Y ) = ∑(yi − E(Y ))2P (Y = yi)


i

1 1 1
Var(Y ) = (0 − 1)2 ⋅ + (1 − 1)2 ⋅ + (2 − 1)2 ⋅
4 2 4

1 1 1 1
Var(Y ) = 1 ⋅ +0⋅ +1⋅ = = 0.5
4 2 4 2
Since X and Y are independent, the variance of Z is:

Var(Z) = Var(X + Y ) = Var(X) + Var(Y ) = 2.9167 + 0.5 = 3.4167

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


37
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 38

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


38
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 39

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


39
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 40

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


40
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |

Unit -4

4.1 Sampling and Sampling Distributions - Data Collection

Sampling and Sampling Distributions

Overview

Sampling is the process of selecting a subset of individuals from a population to estimate


characteristics of the whole population. Sampling is essential when it is impractical or impossible to
examine the entire population.
Sampling Distributions refer to the probability distributions of statistics computed from a sample. A
common example is the sampling distribution of the sample mean. This distribution shows how the
sample mean varies from sample to sample.
Key Concepts

1. Population and Sample:


Population: The entire group of individuals or instances about whom we hope to
learn.Sample: A subset of the population, selected for measurement, observation,
or questioning, to provide statistical information about the population.
2. Types of Sampling:
Simple Random Sampling: Every member of the population has an equal chance of being
selected.
Stratified Sampling: The population is divided into strata (groups), and a random sample
istaken from each stratum.
Cluster Sampling: The population is divided into clusters, some of which are
randomlyselected. All members of chosen clusters are sampled.
Systematic Sampling: Every k-th member of the population is selected.
3. Sampling Distribution of the Sample Mean:
If we take all possible samples of size n from a population with mean μ and standard
ˉ ) will have:
deviation σ , the sampling distribution of the sample mean (X
Mean (μXˉ ) = μ
Standard Error (σXˉ ) = nσ
4. Central Limit Theorem (CLT):
For a large enough sample size n, the sampling distribution of the sample mean will be
approximately normally distributed, regardless of the shape of the population distribution.

Problem Solving with Sampling Distributions

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


41
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |
Example Problem:
Let's assume we have a population with a known mean (μ) of 50 and a standard deviation (σ) of 10.
We take a sample of size n = 25.
Find:

1. The mean and standard deviation of the sampling distribution of the sample mean.
2. The probability that the sample mean is greater than 52.

Solution:

1. Mean of the Sampling Distribution:

μXˉ = μ = 50
2. Standard Error of the Sampling Distribution:
σ 10
3. Probability Calculation: σXˉ = = 10 = =2
5
n 25
To find the probability that the sample mean is greater than 52, we use the standard normal
distribution (Z-distribution).
ˉ
We need P (X Z= X ˉ − μXˉ 52 − 50
= =1
> 52) = P (Z > 1). σXˉ 2
Using standard normal distribution tables or a calculator:
P (Z > 1) ≈ 0.1587

Therefore, the probability that the sample mean is greater than 52 is approximately 0.1587, or
15.87%.
This example illustrates how to use the properties of sampling distributions to make probabilistic
statements about sample means. The Central Limit Theorem plays a crucial role in enabling these
calculations, especially when the sample size is large.

4.2 Sampling and Non-Sampling Errors – Principles of Sampling

Sampling and Non-Sampling Errors

Overview

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


42
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Statistical Computing |
When conducting a survey or study, errors can occur in the process of data collection. These errors
can be broadly classified into two categories: sampling errors and non-sampling errors.
Sampling Errors
Sampling Errors occur because the sample is only a subset of the population, and there is always
variability in the samples chosen. These errors can be quantified and controlled using statistical
techniques.
1. Causes of Sampling Errors:
Sample Size: Smaller samples tend to have larger sampling errors.
Sampling Method: The method used to select the sample can introduce bias. Simple
random sampling minimizes bias, while non-random sampling methods can increase it.
2. Controlling Sampling Errors:
Increasing Sample Size: Larger samples reduce sampling error.
Using Proper Sampling Techniques: Ensuring randomness in sample selection helps to
reduce bias.
3. Quantifying Sampling Errors:
Standard Error: The standard deviation of the sampling distribution, indicating the typical
amount by which the sample mean will differ from the population mean.
Confidence Intervals: A range around the sample mean within which we expect the
population mean to lie, given a certain level of confidence.

Non-Sampling Errors

Non-Sampling Errors are errors not related to the act of sampling itself. They can occur in any phase
of the survey or data collection process.
1. Types of Non-Sampling Errors:
Measurement Error: Inaccuracies in data collection instruments or respondents' answers.
Processing Error: Mistakes made during data entry, coding, or analysis.
Non-Response Error: When certain individuals do not respond or cannot be reached.
Coverage Error: When some members of the population are not included in the sample
frame.
2. Controlling Non-Sampling Errors:
Improving Data Collection Instruments: Ensure clarity and precision in survey questions.
Training Data Collectors: Proper training for those collecting and entering data.
Increasing Response Rates: Using follow-ups and incentives to reduce non-response.
Ensuring Complete Coverage: Using comprehensive sampling frames and multiple data
sources.

Problem Solving with Sampling Errors


Example Problem:
A researcher wants to estimate the average height of students in a university. They take a random
sample of 100 students and find an average height of 170 cm with a standard deviation of 10 cm. They
want to calculate a 95% confidence interval for the population mean.
Solution:

1. Calculate the Standard Error (SE):

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


43
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

SE = σ = 10 = 10 = 1
n 100 10
2. Determine the Critical Value:
For a 95% confidence interval, the critical value (Z*) from the standard normal distribution is
approximately 1.96.
3. Calculate the Margin of Error (ME):
ME = Z∗ × SE = 1.96 × 1 = 1.96
4. Construct the Confidence Interval:
ˉ ± ME = 170 ± 1.96
Confidence Interval = X

Confidence Interval = [168.04, 171.96]

Therefore, the 95% confidence interval for the average height of students in the university is between
168.04 cm and 171.96 cm.

Problem Solving with Non-Sampling Errors

Example Problem:
A survey is conducted to measure the job satisfaction of employees in a company. Out of 500
employees, 400 responded. The survey finds that 80% of respondents are satisfied with their jobs.
However, there is concern about non-response bias and measurement error.
Solution:

1. Assess Non-Response Error:


Non-Response Rate:
Non-Response Rate = 500 − 400 100 = 0.20 = 20%
=
500 500
The 20% non-response rate indicates that a significant portion of the population did not
respond, which might bias the results if non-respondents have different satisfaction levels.
2. Adjust for Measurement Error:
Pre-Test the Survey: Conduct a pilot study to identify and correct ambiguities in the survey
questions.
Train Survey Administrators: Ensure consistent and accurate data collection procedures.

By addressing both sampling and non-sampling errors, the researcher can improve the accuracy and
reliability of the survey results, providing a more accurate estimate of job satisfaction among the
employees.

4.3 Merits and Limitations of Sampling- Methods of Sampling

44
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Merits and Limitations of Sampling

Merits of Sampling

1. Cost-Effective:
Sampling reduces the cost of data collection since fewer resources are needed compared to
a full population survey.
2. Time-Saving:
Collecting data from a sample takes less time than surveying the entire population.
3. Manageable Data:
Working with a smaller data set is more manageable, especially when it comes to data
analysis and processing.
4. Practicality:
In many cases, it is impractical or impossible to survey an entire population (e.g., the entire
population of a country).
5. Focus on Quality:
Resources can be allocated to ensuring high-quality data collection methods and accuracy
in a sample survey, which might be diluted in a full population survey.
6. Allows for In-Depth Study:
Researchers can study specific characteristics in more detail, as they have fewer cases to
handle.

Limitations of Sampling

1. Sampling Bias:
If the sample is not properly randomized or representative, it can lead to biased results that
do not accurately reflect the population.
2. Sampling Error:
The results from a sample will always differ to some extent from the actual population
values due to random chance.
3. Limited Scope:
Some studies might require data from the entire population to be conclusive (e.g., certain
medical studies).
4. Non-Sampling Errors:
Errors not related to the sampling process, such as data collection or processing errors, can
still affect the validity of the results.
5. Generalization Issues:
Findings from the sample might not be generalizable to the entire population, especially if
the sample size is small or not representative.

Methods of Sampling
1. Simple Random Sampling:
Every member of the population has an equal chance of being selected. This can be
achieved using random number generators or drawing lots.
2. Systematic Sampling:

45
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Every k-th member of the population is selected after a random start. For example, if k = 5
, you might survey every 5th person on a list.
3. Stratified Sampling:
The population is divided into subgroups (strata) based on certain characteristics, and
random samples are taken from each stratum. This ensures representation from all
subgroups.
4. Cluster Sampling:
The population is divided into clusters, usually based on geographical areas or other
natural groupings. Entire clusters are randomly selected, and all members of selected
clusters are surveyed.
5. Multi-Stage Sampling:
A combination of sampling methods, typically starting with cluster sampling followed by
random sampling within selected clusters. This is useful for large, diverse populations.
6. Convenience Sampling:
Samples are taken from a group that is conveniently accessible. This method is not random
and can introduce significant bias.
7. Quota Sampling:
The population is segmented into mutually exclusive subgroups, and judgment is used to
select subjects or units from each segment based on a specified proportion.
8. Snowball Sampling:
Existing study subjects recruit future subjects from among their acquaintances. This
method is often used in hidden or hard-to-reach populations.

Problem Solving with Sampling Methods

Example Problem:
A researcher wants to estimate the average income of households in a city. They decide to use
stratified sampling. The city has three main income groups: low, middle, and high income. The
proportions of these groups are known from previous surveys: 30% low income, 50% middle income,
and 20% high income. The researcher wants to take a sample of 300 households.
Solution:

1. Determine Sample Size for Each Stratum:


Low Income:
nlow = 0.30 × 300 = 90

Middle Income:
nmiddle = 0.50 × 300 = 150

High Income:

nhigh = 0.20 × 300 = 60


2. Select Random Samples from Each Stratum:
Using random sampling techniques, select 90 households from the low-income group, 150
households from the middle-income group, and 60 households from the high-income
group.
3. Estimate the Population Mean:

46
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

After collecting income data from these samples, calculate the sample means for each
stratum.
Suppose the means are as follows:
ˉ low = $25, 000,
X ˉ middle = $50, 000,
X ˉ high = $100, 000
X
4. Weighted Average of Sample Means:
Calculate the overall estimate of the population mean using the weighted average of the
stratum means:
μestimate = (0.30 × 25, 000) + (0.50 × 50, 000) + (0.20 × 100, 000)

μestimate = 7, 500 + 25, 000 + 20, 000 = 52, 500

Therefore, the estimated average income of households in the city is $52,500.


By using stratified sampling, the researcher ensures that all income groups are represented
proportionally, reducing the potential for sampling bias and providing a more accurate estimate of the
population mean.

4.4 Parameter and Statistic- Sampling Distribution of a Statistic

Parameter and Statistic

Definitions

1. Parameter:
A parameter is a numerical characteristic or measure of a population. Examples include the
population mean (μ), population standard deviation (σ), and population proportion (p).
Parameters are usually unknown because it's often impractical to measure an entire
population.
2. Statistic:
A statistic is a numerical characteristic or measure of a sample. Examples include the
sample mean (X ˉ ), sample standard deviation (s), and sample proportion (p^). Statistics are
calculated from sample data and used to estimate population parameters.

Key Differences

Population vs. Sample:


Parameters describe populations.
Statistics describe samples.
Symbols:
Common symbols for parameters include μ, σ, and p.
ˉ , s, and p^.
Common symbols for statistics include X

. 47
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Sampling Distribution of a Statistic

Definition

The sampling distribution of a statistic is the probability distribution of that statistic based on a large
number of samples drawn from the same population. It shows how the statistic varies from sample to
sample.

Key Concepts
1. Sampling Distribution of the Sample Mean (X ˉ ):
When taking all possible samples of a specific size n from a population with mean μ and
ˉ ) will have:
standard deviation σ , the distribution of the sample means (X
Mean (μXˉ ) = μ
σ
Standard Error (σXˉ ) = n
2. Central Limit Theorem (CLT):
The CLT states that for a sufficiently large sample size n, the sampling distribution of the
sample mean will be approximately normally distributed, regardless of the shape of the
population distribution. This approximation improves with larger sample sizes.

Example Problem: Sampling Distribution

Let's assume we have a population with a mean (μ) of 60 and a standard deviation (σ) of 12. We draw
a sample of size n = 36.
Find:

1. The mean and standard error of the sampling distribution of the sample mean.
2. The probability that the sample mean is greater than 62.

Solution:

1. Mean of the Sampling Distribution:

μXˉ = μ = 60
2. Standard Error of the Sampling Distribution:
σ 12
3. Probability Calculation: σXˉ = = 12 = =2
6
n is greater
To find the probability that the sample mean 36 than 62, we use the standard normal
distribution (Z-distribution).
ˉ
We need P (X Z= X ˉ − μXˉ 62 − 60
= =1
> 62) = P (Z > 1). σXˉ 2
Using standard normal distribution tables or a calculator:
P (Z > 1) ≈ 0.1587

. 48
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Therefore, the probability that the sample mean is greater than 62 is approximately 0.1587, or
15.87%.
Problem Solving with Parameters and Statistics
Example Problem: Estimating a Population Parameter
A factory produces light bulbs, and the life expectancy of the light bulbs is known to be normally
distributed. A sample of 50 light bulbs is tested, and the sample mean life expectancy is found to be
1,200 hours with a sample standard deviation of 100 hours.
Find:

1. A 95% confidence interval for the population mean life expectancy.

Solution:

1. Calculate the Standard Error (SE):


100 100
s =
SE = = 50 7.071 ≈ 14.14
n

2. Determine the Critical Value:


For a 95% confidence interval, the critical value (Z*) from the standard normal distribution is
approximately 1.96.
3. Calculate the Margin of Error (ME):
ME = Z∗ × SE = 1.96 × 14.14 ≈ 27.71
4. Construct the Confidence Interval:
ˉ ± ME = 1200 ± 27.71
Confidence Interval = X

Confidence Interval = [1172.29, 1227.71]


Therefore, the 95% confidence interval for the population mean life expectancy of the light bulbs is
between 1,172.29 hours and 1,227.71 hours.
ˉ ) and the sampling distribution, we estimate the population
By using the sample statistic (X
parameter (μ) with a certain level of confidence. This process is fundamental in inferential statistics,
allowing researchers to make informed conclusions about populations based on sample data.

4. 5 Examples of Sampling Distributions- Standard Normal

Examples of Sampling Distributions

49
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Overview

Sampling distributions are fundamental in statistical inference, providing a framework to understand


how sample statistics (like the sample mean or sample proportion) behave. The most commonly
discussed sampling distribution is that of the sample mean, particularly when the sample size is large.

Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution with a mean of 0 and a
standard deviation of 1. When sample sizes are large, the Central Limit Theorem tells us that the
sampling distribution of the sample mean will be approximately normally distributed, regardless of
the population's distribution.
Key Concepts
1. Central Limit Theorem (CLT):
The CLT states that for a sufficiently large sample size n, the sampling distribution of the
sample mean will be approximately normally distributed with mean μ and standard error
σ
.
n
2. Standardization:
To transform a normal distribution into a standard normal distribution, we use the z-score
formula:
X −μ
Z=
σ
For the sampling distribution of the sample mean, the z-score formula is:
ˉ −μ
X
Z= σ
n

Examples of Sampling Distributions

Example 1: Sampling Distribution of the Sample Mean

Problem:
A population has a mean (μ) of 80 and a standard deviation (σ) of 10. A random sample of size n = 25
is drawn.

1. Find the mean and standard error of the sampling distribution of the sample mean.
2. Calculate the probability that the sample mean is between 78 and 82.

Solution:

1. Mean of the Sampling Distribution:

μXˉ = μ = 80
2. Standard Error of the Sampling Distribution:

50
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

10
σXˉ = σ = 10 = =2
n 25 5
3. Probability Calculation:
To find the probability that the sample mean is between 78 and 82, we calculate the z-scores for
78 and 82.
ˉ
For X = 78:
78 − 80 −2
Z= = = −1
2 2
ˉ
For X = 82:
82 − 80 2
Z= = =1
2 2
Using standard normal distribution tables or a calculator:
P (−1 < Z < 1) ≈ 0.6826

Therefore, the probability that the sample mean is between 78 and 82 is approximately 68.26%.

Example 2: Sampling Distribution of the Sample Proportion

Problem:
In a large city, 60% of the residents are in favor of building a new park. A random sample of 100
residents is selected.
1. Find the mean and standard error of the sampling distribution of the sample proportion.
2. Calculate the probability that the sample proportion is between 0.55 and 0.65.

Solution:

1. Mean of the Sampling Distribution:

μp^ = p = 0.60

2. Standard Error of the Sampling Distribution:

σp^ = p(1 − p) = 0.60 × 0.40 = 0.24 = ≈ 0.049


0.0024
n 100 100
3. Probability Calculation:
To find the probability that the sample proportion is between 0.55 and 0.65, we calculate the z-
scores for 0.55 and 0.65.
^=
For p 0.55:
0.55 − 0.60 −0.05
Z= ≈
0.049 0.049 ≈ −1.02

^=
For p 0.65:
0.65 − 0.60 0.05
Z= ≈
0.049 0.049 ≈ 1.02

51
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

S t a t i s t i c a l C o m p u t i n g | 42
Using standard normal distribution tables or a calculator:
P (−1.02 < Z < 1.02) ≈ 0.8461

Therefore, the probability that the sample proportion is between 0.55 and 0.65 is approximately
84.61%.
.

4.6 Student’s t, Chi-Square (x2) and Snedecor’s F- Distributions

Student's t-Distribution, Chi-Square (χ²) Distribution, and Snedecor's F-


Distribution
Overview

These three distributions are key in statistical inference, especially when dealing with small sample
sizes or when variances are unknown. Each distribution has unique properties and applications.
Student's t-Distribution

Definition

The Student's t-distribution is used when estimating the mean of a normally distributed population
when the sample size is small and the population standard deviation is unknown. It resembles the
normal distribution but has heavier tails, which provide a higher probability for extreme values.

Key Properties

1. Symmetry: The t-distribution is symmetric around zero.


2. Heavier Tails: It has heavier tails than the normal distribution, which means it gives more
probability to values far from the mean.
3. Degrees of Freedom (df): The shape of the t-distribution is determined by the degrees of
freedom, which is typically n − 1 for a sample of size n. As the sample size increases, the t-
distribution approaches the normal distribution.

Example Problem: t-Distribution

Created by: Dr. S. Dinesh Ph.D. Department Computer Science, GASC


52
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Problem:
A sample of 15 students' scores in a test is collected with a mean score of 75 and a sample standard
deviation of 10. Calculate the 95% confidence interval for the true mean score.
Solution:

1. Determine the Degrees of Freedom:


df = n − 1 = 15 − 1 = 14
2. Find the Critical Value:
Using a t-table, the critical value t0.025,14 ≈ 2.145 for a 95% confidence level.
3. Calculate the Standard Error (SE):

SE = s = 10 ≈ 2.58
n 15
4. Calculate the Margin of Error (ME):
ME = t0.025,14 × SE = 2.145 × 2.58 ≈ 5.53

5. Construct the Confidence Interval:


ˉ ± ME = 75 ± 5.53 = [69.47, 80.53]
Confidence Interval = X
Therefore, the 95% confidence interval for the true mean score is approximately [69.47, 80.53].

Chi-Square (χ²) Distribution

Definition

The Chi-Square distribution is used primarily in hypothesis testing and constructing confidence
intervals for variances and standard deviations. It is also used in tests of independence and goodness-
of-fit tests.

Key Properties

1. Non-Negative: The χ² distribution is always non-negative because it is based on squared values.


2. Skewed Right: The distribution is skewed to the right, with the degree of skewness decreasing
as degrees of freedom increase.
3. Degrees of Freedom (df): The shape of the χ² distribution depends on the degrees of freedom.

Example Problem: Chi-Square Distribution

Problem:
A sample of 20 observations is drawn from a normal population, and the sample variance is calculated
to be 25. Test the hypothesis that the population variance is 20 at a 5% significance level.
Solution:

1. State the Hypotheses:


H0 : σ 2 = 20 and H1 : σ 2 =20

53
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

2. Calculate the Test Statistic:


χ2 = (n − 1)s2 (20 − 1) × 25 475
= =
σ2 20 20 = 23.75
3. Determine the Critical Values:
Using a χ² table with df = 19 and α = 0.05:
Critical values are χ20.025,19 ≈ 32.85 and χ20.975,19 ≈ 8.91.
4. Decision Rule:
If χ2 is between the critical values, fail to reject H0.
If χ2 is outside the critical values, reject H0.
5. Conclusion:
Since 23.75 is between 8.91 and 32.85, we fail to reject H0. There is not enough evidence to
conclude that the population variance is different from 20.

Snedecor's F-Distribution

Definition

The F-distribution is used to compare two variances and is commonly used in analysis of variance
(ANOVA). It is the ratio of two chi-square distributions, each divided by their respective degrees of
freedom.

Key Properties

1. Non-Negative: The F-distribution is always non-negative.


2. Skewed Right: It is skewed to the right.
3. Degrees of Freedom (df): The F-distribution depends on two sets of degrees of freedom: df1 for
the numerator and df2 for the denominator.

Example Problem: F-Distribution

Problem:
Two independent samples are taken to compare their variances. Sample A has 10 observations with a
variance of 15, and Sample B has 12 observations with a variance of 10. Test whether the variances are
equal at a 5% significance level.
Solution:

1. State the Hypotheses:


H0 : σ2 = σ2 and H1 : σ2 =σ 2
A B A B

2. Calculate the Test Statistic:

s2A = 15
F =
s2B 10 = 1.5

3. Determine the Critical Values:


Using an F-table with df1 = 9 (for Sample A) and df2 = 11 (for Sample B) and α = 0.05:
The critical values for F0.025,9,11 and F0.975,9,11 are approximately 3.59 and 0.28.

54
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

4. Decision Rule:
If F is between the critical values, fail to reject H0.
If F is outside the critical values, reject H0.
5. Conclusion:
Since 1.5 is between 0.28 and 3.59, we fail to reject H0. There is not enough evidence to
conclude that the variances are different.

55
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Unit- 5
Statistical Inference and Testing

5.1 Statistical Inference- Estimation and Testing of Hypothesis

Statistical inference involves drawing conclusions about a population based on information from a
sample. It includes two key components: estimation and hypothesis testing.

Estimation

Estimation is the process of inferring the value of a population parameter based on a sample statistic.
There are two types of estimation: point estimation and interval estimation.

1. Point Estimation

A point estimate is a single value estimate of a population parameter. For example, the sample mean (
xˉ) is a point estimate of the population mean (μ).

2. Interval Estimation

An interval estimate provides a range of values within which the population parameter is expected to
lie. This is usually expressed as a confidence interval.
For example, a 95% confidence interval for the population mean is given by:
xˉ ± Zα/2 ( σ
n
)
ˉ is the sample mean, Zα/2 is the critical value from the standard normal distribution for a 95%
where x
confidence level, σ is the population standard deviation, and n is the sample size.

Hypothesis Testing

Hypothesis testing is a method of making decisions or inferences about population parameters based
on sample data. It involves the following steps:

1. Formulating Hypotheses
Null Hypothesis (H0): A statement of no effect or no difference, which we seek to test.
Alternative Hypothesis (H1): A statement that we want to test for, which contradicts the null
hypothesis.
2. Selecting a Significance Level (α)
Commonly used significance levels are 0.05, 0.01, and 0.10.
3. Choosing the Appropriate Test Statistic
Depending on the sample size and variance, different test statistics (z-test, t-test, chi-square
test, etc.) are used.

56
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

4. Formulating the Decision Rule


Based on the significance level and the distribution of the test statistic, determine the
critical value(s) and decision rule.
5. Calculating the Test Statistic from Sample Data
Compute the value of the test statistic using sample data.
6. Making a Decision
Compare the test statistic to the critical value(s) to accept or reject the null hypothesis.

Example Problem with Solution

Problem: Suppose we want to test if a new drug has a different effect on blood pressure compared to
a placebo. The sample data are as follows:

ˉ1
Sample mean for the drug group: x = 120
Sample mean for the placebo group: x ˉ2 = 130
Standard deviation for the drug group: s1 = 15
Standard deviation for the placebo group: s2 = 20
Sample size for both groups: n1 = n2 = 30

We want to test this at a 5% significance level.

Solution:

1. Formulating Hypotheses
H0 : μ1 = μ2 (the drug has no effect compared to the placebo)
H1 : μ1 = μ2 (the drug has a different effect)
2. Selecting the Significance Level
α = 0.05
3. Choosing the Appropriate Test Statistic
Since we are comparing means from two independent samples, we use a two-sample t-test.
4. Formulating the Decision Rule
α = 0.05, the critical value tα/2,df from the t-distribution table for
For a two-tailed test at
df = n1 + n2 − 2 = 58 is approximately 2.001.
5. Calculating the Test Statistic
xˉ1 − xˉ2
t=
s2 s2
( n11 ) + ( n22 )

Substituting the given values:


120 − 130 −10 −10 −10 −10
t= = = = ≈ −2.19
( 225 ) + ( 400 ) 7.5 + 13.33 20.83 = 4.56
( 15
2
) + ( 20
2
30 30
) 30 30

6. Making a Decision
The calculated t-value is −2.19. Since −2.19 is less than −2.001 (the critical value), we
reject the null hypothesis.

Diagram: Confidence Interval

57
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Let's visualize a 95% confidence interval for a population mean. Assume the sample mean is 100, the
population standard deviation is 15, and the sample size is 25.

1. Calculate the standard error:

σ 15
SE = = =3
n 25
2. Determine the critical value for 95% confidence:

Zα/2 = 1.96

3. Compute the margin of error:

ME = Zα/2 × SE = 1.96 × 3 = 5.88

4. Find the confidence interval:

CI = xˉ ± ME = 100 ± 5.88 = (94.12, 105.88)

Diagram:

plaintext

|---------------------------|---------------------|---------------------------|
94.12 100 105.88

This diagram represents a 95% confidence interval for the population mean.

By following these steps and understanding the underlying concepts, you can perform statistical
inference through estimation and hypothesis testing effectively.

5.2 Statistical Inference- Estimation- Point and interval

Statistical Inference: Estimation

Estimation is a fundamental aspect of statistical inference, used to make inferences about population
parameters based on sample data. It involves two main types: point estimation and interval
estimation.

Point Estimation

A point estimate is a single value used to estimate an unknown population parameter. The point
estimate is usually the sample statistic that best represents the population parameter. Common point

58
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

estimates include:

ˉ) for estimating the population mean (μ).


Sample mean (x
^) for estimating the population proportion (p).
Sample proportion (p
Sample variance (s2) for estimating the population variance (σ2).

Example:
Suppose we have a sample of 50 students, and we measure their heights. The average height in the
sample is found to be 165 cm. Here, the sample mean (165 cm) is the point estimate of the population
mean height.

Interval Estimation

Interval estimation provides a range of values within which the population parameter is expected to
lie, along with a specified level of confidence. This range is called a confidence interval.

Confidence Interval for the Mean

A confidence interval for the population mean (μ) when the population standard deviation (σ) is
known is given by:

xˉ ± Zα/2 ( σ
n)

xˉ is the sample mean.


Zα/2 is the critical value from the standard normal distribution for the desired confidence level.
σ is the population standard deviation.
n is the sample size.

When σ is unknown, the sample standard deviation s is used, and the critical value is taken from the t-
distribution with n − 1 degrees of freedom:
xˉ ± tα/2,n−1 ( s
n
)

Confidence Interval for the Proportion

A confidence interval for the population proportion (p) is given by:


p^(1−p^)
p^ ± Zα/2 n

p^ is the sample proportion.


Zα/2 is the critical value from the standard normal distribution for the desired confidence level.
n is the sample size.

Example Problem with Solution

Problem:
A sample of 40 light bulbs is taken from a large batch. The sample mean lifetime is found to be 1,200
hours with a sample standard deviation of 100 hours. Construct a 95% confidence interval for the
population mean lifetime of the light bulbs.

Solution:

59
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

1. Identify the sample statistics:


ˉ): 1,200 hours
Sample mean (x
Sample standard deviation (s): 100 hours
Sample size (n): 40
2. Determine the critical value:
For a 95% confidence level and 39 degrees of freedom (n − 1), the critical value (tα/2,39)
from the t-distribution table is approximately 2.023.
3. Calculate the standard error (SE):
s 100
SE = = ≈ 15.81
n 40
4. Compute the margin of error (ME):

ME = tα/2,39 × SE = 2.023 × 15.81 ≈ 31.96

5. Construct the confidence interval:

CI = xˉ ± ME = 1200 ± 31.96 = (1168.04, 1231.96)

Diagram:

plaintext

|-----------------------------------|-----------------------------------|-----------------
------------------|
1168.04 1200 1231.96

This diagram represents a 95% confidence interval for the population mean lifetime of the light bulbs,
indicating that we are 95% confident that the true mean lifetime lies between 1168.04 hours and
1231.96 hours.

Key Points to Remember

ˉ for μ).
Point Estimate: A single value estimate of a population parameter (e.g., x
Confidence Interval: A range of values, derived from the sample, that is likely to contain the
population parameter.
Critical Value: The value that corresponds to the desired confidence level from the appropriate
distribution (Z for normal, t for small samples or unknown σ).
Margin of Error: The amount added and subtracted from the point estimate to create the
confidence interval.

Understanding these concepts allows for more informed decision-making and accurate interpretation
of statistical data.

5.3 Confidence interval using normal, t and x2Distributions

60
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

Confidence Interval Using Normal, t, and Chi-Square (χ²) Distributions

Confidence intervals (CIs) provide a range of values within which a population parameter is expected
to lie with a certain level of confidence. Depending on the nature of the data and the sample size,
different distributions are used to construct these intervals: the normal distribution, the t-distribution,
and the chi-square (χ²) distribution.

1. Confidence Interval Using the Normal Distribution

When the sample size is large (typically n > 30) or the population standard deviation (σ) is known, we
use the normal distribution.
Formula for the Population Mean (μ):
xˉ ± Zα/2 ( σ
n
)

xˉ = Sample mean
Zα/2 = Critical value from the standard normal distribution for a given confidence level
σ = Population standard deviation
n = Sample size

Example:
A random sample of 100 battery lifetimes is taken from a large batch, with a sample mean of 500
hours and a known population standard deviation of 50 hours. Construct a 95% confidence interval for
the population mean lifetime.

1. Sample statistics:
xˉ = 500
σ = 50
n = 100
2. Critical value for 95% confidence:
Zα/2 = 1.96 (from the Z-table for 95% confidence)
3. Standard error (SE):

σ 50
SE = = =5
n 100
4. Margin of error (ME):

ME = Zα/2 × SE = 1.96 × 5 = 9.8

5. Confidence interval (CI):

CI = xˉ ± ME = 500 ± 9.8 = (490.2, 509.8)

2. Confidence Interval Using the t-Distribution

61
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

When the sample size is small (typically n ≤ 30) and the population standard deviation (σ) is
unknown, we use the t-distribution.
Formula for the Population Mean (μ):
xˉ ± tα/2,n−1 ( s
n
)

xˉ = Sample mean
tα/2,n−1 = Critical value from the t-distribution with n − 1 degrees of freedom for a given
confidence level
s = Sample standard deviation
n = Sample size

Example:
A sample of 15 students' test scores gives a mean score of 75 with a sample standard deviation of 10.
Construct a 95% confidence interval for the population mean score.

1. Sample statistics:
xˉ = 75
s = 10
n = 15
2. Critical value for 95% confidence:
tα/2,14 ≈ 2.145 (from the t-table for 95% confidence and 14 degrees of freedom)
3. Standard error (SE):

= 10 ≈ 2.58
s
SE =
n 15
4. Margin of error (ME):

ME = tα/2,14 × SE = 2.145 × 2.58 ≈ 5.53

5. Confidence interval (CI):

CI = xˉ ± ME = 75 ± 5.53 = (69.47, 80.53)

3. Confidence Interval Using the Chi-Square (χ²) Distribution

When estimating the population variance (σ2) or standard deviation (σ), we use the chi-square
distribution.
Formula for the Population Variance (σ2):

( (n−1)s , χ(n−1)s
2 2

χ2 2 )
α/2,n−1 1−α/2,n−1

s2 = Sample variance
χ2α/2,n−1 and χ 21−α/2,n−1 = Critical values from the chi-square distribution with n − 1 degrees of
freedom for a given confidence level

Example:
A sample of 20 measurements gives a sample variance of 4. Construct a 95% confidence interval for
the population variance.

62
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

1. Sample statistics:
s2 = 4
n = 20
2. Critical values for 95% confidence:
χ2α/2,19 ≈ 32.852 (from the chi-square table for 0.025 in the upper tail and 19 degrees of
freedom)
χ21−α/2,19 ≈ 8.907 (from the chi-square table for 0.975 in the upper tail and 19 degrees of
freedom)
3. Confidence interval (CI):
(n − 1)s2 (n − 1)s2 19 × 4 19 × 4
CI = ( , )=( , ) ≈ (2.31, 8.53)
χ χ 32.852 8.907
2 2
α/2,19 1−α/2,19

Diagrams

1. Confidence Interval Using the Normal Distribution

2. Confidence Interval Using the t-Distribution

3. Confidence Interval Using the Chi-Square Distribution

These diagrams visually represent the confidence intervals calculated using the normal, t, and chi -
square distributions, respectively. Each diagram shows the lower and upper bounds of the confidence
interval, centered around the point estimate.

5.4 Testing of Hypothesis Significance of a mean - Using t Distribution

Hypothesis Testing: Significance of a Mean Using the t-Distribution

Hypothesis testing is a statistical method used to make inferences or draw conclusions about a
population based on sample data. When the population standard deviation (σ) is unknown and the
sample size is small (typically n ≤ 30), the t-distribution is used to test hypotheses about the
population mean (μ).

Steps in Hypothesis Testing

1. Formulate the Hypotheses


Null Hypothesis (H0): The statement being tested, usually a statement of no effect or no
difference. It often specifies a particular value for the population mean.
Alternative Hypothesis (H1): The statement that we want to test for, representing an effect
or a difference.
2. Select the Significance Level (α)

63
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

The significance level is the probability of rejecting the null hypothesis when it is true.
Common choices for α are 0.05, 0.01, and 0.10.
3. Choose the Appropriate Test Statistic
For a mean with an unknown population standard deviation and small sample size, use the
t-statistic.
4. Formulate the Decision Rule
Determine the critical value(s) from the t-distribution table based on the degrees of
freedom (df = n − 1) and the significance level.
5. Calculate the Test Statistic
Compute the t-statistic using the sample data.
6. Make a Decision
Compare the calculated t-value with the critical value(s) to decide whether to reject or fail to
reject the null hypothesis.

Example Problem

Problem:
A researcher claims that the average weight of a particular species of fish in a lake is 3 kg. A sample of
10 fish is taken, and their weights (in kg) are: 2.8, 3.1, 3.0, 2.9, 3.2, 2.7, 3.3, 2.9, 3.0, 2.8. Test the claim
at the 5% significance level.

Solution:

1. Formulate the Hypotheses


H0 : μ = 3 kg (the average weight is 3 kg)
H1 : μ = 3 kg (the average weight is not 3 kg)
2. Select the Significance Level
α = 0.05
3. Choose the Appropriate Test Statistic
Use the t-statistic since the sample size is small and σ is unknown.
4. Formulate the Decision Rule
Degrees of freedom: df = n − 1 = 10 − 1 = 9
For a two-tailed test with α = 0.05, the critical value from the t-table is approximately
±2.262.
5. Calculate the Test Statistic
ˉ) and sample standard deviation (s):
Calculate the sample mean (x

∑ xi 2.8 + 3.1 + 3.0 + 2.9 + 3.2 + 2.7 + 3.3 + 2.9 + 3.0 + 2.8 29.7
xˉ = = = = 2.97
n 10 10

∑(xi − xˉ)2 (2.8 − 2.97)2 + (3.1 − 2.97)2 + ⋯ + (2.8 − 2.97)2


s= = ≈ 0.187
n−1 9
Calculate the t-statistic:

xˉ − μ 2.97 − 3 −0.03
t= = = ≈ −0.51
s/ n 0.187/ 10 0.059

64
Downloaded by Sneka ramar ([email protected])
lOMoARcPSD|52009748

6. Make a Decision
Compare the calculated t-value (−0.51) with the critical values (±2.262):
Since −0.51 is within the range −2.262 to 2.262, we fail to reject the null hypothesis.

Diagram

Here's a diagram to visualize the hypothesis test:

plaintext

t-distribution curve:

-2.262 -0.51 2.262


|------------------|-------|------------------|

Key Points

Hypotheses: Clearly state the null and alternative hypotheses.


Significance Level: Choose an appropriate α level.
σ is unknown and the sample size is small.
Test Statistic: Use the t-statistic when
Critical Values: Obtain from the t-distribution table based on df and α.
Decision Rule: Compare the calculated t-value with the critical value(s).

65
Downloaded by Sneka ramar ([email protected])

You might also like