0% found this document useful (0 votes)
12 views6 pages

Statistics Notes

The document discusses the principles of Design of Experiments (DOE) and its application in statistical analysis, emphasizing the importance of randomization, replication, and blocking. It explains the concept of random variables, their distributions, and the central limit theorem, which underpins the sampling distributions of test statistics. Additionally, it covers hypothesis testing, types of errors, and the methodology for selecting optimal sample sizes in experiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Statistics Notes

The document discusses the principles of Design of Experiments (DOE) and its application in statistical analysis, emphasizing the importance of randomization, replication, and blocking. It explains the concept of random variables, their distributions, and the central limit theorem, which underpins the sampling distributions of test statistics. Additionally, it covers hypothesis testing, types of errors, and the methodology for selecting optimal sample sizes in experiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics and experimental design in perspective

Design of experiments (DOE) is a statistical and mathematical tool to perform the experiments in a
systematic way and analyse the data efficiently. In DOE the levels of factors are changed
simultaneously to find the effect of individual factors as well as their interactions on response. The
significant effect of the variables on the response is quantified by doing variance analysis with
reference to experimental error estimated by repeating experiments at same level of factors. The
experimental errors are inherent in experimentation. It is caused due to several reasons that include
atmospheric conditions, material in -homogeneity, operators’ variability in conducting experiments,
measurement errors, and non standardization of test samples. If the mean variance of response due to a
factor is sufficiently large compared to mean experimental error then that factor is called significant
factor. The mean variance is found by dividing variance with degree of freedom which will be
discussed in next section.

DOE was first successfully implemented in agriculture sector by Ronald A. Fisher in 1930s who has
introduced the three main terms of randomization, replication and blocking widely used in DOE. He
has given the concept of analysis of variance (ANOVA), factorial and fractional factorial design. The
next development starts in 1950s in which Box and Wilson introduced the concept of response surface
methodology (RSM) which was applied in process industries like chemical industries. In these
industries data can be obtained quickly and experimenter can get sufficient information from small runs
which can be used in planning next set of experiments for optimization. In 1980s Taguchi introduced
robust parameter design which has popularized the use of DOE in other industries like automotive,
aerospace, and electronics. His approach is to develop the product or process that is robust enough to
remain insensitive to environment or other factors that are difficult to control.

Random variable
In a statistical sense any experimental observation y is subject to error that includes several
components. These components include errors associated with process, measurement, and environment
amongst others and are uncontrollable in nature. In view of the variations involved it is usual to
consider y as a random variable. The random variable y may be discrete or continuous. The distribution
of all possible values of y i.e. its population is expressed by probability function p(y) if it is discrete
and by probability density function f (y) if it is continuous. The population of y is statistically
characterised mainly by mean  and variance  2 which quantify its central tendency and variability
respectively. Mathematically they are expressed as

For continuous random variable


 
2 2
μ  E(y)   yf (y)dy and 

 V(y)   (y  μ)

f (y)dy (1)

For discrete random variable


2 2
μ  E(y)   yp(y) and σ
all possible y
 V(y)   (y  μ) p(y)
all possible y
(2)

where E (y) represents expected value of y and V is variance operator. Using above equation V
operator can be shown in terms of E operator as V  E[(y  μ) 2 ] . Using these operators and following
above equations it can be easily shown that if y1 and y2 are two random variables having means as 1
and 2 respectively and variances as 12 and  22 then

E(y1  y 2 )  1   2 and V(y1  y 2 )  12  σ 22  2Covariance(y1 , y 2 ) (3)

E(y1  y2 )  1   2 and V(y1  y 2 )  12   22  2Covariance(y1 , y 2 ) (4)

where Covariance(y1 , y 2 )  E[(y1  1 )( y 2   2 )] . If y1 and y2 are independent random variables then


Covariance(y1 , y 2 ) = 0
2
Normal distribution

Most of the random variables follow normal distribution N(, σ 2 ). The shape of normal distribution
may be seen in Fig. 1 which is a distribution of a statistic to be discussed later in this section. The
notation N(, σ 2 ) denotes normal distribution with mean  and variance σ 2 . The location and shape of
this distribution are determined by parameters  and σ 2 respectively. Normal distribution follows the
reproductive property for addition and subtraction operations. Thus if y1 and y2 are two independent
random variables following normal distributions N( 1 , 12 ) and N(  2 ,  22 ) respectively then y1  y 2
and y1  y 2 will also follow normal distributions as N( 1   2 , 12   22 ) and N( 1   2 , 12   22 )
respectively. This reproductive property also holds for n number of independent normally distributed
random variables.

It is better to express the random variable y following N(, σ 2 ) as a standard normal random variable z
as
y μ
z (5)
σ
The variable z follows N(0,1). This standardization helps in using cumulative standard normal
distribution table for statistical tests.

Sample mean, variance, and degree of freedom


The mean and variance estimation by using Eqns (1) and (2) need very large number of observations
and hence impractical. So these are estimated by randomly taking sample of an appropriate size ‘n’
from population. Random sampling means that each sample of a particular size has equal probability to
be chosen. A function of sample observations having no unknown parameters is called as statistic. The
mean and variance of random sample is estimated by the statistics sample mean y and sample variance
S 2 . These statistics are unbiased estimators of  and σ 2 . The equations to compute y and S 2 are given
below
n

y
i 1
i
y (6)
n
n
2

SS Sum of squares
 (y  y)i
S2    i 1
(7)
n 1 n 1 n 1

Applying E and V operators in Eqn. (6), we get

E( y ) =  and V( y ) = σ 2 n (8)

It should also be noted that in Eqn. (7) S 2 is estimated by dividing SS by ‘n – 1’ rather than ‘n’ while
SS is obtained by adding ‘n’ terms. The reason is that in SS out of ‘n’ terms only ‘n – 1’ terms are
n
independent as  (y  y) = 0. The number of independent terms ‘n – 1’ in SS is known as degree of
i 1
i

freedom (df) of SS. In any statistical quantity df can be computed by subtracting number of statistics
used to compute that quantity from the number of observations.

Sampling distributions
The distribution of a statistic is known as sampling distribution. The sampling distribution of sample
mean is an important distribution. It follows approximately the normal distribution even if population
3
from which sample is drawn is not normal. This is due to central limit theorem which states that if we
take a random sample of size n from a population which may be finite or infinite with mean  and
variance σ 2 then sample mean y will follow approximately N(, σ 2 n ) if sample size is large.
Central limit theorem allows the use of z 01 and z 02 statistics mentioned in Table 1 even if the samples
are drawn from non-normal population. Here it should be noted that if population is normal then y
statistic will definitely follow N(, σ 2 n ) which can be easily proved by using reproductive property
and Eqn. (8). Sampling distribution of the important test statistics along with their tribological
application are given in Table 1. The second and first columns of the table give information about the
sample and its population. It should be noted here that samples should be drawn randomly in order to
estimate a particular test statistic. The symbols used in the table which have not been described earlier
are given at the end of table.

Table 1. Sampling distributions of important test statistics with details of random sampling from population/s
along with applications
Population/s

Sample/s

Sampling Tribological examples for


Test statistic
distribution use of test statistic

y  μ0
z 01 
One random sample of size n

N(0, 1)
σ n
compare the new lubricant
Single population:

for load carrying capacity


and wear coefficient with
N(, 2)

y  μ0
t 01  t n 1 desired values
S n

(n  1)S2 Assess whether the old wear


χ 02  χ 2n 1 testing machine is giving the
σ 02 precision as prescribed?
y1  y 2
population 1 and another of size

z 02 
Two independent populations:

σ12 σ 22
one sample of size n1 from

N(0, 1) To compare two different


N(1, 12), N(2, 22)


Two random samples:

n2 from population 2

n1 n 2 lubricants on the basis of


(i) wear coefficient
y1  y 2 (ii) load carrying capacity
t 02 
1 1 t n1  n 2  2 (iii) oxidation stability
Sp 
n1 n 2
Analysis of
S12 (i) designed experiment of
F0  Fn1 1, n 2 1
S22 wear test
(ii) regression wear model

where,
Total SS SS1  SS2
Sp = Square root of pooled sample variance = 
df n1  1  n 2  1
(n1  1)S12  (n 2  1)S22
=
n1  n 2  2
t n 1 = t-distribution with n – 1 df
t n1  n 2  2 = t-distribution with n1  n 2  2 df
χ 2n 1 = Chi-square distribution with n – 1 df
Fn1 1, n 2 1 = F-distribution with numerator df of n1  1 and denominator df of n 2  1
4
In Table 1 z 01 and t 01 statistics are used to compare the population mean  with a fixed value 0.
z 02 and t 02 are used to compare the means 1 and 2 of two populations. z statistic is used if
population variance is known. Otherwise t statistic should be used. t 02 statistic has an assumption that
σ12 = σ 22 though the value is not known. The estimation of variability needed for calculation is
obtained from pooled value of both the samples S p as given under Table 6.1. If this condition is not
satisfied then another statistic given in literature should be used. χ 02 statistic is used to compare
population variance σ 2 with a fixed value σ 02 . F0 statistic is used to compare the variances σ12 and
σ 22 of two populations. All these test statistics are based on standard random variables. The detailed
procedure of development and use of the test statistic z 02 is exemplified in hypothesis testing below.

Hypothesis testing, errors in testing, and optimum sample size selection

The hypothesis of relevance here is based on statistical comparison of  and  2 of one or more
populations. Hypothesis test is the decision making method to accept or reject the statement. The
statement to be tested is called null hypothesis denoted by H0. Along with null hypothesis an
alternative hypothesis denoted by H1 is also specified which gives condition to be concluded at
rejection. Alternative hypotheses are of two types: (i) one-sided alternative hypothesis and (ii) two-
sided alternative hypothesis. These are discussed later in a non-numeric example. The hypothesis
testing involves the steps given below:

(i) Experimentation with a randomised sample.


(ii) Selection of an appropriate test statistic which follows certain sampling distribution.
(iii) Computation of the test statistic.
(iv) Selection of critical or rejection region in sampling distribution of interest.
(v) Decision on the null hypothesis.

In hypothesis testing two types of errors are encountered which are given below:

(i) Type-I error: H0 is rejected when it was true.


(ii) Type-II error: H0 is accepted when it was false.

Probability of theses errors are denoted by  and  which are expressed as

 = P(Type-I error)
 = P(Type-II error)

The above probabilities are discussed at the end of the following example.

Let us consider a non-numeric example in which an oil lubricant formulator wants to change the
existing engine oil, lubricant 1 with another eco-friendly lubricant, lubricant 2. He wants to know
firstly whether the two lubricants are different in terms of engine wear. In case they are different, he
would additionally like to know if the ecofriendly lubricant is superior. To get these answers tests on
real engines at the same operating conditions can be performed and wear coefficient can be estimated.
Suppose n engine tests have been decided with each lubricant. Total 2n tests should be conducted in a
random order. It is assumed that variance 2 in both type of tests is the same and its value is known
from experience.

For the above problem two separate hypothesis tests are done. In both tests null hypothesis is H0: 1 =
2 i.e. 1 – 2 = 0 which expresses that both populations 1 and 2 are same. To test this null hypothesis
distribution of y1  y 2 has to be observed where y1 and y 2 are the respective averages of ‘n’ wear
coefficients found with lubricants 1 and 2 respectively. If populations 1 and 2 are independent normal
distributions then using Eqn. (8), Eqn. (3) and reproductive property of normal distribution it can be
said that y1  y 2 will also be normally distributed as N( 1   2 , σ12 n  σ 22 n ). Under null hypothesis
H0: 1 – 2 = 0 this distribution will be N(0, σ12 n  σ 22 n ) which has been shown by curve 1 in Fig. 1.
5
This curve varies from –  to +  and centred at O1. Now using Eqn. (5) the appropriate test statistic
for H0: 1 – 2 = 0 can be developed as test statistics of standard random variable. The developed
statistic is given as

z 02  (y1  y 2 ) σ12 n  σ 22 n (9)

The above statistic follows the standard normal distribution N(0,1).

Since for the above problem σ12  σ12  σ 2 is assumed test statistic can be rewritten as
y1  y 2 n
z 02  (10)
 2

If σ12 and σ 22 are not known but σ12 = σ 22 =  2 can be reasonably assumed then it is estimated by
Total SS SS1  SS2 S 2  S 22
pooled sample variance S 2p   . Using Eqn. (7) we can get S 2p  1 and test
df 2n  1 2
y1  y 2 n
statistic will be t 02  which follows t-distribution with 2(n – 1) df.
Sp 2

For H0: µ1 = µ2 For H1: µ1 > µ2

Curve 1 Curve 2

2σ 2
Probability density

2σ 2 N(Δ, )
N(0, ) n
n

- B1 O1 A B2 O2 
Difference in average of wear coefficients y1  y 2 using two different
lubricants each having sample size of n

Fig. 1. Probability density functions of difference in average y1  y 2 of wear coefficients with two different
lubricants for two different hypotheses

For first hypothesis test an alternative hypothesis is H1: 1  2. This is a two-sided alternative
hypothesis in which the significance level of  the critical region is in both tail sides of distribution. In
Fig. 1 it is shown under curve 1 from –  to B1 and from B2 to . The area under the curve 1 in these
critical regions in both side is /2. The critical regions for this case are obtained using cumulative
standard normal distribution table which gives area under N(0,1) curve from –  to z. From this table
find z value which gives area = /2 and denote z as z α/2 . Now calculate z02 using Eqn. (10). If z 02
> z α/2 i.e. z02 falls in critical region then null hypothesis H0 is rejected and we say that both lubricants
are really different in their wear response.

For second hypothesis test an alternative hypothesis is H1: 1 > 2. This is a one-sided alternative
hypothesis in which the significance level of , the critical region, is in one tail side of distribution. In
Fig. 1 it is shown under curve 1 from A to . The area under the curve 1 in this critical region is . If
z 02 is found to be greater than z α then null hypothesis H0 is rejected and we say that lubricant 1 is
6
giving higher wear coefficient than lubricant 2. In other words we can say that lubricant 2, the eco-
friendly lubricant has better antiwear property than existing engine oil.

In the above conclusions type-I and type-II errors are encountered as mentioned earlier. In type-I error
fact is H0: 1 – 2 = 0 hence curve 1 in Fig. 1 should be referred for this error. The probability of taking
wrong decision of rejecting H0 will be the area of critical region which is . In type-II error H0 is false
hence distribution of y1  y 2 shown by curve 1 is not correct. Let the correct distribution be N(,
σ12 n  σ 22 n ) and is shown by curve 2 which is centred at O2.  is the difference between means of
true and hypothesised distribution of y1  y 2 . The probability of taking wrong decision of accepting H0
will be the area of acceptance region under curve 2 which is denoted by . For first part of the problem,
 is the area under curve 2 from B1 to B2 while for second part it is from –  to A. Referring Fig. 1 we
can easily draw the following conclusions:

(i)  value depends on choice of  value. With increase of  value of  decreases and vice versa.
(ii) On increasing  difference between means of true and hypothesised distribution the  value
decreases.
(iii) Increase in sample size will decrease the variability of distribution as V( y1  y 2 ) =
σ12 n  σ 22 n . Hence thinner distribution curves will be obtained which will reduce both 
and .

From the above discussion it is clear that an optimum sample size selection depends on  and  values.
Let us find the sample size for part II of the problem. Referring to Fig. 1 if we convert y1  y 2 random
variable into standard normal random variable with respect to H0: 1 – 2 = 0 the curve 1 and 2 will be
 n
distributed as N(0, 1) and N( , 1) respectively. The x-axis will give z 02 value. The x-coordinate
 2
of A will be z α . Converting again z 02 into standard normal random variable with respect to
 n  n
distribution N( , 1) will give x-coordinate of A as z α  and curve 2 will be distributed as
 2  2
N(0, 1). The area under this distribution from –  to – z β will be  which is actually from –  to A.
Hence we can write the following relationship

 n
 zβ  z α  (11)
 2
Rearranging above equation we get

(z α  z β ) 2 2σ 2
n (12)
Δ2

The above equation can be used in selecting an appropriate sample size.

Another approach to find the sample size is use of operating characteristics (OC) curves. The
appropriate set of OC curves is chosen from different sets of OC curves available in statistical
handbook or text books on the basis of significance level , sampling distribution, and type of
alternative hypothesis. The appropriate set of OC curves chosen for the second part of the problem has
many curves for different sample size. These curves show  variation as a function of d a non-
dimensional difference between means of true and hypothesised distribution. The d is given as
Δ
d (13)
σ12  σ 22

By knowing the value of Δ , Variance,  and  value the proper sample size may be selected.

It may be noted software are available for statistical analysis as well as DOE. Mention may be made
here of two software SAS and Design-Expert widely used in these areas.

You might also like