0% found this document useful (0 votes)
64 views78 pages

SPSS Training: Statistical Data Analysis

1. The document provides an overview of the training program on statistical data analysis using SPSS. 2. It covers topics such as fundamentals of statistics, tests of hypotheses, ANOVA, cross tabulation, chi-square tests, correlation, linear regression, and logistic regression. 3. The fundamentals section defines key terms like statistics, analytics, data, and demonstrates how to summarize sample data by finding the central tendency, variation in the data, and shape of the distribution.

Uploaded by

Jeevitha jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views78 pages

SPSS Training: Statistical Data Analysis

1. The document provides an overview of the training program on statistical data analysis using SPSS. 2. It covers topics such as fundamentals of statistics, tests of hypotheses, ANOVA, cross tabulation, chi-square tests, correlation, linear regression, and logistic regression. 3. The fundamentals section defines key terms like statistics, analytics, data, and demonstrates how to summarize sample data by finding the central tendency, variation in the data, and shape of the distribution.

Uploaded by

Jeevitha jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Indian Statistical Institute

Training Program on
Statistical Data Analysis
using

SPSS

CONTENTS Indian Statistical Institute

SL No. Topics SL No. Topics

1 Fundamentals of Statistics 6 Correlation

2 Test of Hypothesis 7 Linear Regression

3 ANOVA 8 Dummy Variable Regression

4 Cross Tabulation 9 Logistic Regression

5 Chi Square Test

1
Indian Statistical Institute

FUNDAMENTALS
of
STATISTICS

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Statistics
Science of collection, analysis, interpretation and presentation of data

Analytics
Process of extracting meaningful insights by discovering patterns and
relationships in the data

2
Indian Statistical Institute

DESCRIPTIVE STATISTICS

Data set
Number of tasks completed per hour

Productivity
69 71
70 68
72 70
71 67
70 69
68 72
73 70
69 71

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Data
The values or figures assigned to a metric
Can be numeric or non numeric

Example
Metric: Productivity
Data: 70, 72, 69, etc

3
Indian Statistical Institute

DESCRIPTIVE STATISTICS

Use of Statistics
To know what happened in the past
To know approximately what will happen in future if things remain more or less
the same

Example
Approximately how many tasks will be completed in the next hour?
Is it same as the last value as the last value is close to the future?
The difference between the last value and previous value is only 1 (71 – 70).So
adding 1 to last value will give the next future value?

Approximately how many tasks will be completed two hours down the line?

7
Is mean, median, etc give better estimate of future value? Why?

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Use of Statistics
Generally the values in a data set will not be same
Values will be not only changing but change without any particular trend or
pattern

Example
If we collect another 16 hours productivity data, the value may not be same as 69,
second value as 70, etc?
Similarly the difference between the 1st and 2nd value or 2nd and 3rd value , etc
may not be same.
It is possible that none of the values will be repeated in the new dataset
How will you make projections about future?
Explore what is remaining more or less consistent even if the values are
changing.
8
Use it for future projections, generalizations, etc

4
Indian Statistical Institute
DESCRIPTIVE STATISTICS

Productivity Data Demonstration


69 71
70 68
72 70
71 67
70 69
68 72
73 70
69 71
67 68 69 70 71 72 73

Indian Statistical Institute


DESCRIPTIVE STATISTICS

Interpretation Productivity
• Only thing remain more or less
consistent is the shape
• All projections or generalizations
need to be made using shape
• Shape represents the distribution
• The properties of the shape also
will remain more or less constant
• The properties can be used to
make projections
Center

Variation

5
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Summarization of sample data
The monthly credit card expenses of an individual in 1000 rupees is given below.
Kindly summarize the data

Month Credit Card Expenses Month Credit Card Expenses


1 55 11 63
2 65 12 55
3 59 13 61
4 59 14 61
5 57 15 57
6 61 16 59
7 53 17 61
8 63 18 57
9 59 19 59
10 57 20 63
11

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Summarization of sample data
6

0
51 53 55 57 59 61 63 65 67

Summary: 1. Centre (tendency) of the Data


2. Spread or Variation in the Data
3. Shape of the Curve 12

6
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Summarization of sample data
6

0
53 55 57 59 61 63 65

Summary: 1. Centre(tendency) of the Data


2. Spread or Variation in the Data
3. Shape of the Curve 13

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Variable Data: Measure of Centre tendency

Sample Mean:

• Numerical value indicating the central value of data


• Sum of all data / Total number of data

Suppose x1, x2, - - - xn be the data, then


Sample Mean = (x1+ x2 + - - -+ xn ) / n = xi /n

14

7
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Summarization of sample data: Credit Card Expenses

Sample Mean: Sum of all data / Total number of data

= (55 + 65 + 59 + 59 + 57 + 61 + 53 + 63 + 59 + 57 + 63 + 55 + 61 + 61 + 57
+ 59 + 61 + 57 + 59 + 63) / 20
= 1184 / 20 = 59.2

Interpretation
On an average, the individual spends Rs. 59200 through credit card

15

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data: Measure of Centre tendency

Sample Median:

• Middle Value
• Value which divides data arranged in ascending or descending order
into two equal halves

Case 1: Total number of data is odd


Median: Middle Value

Case 2: Total number of data is even


Median: Average of two middle values

Credit Card Expenses


Median = ?
16

8
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data: Measure of Centre tendency

Sample Median: Credit Card Expenses


Month Credit Card Month Credit Card
Expenses Expenses
1 53 11 59
2 55 12 59
3 55 13 61
4 57 14 61
5 57 15 61 Median = 59
6 57 16 61
7 57 17 63
8 59 18 63
9 59 19 63
10 59 20 65
Interpretation
50% of the months the credit card expenses are less than or equal to 17
Rs. 59,000/-

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data: Measure of Centre tendency

Sample Mode:

• The value which occurs maximum number of times in the data

Example: Credit Card Expenses


Mode = ?

18

9
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Centre tendency

Sample Mode:

• The value which occurs maximum number of times in the data

Example: Credit Card Expenses


Values No. of Occurrences
53 1
55 2
Mode = 59 57 4
59 5
61 4
63 3
65 1
Total 20
Interpretation
Maximum number of months, the credit card expenses is equal to Rs.
59,000/- 19

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Range: Definition

Range: Maximum value – Minimum Value

Example:

5 4 7 3 2
15 9 8 5 2

Maximum Value = 15
Minimum Value = 2
Range = 15 – 2 = 13

20

10
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Range: Issues


It depends only on extreme values
Hence affected by outliers

21

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Range: Issues

16
14
12
10
8 Range
6
4
2
0
1 2 3 4 5 6 7 8 9 10

22

11
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Example:

5 4 7 3 2
15 9 8 5 2

Step 1:
Calculate Mean
Mean = 6

23

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Example:

5 4 7 3 2
15 9 8 5 2

Step 2:
Take deviations from Mean

-1 -2 1 -3 -4
9 3 2 -1 -4

24

12
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Example:

Step 2:
Take deviations from Mean

16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
25

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Example:

Step 3:
Since some values are positive & rest are negative, while
taking sum they will cancel out.
So square the values & Sum

1 4 1 9 16
81 9 4 1 16

Sum = 142

26

13
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Example:

Step 4:
Standard Deviation = (Sum of Squares / (n -1))
= (142 / (10 -1))
=  15.77 = 3.972

27

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Summarization of sample data : Measure of Variation or Spread

Sample Standard Deviation: Definition

Square root of the average squared deviation from mean


Indicates On an average how much each value is away from the Mean

16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
28

14
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Sample Standard Deviation: Credit Card usage data

Month Credit Card Expenses Month Credit Card Expenses


1 55 11 63
2 65 12 55
3 59 13 61
4 59 14 61
5 57 15 57
6 61 16 59
7 53 17 61
8 63 18 57
9 59 19 59
10 57 20 63

29

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Frequency Table

Count of number of different values of a variable


Table of counts, percentages and cumulative percentages for all values of a
variable

30

15
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Frequency Table: Credit Card usage data

Month Credit Card Expenses Month Credit Card Expenses


1 55 11 63
2 65 12 55
3 59 13 61
4 59 14 61
5 57 15 57
6 61 16 59
7 53 17 61
8 63 18 57
9 59 19 59
10 57 20 63

31

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Frequency Table: Credit Card usage data

Values Count Percent Cumulative Percent


53 1 5 5
55 2 10 15
57 4 20 35
59 5 25 60
61 4 20 80
63 3 15 95
65 1 5 100
Total 20 100

32

16
Indian Statistical Institute

FUNDAMENTALS OF STATISTICS

Graphical representation of frequency table

33

Indian Statistical Institute

FUNDAMENTALS OF STATISTICS
Exercise: The data of 30 customers on credit card usage in INR1000, sex (1: male,
2: female) and whether they have done shopping or banking (1: yes , 2: no) with
credit card are given in table below.
1. Summarize and interpret the credit card usage?
2. How the credit card usage by differ with sex?
3. How the credit card usage pattern vary with those who do shopping with credit
card and those who don’t do shopping?
4. How the credit card usage pattern vary with those who do banking with credit
card and those who don’t do banking?

34

17
Indian Statistical Institute

TEST
of
HYPOTHESIS

35

Indian Statistical Institute

TEST OF HYPOTHESIS

Introduction:
In many problems, it is required to accept or reject a statement about some
parameter

Example:
1. The average cycle time is less than 24 hours
2. The % rejection is only 1%

The statement is called the hypothesis


The procedure for decision making about the hypothesis is called hypothesis
testing

36

18
Indian Statistical Institute

TEST OF HYPOTHESIS

Some of the commonly used hypothesis tests:

• Checking mean equal to a specified value (mu = mu0)


• Two means are equal or not (mu1 = mu2)
• Two variances are equal or not (sigma12 = sigma22)

• Proportion equal to a specified value (P = P0)


• Two Proportions are equal or not (P1 = P2)

37

Indian Statistical Institute

TEST OF HYPOTHESIS

Null Hypothesis:
A statement about the status quo
One of no difference or no effect
Denoted by H0

Alternative Hypothesis:
One in which some difference or effect is expected
Opposite of null hypothesis
Denoted by H1

38

19
Indian Statistical Institute

TEST OF HYPOTHESIS

Types of errors in hypothesis testing


The decision procedure can lead to either of the two wrong conclusions

Type I Error
Rejecting the null hypothesis H0 when it is true
Type II Error
Failing to reject the null hypothesis H0 when it is false

Alpha (Significance level) = Probability of making type I error


Beta = Probability of making type II error
Power = 1 – Beta : Probability of correctly rejecting a false null hypothesis

39

Indian Statistical Institute

TEST OF HYPOTHESIS

Hypothesis Testing: General Procedure


1. Formulate the null hypothesis H0 and the alternative hypothesis H1
2. Select an appropriate statistical technique and the corresponding test statistic
3. Choose level of significance alpha (generally taken as 0.05)
4. Collect data and calculate the value of test statistic
5. Determine the probability associated with the test statistic under the null
hypothesis using sampling distribution of the test statistic
6. Compare the probability associated with the test statistic with level of
significance specified

40

20
Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology: To Test Mean = Specified Value (mu = mu0)


Suppose we want to test whether mean is 5 based on the sample data

4 4 5 5 6
5 4.5 6.5 6 5.5

Calculate the Mean of the Sample, xbar = 5.15


Check whether xbar = specified value 5
or xbar - specified value = xbar - 5 = 0
If xbar - 5 is close to 0
then conclude mean = 5
else mean  5

41

Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology: To Test Mean = Specified Value (mu = mu0)


Consider another set of sample data. Check whether mean is 500

400 400 500 500 600


500 450 650 600 550

Mean of the Sample, xbar = 515


xbar - 500 = 515 - 500 = 15

Can we conclude mean  500?

Conclusion:
Difficult to say mean = specified value by looking at xbar - Specified value
alone
42

21
Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology demo: To Test Mean = Specified Value (mu = mu0)

Test statistic is calculated by dividing (xbar - specified value) by a function


of standard deviation

To test Mean = Specified value


Test Statistic t0 = (xbar - Specified value) / (SD / n)

If test statistic is close to 0, conclude that Mean = Specified value

To check whether test statistic is close to 0, find out p value from the sampling
distribution of test statistic
43

Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology demo: To Test Mean = Specified Value (mu = mu0)

p ≥ 0.05

-t/2 t/2

If test statistic t0 is close to 0 then p will be high (p ≥ 0.05)


If test statistic t0 is not close to 0 then p will be small (p <0.05)
If p is high , p ≥ 0.05 (with alpha = 0.05), conclude that t  0, then
44
Mean = Specified Value, H0 is not rejected

22
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Mean = Specified Value (mu = mu0)

Example: Suppose we want to test whether mean of the process characteristic is


5 based on the following sample data
4 4 5 5 6
5 4.5 6.5 6 5.5

H0: Mean = 5
H1: Mean  5

Calculate xbar = 5.15


SD = 0.8515
n = 10
Test statistic t0 = (xbar - 5)/(SD / n) = (5.15 - 5) / (0.8515 / 10) = 0.5571
Critical value = ± 2.263 corresponding to sample size of 10
45

Indian Statistical Institute

TEST OF HYPOTHESIS

Example: To Test Mean = Specified Value (mu = mu0)

t0 = 0.5571

p ≥ 0.05

-t/2 t/2
t0

p = 0.59 ≥ 0.05, hence Mean = Specified value = 5.


H0: Mean = 5 is not rejected 46

23
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Mean = Specified Value

Exercise 1 : An insurance company claims that on an average it takes only 40


days to process any insurance claim. Based on the data given below, can you
validate the claim?

73 19 16 64 28 28 31 90 60 56 31 56 22

18 45 48 17 17 17 91 92 63 50 51 69 16 17

47

Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Mean = Specified Value:

Exercise 2 : A computer manufacturing company claims that on an average it


will respond to any complaint logged by the customer from anywhere in the world
within 24 hours. Based on the data, validate the claim?
Response Time
24 26
31 27
29 24
26 23
28 27
26 28
29 27
29 23
27 27
31 23
25 25
29 27
29 26
25 28 48
26 27

24
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Means are Equal:

Null Hypothesis H0: Mean 1 = Mean 2 (mu1 = mu2)


Alternative Hypothesis H1: Mean 1  Mean 2 (mu1  mu2)
or
H1: Mean 1 > Mean 2 (mu1 > mu2)
or
H1: Mean 1 < Mean 2 (mu1 < mu2)

49

Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Means are Equal: Methodology

Calculate both sample means xbar1 & xbar2


Calculate SD1 & SD2
Check xbar1 = xbar2
Or xbar1 - xbar2 = 0
Calculate Test Statistic t0 by dividing (xbar1 – xbar2) by a function of SD1 & SD2
t0 = (xbar1 – xbar2) / (Sp  ((1/n1)+(1/n2)))
Calculate p value from t distribution
If p > 0.05 then H0: Mean1 = Mean2 is not rejected

50

25
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Means are Equal: Exercise 1


A super market chain has introduced a promotional activity in its selected outlets
in the city to increase the sales volume. Based on the data given below, check
whether the promotional activity resulted in increasing the sales. The outlets
where promotional activity introduced are denoted by 1 and others by 2?
Outlet Sales Outlet Sales
1 1217 2 1731
1 1416 2 1420
1 1381 2 1065
1 1413 2 1612
1 1800 2 1361
1 1724 2 1259
1 1310 2 1470
1 1616 2 622
1 1941 2 1711
1 1792 2 2315
1 1453 2 1180
1 1780 2 1515 51

Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Means are Equal: Exercise


A petroleum company have developed a new formulation for gasoline. Road
octane number is an important characteristic of gasoline. 10 observations on
road octane number from both formulations is given below: Check whether the
means of road octane number for both formulations are same?
Formulation RON Formulation RON
Old 89.5 New 89.5
Old 90 New 91.5
Old 91 New 91
Old 91.5 New 89
Old 92.5 New 91.5
Old 91 New 92
Old 89 New 92
Old 89.5 New 90.5
Old 91 New 90
Old 92 New 91 52

26
Indian Statistical Institute
TEST OF HYPOTHESIS

Exercise 3: The data of 30 customers on credit card usage in INR1000, sex (1:
male, 2: female) and whether they have done shopping or banking (1: yes , 2: no)
with credit card are given in table below.
1. Check whether the average credit card usage is same for both sex?
2. Check whether the average credit card usage is same for those who do
shopping with credit card and those who don’t do shopping?
3. Check whether the average credit card usage is same for those who do banking
with credit card and those who don’t do banking?

53

Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Variances are Equal: Methodology (Sigma12 = Sigma 22)

Null Hypothesis
H0: S12 = S22

Alternative Hypothesis
H1: S12  S22

Calculate standard deviations of both the samples S1 & S2


Calculate Test Statistic F = S12 / S22
If F is close to 1, then S12 = S22
Calculate p from F distribution.

If p > 0.05, then


H0: Sigma12 = Sigma 22 is not rejected
54

27
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Variances are Equal: Exercise 1


A super market chain has introduced a promotional activity in its selected outlets
in the city to increase the sales volume. The outlets where promotional activity
introduced are denoted by 1 and others by 2. Check for equality of variance?

Outlet Sales Outlet Sales


1 1217 2 1731
1 1416 2 1420
1 1381 2 1065
1 1413 2 1612
1 1800 2 1361
1 1724 2 1259
1 1310 2 1470
1 1616 2 622
1 1941 2 1711
1 1792 2 2315
1 1453 2 1180
1 1780 2 1515 55

Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Two Variances are Equal: Exercise


A petroleum company have developed a new formulation for gasoline. Road
octane number is an important characteristic of gasoline. 10 observations on
road octane number from both formulations is given below: Check whether both
formulations have same consistency with respect to RON?
Formulation RON Formulation RON
Old 89.5 New 89.5
Old 90 New 91.5
Old 91 New 91
Old 91.5 New 89
Old 92.5 New 91.5
Old 91 New 92
Old 89 New 92
Old 89.5 New 90.5
Old 91 New 90
Old 92 New 91 56

28
Indian Statistical Institute

ANALYSIS
of
VARIANCE

57

Indian Statistical Institute

ANALYSIS OF VARIANCE

ANOVA

Analysis of Variance is a test of means for two or more populations

Example:
To study Location of shelf on Sales Revenue

58

29
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

A supermarket chain suspect the location of shelf were toys are kept will
influence it sales volume. The data on sales revenue in lakhs from the toys
when they are kept at location inside the store are given in Sales Revenue
data. The location is denoted as 1:front, 2: middle & 3: rear. Verify the doubt/

59

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Factor: Location(A)
Levels : front, middle, rear
Response: Sales revenue

One Way Anova involves only one factor

60

30
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).

Level 1 Sum(A1):
Sum of all responses when location is at level 1 (front)
= 1.34 + 1.89 + 1.35 + 2.07 + 2.41 + 3.06
= 12.12
nA1: Number of Responses with location is at level 1 (front)
=6

61

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).

Level 1 Average:
Sum of all responses when location is at level 1 / Number of Responses with
location is at level 1
= A1 / nA1 = 12.12 / 6 = 2.02

62

31
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).

Level 1 Level 2 Level 3


(front) (middle) (rear
Sum A1: 12.12 A2: 38.09 A3: 7.09
Number nA1: 6 nA2: 8 nA3: 4
Average 2.02 4.7613 1.7725

63

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 2: Calculate the Grand Total (T)


T = Sum of all the Responses
= 1.34 + 1.89 + - - - + 1.40 + 1.48 = 57.3

Step 3: Calculate the Total Number of Responses (N)


N = 18
Step 4: Calculate the Correction Factor (CF)
CF = (Grand Total)2 / Number of Responses
= T2 / N = (57.3)2 / 18 = 182.405

64

32
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 5: Calculate the Total Sum of Squares ( TSS)


TSS = Sum of Square of all the Responses - CF
= 1.342 + 1.892 + - - - + 1.402 + 1.482 – 182.405
= 72. 4918

65

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 6: Calculate the Between Sum of Square of Factor


SSA = A12 / nA1 + A22 / nA2 + A32 / nA3 - CF
= 12.122 / 6 + 38.092 / 8 + 7.092 / 4 – 182.405
= 36.0004

Step 7: Calculate the Within Sum of Square


SSe = Total Sum of Square - Sum of Square of Factors
= TSS - SSA = 72.4918 – 36.0004 = 36.4914

66

33
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 8: Calculate Degrees of Freedom (Df)


Total Df = Total Number of Responses - 1
= 18 - 1 = 17
Between Df
= Number of levels of the Factor - 1
=3-1=2
Within Df r = Total Df – Between Df
= 17 - 2 = 15

67

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Anova Table:

Source Df SS MS F F Table P value


Between 2 36.0004 18.0002 7.3991 3.68 0.0058
Within 15 36.4914 2.4328
Total 17 72.4918 4.2642

MS = SS / Df
F = MS Between/ MS Within

F table =finv (probability, between df, within df ) , probability = 0.05


P value = fdist ( F, between df, within df)
68

34
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Decision Rule

If p value < 0.05 of a Factor, then


The Factor has significant effect on the Process Output or Response.

Meaning:
When the Factor is changed from 1 level to another level, there will be significant
change in the response.

69

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

For Factor Location, p = 0.006 < 0.05


Conclusion:
Location has significant effect on Sales Revenue

Meaning:
The sales revenue is not same for different locations like front, middle &
rear

70

35
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

The expected sales revenue for different location under study is equal to level
averages.

Location Expected Sales Revenue

Front 2.0200

Middle 4.7613
rear 1.7725
mm

71

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result


Average Response Graph

6
Sales Revenue

5
4
3
2
1
Front Middle rear
Location

72

36
Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic:

Two Types of Variations:


1. Variation within the level of a Factor
2. Variation between the levels of Factor

73

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :
Variation between the level of a Factor:
The effect of Factor.

Variation within the levels of a Factor:


The inherent variation in the process or Process Error.

Location

Front Middle rear


1.34 3.20 2.30
Sales Revenue

1.89 2.81 1.91


1.35 4.52 1.40
2.07 4.40 1.48
2.41 4.75
3.06 5.19
3.42
9.80
74

37
Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :

If the Variation between the levels of a Factor is significantly higher


than the inherent variation
then the factor has significant effect on Response

To check whether a Factor is significant:


Compare Variation between levels with Variation within levels

75

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :

Measure of Variation between Levels: MS of the Factor


Measure of Variation within levels: MS Error

To check whether a Factor is significant:


Compare MS of Factor with MS Error
i.e. Calculate F = MS factor / MS Error
If F is very high, then the factor is significant.

76

38
Indian Statistical Institute

ANALYSIS OF VARIANCE

Exercise 1: A bank want to check whether the waiting time of customer at their
single window operations across 4 cities is same or not. The data is given in
bank_waiting_time?

Exercise 2: An automobile manufacturing company wants to study the effect of four


different tuning techniques on mileage. The data collected is given in Mileage file.
Test whether the tuning techniques impacts the mileage?

77

Indian Statistical Institute

CROSS TABULATION

78

39
CROSS TABULATION Indian Statistical Institute

• An approach to summarize and identify the relation between two variables or


parameters
• Describes two variables simultaneously
• Expressed as two way table
• Parameters need to be categorical or grouped

Input or Process Output Variable


Variable Very Good Average Below Poor
Good Average
0–3
3-6
6 - 12

79

CROSS TABULATION Indian Statistical Institute

Example: An apparel manufacturing company has collected the data from 45


customers on usage, sex, awareness of brand and preference of the brand.
Usage has been coded as 1, 2 ,and 3 representing light, medium and heavy
usage. The sex has been coded as 1 for female and 2 for male users. The
attitude and preference are measured on a 7 point scale (1: unfavorable to 7 :
very favorable). The data is given in shoe_data file .
1. Does male and female differ in their usage?
2. Does male and female differ in their awareness of the brand?
3. Does male and female differ in their preference?
4. Does higher the awareness means higher preference?

80

40
CROSS TABULATION Indian Statistical Institute

Sex vs. Usage

Usage
Sex
Light Medium Heavy Total

Female 14 5 5 24
Male 5 5 11 21
Total 19 10 16 45

81

CROSS TABULATION Indian Statistical Institute

Exercise 1: An ITeS company has collected following information from its customers
through survey. The data has been collected in 5 point scale (1: Very dissatisfied to 5:
Very satisfied). The survey questions are given below and data is given in Csat_data
file. Check whether the questions 1 to 9 are related to overall satisfaction
1. Team’s ability to meet service level agreements
2. Team’s ability to deliver seamlessly in the event of changes (volume fluctuations,
resource movement etc)
3. Team’s operational performance
4. Team's application of process knowledge
5. Team's communication with you
6. Team’s effectiveness in handling escalations
7. Team’s flexibility and responsiveness to special service requests
8. Team’s contribution to your business requirements
9. Effectiveness of the reviews around operations delivery
82
10. Overall with team’s service

41
Indian Statistical Institute

CHI SQUARE TEST

83

Indian Statistical Institute

CHI SQUARE TEST

Objective:
To test whether two variables are related or not
To check whether a metric is depends of another metric

Usage:
When both the variables ( X & Y) are categorical (grouped)

H0: Relation between x & y = 0 or x and y are independent


H1: Relation between x & y  0 or x and y are not independent

If p value < 0.05, then H0 is rejected

84

42
Indian Statistical Institute

CHI SQUARE TEST

Exercise:

A project was undertaken to improve the CSat score of transaction processing.


Based on brainstorming, the project team concluded that lack of experience is a
cause of low CSat score.
The following data was collected. Analyze the data and verify whether CSat
score dependents on experience

Experience CSat Score


(Months) VD D N S VS
0–3 50 40 30 10 10
3- 6 5 30 50 35 7
6-9 6 7 30 40 50

Note: Table gives the count of CSat score of 1, 2 etc for each group of
agents 85

Indian Statistical Institute


CHI SQUARE TEST

Exercise:

Step 1: Calculate the Row and column Sum

Experience CSat Score


(Months) VD D N S VS Row Sum
0–3 50 40 30 10 10 140
3-6 5 30 50 35 7 127
6-9 6 7 30 40 50 133
Col Sum 61 77 110 85 67 400

86

43
Indian Statistical Institute

CHI SQUARE TEST

Exercise:

Step 2: Calculate Expected Count for each Cell

Expected Count of CSat Score 1 for group 0 – 3 Months experience


= Expected count of Cell (1,1) = (Row 1 sum x Column 1 sum ) / Total
= (140 x 61 ) / 400 = 21.4

Table of expected count

Experience CSat Score


(Months) VD D N S VS Row Sum
0–3 21.4 27 38.5 29.8 23.5 140
3-6 19.4 24.4 34.9 27 21.3 127
6-9 20.3 25.6 36.6 28.3 22.3 133
Col Sum 61 77 110 85 67 400 87

Indian Statistical Institute

CHI SQUARE TEST

Exercise:

Step 3: Take difference between actual count and expected count

For Cell (1,1)


Actual Count = 1
Expected Count = 0.35
Difference = 0.65
Table of Actual Count – Expected count

Experience CSat Score


(Months) VD D N S VS
0–3 28.7 13.1 -8.5 -20 -13
3-6 -14.4 5.55 15.1 8.01 -14
6-9 -14.3 -19 -6.6 11.7 27.7 88

44
Indian Statistical Institute

CHI SQUARE TEST

Exercise:

Step 4: Calculate (Actual - Expected)2 / Expected for each cell

Table of (Actual - Expected)2 / Expected

Experience CStat Score


(Months) VD D N S VS
0–3 38.45 6.32 1.88 13.11 7.71
3-6 10.66 1.26 6.51 2.38 9.58
6- 9 10.06 13.52 1.18 4.87 34.50

89

Indian Statistical Institute

CHI SQUARE TEST


Exercise:
Step 5: Calculate Chi Square = Sum of all ((Actual - Expected)2 / Expected)

Chi Square calculated = 38.45 + 6.32 + - - - + 34.5


Chi Square Calculated 2= 161.98

Step 6: Calculate p value

P value = chidist(chi Sq, df)


= chidist(161.98,8)
= 0.00
Conclusion:
Since p value 0.00 < 0.05, Csat score depends on experience.

90

45
Indian Statistical Institute

CHI SQUARE TEST


Issues:

• Chi square test only shows whether two variables are independent or not
• Degree of association will not be known

Measures of Strength of relationship:


1. Phi () Coefficient
 = (2 / n)
Only for 2 x2 tables

2. Cramer V = (2 / (min (rows – 1) , (cols – 1)))

Phi & V varies from 0 to 1, higher the value better the strength of relation
91

Indian Statistical Institute

CHI SQUARE TEST

Phi Coefficient = sqrt(161.98 / 400) = 0.64

Cramer V:
Rows – 1 = 2
Columns - 1 = 4

Cramer V = ( 0.642/ 2) = 0.4499 = 44.99%

92

46
Indian Statistical Institute

CHI SQUARE TEST

Tau b
• Varies from -1 to +1
• lose to -1 indicate good negative correlation, Close to +1 indicate good positive
correlation & close to 0 indicate no relation
• Used for n x n tables

Tau c
• Varies from -1 to +1
• lose to -1 indicate good negative correlation, Close to +1 indicate good positive
correlation & close to 0 indicate no relation
• Used for r x c tables ( r not equal to c)

93

Indian Statistical Institute


CHI SQUARE TEST

Example: An branded shoe manufacturing company has collected the data from
45 customers on usage, sex, awareness, of brand and preference of the brand.
Usage has been coded as 1, 2 ,and 3 representing light, medium and heavy
usage. The sex has been coded as 1 for female and 2 for male users. The
attitude and preference are measured on a 7 point scale (1: unfavorable to 7 :
very favorable). The data is given in shoe_data file .
1. Estimate the relation between sex and usage?
2. Estimate the relation between sex and awareness of the brand?
3. Estimate the relation between sex and preference?
4. Does higher the awareness means higher preference?

94

47
Indian Statistical Institute

CHI SQUARE TEST

Exercise 1: An ITeS company has collected following information from its customers
through survey. The data has been collected in 5 point scale (1: Very dissatisfied to 5:
Very satisfied). The survey questions are given below and data is given in Csat_data
file. Check whether the questions 1 to 9 are related to overall satisfaction?
1. Team’s ability to meet service level agreements
2. Team’s ability to deliver seamlessly in the event of changes (volume fluctuations,
resource movement etc)
3. Team’s operational performance
4. Team's application of process knowledge
5. Team's communication with you
6. Team’s effectiveness in handling escalations
7. Team’s flexibility and responsiveness to special service requests
8. Team’s contribution to your business requirements
9. Effectiveness of the reviews around operations delivery
95
10. Overall with team’s service

Indian Statistical Institute

CORRELATION
&
REGRESSION

96

48
Indian Statistical Institute
CORRELATION & REGRESSION

Correlation:

Correlation analysis is a technique to identify the relationship between two


variables.
Type and degree of relationship between two variables.

97

Indian Statistical Institute


CORRELATION & REGRESSION

Correlation: Usage

Explore the relationship between the output characteristic and input or process
variable.

Output variable : Y : Dependent variable


Input / Process variable : X : Independent variable

98

49
Indian Statistical Institute
CORRELATION & REGRESSION

Positive Correlation: Y increases as X increases & vice versa

Scatter Plot

20
16
12
Y

8
4
0
0 3 6 9 12
X

99

Indian Statistical Institute


CORRELATION & REGRESSION

Negative Correlation: Y decreases as X increases & vice versa

Scatter Plot

9
8
7
6
5
Y

4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
X
100

50
Indian Statistical Institute

CORRELATION & REGRESSION

No Correlation: Random Distribution of points

Scatter Plot

100
90
80
70
60
Y

50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100

X
101

Indian Statistical Institute


CORRELATION & REGRESSION

Is there any correlation ?

Scatter Plot

30

25

20
Y

15

10

0
0 2 4 6 8 10 12

X
102

51
Indian Statistical Institute
CORRELATION & REGRESSION

Measure of Correlation: Coefficient of Correlation

Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation

Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation

103

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Collect data on x and y: When x is low, y is also low & vice versa

x y
2 5
3 7
1 3
5 11
6 12
7 15
104

52
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Calculate Mean of x & y values

SL No. x y

1 2 5
2 3 7
3 1 3

4 5 11
5 6 12
6 7 15
Mean 4 8.83

105

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y


Conclusion:
1 -2 -3.83 Low values will become
2 -1 -1.83 negative & high values will
become positive
3 -3 -5.83

4 1 2.17
5 2 3.17
6 3 6.17

106

53
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Generally when x values are negative, y values are also negative & vice versa

SL No. x – Mean x y – Mean y

1 -2 -3.83
2 -1 -1.83
3 -3 -5.83

4 1 2.17
5 2 3.17
6 3 6.17
107

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Then
Product of x & y values will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49

4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
108

54
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Positive Correlation

Sum of Product of x & y values (Sxy) will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49

4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
109

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Collect data on x and y: When x is low then y will be high & vice versa

x y
2 12
3 11
1 15
5 7
6 5
7 3
110

55
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Calculate Mean of x & y values

SL No. x y

1 2 12
2 3 11
3 1 15

4 5 7
5 6 5
6 7 3
Mean 4 8.83

111

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y


Conclusion:
1
-2 3.67 Low values will become
2 negative & high values will
-1 2.67 become positive
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33

112

56
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Generally when x values are negative, y values are positive & vice versa

SL No. x – Mean x y – Mean y

1
-2 3.67
2 -1 2.67
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
113

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Then
Product of x & y values will be negative

SL No. x – Mean x y – Mean y Product

1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
114

57
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation: Negative Correlation

Sum of Product of x & y values Sxy will be negative

SL No. x – Mean x y – Mean y Product

1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
115

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation:

In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative

116

58
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Correlation:

Sxy is divided by  ([Link])

Sxy = (x-Mean x)(y-Mean y)


Sxx = (x-Mean x)2
Syy = (y-Mean y)2
Correlation Coefficient r = Sxy /  ([Link])

117

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Correlation:

SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2

1
-2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3
-3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5
2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83

r = Sxy / [Link] = -54 / (28 x 104.83) = -0.9967


118

59
Indian Statistical Institute
CORRELATION & REGRESSION

Correlation Coefficients:

1. Spearman’s rho ()


2. Kendall’s Tau ()

Varies from -1 to +1
Close to -1 indicate negative correlation
Close to +1 indicate positive correlation
Close to 0 means no correlation
Not for continuous variables

119

Indian Statistical Institute


CORRELATION & REGRESSION

Example: To study the effect of quality and price on supermarket store preference, 14
major supermarkets in a city are rated in terms of preference to shop,
quality of merchandise and pricing. All the ratings were obtained on an 11
point scale, with higher number indicating more positive rating. The data is
given in Supermarket_Correlation file. Whether quality and price influence
the preference to the store?

120

60
Indian Statistical Institute
CORRELATION & REGRESSION

Regression

Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship

121

Indian Statistical Institute


CORRELATION & REGRESSION

Regression

Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables

Examples:
Expected (Yield) = 5 + 3 x Time - 2 x Temperature

122

61
Indian Statistical Institute
CORRELATION & REGRESSION

Simple Linear Regression

Output variable is modeled in terms of only one variable

x y
Regression Model
2 7
y = 1 + 3x
1 4
5 16
4 13
3 10
6 19 123

Indian Statistical Institute


CORRELATION & REGRESSION

Simple Linear Regression

General Form:
Each observation y= a + bx +

Choose a & b such that


Y - a - bx is minimum for all values of x & y

124

62
Indian Statistical Institute
CORRELATION & REGRESSION

Simple Linear Regression: Parameter Estimation


Model: y = a + bx + 

â  y - b̂x
b̂  Sxy - Sxx
Test for Significance (Testing b = 0 or not) of relation between x & y
H0: b = 0
H1: b  0

Test Statistic t 0  (b̂ - 0)/se(b̂)


If p value < 0.05, then H0 is rejected & y can be modeled with x

125

Indian Statistical Institute


CORRELATION & REGRESSION

Regression: Example

x y
65 69
8 78
89 8
88 21
50 24
73 72

126

63
Indian Statistical Institute
CORRELATION & REGRESSION

Regression Model y = 76.32 - 0.42x + 

Scatter Plot

100
90
80
70
60
50
40
y

30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
x
127

Indian Statistical Institute


CORRELATION & REGRESSION

Regression: Issues

For any set of data,


a & b can be calculated
Regression model y = a + bx +  can be build

But all the models may not be useful

128

64
Indian Statistical Institute
CORRELATION & REGRESSION

Coefficient of Regression: Measure of degree of Relationship

Symbol : R2
R2 = SSR/ Syy = [Link]/ Syy

SSR = (ypredicted - Mean y)2


Syy = (yactual - Mean y)2

Range of R2 : 0 to 1
If R2 > 0.6, the Model is reasonably good

129

Indian Statistical Institute


CORRELATION & REGRESSION

Coefficient of Regression: Testing the significance of Regression

Regression ANOVA
Model SS df MS F p value
Regression SSR
Residual Syy – SSR
Total Syy

If p value < 0.05, then the regression model is significant

130

65
Indian Statistical Institute
CORRELATION & REGRESSION

Exercise : Thirteen specimens of 90/10 Cu-Ni alloys, each with a specific iron
content, was tested in a corrosion- wheel set up. The wheel was rotated
in salt sea water at 30 ft/s for 60 days. The corrosion was measured in
weight loss in milligrams/square diameter/day, MDD. The data obtained
is given in MDD_Simple_Reg file. Develop a prediction model for MDD
in terms of iron content.

131

Indian Statistical Institute


CORRELATION & REGRESSION

Step 1: Scatter Plot

132

66
Indian Statistical Institute
CORRELATION & REGRESSION

Step 2: Correlation Matrix

Attribute FE MDD

FE 1.000 -0.985

MDD -0.985 1.000

Remark:
Correlation between Y & X need to be 0.8 to 1 to -0.8 to -1.0

133

Indian Statistical Institute


CORRELATION & REGRESSION

Step 3: Regression Output

Statistic Value Criteria

R Square 0.9697 > 0.6

Regression ANOVA
Model SS df MS F p value
Regression 3293.76669 1 3294 352.3 0.0000
Residual 102.850233 11 9.35
Total 3396.616923 12

Criteria: P value < 0.05

134

67
Indian Statistical Institute
CORRELATION & REGRESSION

Step 3: Regression Output

Attribute Coefficient Std. Error Std. Coefficients Tolerance t Statistic p value

FE -24.01989 1.22368 -18.9759 1 -19.629 0.00

(Intercept) 129.7866 1.33321 97.3491 0.00

Interpretation
The p value for independent variable is < 0.05

Model: MDD = - 24.020 * FE + 129.787

135

Indian Statistical Institute


CORRELATION & REGRESSION

Step 4: Model Diagnostics

SL No. MDD Fit FE


1 127.60 129.55 0.01
2 124.00 118.26 0.48
3 110.80 112.73 0.71
4 103.90 106.97 0.95
5 101.50 101.20 1.19
6 130.10 129.55 0.01
7 122.00 118.26 0.48
8 92.30 95.20 1.44
9 113.10 112.73 0.71
10 83.70 82.71 1.96
11 128.00 129.55 0.01
12 91.40 95.20 1.44
13 86.20 82.71 1.96
136

68
Indian Statistical Institute
CORRELATION & REGRESSION

Step4: Model Diagnostics - Model Performance

137
Scatter Plot: Actual Vs Predicted

Indian Statistical Institute


CORRELATION & REGRESSION

Step 5: Residual Analysis

Residual: Actual – Fitted


Standardized Residual: (Residual - Mean )/ Standard Deviation of Residuals
Observation Actual MDD Predicted MDD Residuals Standard Residuals
1 127.6 129.55 -1.95 -0.66
2 124 118.26 5.74 1.96
3 110.8 112.73 -1.93 -0.66
4 103.9 106.97 -3.07 -1.05
5 101.5 101.20 0.30 0.10
6 130.1 129.55 0.55 0.19
7 122 118.26 3.74 1.28
8 92.3 95.20 -2.90 -0.99
9 113.1 112.73 0.37 0.13
10 83.7 82.71 0.99 0.34
11 128 129.55 -1.55 -0.53
12 91.4 95.20 -3.80 -1.30
13 86.2 82.71 3.49 1.19
Mean 0.00 0.00
SD 2.93 1.00
138

69
Indian Statistical Institute
CORRELATION & REGRESSION

Step 5: Residual Analysis

Residuals should be normally distributed or bell shaped

139

Indian Statistical Institute


CORRELATION & REGRESSION

Step 5: Residual Analysis: Detection of Outliers

Observation Actual MDD Predicted MDD Residuals Standard Residuals


1 127.6 129.55 -1.95 -0.66
2 124 118.26 5.74 1.96
3 110.8 112.73 -1.93 -0.66
4 103.9 106.97 -3.07 -1.05
5 101.5 101.20 0.30 0.10
6 130.1 129.55 0.55 0.19
7 122 118.26 3.74 1.28
8 92.3 95.20 -2.90 -0.99
9 113.1 112.73 0.37 0.13
10 83.7 82.71 0.99 0.34
11 128 129.55 -1.55 -0.53
12 91.4 95.20 -3.80 -1.30
13 86.2 82.71 3.49 1.19
Mean 0.00 0.00
SD 2.93 1.00

Any observation with standardized residuals beyond ± 3 or residuals beyond ± 3


standard deviation 140

70
Indian Statistical Institute
CORRELATION & REGRESSION

Multiple Regression

To model output variable y in terms of two or more variables.


General Form:
Y = a + b1X1 + b2X2 + - - - + bkXk

Two variable case:


Y = a + b1X1 + b2X2

141

Indian Statistical Institute


CORRELATION & REGRESSION

Exercise : The effect of sealer plate temperature and sealer plate clearance in a soap
wrapping machine affects the % of wrapped bars that pass inspection.
The data collected in given in the Mult-Reg_Sealed file. Develop a model
for % Sealed Property in terms of clearance and temperature?

Step 1: Correlation Analysis

Attribute Clearance Temperature % Sealed Property

Clearance 1.00 -0.01 0.90

Temperature -0.01 1.00 -0.05

% Sealed Property 0.90 -0.05 1.00

Correlation between Xs & Y should be high


Correlation between Xs should be low
142

71
Indian Statistical Institute
CORRELATION & REGRESSION

Step 2: Regression Output

Statistic Value Criteria

Adjusted R Square 0.7766 > 0.6

Regression ANOVA
Model SS df MS F p value
Regression 6797.063 2 3398.531 27.07 0.0000
Residual 1632.08138 13 125.5447
Total 8429.14438 15

Criteria: P value < 0.05

143

Indian Statistical Institute


CORRELATION & REGRESSION

Step 2: Regression Output – Identify the model

Attribute Coefficient Std. Error Std. Coefficients Tolerance t Statistic p value

Clearance 0.9061 0.1004 0.1296 0.9999 9.0266 0.0000

Temperature -0.0642 0.0844 -0.0053 0.9999 -0.7604 0.4669

(Intercept) -67.8844 10.1020 -6.7199 0.0000

Interpretation: Only Temperature is related to % Sealed property as p value < 0.05

144

72
Indian Statistical Institute
CORRELATION & REGRESSION

Step 2: Regression Output – Identify the model

Attribute Coefficient Std. Error Std. Coefficients Tolerance t Statistic p value

Clearance 0.9065 0.05755 0.129664 15.74989 0.00

(Intercept) -81.6205 9.24078 -8.83264 0.00

Model % Sealed property = 0.906 * Clearance - 81.621

145

Indian Statistical Institute


CORRELATION & REGRESSION

Step 3: Residual Analysis

SL No. Temperature % Sealed Property Predicted Clearance


1 190 35.0 36.22 130
2 176 81.7 76.10 174
3 205 42.5 39.84 134
4 210 98.3 91.51 191
5 230 52.7 67.94 165
6 192 82.0 94.23 194
7 220 34.5 48.00 143
8 235 95.4 86.98 186
9 240 56.7 44.38 139
10 230 84.4 88.79 188
11 200 94.3 77.01 175
12 218 44.3 59.79 156
13 220 83.3 90.61 190
14 210 91.4 79.73 178
15 208 43.5 38.03 132
146
16 225 51.7 52.53 148

73
Indian Statistical Institute
CORRELATION & REGRESSION

Step 3: Residual Analysis: Outlier detection

SL No. Temperature % Sealed Property Predicted Clearance Residuals Std Residuals


1 190 35 36.22 130 -1.22 -0.116
2 176 81.7 76.1 174 5.60 0.534
3 205 42.5 39.84 134 2.66 0.253
4 210 98.3 91.51 191 6.79 0.647
5 230 52.7 67.94 165 -15.24 -1.453
6 192 82 94.23 194 -12.23 -1.166
7 220 34.5 48 143 -13.50 -1.287
8 235 95.4 86.98 186 8.42 0.802
9 240 56.7 44.38 139 12.32 1.174
10 230 84.4 88.79 188 -4.39 -0.418
11 200 94.3 77.01 175 17.29 1.648
12 218 44.3 59.79 156 -15.49 -1.476
13 220 83.3 90.61 190 -7.31 -0.697
14 210 91.4 79.73 178 11.67 1.112
15 208 43.5 38.03 132 5.47 0.521
16 225 51.7 52.53 148 -0.83 -0.079
Mean 0.000625
SD 10.4918333

147

Indian Statistical Institute

LOGISTIC REGRESSION

148

74
Indian Statistical Institute

LOGISTIC REGRESSION

Used to develop models when the output or response variable y is binary and x’s
are continuous or numeric
The output variable will be binary, coded as either success or failure
Models probability of success p

e a  b1 x1  b2 x2      bk xk
p
1  e a  b1 x1  b2 x2      bk xk

p: probability of success
xi’s : independent variables
a, b1, b2, ---: coefficients to be estimates

If estimated p > 0.5, then classified as success, otherwise predicted as failure

149

Indian Statistical Institute

LOGISTIC REGRESSION

Example: An apparel brand wants to develop a model for brand loyalty. The data
was collected from 30 customers, 15 of whom are brand loyal (indicated by 1)
and 15 of whom are not (indicated by 0). The company also measured attitude
towards the brand (Brand), attitude towards the product category (Product) and
attitude toward shopping (Shopping), all on a 1 (unfavorable) to 7 (favorable)
scale. The data is given in brand file.

150

75
Indian Statistical Institute

LOGISTIC REGRESSION

Step1 :
Calculate the mean value of x’s for the two different values of y. At least the
means should be different for some of the x’s

Loyalty Brand Product Shopping

1 5.67 3.8 3.07

0 3.53 4.2 4.00

151

Indian Statistical Institute

LOGISTIC REGRESSION

Step2 : Test goodness of the model

Statistic -2 Log likelihood Cox & Snell R Square Nagelkerke R Square

Value 23.471 0.453 0.604

Criteria > 0.6 > 0.6

Interpretation: R2 values should be high(> 0.6) at least for Nagelkerke R2

152

76
Indian Statistical Institute

LOGISTIC REGRESSION

Step2 : Coefficients of the Model

Variable Coefficient Std Error Wald df p value


Brand 1.274 0.4789 7.0748 1 0.0078
Product 0.186 0.3218 0.3346 1 0.5629
Shopping 0.590 0.4912 1.4424 1 0.2298
Constant -8.642 3.3457 6.6717 1 0.0098

Interpretation: Only Brand and Constant terms are significant in the model

Wald = Coefficient / Std Error

153

Indian Statistical Institute

LOGISTIC REGRESSION

Step2 : Model Validation


Loyalty Loyalty
Pred
SL No. Actual Prob Pred Group SL No. Actual Pred Prob Pred Group
1 1 0.490 0 16 0 0.054 0
2 1 0.891 1 17 0 0.223 0
3 1 0.613 1 18 0 0.018 0
4 1 0.985 1 19 0 0.613 1
5 1 0.872 1 20 0 0.169 0
6 1 0.245 0 21 0 0.130 0
7 1 0.833 1 22 0 0.245 0
8 1 0.414 0 23 0 0.126 0
9 1 0.973 1 24 0 0.165 0
10 1 0.977 1 25 0 0.957 1
11 1 0.815 1 26 0 0.126 0
12 1 0.769 1 27 0 0.141 0
13 1 0.931 1 28 0 0.062 0
14 1 0.568 1 29 0 0.605 1154
15 1 0.985 1 30 0 0.004 0

77
Indian Statistical Institute

LOGISTIC REGRESSION

Step2 : Model Validation

Predicted
%
Loyalty 0 1 Correct
Actual 0 12 3 80
1 3 12 80
Total % 80

Interpretation: 80% above correct prediction is good

155

78

You might also like