SPSS Training: Statistical Data Analysis
SPSS Training: Statistical Data Analysis
Training Program on
Statistical Data Analysis
using
SPSS
1
Indian Statistical Institute
FUNDAMENTALS
of
STATISTICS
DESCRIPTIVE STATISTICS
Statistics
Science of collection, analysis, interpretation and presentation of data
Analytics
Process of extracting meaningful insights by discovering patterns and
relationships in the data
2
Indian Statistical Institute
DESCRIPTIVE STATISTICS
Data set
Number of tasks completed per hour
Productivity
69 71
70 68
72 70
71 67
70 69
68 72
73 70
69 71
DESCRIPTIVE STATISTICS
Data
The values or figures assigned to a metric
Can be numeric or non numeric
Example
Metric: Productivity
Data: 70, 72, 69, etc
3
Indian Statistical Institute
DESCRIPTIVE STATISTICS
Use of Statistics
To know what happened in the past
To know approximately what will happen in future if things remain more or less
the same
Example
Approximately how many tasks will be completed in the next hour?
Is it same as the last value as the last value is close to the future?
The difference between the last value and previous value is only 1 (71 – 70).So
adding 1 to last value will give the next future value?
Approximately how many tasks will be completed two hours down the line?
7
Is mean, median, etc give better estimate of future value? Why?
DESCRIPTIVE STATISTICS
Use of Statistics
Generally the values in a data set will not be same
Values will be not only changing but change without any particular trend or
pattern
Example
If we collect another 16 hours productivity data, the value may not be same as 69,
second value as 70, etc?
Similarly the difference between the 1st and 2nd value or 2nd and 3rd value , etc
may not be same.
It is possible that none of the values will be repeated in the new dataset
How will you make projections about future?
Explore what is remaining more or less consistent even if the values are
changing.
8
Use it for future projections, generalizations, etc
4
Indian Statistical Institute
DESCRIPTIVE STATISTICS
Interpretation Productivity
• Only thing remain more or less
consistent is the shape
• All projections or generalizations
need to be made using shape
• Shape represents the distribution
• The properties of the shape also
will remain more or less constant
• The properties can be used to
make projections
Center
Variation
5
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Summarization of sample data
The monthly credit card expenses of an individual in 1000 rupees is given below.
Kindly summarize the data
FUNDAMENTALS OF STATISTICS
Summarization of sample data
6
0
51 53 55 57 59 61 63 65 67
6
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Summarization of sample data
6
0
53 55 57 59 61 63 65
FUNDAMENTALS OF STATISTICS
Sample Mean:
14
7
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Summarization of sample data: Credit Card Expenses
= (55 + 65 + 59 + 59 + 57 + 61 + 53 + 63 + 59 + 57 + 63 + 55 + 61 + 61 + 57
+ 59 + 61 + 57 + 59 + 63) / 20
= 1184 / 20 = 59.2
Interpretation
On an average, the individual spends Rs. 59200 through credit card
15
FUNDAMENTALS OF STATISTICS
Sample Median:
• Middle Value
• Value which divides data arranged in ascending or descending order
into two equal halves
8
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
FUNDAMENTALS OF STATISTICS
Sample Mode:
18
9
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Sample Mode:
FUNDAMENTALS OF STATISTICS
Example:
5 4 7 3 2
15 9 8 5 2
Maximum Value = 15
Minimum Value = 2
Range = 15 – 2 = 13
20
10
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
21
FUNDAMENTALS OF STATISTICS
16
14
12
10
8 Range
6
4
2
0
1 2 3 4 5 6 7 8 9 10
22
11
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
5 4 7 3 2
15 9 8 5 2
Step 1:
Calculate Mean
Mean = 6
23
FUNDAMENTALS OF STATISTICS
5 4 7 3 2
15 9 8 5 2
Step 2:
Take deviations from Mean
-1 -2 1 -3 -4
9 3 2 -1 -4
24
12
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Step 2:
Take deviations from Mean
16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
25
FUNDAMENTALS OF STATISTICS
Step 3:
Since some values are positive & rest are negative, while
taking sum they will cancel out.
So square the values & Sum
1 4 1 9 16
81 9 4 1 16
Sum = 142
26
13
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Step 4:
Standard Deviation = (Sum of Squares / (n -1))
= (142 / (10 -1))
= 15.77 = 3.972
27
FUNDAMENTALS OF STATISTICS
16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
28
14
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Sample Standard Deviation: Credit Card usage data
29
FUNDAMENTALS OF STATISTICS
Frequency Table
30
15
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
Frequency Table: Credit Card usage data
31
FUNDAMENTALS OF STATISTICS
Frequency Table: Credit Card usage data
32
16
Indian Statistical Institute
FUNDAMENTALS OF STATISTICS
33
FUNDAMENTALS OF STATISTICS
Exercise: The data of 30 customers on credit card usage in INR1000, sex (1: male,
2: female) and whether they have done shopping or banking (1: yes , 2: no) with
credit card are given in table below.
1. Summarize and interpret the credit card usage?
2. How the credit card usage by differ with sex?
3. How the credit card usage pattern vary with those who do shopping with credit
card and those who don’t do shopping?
4. How the credit card usage pattern vary with those who do banking with credit
card and those who don’t do banking?
34
17
Indian Statistical Institute
TEST
of
HYPOTHESIS
35
TEST OF HYPOTHESIS
Introduction:
In many problems, it is required to accept or reject a statement about some
parameter
Example:
1. The average cycle time is less than 24 hours
2. The % rejection is only 1%
36
18
Indian Statistical Institute
TEST OF HYPOTHESIS
37
TEST OF HYPOTHESIS
Null Hypothesis:
A statement about the status quo
One of no difference or no effect
Denoted by H0
Alternative Hypothesis:
One in which some difference or effect is expected
Opposite of null hypothesis
Denoted by H1
38
19
Indian Statistical Institute
TEST OF HYPOTHESIS
Type I Error
Rejecting the null hypothesis H0 when it is true
Type II Error
Failing to reject the null hypothesis H0 when it is false
39
TEST OF HYPOTHESIS
40
20
Indian Statistical Institute
TEST OF HYPOTHESIS
4 4 5 5 6
5 4.5 6.5 6 5.5
41
TEST OF HYPOTHESIS
Conclusion:
Difficult to say mean = specified value by looking at xbar - Specified value
alone
42
21
Indian Statistical Institute
TEST OF HYPOTHESIS
To check whether test statistic is close to 0, find out p value from the sampling
distribution of test statistic
43
TEST OF HYPOTHESIS
p ≥ 0.05
-t/2 t/2
22
Indian Statistical Institute
TEST OF HYPOTHESIS
H0: Mean = 5
H1: Mean 5
TEST OF HYPOTHESIS
t0 = 0.5571
p ≥ 0.05
-t/2 t/2
t0
23
Indian Statistical Institute
TEST OF HYPOTHESIS
73 19 16 64 28 28 31 90 60 56 31 56 22
18 45 48 17 17 17 91 92 63 50 51 69 16 17
47
TEST OF HYPOTHESIS
24
Indian Statistical Institute
TEST OF HYPOTHESIS
49
TEST OF HYPOTHESIS
50
25
Indian Statistical Institute
TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
26
Indian Statistical Institute
TEST OF HYPOTHESIS
Exercise 3: The data of 30 customers on credit card usage in INR1000, sex (1:
male, 2: female) and whether they have done shopping or banking (1: yes , 2: no)
with credit card are given in table below.
1. Check whether the average credit card usage is same for both sex?
2. Check whether the average credit card usage is same for those who do
shopping with credit card and those who don’t do shopping?
3. Check whether the average credit card usage is same for those who do banking
with credit card and those who don’t do banking?
53
TEST OF HYPOTHESIS
Null Hypothesis
H0: S12 = S22
Alternative Hypothesis
H1: S12 S22
27
Indian Statistical Institute
TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
28
Indian Statistical Institute
ANALYSIS
of
VARIANCE
57
ANALYSIS OF VARIANCE
ANOVA
Example:
To study Location of shelf on Sales Revenue
58
29
Indian Statistical Institute
ANALYSIS OF VARIANCE
A supermarket chain suspect the location of shelf were toys are kept will
influence it sales volume. The data on sales revenue in lakhs from the toys
when they are kept at location inside the store are given in Sales Revenue
data. The location is denoted as 1:front, 2: middle & 3: rear. Verify the doubt/
59
ANALYSIS OF VARIANCE
Factor: Location(A)
Levels : front, middle, rear
Response: Sales revenue
60
30
Indian Statistical Institute
ANALYSIS OF VARIANCE
Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).
Level 1 Sum(A1):
Sum of all responses when location is at level 1 (front)
= 1.34 + 1.89 + 1.35 + 2.07 + 2.41 + 3.06
= 12.12
nA1: Number of Responses with location is at level 1 (front)
=6
61
ANALYSIS OF VARIANCE
Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).
Level 1 Average:
Sum of all responses when location is at level 1 / Number of Responses with
location is at level 1
= A1 / nA1 = 12.12 / 6 = 2.02
62
31
Indian Statistical Institute
ANALYSIS OF VARIANCE
Step 1: Calculate the Sum, Average and Number of Responses for each
level of the factor (location).
63
ANALYSIS OF VARIANCE
64
32
Indian Statistical Institute
ANALYSIS OF VARIANCE
65
ANALYSIS OF VARIANCE
66
33
Indian Statistical Institute
ANALYSIS OF VARIANCE
67
ANALYSIS OF VARIANCE
Anova Table:
MS = SS / Df
F = MS Between/ MS Within
34
Indian Statistical Institute
ANALYSIS OF VARIANCE
Meaning:
When the Factor is changed from 1 level to another level, there will be significant
change in the response.
69
ANALYSIS OF VARIANCE
Meaning:
The sales revenue is not same for different locations like front, middle &
rear
70
35
Indian Statistical Institute
ANALYSIS OF VARIANCE
The expected sales revenue for different location under study is equal to level
averages.
Front 2.0200
Middle 4.7613
rear 1.7725
mm
71
ANALYSIS OF VARIANCE
6
Sales Revenue
5
4
3
2
1
Front Middle rear
Location
72
36
Indian Statistical Institute
ANALYSIS OF VARIANCE
Anova logic:
73
ANALYSIS OF VARIANCE
Anova logic :
Variation between the level of a Factor:
The effect of Factor.
Location
37
Indian Statistical Institute
ANALYSIS OF VARIANCE
Anova logic :
75
ANALYSIS OF VARIANCE
Anova logic :
76
38
Indian Statistical Institute
ANALYSIS OF VARIANCE
Exercise 1: A bank want to check whether the waiting time of customer at their
single window operations across 4 cities is same or not. The data is given in
bank_waiting_time?
77
CROSS TABULATION
78
39
CROSS TABULATION Indian Statistical Institute
79
80
40
CROSS TABULATION Indian Statistical Institute
Usage
Sex
Light Medium Heavy Total
Female 14 5 5 24
Male 5 5 11 21
Total 19 10 16 45
81
Exercise 1: An ITeS company has collected following information from its customers
through survey. The data has been collected in 5 point scale (1: Very dissatisfied to 5:
Very satisfied). The survey questions are given below and data is given in Csat_data
file. Check whether the questions 1 to 9 are related to overall satisfaction
1. Team’s ability to meet service level agreements
2. Team’s ability to deliver seamlessly in the event of changes (volume fluctuations,
resource movement etc)
3. Team’s operational performance
4. Team's application of process knowledge
5. Team's communication with you
6. Team’s effectiveness in handling escalations
7. Team’s flexibility and responsiveness to special service requests
8. Team’s contribution to your business requirements
9. Effectiveness of the reviews around operations delivery
82
10. Overall with team’s service
41
Indian Statistical Institute
83
Objective:
To test whether two variables are related or not
To check whether a metric is depends of another metric
Usage:
When both the variables ( X & Y) are categorical (grouped)
84
42
Indian Statistical Institute
Exercise:
Note: Table gives the count of CSat score of 1, 2 etc for each group of
agents 85
Exercise:
86
43
Indian Statistical Institute
Exercise:
Exercise:
44
Indian Statistical Institute
Exercise:
89
90
45
Indian Statistical Institute
• Chi square test only shows whether two variables are independent or not
• Degree of association will not be known
Phi & V varies from 0 to 1, higher the value better the strength of relation
91
Cramer V:
Rows – 1 = 2
Columns - 1 = 4
92
46
Indian Statistical Institute
Tau b
• Varies from -1 to +1
• lose to -1 indicate good negative correlation, Close to +1 indicate good positive
correlation & close to 0 indicate no relation
• Used for n x n tables
Tau c
• Varies from -1 to +1
• lose to -1 indicate good negative correlation, Close to +1 indicate good positive
correlation & close to 0 indicate no relation
• Used for r x c tables ( r not equal to c)
93
Example: An branded shoe manufacturing company has collected the data from
45 customers on usage, sex, awareness, of brand and preference of the brand.
Usage has been coded as 1, 2 ,and 3 representing light, medium and heavy
usage. The sex has been coded as 1 for female and 2 for male users. The
attitude and preference are measured on a 7 point scale (1: unfavorable to 7 :
very favorable). The data is given in shoe_data file .
1. Estimate the relation between sex and usage?
2. Estimate the relation between sex and awareness of the brand?
3. Estimate the relation between sex and preference?
4. Does higher the awareness means higher preference?
94
47
Indian Statistical Institute
Exercise 1: An ITeS company has collected following information from its customers
through survey. The data has been collected in 5 point scale (1: Very dissatisfied to 5:
Very satisfied). The survey questions are given below and data is given in Csat_data
file. Check whether the questions 1 to 9 are related to overall satisfaction?
1. Team’s ability to meet service level agreements
2. Team’s ability to deliver seamlessly in the event of changes (volume fluctuations,
resource movement etc)
3. Team’s operational performance
4. Team's application of process knowledge
5. Team's communication with you
6. Team’s effectiveness in handling escalations
7. Team’s flexibility and responsiveness to special service requests
8. Team’s contribution to your business requirements
9. Effectiveness of the reviews around operations delivery
95
10. Overall with team’s service
CORRELATION
&
REGRESSION
96
48
Indian Statistical Institute
CORRELATION & REGRESSION
Correlation:
97
Correlation: Usage
Explore the relationship between the output characteristic and input or process
variable.
98
49
Indian Statistical Institute
CORRELATION & REGRESSION
Scatter Plot
20
16
12
Y
8
4
0
0 3 6 9 12
X
99
Scatter Plot
9
8
7
6
5
Y
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
X
100
50
Indian Statistical Institute
Scatter Plot
100
90
80
70
60
Y
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
X
101
Scatter Plot
30
25
20
Y
15
10
0
0 2 4 6 8 10 12
X
102
51
Indian Statistical Institute
CORRELATION & REGRESSION
Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation
Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation
103
Collect data on x and y: When x is low, y is also low & vice versa
x y
2 5
3 7
1 3
5 11
6 12
7 15
104
52
Indian Statistical Institute
CORRELATION & REGRESSION
SL No. x y
1 2 5
2 3 7
3 1 3
4 5 11
5 6 12
6 7 15
Mean 4 8.83
105
4 1 2.17
5 2 3.17
6 3 6.17
106
53
Indian Statistical Institute
CORRELATION & REGRESSION
Generally when x values are negative, y values are also negative & vice versa
1 -2 -3.83
2 -1 -1.83
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
107
Then
Product of x & y values will be positive
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
108
54
Indian Statistical Institute
CORRELATION & REGRESSION
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
109
Collect data on x and y: When x is low then y will be high & vice versa
x y
2 12
3 11
1 15
5 7
6 5
7 3
110
55
Indian Statistical Institute
CORRELATION & REGRESSION
SL No. x y
1 2 12
2 3 11
3 1 15
4 5 7
5 6 5
6 7 3
Mean 4 8.83
111
112
56
Indian Statistical Institute
CORRELATION & REGRESSION
Generally when x values are negative, y values are positive & vice versa
1
-2 3.67
2 -1 2.67
3
-3 6.67
4 1 -1.33
5
2 -3.33
6 3 -5.33
113
Then
Product of x & y values will be negative
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
114
57
Indian Statistical Institute
CORRELATION & REGRESSION
1
-2 3.67 -7.34
2 -1 2.67 -2.67
3
-3 6.67 -20.01
4 1 -1.33 -1.33
5
2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
115
Coefficient of Correlation:
In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative
116
58
Indian Statistical Institute
CORRELATION & REGRESSION
Coefficient of Correlation:
117
Coefficient of Correlation:
1
-2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3
-3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5
2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83
59
Indian Statistical Institute
CORRELATION & REGRESSION
Correlation Coefficients:
Varies from -1 to +1
Close to -1 indicate negative correlation
Close to +1 indicate positive correlation
Close to 0 means no correlation
Not for continuous variables
119
Example: To study the effect of quality and price on supermarket store preference, 14
major supermarkets in a city are rated in terms of preference to shop,
quality of merchandise and pricing. All the ratings were obtained on an 11
point scale, with higher number indicating more positive rating. The data is
given in Supermarket_Correlation file. Whether quality and price influence
the preference to the store?
120
60
Indian Statistical Institute
CORRELATION & REGRESSION
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
121
Regression
Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables
Examples:
Expected (Yield) = 5 + 3 x Time - 2 x Temperature
122
61
Indian Statistical Institute
CORRELATION & REGRESSION
x y
Regression Model
2 7
y = 1 + 3x
1 4
5 16
4 13
3 10
6 19 123
General Form:
Each observation y= a + bx +
124
62
Indian Statistical Institute
CORRELATION & REGRESSION
â y - b̂x
b̂ Sxy - Sxx
Test for Significance (Testing b = 0 or not) of relation between x & y
H0: b = 0
H1: b 0
125
Regression: Example
x y
65 69
8 78
89 8
88 21
50 24
73 72
126
63
Indian Statistical Institute
CORRELATION & REGRESSION
Scatter Plot
100
90
80
70
60
50
40
y
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
x
127
Regression: Issues
128
64
Indian Statistical Institute
CORRELATION & REGRESSION
Symbol : R2
R2 = SSR/ Syy = [Link]/ Syy
Range of R2 : 0 to 1
If R2 > 0.6, the Model is reasonably good
129
Regression ANOVA
Model SS df MS F p value
Regression SSR
Residual Syy – SSR
Total Syy
130
65
Indian Statistical Institute
CORRELATION & REGRESSION
Exercise : Thirteen specimens of 90/10 Cu-Ni alloys, each with a specific iron
content, was tested in a corrosion- wheel set up. The wheel was rotated
in salt sea water at 30 ft/s for 60 days. The corrosion was measured in
weight loss in milligrams/square diameter/day, MDD. The data obtained
is given in MDD_Simple_Reg file. Develop a prediction model for MDD
in terms of iron content.
131
132
66
Indian Statistical Institute
CORRELATION & REGRESSION
Attribute FE MDD
FE 1.000 -0.985
Remark:
Correlation between Y & X need to be 0.8 to 1 to -0.8 to -1.0
133
Regression ANOVA
Model SS df MS F p value
Regression 3293.76669 1 3294 352.3 0.0000
Residual 102.850233 11 9.35
Total 3396.616923 12
134
67
Indian Statistical Institute
CORRELATION & REGRESSION
Interpretation
The p value for independent variable is < 0.05
135
68
Indian Statistical Institute
CORRELATION & REGRESSION
137
Scatter Plot: Actual Vs Predicted
69
Indian Statistical Institute
CORRELATION & REGRESSION
139
70
Indian Statistical Institute
CORRELATION & REGRESSION
Multiple Regression
141
Exercise : The effect of sealer plate temperature and sealer plate clearance in a soap
wrapping machine affects the % of wrapped bars that pass inspection.
The data collected in given in the Mult-Reg_Sealed file. Develop a model
for % Sealed Property in terms of clearance and temperature?
71
Indian Statistical Institute
CORRELATION & REGRESSION
Regression ANOVA
Model SS df MS F p value
Regression 6797.063 2 3398.531 27.07 0.0000
Residual 1632.08138 13 125.5447
Total 8429.14438 15
143
144
72
Indian Statistical Institute
CORRELATION & REGRESSION
145
73
Indian Statistical Institute
CORRELATION & REGRESSION
147
LOGISTIC REGRESSION
148
74
Indian Statistical Institute
LOGISTIC REGRESSION
Used to develop models when the output or response variable y is binary and x’s
are continuous or numeric
The output variable will be binary, coded as either success or failure
Models probability of success p
e a b1 x1 b2 x2 bk xk
p
1 e a b1 x1 b2 x2 bk xk
p: probability of success
xi’s : independent variables
a, b1, b2, ---: coefficients to be estimates
149
LOGISTIC REGRESSION
Example: An apparel brand wants to develop a model for brand loyalty. The data
was collected from 30 customers, 15 of whom are brand loyal (indicated by 1)
and 15 of whom are not (indicated by 0). The company also measured attitude
towards the brand (Brand), attitude towards the product category (Product) and
attitude toward shopping (Shopping), all on a 1 (unfavorable) to 7 (favorable)
scale. The data is given in brand file.
150
75
Indian Statistical Institute
LOGISTIC REGRESSION
Step1 :
Calculate the mean value of x’s for the two different values of y. At least the
means should be different for some of the x’s
151
LOGISTIC REGRESSION
152
76
Indian Statistical Institute
LOGISTIC REGRESSION
Interpretation: Only Brand and Constant terms are significant in the model
153
LOGISTIC REGRESSION
77
Indian Statistical Institute
LOGISTIC REGRESSION
Predicted
%
Loyalty 0 1 Correct
Actual 0 12 3 80
1 3 12 80
Total % 80
155
78