Medical Statistics
SPSS and building models
Hans Burgerhof
Epidemiology
[email protected]
Programme
Some information on working with SPSS
- creating your own dataset
- importing data (from Excel)
- working with SPSS syntax files
Building a regression model
- prediction models
- estimating a specific relationship; to correct for
other variables or not?
SPSS tutorials on the Internet
SPSS, the empty data matrix
The variable view (empty)
Variable view
Typing in data (Data view)
Missing data
Statistics
age
N Valid 5
Missing 0
Mean 216,200
0
Statistics
age
N Valid 4
Missing 1
Mean 20,5000
Some parts of the menu (1)
Means: you need extra software to use this option
Some parts of the menu (2)
Some parts of the menu (3)
Why working with syntax files?
1) To keep track of all commands you gave, so,
after three months, you will still know what you
did three months ago.
2) In the case there was an error in the dataset
and you have to redo all analyses again: simply
run the syntax file! It will take you less than a
minute!
3) Reproducability: other researchers can check
your analyses.
SPSS syntax
https://www.google.nl/url?esrc=s&q=&rct=j&sa
=U&url=https://www.spss-tutorials.com/spss-ou
tput/&
ved=2ahUKEwjUls7ChM32AhUT_rsIHaw9CAQQF
noECAoQAg&usg=AOvVaw0ks_vO_Zb9aOlYIFSF
UiLt
Creating a syntax file
The syntax file
The command has not
been performed yet!
Running (part of) the syntax file
Using copy-paste
Save the syntax file – give it
a relevant name – and you
can open it another day to
check what you did and/or
to rerun your analyses
Building a regression model – prediction
models
If we have a continuous outcome variable Y and a
set of p explanatory variables X1, X2, ... ,Xp and we
would like to predict (or explain) Y, we can test a
linear regression model like
Do we need all available explanatory variables to
predict Y?
Occam’s razor
William of Occam (or: Ockham) was a medieval
philosopher known from “the principle of
parsimony”.
We will only use (statistically) significant
variables and theoretically arguable variables in
the final model.
Steps to build the model
1. Perform simple regression analyses for all explanatory variables Xi, i = 1 … p. (In
linear regression: check for continuous explanatory variables the linearity
assumption). Do not forget to use dummy variables in the case of categorical
explanatory variables with more than two categories.
2. Select possibly significant explanatory variables in the multiple model by selection
on a large alpha ( = 0.15, 0.2 or 0.25, depending on the number of candidate
explanatory variables) using the P-values from step 1, and on theory / literature.
3. Perform a multiple regression model with all explanatory variables selected in step
2.
4. Check the P-values of the regression coefficients in the multiple regression model.
If all P-values are smaller than 0.05, continue with step 6.
5. If not all P-values are smaller than 0.05: remove the non-significant explanatory
variables, one by one. Start removing the explanatory variable with highest P-value
and rerun the analysis with the other explanatory variables. Continue this process
until all remaining variables have P-values smaller than (or equal to) 0.05.
6. Optional: add, based on theory or clear patterns in your data, interaction terms to
the model and test if this will improve the model.
7. Check assumptions of the final model.
Building a multiple regression model
Outcome variable , (possible) explanatory variables , , , ,
repeat steps 3&4
Steps 1 & 2 Step 3 Step 4 Step 5
Build simple Build 1 Are any of the p- Remove the
values for the
Yes!
regression multiple explanatory
models regression regression variable with the
(univariate, model using all coefficient non- largest non-
remove remaining significant? significant p-
variables explanatory (using α=0.05) value (using
using variables α=0.05)
α=0.25) No!
𝑋 1 , 𝑋 2 , 𝑋 4 , 𝑋5 Step 6
Optional:
𝑋 1 , 𝑋 2 , 𝑋 3 , 𝑋 4 , 𝑋 5 , 𝑋 6 investigate
addition of
interaction terms
Final model…
Step 7
Check model
assumptions
Example (FEV data)
N = 624 children (Boston)
Ages between 3 – 15 years
Sex: 0 = girl, 1 = boy
Smoke: 0 = no, 1 = yes
Height in cm
What is the best model to predict FEV?
(part of the) Syntax file
Graphical impressions
continuous predictors
Graphical impressions
categorical predictors
Results of univariate analyses (1)
Variable R² coefficient 95% CI P-value All four explanantory
Age 0.567 0.242 0.092 ; 0.423 < 0.0005 variables are
Height 0.748 0.050 0.048 ; 0.052 < 0.0005 significantly related to
Sex 0.033 0.302 0.174 ; 0.429 < 0.0005 FEV (in simple linear
Smoke 0.050 0.669 0.440 ; 0.898 < 0.0005 regression models) (2)
A coefficient depends
The higher the R²,
on the unit of the
the better the
variable.
model
Interpretation?
Results of multiple linear regression with all
four explanantory variables (3)
Smoking no longer significant (P > 0.05). (4)
Do we have an explanation for that?
Multiple linear regression without Smoke (5)
The absolute values of the standardized
coefficients can be used for checking relative
importance of the variables
FEV = -4.417 + 0.055·Age + 0.041·Height + 0.136·Sex
girl = 0
boy = 1
Checking the assumptions concerning the
residuals (7)
Lines by subgroups (6)
Is the height effect on FEV
equal for boys and girls?
Interaction between height and sex
Significant interaction?
FEV = -3.224 + 0.063·Age + 0.033·Height – 1.593·Sex + 0.011·Intheightsex
For girls (coded as 0):
FEV = -3.224 + 0.063·Age + 0.033·Height – 1.593·0 + 0.011·0
FEV = -3.224 + 0.063·Age + 0.033·Height
For boys (coded as 1):
FEV = -3.224 + 0.063·Age + 0.033·Height – 1.593·1 + 0.011·height·1
FEV = -4.817 + 0.063·Age + 0.044·Height
A specific association
What if we do not want to predict FEV, but we
are interested in the effect of a specific variable
on FEV?
Do we have to correct for other variables or not?
Theory on Causality van help.
Directed Acyclic Graph (DAG)
?
Smoke FEV
DAGs can help you to
Age
analyze the data in a
correct way
Age is a confounder in the relation
between Smoke and FEV in the
Boston children data
Uncorrected versus corrected analysis
T-test for independent
groups: P < 0.0005
Some closing remarks
- Correlation doesn’t mean automatically direct
causation
- Two variables can share a common cause
- Should we always correct for possible
confounding?
- In an RCT with large enough groups: probably no
need for correcting
- More likely in observational studies
- Beware for overcorrecting
If you torture your data long enough, they will confess!