S TAT I S T I C A L
MODELS FOR AI
2Ai- EST | IPCA
Multivariate Data Analysis
Dependence Techniques are types of multivariate analysis
techniques that are used when one or more of the variables
can be identified as dependent variables and the remaining
variables can be identified as independent.
Interdependence Techniques are types of multivariate
analysis techniques that are used where no distinction is
made as to which variables are dependent or independent.
Multivariate analysis is a Statistical procedure for analysis of data involving more than one type of
measurement or observation. It may also mean solving problems where more than one dependent
variable is analyzed simultaneously with other variables.
STATISTICAL M ODELS FOR AI | 2AI - EST 2
STATISTICAL M ODELS FOR AI | 2AI - EST 3
STATISTICAL M ODELS FOR AI | 2AI - EST 4
1. Factor Analysis
Factor Analysis is an exploratory data analysis technique that aims to find out whether the covariances or correlations present in a
set of observable variables, can be explained by a smaller number of unobservable variables, known as latent variables or common
factors.
Explained here means that the
correlation between each pair of
Latent Variable observable variables results from
• Variable that can neither be observed nor measured directly their mutual association with
• Defined from a set of other variables (possible to be observed or measured) common factors.
Factor Analysis can be used as:
• exploratory analysis with the aim of reducing the dimension of the data;
• confirmatory analysis to test an initial hypothesis that the data can be reduced to a certain dimension and what is the
distribution of variables according to this dimension.
STATISTICAL M ODELS FOR AI | 2AI - EST 5
The DASS is a 21-item self report instrument designed to
measure the three related negative emotional states of
depression, anxiety and stress.
For example, depression, is a latent variable defined by a
set of specific variables:
STATISTICAL M ODELS FOR AI | 2AI - EST 6
If two variables are correlated (and the correlation is not spurious), this association results from sharing a common, not directly
observable characteristic, i.e., a common latent factor:
Factor analysis is able to summarize the information present in many
variables into a small number of factors that are not directly
latent common observable.
factor
These factors allow the identification of structural relationships
between variables that would otherwise go unrecognized in the large
set of original variables.
𝑟𝑥1,𝑥2
𝑥1 𝑥2
Factor Analysis uses the observed correlations between the original
variables to estimate the common factors and structural relationships
Factor Analysis aims to discover and analyze the structure that link the factors (latents) to the variables.
of a set of correlated variables in order to construct a
measurement scale for (intrinsic) factors that in some way
(more or less explicitly) control the original variables.
STATISTICAL M ODELS FOR AI | 2AI - EST 7
Model Adequacy
Data Quality
Factorial Analysis is useful when correlations between variables are high.
When are the correlations high?
• We can do the Bartlett's test
When the null hypothesis is rejected, it states that the population correlation matrix is different from the identity matrix, that is,
there is correlation between the variables.
Is sensitive to sample size and is not very appropriate for large sample
sizes.
What to do in this case?
We can use the Kaiser Meyer Olkin (KMO) measure that:
Compares the simple correlations with the observed partial correlations,
indicates the proportion of variability that is common to the variables.
STATISTICAL M ODELS FOR AI | 2AI - EST 8
Anti-image matrices for the correlations
For KMO values the Factor Analysis is:
These values estimate the correlations between the variables that are not due to the
[0.9 - 1[ : Excellent common factors.
[0.8 - 0.9[ : Good
[0.7 - 0.8[ : Average Low values of these partial correlations indicate that the variables share one or more
[0,6 - 0,7[ : Poor common factors, while high values suggest that the variables are more or less
[0.5 - 0.6[ : Poor but still acceptable
< 0.5 : Is Unacceptable independent.
Thus, the values below the main diagonal should be close to zero.
The values on the main diagonal are another measure of data adequacy for the HF called
the Measure of Sampling Adequacy (MSA) for each of the variables in the analysis. This
measure is a particularization of the KMO for each of the variables in the analysis. MSA
values below 0.5 indicate that this variable does not fit the structure defined by the other
variables and, in this case, its elimination from the PA should be considered.
STATISTICAL M ODELS FOR AI | 2AI - EST 9
Estimation of Common Factors and Specific Factors
(Factor Extraction) Factor Rotation
There are several iterative methods for obtaining a solution to the estimation problem. Three of the most popular methods are the
principal components method, the principal axis factorization and the maximum likelihood.
The principal components method has the advantage of not having the
normality assumption.
• It is necessary to estimate the communalities that depend on the weights.
• To obtain the weights it is necessary to know the communalities.
STATISTICAL M ODELS FOR AI | 2AI - EST 10
The initial solution may not allow the interpretation of the common factors.
• It is necessary to rotate the factors.
Factor Rotation
• There are several methods for factor rotation.
• When we apply a factorial method, the factors obtained are uncorrelated
with each other.
• If each factor represents an axis in the plan, this means that taking the
first two factors, we obtain an orthogonal plan where the points of the
variable are projected.
• Each factor is associated with a certain eigenvalue or percentage of the
total variability of the data.
• Rotation techniques are applied to the initial solution of axes in the plane
obtained to obtain a new solution that is more easily interpreted.
STATISTICAL M ODELS FOR AI | 2AI - EST 11
Varimax method
• When a variable is suspected to be strongly associated with a particular
factor and weakly associated with the others;
• It should not be used when a general factor is suspected;
• It aims to maximize the variance associated with the weight of each of the
initial variables;
STATISTICAL M ODELS FOR AI | 2AI - EST 12
Factor Analysis in SPSS
Extract: Criteria for choosing the number of factors (eigenvalues greater than one)
(The number of factors can also be fixed)
Display: Number of factors (represented on the horizontal axis) based on
eigenvalues
STATISTICAL M ODELS FOR AI | 2AI - EST 13
STATISTICAL M ODELS FOR AI | 2AI - EST 14
Analyzing the KMO value (0.952), we conclude that the result of
the factorial analysis is excellent, so we can proceed with the
factorial analysis.
Bartlett's test allows us to reject the hypothesis
(p<0.001) of identity of the correlation matrix in
order to perform the factorial analysis (means
that the variables are correlated).
STATISTICAL M ODELS FOR AI | 2AI - EST 15
MSA>0.5 for all variables,
suggesting that all of
them can be used.
STATISTICAL M ODELS FOR AI | 2AI - EST 16
The following table indicates the communalities. The initial
communalities were 1.
The communalities after extraction
• indicate the part of the variance of each variable explained by the
factorial model (takes into account the number of factors selected) and
should all be greater than 50%.
- Low values indicate variables poorly explained in the model
(solution: eliminate these variables or increase the number of factors) after
analyzing the anti-image.
For the extracted factors, the percentage of the variance of each variable
explained by the extracted common factors were greater than 50% for all
variables.
STATISTICAL M ODELS FOR AI | 2AI - EST 17
None of these criteria is "ideal" and they
commonly lead to different numbers of
principal components.
One usually chooses the number of factors
Number of Principal Components (PC) or Factors to be retained in which the highest number of criteria
coincide.
Criteria for choosing PCs or Factors
(Indicate the "ideal" number of PCs or Factors to interpret, when the objective is to analyze the main dimensions responsible for
the total variability of the data).
• Pearson Criterion
CPs or factors are used until the cumulative percentage of cumulative variance is equal to or greater than 80%;
• Kaiser Criterion
The CPs or the factors with eigenvalues greater than the mean (of all the eigenvalues obtained) are used.
In case the correlation matrix is used, the mean is 1 (in this case the principal components or the factors with eigenvalues
greater than or equal to 1 are used)
• Scree Plot Criterion
Eigenvalues are represented on the vertical axis of a referential and the number of PCs or Factors on the horizontal axis.
Only the points up to which is observed the break in the slope of the line connecting successive points (followed by a threshold)
are used.
STATISTICAL M ODELS FOR AI | 2AI - EST 18
The following table presents the eigenvalues for each, and the percentage of
variance explained.
According to the retention rule of factors with eigenvalues greater than 1, two
factors were retained (which is confirmed by the Scree plot) that explain about
67.357% of the total variability.
STATISTICAL M ODELS FOR AI | 2AI - EST 19
The table presents the weights associated with each factor in the model.
Each weight corresponds to the correlation between the variable and the
factor.
The analysis should be in line, the variable that weights the most for the
factor.
Component 1 = Stress
Component 2 = Depression
STATISTICAL M ODELS FOR AI | 2AI - EST 20
Calculation of total item correlations
• They should be calculated because it is assumed that each item should contribute to the formation of the component that is
intended to be measured;
• In statistical terms it means that there should be a strong, and statistically significant correlation (0.4 to 0.7) between each item
and the total.
STATISTICAL M ODELS FOR AI | 2AI - EST 21
The correlations between items they should be relatively high,
and the values should be all positive and significant.
STATISTICAL M ODELS FOR AI | 2AI - EST 22
Summary - Factor analysis for metric and ordinal variables:
1) Selection of variables
2) Validation of assumptions:
2.1.) Follow Normal distribution ( Test for normality, normality can also be analysed through the symmetry of the
distribution)
2.2) Testing the validity of factor analysis(a) Bartlet's test of sphericity For there to be factor analysis, we must
reject H0.
H0: Correlation matrix is identity To choose the H0 option, Sig must be greater than or equal to 0.05. In this case,
there is no factor analysis.
H1: Correlation matrix of similarity, for this Sig must be less than 0.05.In this case there is factor analysis.By rejecting
H0, we say that the model is valid, but we don't know if it is powerful.
2.3) KMO Measure (Kaiser-Meyer-Olkin Measure) - Used to assess the strength of factor analysis.
Scale for KMO
•If it is less than 0.05 it is inadmissible, factor analysis is possible but the model is not robust.
•If it's between 0.5 and 0.6 we can go ahead but it's a bad factor analysis
•If it's between 0.6 and 0.7 the factor analysis is considered reasonable
•If it is between 0.7 and 0.8 the factor analysis is average
•If it is between 0.8 and 0.9 it is good
•If it is between 0.9 and 1 it is very good
3) Factor stratification method
STATISTICAL M ODELS FOR AI | 2AI - EST 23
Summary - Factor analysis for metric and ordinal variables:
3) Factor stratification method
• Principal component method (the most rigorous)
• Maximum likelihood
• Principal axis factorization
4) Interpreting the results
(a) Choose factors with eigenvalues greater than 1
(b) Interpret the factors - factors with a cumulative variance greater than 50 per cent should be interpreted
5) Rotation of the correlation matrix
(a) Varimax is usually used
(b) obtaining the Scores
- the Regression method is usually used
STATISTICAL M ODELS FOR AI | 2AI - EST 24
Cronbach's Alpha
Internal Consistency
The Internal Consistency of factors is defined as the proportion of variability in responses;
Responses differ not because the questionnaire is confusing but because it elicits diverse opinions;
Cronbach's Alpha is one of the most widely used measures to verify the Internal Consistency of a
group of variables.
STATISTICAL M ODELS FOR AI | 2AI - EST 25
Excellent!
STATISTICAL M ODELS FOR AI | 2AI - EST 26
‘Corrected-Item Total Correlation’
The Pearson correlation coefficient of each variable with
the others.
The highest correlation observed is between S4Q12 and the
others.
To know the effect of each variable on the factor's internal consistency, The ‘Alpha if Item Deleted’ is observed and
compared to the standardized Alpha.
If the ‘Alpha if Item Deleted’ is higher than the Alpha, the item is removed.
!! common sense is needed because the item may be important for the analysis !!
STATISTICAL M ODELS FOR AI | 2AI - EST 27
2. Cluster Analysis
• Cluster Analysis is a group of multivariate techniques which allows grouping subjects or variables into homogeneous groups
according to one or more common characteristics;
• Each observation belonging to a given cluster is similar to all the others belonging to that cluster, and is different from
observations belonging to other clusters;
• The identification of natural groupings of subjects or variables allows the evaluation of the dimensionality of the data matrix,
identify possible multivariate outliers and hypothesize about the structural relationships between variables;
• In cluster analysis, grouping of subjects (cases or items) or variables is done based on measures of similarity or dissimilarity
(distance) between two subjects, initially, and later between two clusters of observations using hierarchical or non-
hierarchical cluster grouping techniques.
STATISTICAL M ODELS FOR AI | 2AI - EST 28
Purpose of Cluster Analysis
✓ Grouping objects based on similarity of characteristics they possess.
Homogeneity
Heterogeneity
✓ Geometrically, the objects within clusters will be close together, while distance between
clusters will be farther apart.
STATISTICAL M ODELS FOR AI | 2AI - EST 29
Forms of clusters (or groups)
Cohesive and well-separated groups Homogeneous group with no natural clusters Separate but not cohesive group
High density areas surrounded by
Separate groups but no internal cohesion Fully cohesive but not separate groups
low density regions
STATISTICAL M ODELS FOR AI | 2AI - EST 30
Cluster Analysis
Dissimilarity/similarity metric
Hierarchical Method
Non-hierarchical Method
STATISTICAL M ODELS FOR AI | 2AI - EST 31
Measuring similarity
The degree of correspondence among objects across all of characteristics:
• Correlational measures
Grouping cases base on respondente pattern How to group the subjects into
homogeneous clusters
from the dissimilarity measures
• Distance measures
so that within the same cluster these
Grouping cases base on distance
measures are as small as possible and
(some) Distance Measures between clusters as large as possible?
❖ (Square) Euclidean distance
❖ Minkowski distance
❖ Mahalanobis distance
❖ Cosine's Measure of Similarity
❖ …
❖ Similarity Measures for Variables
STATISTICAL M ODELS FOR AI | 2AI - EST 32
Hierarchical Cluster Analysis
Agglomerative: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy;
Divisive: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
STATISTICAL M ODELS FOR AI | 2AI - EST 33
It is necessary to find a way to define the distances between the cluster with more
than one individual (or variable) and the others:
• Farthest neighbor or complete linkage
the distance between two clusters corresponds to the largest of the distances
between two elementary observations belonging to different clusters;
• Nearest neighbor or single linkage
the distance between two clusters corresponds to the smaller of the distances
between two elementary observations belonging to different clusters;
• Average linkage or between groups
average among all pairs of elementary observations possible to be formed, with
each of the observations belonging to different clusters.
Different methods = different results
(test alternatives and see how robust the results are)
STATISTICAL M ODELS FOR AI | 2AI - EST 34
Steps in Cluster Analyis
• Formulate the problem The primary objective of cluster analysis
• Select a distance measure is to define the structure of data by
• Select a clustering procedure placing the most similar observations
• Decide the number of clusters into groups!!
• Interpret and profile clusters
• Access the validity of clustering
STATISTICAL M ODELS FOR AI | 2AI - EST 35
Formulate the problem
✓ Select the variable in which clustering is based;
✓ The variables selected must be relevant to research problem;
✓ In exploratory research, researcher should exercise judgment and intuition.
Select a distance measure
✓ The objective of clustering is to group similar objects together.
Some measure is needed to assess how similar or diferente the objects are.
STATISTICAL M ODELS FOR AI | 2AI - EST 36
Select a clustering procedure
✓ Hierarchical method
✓ Non-hierarchical method
✓ Combination method
Decide the number of clusters
✓ Theorical, conceptual or practical considerations may suggest a certain number of clusters;
✓ In hierarchical clustering the distance in which clusters are combined can be used;
✓ The relative number of clusters must be meaningful.
STATISTICAL M ODELS FOR AI | 2AI - EST 37
Interpret and profile clusters
✓ It involves examining the cluster centroids;
✓ Centroids represent means values of the objects contained in the cluster on each of the variables;
✓ Centroid enable us to describe each cluster by assigning it a name.
Acess the validity of clustering
✓ Perform cluster analysis on the same data using different distance measures and compare them
to determine the stability of the solutions;
✓ Use different methods of clustering and compare the results.
STATISTICAL M ODELS FOR AI | 2AI - EST 38
Example
Identify groups of individuals with distinct characteristics in the second pandemic condition (the individuals were submitted to a
questionnaire to evaluate their condition related with stress, depression, anxiety, social support among other variables).
Let’s consider in that example age, stress and social support.
STATISTICAL M ODELS FOR AI | 2AI - EST 39
(Hierarchical) Cluster Analysis in SPSS
Agglomeration schedule: for a summary table of the steps taken to obtain the clusters
Proximity matrix: to obtain a proximity matrix (similarity or dissimilarity, according to the proximity measure to
choose)
Cluster membership: you can indicate the number of clusters if you previously have an idea of it, or alternatively, you
can make a first analysis without this option, and after knowing the number of clusters, redo the analysis
STATISTICAL M ODELS FOR AI | 2AI - EST 40
• select the Nearest neighbor method for the aggregate method
(you can try other types of methods, for example, Ward's Method produces
a dendrogram that is easier to interpret)
• as the variables used are at least interval we will opt for the square
Euclidean distance measure
STATISTICAL M ODELS FOR AI | 2AI - EST 41
Example Dissimilarity Matrix
Using Euclidean distance
The smaller Euclidean distance corresponds to the
smaller dissimilarity (or greater similarity or
proximity) between individuals.
STATISTICAL M ODELS FOR AI | 2AI - EST 42
The first cluster to be formed contains individuals 7 and 8 , then 6 and
7, and 5 and 6 (which have the smallest distance in the Proximity
Matrix)
The other cluster to be formed contains individuals 16 and 17
(which have the second smallest distance in the proximity matrix)
STATISTICAL M ODELS FOR AI | 2AI - EST 43
Dendrogram (Graphical representation of the clustering process)
The figure represents the dendrograms
obtained with the ‘Single Linkage’
method and with the ‘Ward Linkage’
method. Note the better separation of
the clusters obtained with the Ward
method.
The figure graphically represents the
agglomeration scheme presented in the
‘Agglomeration Schedule’. However, the
coefficients (distances) have been
rescaled on a scale from 0 to 25.
STATISTICAL M ODELS FOR AI | 2AI - EST 44
Let's repeat the analysis, but assuming we have 3 clusters (with the option: ‘Single Linkage Method’) …
STATISTICAL M ODELS FOR AI | 2AI - EST 45
Cluster 1: 1, 3, 4, 9, 11, 12, 19, 20, 29
Cluster 2: 2, 10, 13, 16, 17, 28
Cluster 3: 5, 6, 7, 8, 14, 15, 18, 21, 22, 23,
24, 25, 26, 27, 30, 31, 32, 33
STATISTICAL M ODELS FOR AI | 2AI - EST 46
How many clusters should be retained?
• Dendrogram Analysis
• Heuristic methods for evaluating cluster solution and number of clusters
• Distance between clusters
✓ Agglomeration Schedule
✓ r-squared (R2) criteria
r-squared Criteria
• It is a measure of the percentage of the total variability that is retained in
each of the cluster solutions;
• If the number of clusters is 1 the inter-cluster variability is 0;
• If the number of clusters is equal to the number of subjects/cases the
variability between clusters is 1, which is the total variability.
STATISTICAL M ODELS FOR AI | 2AI - EST 47
Objective:
To find the minimum number of clusters that retain a significant percentage of total variability (higher than 70-80%).
𝑟2 = =
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝐵𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝𝑠 (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠) 𝑆𝑆𝐶 • Calculations can be performed using one-way ANOVA;
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑆𝑆
• First it is necessary to perform the cluster analysis with Cluster
σ𝑝𝑖=1 σ𝑘𝑗=1 𝑛𝑖𝑗 𝑋ሜ 𝑖𝑗 − 𝑋ሜ 𝑖
2 Membership to create new variables that record each object's belonging
𝑟2 = 2
to the requested cluster solution.
σ𝑝𝑖=1 σ𝑘𝑗=1 σ𝑛𝑙=1
𝑖
𝑋𝑖𝑗𝑙 − 𝑋ሜ
STATISTICAL M ODELS FOR AI | 2AI - EST 48
One-Way ANOVA (perform for all clusters)
let's start with cluster 3
The Sum of Squares of the Clusters (SSC) for each
variable is given by Sum of Squares Between
Groups and adding these for all dependent variables
gives SSC= 5444.015
In the same way we obtain the Total Sum of
Squares (TSS) for all and TSS = 11566,848
4966,379 + 121,740 + 356,167
𝑟2 = = 47,07%
5049,879 + 733,636 + 5783,333
STATISTICAL M ODELS FOR AI | 2AI - EST 49
Similarly for the other Cluster solutions.
An acceptable solution would be 5 clusters!!
N.º Clusters r-squared
1 0
2 ,,,
3 47,07%
4 51,30%
5 89.47%
6 92,31%
,,, ,,,
Were retained 5 clusters that explain
89.47% of the total variance.
STATISTICAL M ODELS FOR AI | 2AI - EST 50
Non-Hierarchical Cluster Analysis
• Clustering of subjects (not variables);
• Number of clusters initially defined by the analyst;
• Ease of application in large matrices;
• No need to calculate and store a new dissimilarity matrix at each step of the algorithm;
• Inclusion of an individual in a cluster may not be definitive.
One of the most used methods for clustering subjects is k-means:
1. Initial partition of the subjects in k predefined clusters;
2. Calculate the centroids for each of the k clusters and calculate the Euclidean distance from the centroids to each individual;
3. Group the individuals to the clusters whose centroids are closest and return to step 2 until there is no significant variation in the minimum distance
from each subject in the database to each of the centroids of the k clusters (or until the maximum number of interactions or convergence criterion is
reached).
STATISTICAL M ODELS FOR AI | 2AI - EST 51
Let's do the cluster analysis of our example with the k-means method.
The first question that arises is what should be the value of k?
We saw earlier that the hierarchical method gives a solution of 5 clusters.
k-means in SPSS does not have the possibility to unstandardized the observations.
If it is important that all variables contribute equally to the analysis you should start here:
STATISTICAL M ODELS FOR AI | 2AI - EST 52
select the standardized variables
(Non-hierarchical) Cluster Analysis in SPSS
STATISTICAL M ODELS FOR AI | 2AI - EST 53
‘Iteration History’ table indicate the Tables indicate:
variation of the center of the clusters at • the belonging of each subject to each cluster
each iteration step
(‘Cluster Membership’)
• the average of each variable in each of the 5
clusters (‘Final Cluster Centers’)
STATISTICAL M ODELS FOR AI | 2AI - EST 54
The usefulness of ANOVA is not to identify whether the clusters are
different or not, but rather which variable or variables allow the
separation of the clusters.
If a variable discriminates enough between clusters, then its
variability between clusters (given by the ‘Cluster Mean Square’)
will be higher. On the other hand, within clusters (given by the
‘Error Mean Square’) this variability will be smaller.
Thus, the variables that contribute most to the definition of the
clusters are those with the highest Cluster Mean Square (CMS)
and lowest Error Mean Square (EMS), that is, those with the
highest value of F=CMS/EMS.
social support is the variable that
allows greater discrimination
between clusters
STATISTICAL M ODELS FOR AI | 2AI - EST 55
If we repeat the analysis without
standardizing the variables we will get:
▪ The classification of individuals into each of the clusters is generally
more rigorous in non-hierarchical methods;
▪ It is recommended to start the cluster analysis with hierarchical
methods to explore and proceed with K-Means to refine and interpret
the cluster solution;
▪ Cluster analysis should be supported with other analyses (e.g.,
Discriminant Analysis) to obtain error probabilities associated with the
conclusions obtained
STATISTICAL M ODELS FOR AI | 2AI - EST 56
Linear Correlation
A correlation analysis provides information on the strength and direction
of the linear relationship between two variables (x and y).
Pearson Correlation
The most used statistic is the linear correlation coefficient, r.
The sign of the coefficient reflects the slope of the linear relationship
between two variables.
• A positive value of r suggests that the variables are positively
linearly correlated, indicating that y tends to increase linearly as x
increases.
• A negative value of r suggests that the variables are negatively If a curved line is needed to express the relationship, other and
more complicated measures of the correlation must be used.
linearly correlated, indicating that y tends to decrease linearly as x
increases.
STATISTICAL M ODELS FOR AI | 2AI - EST 57
• The linear correlation coefficient, r, takes values between −1 and 1.
• If r is close to ±1, the two variables are highly correlated and if plotted on a scatter plot, the data points cluster about a line.
• If r is far from ±1, the data points are more widely scattered.
• If r is near 0, the data points are essentially scattered about a horizontal line indicating that there is almost no linear
relationship between the variables.
• A perfect linear relationship (r=-1 or r=1) means that one of the variables can be perfectly explained by a linear function of the
other.
Pearson’s correlation assumes the variables to be roughly normally distributed and it is
not robust in the presence of outliers.
STATISTICAL M ODELS FOR AI | 2AI - EST 58
𝐻0 : 𝜌𝑋,𝑌 = 0 𝑣𝑠 𝐻1 : 𝜌𝑋,𝑌 ≠ 0
p_value<0.001 < =0.05
There is a significand, and negative correlation
between stress and optimism (𝜌=-0.410).
STATISTICAL M ODELS FOR AI | 2AI - EST 59
Linear Regression
• Regression models are used to describe relationships between variables by fitting a line to the observed data.
• Regression allows estimate how a dependent (or outcome) variable (Y | DV) changes as the independent variable(s) (X | ID)
change.
Simple Linear Regression Multiple Linear Regression
Simple Linear regression uses one independent variable to explain or predict the outcome of the dependent
variable Y, while multiple regression uses two or more independent variables to predict the outcome
STATISTICAL M ODELS FOR AI | 2AI - EST 60
Multiple Linear Regression Model (MLRL)
The MRLM is a statistical technique that allows the analysis of the relationship between a dependent variable (Y) and a set of
independent variables (X’s).
The functional relationship between a dependent variable (Y), continuous, and one or more independent variables (X’s), continuous
or categorical, 𝑋𝑖 , 𝑖 = 1, … , 𝑝
𝑌𝑗 = 𝛽0 + 𝛽1 𝑋1𝑗 + 𝛽2 𝑋2𝑗 + ⋯ + 𝛽𝑝 𝑋𝑝𝑗 + 𝜀𝑗 𝑗 = 1, … , 𝑛
𝛽𝑖 are called regression coefficients;
𝜀𝑗 represent the errors or residuals of the model, 𝜀𝑗 ~𝑁 0, 𝜎 2 ;
𝛽0 is the intercept (i.e., the value of 𝑌𝑗 when 𝑋𝑖𝑗 = 0, 𝑖 = 1, … , 𝑝 );
and 𝛽𝑖 (𝑖 = 1, … , 𝑝) represent the partial slopes (i.e., a measure of the influence of 𝑋𝑖 on Y, i.e., the change in Y per unit change in 𝑋𝑖 ).
STATISTICAL M ODELS FOR AI | 2AI - EST 61
Note that in linear regression, we do not say that given values of 𝑋𝑖𝑗 the value of 𝑌𝑗 is 𝑌𝑗 = 𝛽0 + 𝛽1 𝑋1𝑗 + 𝛽2 𝑋2𝑗 + ⋯ + 𝛽𝑝 𝑋𝑝𝑗 + 𝜀𝑗
since errors are unknown random quantities, not possible to measure exactly.
We can say that the mean value or the expected value of 𝑌𝑗 is:
𝑦ො𝑗 = Ε 𝑌𝑗 ห𝑋𝑖 = Ε(𝛽 0 +𝛽1 X1𝑗 + 𝛽2 X 2𝑗 + ⋯ + 𝛽𝑝 X 𝑝𝑗 + 𝜀𝑗 ൯
𝑦ො𝑗 = 𝛽0 + 𝛽1 𝑥1𝑗 + 𝛽2 𝑥2𝑗 + ⋯ + 𝛽𝑝 𝑥𝑝𝑗
STATISTICAL M ODELS FOR AI | 2AI - EST 62
Example
Estimate a predictive model of Stress (Y) (DV) as a function of the set of IV, in pandemic condition:
• Gender (𝑋1 ) – sex of the individuals [categorial variable (0-Male; 1-Female)]
• Age (𝑋2 )– age of the individuals (continuous variable)
• SSocial (𝑋3 ) – Social Support (continuous variable)
• SleepAfter (𝑋4 ) - Sleep quality perception after (continuous variable)
• TierdenessAfter (𝑋5 ) - Perception of tiredness afterwards (continuous variable)
• Optimism (𝑋6 ) – Dispositional optimism (continuous variable)
Y= 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 + 𝛽5 𝑋5 + 𝛽6 𝑋6 + 𝜀
STATISTICAL M ODELS FOR AI | 2AI - EST 63
Estimation of Regression Coefficients: Least Squares Method
In a linear regression the first task is to estimate the coefficients of the regression model, from a sample that is assumed to be
representative of the population. This estimation will be done using appropriate estimators that will produce sample estimates
𝑏0 , 𝑏1 , … , 𝑏𝑝 of the true population parameters 𝛽0 , 𝛽1 , … , 𝛽𝑝 .
The method of least squares is the method used: the estimates of the
regression coefficients are obtained so that the errors or residuals,
calculated by 𝑒𝑗 = 𝑦𝑗 − 𝑦ො𝑗 , are minimum.
STATISTICAL M ODELS FOR AI | 2AI - EST 64
Inference for Linear Regression Model
Have found the estimates 𝑏0 , 𝑏1 , … , 𝑏𝑝 we can now evaluate the quantitative influence of the independent variables on the
dependent variable in the sample. However, the question is, does this relationship hold in the population? Does at least one
independent variable influence Y ? If so, which one or ones? And, by what percentage does the fitted model explain the observed
variation in Y ?
Analysis of variance of the linear regression model
The aim is to evaluate, from the sample estimates, whether or not any of the independent variables in
the population can influence the dependent variable, that is, whether or not the fitted model is
significant.
H0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
H1 : ∃𝑖: 𝛽𝑖 ≠ 0 𝑖 = 1 … 𝑝
The SPSS produces the p-value associated with this test statistic and summarizes the calculations
in a table called ANOVA.
STATISTICAL M ODELS FOR AI | 2AI - EST 65
If the p-value ≤ , we reject H0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0 and we can conclude that at least one of the independent variables has a
significant effect on the variation of the dependent variable (this does not mean to say that this independent variable is the cause of
the dependent variable).
In other words, we can say that the model fitted to the data is significant. It now remains to be seen whether all or only some of the
independent variables influence the variation of the dependent variable.
This test statistic has an associated p-value < 0.001
("Sig.") so we can reject H0 ( =5%).
Which means that some of the IV influences the
Stress (DV).
STATISTICAL M ODELS FOR AI | 2AI - EST 66
Testing the regression model coefficients
To find out which 𝛽i is different from zero, it is necessary
to perform multiple tests.
However, will all independent variables contribute equally
(magnitude) to the model? Do they all have a significant effect
on the prediction of Stress? H0 : 𝛽𝑖 = 𝑘
H1 : 𝛽𝑖 ≠ 𝑘 (𝑖 = 1, … , 𝑝)
k can take any value, but in most statistical analysis
software k=0.
Note that the statistic test is obtained for each 𝛽𝑖
if only the independent variable corresponding to
𝛽𝑖 enters the model (keeping the others
constant).
STATISTICAL M ODELS FOR AI | 2AI - EST 67
We find the values of the statistic tests for each of the hypotheses to the
partial regression coefficients.
H0 : 𝛽𝑖 = 0 𝑣𝑠. H1 : 𝛽𝑖 ≠ 0 𝑖 = 0, … , 6
For =5% we can conclude that the variables Gender, SleepAfter, H0 : 𝛽1 = 0 𝑣𝑠. H1 : 𝛽1 ≠ 0
TierdenessAfter and Optimism significantly affect the Stress, even adjusted
p_value=0,001 < =0,05
for the others.
reject H0
Gender have a significant impact on Stress: women's
stress levels are higher than men’s mean in 1.964
STATISTICAL M ODELS FOR AI | 2AI - EST 68
Coefficient of Determination ( 𝑅2 ) 0 ≤ 𝑅2 ≤ 1
• Is a measure used in statistical analysis that assess how well a model explains and predicts future outcomes (is a measure
of the size of the effect of the IV(s) on the DV);
• It is indicative of the level of explained variability in the data set;
• The coefficient of determination, also commonly known as “R-squared”, is used as a guideline to measure the accuracy of the
model.
• When 𝑅2 =0 the model does not fit the data and when 𝑅2 =1 the fit is perfect;
• Knowing what value of 𝑅2 to consider to produce an adequate fit is somewhat subjective. In the exact sciences a value > 0.9 is
usually an indicator of a good fit, while in the social sciences values > 0.5 are already considered acceptable for a good fit of
the model to the data.
STATISTICAL M ODELS FOR AI | 2AI - EST 69
𝑅2 value should not be used to compare models that differ in the number of independent variables. Generally, the incorporation
of one more variable causes this value to increase, even if it has little influence on the dependent variable. Alternatively, the
adjusted coefficient of determination (𝑅𝑎2 ) is used. This only increases if the addition of a new variable leads to a better
adjustment of the model to the data.
STATISTICAL M ODELS FOR AI | 2AI - EST 70
Validation of the Linear Regression Models Assumptions
1. Linearity
There must be a linear relationship between the outcome variable
(DV) and all the independent variables . Scatterplots can show
whether there is a linear or curvilinear relationship.
STATISTICAL M ODELS FOR AI | 2AI - EST 71
2. Residuals Analysis
• 𝜀𝑗 ~𝑁 0, 𝜎 2 - the errors have normal distribution with zero mean
and constant variance;
This assumption may be checked by looking at a histogram or a Q-Q
Plot. Normality can also be checked with a goodness of fit test (e.g., the
Kolmogorov-Smirnov test), though this test must be conducted on the
residuals themselves.
STATISTICAL M ODELS FOR AI | 2AI - EST 72
• 𝐶𝑜𝑣 𝜀𝑖 , 𝜀𝑗 = 0, 𝑖 ≠ 𝑗; 𝑖, 𝑗 = 1, … , 𝑛 - independent errors.
The Durbin Watson (d ) statistic is a test for autocorrelation in the residuals from a statistical regression analysis.
σ𝑛
𝑡=2 𝑒𝑡 −𝑒𝑡−1
2
d= σ𝑛 2
𝑡=1 𝑒𝑡
The Durbin-Watson statistic will always have a value between 0 and 4.
A value ≈ 2.0 means that there is no autocorrelation detected in the sample.
Values from 0 to less than 2 indicate positive autocorrelation and values from 2 to 4 indicate negative autocorrelation.
STATISTICAL M ODELS FOR AI | 2AI - EST 73
3. Multicollinearity
When the independent variables are strongly correlated with
each other (a condition called multicollinearity), the analysis of
the fitted regression model can be confusing and meaningless.
Multicollinearity may be checked multiple ways:
• Correlation matrix
When computing a matrix of Pearson’s bivariate
correlations among all independent variables, the
magnitude of the correlation coefficients should be less
than 0.80.
STATISTICAL M ODELS FOR AI | 2AI - EST 74
Variance Inflation Factor (VIF) – The VIFs of the linear regression indicate the degree that the variances in the regression estimates
are increased due to multicollinearity. VIF values higher than 10 indicate
VIF = 1Τ 1 − R2𝑖 , for the regression coefficient 𝛽𝑖 associated with variable 𝑋𝑖 .
𝑋1 = 𝑓 𝑋2 , 𝑋3 , … , 𝑋𝐾 → 𝑅12
𝑋2 = 𝑓 𝑋1 , 𝑋3 , … , 𝑋𝐾 → 𝑅22
⋮ R2𝑖 is the coefficient of determination when considering the regression model of
𝑋𝑘 = 𝑓 𝑋1 , 𝑋2 , … , 𝑋𝐾−1 → 𝑅𝑘2 predictor i with respect to all other predictors (excluding Y ). If predictor i is not
correlated with the other predictors, R2𝑖 =0 and VIF=1 (ideal situation).
STATISTICAL M ODELS FOR AI | 2AI - EST 75
There is no limit to the magnitude of VIF. Some authors suggest that VIF values greater than 10 or even 5 indicate problems
with the estimation of 𝛽𝑖 due to the presence of multicollinearity in the independent variables.
Another measure also used in SPSS is the "tolerance" of the variable 𝑋𝑖 defined as T= R2𝑖 . The tolerance ranges from zero to
one, and the closer it is to zero, the greater the multicollinearity.
Note that VIF=1⁄T and therefore, by
definition, variables with low
tolerance have high VIF values and
vice versa.
STATISTICAL M ODELS FOR AI | 2AI - EST 76
Best Model Search/Variable Selection Method
• In a linear regression problem, the researcher may know up front which independent variables
to include in his regression model;
• However, at an early stage of exploration, the researcher may not know which variables lead to
the best model, either because of the presence of multicollinearity and the effects on the
magnitude and sign of the regression coefficients;
• There are several methods of searching for the "best model", although none leads to the
"optimal" model.
STATISTICAL M ODELS FOR AI | 2AI - EST 77
Forward Selection
In this method the initial model only includes the constant 𝛽0 .
In the first step, the first independent variable (e.g., 𝑋1 ) to be selected is the one with the highest correlation (in absolute
value) with the dependent variable, i.e., the independent variable whose addition to the model yields the highest value of the
ANOVA F-statistic of the linear regression (or similarly, which leads to the largest increase in 𝑅2 ). This variable will be
added to the model if the associated F-statistic is greater than a critical input value F_Entry (or if the associated p-value is
less than the significance level). The next independent variable to add to the model is the one with the highest correlation
with Y after adjusting for the effects of 𝑋1 on Y. The procedure is repeated and continued until a given variable does not
have an F greater than the F_Entry (or the associated p-value is not less than the significance level), or until all
independent variables enter the model
STATISTICAL M ODELS FOR AI | 2AI - EST 78
Backward Selection
In this method, the model starts with all p independent variables. In the next step a partial F-statistic is calculated for each
variable as if it were the last to enter the model. The variable with the smallest F-value (or largest p-value) is compared with
a critical F_Removal (or a fixed significance level) and if the F is smaller than the F_Removal, that variable is removed from
the model. Next a new model is fitted with p-1 independent variables, the process is repeated and continued until there are
no variables in the model or until all variables present in the model have a partial F greater than F_Removal.
Stepwise Selection
In the first step this type of selection starts with only one independent variable (as in Forward) but the significance of each
addition of a new independent variable to the model is tested as in Backward. The advantage of this method is that it allows
the removal of a variable whose importance in the model is reduced by the addition of new variables. The process ends when
none of the independent variables still out, manage to enter the model based on F_Entry and none of the variables present in
the model are expelled from the model based on F_Removal.
STATISTICAL M ODELS FOR AI | 2AI - EST 79
Example
1. Using ENTER Selection
Estimate a predictive model of Depression (Y) (DV) as a function of the set of IV, in pandemic condition:
• Gender (𝑋1 ) – sex of the individuals [categorial variable (0-Male; 1-Female)]
• Age (𝑋2 )– age of the individuals (continuous variable)
• SSocial (𝑋3 ) – Social Support (continuous variable)
• SleepAfter (𝑋4 ) - Sleep quality perception after (continuous variable)
• TierdenessAfter (𝑋5 ) - Perception of tiredness afterwards (continuous variable)
• Optimism (𝑋6 ) – Dispositional optimism (continuous variable)
STATISTICAL M ODELS FOR AI | 2AI - EST 80
Multiple Linear Regression Model in SPSS
STATISTICAL M ODELS FOR AI | 2AI - EST 81
H0 : 𝛽𝑖 = 0
H1 : 𝛽𝑖 ≠ 0 (𝑖 = 1, … , 𝑝)
p_value<0,001 < =0,05
reject H0
At least one of the independent variables is predictive of depression.
“the question is which one or which ones?”
STATISTICAL M ODELS FOR AI | 2AI - EST 82
STATISTICAL M ODELS FOR AI | 2AI - EST 83
2. Using Stepwise Selection
Regression equation
𝑌 = 16.307 − 0.569𝑋6 + 0.647𝑋5 − 0.049𝑋2 − 0.405𝑋4
STATISTICAL M ODELS FOR AI | 2AI - EST 84
𝑌 = 16.307 − 0.569𝑋6 + 0.647𝑋5 − 0.049𝑋2 − 0.405𝑋4
• each one unit increase in Optimism is associated with a 0.569 unit decreases in Depression;
• each one unit increase in TierdnessAfter is associated with a 0.647 unit increase in Depression;
• each one unit increase in Age is associated with a 0.049 unit decrease in Depression;
• each one unit increase in SleepAfter is associated with a 0.405 unit decrease in Depression.
𝑅𝑎2 =41.2% of the variability of Depression are explained by
Optimism, TierdnessAfter, Age and SleepAfter, the
remaining 58.8% are explained by other factors
STATISTICAL M ODELS FOR AI | 2AI - EST 85
Assumption Validation
1. Linearity
• Construct and looking to scatter plots (Y vs X’s)
2. Residuals Analysis
• 𝜀𝑗 ~𝑁 0, 𝜎 2 - the errors have normal distribution with zero mean
and constant variance; (Normal P-P plot: the points are
overlapped on the line )
• 𝐶𝑜𝑣 𝜀𝑖 , 𝜀𝑗 = 0, 𝑖 ≠ 𝑗; 𝑖, 𝑗 = 1, … , 𝑛 - independent errors
(Durbin-Watson=1.834 ≈2)
STATISTICAL M ODELS FOR AI | 2AI - EST 86
3. Multicollinearity
• VIF < 5
STATISTICAL M ODELS FOR AI | 2AI - EST 87
Logistic Regression
Categorical Regression
dependent variable is qualitative and takes on discrete class values.
(It is identical to linear regression, differing only in the assumptions and method of obtaining the model estimates)
In linear regression the dependent variable is quantitative, in categorical regression the dependent variable is qualitative, and
the independent variables or predictors (also called covariates) can be quantitative or qualitative.
When the dependent variable is nominal dichotomous, the categorical regression is called logistic regression;
If the dependent variable is nominal polytomous, the regression is called multinomial regression (it is an extension of logistic
regression). If the classes of the dependent variable can be ordered (ordinal variable) then ordinal regression should be used.
STATISTICAL M ODELS FOR AI | 2AI - EST 88
Example
• A biomedical sciences researcher can use logistic regression to estimate the effectiveness of a new
treatment on cancer survival;
• An educational psychologist can use multinomial regression to identify risk factors in adolescents
that may be implicated in school dropout;
• An economist can use ordinal regression to predict the risk classes in the attribution of a bank credit
to a firm.
STATISTICAL M ODELS FOR AI | 2AI - EST 89
Binomial (or binary) logistic regression is a form of regression which is used when the dependent is a dichotomy, and the
independents are of any type.
Logistic regression can be used to predict a dependent variable on the basis of continuous and/or categorical independents and
to determine the percent of variance in the dependent variable explained by the independents.
Logistic regression is popular in part because it enables the researcher to overcome many of the restrictive assumptions:
• Logistic regression does not assume a linear relationship between the dependents and the independents;
• The dependent variable need not be normally distributed (but does assume its distribution is within the range of the
exponential family of distributions, such as normal, Poisson, binomial, gamma);
• The dependent variable need not be homoscedastic for each level of the independents; that is, there is no homogeneity of
variance assumption: variances need not be the same within categories.
The main restriction is that the model should have little or no multicollinearity. That is that the independent variables should be
independent from each other.
STATISTICAL M ODELS FOR AI | 2AI - EST 90
This is a non-linear regression model, used when the response variable is qualitative with two possible
outcomes. The qualitative response variable has two possibilities, and so can be represented by an indicator
variable that takes only the values zero and one.
.
It models, in probabilistic terms, the occurrence of one of the two possible outcomes of the response variable.
STATISTICAL M ODELS FOR AI | 2AI - EST 91
Applications
• Estimating the probability that the event of interest will occur;
• Identifying the explanatory variables that can influence the response, eliminating those that do not provide
information;
• Estimating the influence and relative importance of each explanatory variable on the event of interest;
• Detect interactions between explanatory variables that affect the response variable;
• Identify interfering variables;
• Estimate the odds ratio that measures the importance of a predictor on the response variable.
STATISTICAL M ODELS FOR AI | 2AI - EST 92
In linear regression, the model was concerned with estimating (or predicting) the mean value of y given a certain set of
explanatory variable values.
What if the variables is dichotomous?
Disease present = 1
Disease absent = 0
Dead = 1
Alive = 0
1 = "success" from the statistical point of view, corresponds to the occurrence of the event
0 = "failure" from the statistical point of view, corresponds to no occurrence of the event
STATISTICAL M ODELS FOR AI | 2AI - EST 93
The mean of this dichotomous variable "y" will be denoted "p"
Where "p" is the proportion of times the variable takes on the value 1
p = P (Y = 1)
p = P ("success")
To estimate the probability "p" associated with a dichotomous response for various values of an explanatory variable, we use a
logistic regression !!
STATISTICAL M ODELS FOR AI | 2AI - EST 94
Example
Consider low birth weight neonates (defined as < 1750 grams) who meet the following criteria:
• Confined to a neonatal ICU;
• Required OTI and MV during the first 12 hours of life;
• Survived for at least 28 days.
Random sample of n = 223 neonates with these characteristics:
76 were identified as having bronchopulmonary dysplasia;
the remaining 147 were not.
STATISTICAL M ODELS FOR AI | 2AI - EST 95
P (Y = 1)
Y = dichotomous random variable where: -
1 = presence of BPD (bronchopulmonary dysplasia)
0 = absence of BPD
The estimated probability that a neonate drawn from this population has BPD is the proportion of BPD in the random sample:
p = 76/223 = 0.341 (or 34.1%).
STATISTICAL M ODELS FOR AI | 2AI - EST 96
We may suspect that some factors - maternal and neonatal - should affect the likelihood that a particular
neonate will develop BPD.
Knowledge of the presence or absence of these factors can:
• increase the accuracy of our "p" estimate
• develop interventions to reduce that probability
Analogy with linear regression
• Equation aims to improve the estimate over the simple arithmetic mean
If the dependent variable y were continuous, we could begin the analysis by constructing a scatterplot of points of the
variables x versus y.
Could we do the same with a dichotomous variable y?
STATISTICAL M ODELS FOR AI | 2AI - EST 97
Figure: DIAGNOSIS OF BRONCHOPULMONARY DYSPLASIA VERSUS BIRTH WEIGHT FOR A SAMPLE OF 223 LOW BIRTH WEIGHT INFANTS
STATISTICAL M ODELS FOR AI | 2AI - EST 98
Birth Weight (in grams) Sample Size Number with BPD p
0-950 68 49 0.721
951-1350 80 18 0.225
1351-1750 75 9 0.120
223 76 0.341
It appears that the probability of developing BPD increases as the weight of the neonate
decreases - and vice versa.
Since there seems to be a relationship between these two variables, we would like to use
birth weight to help estimate the likelihood that the neonate will develop BPD.
STATISTICAL M ODELS FOR AI | 2AI - EST 99
The first strategy might be to try to fit a model of the type:
𝑝 = 𝛼 + 𝛽𝑋
where 𝑋 represents birth weight.
At first impression, this model is not suitable.
Since 𝑝 is a probability, it can only accept values between 0 and 1.
An alternative would be to fit a model of the type
𝑝 = 𝑒 𝛼+𝛽𝑋
where “e” is the Euler number;
The natural logarithm is defined for all strictly positive real numbers.
This equation guarantees that the estimate of 𝑝 will be positive.
However, the equation is also inadequate because it can produce a number greater than 1.
STATISTICAL M ODELS FOR AI | 2AI - EST 100
To fit this last requirement, we could fit a model of the type
𝑒 𝛼+𝛽𝑋
𝑝=
1 + 𝑒 𝛼+𝛽𝑋
This expression, known as the logistic function, admits neither negative values nor values greater than 1!!
An odds (chance) is a ratio between two probabilities:
𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑝
𝑜𝑑𝑑𝑠 = =
𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑟𝑟𝑖𝑛𝑔 1 − 𝑝
• Probability of taking a face at the toss of a coin=0.5
• Chance to flip a coin=1
STATISTICAL M ODELS FOR AI | 2AI - EST 101
𝑒 𝛼+𝛽𝑋
If 𝑝=
1 + 𝑒 𝛼+𝛽𝑋
Modeling a probability 𝑝 with a logistic function is equivalent to fitting a
𝑒 𝛼+𝛽𝑋 linear regression model where the continuous dependent variable Y
𝑝 𝛼+𝛽𝑋
= 1+𝑒 = 𝑒 𝛼+𝛽𝑋 has been replaced by the Neperian logarithm of the chance of
1−𝑝 1
1 + 𝑒 𝛼+𝛽𝑋 occurrence of a dichotomous event.
Instead of assuming that the relationship between 𝑝 and X is linear, it
𝑝
⇔ 𝑙𝑛 = 𝑙𝑛 𝑒 𝛼+𝛽𝑋
1−𝑝 is assumed that the relationship between 𝑙𝑛 𝑝Τ1 − 𝑝 and X is linear.
𝑝
⇔ 𝑙𝑛 =𝛼 + 𝛽𝑋 This technique is known as Logistic Regression.
1−𝑝
STATISTICAL M ODELS FOR AI | 2AI - EST 102
Maximum Likelihood Method
𝑝ො መ
𝑙𝑛 =𝛼ො + 𝛽𝑋
1−𝑝ො
For the sample of 223 low birth weight neonates, the estimated equation of the method is:
𝑝ො
𝑙𝑛 =4.0343 − 0.0042𝑥
1−𝑝ො
Interpretation
For each one gram increase in weight, the ln chance that a neonate will develop BPD decreases, on average, by 0.0042.
STATISTICAL M ODELS FOR AI | 2AI - EST 103
What is the probability that a neonate (taken from the population), whose
weight is 750 grams, will develop BPD?
𝑝ො
𝑙𝑛 1−𝑝ො
=4.0343 − 0.0042𝑥
𝑝ො
𝑙𝑛 =4.0343 − 0.0042 750 = 0.8843
1−𝑝ො
Applying logarithm to both members:
𝑝ො 2.4213
=𝑒 0.8843 = 2.4213 ⇔ 𝑝Ƹ = = 0.708
1−𝑝ො 1+2.4213
STATISTICAL M ODELS FOR AI | 2AI - EST 104
If we calculate the estimated probability for each observed value of birth weight, in three graphs we get:
Birth Weight (in grams) Estimated
likelihood of BPD
750 0.708
1150 0.311
1550 0.078
STATISTICAL M ODELS FOR AI | 2AI - EST 105
Multiple Logistic Regression
To begin to explore the possibility that Gestational Age Sample Size N.º BPD Estimated
(weeks) Probability
gestational age might also interfere
<=28 58 40 0.690
with the likelihood of a neonate
29-30 73 26 0.356
developing BPD, the population of low-
>=31 92 10 0.109
birth-weight neonates was subdivided
223 76 0.341
into three categories.
The table shows that the estimated
probability of BPD decreases as
gestational age increases.
STATISTICAL M ODELS FOR AI | 2AI - EST 106
Birth Weight Gestational Age (weeks)
(in grams)
<=28 29-30 >=31
0-950 0.805 (41) 0.714 (21) 0.167 (6)
951-1350 0.412 (17) 0.194 (36) 0.148 (27)
1351-1750 - (0) 0.250 (16) 0.085 (59)
The following trends can be observed:
• for a given weight category, the estimated probability of BPD appears to decrease as
gestational age increases
• for a given gestational age category, the estimated probability of BPD seems to
decrease as birth weight increases (except when n is small)
𝑝ො
𝑙𝑛 =13.8273 − 0.0024𝑥1 − 0.3983𝑥2
1−𝑝ො
in the analysis these variables are continuous
STATISTICAL M ODELS FOR AI | 2AI - EST 107
And if the variable X is dichotomous??
𝑝ො
𝑙𝑛 =𝛼ො + 𝛽መ3 𝑥3
1−𝑝ො
Where 𝑋3 is the outcome random variable indicating whether the mother had pre-eclampsia during
pregnancy:
pre-eclampsia ‘yes’=1
pre-eclampsia ’no’=0 𝑝ො
𝑙𝑛 =−0.5718 − 0.7719𝑥3
1−𝑝ො
the logarithm of the chance of developing BPD is
lower for children whose mothers had
pre-eclampsia
STATISTICAL M ODELS FOR AI | 2AI - EST 108
ODDS RATIO
The antilogarithm in 𝛽 is a ratio of changes:
𝑐ℎ𝑎𝑛𝑐𝑒 𝑜𝑓 𝐵𝑃𝐷 𝑖𝑛 𝑙𝑜𝑤 𝑏𝑖𝑟𝑡ℎ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑛𝑒𝑜𝑛𝑎𝑡𝑒𝑠 𝑤ℎ𝑜𝑠𝑒 𝑚𝑜𝑡ℎ𝑒𝑟𝑠 ℎ𝑎𝑑 𝑝𝑟𝑒 − 𝑒𝑐𝑙𝑎𝑚𝑝𝑠𝑖𝑎
𝑜𝑑𝑑𝑠 =
𝑐ℎ𝑎𝑛𝑐𝑒 𝑜𝑓 𝐵𝑃𝐷 𝑖𝑛 𝑙𝑜𝑤 𝑏𝑖𝑟𝑡ℎ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑛𝑒𝑜𝑛𝑎𝑡𝑒𝑠 𝑤ℎ𝑜𝑠𝑒 𝑚𝑜𝑡ℎ𝑒𝑟𝑠 ℎ𝑎𝑑 𝑛𝑜 𝑝𝑟𝑒 − 𝑒𝑐𝑙𝑎𝑚𝑝𝑠𝑖𝑎
= 𝑒 𝛽3
𝑂𝑅
= 𝑒 −0.7719 = 0.46
𝑂𝑅
The chance of BPD is 54% lower in neonates whose mother
had pre-eclampsia.
STATISTICAL M ODELS FOR AI | 2AI - EST 109
Logistic response function
The curve that best describes the relationship 1 1
1 + 𝑒𝑥 1 + 𝑒 −𝑥
between the explanatory variables and the
dichotomous response variable is an 'S' shaped
curvilinear curve, taking values in the range (0,1) and
where the zero and one values are horizontal
asymptotes.
There are several mathematical functions that can be
used to model this type of curve.
STATISTICAL M ODELS FOR AI | 2AI - EST 110
STATISTICAL M ODELS FOR AI | 2AI - EST 111
The simple logistic regression model is then: 𝑒 𝛼+𝛽𝑋
𝜋ො =
1 + 𝑒 𝛼+𝛽𝑋
𝐿𝑜𝑔𝑖𝑡 𝜋ො𝑗 = 𝛽0 + 𝛽1 𝑋𝑗 , (𝑗 = 1, … , 𝑛)
𝜋ො
𝐿𝑜𝑔𝑖𝑡 𝜋ො = 𝐿𝑛
Note that in this model the dependent variable is not 𝑌, or 𝑃 𝑌 = 1 , but 𝐿𝑜𝑔𝑖𝑡 𝜋ො𝑗 . 1 − 𝜋ො
ෝ
𝜋
For more than one variable 𝑋𝑖 , 𝑖 = 1, … , 𝑝 The racio
1−ෝ𝜋
is called
likelihood ratio, chances or odds
𝐿𝑜𝑔𝑖𝑡 𝜋ො𝑗 = 𝛽0 + 𝛽1 𝑋1𝑗 + 𝛽2 𝑋2𝑗 + ⋯ + 𝛽𝑝 𝑋𝑝𝑗 ,
translates the ratio between the probability
Note
of success versus the probability of failure,
Beta coefficients are difficult to interpret, so it is usual to interpret the exponential
of these coefficients; that is, the chances of observing success
(Y=1) relative to failure (Y=0)
𝐸𝑥𝑝 𝛽𝑖 estimate of the ratio of the chances of success versus
failure per unit of the independent variable i. i.e.
when 𝑋𝑖 varies by one unit, the chances of succeeding vary by 𝛽𝑖 units.
STATISTICAL M ODELS FOR AI | 2AI - EST 112
Assumptions
1. Linearity and additivity: the scale 𝐿𝑜𝑔𝑖𝑡 𝜋 is additive and linear (but 𝜋 is not);
2. Proportionality: the contribution of each 𝑋𝑖 is proportional to its value with a factor 𝛽𝑖 ;
3. Constancy of effect: the contribution of an independent variable is constant, and independent of the
contribution of the other independent variables;
4. The errors are independent and show a binomial distribution;
5. The predictors are not multicollinear (as in multiple linear regression).
Odds and Logit are equivalent ways of describing P(Y=1), whose value you want to estimate with
logistic regression.
STATISTICAL M ODELS FOR AI | 2AI - EST 113
Example
for a given variable you can select a condition to apply
indicates the reference class
STATISTICAL M ODELS FOR AI | 2AI - EST 114
to obtain a graph of the classification of the subjects in the two groups of the
dependent variable
to evaluate the quality of the fitted model
if using the stepwise method confidence interval of the odds
diagnosing outliers
correlations between variables
STATISTICAL M ODELS FOR AI | 2AI - EST 115
• summary of cases used and the presence of missings and
cases not selected in the analysis
• coding of the dependent variable and the independent variables
• in this case success is having covid
• for the independent variables the reference class is zero
STATISTICAL M ODELS FOR AI | 2AI - EST 116
The tables obtained for block 0 are relative to the null model, that is, with only the constant:
STATISTICAL M ODELS FOR AI | 2AI - EST 117
STATISTICAL M ODELS FOR AI | 2AI - EST 118
STATISTICAL M ODELS FOR AI | 2AI - EST 119
STATISTICAL M ODELS FOR AI | 2AI - EST 120
STATISTICAL M ODELS FOR AI | 2AI - EST 121
The Omnibus Tests presents the likelihood ratio test between the null model and
the models in each step, block and model.
As we use the Enter method all values are equal because there is only a single.
p<0.001 we can conclude that there is at least one independent variable in the
model with predictive power on our dependent variable.
Evaluates the quality of the fit (we'll see more later)
pseudo 𝑅2
Goodness-of-fit test
H0: the model fits the data
STATISTICAL M ODELS FOR AI | 2AI - EST 122
STATISTICAL M ODELS FOR AI | 2AI - EST 123
STATISTICAL M ODELS FOR AI | 2AI - EST 124
The table presents the observed and predicted classification of
subjects by the fitted model.
Note that there are 8 individuals who didn´t have covid but the
model predicts that they did (false-positive), and 39 who didn´t
have covid but the model predicts that they did not (false-
negative).
The sensitivity of the model is 29.1% (that is, the model
correctly classifies 29.1% of the subjects who have covid-
success) and the specificity is 94.2% (the model correctly
classifies 94.2% of the subjects who did didn´t have have covid-
insuccess).
STATISTICAL M ODELS FOR AI | 2AI - EST 125
As the coefficients are not all significant, the model can be estimated using
another analysis procedure.
O racio das chances de ter covid relativamente a não ter aumenta 4,1% por cada ano de idade!
the odds ratio of having covid is 1.042 for each unit increase in age
Repeat the analysis with the Forward LR method!!
%𝑜𝑑𝑑𝑠 = 100 × exp 𝛽𝑗 − 1
the odds of getting covid increase by 4.2% for each year in age
STATISTICAL M ODELS FOR AI | 2AI - EST 126
STATISTICAL M ODELS FOR AI | 2AI - EST 127