15/11/2016
Work Plan
Quantitative Methods
Multivariate Data Analysis
2016/2017
Pedro Campos / Paula Brito
DATA ARRAY
n "individuals" in rows
p variables (attributes) in columns
nb. children weight gender education
I1 2 52 1 2
I2 1 55 1 3
I3 0 50 1 2
I4 3 60 2 1
1
15/11/2016
Data array - Example
The table below records, for some portuguese towns, the values
of % of workers in industry, nb.. of ATM machines and nb. of
available sportive facilities (old data...)
Workers in industry ATM Sportive
(%) Machines Facilities
Aveiro 47,07 36 81
Beja 10,35 12 52
Braga 46,81 50 125
Guimarães 79,36 34 103
Portimão 6,07 19 104
DATA ARRAY
X Y1 Y2 ... Yj ... Yp
I1 x11 x12 ... x1j ... x1p
I2 x21 x22 ... x2j ... x2p
... ... ... ... ... ... ...
Ii xi1 xi2 ... xij ... xip
... ... ... ... ... ... ...
In xn1 xn2 ... xnj ... xnp
4
2
15/11/2016
VARIABLES
• Numerical (Quantitative)
When their values are real numbers
• Discrete : if the value set is finite or infinite
but countable
Ex. : nb. children
nb. times you use the cell phone each day
• Continuous : if the value set is infinite and non-
countable
Ex. : height, weight, temperature
5
VARIABLES
• Numerical
• Interval scale : if they do not have an absolute zero
Ex: temperature
• Ratio scale : it is possible to define an exact relation
between variable values, since the scale has an
absolute zero
Ex: weight
3
15/11/2016
VARIABLES
• Categorical
If their values - categories, modalities -
are not real numbers, although numerical codes may
be used
• Ordinal : if the values are naturally ordered
Ex.: education level
• Nominal : if the values are not ordered
Ex.: nationality, job
Multivariate Data Analysis
Multivariate data analysis comprises a set of
statistical methods, which are used to analyse
together several variables, observed for each
individual or object.
4
15/11/2016
Multivariate Data Analysis
Steps for a multivariate data analysis:
1. Establish the objectives of the analysis
2. Design the analysis (sample size, type of variables,
statistical methods,…);
3. Check the hypothesis/assumptions of the selected
methods
4. Perform the analysis (estimation of the multivariate
method);
5. Interpret the obtained results (this often leads to a
reformulation of the model);
6. Validate the results.
Multivariate Data Analysis
There are 2 large groups of multivariate methods :
• Dependence methods
• Interdependence methods
5
15/11/2016
Main Techniques of Multivariate Analysis
The dependence methods assume the division of the
variables in two groups, the dependent and the independent
variables, and the objective is to assess whether the
independent variables have some influence in the dependent
ones, and how.
The interdependence methods make no distinction between
dependent and independent variables, and the objective is to
determine which are related, how they are related, and why.
Examples of Dependence Methods
Quantitative • Multiple Linear Regression
Dependent • Multivariate Analysis of Variance
Variable (MANOVA)
Qualitative • Discriminant Analysis
Dependent • Logistic Regression
Variable
6
15/11/2016
Interdependence Methods
• Factor Analysis
Quantitative Data • Principal Component Analysis
• Canonical Correlation
• Cluster Analysis
• Log-linear models
Qualitative Data • Multiple Correspondence Analysis
(HOMALS)
• CatPCA
One view of multivariate methods…
Nonsupervised methods
What type of relation?
Supervised Methods
Dependence Interdependence
How many variables to predict? Is a relation between:
Variables Cases
Several dependent variables in One dependent variable in
one relation one relation
- Factor Cluster
Analysis Analysis
- Principal
Type of dependent Type of dependent
Component
variables variable
Analysis
- Canonical
Quantitative Correlation
Qualitative Quantitative Qualitative
What is the type of Multiple - Discriminant
independente Regression Analysis
variables? - Logistic
Regression
Qualitative
Quantitative Multivariate
Analysis of
Multivariate Variance
Regression (MANOVA)
7
15/11/2016
Multiple Regression
Goal: explain the behaviour of one or more variables
according to other variables
Dependent variables : quantitative
Independent variables : quantitative or qualitative -
changed to binary (dummy)
Example:
Create a model to predict the profitability according to equity, return
on equity and solvability.
Linear Regression: the relation between the variables may be
described through a linear function
(if there is only one independent variable → a line)
Multiple Regression
Model:
Y = β0 + β1 X1+ β2 X2 + …+ βp Xp + ε
For each case:
Yi = β0 + β1 Xi1 + β2 Xi2 + …+ βp Xip + εi (i=1,… n)
βj – regression coefficients; εi – residuals
Hypotheses :
Only Y is affected by measurement errors
Residuals εi are random, independent, with Normal distribution
with zero mean and constant variance: εi ~ N(0, σ)
Residuals εi are non correlated with independent variables X1,…, Xp
8
15/11/2016
Analysis of Variance – (M)ANOVA
Goal: verify if the behaviour of one (or more) numerical
variables depends on qualitative variables - factors
Dependent variables: quantitative
Independent variables: qualitative
Example :
To verify if sales of specific products depend on size of
store and location
Hypotheses: we assume that numeric (dependent) variables follow
Normal distribution in each population, and that variances in
different populations are equal.
Discriminant Analysis (Linear)
Dependent variable: qualitative, categories: groups to
discriminate
Independent variables : quantitative
Examples :
A Marketing department wants to find certain parameters of
customers, to distinguish buyers from non-buyers of some products,
and to use this information to predict the behaviour of new customers.
A bank needs to find parameters that identify successful firms (and
those who fail), and use this information to take decisions about loans.
9
15/11/2016
Discriminant Analysis (Linear)
Goals:
Identify the variables that most distinguish groups
Use these variables to build an index to briefly represent
difference between groups
Use the identified variables and the index to create a rule
that allows classifying future observations in one of the
groups.
Hypothesis :
Explicative (independent) variables have multivariate
Normal distribution in each group
Variance-Covariances matrices are equal in all groups
Discriminant Analysis (Linear)
When assumptions are not met:
Variance - covariance matrix IS NOT equal in all the
groups:
→ Quadratic Discriminant Analysis. Requires large samples.
Explicative (independent) variables deviate a lot from
Normal Distribution:
→ Logistic regression
10
15/11/2016
Factorial Analysis
Applies to quantitative (numerical) variables
Objective : to identify a small number of factors that
allow explaining the relations between variables.
Example : sales values of different products may be
explained by common factors such as quality, utility, etc.
Factorial Analysis allows identifying underlying factors,
which cannot be directly observed.
Factorial Analysis
The observed correlations between variables are then
due to the fact that they "share" these factors.
Analysis of the correlation matrix :
the factorial model only makes sense if the variables
are indeed correlated ;
if correlations are very low, it is unlikely that the
variables share common factors.
11
15/11/2016
Factorial Analysis
In general, the model is written as :
Yj = aj1 F1 + aj2 F2 +... + ajk Fk + Uj
F1, F2 , ... , Fk - common factors
Uj - specific factor
aj1 , aj2 , ..., ajk : loadings
Factorial Analysis
It is assumed that :
a) The observed variables Yj, the common factors and
the specific factors have null mean ;
b) The specific factors are not correlated among
themselves, nor with the common factors ;
Orthogonal Model :
c) The common factors are not correlated among
themselves and have unit variance
12
15/11/2016
Factorial Analysis
Example (Sharma) :
Consider students’ marks on 6 subjects :
mathematics, physics, chemistry, english, history and
french.
Each mark may be written as a function of
- the student’s intelligence/capacity - common factor
- the oposition between quantitative capacity and
verbal capacity - common factor
- aptitute to the subject – specific factor
Example - cont.
Correlation matrix between marks (given) :
M P C E H F
M 1
P 0,62 1
C 0,54 0,51 1
E 0,32 0,38 0,36 1
H 0,284 0,351 0,336 0,686 1
F 0,37 0,43 0,405 0,73 0,735 1
26
13
15/11/2016
Factorial Analysis
M = 0,675 F1 + 0,557 F2 + ApM
F = 0,717 F1 + 0,447 F2 + ApF
Q = 0,683 F1 + 0,418 F2 + ApQ
I = 0,793 F1 - 0,410 F2 + ApI
H = 0,774 F1 - 0,461 F2 + ApH
Fr = 0,837 F1 - 0,359 F2 + ApFr
Factorial Analysis
The correlations between the observed variables
and the common factors (standardized principal
components) are given by the pattern loadings.
→ Interpretation of the factors
14
15/11/2016
Factorial Analysis
Methods for factor extraction :
Principal Components
Principal Axis
Non-weighted mean-squares
Generalized mean-squares
Maximum Likelihood
Alpha Method
Image Factoring
Principal Component Analysis
Principal Components :
New variables
Linear combinations of the original variables, non-
correlated, and that maximize variance
They are obtained from the eigenvectors of the
correlation matrix, associated with the largest
eigenvalues
15
15/11/2016
Principal Component Analysis
If an important part of the dispersion is explained by a
small number of principal components, then we may
use just some of them for interpretation and future
analysis, instead of the original p.
How many components should be kept?
Which percentage of dispersion are we ready to
sacrifice ?
How much is just “noise” ?
Principal Component Analysis
1) Pearson’s criterion:
Keep a number q of components such that they explain at
least 80% of the total dispersion.
2) Observe the graphical representation of the eigenvalues
and keep those λα for which : λα- λα -1 > ε (ε relatively
small) - “elbow’s rule”.
3) Kaiser proposed to only keep the eigenvalues above 1 -
i.e., the principal components which are “more
informative” than the original variables, i.e., whose
variance is above the original variables’ variance.
16
15/11/2016
Factorial Analysis with Qualitative Variables
Specific methods for qualitative variables
• Multiple Correspondence Analysis
• CatPCA
CLUSTER ANALYSIS
Marketing:
Potential clients :
socio-economic characteristics, preferences
→ IdenTficaTon of market segments
Finance:
Companies : financial indicators
→ Typology of companies ?
17
15/11/2016
CLUSTER ANALYSIS
Applies to elements described by numerical or binary
variables (not simultaneaously)
Objective :
Given : n objects described by p variables
Potential clients socio-economic charac., past expenses
Companies financial indicators
Cities social structure, facilities
...
Determine a CLUSTERING :
Structure the objects in classes
CLUSTER ANALYSIS
The objective is grouping the objects in classes, such that
- elements of a given class are quite similar among each
other – homogeneous classes
- classes are "relatively distinct" from each other –
well separated classes
18
15/11/2016
Clustering Models
Partition
Disjoint classes which together cover the whole set to
be clustered
Clustering Models
Hierarchical Models
Classes are organized in a nested structure
19
15/11/2016
Comparing elements
It is necessary to select a comparison measure between
pairs of elements of the set to be clustered
Examples of measures for numerical data:
- Euclidean distance
- Manhattan, or City-Block distance
- Mahalanobis distance
- …
Consider standardization
Many measures for binary variables
Hierarchical Clustering
Hierarchical model:
Set of nested partitions
Dendrogram
20
15/11/2016
Example
BUYING HOTEL AVERAGE
CITY BASKET RENT TAXI
POWER NIGHT INCOME
Amsterdam 78,00 1339,00 520,00 10,58 286,00 16486,00
Caracas 14,30 795,00 210,00 2,96 148,00 1910,00
Chicago 99,70 1474,00 900,00 5,00 218,00 25129,00
Helsinki 54,80 1597,00 570,00 7,56 194,00 13463,00
Houston 96,30 1314,00 430,00 6,00 149,00 21997,00
Jakarta 18,10 1035,00 980,00 1,42 245,00 3253,00
London 59,90 1354,00 810,00 7,16 375,00 13348,00
Luxembourg 114,00 1371,00 1080,00 8,92 227,00 24564,00
RiodeJaneiro 22,20 1067,00 450,00 2,58 194,00 3900,00
Zurich 100,00 1946,00 740,00 14,36 287,00 32420,00
Example
21
15/11/2016
Example
Class 1 : Amsterdam, Chicago, Helsinki, Houston,
Luxembourg, Zurich
Class 2 : Caracas, Jakarta, London, Rio de Janeiro
BUYING HOTEL AVERAGE
CITY BASKET RENT TAXI
POWER NIGHT INCOME
class1 90,47 1506,83 706,67 8,74 226,83 22343,17
class2 28,63 1062,75 612,50 3,53 240,50 5602,75
Non-Hierarchical Clustering
Objective :
Determine (directly) partitions P = {C1,… , Ck}, i.e.,
families of k classes which do not intersect and that
jointly cover the whole :
22
15/11/2016
K-Means method
Fix the number of clusters – k
Starting from a set of k initial centers - elements of W -
assign each element to the class with nearest center.
After each assignment the cluster center is re-
computed.
After assigning all elements, the method may be
iterated.
Known as : moving-centers method
Hierarchical VS
Non-hierarchical Clustering
Hierarchical Non-hierarchical
Series of “solutions” One single solution
No need to fix the number of Need to fix the number of
clusters clusters
Solution not improved Optimized solution
Computationally very heavy Computationally “lighter” :
less number of calculations
and comparisons
Not indicated for large Indicated for large datasets
datasets
46
23
15/11/2016
Combining Factorial Analysis
and Clustering
Determine principal components
Select the relevant ones
Cluster the data with the values of the principal
components (or factor scores) instead of original
data
47
Clustering with qualitative data
Do not apply directly a clustering method!
Perform Multiple Correspondence Analysis
Select the relevant ones
Cluster the data with the values of the principal
components (or factor scores) instead of original
data
48
24