MatlabStats PDF
MatlabStats PDF
User’s Guide
How to Contact MathWorks
www.mathworks.com Web
comp.soft-sys.matlab Newsgroup
www.mathworks.com/contact_TS.html Technical Support
508-647-7000 (Phone)
508-647-7001 (Fax)
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 1993 First printing Version 1.0
March 1996 Second printing Version 2.0
January 1997 Third printing Version 2.11
November 2000 Fourth printing Revised for Version 3.0 (Release 12)
May 2001 Fifth printing Minor revisions
July 2002 Sixth printing Revised for Version 4.0 (Release 13)
February 2003 Online only Revised for Version 4.1 (Release 13.0.1)
June 2004 Seventh printing Revised for Version 5.0 (Release 14)
October 2004 Online only Revised for Version 5.0.1 (Release 14SP1)
March 2005 Online only Revised for Version 5.0.2 (Release 14SP2)
September 2005 Online only Revised for Version 5.1 (Release 14SP3)
March 2006 Online only Revised for Version 5.2 (Release 2006a)
September 2006 Online only Revised for Version 5.3 (Release 2006b)
March 2007 Eighth printing Revised for Version 6.0 (Release 2007a)
September 2007 Ninth printing Revised for Version 6.1 (Release 2007b)
March 2008 Online only Revised for Version 6.2 (Release 2008a)
October 2008 Online only Revised for Version 7.0 (Release 2008b)
March 2009 Online only Revised for Version 7.1 (Release 2009a)
September 2009 Online only Revised for Version 7.2 (Release 2009b)
March 2010 Online only Revised for Version 7.3 (Release 2010a)
September 2010 Online only Revised for Version 7.4 (Release 2010b)
April 2011 Online only Revised for Version 7.5 (Release 2011a)
Contents
Getting Started
1
Product Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Organizing Data
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Descriptive Statistics
3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
v
Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . 3-3
Statistical Visualization
4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Probability Distributions
5
Using Probability Distributions . . . . . . . . . . . . . . . . . . . . . 5-2
vi Contents
Supported Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Parametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Nonparametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . 5-8
vii
Common Generation Methods . . . . . . . . . . . . . . . . . . . . . . 6-5
Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Inversion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Acceptance-Rejection Methods . . . . . . . . . . . . . . . . . . . . . . . 6-9
Hypothesis Tests
7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
viii Contents
Analysis of Variance
8
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
N-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Other ANOVA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27
Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-35
MANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
ANOVA with Multiple Responses . . . . . . . . . . . . . . . . . . . . 8-39
ix
Multivariate Methods
10
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
Cluster Analysis
11
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
x Contents
Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
Verifying the Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Creating Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
Parametric Classification
12
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
xi
Supervised Learning
13
Supervised Learning (Machine Learning) Workflow
and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Steps in Supervised Learning (Machine Learning) . . . . . . 13-2
Characteristics of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13-6
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-130
xii Contents
Markov Models
14
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Design of Experiments
15
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
xiii
Statistical Process Control
16
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-2
Parallel Statistics
17
Quick Start Parallel Computing for Statistics
Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
What Is Parallel Statistics Functionality? . . . . . . . . . . . . . 17-2
How To Compute in Parallel . . . . . . . . . . . . . . . . . . . . . . . . 17-3
Example: Parallel Treebagger . . . . . . . . . . . . . . . . . . . . . . . 17-5
xiv Contents
Subtleties in Parallel Statistical Computation Using
Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-14
Function Reference
18
File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
xv
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
Distribution Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
Distribution Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-16
Probability Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-17
Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-19
Inverse Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . 18-21
Distribution Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-23
Distribution Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-24
Negative Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-26
Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . 18-26
Quasi-Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-28
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-29
xvi Contents
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
Distance Computation and Nearest Neighbor Search . . . . 18-41
GUIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-59
Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-60
Class Reference
19
Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
Categorical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
Dataset Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
xvii
Distribution Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3
Quasi-Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4
Data Sets
A
Distribution Reference
B
Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
Definition of the Bernoulli Distribution . . . . . . . . . . . . . . . . B-3
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
xviii Contents
Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6
Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
xix
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-22
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-24
F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
xx Contents
Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . B-43
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-43
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-43
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-44
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-44
xxi
Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . B-57
xxii Contents
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-81
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-81
xxiii
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-95
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-96
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-96
xxiv Contents
Bibliography
C
Index
xxv
xxvi Contents
1
Getting Started
1 Getting Started
Product Overview
Statistics Toolbox™ software extends MATLAB® to support a wide range of
common statistical tasks. The toolbox contains two categories of tools:
Code for the building-block functions is open and extensible. Use the MATLAB
Editor to review, copy, and edit code for any function. Extend the toolbox by
copying code to new files or by writing files that call toolbox functions.
1-2
2
Organizing Data
Introduction
MATLAB data is placed into “data containers” in the form of workspace
variables. All workspace variables organize data into some form of array. For
statistical purposes, arrays are viewed as tables of values.
Data types determine the kind of data variables contain. (See “Classes (Data
Types)” in the MATLAB documentation.)
These variables are not specifically designed for statistical data, however.
Statistical data generally involves observations of multiple variables, with
measurements of heterogeneous type and size. Data may be numerical (of
type single or double), categorical, or in the form of descriptive metadata.
Fitting statistical data into basic MATLAB variables, and accessing it
efficiently, can be cumbersome.
2-2
Introduction
2-3
2 Organizing Data
MATLAB Arrays
In this section...
“Numerical Data” on page 2-4
“Heterogeneous Data” on page 2-7
“Statistical Functions” on page 2-9
Numerical Data
MATLAB two-dimensional numerical arrays (matrices) containing statistical
data use rows to represent observations and columns to represent measured
variables. For example,
loads the variables meas and species into the MATLAB workspace. The meas
variable is a 150-by-4 numerical matrix, representing 150 observations of 4
different measured variables (by column: sepal length, sepal width, petal
length, and petal width, respectively).
2-4
MATLAB® Arrays
setosa_indices = strcmp('setosa',species);
setosa = meas(setosa_indices,:);
To access and display the first five observations in the setosa data, use row,
column parenthesis indexing:
SetosaObs = setosa(1:5,:)
SetosaObs =
5.1000 3.5000 1.4000 0.2000
4.9000 3.0000 1.4000 0.2000
4.7000 3.2000 1.3000 0.2000
4.6000 3.1000 1.5000 0.2000
5.0000 3.6000 1.4000 0.2000
The data are organized into a table with implicit column headers “Sepal
Length,” “Sepal Width,” “Petal Length,” and “Petal Width.” Implicit row
headers are “Observation 1,” “Observation 2,” “Observation 3,” etc.
Similarly, 50 observations for iris versicolor and iris virginica can be extracted
from the meas container variable:
versicolor_indices = strcmp('versicolor',species);
versicolor = meas(versicolor_indices,:);
virginica_indices = strcmp('virginica',species);
virginica = meas(virginica_indices,:);
Because the data sets for the three species happen to be of the same size, they
can be reorganized into a single 50-by-4-by-3 multidimensional array:
iris = cat(3,setosa,versicolor,virginica);
The iris array is a three-layer table with the same implicit row and column
headers as the setosa, versicolor, and virginica arrays. The implicit layer
names, along the third dimension, are “Setosa,” “Versicolor,” and “Virginica.”
The utility of such a multidimensional organization depends on assigning
meaningful properties of the data to each dimension.
2-5
2 Organizing Data
SetosaSL = iris(1:5,1,1)
SetosaSL =
5.1000
4.9000
4.7000
4.6000
5.0000
2-6
MATLAB® Arrays
Heterogeneous Data
MATLAB data types include two container variables—cell arrays and
structure arrays—that allow you to combine metadata with variables of
different types and sizes.
obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris1(2:end,1,:) = repmat(obsnames,[1 1 3]);
varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris1(1,2:end,:) = repmat(varnames,[1 1 3]);
iris1(2:end,2:end,1) = num2cell(setosa);
iris1(2:end,2:end,2) = num2cell(versicolor);
iris1(2:end,2:end,3) = num2cell(virginica);
iris1{1,1,1} = 'Setosa';
iris1{1,1,2} = 'Versicolor';
iris1{1,1,3} = 'Virginica';
To access and display the cells, use parenthesis indexing. The following
displays the first five observations in the setosa sepal data:
SetosaSLSW = iris1(1:6,1:3,1)
SetosaSLSW =
'Setosa' 'SepalLength' 'SepalWidth'
'Obs1' [ 5.1000] [ 3.5000]
'Obs2' [ 4.9000] [ 3]
'Obs3' [ 4.7000] [ 3.2000]
'Obs4' [ 4.6000] [ 3.1000]
'Obs5' [ 5] [ 3.6000]
Here, the row and column headers have been explicitly labeled with metadata.
To extract the data subset, use row, column curly brace indexing:
2-7
2 Organizing Data
subset = reshape([iris1{2:6,2:3,1}],5,2)
subset =
5.1000 3.5000
4.9000 3.0000
4.7000 3.2000
4.6000 3.1000
5.0000 3.6000
While cell arrays are useful for organizing heterogeneous data, they may
be cumbersome when it comes to manipulating and analyzing the data.
MATLAB and Statistics Toolbox statistical functions do not accept data in the
form of cell arrays. For processing, data must be extracted from the cell array
to a numerical container variable, as in the preceding example. The indexing
can become complicated for large, heterogeneous data sets. This limitation of
cell arrays is addressed by dataset arrays (see “Dataset Arrays” on page 2-23),
which are designed to store general statistical data and provide easy access.
The data in the preceding example can also be organized in a structure array,
as follows:
iris2.data = cat(3,setosa,versicolor,virginica);
iris2.varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris2.obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris2.species = {'setosa','versicolor','virginica'};
The data subset is then returned using a combination of dot and parenthesis
indexing:
subset = iris2.data(1:5,1:2,1)
subset =
5.1000 3.5000
4.9000 3.0000
4.7000 3.2000
4.6000 3.1000
5.0000 3.6000
For statistical data, structure arrays have many of the same limitations as
cell arrays. Once again, dataset arrays (see “Dataset Arrays” on page 2-23),
designed specifically for general statistical data, address these limitations.
2-8
MATLAB® Arrays
Statistical Functions
One of the advantages of working in the MATLAB language is that functions
operate on entire arrays of data, not just on single scalar values. The
functions are said to be vectorized. Vectorization allows for both efficient
problem formulation, using array-based data, and efficient computation,
using vectorized statistical functions.
std(setosa)
ans =
0.3525 0.3791 0.1737 0.1054
The four standard deviations are for measurements of sepal length, sepal
width, petal length, and petal width, respectively.
Compare this to
std(setosa(:))
ans =
1.8483
2-9
2 Organizing Data
which gives the standard deviation across the entire array (all measurements).
sin(setosa)
This operation returns a 50-by-4 array the same size as setosa. The sin
function is vectorized in a different way than the std function, computing one
scalar value for each element in the array.
2-10
Statistical Arrays
Statistical Arrays
In this section...
“Introduction” on page 2-11
“Categorical Arrays” on page 2-13
“Dataset Arrays” on page 2-23
Introduction
As discussed in “MATLAB Arrays” on page 2-4, MATLAB data types include
arrays for numerical, logical, and character data, as well as cell and structure
arrays for heterogeneous collections of data.
Categorical arrays store data with values in a discrete set of levels. Each level
is meant to capture a single, defining characteristic of an observation. If no
ordering is encoded in the levels, the data and the array are nominal. If an
ordering is encoded, the data and the array are ordinal.
Categorical arrays also store labels for the levels. Nominal labels typically
suggest the type of an observation, while ordinal labels suggest the position
or rank.
2-11
2 Organizing Data
Both categorical and dataset arrays have associated methods for assembling,
accessing, manipulating, and processing the collected data. Basic array
operations parallel those for numerical, cell, and structure arrays.
2-12
Statistical Arrays
Categorical Arrays
• “Categorical Data” on page 2-13
• “Categorical Arrays” on page 2-14
• “Using Categorical Arrays” on page 2-16
Categorical Data
Categorical data take on values from only a finite, discrete set of categories
or levels. Levels may be determined before the data are collected, based on
the application, or they may be determined by the distinct values in the data
when converting them to categorical form. Predetermined levels, such as a
set of states or numerical intervals, are independent of the data they contain.
Any number of values in the data may attain a given level, or no data at all.
Categorical data show which measured values share common levels, and
which do not.
2-13
2 Organizing Data
Categorical Arrays
Categorical data can be represented using MATLAB integer arrays, but
this method has a number of drawbacks. First, it removes all of the useful
metadata that might be captured in labels for the levels. Labels must be
stored separately, in character arrays or cell arrays of strings. Secondly, this
method suggests that values stored in the integer array have their usual
numeric meaning, which, for categorical data, they may not. Finally, integer
types have a fixed set of levels (for example, -128:127 for all int8 arrays),
which cannot be changed.
load fisheriris
ndata = nominal(species,{'A','B','C'});
creates a nominal array with levels A, B, and C from the species data in
fisheriris.mat, while
odata = ordinal(ndata,{},{'C','A','B'});
encodes an ordering of the levels with C < A < B. See “Using Categorical
Arrays” on page 2-16, and the reference pages for nominal and ordinal, for
further examples.
2-14
Statistical Arrays
the nominal class and ordinal class. Use the corresponding constructors,
nominal or ordinal, to create categorical arrays. Methods of the classes are
used to display, summarize, convert, concatenate, and access the collected
data. Many of these methods are invoked using operations analogous to those
for numerical arrays, and do not need to be called directly (for example, []
invokes horzcat). Other methods, such as reorderlevels, must be called
directly.
2-15
2 Organizing Data
1 Load the 150-by-4 numerical array meas and the 150-by-1 cell array of
strings species:
The data are 150 observations of four measured variables (by column
number: sepal length, sepal width, petal length, and petal width,
respectively) over three species of iris (setosa, versicolor, and virginica).
n1 = nominal(species);
3 Open species and n1 side by side in the Variable Editor (see “Viewing and
Editing Workspace Variables with the Variable Editor” in the MATLAB
documentation). Note that the string information in species has been
converted to categorical form, leaving only information on which data share
the same values, indicated by the labels for the levels.
By default, levels are labeled with the distinct values in the data (in this
case, the strings in species). Give alternate labels with additional input
arguments to the nominal constructor:
n2 = nominal(species,{'species1','species2','species3'});
4 Open n2 in the Variable Editor, and compare it with species and n1. The
levels have been relabeled.
2-16
Statistical Arrays
o1 = ordinal(n2,{},{'species1','species3','species2'});
The second input argument to ordinal is the same as for nominal—a list
of labels for the levels in the data. If it is unspecified, as above, the labels
are inherited from the data, in this case n2. The third input argument of
ordinal indicates the ordering of the levels, in ascending order.
6 When displayed side by side in the Variable Editor, o1 does not appear any
different than n2. This is because the data in o1 have not been sorted. It
is important to recognize the difference between the ordering of the levels
in an ordinal array and sorting the actual data according to that ordering.
Use sort to sort ordinal data in ascending order:
o2 = sort(o1);
When displayed in the Variable Editor, o2 shows the data sorted by diploid
chromosome count.
7 To find which elements moved up in the sort, use the < operator for ordinal
arrays:
8 Use getlabels to display the labels for the levels in ascending order:
labels2 = getlabels(o2)
labels2 =
'species1' 'species3' 'species2'
9 The sort function reorders the display of the data, but not the order of the
levels. To reorder the levels, use reorderlevels:
2-17
2 Organizing Data
o3 = reorderlevels(o2,labels2([1 3 2]));
labels3 = getlabels(o3)
labels3 =
'species1' 'species2' 'species3'
o4 = sort(o3);
These operations return the levels in the data to their original ordering, by
species number, and then sort the data for display purposes.
low50 = o4(1:50);
Suppose you want to categorize the data in o4 with only two levels: low (the
data in low50) and high (the rest of the data). One way to do this is to use an
assignment with parenthesis indexing on the left-hand side:
o5 = o4; % Copy o4
o5(1:50) = 'low';
Warning: Categorical level 'low' being added.
o5(51:end) = 'high';
Warning: Categorical level 'high' being added.
Note the warnings: the assignments move data to new levels. The old levels,
though empty, remain:
getlabels(o5)
ans =
'species1' 'species2' 'species3' 'low' 'high'
o5 = droplevels(o5,{'species1','species2','species3'});
2-18
Statistical Arrays
o5 = mergelevels(o4,{'species1'},'low');
o5 = mergelevels(o5,{'species2','species3'},'high');
getlabels(o5)
ans =
'low' 'high'
The merged levels are removed and replaced with the new levels.
• Only categorical arrays of the same type can be combined. You cannot
concatenate a nominal array with an ordinal array.
• Only ordinal arrays with the same levels, in the same order, can be
combined.
• Nominal arrays with different levels can be combined to produce a nominal
array whose levels are the union of the levels in the component arrays.
First use ordinal to create ordinal arrays from the variables for sepal length
and sepal width in meas. Categorize the data as short or long depending on
whether they are below or above the median of the variable, respectively:
Because SL1 and SW1 are ordinal arrays with the same levels, in the same
order, they can be concatenated:
S1 = [SL1,SW1];
S1(1:10,:)
ans =
short long
short long
short long
2-19
2 Organizing Data
short long
short long
short long
short long
short long
short short
short long
If, on the other hand, the measurements are cast as nominal, different levels
can be used for the different variables, and the two nominal arrays can still
be combined:
SL2 = nominal(sl,{'short','long'},[],...
[min(sl),median(sl),max(sl)]);
SW2 = nominal(sw,{'skinny','wide'},[],...
[min(sw),median(sw),max(sw)]);
S2 = [SL2,SW2];
getlabels(S2)
ans =
'short' 'long' 'skinny' 'wide'
S2(1:10,:)
ans =
short wide
short wide
short wide
short wide
short wide
short wide
short wide
short wide
short skinny
short wide
2-20
Statistical Arrays
SetosaObs = ismember(n1,'setosa');
Since the code above compares elements of n1 to a single value, the same
operation is carried out by the equality operator:
The SetosaObs variable is used to index into meas to extract only the setosa
data:
SetosaData = meas(SetosaObs,:);
Categorical arrays are also used as grouping variables. The following plot
summarizes the sepal length data in meas by category:
boxplot(sl,n1)
2-21
2 Organizing Data
2-22
Statistical Arrays
Dataset Arrays
• “Statistical Data” on page 2-23
• “Dataset Arrays” on page 2-24
• “Using Dataset Arrays” on page 2-25
Statistical Data
MATLAB data containers (variables) are suitable for completely homogeneous
data (numeric, character, and logical arrays) and for completely heterogeneous
data (cell and structure arrays). Statistical data, however, are often a mixture
of homogeneous variables of heterogeneous types and sizes. Dataset arrays
are suitable containers for this kind of data.
2-23
2 Organizing Data
Dataset Arrays
Dataset arrays are variables created with dataset. For example, the
following creates a dataset array from observations that are a combination of
categorical and numerical measurements:
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
species SL SW PL PW
Obs1 setosa 5.1 3.5 1.4 0.2
Obs2 setosa 4.9 3 1.4 0.2
Obs3 setosa 4.7 3.2 1.3 0.2
Obs4 setosa 4.6 3.1 1.5 0.2
Obs5 setosa 5 3.6 1.4 0.2
When creating a dataset array, variable names and observation names can be
assigned together with the data. Other metadata associated with the array
can be assigned with set and accessed with get:
2-24
Statistical Arrays
Constructing Dataset Arrays. Load the 150-by-4 numerical array meas and
the 150-by-1 cell array of strings species:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
Use dataset to create a dataset array iris from the data, assigning variable
names species, SL, SW, PL, and PW and observation names Obs1, Obs2, Obs3,
etc.:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
species SL SW PL PW
Obs1 setosa 5.1 3.5 1.4 0.2
Obs2 setosa 4.9 3 1.4 0.2
Obs3 setosa 4.7 3.2 1.3 0.2
Obs4 setosa 4.6 3.1 1.5 0.2
Obs5 setosa 5 3.6 1.4 0.2
2-25
2 Organizing Data
iris = set(iris,'Description',desc,...
'Units',units,...
'UserData',info);
get(iris)
Description: 'Fisher's iris data (1936)'
Units: {'' 'cm' 'cm' 'cm' 'cm'}
DimNames: {'Observations' 'Variables'}
UserData: 'http://en.wikipedia.org/wiki/R.A._Fisher'
ObsNames: {150x1 cell}
VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}
get(iris(1:5,:),'ObsNames')
ans =
'Obs1'
'Obs2'
'Obs3'
'Obs4'
'Obs5'
2-26
Statistical Arrays
iris1 = iris(1:5,2:3)
iris1 =
SL SW
Obs1 5.1 3.5
Obs2 4.9 3
Obs3 4.7 3.2
Obs4 4.6 3.1
Obs5 5 3.6
Similarly, use parenthesis indexing to assign new data to the first variable
in iris1:
iris1(:,1) = dataset([5.2;4.9;4.6;4.6;5])
iris1 =
SL SW
Obs1 5.2 3.5
Obs2 4.9 3
Obs3 4.6 3.2
Obs4 4.6 3.1
Obs5 5 3.6
SepalObs = iris1({'Obs1','Obs3','Obs5'},'SL')
SepalObs =
SL
Obs1 5.2
Obs3 4.6
Obs5 5
2-27
2 Organizing Data
The following code extracts the sepal lengths in iris1 corresponding to sepal
widths greater than 3:
Dot indexing also allows entire variables to be deleted from a dataset array:
iris1.SL = []
iris1 =
SW
Obs 1 3.5
Obs 2 3
Obs 3 3.2
Obs 4 3.1
Obs 5 3.6
Dynamic variable naming works for dataset arrays just as it does for structure
arrays. For example, the units of the SW variable are changed in iris1 as
follows:
varname = 'SW';
iris1.(varname) = iris1.(varname)*10
iris1 =
SW
Obs1 35
Obs2 30
2-28
Statistical Arrays
Obs3 32
Obs4 31
Obs5 36
iris1 = set(iris1,'Units',{'mm'});
Curly brace indexing is used to access individual data elements. The following
are equivalent:
iris1{1,1}
ans =
35
iris1{'Obs1','SW'}
ans =
35
SepalData = iris(:,{'SL','SW'});
PetalData = iris(:,{'PL','PW'});
newiris = [SepalData,PetalData];
size(newiris)
ans =
150 4
The following concatenates variables within a dataset array and then deletes
the component variables:
newiris.SepalData = [newiris.SL,newiris.SW];
newiris.PetalData = [newiris.PL,newiris.PW];
newiris(:,{'SL','SW','PL','PW'}) = [];
size(newiris)
ans =
150 2
size(newiris.SepalData)
ans =
2-29
2 Organizing Data
150 2
snames = nominal({'setosa';'versicolor';'virginica'});
CC = dataset({snames,'species'},{[38;108;70],'cc'})
CC =
species cc
setosa 38
versicolor 108
virginica 70
iris2 = join(iris,CC);
2-30
Statistical Arrays
ds.var = [];
ds(:,j) = [];
ds = ds(:,[1:(j-1) (j+1):end]);
ds(i,:) = [];
ds = ds([1:(i-1) (i+1):end],:);
summary(newiris)
Fisher's iris data (1936)
SepalData: [153x2 double]
2-31
2 Organizing Data
min 4.3000 2
1st Q 5.1000 2.8000
median 5.8000 3
3rd Q 6.4000 3.3250
max 7.9000 4.4000
PetalData: [153x2 double]
min 1 0.1000
1st Q 1.6000 0.3000
median 4.4000 1.3000
3rd Q 5.1000 1.8000
max 6.9000 4.2000
SepalMeans = mean(newiris.SepalData)
SepalMeans =
5.8294 3.0503
means = datasetfun(@mean,newiris,'UniformOutput',false)
means =
[1x2 double] [1x2 double]
SepalMeans = means{1}
SepalMeans =
5.8294 3.0503
covs = datasetfun(@cov,newiris,'UniformOutput',false)
covs =
[2x2 double] [2x2 double]
SepalCovs = covs{1}
SepalCovs =
0.6835 -0.0373
-0.0373 0.2054
2-32
Statistical Arrays
SepalCovs = cov(double(newiris(:,1)))
SepalCovs =
0.6835 -0.0373
-0.0373 0.2054
2-33
2 Organizing Data
Grouped Data
In this section...
“Grouping Variables” on page 2-34
“Level Order Definition” on page 2-35
“Functions for Grouped Data” on page 2-35
“Using Grouping Variables” on page 2-37
Grouping Variables
Grouping variables are utility variables used to indicate which elements
in a data set are to be considered together when computing statistics and
creating visualizations. They may be numeric vectors, string arrays, cell
arrays of strings, or categorical arrays. Logical vectors can be used to indicate
membership (or not) in a single group.
Grouping variables have the same length as the variables (columns) in a data
set. Observations (rows) i and j are considered to be in the same group if the
values of the corresponding grouping variable are identical at those indices.
Grouping variables with multiple columns are used to specify different groups
within multiple variables.
For example, the following loads the 150-by-4 numerical array meas and the
150-by-1 cell array of strings species into the workspace:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively)
over three species of iris (setosa, versicolor, and virginica). To group the
observations by species, the following are all acceptable (and equivalent)
grouping variables:
2-34
Grouped Data
• For a categorical vector G, the set of group levels and their order match the
output of the getlabels (G) method.
• For a numeric vector or a logical vector G, the set of group levels is the
distinct values of G. The order is the sorted order of the unique values.
• For a cell vector of strings or a character matrix G, the set of group levels
is the distinct strings of G. The order for strings is the order of their first
appearance in G.
Some functions, such as grpstats, can take a cell array of several grouping
variables (such as {G1 G2 G3}) to group the observations in the data set by
each combination of the grouping variable levels. The order is decided first
by the order of the first grouping variables, then by the order of the second
grouping variable, and so on.
For a full description of the syntax of any particular function, and examples
of its use, consult its reference page, linked from the table. “Using Grouping
Variables” on page 2-37 also includes examples.
2-35
2 Organizing Data
2-36
Grouped Data
Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings
species:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
group = nominal(species);
Compute some basic statistics for the data (median and interquartile range),
by group, using the grpstats function:
[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50 50 50 50
50 50 50 50
50 50 50 50
group_median =
5.0000 3.4000 1.5000 0.2000
5.9000 2.8000 4.3500 1.3000
6.5000 3.0000 5.5500 2.0000
2-37
2 Organizing Data
group_iqr =
0.4000 0.5000 0.2000 0.1000
0.7000 0.5000 0.6000 0.3000
0.7000 0.4000 0.8000 0.5000
To improve the labeling of the data, create a dataset array (see “Dataset
Arrays” on page 2-23) from meas:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({group,'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
When you call grpstats with a dataset array as an argument, you invoke the
grpstats method of the dataset class, rather than the grpstats function.
The method has a slightly different syntax than the function, but it returns
the same results, with better labeling:
stats = grpstats(iris,'species',{@median,@iqr})
stats =
species GroupCount
setosa setosa 50
versicolor versicolor 50
virginica virginica 50
median_SL iqr_SL
setosa 5 0.4
versicolor 5.9 0.7
virginica 6.5 0.7
median_SW iqr_SW
setosa 3.4 0.5
versicolor 2.8 0.5
virginica 3 0.4
2-38
Grouped Data
median_PL iqr_PL
setosa 1.5 0.2
versicolor 4.35 0.6
virginica 5.55 0.8
median_PW iqr_PW
setosa 0.2 0.1
versicolor 1.3 0.3
virginica 2 0.5
subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
iris.SW(subset),...
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')
2-39
2 Organizing Data
2-40
3
Descriptive Statistics
Introduction
You may need to summarize large, complex data sets—both numerically
and visually—to convey their essence to the data analyst and to allow for
further processing. This chapter focuses on numerical summaries; Chapter 4,
“Statistical Visualization” focuses on visual summaries.
3-2
Measures of Central Tendency
The following table lists the functions that calculate the measures of central
tendency.
The average is a simple and popular estimate of location. If the data sample
comes from a normal distribution, then the sample mean is also optimal
(MVUE of µ).
The median and trimmed mean are two measures that are resistant (robust)
to outliers. The median is the 50th percentile of the sample, which will only
change slightly if you add a large perturbation to any value. The idea behind
the trimmed mean is to ignore a small percentage of the highest and lowest
values of a sample when determining the center of the sample.
The geometric mean and harmonic mean, like the average, are not robust
to outliers. They are useful when the sample is distributed lognormal or
heavily skewed.
3-3
3 Descriptive Statistics
The following example shows the behavior of the measures of location for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1 1 1 1 1 1 100
locate =
1.9307 1.1647 15.1429 1.0000 1.0000
You can see that the mean is far from any data value because of the influence
of the outlier. The median and trimmed mean ignore the outlying value and
describe the location of the rest of the data values.
3-4
Measures of Dispersion
Measures of Dispersion
The purpose of measures of dispersion is to find out how spread out the data
values are on the number line. Another term for these statistics is measures
of spread.
Function
Name Description
iqr Interquartile range
mad Mean absolute deviation
moment Central moment of all orders
range Range
std Standard deviation
var Variance
The range (the difference between the maximum and minimum values) is the
simplest measure of spread. But if there is an outlier in the data, it will be the
minimum or maximum value. Thus, the range is not robust to outliers.
The standard deviation and the variance are popular measures of spread that
are optimal for normally distributed samples. The sample variance is the
MVUE of the normal parameter σ2. The standard deviation is the square root
of the variance and has the desirable property of being in the same units as
the data. That is, if the data is in meters, the standard deviation is in meters
as well. The variance is in meters2, which is more difficult to interpret.
Neither the standard deviation nor the variance is robust to outliers. A data
value that is separate from the body of the data can increase the value of the
statistics by an arbitrarily large amount.
The mean absolute deviation (MAD) is also sensitive to outliers. But the
MAD does not move quite as much as the standard deviation or variance in
response to bad data.
3-5
3 Descriptive Statistics
The interquartile range (IQR) is the difference between the 75th and 25th
percentile of the data. Since only the middle 50% of the data affects this
measure, it is robust to outliers.
The following example shows the behavior of the measures of dispersion for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1 1 1 1 1 1 100
stats =
0 24.2449 99.0000 37.4185
3-6
Measures of Shape
Measures of Shape
Quantiles and percentiles provide information about the shape of data as
well as its location and spread.
1 n sorted data points are the 0.5/n, 1.5/n, ..., (n–0.5)/n quantiles.
3 The data min or max are assigned to quantiles outside the range.
4 Missing values are treated as NaN, and removed from the data.
The following example shows the result of looking at every quartile (quantiles
with orders that are multiples of 0.25) of a sample containing a mixture of
two distributions.
x = [normrnd(4,1,1,100) normrnd(6,0.5,1,200)];
p = 100*(0:0.25:1);
y = prctile(x,p);
z = [p;y]
z =
0 25.0000 50.0000 75.0000 100.0000
1.8293 4.6728 5.6459 6.0766 7.1546
boxplot(x)
3-7
3 Descriptive Statistics
The long lower tail and plus signs show the lack of symmetry in the sample
values. For more information on box plots, see “Box Plots” on page 4-6.
3-8
Resampling Statistics
Resampling Statistics
In this section...
“The Bootstrap” on page 3-9
“The Jackknife” on page 3-12
“Parallel Computing Support for Resampling Methods” on page 3-13
The Bootstrap
The bootstrap procedure involves choosing random samples with replacement
from a data set and analyzing each sample the same way. Sampling with
replacement means that each observation is selected separately at random
from the original dataset. So a particular data point from the original data
set could appear multiple times in a given bootstrap sample. The number of
elements in each bootstrap sample equals the number of elements in the
original data set. The range of sample estimates you obtain enables you to
establish the uncertainty of the quantity you are estimating.
This example from Efron and Tibshirani [33] compares Law School Admission
Test (LSAT) scores and subsequent law school grade point average (GPA) for
a sample of 15 law schools.
load lawdata
plot(lsat,gpa,'+')
lsline
3-9
3 Descriptive Statistics
The least-squares fit line indicates that higher LSAT scores go with higher
law school GPAs. But how certain is this conclusion? The plot provides some
intuition, but nothing quantitative.
You can calculate the correlation coefficient of the variables using the corr
function.
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Now you have a number describing the positive connection between LSAT
and GPA; though it may seem large, you still do not know if it is statistically
significant.
3-10
Resampling Statistics
Using the bootstrp function you can resample the lsat and gpa vectors as
many times as you like and consider the variation in the resulting correlation
coefficients.
Here is an example.
rhos1000 = bootstrp(1000,'corr',lsat,gpa);
This command resamples the lsat and gpa vectors 1000 times and computes
the corr function on each sample. Here is a histogram of the result.
hist(rhos1000,30)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
3-11
3 Descriptive Statistics
ci = bootci(5000,@corr,lsat,gpa)
ci =
0.3313
0.9427
Although the bootci function computes the Bias Corrected and accelerated
(BCa) interval as the default type, it is also able to compute various other
types of bootstrap confidence intervals, such as the studentized bootstrap
confidence interval.
The Jackknife
Similar to the bootstrap is the jackknife, which uses resampling to estimate
the bias of a sample statistic. Sometimes it is also used to estimate standard
error of the sample statistic. The jackknife is implemented by the Statistics
Toolbox function jackknife.
3-12
Resampling Statistics
load lawdata
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Next compute the correlations for jackknife samples, and compute their mean:
jackrho = jackknife(@corr,lsat,gpa);
meanrho = mean(jackrho)
meanrho =
0.7759
n = length(lsat);
biasrho = (n-1) * (meanrho-rhohat)
biasrho =
-0.0065
3-13
3 Descriptive Statistics
For example:
X = magic(3);
X([1 5]) = [NaN NaN]
X =
NaN 1 6
3 NaN 7
4 9 2
s1 = sum(X)
s1 =
NaN NaN 15
Removing the NaN values would destroy the matrix structure. Removing
the rows containing the NaN values would discard data. Statistics Toolbox
functions in the following table remove NaN values only for the purposes of
computation.
Function Description
nancov Covariance matrix, ignoring NaN values
nanmax Maximum, ignoring NaN values
nanmean Mean, ignoring NaN values
nanmedian Median, ignoring NaN values
nanmin Minimum, ignoring NaN values
nanstd Standard deviation, ignoring NaN values
nansum Sum, ignoring NaN values
nanvar Variance, ignoring NaN values
3-14
Data with Missing Values
For example:
s2 = nansum(X)
s2 =
7 10 15
Other Statistics Toolbox functions also ignore NaN values. These include iqr,
kurtosis, mad, prctile, range, skewness, and trimmean.
3-15
3 Descriptive Statistics
3-16
4
Statistical Visualization
Introduction
Statistics Toolbox data visualization functions add to the extensive graphics
capabilities already in MATLAB.
• Scatter plots are a basic visualization tool for multivariate data. They
are used to identify relationships among variables. Grouped versions of
these plots use different plotting symbols to indicate group membership.
The gname function is used to label points on these plots with a text label
or an observation number.
• Box plots display a five-number summary of a set of data: the median,
the two ends of the interquartile range (the box), and two extreme values
(the whiskers) above and below the box. Because they show less detail
than histograms, box plots are most useful for side-by-side comparisons
of two distributions.
• Distribution plots help you identify an appropriate distribution family
for your data. They include normal and Weibull probability plots,
quantile-quantile plots, and empirical cumulative distribution plots.
4-2
Scatter Plots
Scatter Plots
A scatter plot is a simple plot of one variable against another. The MATLAB
functions plot and scatter produce scatter plots. The MATLAB function
plotmatrix can produce a matrix of such plots showing the relationship
between several pairs of variables.
Suppose you want to examine the weight and mileage of cars from three
different model years.
load carsmall
gscatter(Weight,MPG,Model_Year,'','xos')
4-3
4 Statistical Visualization
This shows that not only is there a strong relationship between the weight of
a car and its mileage, but also that newer cars tend to be lighter and have
better gas mileage than older cars.
The default arguments for gscatter produce a scatter plot with the different
groups shown with the same symbol but different colors. The last two
arguments above request that all groups be shown in default colors and with
different symbols.
The carsmall data set contains other variables that describe different aspects
of cars. You can examine several of them in a single display by creating a
grouped plot matrix.
4-4
Scatter Plots
gplotmatrix(xvars,yvars,Model_Year,'','xos')
The upper right subplot displays MPG against Horsepower, and shows that
over the years the horsepower of the cars has decreased but the gas mileage
has improved.
The gplotmatrix function can also graph all pairs from a single list of
variables, along with histograms for each variable. See “MANOVA” on page
8-39.
4-5
4 Statistical Visualization
Box Plots
The graph below, created with the boxplot command, compares petal lengths
in samples from two species of iris.
load fisheriris
s1 = meas(51:100,3);
s2 = meas(101:150,3);
boxplot([s1 s2],'notch','on',...
'labels',{'versicolor','virginica'})
• The tops and bottoms of each “box” are the 25th and 75th percentiles of the
samples, respectively. The distances between the tops and bottoms are the
interquartile ranges.
4-6
Box Plots
• The line in the middle of each box is the sample median. If the median is
not centered in the box, it shows sample skewness.
• The whiskers are lines extending above and below each box. Whiskers are
drawn from the ends of the interquartile ranges to the furthest observations
within the whisker length (the adjacent values).
• Observations beyond the whisker length are marked as outliers. By
default, an outlier is a value that is more than 1.5 times the interquartile
range away from the top or bottom of the box, but this value can be adjusted
with additional input arguments. Outliers are displayed with a red + sign.
• Notches display the variability of the median between samples. The width
of a notch is computed so that box plots whose notches do not overlap (as
above) have different medians at the 5% significance level. The significance
level is based on a normal distribution assumption, but comparisons of
medians are reasonably robust for other distributions. Comparing box-plot
medians is like a visual hypothesis test, analogous to the t test used for
means.
4-7
4 Statistical Visualization
Distribution Plots
In this section...
“Normal Probability Plots” on page 4-8
“Quantile-Quantile Plots” on page 4-10
“Cumulative Distribution Plots” on page 4-12
“Other Probability Plots” on page 4-14
The following example shows a normal probability plot created with the
normplot function.
x = normrnd(10,1,25,1);
normplot(x)
4-8
Distribution Plots
The plus signs plot the empirical probability versus the data value for each
point in the data. A solid line connects the 25th and 75th percentiles in the
data, and a dashed line extends it to the ends of the data. The y-axis values
are probabilities from zero to one, but the scale is not linear. The distance
between tick marks on the y-axis matches the distance between the quantiles
of a normal distribution. The quantiles are close together near the median
(probability = 0.5) and stretch out symmetrically as you move away from
the median.
In a normal probability plot, if all the data points fall near the line, an
assumption of normality is reasonable. Otherwise, the points will curve away
from the line, and an assumption of normality is not justified.
For example:
x = exprnd(10,100,1);
4-9
4 Statistical Visualization
normplot(x)
The plot is strong evidence that the underlying distribution is not normal.
Quantile-Quantile Plots
Quantile-quantile plots are used to determine whether two samples come from
the same distribution family. They are scatter plots of quantiles computed
from each sample, with a line drawn between the first and third quartiles. If
the data falls near the line, it is reasonable to assume that the two samples
come from the same distribution. The method is robust with respect to
changes in the location and scale of either distribution.
4-10
Distribution Plots
x = poissrnd(10,50,1);
y = poissrnd(5,100,1);
qqplot(x,y);
Even though the parameters and sample sizes are different, the approximate
linear relationship suggests that the two samples may come from the same
distribution family. As with normal probability plots, hypothesis tests,
as described in Chapter 7, “Hypothesis Tests”, can provide additional
justification for such an assumption. For statistical procedures that depend
on the two samples coming from the same distribution, however, a linear
quantile-quantile plot is often sufficient.
4-11
4 Statistical Visualization
The following example shows what happens when the underlying distributions
are not the same.
x = normrnd(5,1,100,1);
y = wblrnd(2,0.5,100,1);
qqplot(x,y);
These samples clearly are not from the same distribution family.
4-12
Distribution Plots
To create an empirical cdf plot, use the cdfplot function (or ecdf and stairs).
The following example compares the empirical cdf for a sample from an
extreme value distribution with a plot of the cdf for the sampling distribution.
In practice, the sampling distribution would be unknown, and would be
chosen to match the empirical cdf.
y = evrnd(0,3,100,1);
cdfplot(y)
hold on
x = -20:0.1:10;
f = evcdf(x,0,3);
plot(x,f,'m')
legend('Empirical','Theoretical','Location','NW')
4-13
4 Statistical Visualization
For example, the following plot assesses two samples, one from a Weibull
distribution and one from a Rayleigh distribution, to see if they may have
come from a Weibull population.
x1 = wblrnd(3,3,100,1);
x2 = raylrnd(3,100,1);
probplot('weibull',[x1 x2])
legend('Weibull Sample','Rayleigh Sample','Location','NW')
4-14
Distribution Plots
The plot gives justification for modeling the first sample with a Weibull
distribution; much less so for the second sample.
4-15
4 Statistical Visualization
4-16
5
Probability Distributions
The Statistics Toolbox provides several ways of working with both parametric
and nonparametric probability distributions:
5-2
Supported Distributions
Supported Distributions
In this section...
“Parametric Distributions” on page 5-4
“Nonparametric Distributions” on page 5-8
5-3
5 Probability Distributions
Parametric Distributions
5-4
Supported Distributions
5-5
5 Probability Distributions
5-6
Supported Distributions
Discrete Distributions
5-7
5 Probability Distributions
Multivariate Distributions
Nonparametric Distributions
Name pdf cdf inv stat fit like rnd
ksdensity,
Nonparametric ksdensity ksdensity ksdensity
dfittool
5-8
Working with Distributions Through GUIs
Exploring Distributions
To interactively see the influence of parameter changes on the shapes of the
pdfs and cdfs of supported Statistics Toolbox distributions, use the Probability
Distribution Function Tool.
5-9
5 Probability Distributions
Function
plot
Function
value Draggable
reference
lines
5-10
Working with Distributions Through GUIs
5-11
5 Probability Distributions
dfittool
Task buttons
Import data
from workspace
Adjusting the Plot. Buttons at the top of the tool allow you to adjust the
plot displayed in this window:
5-12
Working with Distributions Through GUIs
Displaying the Data. The Display type field specifies the type of plot
displayed in the main window. Each type corresponds to a probability
function, for example, a probability density function. The following display
types are available:
Inputting and Fitting Data. The task buttons enable you to perform the
tasks necessary to fit distributions to data. Each button opens a new dialog
box in which you perform the task. The buttons include:
• Data — Import and manage data sets. See “Creating and Managing Data
Sets” on page 5-14.
• New Fit — Create new fits. See “Creating a New Fit” on page 5-19.
• Manage Fits — Manage existing fits. See “Managing Fits” on page 5-26.
• Evaluate — Evaluate fits at any points you choose. See “Evaluating Fits”
on page 5-28.
• Exclude — Create rules specifying which values to exclude when fitting a
distribution. See “Excluding Data” on page 5-32.
The display pane displays plots of the data sets and fits you create. Whenever
you make changes in one of the dialog boxes, the results in the display pane
update.
5-13
5 Probability Distributions
• Save and load sessions. See “Saving and Loading Sessions” on page 5-38.
• Generate a file with which you can fit distributions to data and plot the
results independently of the Distribution Fitting Tool. See “Generating a
File to Fit and Plot Distributions” on page 5-46.
• Define and import custom distributions. See “Using Custom Distributions”
on page 5-47.
To begin, click the Data button in the Distribution Fitting Tool to open the
Data dialog box shown in the following figure.
5-14
Working with Distributions Through GUIs
• Data — The drop-down list in the Data field contains the names of all
matrices and vectors, other than 1-by-1 matrices (scalars) in the MATLAB
workspace. Select the array containing the data you want to fit. The actual
data you import must be a vector. If you select a matrix in the Data field,
the first column of the matrix is imported by default. To select a different
column or row of the matrix, click Select Column or Row. This displays
5-15
5 Probability Distributions
the matrix in the Variable Editor, where you can select a row or column
by highlighting it with the mouse.
Alternatively, you can enter any valid MATLAB expression in the Data
field.
When you select a vector in the Data field, a histogram of the data appears
in the Data preview pane.
• Censoring — If some of the points in the data set are censored, enter
a Boolean vector, of the same size as the data vector, specifying the
censored entries of the data. A 1 in the censoring vector specifies that the
corresponding entry of the data vector is censored, while a 0 specifies that
the entry is not censored. If you enter a matrix, you can select a column or
row by clicking Select Column or Row. If you do not want to censor any
data, leave the Censoring field blank.
• Frequency — Enter a vector of positive integers of the same size as the
data vector to specify the frequency of the corresponding entries of the data
vector. For example, a value of 7 in the 15th entry of frequency vector
specifies that there are 7 data points corresponding to the value in the 15th
entry of the data vector. If all entries of the data vector have frequency 1,
leave the Frequency field blank.
• Data set name — Enter a name for the data set you import from the
workspace, such as My data.
After you have entered the information in the preceding fields, click Create
Data Set to create the data set My data.
Managing Data Sets. The Manage data sets pane enables you to view
and manage the data sets you create. When you create a data set, its name
appears in the Data sets list. The following figure shows the Manage data
sets pane after creating the data set My data.
5-16
Working with Distributions Through GUIs
For each data set in the Data sets list, you can:
• Select the Plot check box to display a plot of the data in the main
Distribution Fitting Tool window. When you create a new data set, Plot is
selected by default. Clearing the Plot check box removes the data from the
plot in the main window. You can specify the type of plot displayed in the
Display type field in the main window.
• If Plot is selected, you can also select Bounds to display confidence
interval bounds for the plot in the main window. These bounds are
pointwise confidence bounds around the empirical estimates of these
functions. The bounds are only displayed when you set Display Type in
the main window to one of the following:
- Cumulative probability (CDF)
- Survivor function
- Cumulative hazard
When you select a data set from the list, the following buttons are enabled:
5-17
5 Probability Distributions
Setting Bin Rules. To set bin rules for the histogram of a data set, click Set
Bin Rules. This opens the Set Bin Width Rules dialog box.
5-18
Working with Distributions Through GUIs
• Bin width — Enter the width of each bin. If you select this option, you
can also select:
- Automatic bin placement — Place the edges of the bins at integer
multiples of the Bin width.
- Bin boundary at — Enter a scalar to specify the boundaries of the
bins. The boundary of each bin is equal to this scalar plus an integer
multiple of the Bin width.
The Set Bin Width Rules dialog box also provides the following options:
• Apply to all existing data sets — Apply the rule to all data sets.
Otherwise, the rule is only applied to the data set currently selected in
the Data dialog box.
• Save as default — Apply the current rule to any new data sets that you
create. You can also set default bin width rules by selecting Set Default
Bin Rules from the Tools menu in the main window.
5-19
5 Probability Distributions
5-20
Working with Distributions Through GUIs
Apply the New Fit. Click Apply to fit the distribution. For a parametric
fit, the Results pane displays the values of the estimated parameters. For a
nonparametric fit, the Results pane displays information about the fit.
When you click Apply, the Distribution Fitting Tool displays a plot of the
distribution, along with the corresponding data.
Note When you click Apply, the title of the dialog box changes to Edit Fit.
You can now make changes to the fit you just created and click Apply again
to save them. After closing the Edit Fit dialog box, you can reopen it from the
Fit Manager dialog box at any time to edit the fit.
5-21
5 Probability Distributions
After applying the fit, you can save the information to the workspace using
probability distribution objects by clicking Save to workspace. See “Using
Probability Distribution Objects” on page 5-84 for more information.
Most, but not all, of the distributions available in the Distribution Fitting
Tool are supported elsewhere in Statistics Toolbox software (see “Supported
Distributions” on page 5-3), and have dedicated distribution fitting functions.
These functions compute the majority of the fits in the Distribution Fitting
Tool, and are referenced in the list below.
Other fits are computed using functions internal to the Distribution Fitting
Tool. Distributions that do not have corresponding Statistics Toolbox
fitting functions are described in “Additional Distributions Available in the
Distribution Fitting Tool” on page 5-49.
Not all of the distributions listed below are available for all data sets. The
Distribution Fitting Tool determines the extent of the data (nonnegative, unit
interval, etc.) and displays appropriate distributions in the Distribution
drop-down list. Distribution data ranges are given parenthetically in the
list below.
• Beta (unit interval values) distribution, fit using the function betafit.
• Binomial (nonnegative values) distribution, fit using the function binopdf.
• Birnbaum-Saunders (positive values) distribution.
• Exponential (nonnegative values) distribution, fit using the function
expfit.
• Extreme value (all values) distribution, fit using the function evfit.
• Gamma (positive values) distribution, fit using the function gamfit.
• Generalized extreme value (all values) distribution, fit using the function
gevfit.
• Generalized Pareto (all values) distribution, fit using the function gpfit.
• Inverse Gaussian (positive values) distribution.
5-22
Working with Distributions Through GUIs
5-23
5 Probability Distributions
Displaying Results
This section explains the different ways to display results in the Distribution
Fitting Tool window. This window displays plots of:
• The data sets for which you select Plot in the Data dialog box
• The fits for which you select Plot in the Fit Manager dialog box
• Confidence bounds for:
- Data sets for which you select Bounds in the Data dialog box
- Fits for which you select Bounds in the Fit Manager dialog box
5-24
Working with Distributions Through GUIs
Display Type. The Display Type field in the main window specifies the type
of plot displayed. Each type corresponds to a probability function, for example,
a probability density function. The following display types are available:
• Density (PDF) — Display a probability density function (PDF) plot for the
fitted distribution. The main window displays data sets using a probability
histogram, in which the height of each rectangle is the fraction of data
points that lie in the bin divided by the width of the bin. This makes the
sum of the areas of the rectangles equal to 1.
• Cumulative probability (CDF) — Display a cumulative probability
plot of the data. The main window displays data sets using a cumulative
probability step function. The height of each step is the cumulative sum of
the heights of the rectangles in the probability histogram.
• Quantile (inverse CDF) — Display a quantile (inverse CDF) plot.
• Probability plot — Display a probability plot of the data. You can
specify the type of distribution used to construct the probability plot in the
Distribution field, which is only available when you select Probability
plot. The choices for the distribution are:
- Exponential
- Extreme value
- Logistic
- Log-Logistic
- Lognormal
- Normal
- Rayleigh
- Weibull
In addition to these choices, you can create a probability plot against a
parametric fit that you create in the New Fit pane. These fits are added at
the bottom of the Distribution drop-down list when you create them.
• Survivor function — Display survivor function plot of the data.
• Cumulative hazard — Display cumulative hazard plot of the data.
5-25
5 Probability Distributions
Confidence Bounds. You can display confidence bounds for data sets and fits
when you set Display Type to Cumulative probability (CDF), Survivor
function, Cumulative hazard, or, for fits only, Quantile (inverse CDF).
• To display bounds for a data set, select Bounds next to the data set in the
Data sets pane of the Data dialog box.
• To display bounds for a fit, select Bounds next to the fit in the Fit Manager
dialog box. Confidence bounds are not available for all fit types.
To set the confidence level for the bounds, select Confidence Level from the
View menu in the main window and choose from the options.
Managing Fits
This section describes how to manage fits that you have created. To begin,
click the Manage Fits button in the Distribution Fitting Tool. This opens the
Fit Manager dialog box as shown in the following figure.
5-26
Working with Distributions Through GUIs
The Table of fits displays a list of the fits you create, with the following
options:
• Plot — Select Plot to display a plot of the fit in the main window of the
Distribution Fitting Tool. When you create a new fit, Plot is selected by
default. Clearing the Plot check box removes the fit from the plot in the
main window.
• Bounds — If Plot is selected, you can also select Bounds to display
confidence bounds in the plot. The bounds are displayed when you set
Display Type in the main window to one of the following:
- Cumulative probability (CDF)
- Quantile (inverse CDF)
- Survivor function
- Cumulative hazard
5-27
5 Probability Distributions
Note You can only edit the currently selected fit in the Edit Fit dialog
box. To edit a different fit, select it in the Table of fits and click Edit to
open another Edit Fit dialog box.
Evaluating Fits
The Evaluate dialog box enables you to evaluate any fit at whatever points you
choose. To open the dialog box, click the Evaluate button in the Distribution
Fitting Tool. The following figure shows the Evaluate dialog box.
5-28
Working with Distributions Through GUIs
• Fit pane — Display the names of existing fits. Select one or more fits that
you want to evaluate. Using your platform specific functionality, you can
select multiple fits.
• Function — Select the type of probability function you want to evaluate
for the fit. The available functions are
- Density (PDF) — Computes a probability density function.
5-29
5 Probability Distributions
Note The settings for Compute confidence bounds, Level, and Plot
function do not affect the plots that are displayed in the main window of
the Distribution Fitting Tool. The settings only apply to plots you create by
clicking Plot function in the Evaluate window.
5-30
Working with Distributions Through GUIs
Click Apply to apply these settings to the selected fit. The following figure
shows the results of evaluating the cumulative density function for the fit My
fit, created in “Example: Fitting a Distribution” on page 5-39, at the points
in the vector -3:0.5:3.
The window displays the following values in the columns of the table to the
right of the Fit pane:
5-31
5 Probability Distributions
• LB — The lower bounds for the confidence interval, if you select Compute
confidence bounds
• UB — The upper bounds for the confidence interval, if you select Compute
confidence bounds
Excluding Data
To exclude values from fit, click the Exclude button in the main window of
the Distribution Fitting Tool. This opens the Exclude window, in which you
can create rules for excluding specified values. You can use these rules to
exclude data when you create a new fit in the New Fit window. The following
figure shows the Exclude window.
5-32
Working with Distributions Through GUIs
OR
5-33
5 Probability Distributions
To set a lower limit for the boundary of the excluded region, click Add
Lower Limit. This displays a vertical line on the left side of the plot
window. Move the line with the mouse to the point you where you want
the lower limit, as shown in the following figure.
5-34
Working with Distributions Through GUIs
Moving the vertical line changes the value displayed in the Lower limit:
exclude data field in the Exclude window, as shown in the following figure.
5-35
5 Probability Distributions
Similarly, you can set the upper limit for the boundary of the excluded
region by clicking Add Upper Limit and moving the vertical line that
appears at the right side of the plot window. After setting the lower and
upper limits, click Close and return to the Exclude window.
3 Create Exclusion Rule—Once you have set the lower and upper limits
for the boundary of the excluded data, click Create Exclusion Rule
to create the new rule. The name of the new rule now appears in the
Existing exclusion rules pane.
When you select an exclusion rule in the Existing exclusion rules pane,
the following buttons are enabled:
• Copy — Creates a copy of the rule, which you can then modify. To save
the modified rule under a different name, click Create Exclusion Rule.
• View — Opens a new window in which you can see which data points
are excluded by the rule. The following figure shows a typical example.
5-36
Working with Distributions Through GUIs
The shaded areas in the plot graphically display which data points are
excluded. The table to the right lists all data points. The shaded rows
indicate excluded points:
• Rename — Renames the rule
• Delete — Deletes the rule
Once you define an exclusion rule, you can use it when you fit a distribution
to your data. The rule does not exclude points from the display of the data
set.
5-37
5 Probability Distributions
Saving a Session. To save the current session, select Save Session from
the File menu in the main window. This opens a dialog box that prompts you
to enter a filename, such as my_session.dfit, for the session. Clicking Save
saves the following items created in the current session:
• Data sets
• Fits
• Exclusion rules
• Plot settings
• Bin width rules
5-38
Working with Distributions Through GUIs
Step 1: Generate Random Data. To try the example, first generate some
random data to which you will fit a distribution. The following command
generates a vector data, of length 100, whose entries are random numbers
from a normal distribution with mean.36 and standard deviation 1.4.
dfittool
To import the vector data into the Distribution Fitting Tool, click the Data
button in main window. This opens the window shown in the following figure.
5-39
5 Probability Distributions
The Data field displays all numeric arrays in the MATLAB workspace. Select
data from the drop-down list, as shown in the following figure.
5-40
Working with Distributions Through GUIs
In the Data set name field, type a name for the data set, such as My data,
and click Create Data Set to create the data set. The main window of the
Distribution Fitting Tool now displays a larger version of the histogram in the
Data preview pane, as shown in the following figure.
5-41
5 Probability Distributions
Note Because the example uses random data, you might see a slightly
different histogram if you try this example for yourself.
Step 3: Create a New Fit. To fit a distribution to the data, click New Fit
in the main window of the Distribution Fitting Tool. This opens the window
shown in the following figure.
5-42
Working with Distributions Through GUIs
1 Enter a name for the fit, such as My fit, in the Fit name field.
5-43
5 Probability Distributions
3 Click Apply.
The Results pane displays the mean and standard deviation of the normal
distribution that best fits My data, as shown in the following figure.
The main window of the Distribution Fitting Tool displays a plot of the
normal distribution with this mean and standard deviation, as shown in the
following figure.
5-44
Working with Distributions Through GUIs
5-45
5 Probability Distributions
• Fits the distributions used in the current session to any data vector in the
MATLAB workspace.
• Plots the data and the fits.
After you end the current session, you can use the file to create plots in a
standard MATLAB figure window, without having to reopen the Distribution
Fitting Tool.
2 Choose File > Save as in the MATLAB Editor window. Save the file as
normal_fit.m in a folder on the MATLAB path.
You can then apply the function normal_fit to any vector of data in the
MATLAB workspace. For example, the following commands
newfit =
normal distribution
mu = 3.19148
sigma = 12.5631
5-46
Working with Distributions Through GUIs
Note By default, the file labels the data in the legend using the same name as
the data set in the Distribution Fitting Tool. You can change the label using
the legend command, as illustrated by the preceding example.
5-47
5 Probability Distributions
The template includes example code that computes the Laplace distribution,
beginning at the lines
% -
% Remove the following return statement to define the
% Laplace distributon
% -
return
To use this example, simply delete the command return and save the
file. If you save the template in a folder on the MATLAB path, under its
default name dfittooldists.m, the Distribution Fitting Tool reads it in
automatically when you start the tool. You can also save the template under a
different name, such as laplace.m, and then import the custom distribution
as described in the following section.
5-48
Working with Distributions Through GUIs
For a complete list of the distributions available for use with the Distribution
Fitting Tool, see “Supported Distributions” on page 5-3. Distributions listing
dfittool in the fit column of the tables in that section can be used with
the Distribution Fitting Tool.
5-49
5 Probability Distributions
Histogram
Parameter
bounds
Parameter
value
Parameter
control Additional Sample again Export to
parameters from the same workspace
distribution
5-50
Working with Distributions Through GUIs
• Use the controls at the bottom of the window to set parameter values for
the distribution and to change their upper and lower bounds.
• Draw another sample from the same distribution, with the same size and
parameters.
• Export the current sample to your workspace. A dialog box enables you
to provide a name for the sample.
5-51
5 Probability Distributions
5-52
Statistics Toolbox™ Distribution Functions
⎛ n⎞
f (k) = ⎜ ⎟ pk (1 − p)n− k
⎝ k⎠
5-53
5 Probability Distributions
f (t) = λe−λt
is used to model the probability that a process with constant failure rate λ will
have a failure within time t . Each time t > 0 is assigned a positive probability
density. Densities are computed with the exppdf function:
5-54
Statistics Toolbox™ Distribution Functions
Probabilities for continuous pdfs can be computed with the quad function.
In the example above, the probability of failure in the time interval [0,1] is
computed as follows:
cars = load('carsmall','MPG','Origin');
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-55
5 Probability Distributions
[f,x] = ksdensity(MPG);
plot(x,f);
title('Density estimate for MPG')
5-56
Statistics Toolbox™ Distribution Functions
The first call to ksdensity returns the default bandwidth, u, of the kernel
smoothing function. Subsequent calls modify this bandwidth.
[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on
5-57
5 Probability Distributions
[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');
[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');
5-58
Statistics Toolbox™ Distribution Functions
The green curve shows a density with the kernel bandwidth set too high.
This curve smooths out the data so much that the end result looks just like
the kernel function. The red curve has a smaller bandwidth and is rougher
looking than the blue curve. It may be too rough, but it does provide an
indication that there might be two major peaks rather than the single peak
of the blue curve. A reasonable choice of width might lead to a curve that is
intermediate between the red and blue curves.
Using default bandwidths, you can now plot the same mileage data, using
each of the available kernel functions.
5-59
5 Probability Distributions
The density estimates are roughly comparable, but the box kernel produces a
density that is rougher than the others.
Origin = cellstr(cars.Origin);
I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);
5-60
Statistics Toolbox™ Distribution Functions
[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on
[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')
[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')
legend('USA','Japan','Europe')
hold off
5-61
5 Probability Distributions
F ( x) = ∑ f ( y)
y≤ x
x
F ( x) = ∫ f ( y) dy
−∞
• P(y ≤ x) = F(x)
• P(y > x) = 1 – F(x)
• P(x1 < y ≤ x2) = F(x2) – F(x1)
5-62
Statistics Toolbox™ Distribution Functions
x−
t=
s/ n
mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
0.4020
This probability is the same as the p value returned by a t-test of the null
hypothesis that the sample comes from a normal population with mean μ:
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020
5-63
5 Probability Distributions
returns the values of a function F such that F(x) represents the proportion of
observations in a sample less than or equal to x.
The idea behind the empirical cdf is simple. It is a function that assigns
probability 1/n to each of n observations in a sample. Its graph has a
stair-step appearance. If a sample comes from a distribution in a parametric
family (such as a normal distribution), its empirical cdf is likely to resemble
the parametric distribution. If not, its empirical distribution still gives an
estimate of the cdf for the distribution that generated the data.
x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);
stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)
5-64
Statistics Toolbox™ Distribution Functions
For piecewise probability density estimation, using the empirical cdf in the
center of the distribution and Pareto distributions in the tails, see “Fitting
Piecewise Distributions” on page 5-72.
5-65
5 Probability Distributions
For continuous distributions, the inverse cdf returns the unique outcome
whose cdf value is the input cumulative probability.
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = expcdf(x,1) % Cumulative probabilities
p =
0.3935 0.5034 0.5934 0.6671 0.7275 0.7769
expinv(p,1) % Return original outcomes
ans =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
For discrete distributions, there may be no outcome whose cdf value is the
input cumulative probability. In these cases, the inverse cdf returns the first
outcome whose cdf value equals or exceeds the input cumulative probability.
p =
5-66
Statistics Toolbox™ Distribution Functions
q =
0 1 1 2
5-67
5 Probability Distributions
For example, the wblstat function can be used to visualize the mean of the
Weibull distribution as a function of its two distribution parameters:
a = 0.5:0.1:3;
b = 0.5:0.1:3;
[A,B] = meshgrid(a,b);
M = wblstat(A,B);
surfc(A,B,M)
5-68
Statistics Toolbox™ Distribution Functions
5-69
5 Probability Distributions
The Statistics Toolbox function mle is a convenient front end to the individual
distribution fitting functions, and more. The function computes MLEs for
distributions beyond those for which Statistics Toolbox software provides
specific pdf functions.
For some pdfs, MLEs can be given in closed form and computed directly.
For other pdfs, a search for the maximum likelihood must be employed. The
search can be controlled with an options input argument, created using
the statset function. For efficient searches, it is important to choose a
reasonable distribution model and set appropriate convergence tolerances.
MLEs can be heavily biased, especially for small samples. As sample size
increases, however, MLEs become unbiased minimum variance estimators
with approximate normal distributions. This is used to compute confidence
bounds for the estimates.
mu = 1; % Population parameter
n = 1e3; % Sample size
ns = 1e4; % Number of samples
5-70
Statistics Toolbox™ Distribution Functions
The Central Limit Theorem says that the means will be approximately
normally distributed, regardless of the distribution of the data in the samples.
The normfit function can be used to find the normal distribution that best
fits the means:
[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
1.0003
sigmahat =
0.0319
muci =
0.9997
1.0010
sigmaci =
0.0314
0.0323
The function returns MLEs for the mean and standard deviation and their
95% confidence intervals.
To visualize the distribution of sample means together with the fitted normal
distribution, you must scale the fitted pdf, with area = 1, to the area of the
histogram being used to display the means:
numbins = 50;
hist(means,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(means,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = normpdf(x,muhat,sigmahat);
plot(x,histarea*y,'r','LineWidth',2)
5-71
5 Probability Distributions
5-72
Statistics Toolbox™ Distribution Functions
left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = fitdist(data,'tlocationscale');
h = probplot(gca,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Normal','Data','t','Location','NW')
5-73
5 Probability Distributions
On the other hand, the empirical distribution provides a perfect fit, but the
outliers make the tails very discrete:
ecdf(data)
5-74
Statistics Toolbox™ Distribution Functions
The paretotails function provides a single, well-fit model for the entire
sample. The following uses generalized Pareto distributions (GPDs) for the
lower and upper 10% of the data:
pfit = paretotails(data,0.1,0.9)
pfit =
Piecewise distribution with 3 segments
-Inf < x < -1.30726 (0 < p < 0.1)
lower tail, GPD(-1.10167,1.12395)
5-75
5 Probability Distributions
x = -4:0.01:10;
plot(x,cdf(pfit,x))
Access information about the fit using the methods of the paretotails class.
Options allow for nonparametric estimation of the center of the cdf.
5-76
Statistics Toolbox™ Distribution Functions
L(a) = ∏ f (a| x)
x∈ X
5-77
5 Probability Distributions
For example, use gamrnd to generate a random sample from a specific gamma
distribution:
a = [1,2];
X = gamrnd(a(1),a(2),1e3,1);
Given X, the gamlike function can be used to visualize the likelihood surface
in the neighborhood of a:
mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
for j = 1:mesh
logL(i,j) = gamlike([a1(i),a2(j)],X);
end
end
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)
5-78
Statistics Toolbox™ Distribution Functions
These can be compared to the MLEs returned by the gamfit function, which
uses a combination search and solve algorithm:
ahat = gamfit(X)
ahat =
1.0231 1.9728
The MLEs can be added to the surface plot (rotated to show the minimum):
hold on
plot3(MLES(1),MLES(2),LL(MLES),...
'ro','MarkerSize',5,...
'MarkerFaceColor','r')
5-79
5 Probability Distributions
• cvpartition
• hmmgenerate
• lhsdesign
• lhsnorm
• mhsample
• random
• randsample
• slicesample
By controlling the default random number stream and its state, you can
control how the RNGs in Statistics Toolbox software generate random values.
For example, to reproduce the same sequence of values from an RNG, you can
save and restore the default stream’s state, or reset the default stream. For
details on managing the default random number stream, see “Managing the
Global Stream”.
MATLAB initializes the default random number stream to the same state
each time it starts up. Thus, RNGs in Statistics Toolbox software will
generate the same sequence of values for each MATLAB session unless you
modify that state at startup. One simple way to do that is to add commands
to startup.m such as
5-80
Statistics Toolbox™ Distribution Functions
stream = RandStream('mt19937ar','seed',sum(100*clock));
RandStream.setDefaultStream(stream);
5-81
5 Probability Distributions
5-82
Statistics Toolbox™ Distribution Functions
5-83
5 Probability Distributions
Probability distribution objects allow you to easily fit, access, and store
distribution information for a given data set. The following operations are
easier to perform using distribution objects:
5-84
Using Probability Distribution Objects
If you are a novice statistician who would like to explore how various
distributions look without having to manipulate data, see “Working with
Distributions Through GUIs” on page 5-9.
If you have no data to fit, but want to calculate a pdf, cdf, etc for various
parameters, see “Statistics Toolbox Distribution Functions” on page 5-52.
5-85
5 Probability Distributions
The left side of this diagram shows the inheritance line from all probability
distributions down to univariate parametric probability distributions. The
right side shows the lineage down to univariate kernel distributions. Here is
how to interpret univariate parametric distribution lineage:
5-86
Using Probability Distribution Objects
5-87
5 Probability Distributions
pd = ProbDistUnivParam('normal',[100 10])
5-88
Using Probability Distribution Objects
load carsmall
pd = ProbDistUnivKernel(MPG)
Object-Supported Distributions
Object-oriented programming in the Statistics Toolbox supports the following
distributions.
Parametric Distributions
Use the following distribution to create ProbDistUnivParam objects using
fitdist. For more information on the cumulative distribution function (cdf)
and probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivParam class reference page.
5-89
5 Probability Distributions
Nonparametric Distributions
Use the following distributions to create ProbDistUnivKernel objects.
For more information on the cumulative distribution function (cdf) and
probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivKernel class reference page.
5-90
Using Probability Distribution Objects
load carsmall
NormDist = fitdist(MPG,'normal')
NormDist =
normal distribution
mu = 23.7181
sigma = 8.03573
load carsmall
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin)
Warning: Error while fitting group 'Italy':
Not enough data in X to fit this distribution.
> In fitdist at 171
WeiByOrig =
5-91
5 Probability Distributions
Columns 1 through 4
Columns 5 through 6
[1x1 ProbDistUnivParam] []
Country =
'USA'
'France'
'Japan'
'Germany'
'Sweden'
'Italy'
A warning appears informing you that, since the data only represents one
Italian car, fitdist cannot fit a Weibull distribution to that group. Each
one of the five other groups now has a distribution object associated with it,
represented in the cell array wd. Each object contains properties that hold
information about the data, the distribution, and the parameters. For more
information on what properties exist and what information they contain, see
ProbDistUnivParam or ProbDistUnivKernel.
5-92
Using Probability Distribution Objects
Now you can easily compare PDFs using the pdf method of the
ProbDistUnivParam class:
time = linspace(0,45);
pdfjapan = pdf(distjapan,time);
pdfusa = pdf(distusa,time);
hold on
plot(time,[pdfjapan;pdfusa])
l = legend('Japan','USA')
set(l,'Location','Best')
xlabel('MPG')
ylabel('Probability Density')
5-93
5 Probability Distributions
You could then further group the data and compare, for example, MPG by
year for American cars:
load carsmall
[WeiByYearOrig, Names] = fitdist(MPG,'weibull','by',...
{Origin Model_Year});
USA70 = WeiByYearOrig{1};
USA76 = WeiByYearOrig{2};
USA82 = WeiByYearOrig{3};
time = linspace(0,45);
pdf70 = pdf(USA70,time);
pdf76 = pdf(USA76,time);
pdf82 = pdf(USA82,time);
line(t,[pdf70;pdf76;pdf82])
l = legend('1970','1976','1982')
set(l,'Location','Best')
title('USA Car MPG by Year')
xlabel('MPG')
ylabel('Probability Density')
5-94
Using Probability Distribution Objects
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
5-95
5 Probability Distributions
Extract the fits for American cars and compare the fits visually against a
histogram of the original data:
WeiUSA = WeiByOrig{1};
NormUSA = NormByOrig{1};
LogUSA = LogByOrig{1};
KerUSA = KerByOrig{1};
5-96
Using Probability Distribution Objects
You can see that only the nonparametric kernel distribution, KerUSA, comes
close to revealing the two modes in the data.
5-97
5 Probability Distributions
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
Combine all four fits and the country labels into a single cell array, including
“headers” to indicate which distributions correspond to which objects. Then,
save the array to a .mat file:
To show that the data is both safely saved and easily restored, clear your
workspace of relevant variables. This command clears only those variables
associated with this example:
clear('Weight','Acceleration','AllFits','Country',...
'Cylinders','Displacement','Horsepower','KerByOrig',...
'LogByOrig','MPG','Model','Model_Year','NormByOrig',...
'Origin','WeiByOrig')
load CarSmallFits
AllFits
You can now access the distributions objects as in the previous examples.
5-98
Probability Distributions Used for Multivariate Modeling
5-99
5 Probability Distributions
obj = gmdistribution(MU,SIGMA,p);
properties = fieldnames(obj)
properties =
'NDimensions'
'DistName'
'NComponents'
'PComponents'
'mu'
'Sigma'
'NlogL'
'AIC'
'BIC'
'Converged'
'Iters'
'SharedCov'
'CovType'
'RegV'
dimension = obj.NDimensions
dimension =
2
name = obj.DistName
name =
gaussian mixture distribution
5-100
Probability Distributions Used for Multivariate Modeling
Use the methods pdf and cdf to compute values and visualize the object:
5-101
5 Probability Distributions
Fitting a Model to Data. You can also create Gaussian mixture models
by fitting a parametric model with a specified number of components to
data. The fit method of the gmdistribution class uses the syntax obj =
gmdistribution.fit(X,k), where X is a data matrix and k is the specified
number of components. Choosing a suitable number of components k is
essential for creating a useful model of the data—too few components fails to
model the data accurately; too many components leads to an over-fit model
with singular covariance matrices.
First, create some data from a mixture of two bivariate Gaussian distributions
using the mvnrnd function:
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
5-102
Probability Distributions Used for Multivariate Modeling
options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
hold on
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
hold off
5-103
5 Probability Distributions
ComponentMeans = obj.mu
ComponentMeans =
0.9391 2.0322
-2.9823 -4.9737
ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
1.7786 -0.0528
-0.0528 0.5312
ComponentCovariances(:,:,2) =
1.0491 -0.0150
5-104
Probability Distributions Used for Multivariate Modeling
-0.0150 0.9816
MixtureProportions = obj.PComponents
MixtureProportions =
0.5000 0.5000
AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
obj{k} = gmdistribution.fit(X,k);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
2
model = obj{2}
model =
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean: 0.9391 2.0322
Component 2:
Mixing proportion: 0.500000
Mean: -2.9823 -4.9737
Both the Akaike and Bayes information are negative log-likelihoods for the
data with penalty terms for the number of estimated parameters. You can use
them to determine an appropriate number of components for a model when
the number of components is unspecified.
5-105
5 Probability Distributions
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
Y = random(obj,1000);
scatter(Y(:,1),Y(:,2),10,'.')
5-106
Probability Distributions Used for Multivariate Modeling
Copulas
• “Determining Dependence Between Simulation Inputs” on page 5-108
• “Constructing Dependent Bivariate Distributions” on page 5-112
• “Using Rank Correlation Coefficients” on page 5-116
• “Using Bivariate Copulas” on page 5-119
• “Higher Dimension Copulas” on page 5-126
• “Archimedean Copulas” on page 5-128
• “Simulating Dependent Multivariate Data Using Copulas” on page 5-130
• “Example: Fitting Copulas to Data” on page 5-135
5-107
5 Probability Distributions
It can be difficult to generate random inputs with dependence when they have
distributions that are not from a standard multivariate distribution. Further,
some of the standard multivariate distributions can model only limited types
of dependence. It is always possible to make the inputs independent, and
while that is a simple choice, it is not always sensible and can lead to the
wrong conclusions.
5-108
Probability Distributions Used for Multivariate Modeling
n = 1000;
sigma = .5;
SigmaInd = sigma.^2 .* [1 0; 0 1]
SigmaInd =
0.25 0
0 0.25
ZInd = mvnrnd([0 0],SigmaInd,n);
XInd = exp(ZInd);
plot(XInd(:,1),XInd(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-109
5 Probability Distributions
rho = .7;
A second scatter plot demonstrates the difference between these two bivariate
distributions:
plot(XDep(:,1),XDep(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-110
Probability Distributions Used for Multivariate Modeling
It is clear that there is a tendency in the second data set for large values of
X1 to be associated with large values of X2, and similarly for small values.
The correlation parameter, ρ, of the underlying bivariate normal determines
this dependence. The conclusions drawn from the simulation could well
depend on whether you generate X1 and X2 with dependence. The bivariate
lognormal distribution is a simple solution in this case; it easily generalizes
to higher dimensions in cases where the marginal distributions are different
lognormals.
5-111
5 Probability Distributions
n = 1000;
z = normrnd(0,1,n,1);
hist(z,-3.75:.5:3.75)
xlim([-4 4])
title('1000 Simulated N(0,1) Random Values')
xlabel('Z')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-112
Probability Distributions Used for Multivariate Modeling
u = normcdf(z);
hist(u,.05:.1:.95)
title('1000 Simulated N(0,1) Values Transformed to Unif(0,1)')
xlabel('U')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-113
5 Probability Distributions
x = gaminv(u,2,1);
hist(x,.25:.5:9.75)
title('1000 Simulated N(0,1) Values Transformed to Gamma(2,1)')
xlabel('X')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-114
Probability Distributions Used for Multivariate Modeling
⎛ ⎡1 ⎤⎞
Z = [ Z1 , Z2 ] N ⎜ [0, 0] , ⎢ ⎟
⎝ ⎣ 1 ⎥⎦ ⎠
U = ⎡⎣Φ ( Z1 ) , Φ ( Z2 ) ⎤⎦
X = ⎡⎣G1 (U1 ) , G2 (U2 ) ⎤⎦
5-115
5 Probability Distributions
where G1 and G2 are inverse cdfs of two possibly different distributions. For
example, the following generates random vectors from a bivariate distribution
with t5 and Gamma(2,1) marginals:
scatterhist(X(:,1),X(:,2),'Direction','out')
This plot has histograms alongside a scatter plot to show both the marginal
distributions, and the dependence.
5-116
Probability Distributions Used for Multivariate Modeling
e − 1
2
cor ( X 1, X 2) =
e − 1
2
which is strictly less than ρ, unless ρ is exactly 1. In more general cases such
as the Gamma/t construction, the linear correlation between X1 and X2 is
difficult or impossible to express in terms of ρ, but simulations show that the
same effect happens.
2 ⎛ ⎞
= arcsin ( )
or = sin ⎜ ⎟
⎝ 2⎠
6 ⎛⎞ ⎛ ⎞
s = arcsin ⎜ ⎟ or = 2sin ⎜ s ⎟
⎝2⎠ ⎝ 6⎠
rho = -1:.01:1;
5-117
5 Probability Distributions
tau = 2.*asin(rho)./pi;
rho_s = 6.*asin(rho./2)./pi;
plot(rho,tau,'b-','LineWidth',2)
hold on
plot(rho,rho_s,'g-','LineWidth',2)
plot([-1 1],[-1 1],'k:','LineWidth',2)
axis([-1 1 -1 1])
xlabel('rho')
ylabel('Rank correlation coefficient')
legend('Kendall''s {\it\tau}', ...
'Spearman''s {\it\rho_s}', ...
'location','NW')
5-118
Probability Distributions Used for Multivariate Modeling
Thus, it is easy to create the desired rank correlation between X1 and X2,
regardless of their marginal distributions, by choosing the correct ρ parameter
value for the linear correlation between Z1 and Z2.
For example, use the copularnd function to create scatter plots of random
values from a bivariate Gaussian copula for various levels of ρ, to illustrate the
range of different dependence structures. The family of bivariate Gaussian
copulas is parameterized by the linear correlation matrix:
5-119
5 Probability Distributions
⎛1 ⎞
Ρ=⎜
⎝ 1 ⎟⎠
n = 500;
5-120
Probability Distributions Used for Multivariate Modeling
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
Rho = [1 .8; .8 1];
f = copulapdf('t',[U1(:) U2(:)],Rho,5);
f = reshape(f,size(U1));
surf(u1,u2,log(f),'FaceColor','interp','EdgeColor','none')
5-121
5 Probability Distributions
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Probability Density')
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
F = copulacdf('t',[U1(:) U2(:)],Rho,5);
F = reshape(F,size(U1));
surf(u1,u2,F,'FaceColor','interp','EdgeColor','none')
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Cumulative Probability')
5-122
Probability Distributions Used for Multivariate Modeling
5-123
5 Probability Distributions
For example, use the copularnd function to create scatter plots of random
values from a bivariate t1 copula for various levels of ρ, to illustrate the range
of different dependence structures:
n = 500;
nu = 1;
5-124
Probability Distributions Used for Multivariate Modeling
n = 1000;
rho = .7;
nu = 1;
5-125
5 Probability Distributions
scatterhist(X(:,1),X(:,2),'Direction','out')
5-126
Probability Distributions Used for Multivariate Modeling
n = 1000;
Rho = [1 .4 .2; .4 1 -.8; .2 -.8 1];
U = copularnd('Gaussian',Rho,n);
X = [gaminv(U(:,1),2,1) betainv(U(:,2),2,2) tinv(U(:,3),5)];
subplot(1,1,1)
plot3(X(:,1),X(:,2),X(:,3),'.')
grid on
view([-55, 15])
xlabel('X1')
ylabel('X2')
zlabel('X3')
Notice that the relationship between the linear correlation parameter ρ and,
for example, Kendall’s τ, holds for each entry in the correlation matrix P
used here. You can verify that the sample rank correlations of the data are
approximately equal to the theoretical values:
tauTheoretical = 2.*asin(Rho)./pi
tauTheoretical =
5-127
5 Probability Distributions
1 0.26198 0.12819
0.26198 1 -0.59033
0.12819 -0.59033 1
tauSample = corr(X,'type','Kendall')
tauSample =
1 0.27254 0.12701
0.27254 1 -0.58182
0.12701 -0.58182 1
Archimedean Copulas
Statistics Toolbox functions are available for three bivariate Archimedean
copula families:
• Clayton copulas
• Frank copulas
• Gumbel copulas
These are one-parameter families that are defined directly in terms of their
cdfs, rather than being defined constructively using a standard multivariate
distribution.
alpha = copulaparam('Clayton',tau,'type','kendall')
alpha =
2.882
Finally, plot a random sample from the Clayton copula with copularnd.
Repeat the same procedure for the Frank and Gumbel copulas:
5-128
Probability Distributions Used for Multivariate Modeling
n = 500;
U = copularnd('Clayton',alpha,n);
subplot(3,1,1)
plot(U(:,1),U(:,2),'.');
title(['Clayton Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Frank',tau,'type','kendall');
U = copularnd('Frank',alpha,n);
subplot(3,1,2)
plot(U(:,1),U(:,2),'.')
title(['Frank Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Gumbel',tau,'type','kendall');
U = copularnd('Gumbel',alpha,n);
subplot(3,1,3)
plot(U(:,1),U(:,2),'.')
title(['Gumbel Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
5-129
5 Probability Distributions
5-130
Probability Distributions Used for Multivariate Modeling
Suppose you have return data for two stocks and want to run a Monte Carlo
simulation with inputs that follow the same distributions as the data:
load stockreturns
nobs = size(stocks,1);
subplot(2,1,1)
hist(stocks(:,1),10)
xlim([-3.5 3.5])
xlabel('X1')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
subplot(2,1,2)
hist(stocks(:,2),10)
xlim([-3.5 3.5])
xlabel('X2')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-131
5 Probability Distributions
You could fit a parametric model separately to each dataset, and use those
estimates as the marginal distributions. However, a parametric model may
not be sufficiently flexible. Instead, you can use a nonparametric model
to transform to the marginal distributions. All that is needed is a way to
compute the inverse cdf for the nonparametric model.
5-132
Probability Distributions Used for Multivariate Modeling
[Fi,xi] = ecdf(stocks(:,1));
stairs(xi,Fi,'b','LineWidth',2)
hold on
Fi_sm = ksdensity(stocks(:,1),xi,'function','cdf','width',.15);
plot(xi,Fi_sm,'r-','LineWidth',1.5)
xlabel('X1')
ylabel('Cumulative Probability')
legend('Empirical','Smoothed','Location','NW')
grid on
5-133
5 Probability Distributions
parameter. For the correlation parameter, you can compute the rank
correlation of the data, and then find the corresponding linear correlation
parameter for the t copula using copulaparam:
nu = 5;
tau = corr(stocks(:,1),stocks(:,2),'type','kendall')
tau =
0.51798
Next, use copularnd to generate random values from the t copula and
transform using the nonparametric inverse cdfs. The ksdensity function
allows you to make a kernel estimate of distribution and evaluate the inverse
cdf at the copula points all in one step:
n = 1000;
Alternatively, when you have a large amount of data or need to simulate more
than one set of values, it may be more efficient to compute the inverse cdf
over a grid of values in the interval (0,1) and use interpolation to evaluate it
at the copula points:
p = linspace(0.00001,0.99999,1000);
G1 = ksdensity(stocks(:,1),p,'function','icdf','width',0.15);
X1 = interp1(p,G1,U(:,1),'spline');
G2 = ksdensity(stocks(:,2),p,'function','icdf','width',0.15);
X2 = interp1(p,G2,U(:,2),'spline');
scatterhist(X1,X2,'Direction','out')
5-134
Probability Distributions Used for Multivariate Modeling
The marginal histograms of the simulated data are a smoothed version of the
histograms for the original data. The amount of smoothing is controlled by
the bandwidth input to ksdensity.
5-135
5 Probability Distributions
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
scatterhist(x,y,'Direction','out')
Transform the data to the copula scale (unit square) using a kernel estimator
of the cumulative distribution function:
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
5-136
Probability Distributions Used for Multivariate Modeling
scatterhist(u,v,'Direction','out')
xlabel('u')
ylabel('v')
Fit a t copula:
r = copularnd('t',Rho,nu,1000);
u1 = r(:,1);
v1 = r(:,2);
5-137
5 Probability Distributions
scatterhist(u1,v1,'Direction','out')
xlabel('u')
ylabel('v')
set(get(gca,'children'),'marker','.')
Transform the random sample back to the original scale of the data:
x1 = ksdensity(x,u1,'function','icdf');
y1 = ksdensity(y,v1,'function','icdf');
scatterhist(x1,y1,'Direction','out')
set(get(gca,'children'),'marker','.')
5-138
Probability Distributions Used for Multivariate Modeling
5-139
5 Probability Distributions
5-140
6
Random Number
Generation
Random number generators (RNGs) like those in MATLAB are algorithms for
generating pseudorandom numbers with a specified distribution.
For more information on the GUI for generating random numbers from
supported distributions, see “Visually Exploring Random Number Generation”
on page 5-49.
6-2
Random Number Generation Functions
6-3
6 Random Number Generation
6-4
Common Generation Methods
Direct Methods
Direct methods directly use the definition of the distribution.
function X = directbinornd(N,p,m,n)
For example:
X = directbinornd(100,0.3,1e4,1);
hist(X,101)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
6-5
6 Random Number Generation
The Statistics Toolbox function binornd uses a modified direct method, based
on the definition of a binomial random variable as the sum of Bernoulli
random variables.
You can easily convert the previous method to a random number generator
for the Poisson distribution with parameter λ. The Poisson distribution is
the limiting case of the binomial distribution as N approaches infinity, p
approaches zero, and Np is held fixed at λ. To generate Poisson random
numbers, create a version of the previous generator that inputs λ rather than
N and p, and internally sets N to some large number and p to λ/N.
The Statistics Toolbox function poissrnd actually uses two direct methods:
6-6
Common Generation Methods
Inversion Methods
Inversion methods are based on the observation that continuous cumulative
distribution functions (cdfs) range uniformly over the interval (0,1). If u is a
uniform random number on (0,1), then using X = F -1(U) generates a random
number X from a continuous distribution with specified cdf F.
For example, the following code generates random numbers from a specific
exponential distribution using the inverse cdf and the MATLAB uniform
random number generator rand:
mu = 1;
X = expinv(rand(1e4,1),mu);
Compare the distribution of the generated random numbers to the pdf of the
specified exponential by scaling the pdf to the area of the histogram used
to display the distribution:
numbins = 50;
hist(X,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(X,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = exppdf(x,mu);
plot(x,histarea*y,'r','LineWidth',2)
6-7
6 Random Number Generation
function X = discreteinvrnd(p,m,n)
Use the function to generate random numbers from any discrete distribution:
6-8
Common Generation Methods
Acceptance-Rejection Methods
The functional form of some distributions makes it difficult or time-consuming
to generate random numbers using direct or inversion methods.
Acceptance-rejection methods provide an alternative in these cases.
6-9
6 Random Number Generation
1 Chooses a density g.
6-10
Common Generation Methods
function X = accrejrnd(f,g,grnd,c,m,n)
For example, the function f(x) = xe–x2/2 satisfies the conditions for a pdf on [0,∞)
(nonnegative and integrates to 1). The exponential pdf with mean 1, f(x) = e–x,
dominates g for c greater than about 2.2. Thus, you can use rand and exprnd
to generate random numbers from f:
f = @(x)x.*exp(-(x.^2)/2);
g = @(x)exp(-x);
grnd = @()exprnd(1);
X = accrejrnd(f,g,grnd,2.2,1e4,1);
Y = raylrnd(1,1e4,1);
hist([X Y])
h = get(gca,'Children');
set(h(1),'FaceColor',[.8 .8 1])
legend('A-R RNG','Rayleigh RNG')
6-11
6 Random Number Generation
6-12
Representing Sampling Distributions Using Markov Chain Samplers
3 Accept y(t) as the next sample x(t + 1) with probability r(x(t),y(t)), and keep
x(t) as the next sample x(t + 1) with probability 1 – r(x(t),y(t)), where:
⎧ f ( y) q( x | y) ⎫
r( x, y) = min ⎨ , 1⎬
⎩ f ( x) q( y| x) ⎭
4 Increment t → t+1, and repeat steps 2 and 3 until you get the desired
number of samples.
6-13
6 Random Number Generation
2 Draw a real value y uniformly from (0, f(x(t))), thereby defining a horizontal
“slice” as S = {x: y < f(x)}.
3 Find an interval I = (L, R) around x(t) that contains all, or much of the
“slice” S.
5 Increment t → t+1 and repeat steps 2 through 4 until you get the desired
number of samples.
Generate random numbers using the slice sampling method with the
slicesample function.
6-14
Generating Quasi-Random Numbers
Quasi-Random Sequences
Quasi-random number generators (QRNGs) produce highly uniform samples
of the unit hypercube. QRNGs minimize the discrepancy between the
distribution of generated points and a distribution with equal proportions of
points in each sub-cube of a uniform partition of the hypercube. As a result,
QRNGs systematically fill the “holes” in any initial segment of the generated
quasi-random sequence.
6-15
6 Random Number Generation
• Skip — A Skip value specifies the number of initial points to ignore. In this
example, set the Skip value to 2. The sequence is now 5,7,9,2,4,6,8,10
and the first three points are [5,7,9]:
• Leap — A Leap value specifies the number of points to ignore for each one
you take. Continuing the example with the Skip set to 2, if you set the Leap
to 1, the sequence uses every other point. In this example, the sequence is
now 5,9,4,8 and the first three points are [5,9,4]:
6-16
Generating Quasi-Random Numbers
Quasi-random sequences are functions from the positive integers to the unit
hypercube. To be useful in application, an initial point set of a sequence must
be generated. Point sets are matrices of size n-by-d, where n is the number of
points and d is the dimension of the hypercube being sampled. The functions
haltonset and sobolset construct point sets with properties of a specified
quasi-random sequence. Initial segments of the point sets are generated by
the net method of the qrandset class (parent class of the haltonset class
and sobolset class), but points can be generated and accessed more generally
using parenthesis indexing.
p = haltonset(2,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
6-17
6 Random Number Generation
p = scramble(p,'RR2')
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
X0 = net(p,500);
X0 = p(1:500,:);
Values of the point set X0 are not generated and stored in memory until you
access p using net or parenthesis indexing.
scatter(X0(:,1),X0(:,2),5,'r')
axis square
title('{\bf Quasi-Random Scatter}')
6-18
Generating Quasi-Random Numbers
X = rand(500,2);
scatter(X(:,1),X(:,2),5,'b')
axis square
title('{\bf Uniform Random Scatter}')
6-19
6 Random Number Generation
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = rand(sampSize,1);
[h,pval] = kstest(x,[x,x]);
6-20
Generating Quasi-Random Numbers
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
The results are quite different when the test is performed repeatedly on
uniform quasi-random samples:
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
6-21
6 Random Number Generation
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
Small p-values call into question the null hypothesis that the data are
uniformly distributed. If the hypothesis is true, about 5% of the p-values are
expected to fall below 0.05. The results are remarkably consistent in their
failure to challenge the hypothesis.
6-22
Generating Quasi-Random Numbers
Quasi-Random Streams
Quasi-random streams, produced by the qrandstream function, are used
to generate sequential quasi-random outputs, rather than point sets of a
specific size. Streams are used like pseudoRNGS, such as rand, when client
applications require a source of quasi-random numbers of indefinite size that
can be accessed intermittently. Properties of a quasi-random stream, such
as its type (Halton or Sobol), dimension, skip, leap, and scramble, are set
when the stream is constructed.
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
6-23
6 Random Number Generation
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p)
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
6-24
Generating Data Using Flexible Families of Distributions
Data Input
The following parameters define each member of the Pearson and Johnson
systems
These statistics can also be computed with the moment function. The Johnson
system, while based on these four parameters, is more naturally described
using quantiles, estimated by the quantile function.
6-25
6 Random Number Generation
load carbig
MPG = MPG(~isnan(MPG));
[n,x] = hist(MPG,15);
bar(x,n)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
The following two sections model the distribution with members of the
Pearson and Johnson systems, respectively.
6-26
Generating Data Using Flexible Families of Distributions
values for each of these moments from data, it is easy to find the distribution
in the Pearson system that matches these four moments and to generate a
random sample.
For a given set of moments, there are distributions that are not in the system
that also have those same first four moments, and the distribution in the
Pearson system may not be a good match to your data, particularly if the
data are multimodal. But the system does cover a wide range of distribution
shapes, including both symmetric and skewed distributions.
moments = {mean(MPG),std(MPG),skewness(MPG),kurtosis(MPG)};
[r,type] = pearsrnd(moments{:},10000,1);
The optional second output from pearsrnd indicates which type of distribution
within the Pearson system matches the combination of moments.
type
type =
1
In this case, pearsrnd has determined that the data are best described with a
Type I Pearson distribution, which is a shifted, scaled beta distribution.
Verify that the sample resembles the original data by overlaying the empirical
cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r);
hold on, stairs(xi,Fi,'r'); hold off
6-27
6 Random Number Generation
⎛ ( Ζ-ξ ) ⎞
Χ = γ + δ ⋅ Γ ⎜⎜ ⎟⎟
⎝ λ ⎠
6-28
Generating Data Using Flexible Families of Distributions
To generate a sample from the Johnson distribution that matches the MPG
data, first define the four quantiles to which the four evenly spaced standard
normal quantiles of -1.5, -0.5, 0.5, and 1.5 should be transformed. That is, you
compute the sample quantiles of the data for the cumulative probabilities of
0.067, 0.309, 0.691, and 0.933.
quantiles = quantile(MPG,probs)
quantiles =
13.0000 18.0000 27.2000 36.0000
[r1,type] = johnsrnd(quantiles,10000,1);
The optional second output from johnsrnd indicates which type of distribution
within the Johnson system matches the quantiles.
type
type =
SB
You can verify that the sample resembles the original data by overlaying the
empirical cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r1);
hold on, stairs(xi,Fi,'r'); hold off
6-29
6 Random Number Generation
However, while the new sample matches the original data better in the right
tail, it matches much worse in the left tail.
[Fj,xj] = ecdf(r2);
hold on, stairs(xj,Fj,'g'); hold off
6-30
Generating Data Using Flexible Families of Distributions
6-31
6 Random Number Generation
6-32
7
Hypothesis Tests
Introduction
Hypothesis testing is a common method of drawing inferences about a
population based on statistical evidence from a sample.
Sample averages differ from one another due to chance variability in the
selection process. Suppose your sample average comes out to be $1.18. Is the
$0.03 difference an artifact of random sampling or significant evidence that
the average price of a gallon of gas was in fact greater than $1.15? Hypothesis
testing is a statistical method for making such decisions.
7-2
Hypothesis Test Terminology
7-3
7 Hypothesis Tests
7-4
Hypothesis Test Assumptions
For example, the z-test (ztest) and the t-test (ttest) both assume that
the data are independently sampled from a normal distribution. Statistics
Toolbox functions are available for testing this assumption, such as chi2gof,
jbtest, lillietest, and normplot.
Both the z-test and the t-test are relatively robust with respect to departures
from this assumption, so long as the sample size n is large enough. Both
tests compute a sample mean x , which, by the Central Limit Theorem,
has an approximately normal sampling distribution with mean equal to the
population mean μ, regardless of the population distribution being sampled.
The difference between the z-test and the t-test is in the assumption of the
standard deviation σ of the underlying normal distribution. A z-test assumes
that σ is known; a t-test does not. As a result, a t-test must compute an
estimate s of the standard deviation from the sample.
Test statistics for the z-test and the t-test are, respectively,
x−
z=
/ n
x−
t=
s/ n
Under the null hypothesis that the population is distributed with mean μ, the
z-statistic has a standard normal distribution, N(0,1). Under the same null
hypothesis, the t-statistic has Student’s t distribution with n – 1 degrees of
freedom. For small sample sizes, Student’s t distribution is flatter and wider
than N(0,1), compensating for the decreased confidence in the estimate s.
As sample size increases, however, Student’s t distribution approaches the
standard normal distribution, and the two tests become essentially equivalent.
7-5
7 Hypothesis Tests
Knowing the distribution of the test statistic under the null hypothesis allows
for accurate calculation of p-values. Interpreting p-values in the context of
the test assumptions allows for critical analysis of test results.
7-6
Example: Hypothesis Testing
load gas
prices = [price1 price2];
As a first step, you might want to test the assumption that the samples come
from normal distributions.
7-7
7 Hypothesis Tests
normplot(prices)
Both scatters approximately follow straight lines through the first and third
quartiles of the samples, indicating approximate normal distributions.
The February sample (the right-hand line) shows a slight departure from
normality in the lower tail. A shift in the mean from January to February is
evident.
A hypothesis test is used to quantify the test of normality. Since each sample
is relatively small, a Lilliefors test is recommended.
7-8
Example: Hypothesis Testing
lillietest(price1)
ans =
0
lillietest(price2)
ans =
0
sample_means = mean(prices)
sample_means =
115.1500 118.5000
You might want to test the null hypothesis that the mean price across the
state on the day of the January sample was $1.15. If you know that the
standard deviation in prices across the state has historically, and consistently,
been $0.04, then a z-test is appropriate.
[h,pvalue,ci] = ztest(price1/100,1.15,0.04)
h =
0
pvalue =
0.8668
ci =
1.1340
1.1690
7-9
7 Hypothesis Tests
Does the later sample offer stronger evidence for rejecting a null hypothesis
of a state-wide average price of $1.15 in February? The shift shown in the
probability plot and the difference in the computed sample means suggest
this. The shift might indicate a significant fluctuation in the market, raising
questions about the validity of using the historical standard deviation. If a
known standard deviation cannot be assumed, a t-test is more appropriate.
[h,pvalue,ci] = ttest(price2/100,1.15)
h =
1
pvalue =
4.9517e-004
ci =
1.1675
1.2025
You might want to investigate the shift in prices a little more closely.
The function ttest2 tests if two independent samples come from normal
distributions with equal but unknown standard deviations and the same
mean, against the alternative that the means are unequal.
[h,sig,ci] = ttest2(price1,price2)
h =
1
sig =
0.0083
ci =
-5.7845
-0.9155
7-10
Example: Hypothesis Testing
boxplot(prices,1)
set(gca,'XTick',[1 2])
set(gca,'XtickLabel',{'January','February'})
xlabel('Month')
ylabel('Prices ($0.01)')
The plot displays the distribution of the samples around their medians. The
heights of the notches in each box are computed so that the side-by-side
boxes have nonoverlapping notches when their medians are different at a
default 5% significance level. The computation is based on an assumption
of normality in the data, but the comparison is reasonably robust for other
distributions. The side-by-side plots provide a kind of visual hypothesis test,
comparing medians rather than means. The plot above appears to barely
reject the null hypothesis of equal medians.
7-11
7 Hypothesis Tests
[p,h] = ranksum(price1,price2)
p =
0.0095
h =
1
The test rejects the null hypothesis of equal medians at the default 5%
significance level.
7-12
Available Hypothesis Tests
7-13
7 Hypothesis Tests
Function Description
ranksum Wilcoxon rank sum test. Tests if two independent
samples come from identical continuous distributions
with equal medians, against the alternative that they
do not have equal medians.
runstest Runs test. Tests if a sequence of values comes in
random order, against the alternative that the ordering
is not random.
signrank One-sample or paired-sample Wilcoxon signed rank test.
Tests if a sample comes from a continuous distribution
symmetric about a specified median, against the
alternative that it does not have that median.
signtest One-sample or paired-sample sign test. Tests if a
sample comes from an arbitrary continuous distribution
with a specified median, against the alternative that it
does not have that median.
ttest One-sample or paired-sample t-test. Tests if a sample
comes from a normal distribution with unknown
variance and a specified mean, against the alternative
that it does not have that mean.
ttest2 Two-sample t-test. Tests if two independent samples
come from normal distributions with unknown but
equal (or, optionally, unequal) variances and the same
mean, against the alternative that the means are
unequal.
vartest One-sample chi-square variance test. Tests if a sample
comes from a normal distribution with specified
variance, against the alternative that it comes from a
normal distribution with a different variance.
vartest2 Two-sample F-test for equal variances. Tests if two
independent samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
7-14
Available Hypothesis Tests
Function Description
vartestn Bartlett multiple-sample test for equal variances. Tests
if multiple samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
ztest One-sample z-test. Tests if a sample comes from a
normal distribution with known variance and specified
mean, against the alternative that it does not have that
mean.
7-15
7 Hypothesis Tests
7-16
8
Analysis of Variance
Introduction
Analysis of variance (ANOVA) is a procedure for assigning sample variance to
different sources and deciding whether the variation arises within or among
different population groups. Samples are described in terms of variation
around group means and variation of group means around an overall mean. If
variations within groups are small relative to variations between groups, a
difference in group means may be inferred. Chapter 7, “Hypothesis Tests” are
used to quantify decisions.
This chapter treats ANOVA among groups, that is, among categorical
predictors. ANOVA for regression, with continuous predictors, is discussed in
“Tabulating Diagnostic Statistics” on page 9-13.
8-2
ANOVA
ANOVA
In this section...
“One-Way ANOVA” on page 8-3
“Two-Way ANOVA” on page 8-9
“N-Way ANOVA” on page 8-12
“Other ANOVA Models” on page 8-26
“Analysis of Covariance” on page 8-27
“Nonparametric Methods” on page 8-35
One-Way ANOVA
• “Introduction” on page 8-3
• “Example: One-Way ANOVA” on page 8-4
• “Multiple Comparisons” on page 8-6
• “Example: Multiple Comparisons” on page 8-7
Introduction
The purpose of one-way ANOVA is to find out whether data from several
groups have a common mean. That is, to determine whether the groups are
actually different in the measured characteristic.
One-way ANOVA is a simple special case of the linear model. The one-way
ANOVA form of the model is
yij = . j + ij
where:
8-3
8 Analysis of Variance
• α.j is a matrix whose columns are the group means. (The “dot j” notation
means that α applies to all rows of column j. That is, the value αij is the
same for all i.)
• εij is a matrix of random disturbances.
The model assumes that the columns of y are a constant plus a random
disturbance. You want to know if the constants are all the same.
load hogg
hogg
hogg =
24 14 11 7 19
15 7 9 7 24
21 12 7 4 19
27 17 13 7 15
33 14 12 12 10
23 16 18 18 20
[p,tbl,stats] = anova1(hogg);
p
p =
1.1971e-04
The standard ANOVA table has columns for the sums of squares, degrees of
freedom, mean squares (SS/df), F statistic, and p value.
8-4
ANOVA
You can use the F statistic to do a hypothesis test to find out if the bacteria
counts are the same. anova1 returns the p value from this hypothesis test.
In this case the p value is about 0.0001, a very small value. This is a strong
indication that the bacteria counts from the different shipments are not the
same. An F statistic as extreme as the observed F would occur by chance only
once in 10,000 times if the counts were truly equal.
You can get some graphical assurance that the means are different by
looking at the box plots in the second figure window displayed by anova1.
Note, however, that the notches are used for a comparison of medians, not a
comparison of means. For more information on this display, see “Box Plots”
on page 4-6.
8-5
8 Analysis of Variance
Multiple Comparisons
Sometimes you need to determine not just whether there are any differences
among the means, but specifically which pairs of means are significantly
different. It is tempting to perform a series of t tests, one for each pair of
means, but this procedure has a pitfall.
In this example there are five means, so there are 10 pairs of means to
compare. It stands to reason that if all the means are the same, and if there is
a 5% chance of incorrectly concluding that there is a difference in one pair,
8-6
ANOVA
then the probability of making at least one incorrect conclusion among all 10
pairs is much larger than 5%.
load hogg
[p,tbl,stats] = anova1(hogg);
[c,m] = multcompare(stats)
c =
1.0000 2.0000 2.4953 10.5000 18.5047
1.0000 3.0000 4.1619 12.1667 20.1714
1.0000 4.0000 6.6619 14.6667 22.6714
1.0000 5.0000 -2.0047 6.0000 14.0047
2.0000 3.0000 -6.3381 1.6667 9.6714
2.0000 4.0000 -3.8381 4.1667 12.1714
2.0000 5.0000 -12.5047 -4.5000 3.5047
3.0000 4.0000 -5.5047 2.5000 10.5047
3.0000 5.0000 -14.1714 -6.1667 1.8381
4.0000 5.0000 -16.6714 -8.6667 -0.6619
m =
23.8333 1.9273
13.3333 1.9273
11.6667 1.9273
9.1667 1.9273
17.8333 1.9273
The first output from multcompare has one row for each pair of groups, with
an estimate of the difference in group means and a confidence interval for that
group. For example, the second row has the values
8-7
8 Analysis of Variance
[4.1619, 20.1714]. This interval does not contain 0, so you can conclude that
the means of groups 1 and 3 are different.
The second output contains the mean and its standard error for each group.
There are five groups. The graph instructs you to Click on the group you
want to test. Three groups have slopes significantly different from group one.
The graph shows that group 1 is significantly different from groups 2, 3, and
4. By using the mouse to select group 4, you can determine that it is also
significantly different from group 5. Other pairs are not significantly different.
8-8
ANOVA
Two-Way ANOVA
• “Introduction” on page 8-9
• “Example: Two-Way ANOVA” on page 8-10
Introduction
The purpose of two-way ANOVA is to find out whether data from several
groups have a common mean. One-way ANOVA and two-way ANOVA differ
in that the groups in two-way ANOVA have two categories of defining
characteristics instead of one.
Suppose an automobile company has two factories, and each factory makes
the same three models of car. It is reasonable to ask if the gas mileage in the
cars varies from factory to factory as well as from model to model. There are
two predictors, factory and model, to explain differences in mileage.
Finally, a factory might make high mileage cars in one model (perhaps
because of a superior production line), but not be different from the other
factory for other models. This effect is called an interaction. It is impossible
to detect an interaction unless there are duplicate observations for some
combination of factory and car model.
Two-way ANOVA is a special case of the linear model. The two-way ANOVA
form of the model is
yijk = + . j + i. + ij + ijk
• yijk is a matrix of gas mileage observations (with row index i, column index
j, and repetition index k).
• μ is a constant matrix of the overall mean gas mileage.
8-9
8 Analysis of Variance
• α.j is a matrix whose columns are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s model. All
values in a given column of α.j are identical, and the values in each row of
α.j sum to 0.
• βi. is a matrix whose rows are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s factory. All
values in a given row of βi. are identical, and the values in each column
of βi. sum to 0.
• γij is a matrix of interactions. The values in each row of γij sum to 0, and the
values in each column of γij sum to 0.
• εijk is a matrix of random disturbances.
load mileage
mileage
mileage =
cars = 3;
[p,tbl,stats] = anova2(mileage,cars);
p
p =
0.0000 0.0039 0.8411
There are three models of cars (columns) and two factories (rows). The reason
there are six rows in mileage instead of two is that each factory provides
8-10
ANOVA
three cars of each model for the study. The data from the first factory is in the
first three rows, and the data from the second factory is in the last three rows.
The standard ANOVA table has columns for the sums of squares,
degrees-of-freedom, mean squares (SS/df), F statistics, and p-values.
You can use the F statistics to do hypotheses tests to find out if the mileage is
the same across models, factories, and model-factory pairs (after adjusting for
the additive effects). anova2 returns the p value from these tests.
The p value for the model effect is zero to four decimal places. This is a strong
indication that the mileage varies from one model to another. An F statistic
as extreme as the observed F would occur by chance less than once in 10,000
times if the gas mileage were truly equal from model to model. If you used the
multcompare function to perform a multiple comparison test, you would find
that each pair of the three models is significantly different.
The p value for the factory effect is 0.0039, which is also highly significant.
This indicates that one factory is out-performing the other in the gas mileage
of the cars it produces. The observed p value indicates that an F statistic as
extreme as the observed F would occur by chance about four out of 1000 times
if the gas mileage were truly equal from factory to factory.
There does not appear to be any interaction between factories and models.
The p value, 0.8411, means that the observed result is quite likely (84 out 100
times) given that there is no interaction.
8-11
8 Analysis of Variance
In addition, anova2 requires that data be balanced, which in this case means
there must be the same number of cars for each combination of model and
factory. The next section discusses a function that supports unbalanced data
with any number of predictors.
N-Way ANOVA
• “Introduction” on page 8-12
• “N-Way ANOVA with a Small Data Set” on page 8-13
• “N-Way ANOVA with a Large Data Set” on page 8-15
• “ANOVA with Random Effects” on page 8-19
Introduction
You can use N-way ANOVA to determine if the means in a set of data differ
when grouped by multiple factors. If they do differ, you can determine which
factors or combinations of factors are associated with the difference.
yijkl = + . j. + i.. + ..k + ( )ij. + ( )i.k + ( ). jk + ( )ijk + ijkl
8-12
ANOVA
The anovan function performs N-way ANOVA. Unlike the anova1 and anova2
functions, anovan does not expect data in a tabular form. Instead, it expects
a vector of response measurements and a separate vector (or text array)
containing the values corresponding to each factor. This input data format is
more convenient than matrices when there are more than two factors or when
the number of measurements per factor combination is not constant.
anova2(m,2)
ans =
0.0197 0.2234 0.2663
The factor information is implied by the shape of the matrix m and the number
of measurements at each factor combination (2). Although anova2 does not
actually require arrays of factor values, for illustrative purposes you could
create them as follows.
cfactor = repmat(1:3,4,1)
cfactor =
1 2 3
1 2 3
1 2 3
1 2 3
rfactor =
1 1 1
8-13
8 Analysis of Variance
1 1 1
2 2 2
2 2 2
The cfactor matrix shows that each column of m represents a different level
of the column factor. The rfactor matrix shows that the top two rows of m
represent one level of the row factor, and bottom two rows of m represent a
second level of the row factor. In other words, each value m(i,j) represents
an observation at column factor level cfactor(i,j) and row factor level
rfactor(i,j).
To solve the above problem with anovan, you need to reshape the matrices m,
cfactor, and rfactor to be vectors.
m = m(:);
cfactor = cfactor(:);
rfactor = rfactor(:);
[m cfactor rfactor]
ans =
23 1 1
27 1 1
43 1 2
41 1 2
15 2 1
17 2 1
3 2 2
9 2 2
20 3 1
63 3 1
55 3 2
90 3 2
anovan(m,{cfactor rfactor},2)
ans =
0.0197
8-14
ANOVA
0.2234
0.2663
load carbig
whos
The example focusses on four variables. MPG is the number of miles per gallon
for each of 406 cars (though some have missing values coded as NaN). The
other three variables are factors: cyl4 (four-cylinder car or not), org (car
originated in Europe, Japan, or the USA), and when (car was built early in the
period, in the middle of the period, or late in the period).
First, fit the full model, requesting up to three-way interactions and Type 3
sums-of-squares.
varnames = {'Origin';'4Cyl';'MfgDate'};
anovan(MPG,{org cyl4 when},3,3,varnames)
ans =
0.0000
8-15
8 Analysis of Variance
NaN
0
0.7032
0.0001
0.2072
0.6990
Note that many terms are marked by a # symbol as not having full rank,
and one of them has zero degrees of freedom and is missing a p value. This
can happen when there are missing factor combinations and the model has
higher-order terms. In this case, the cross-tabulation below shows that there
are no cars made in Europe during the early part of the period with other than
four cylinders, as indicated by the 0 in table(2,1,1).
table(:,:,1) =
82 75 25
0 4 3
3 3 4
8-16
ANOVA
table(:,:,2) =
12 22 38
23 26 17
12 25 32
chi2 =
207.7689
p =
factorvals =
Using even the limited information available in the ANOVA table, you can see
that the three-way interaction has a p value of 0.699, so it is not significant.
So this time you examine only two-way interactions.
terms =
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
8-17
8 Analysis of Variance
Now all terms are estimable. The p-values for interaction term 4
(Origin*4Cyl) and interaction term 6 (4Cyl*MfgDate) are much larger than
a typical cutoff value of 0.05, indicating these terms are not significant. You
could choose to omit these terms and pool their effects into the error term.
The output terms variable returns a matrix of codes, each of which is a bit
pattern representing a term. You can omit terms from the model by deleting
their entries from terms and running anovan again, this time supplying the
resulting vector as the model argument.
terms([4 6],:) = []
terms =
1 0 0
0 1 0
0 0 1
1 0 1
ans =
1.0e-003 *
8-18
ANOVA
0.0000
0
0
0.1140
Now you have a more parsimonious model indicating that the mileage of
these cars seems to be related to all three factors, and that the effect of the
manufacturing date depends on where the car was made.
8-19
8 Analysis of Variance
Setting Up the Model. To set up the example, first load the data, which is
stored in a 6-by-3 matrix, mileage.
load mileage
The anova2 function works only with balanced data, and it infers the values
of the grouping variables from the row and column numbers of the input
matrix. The anovan function, on the other hand, requires you to explicitly
create vectors of grouping variable values. To create these vectors, do the
following steps:
1 Create an array indicating the factory for each value in mileage. This
array is 1 for the first column, 2 for the second, and 3 for the third.
factory = repmat(1:3,6,1);
2 Create an array indicating the car model for each mileage value. This array
is 1 for the first three rows of mileage, and 2 for the remaining three rows.
mileage = mileage(:);
factory = factory(:);
carmod = carmod(:);
[mileage factory carmod]
ans =
8-20
ANOVA
In the fixed effects version of this fit, which you get by omitting the inputs
'random',1 in the preceding code, the effect of car model is significant, with a
p value of 0.0039. But in this example, which takes into account the random
variation of the effect of the variable 'Car Model' from one factory to another,
the effect is still significant, but with a higher p value of 0.0136.
8-21
8 Analysis of Variance
In the example described in “Setting Up the Model” on page 8-20, the effect
of the variable 'Factory' could vary across car models. In this case, the
interaction mean square takes the place of the error mean square in the F
statistic. The F statistic for factory is:
F = 1.445 / 0.02
F =
72.2500
The degrees of freedom for the statistic are the degrees of freedom for the
numerator (1) and denominator (2) mean squares. Therefore the p value
for the statistic is:
pval = 1 - fcdf(F,1,2)
pval =
0.0136
With random effects, the expected value of each mean square depends not only
on the variance of the error term, but also on the variances contributed by
the random effects. You can see these dependencies by writing the expected
values as linear combinations of contributions from the various model terms.
To find the coefficients of these linear combinations, enter stats.ems, which
returns the ems field of the stats structure:
stats.ems
ans =
8-22
ANOVA
stats.txtems
ans =
'6*V(Factory)+3*V(Factory*Car Model)+V(Error)'
'9*Q(Car Model)+3*V(Factory*Car Model)+V(Error)'
'3*V(Factory*Car Model)+V(Error)'
'V(Error)'
The expected value for the mean square due to car model (second term)
includes contributions from a quadratic function of the car model effects, plus
three times the variance of the interaction term’s effect, plus the variance
of the error term. Notice that if the car model effects were all zero, the
expression would reduce to the expected mean square for the third term (the
interaction term). That is why the F statistic for the car model effect uses the
interaction mean square in the denominator.
In some cases there is no single term whose expected value matches the one
required for the denominator of theFstatistic. In that case, the denominator is
a linear combination of mean squares. The stats structure contains fields
giving the definitions of the denominators for each F statistic. The txtdenom
field, stats.txtdenom, gives a text representation, and the denom field gives
a matrix that defines a linear combination of the variances of terms in the
model. For balanced models like this one, the denom matrix, stats.denom,
contains zeros and ones, because the denominator is just a single term’s mean
square:
stats.txtdenom
ans =
'MS(Factory*Car Model)'
'MS(Factory*Car Model)'
'MS(Error)'
stats.denom
8-23
8 Analysis of Variance
ans =
stats.rtnames
ans =
'Factory'
'Factory*Car Model'
'Error'
You do not know those variances, but you can estimate them from the data.
Recall that the ems field of the stats structure expresses the expected value
of each term’s mean square as a linear combination of unknown variances for
random terms, and unknown quadratic forms for fixed terms. If you take
the expected mean square expressions for the random terms, and equate
those expected values to the computed mean squares, you get a system of
equations that you can solve for the unknown variances. These solutions
are the variance component estimates. The varest field contains a variance
component estimate for each term. The rtnames field contains the names
of the random terms.
stats.varest
ans =
4.4426
-0.0313
0.1139
8-24
ANOVA
is common to set the estimate to zero, which you might do, for example, to
create a bar graph of the components.
bar(max(0,stats.varest))
set(gca,'xtick',1:3,'xticklabel',stats.rtnames)
You can also compute confidence bounds for the variance estimate. The
anovan function does this by computing confidence bounds for the variance
expected mean squares, and finding lower and upper limits on each variance
component containing all of these bounds. This procedure leads to a set
of bounds that is conservative for balanced data. (That is, 95% confidence
bounds will have a probability of at least 95% of containing the true variances
if the number of observations for each combination of grouping variables
is the same.) For unbalanced data, these are approximations that are not
guaranteed to be conservative.
ans =
8-25
8 Analysis of Variance
For example, the mileage data from the previous section assumed that the
two car models produced in each factory were the same. Suppose instead,
each factory produced two distinct car models for a total of six car models, and
we numbered them 1 and 2 for each factory for convenience. Then, the car
model is nested in factory. A more accurate and less ambiguous numbering of
car model would be as follows:
8-26
ANOVA
Analysis of Covariance
• “Introduction” on page 8-27
• “Analysis of Covariance Tool” on page 8-27
• “Confidence Bounds” on page 8-32
• “Multiple Comparisons” on page 8-34
Introduction
Analysis of covariance is a technique for analyzing grouped data having a
response (y, the variable to be predicted) and a predictor (x, the variable
used to do the prediction). Using analysis of covariance, you can model y as
a linear function of x, with the coefficients of the line possibly varying from
group to group.
Same line y = α + βx + ε
Parallel lines y = (α + αi) + βx + ε
Separate lines y = (α + αi) + (β + βi)x + ε
For example, in the parallel lines model the intercept varies from one group
to the next, but the slope is the same for each group. In the same mean
model, there is a common intercept and no slope. In order to make the group
coefficients well determined, the tool imposes the constraints
∑ j = ∑ j = 0
The following steps describe the use of aoctool.
8-27
8 Analysis of Variance
1 Load the data. The Statistics Toolbox data set carsmall.mat contains
information on cars from the years 1970, 1976, and 1982. This example
studies the relationship between the weight of a car and its mileage,
and whether this relationship has changed over the years. To start the
demonstration, load the data set.
load carsmall
2 Start the tool. The following command calls aoctool to fit a separate line
to the column vectors Weight and MPG for each of the three model group
defined in Model_Year. The initial fit models the y variable, MPG, as a linear
function of the x variable, Weight.
[h,atab,ctab,stats] = aoctool(Weight,MPG,Model_Year);
8-28
ANOVA
See the aoctool function reference page for detailed information about
calling aoctool.
8-29
8 Analysis of Variance
The coefficients of the three lines appear in the figure titled ANOCOVA
Coefficients. You can see that the slopes are roughly –0.0078, with a small
deviation for each group:
• Model year 1970: y = (45.9798 – 8.5805) + (–0.0078 + 0.002)x + ε
• Model year 1976: y = (45.9798 – 3.8902) + (–0.0078 + 0.0011)x + ε
• Model year 1982: y = (45.9798 + 12.4707) + (–0.0078 – 0.0031)x + ε
Because the three fitted lines have slopes that are roughly similar, you may
wonder if they really are the same. The Model_Year*Weight interaction
expresses the difference in slopes, and the ANOVA table shows a test for
the significance of this term. With an F statistic of 5.23 and a p value of
0.0072, the slopes are significantly different.
8-30
ANOVA
4 Constrain the slopes to be the same. To examine the fits when the
slopes are constrained to be the same, return to the ANOCOVA Prediction
Plot window and use the Model pop-up menu to select a Parallel Lines
model. The window updates to show the following graph.
Though this fit looks reasonable, it is significantly worse than the Separate
Lines model. Use the Model pop-up menu again to return to the original
model.
8-31
8 Analysis of Variance
Confidence Bounds
The example in “Analysis of Covariance Tool” on page 8-27 provides estimates
of the relationship between MPG and Weight for each Model_Year, but how
accurate are these estimates? To find out, you can superimpose confidence
bounds on the fits by examining them one group at a time.
1 In the Model_Year menu at the lower right of the figure, change the
setting from All Groups to 82. The data and fits for the other groups are
dimmed, and confidence bounds appear around the 82 fit.
8-32
ANOVA
The dashed lines form an envelope around the fitted line for model year 82.
Under the assumption that the true relationship is linear, these bounds
provide a 95% confidence region for the true line. Note that the fits for the
other model years are well outside these confidence bounds for Weight
values between 2000 and 3000.
8-33
8 Analysis of Variance
Like the polytool function, the aoctool function has cross hairs that you
can use to manipulate the Weight and watch the estimate and confidence
bounds along the y-axis update. These values appear only when a single
group is selected, not when All Groups is selected.
Multiple Comparisons
You can perform a multiple comparison test by using the stats output
structure from aoctool as input to the multcompare function. The
multcompare function can test either slopes, intercepts, or population
marginal means (the predicted MPG of the mean weight for each group). The
example in “Analysis of Covariance Tool” on page 8-27 shows that the slopes
are not all the same, but could it be that two are the same and only the other
one is different? You can test that hypothesis.
multcompare(stats,0.05,'on','','s')
ans =
1.0000 2.0000 -0.0012 0.0008 0.0029
1.0000 3.0000 0.0013 0.0051 0.0088
2.0000 3.0000 0.0005 0.0042 0.0079
This matrix shows that the estimated difference between the intercepts of
groups 1 and 2 (1970 and 1976) is 0.0008, and a confidence interval for the
difference is [–0.0012, 0.0029]. There is no significant difference between the
two. There are significant differences, however, between the intercept for
1982 and each of the other two. The graph shows the same information.
8-34
ANOVA
Note that the stats structure was created in the initial call to the aoctool
function, so it is based on the initial model fit (typically a separate-lines
model). If you change the model interactively and want to base your multiple
comparisons on the new model, you need to run aoctool again to get another
stats structure, this time specifying your new model as the initial model.
Nonparametric Methods
• “Introduction” on page 8-36
8-35
8 Analysis of Variance
Introduction
Statistics Toolbox functions include nonparametric versions of one-way and
two-way analysis of variance. Unlike classical tests, nonparametric tests
make only mild assumptions about the data, and are appropriate when the
distribution of the data is non-normal. On the other hand, they are less
powerful than classical methods for normally distributed data.
Kruskal-Wallis Test
The example “Example: One-Way ANOVA” on page 8-4 uses one-way
analysis of variance to determine if the bacteria counts of milk varied from
shipment to shipment. The one-way analysis rests on the assumption that
the measurements are independent, and that each has a normal distribution
with a common variance and with a mean that was constant in each column.
You can conclude that the column means were not all the same. The following
example repeats that analysis using a nonparametric procedure.
load hogg
p = kruskalwallis(hogg)
p =
0.0020
8-36
ANOVA
The low p value means the Kruskal-Wallis test results agree with the one-way
analysis of variance results.
Friedman’s Test
“Example: Two-Way ANOVA” on page 8-10 uses two-way analysis of variance
to study the effect of car model and factory on car mileage. The example
tests whether either of these factors has a significant effect on mileage, and
whether there is an interaction between these factors. The conclusion of
the example is there is no interaction, but that each individual factor has
a significant effect. The next example examines whether a nonparametric
analysis leads to the same conclusion.
Friedman’s test is a nonparametric test for data having a two-way layout (data
grouped by two categorical factors). Unlike two-way analysis of variance,
Friedman’s test does not treat the two factors symmetrically and it does not
test for an interaction between them. Instead, it is a test for whether the
columns are different after adjusting for possible row differences. The test is
based on an analysis of variance using the ranks of the data across categories
of the row factor. Output includes a table similar to an ANOVA table.
load mileage
p = friedman(mileage,3)
p =
7.4659e-004
Recall the classical analysis of variance gave a p value to test column effects,
row effects, and interaction effects. This p value is for column effects. Using
either this p value or the p value from ANOVA (p < 0.0001), you conclude that
there are significant column effects.
In order to test for row effects, you need to rearrange the data to swap the
roles of the rows in columns. For a data matrix x with no replications, you
could simply transpose the data and type
p = friedman(x')
8-37
8 Analysis of Variance
representing the replicates, swapping the other two dimensions, and restoring
the two-dimensional shape.
x = reshape(mileage, [3 2 3]);
x = permute(x,[1 3 2]);
x = reshape(x,[9 2])
x =
33.3000 32.6000
33.4000 32.5000
32.9000 33.0000
34.5000 33.4000
34.8000 33.7000
33.8000 33.9000
37.4000 36.6000
36.8000 37.0000
37.6000 36.7000
friedman(x,3)
ans =
0.0082
You cannot use Friedman’s test to test for interactions between the row and
column factors.
8-38
MANOVA
MANOVA
In this section...
“Introduction” on page 8-39
“ANOVA with Multiple Responses” on page 8-39
Introduction
The analysis of variance technique in “Example: One-Way ANOVA” on
page 8-4 takes a set of grouped data and determine whether the mean of a
variable differs significantly among groups. Often there are multiple response
variables, and you are interested in determining whether the entire set of
means is different from one group to the next. There is a multivariate version
of analysis of variance that can address the problem.
load carsmall
whos
Name Size Bytes Class
Acceleration 100x1 800 double array
Cylinders 100x1 800 double array
Displacement 100x1 800 double array
Horsepower 100x1 800 double array
MPG 100x1 800 double array
Model 100x36 7200 char array
Model_Year 100x1 800 double array
Origin 100x7 1400 char array
Weight 100x1 800 double array
8-39
8 Analysis of Variance
Model_Year indicates the year in which the car was made. You can create a
grouped plot matrix of these variables using the gplotmatrix function.
It appears the cars do differ from year to year. The upper right plot, for
example, is a graph of MPG versus Weight. The 1982 cars appear to have
higher mileage than the older cars, and they appear to weigh less on average.
But as a group, are the three years significantly different from one another?
The manova1 function can answer that question.
[d,p,stats] = manova1(x,Model_Year)
8-40
MANOVA
d =
2
p =
1.0e-006 *
0
0.1141
stats =
W: [4x4 double]
B: [4x4 double]
T: [4x4 double]
dfW: 90
dfB: 2
dfT: 92
lambda: [2x1 double]
chisq: [2x1 double]
chisqdf: [2x1 double]
eigenval: [4x1 double]
eigenvec: [4x4 double]
canon: [100x4 double]
mdist: [100x1 double]
gmdist: [3x3 double]
8-41
8 Analysis of Variance
The next three fields are used to do a canonical analysis. Recall that in
principal components analysis (“Principal Component Analysis (PCA)” on
page 10-31) you look for the combination of the original variables that has the
largest possible variation. In multivariate analysis of variance, you instead
look for the linear combination of the original variables that has the largest
separation between groups. It is the single variable that would give the most
significant result in a univariate one-way analysis of variance. Having found
that combination, you next look for the combination with the second highest
separation, and so on.
The eigenvec field is a matrix that defines the coefficients of the linear
combinations of the original variables. The eigenval field is a vector
measuring the ratio of the between-group variance to the within-group
variance for the corresponding linear combination. The canon field is a matrix
of the canonical variable values. Each column is a linear combination of the
mean-centered original variables, using coefficients from the eigenvec matrix.
A grouped scatter plot of the first two canonical variables shows more
separation between groups then a grouped scatter plot of any pair of original
variables. In this example it shows three clouds of points, overlapping but
with distinct centers. One point in the bottom right sits apart from the others.
By using the gname function, you can see that this is the 20th point.
c1 = stats.canon(:,1);
c2 = stats.canon(:,2);
gscatter(c2,c1,Model_Year,[],'oxs')
gname
8-42
MANOVA
Roughly speaking, the first canonical variable, c1, separates the 1982 cars
(which have high values of c1) from the older cars. The second canonical
variable, c2, reveals some separation between the 1970 and 1976 cars.
The final two fields of the stats structure are Mahalanobis distances. The
mdist field measures the distance from each point to its group mean. Points
with large values may be outliers. In this data set, the largest outlier is the
one in the scatter plot, the Buick Estate station wagon. (Note that you could
have supplied the model name to the gname function above if you wanted to
label the point with its model name rather than its row number.)
max(stats.mdist)
ans =
31.5273
find(stats.mdist == ans)
ans =
8-43
8 Analysis of Variance
20
Model(20,:)
ans =
buick_estate_wagon_(sw)
The gmdist field measures the distances between each pair of group means.
The following commands examine the group means and their distances:
grpstats(x, Model_Year)
ans =
1.0e+003 *
0.0177 0.1489 0.2869 3.4413
0.0216 0.1011 0.1978 3.0787
0.0317 0.0815 0.1289 2.4535
stats.gmdist
ans =
0 3.8277 11.1106
3.8277 0 6.1374
11.1106 6.1374 0
8-44
9
Parametric Regression
Analysis
Introduction
Regression is the process of fitting models to data. The process depends on the
model. If a model is parametric, regression estimates the parameters from the
data. If a model is linear in the parameters, estimation is based on methods
from linear algebra that minimize the norm of a residual vector. If a model
is nonlinear in the parameters, estimation is based on search methods from
optimization that minimize the norm of a residual vector. Nonparametric
models, like “Classification Trees and Regression Trees” on page 13-25, use
methods all their own.
This chapter considers data and models with continuous predictors and
responses. Categorical predictors are the subject of Chapter 8, “Analysis of
Variance”. Categorical responses are the subject of Chapter 12, “Parametric
Classification” and Chapter 13, “Supervised Learning”.
9-2
Linear Regression
Linear Regression
In this section...
“Linear Regression Models” on page 9-3
“Multiple Linear Regression” on page 9-8
“Robust Regression” on page 9-14
“Stepwise Regression” on page 9-19
“Ridge Regression” on page 9-29
“Partial Least Squares” on page 9-32
“Polynomial Models” on page 9-37
“Response Surface Models” on page 9-45
“Generalized Linear Models” on page 9-52
“Multivariate Regression” on page 9-57
y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + 4 x12 + 5 x22 +
y = 1 f1 ( x) + ... + p f p ( x) +
9-3
9 Parametric Regression Analysis
Given n independent observations (x1, y1), …, (xn, yn) of the predictor x and the
response y, the linear regression model becomes an n-by-p system of equations:
9-4
Linear Regression
⎛ y1 ⎞ ⎛ f1 ( x1 ) f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎜ y ⎟ ⎜ f (x ) f (x ) ⎟ ⎜ ⎟ ⎜ ⎟
⎝ n⎠ ⎝ 1 n p n ⎠ ⎝ p ⎠ ⎝ n⎠
y X
X is the design matrix of the system. The columns of X are the terms of the
model evaluated at the predictors. To fit the model to the data, the system
must be solved for the p coefficient values in β = (β1, …, βp)T.
betahat = X\y
9-5
9 Parametric Regression Analysis
( )
X T y − X ˆ = 0
or
X T X ˆ = X T y
If X is n-by-p, the normal equations are a p-by-p square system with solution
betahat = inv(X'*X)*X'*y, where inv is the MATLAB inverse operator.
The matrix inv(X'*X)*X' is the pseudoinverse of X, computed by the
MATLAB function pinv.
The normal equations are often badly conditioned relative to the original
system y = Xβ (the coefficient estimates are much more sensitive to the model
error ε), so the MATLAB backslash operator avoids solving them directly.
9-6
Linear Regression
X T X ˆ = XT y
(QR)T (QR) ˆ = (QR)T y
RT QT QRˆ = RT QT y
RT Rˆ = RT QT y
Rˆ = QT y
Statistics Toolbox functions like regress and regstats call the MATLAB
backslash operator to perform linear regression. The QR decomposition is also
used for efficient computation of confidence intervals.
Once betahat is computed, the model can be evaluated at the predictor data:
yhat = X*betahat
or
yhat = X*inv(X'*X)*X'*y
9-7
9 Parametric Regression Analysis
Introduction
The system of linear equations
⎛ y1 ⎞ ⎛ f1 ( x1 ) f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎜ y ⎟ ⎜ f (x ) f (x ) ⎟ ⎜ ⎟ ⎜ ⎟
⎝ n⎠ ⎝ 1 n p n ⎠ ⎝ p ⎠ ⎝ n⎠
y X
The Statistics Toolbox functions regress and regstats are used for multiple
linear regression analysis.
9-8
Linear Regression
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
betahat = X1\y
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
The advantage of working with regress is that it allows for additional inputs
and outputs relevant to statistical analysis of the regression. For example:
alpha = 0.05;
[betahat,Ibeta,res,Ires,stats] = regress(y,X1,alpha);
9-9
9 Parametric Regression Analysis
Visualize the residuals, in case (row number) order, with the rcoplot
function:
rcoplot(res,Ires)
9-10
Linear Regression
The interval around the first residual, shown in red when plotted, does not
contain zero. This indicates that the residual is larger than expected in 95%
of new observations, and suggests the data point is an outlier.
2 If there is a systematic error in the model (that is, if the model is not
appropriate for generating the data under model assumptions), the mean
of the residuals is not zero.
3 If the errors in the model are not normally distributed, the distributions
of the residuals may be skewed or leptokurtic (with heavy tails and more
outliers).
X2 = moore(:,1:5);
stats = regstats(y,X2);
regstats(y,X2)
9-11
9 Parametric Regression Analysis
Select the check boxes corresponding to the statistics you want to compute and
click OK. Selected statistics are returned to the MATLAB workspace. Names
9-12
Linear Regression
of container variables for the statistics appear on the right-hand side of the
interface, where they can be changed to any valid MATLAB variable name.
t = stats.tstat;
CoeffTable = dataset({t.beta,'Coef'},{t.se,'StdErr'}, ...
{t.t,'tStat'},{t.pval,'pVal'})
CoeffTable =
Coef StdErr tStat pVal
-2.1561 0.91349 -2.3603 0.0333
-9.0116e-006 0.00051835 -0.017385 0.98637
0.0013159 0.0012635 1.0415 0.31531
0.0001278 7.6902e-005 1.6618 0.11876
0.0078989 0.014 0.56421 0.58154
0.00014165 7.3749e-005 1.9208 0.075365
The MATLAB function fprintf gives you control over tabular formatting.
For example, the fstat field of the stats output structure of regstats is a
structure with statistics related to the analysis of variance (ANOVA) of the
regression. The following commands produce a standard regression ANOVA
table:
f = stats.fstat;
fprintf('\n')
fprintf('Regression ANOVA');
fprintf('\n\n')
fprintf('%6s','Source');
fprintf('%10s','df','SS','MS','F','P');
fprintf('\n')
fprintf('%6s','Regr');
9-13
9 Parametric Regression Analysis
fprintf('%10.4f',f.dfr,f.ssr,f.ssr/f.dfr,f.f,f.pval);
fprintf('\n')
fprintf('%6s','Resid');
fprintf('%10.4f',f.dfe,f.sse,f.sse/f.dfe);
fprintf('\n')
fprintf('%6s','Total');
fprintf('%10.4f',f.dfe+f.dfr,f.sse+f.ssr);
fprintf('\n')
Regression ANOVA
Source df SS MS F P
Regr 5.0000 4.1084 0.8217 11.9886 0.0001
Resid 14.0000 0.9595 0.0685
Total 19.0000 5.0679
Robust Regression
• “Introduction” on page 9-14
• “Programmatic Robust Regression” on page 9-15
• “Interactive Robust Regression” on page 9-16
Introduction
The models described in “Linear Regression Models” on page 9-3 are based on
certain assumptions, such as a normal distribution of errors in the observed
responses. If the distribution of errors is asymmetric or prone to outliers,
model assumptions are invalidated, and parameter estimates, confidence
intervals, and other computed statistics become unreliable. The Statistics
Toolbox function robustfit is useful in these cases. The function implements
a robust fitting method that is less sensitive than ordinary least squares to
large changes in small parts of the data.
9-14
Linear Regression
reweighted least squares. In the first iteration, each point is assigned equal
weight and model coefficients are estimated using ordinary least squares. At
subsequent iterations, weights are recomputed so that points farther from
model predictions in the previous iteration are given lower weight. Model
coefficients are then recomputed using weighted least squares. The process
continues until the values of the coefficient estimates converge within a
specified tolerance.
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
X2 = moore(:,1:5);
robustbeta = robustfit(X2,y)
robustbeta =
-1.7516
0.0000
0.0009
0.0002
0.0060
0.0001
9-15
9 Parametric Regression Analysis
[robustbeta,stats] = robustfit(X2,y);
stats.w'
ans =
Columns 1 through 5
0.0246 0.9986 0.9763 0.9323 0.9704
Columns 6 through 10
0.8597 0.9180 0.9992 0.9590 0.9649
Columns 11 through 15
0.9769 0.9868 0.9999 0.9976 0.8122
Columns 16 through 20
0.9733 0.9892 0.9988 0.8974 0.6774
The first data point has a very low weight compared to the other data points,
and so is effectively ignored in the robust regression.
1 Start the demo. To begin using robustdemo with the built-in data, simply
enter the function name at the command line:
robustdemo
9-16
Linear Regression
The resulting figure shows a scatter plot with two fitted lines. The red line
is the fit using ordinary least-squares regression. The green line is the
fit using robust regression. At the bottom of the figure are the equations
for the fitted lines, together with the estimated root mean squared errors
for each fit.
9-17
9 Parametric Regression Analysis
In the built-in data, the rightmost point has a relatively high leverage of
0.35. The point exerts a large influence on the least-squares fit, but its
small robust weight shows that it is effectively excluded from the robust fit.
3 See how changes in the data affect the fits. With the left mouse
button, click and hold on any data point and drag it to a new location.
When you release the mouse button, the displays update.
9-18
Linear Regression
Bringing the rightmost data point closer to the least-squares line makes
the two fitted lines nearly identical. The adjusted rightmost data point has
significant weight in the robust fit.
Stepwise Regression
• “Introduction” on page 9-20
• “Programmatic Stepwise Regression” on page 9-21
• “Interactive Stepwise Regression” on page 9-27
9-19
9 Parametric Regression Analysis
Introduction
Multiple linear regression models, as described in “Multiple Linear
Regression” on page 9-8, are built from a potentially large number of
predictive terms. The number of interaction terms, for example, increases
exponentially with the number of predictor variables. If there is no theoretical
basis for choosing the form of a model, and no assessment of correlations
among terms, it is possible to include redundant terms in a model that confuse
the identification of significant effects.
2 If any terms not in the model have p-values less than an entrance tolerance
(that is, if it is unlikely that they would have zero coefficient if added to
the model), add the one with the smallest p value and repeat this step;
otherwise, go to step 3.
3 If any terms in the model have p-values greater than an exit tolerance (that
is, if it is unlikely that the hypothesis of a zero coefficient can be rejected),
remove the one with the largest p value and go to step 2; otherwise, end.
Depending on the terms included in the initial model and the order in which
terms are moved in and out, the method may build different models from the
same set of potential terms. The method terminates when no single step
improves the model. There is no guarantee, however, that a different initial
model or a different sequence of steps will not lead to a better fit. In this
sense, stepwise models are locally optimal, but may not be globally optimal.
9-20
Linear Regression
load hald
whos
Name Size Bytes Class Attributes
The response (heat) depends on the quantities of the four predictors (the
columns of ingredients).
stepwisefit(ingredients,heat,...
'penter',0.05,'premove',0.10);
Initial columns included: none
Step 1, added column 4, p=0.000576232
Step 2, added column 1, p=1.10528e-006
Final columns included: 1 4
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4400] [ 0.1384] 'In' [1.1053e-006]
[ 0.4161] [ 0.1856] 'Out' [ 0.0517]
[-0.4100] [ 0.1992] 'Out' [ 0.0697]
[-0.6140] [ 0.0486] 'In' [1.8149e-007]
9-21
9 Parametric Regression Analysis
in the model, coefficient estimates and their standard errors are those that
result if the term is added.
initialModel = ...
[false true false false]; % Force in 2nd term
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10);
Initial columns included: 2
Step 1, added column 1, p=2.69221e-007
Final columns included: 1 2
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4683] [ 0.1213] 'In' [2.6922e-007]
[ 0.6623] [ 0.0459] 'In' [5.0290e-008]
[ 0.2500] [ 0.1847] 'Out' [ 0.2089]
[-0.2365] [ 0.1733] 'Out' [ 0.2054]
The preceding two models, built from different initial models, use different
subsets of the predictive terms. Terms 2 and 4, swapped in the two models,
are highly correlated:
term2 = ingredients(:,2);
term4 = ingredients(:,4);
R = corrcoef(term2,term4)
R =
1.0000 -0.9730
-0.9730 1.0000
[betahat1,se1,pval1,inmodel1,stats1] = ...
stepwisefit(ingredients,heat,...
'penter',.05,'premove',0.10,...
'display','off');
[betahat2,se2,pval2,inmodel2,stats2] = ...
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10,...
'display','off');
9-22
Linear Regression
RMSE1 = stats1.rmse
RMSE1 =
2.7343
RMSE2 = stats2.rmse
RMSE2 =
2.4063
The second model has a lower Root Mean Square Error (RMSE).
An added variable plot is used to determine the unique effect of adding a new
term to a model. The plot shows the relationship between the part of the
response unexplained by terms already in the model and the part of the new
term unexplained by terms already in the model. The “unexplained” parts
are measured by the residuals of the respective regressions. A scatter of the
residuals from the two regressions forms the added variable plot.
For example, suppose you want to add term2 to a model that already contains
the single term term1. First, consider the ability of term2 alone to explain
the response:
load hald
term2 = ingredients(:,2);
scatter(term2,heat)
xlabel('Term 2')
ylabel('Heat')
hold on
x2 = 20:80;
y2 = b2(1) + b2(2)*x2;
plot(x2,y2,'r')
title('{\bf Response Explained by Term 2: Ignoring Term 1}')
9-23
9 Parametric Regression Analysis
Next, consider the following regressions involving the model term term1:
term1 = ingredients(:,1);
[b1,Ib1,res1] = regress(heat,[ones(size(term1)) term1]);
[b21,Ib21,res21] = regress(term2,[ones(size(term1)) term1]);
bres = regress(res1,[ones(size(res21)) res21]);
A scatter of the residuals res1 vs. the residuals res12 forms the added
variable plot:
figure
scatter(res21,res1)
xlabel('Residuals: Term 2 on Term 1')
ylabel('Residuals: Heat on Term 1')
hold on
9-24
Linear Regression
xres = -30:30;
yres = bres(1) + bres(2)*xres;
plot(xres,yres,'r')
title('{\bf Response Explained by Term 2: Adjusted for Term 1}')
Since the plot adjusted for term1 shows a stronger relationship (less variation
along the fitted line) than the plot ignoring term1, the two terms act jointly to
explain extra variation. In this case, adding term2 to a model consisting of
term1 would reduce the RMSE.
figure
9-25
9 Parametric Regression Analysis
In addition to the scatter of residuals, the plot shows 95% confidence intervals
on predictions from the fitted line. The fitted line has intercept zero because,
under the assumptions outlined in “Linear Regression Models” on page 9-3,
both of the plotted variables have mean zero. The slope of the fitted line is the
coefficient that term2 would have if it was added to the model with term1.
The addevarplot function is useful for considering the unique effect of adding
a new term to an existing model with any number of terms.
9-26
Linear Regression
load hald
stepwise(ingredients,heat)
9-27
9 Parametric Regression Analysis
The upper left of the interface displays estimates of the coefficients for all
potential terms, with horizontal bars indicating 90% (colored) and 95% (grey)
confidence intervals. The red color indicates that, initially, the terms are not
in the model. Values displayed in the table are those that would result if
the terms were added to the model.
The middle portion of the interface displays summary statistics for the entire
model. These statistics are updated with each step.
The lower portion of the interface, Model History, displays the RMSE for
the model. The plot tracks the RMSE from step to step, so you can compare
the optimality of different models. Hover over the blue dots in the history to
see which terms were in the model at a particular step. Click on a blue dot
in the history to open a copy of the interface initialized with the terms in
the model at that step.
To center and scale the input data (compute z-scores) to improve conditioning
of the underlying least-squares problem, select Scale Inputs from the
Stepwise menu.
1 Click Next Step to select the recommended next step. The recommended
next step either adds the most significant term or removes the least
significant term. When the regression reaches a local minimum of RMSE,
the recommended next step is “Move no terms.” You can perform all of the
recommended steps at once by clicking All Steps.
2 Click a line in the plot or in the table to toggle the state of the corresponding
term. Clicking a red line, corresponding to a term not currently in the
model, adds the term to the model and changes the line to blue. Clicking
a blue line, corresponding to a term currently in the model, removes the
term from the model and changes the line to red.
9-28
Linear Regression
To call addedvarplot and produce an added variable plot from the stepwise
interface, select Added Variable Plot from the Stepwise menu. A list of
terms is displayed. Select the term you want to add, and then click OK.
Click Export to display a dialog box that allows you to select information
from the interface to save to the MATLAB workspace. Check the information
you want to export and, optionally, change the names of the workspace
variables to be created. Click OK to export the information.
Ridge Regression
• “Introduction” on page 9-29
• “Example: Ridge Regression” on page 9-30
Introduction
Coefficient estimates for the models described in “Multiple Linear Regression”
on page 9-8 rely on the independence of the model terms. When terms are
correlated and the columns of the design matrix X have an approximate
linear dependence, the matrix (XTX)–1 becomes close to singular. As a result,
the least-squares estimate
ˆ = ( X T X )−1 X T y
ˆ = ( X T X + kI )−1 X T y
where k is the ridge parameter and I is the identity matrix. Small positive
values of k improve the conditioning of the problem and reduce the variance
of the estimates. While biased, the reduced variance of ridge estimates
often result in a smaller mean square error when compared to least-squares
estimates.
9-29
9 Parametric Regression Analysis
load acetylene
subplot(1,3,1)
plot(x1,x2,'.')
xlabel('x1'); ylabel('x2'); grid on; axis square
subplot(1,3,2)
plot(x1,x3,'.')
xlabel('x1'); ylabel('x3'); grid on; axis square
subplot(1,3,3)
plot(x2,x3,'.')
xlabel('x2'); ylabel('x3'); grid on; axis square
Note the correlation between x1 and the other two predictor variables.
Use ridge and x2fx to compute coefficient estimates for a multilinear model
with interaction terms, for a range of ridge parameters:
X = [x1 x2 x3];
D = x2fx(X,'interaction');
9-30
Linear Regression
figure
plot(k,betahat,'LineWidth',2)
ylim([-100 100])
grid on
xlabel('Ridge Parameter')
ylabel('Standardized Coefficient')
title('{\bf Ridge Trace}')
legend('x1','x2','x3','x1x2','x1x3','x2x3')
9-31
9 Parametric Regression Analysis
The estimates stabilize to the right of the plot. Note that the coefficient of
the x2x3 interaction term changes sign at a value of the ridge parameter ≈
5 × 10–4.
9-32
Linear Regression
Introduction
Partial least-squares (PLS) regression is a technique used with data that
contain correlated predictor variables. This technique constructs new
predictor variables, known as components, as linear combinations of the
original predictor variables. PLS constructs these components while
considering the observed response values, leading to a parsimonious model
with reliable predictive power.
• Multiple linear regression finds a combination of the predictors that best fit
a response.
• Principal component analysis finds combinations of the predictors with
large variance, reducing correlations. The technique makes no use of
response values.
• PLS finds combinations of the predictors that have a large covariance with
the response values.
PLS therefore combines information about the variances of both the predictors
and the responses, while also considering the correlations among them.
load moore
9-33
9 Parametric Regression Analysis
y = moore(:,6); % Response
X0 = moore(:,1:5); % Original predictors
X1 = X0+10*randn(size(X0)); % Correlated predictors
X = [X0,X1];
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(X,y,10);
plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in y');
Choosing the number of components in a PLS model is a critical step. The plot
gives a rough indication, showing nearly 80% of the variance in y explained
9-34
Linear Regression
[XL,yl,XS,YS,beta,PCTVAR,MSE,stats] = plsregress(X,y,6);
yfit = [ones(size(X,1),1) X]*beta;
plot(y,yfit,'o')
TSS = sum((y-mean(y)).^2);
RSS = sum((y-yfit).^2);
Rsquared = 1 - RSS/TSS
Rsquared =
0.8421
9-35
9 Parametric Regression Analysis
A plot of the weights of the ten predictors in each of the six components shows
that two of the components (the last two computed) explain the majority of
the variance in X:
plot(1:10,stats.W,'o-');
legend({'c1','c2','c3','c4','c5','c6'},'Location','NW')
xlabel('Predictor');
ylabel('Weight');
[axes,h1,h2] = plotyy(0:6,MSE(1,:),0:6,MSE(2,:));
set(h1,'Marker','o')
set(h2,'Marker','o')
legend('MSE Predictors','MSE Response')
xlabel('Number of Components')
9-36
Linear Regression
Polynomial Models
• “Introduction” on page 9-37
• “Programmatic Polynomial Regression” on page 9-38
• “Interactive Polynomial Regression” on page 9-43
Introduction
Polynomial models are a special case of the linear models discussed in “Linear
Regression Models” on page 9-3. Polynomial models have the advantages of
being simple, familiar in their properties, and reasonably flexible for following
9-37
9 Parametric Regression Analysis
data trends. They are also robust with respect to changes in the location and
scale of the data (see “Conditioning Polynomial Fits” on page 9-41). However,
polynomial models may be poor predictors of new values. They oscillate
between data points, especially as the degree is increased to improve the fit.
Asymptotically, they follow power functions, leading to inaccuracies when
extrapolating other long-term trends. Choosing a polynomial model is often a
trade-off between a simple description of overall data trends and the accuracy
of predictions made from the model.
x = 0:5; % x data
y = [2 1 4 4 3 2]; % y data
p = polyfit(x,y,3) % Degree 3 fit
p =
-0.1296 0.6865 -0.1759 1.6746
9-38
Linear Regression
r = roots(p)
r =
5.4786
-0.0913 + 1.5328i
-0.0913 - 1.5328i
The MATLAB function poly solves the inverse problem, finding a polynomial
with specified roots. poly is the inverse of roots up to ordering, scaling, and
round-off error.
9-39
9 Parametric Regression Analysis
[p,S] = polyfit(x,y,3);
[yhat,delta] = polyconf(p,x,S);
PI = [yhat-delta;yhat+delta]'
PI =
-5.3022 8.6514
-4.2068 8.3179
-2.9899 9.0534
-2.1963 9.8471
-2.6036 9.9211
-5.2229 8.7308
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
p = polydemo(x,y,2,0.05)
9-40
Linear Regression
p =
0.8107 -4.5054 -1.1862
load census
x = cdate;
9-41
9 Parametric Regression Analysis
y = pop;
p = polyfit(x,y,3);
Warning: Polynomial is badly conditioned.
Add points with distinct X values,
reduce the degree of the polynomial,
or try centering and scaling as
described in HELP POLYFIT.
xfit = linspace(x(1),x(end),100);
plot(xfit,yfit,'b-') % Plot conditioned fit vs. x data
grid on
9-42
Linear Regression
The Basic Fitting Tool. The Basic Fitting Tool is a MATLAB interface,
discussed in “Interactive Fitting” in the MATLAB documentation. The tool
allows you to:
9-43
9 Parametric Regression Analysis
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
polytool(x,y,2,0.05)
9-44
Linear Regression
• Interactively change the degree of the fit. Change the value in the Degree
text box at the top of the figure.
• Evaluate the fit and the bounds using a movable crosshair. Click, hold, and
drag the crosshair to change its position.
• Export estimated coefficients, predicted values, prediction intervals, and
residuals to the MATLAB workspace. Click Export to a open a dialog box
with choices for exporting the data.
Options for the displayed bounds and the fitting method are available through
menu options at the top of the figure:
• The Bounds menu lets you choose between bounds on new observations
(the default) and bounds on estimated values. It also lets you choose
between nonsimultaneous (the default) and simultaneous bounds. See
polyconf for a description of these options.
• The Method menu lets you choose between ordinary least-squares
regression and robust regression, as described in “Robust Regression” on
page 9-14.
Introduction
Polynomial models are generalized to any number of predictor variables xi (i
= 1, ..., N) as follows:
N N N
y( x) = a0 + ∑ ai xi + ∑ aij xi x j + ∑ aii xi2 +
i=0 i< j i=0
The model includes, from left to right, an intercept, linear terms, quadratic
interaction terms, and squared terms. Higher order terms would follow, as
necessary.
9-45
9 Parametric Regression Analysis
load reaction
9-46
Linear Regression
x2fx function converts predictor data to design matrices for quadratic models.
The regstats function calls x2fx when instructed to do so.
For example, the following fits a quadratic response surface model to the
data in reaction.mat:
stats = regstats(rate,reactants,'quadratic','beta');
b = stats.beta; % Model coefficients
The 10-by-1 vector b contains, in order, a constant term and then the
coefficients for the model terms x1, x2, x3, x1x2, x1x3, x2x3, x12, x22, and x32, where
x1, x2, and x3 are the three columns of reactants. The order of coefficients for
quadratic models is described in the reference page for x2fx.
Since the model involves only three predictors, it is possible to visualize the
entire response surface using a color dimension for the reaction rate:
x1 = reactants(:,1);
x2 = reactants(:,2);
x3 = reactants(:,3);
xx1 = linspace(min(x1),max(x1),25);
xx2 = linspace(min(x2),max(x2),25);
xx3 = linspace(min(x3),max(x3),25);
[X1,X2,X3] = meshgrid(xx1,xx2,xx3);
hmodel = scatter3(X1(:),X2(:),X3(:),5,RATE(:),'filled');
hold on
hdata = scatter3(x1,x2,x3,'ko','filled');
axis tight
xlabel(xn(1,:))
ylabel(xn(2,:))
zlabel(xn(3,:))
hbar = colorbar;
ylabel(hbar,yn);
title('{\bf Quadratic Response Surface Model}')
9-47
9 Parametric Regression Analysis
legend(hdata,'Data','Location','NE')
The plot show a general increase in model response, within the space of
the observed data, as the concentration of n-pentane increases and the
concentrations of hydrogen and isopentane decrease.
9-48
Linear Regression
H = [b(8),b(5)/2,b(6)/2; ...
b(5)/2,b(9),b(7)/2; ...
b(6)/2,b(7)/2,b(10)];
lambda = eig(H)
lambda =
1.0e-003 *
-0.1303
0.0412
0.4292
9-49
9 Parametric Regression Analysis
delete(hmodel)
X2slice = 200; % Fix n-Pentane concentration
slice(X1,X2,X3,RATE,[],X2slice,[])
9-50
Linear Regression
load reaction
alpha = 0.01; % Significance level
rstool(reactants,rate,'quadratic',alpha,xn,yn)
9-51
9 Parametric Regression Analysis
in the plots. Predictor values are changed by editing the text boxes or by
dragging the dashed blue lines. When you change the value of a predictor, all
plots update to show the new point in predictor space.
Introduction
Linear regression models describe a linear relationship between a response
and one or more predictive terms. Many times, however, a nonlinear
relationship exists. “Nonlinear Regression” on page 9-58 describes general
nonlinear models. A special class of nonlinear models, known as generalized
linear models, makes use of linear methods.
• At each set of values for the predictors, the response has a normal
distribution with mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
• The model is μ = Xb.
• At each set of values for the predictors, the response has a distribution
that may be normal, binomial, Poisson, gamma, or inverse Gaussian, with
parameters including a mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
9-52
Linear Regression
plot(w,poor./total,'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
9-53
9 Parametric Regression Analysis
The logistic model is useful for proportion data. It defines the relationship
between the proportion p and the weight w by:
Some of the proportions in the data are 0 and 1, making the left-hand side of
this equation undefined. To keep the proportions within range, add relatively
small perturbations to the poor and total values. A semi-log plot then shows
a nearly linear relationship, as predicted by the model:
p_adjusted = (poor+.5)./(total+1);
semilogy(w,p_adjusted./(1-p_adjusted),'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Adjusted p / (1 - p)')
9-54
Linear Regression
b = glmfit(w,[poor total],'binomial','link','logit')
b =
-13.3801
0.0042
9-55
9 Parametric Regression Analysis
x = 2100:100:4500;
y = glmval(b,x,'logit');
plot(w,poor./total,'x','LineWidth',2)
hold on
plot(x,y,'r-','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
9-56
Linear Regression
Multivariate Regression
Whether or not the predictor x is a vector of predictor variables, multivariate
regression refers to the case where the response y = (y1, ..., yM) is a vector of
M response variables.
9-57
9 Parametric Regression Analysis
Nonlinear Regression
In this section...
“Nonlinear Regression Models” on page 9-58
“Parametric Models” on page 9-59
“Mixed-Effects Models” on page 9-64
9-58
Nonlinear Regression
Parametric Models
• “A Parametric Nonlinear Model” on page 9-59
• “Confidence Intervals for Parameter Estimates” on page 9-61
• “Confidence Intervals for Predicted Responses” on page 9-61
• “Interactive Nonlinear Parametric Regression” on page 9-62
1 x2 − x3 / 5
rate =
1 + 2 x1 + 3 x2 + 4 x3
where rate is the reaction rate, x1, x2, and x3 are concentrations of hydrogen,
n-pentane, and isopentane, respectively, and β1, β2, ... , β5 are the unknown
parameters.
load reaction
The function for the model is hougen, which looks like this:
type hougen
9-59
9 Parametric Regression Analysis
b1 = beta(1);
b2 = beta(2);
b3 = beta(3);
b4 = beta(4);
b5 = beta(5);
x1 = x(:,1);
x2 = x(:,2);
x3 = x(:,3);
nlinfit requires the predictor data, the responses, and an initial guess of the
unknown parameters. It also requires a function handle to a function that
takes the predictor data and parameter estimates and returns the responses
predicted by the model.
To fit the reaction data, call nlinfit using the following syntax:
load reaction
betahat = nlinfit(reactants,rate,@hougen,beta)
betahat =
1.2526
0.0628
9-60
Nonlinear Regression
0.0400
0.1124
1.1914
The function nlinfit has robust options, similar to those for robustfit, for
fitting nonlinear models to data with outliers.
[betahat,resid,J] = nlinfit(reactants,rate,@hougen,beta);
betaci = nlparci(betahat,resid,J)
betaci =
-0.7467 3.2519
-0.0377 0.1632
-0.0312 0.1113
-0.0609 0.2857
-0.7381 3.1208
The columns of the output betaci contain the lower and upper bounds,
respectively, of the (default) 95% confidence intervals for each parameter.
[yhat,delta] = nlpredci(@hougen,reactants,betahat,resid,J);
opd = [rate yhat delta]
opd =
8.5500 8.4179 0.2805
3.7900 3.9542 0.2474
4.8200 4.9109 0.1766
0.0200 -0.0110 0.1875
2.7500 2.6358 0.1578
14.3900 14.3402 0.4236
2.5400 2.5662 0.2425
9-61
9 Parametric Regression Analysis
The output opd contains the observed rates in the first column and the
predicted rates in the second column. The (default) 95% simultaneous
confidence intervals on the predictions are the values in the second column ±
the values in the third column. These are not intervals for new observations
at the predictors, even though most of the confidence intervals do contain the
original observations.
Open nlintool with the reaction data and the hougen model by typing
load reaction
nlintool(reactants,rate,@hougen,beta,0.01,xn,yn)
9-62
Nonlinear Regression
You see three plots. The response variable for all plots is the reaction rate,
plotted in green. The red lines show confidence intervals on predicted
responses. The first plot shows hydrogen as the predictor, the second shows
n-pentane, and the third shows isopentane.
Each plot displays the fitted relationship of the reaction rate to one predictor
at a fixed value of the other two predictors. The fixed values are in the text
boxes below each predictor axis. Change the fixed values by typing in a new
value or by dragging the vertical lines in the plots to new positions. When
you change the value of a predictor, all plots update to display the model
at the new point in predictor space.
9-63
9 Parametric Regression Analysis
While this example uses only three predictors, nlintool can accommodate
any number of predictors.
Mixed-Effects Models
• “Introduction” on page 9-64
• “Mixed-Effects Model Hierarchy” on page 9-65
• “Specifying Mixed-Effects Models” on page 9-67
• “Specifying Covariate Models” on page 9-70
• “Choosing nlmefit or nlmefitsa” on page 9-71
• “Using Output Functions with Mixed-Effects Models” on page 9-74
• “Example: Mixed-Effects Models Using nlmefit and nlmefitsa” on page 9-79
• “Example: Examining Residuals for Model Verification” on page 9-93
Introduction
In statistics, an effect is anything that influences the value of a response
variable at a particular setting of the predictor variables. Effects are
translated into model parameters. In linear models, effects become
coefficients, representing the proportional contributions of model terms. In
nonlinear models, effects often have specific physical interpretations, and
appear in more general nonlinear combinations.
9-64
Nonlinear Regression
C0 e−[ r +( ri − r )]t = C0 e− ( + bi )t ,
Random effects are useful when data falls into natural groups. In the drug
elimination model, the groups are simply the individuals under study. More
sophisticated models might group data by an individual’s age, weight, diet,
etc. Although the groups are not the focus of the study, adding random effects
to a model extends the reliability of inferences beyond the specific sample of
individuals.
Mixed-effects models account for both fixed and random effects. As with
all regression models, their purpose is to describe a response variable as a
function of the predictor variables. Mixed-effects models, however, recognize
correlations within sample subgroups. In this way, they provide a compromise
between ignoring data groups entirely and fitting each group with a separate
model.
yij = f ( , xij ) + ij
9-65
9 Parametric Regression Analysis
yij = f ( i , xij ) + ij
i = + bi
i = A + Bbi
i = Ai + Bibi
If the design matrices also differ among observations, the model becomes
ij = Aij + Bij bi
yij = f ( ij , xij ) + ij
9-66
Nonlinear Regression
Some of the group-specific predictors in xij may not change with observation j.
Calling those vi, the model becomes
yij = f ( ij , xij , vi ) + ij
i = Ai + Bi bi
yi = f (i , X i ) + i
bi N (0, Ψ)
i N (0, 2 )
9-67
9 Parametric Regression Analysis
− rpitij − rqitij
yij = C pi e + Cqi e + ij ,
where yij is the observed concentration in individual i at time tij. The model
allows for different sampling times and different numbers of observations for
different individuals.
The elimination rates rpi and rqi must be positive to be physically meaningful.
Enforce this by introducing the log rates Rpi = log(rpi) and Rqi = log(rqi) and
reparametrizing the model:
9-68
Nonlinear Regression
To introduce fixed effects β and random effects bi for all model parameters,
reexpress the model as follows:
Fitting the model and estimating the covariance matrix Ψ often leads to
further refinements. A relatively small estimate for the variance of a random
effect suggests that it can be removed from the model. Likewise, relatively
small estimates for covariances among certain random effects suggests that a
full covariance matrix is unnecessary. Since random effects are unobserved,
Ψ must be estimated indirectly. Specifying a diagonal or block-diagonal
covariance pattern for Ψ can improve convergence and efficiency of the fitting
algorithm.
Statistics Toolbox functions nlmefit and nlmefitsa fit the general nonlinear
mixed-effects model to data, estimating the fixed and random effects. The
functions also estimate the covariance matrix Ψ for the random effects.
Additional diagnostic outputs allow you to assess tradeoffs between the
number of model parameters and the goodness of fit.
9-69
9 Parametric Regression Analysis
⎛ 1 ⎞
⎛ 1 ⎞ ⎛ 1 0 0 wi ⎞ ⎜ ⎟ ⎛ 1 0 0 ⎞ ⎛ b1 ⎞
⎜ ⎟ ⎜ ⎟ ⎜ 2 ⎟ ⎜ ⎟⎜ ⎟
⎜ 2 ⎟ = ⎜ 0 1 0 0 ⎟ ⎜ ⎟ + ⎜ 0 1 0 ⎟ ⎜ b2 ⎟
⎜ ⎟ ⎜ 0 0 1 0 ⎟ ⎜ 3 ⎟ ⎜ 0 0 1 ⎟ ⎜ b ⎟
⎝ 3⎠ ⎝ ⎠⎜ ⎟ ⎝ ⎠⎝ 3 ⎠
⎝ 4⎠
Thus, the parameter φi for any individual in the ith group is:
⎛ 1 ⎞ ⎛ + * w ⎞ ⎛ b1 ⎞
⎜ i ⎟ ⎜ 1 4 i
⎟ ⎜⎜
i
⎟
⎜ 2 ⎟=⎜ 2 ⎟ + ⎜ b2i ⎟
⎜ i ⎟ ⎜ ⎟ ⎜ ⎟
⎜ 3 ⎟ ⎝ 3 ⎠ ⎝ b3i ⎟
⎝ i ⎠ ⎠
To specify a covariate model, use the 'FEGroupDesign' option.
9-70
Nonlinear Regression
% Number of covariates
num_cov = 1;
% Assuming number of groups in the data set is 7
num_groups = 7;
% Array of covariate values
covariates = [75; 52; 66; 55; 70; 58; 62 ];
A = repmat(eye(num_params, num_params+num_cov),...
[1,1,num_groups]);
A(1,num_params+1,1:num_groups) = covariates(:,1)
options.FEGroupDesign = A;
• 'LME' — Use the likelihood for the linear mixed-effects model at the
current conditional estimates of beta and B. This is the default.
• 'RELME' — Use the restricted likelihood for the linear mixed-effects model
at the current conditional estimates of beta and B.
• 'FO' — First-order Laplacian approximation without random effects.
• 'FOCE' — First-order Laplacian approximation at the conditional estimates
of B.
9-71
9 Parametric Regression Analysis
• Cov0 — Initial value for the covariance matrix PSI. Must be an r-by-r
positive definite matrix. If empty, the default value depends on the values
of BETA0.
• ComputeStdErrors — true to compute standard errors for the coefficient
estimates and store them in the output STATS structure, or false (default)
to omit this computation.
• LogLikMethod — Specifies the method for approximating the log likelihood.
9-72
Nonlinear Regression
9-73
9 Parametric Regression Analysis
2 Use statset to set the value of Outputfcn to be a function handle, that is,
the name of the function preceded by the @ sign. For example, if the output
function is outfun.m, the command
9-74
Nonlinear Regression
stop = outfun(beta,status,state)
where
The solver passes the values of the input arguments to outfun at each
iteration.
Fields in status. The following table lists the fields of the status structure:
Field Description
procedure • 'ALT' — alternating algorithm for the optimization of
the linear mixed effects or restricted linear mixed effects
approximations
• 'LAP' — optimization of the Laplacian approximation for
first order or first order conditional estimation
iteration An integer starting from 0.
9-75
9 Parametric Regression Analysis
Field Description
inner A structure describing the status of the inner iterations
within the ALT and LAP procedures, with the fields:
States of the Algorithm. The following table lists the possible values for
state:
9-76
Nonlinear Regression
state Description
'init' The algorithm is in the initial state before the first
iteration.
'iter' The algorithm is at the end of an iteration.
'done' The algorithm is in the final state after the last iteration.
The following code illustrates how the output function might use the value of
state to decide which tasks to perform at the current iteration:
switch state
case 'iter'
% Make updates to plot or guis as needed
case 'init'
% Setup for plots or guis
case 'done'
% Cleanup of plots, guis, or final plot
otherwise
end
Stop Flag. The output argument stop is a flag that is true or false.
The flag tells the solver whether it should quit or continue. The following
examples show typical ways to use the stop flag.
The output function can stop the estimation at any iteration based on the
values of arguments passed into it. For example, the following code sets stop
to true based on the value of the log likelihood stored in the 'fval'field of
the status structure:
stop = outfun(beta,status,state)
stop = false;
% Check if loglikelihood is more than 132.
if status.fval > -132
stop = true;
end
9-77
9 Parametric Regression Analysis
If you design a GUI to perform nlmefit iterations, you can make the output
function stop when a user clicks a Stop button on the GUI. For example, the
following code implements a dialog to cancel calculations:
function stopper(varargin)
% Set flag to stop when button is pressed
stop = true;
9-78
Nonlinear Regression
disp('Calculation stopped.')
end
end
To prevent nlmefitsa from using of this function, specify an empty value for
the output function:
load indomethacin
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
9-79
9 Parametric Regression Analysis
Use the nlinfit function to fit the model to all of the data, ignoring
subject-specific effects:
phi0 = [1 1 1 1];
[phi,res] = nlinfit(time,concentration,model,phi0);
numObs = length(time);
numParams = 4;
df = numObs-numParams;
mse = (res'*res)/df
9-80
Nonlinear Regression
mse =
0.0304
tplot = 0:0.01:8;
plot(tplot,model(phi,tplot),'k','LineWidth',2)
hold off
A boxplot of residuals by subject shows that the boxes are mostly above or
below zero, indicating that the model has failed to account for subject-specific
effects:
colors = 'rygcbm';
h = boxplot(res,subject,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(res,subject,'colors','k','symbol','ko')
9-81
9 Parametric Regression Analysis
grid on
xlabel('Subject')
ylabel('Residual')
hold off
To account for subject-specific effects, fit the model separately to the data
for each subject:
phi0 = [1 1 1 1];
PHI = zeros(4,6);
RES = zeros(11,6);
for I = 1:6
tI = time(subject == I);
cI = concentration(subject == I);
[PHI(:,I),RES(:,I)] = nlinfit(tI,cI,model,phi0);
end
9-82
Nonlinear Regression
PHI
PHI =
0.1915 0.4989 1.6757 0.2545 3.5661 0.9685
-1.7878 -1.6354 -0.4122 -1.6026 1.0408 -0.8731
2.0293 2.8277 5.4683 2.1981 0.2915 3.0023
0.5794 0.8013 1.7498 0.2423 -1.5068 1.0882
numParams = 24;
df = numObs-numParams;
mse = (RES(:)'*RES(:))/df
mse =
0.0057
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
for I = 1:6
plot(tplot,model(PHI(:,I),tplot),'Color',colors(I))
end
axis([0 8 0 3.5])
hold off
9-83
9 Parametric Regression Analysis
PHI gives estimates of the four model parameters for each of the six subjects.
The estimates vary considerably, but taken as a 24-parameter model of the
data, the mean-squared error of 0.0057 is a significant reduction from 0.0304
in the original four-parameter model.
A boxplot of residuals by subject shows that the larger model accounts for
most of the subject-specific effects:
h = boxplot(RES,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(RES,'colors','k','symbol','ko')
grid on
xlabel('Subject')
ylabel('Residual')
9-84
Nonlinear Regression
hold off
The spread of the residuals (the vertical scale of the boxplot) is much smaller
than in the previous boxplot, and the boxes are now mostly centered on zero.
9-85
9 Parametric Regression Analysis
phi0 = [1 1 1 1];
[phi,PSI,stats] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0)
phi =
0.4606
-1.3459
2.8277
0.7729
PSI =
0.0124 0 0 0
0 0.0000 0 0
0 0 0.3264 0
0 0 0 0.0250
stats =
logl: 54.5884
mse: 0.0066
aic: -91.1767
bic: -71.4698
sebeta: NaN
dfe: 57
The estimated covariance matrix PSI shows that the variance of the second
random effect is essentially zero, suggesting that you can remove it to simplify
the model. To do this, use the REParamsSelect parameter to specify the
indices of the parameters to be modeled with random effects in nlmefit:
9-86
Nonlinear Regression
phi =
0.4606
-1.3460
2.8277
0.7729
PSI =
0.0124 0 0
0 0.3270 0
0 0 0.0250
stats =
logl: 54.5876
mse: 0.0066
aic: -93.1752
bic: -75.6580
sebeta: NaN
dfe: 58
The log-likelihood logl is almost identical to what it was with random effects
for all of the parameters, the Akaike information criterion aic is reduced
from -91.1767 to -93.1752, and the Bayesian information criterion bic is
reduced from -71.4698 to -75.6580. These measures support the decision to
drop the second random effect.
Refitting the simplified model with a full covariance matrix allows for
identification of correlations among the random effects. To do this, use the
CovPattern parameter to specify the pattern of nonzero elements in the
covariance matrix:
9-87
9 Parametric Regression Analysis
The estimated covariance matrix PSI shows that the random effects on the
last two parameters have a relatively strong correlation, and both have a
relatively weak correlation with the first random effect. This structure in
the covariance matrix is more apparent if you convert PSI to a correlation
matrix using corrcov:
RHO = corrcov(PSI)
RHO =
1.0000 0.4707 0.1179
0.4707 1.0000 0.9316
0.1179 0.9316 1.0000
clf; imagesc(RHO)
set(gca,'XTick',[1 2 3],'YTick',[1 2 3])
title('{\bf Random Effect Correlation}')
h = colorbar;
set(get(h,'YLabel'),'String','Correlation');
9-88
Nonlinear Regression
Incorporate this structure into the model by changing the specification of the
covariance pattern to block-diagonal:
9-89
9 Parametric Regression Analysis
-1.1087
2.8056
0.8476
PSI =
0.0331 0 0
0 0.4793 0.1069
0 0.1069 0.0294
stats =
logl: 57.4996
mse: 0.0061
aic: -96.9992
bic: -77.2923
sebeta: NaN
dfe: 57
b =
-0.2438 0.0723 0.2014 0.0592 -0.2181 0.1289
-0.8500 -0.1237 0.9538 -0.7267 0.5895 0.1571
-0.1591 0.0033 0.1568 -0.2144 0.1834 0.0300
The output b gives predictions of the three random effects for each of the six
subjects. These are combined with the estimates of the fixed effects in phi
to produce the mixed-effects model.
Use the following commands to plot the mixed-effects model for each of the six
subjects. For comparison, the model without random effects is also shown.
9-90
Nonlinear Regression
cI = concentration(subject == I);
RES(:,I) = cI - fitted_model(tI);
subplot(2,3,I)
scatter(tI,cI,20,colors(I),'filled')
hold on
plot(tplot,fitted_model(tplot),'Color',colors(I))
plot(tplot,model(phi,tplot),'k')
axis([0 8 0 3.5])
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
legend(num2str(I),'Subject','Fixed')
end
9-91
9 Parametric Regression Analysis
If obvious outliers in the data (visible in previous box plots) are ignored, a
normal probability plot of the residuals shows reasonable agreement with
model assumptions on the errors:
clf; normplot(RES(:))
9-92
Nonlinear Regression
• Good when testing against the same type of model as generates the data
• Poor when tested against incorrect data models
9-93
9 Parametric Regression Analysis
groups = repmat(1:nGroups,numel(time),1);
groups = vertcat(groups(:));
y = zeros(numel(time)*nGroups,1);
x = zeros(numel(time)*nGroups,1);
for i = 1:nGroups
idx = groups == i;
f = nlmefun(individualParameters(i,:), time);
% Make a proportional error model for y:
y(idx) = f + errorParam*f.*randn(numel(f),1);
x(idx) = time;
end
P = [ 1 0 ; 0 1 ];
3 Fit the data using the same regression function and error model as the
model generator:
9-94
Nonlinear Regression
function plotResiduals(stats)
pwres = stats.pwres;
iwres = stats.iwres;
cwres = stats.cwres;
figure
subplot(2,3,1);
normplot(pwres); title('PWRES')
subplot(2,3,4);
createhistplot(pwres);
subplot(2,3,2);
normplot(cwres); title('CWRES')
subplot(2,3,5);
createhistplot(cwres);
subplot(2,3,3);
normplot(iwres); title('IWRES')
subplot(2,3,6);
createhistplot(iwres); title('IWRES')
function createhistplot(pwres)
[x, n] = hist(pwres);
d = n(2)- n(1);
x = x/sum(x*d);
bar(n,x);
ylim([0 max(x)*1.05]);
hold on;
x2 = -4:0.1:4;
f2 = normpdf(x2,0,1);
plot(x2,f2,'r');
end
end
plotResiduals(stats);
9-95
9 Parametric Regression Analysis
The upper probability plots look straight, meaning the residuals are
normally distributed. The bottom histogram plots match the superimposed
normal density plot. So you can conclude that the error model matches
the data.
6 For comparison, fit the data using a constant error model, instead of the
proportional model that created the data:
9-96
Nonlinear Regression
The upper probability plots are not straight, indicating the residuals are
not normally distributed. The bottom histogram plots are fairly close to the
superimposed normal density plots.
7 For another comparison, fit the data to a different structural model than
created the data:
9-97
9 Parametric Regression Analysis
Not only are the upper probability plots not straight, but the histogram
plot is quite skewed compared to the superimposed normal density. These
residuals are not normally distributed, and do not match the model.
9-98
10
Multivariate Methods
Introduction
Large, high-dimensional data sets are common in the modern era
of computer-based instrumentation and electronic data storage.
High-dimensional data present many challenges for statistical visualization,
analysis, and modeling.
10-2
Multidimensional Scaling
Multidimensional Scaling
In this section...
“Introduction” on page 10-3
“Classical Multidimensional Scaling” on page 10-3
“Nonclassical Multidimensional Scaling” on page 10-8
“Nonmetric Multidimensional Scaling” on page 10-10
Introduction
One of the most important goals in visualizing data is to get a sense of how
near or far points are from each other. Often, you can do this with a scatter
plot. However, for some analyses, the data that you have might not be in
the form of points at all, but rather in the form of pairwise similarities or
dissimilarities between cases, observations, or subjects. There are no points
to plot.
Even if your data are in the form of points rather than pairwise distances,
a scatter plot of those data might not be useful. For some kinds of data,
the relevant way to measure how near two points are might not be their
Euclidean distance. While scatter plots of the raw data make it easy to
compare Euclidean distances, they are not always useful when comparing
other kinds of inter-point distances, city block distance for example, or even
more general dissimilarities. Also, with a large number of variables, it is very
difficult to visualize distances unless the data can be represented in a small
number of dimensions. Some sort of dimension reduction is usually necessary.
10-3
10 Multivariate Methods
Introduction
The function cmdscale performs classical (metric) multidimensional scaling,
also known as principal coordinates analysis. cmdscale takes as an input a
matrix of inter-point distances and creates a configuration of points. Ideally,
those points are in two or three dimensions, and the Euclidean distances
between them reproduce the original distance matrix. Thus, a scatter plot
of the points created by cmdscale provides a visual representation of the
original distances.
As a very simple example, you can reconstruct a set of points from only their
inter-point distances. First, create some four dimensional points with a small
component in their fourth coordinate, and reduce them to distances.
X = [ normrnd(0,1,10,3), normrnd(0,.1,10,1) ];
D = pdist(X,'euclidean');
[Y,eigvals] = cmdscale(D);
cmdscale produces two outputs. The first output, Y, is a matrix containing the
reconstructed points. The second output, eigvals, is a vector containing the
sorted eigenvalues of what is often referred to as the “scalar product matrix,”
which, in the simplest case, is equal to Y*Y'. The relative magnitudes of those
eigenvalues indicate the relative contribution of the corresponding columns of
Y in reproducing the original distance matrix D with the reconstructed points.
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
12.623 1
4.3699 0.34618
1.9307 0.15295
0.025884 0.0020505
1.7192e-015 1.3619e-016
6.8727e-016 5.4445e-017
10-4
Multidimensional Scaling
4.4367e-017 3.5147e-018
-9.2731e-016 -7.3461e-017
-1.327e-015 -1.0513e-016
-1.9232e-015 -1.5236e-016
If eigvals contains only positive and zero (within round-off error) eigenvalues,
the columns of Y corresponding to the positive eigenvalues provide an exact
reconstruction of D, in the sense that their inter-point Euclidean distances,
computed using pdist, for example, are identical (within round-off) to the
values in D.
If two or three of the eigenvalues in eigvals are much larger than the rest,
then the distance matrix based on the corresponding columns of Y nearly
reproduces the original distance matrix D. In this sense, those columns
form a lower-dimensional representation that adequately describes the
data. However it is not always possible to find a good low-dimensional
reconstruction.
% good reconstruction in 3D
maxerr3 = max(abs(D - pdist(Y(:,1:3))))
maxerr3 =
0.029728
% poor reconstruction in 2D
maxerr2 = max(abs(D - pdist(Y(:,1:2))))
maxerr2 =
0.91641
max(max(D))
ans =
3.4686
10-5
10 Multivariate Methods
cities = ...
{'Atl','Chi','Den','Hou','LA','Mia','NYC','SF','Sea','WDC'};
D = [ 0 587 1212 701 1936 604 748 2139 2182 543;
587 0 920 940 1745 1188 713 1858 1737 597;
1212 920 0 879 831 1726 1631 949 1021 1494;
701 940 879 0 1374 968 1420 1645 1891 1220;
1936 1745 831 1374 0 2339 2451 347 959 2300;
604 1188 1726 968 2339 0 1092 2594 2734 923;
748 713 1631 1420 2451 1092 0 2571 2408 205;
2139 1858 949 1645 347 2594 2571 0 678 2442;
2182 1737 1021 1891 959 2734 2408 678 0 2329;
543 597 1494 1220 2300 923 205 2442 2329 0];
[Y,eigvals] = cmdscale(D);
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
9.5821e+006 1
1.6868e+006 0.17604
8157.3 0.0008513
1432.9 0.00014954
508.67 5.3085e-005
25.143 2.624e-006
10-6
Multidimensional Scaling
5.3394e-010 5.5722e-017
-897.7 -9.3685e-005
-5467.6 -0.0005706
-35479 -0.0037026
However, in this case, the two largest positive eigenvalues are much larger
in magnitude than the remaining eigenvalues. So, despite the negative
eigenvalues, the first two coordinates of Y are sufficient for a reasonable
reproduction of D.
Dtriu = D(find(tril(ones(10),-1)))';
maxrelerr = max(abs(Dtriu-pdist(Y(:,1:2))))./max(Dtriu)
maxrelerr =
0.0075371
plot(Y(:,1),Y(:,2),'.')
text(Y(:,1)+25,Y(:,2),cities)
xlabel('Miles')
ylabel('Miles')
10-7
10 Multivariate Methods
load cereal.mat
X = [Calories Protein Fat Sodium Fiber ...
Carbo Sugars Shelf Potass Vitamins];
10-8
Multidimensional Scaling
dissimilarities = pdist(zscore(X),'cityblock');
size(dissimilarities)
ans =
1 231
This example code first standardizes the cereal data, and then uses city block
distance as a dissimilarity. The choice of transformation to dissimilarities is
application-dependent, and the choice here is only for simplicity. In some
applications, the original data are already in the form of dissimilarities.
Next, use mdscale to perform metric MDS. Unlike cmdscale, you must
specify the desired number of dimensions, and the method to use to construct
the output configuration. For this example, use two dimensions. The metric
STRESS criterion is a common method for computing the output; for other
choices, see the mdscale reference page in the online documentation. The
second output from mdscale is the value of that criterion evaluated for the
output configuration. It measures the how well the inter-point distances of
the output configuration approximate the original input dissimilarities:
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','metricstress');
stress
stress =
0.1856
plot(Y(:,1),Y(:,2),'o','LineWidth',2);
10-9
10 Multivariate Methods
gname(Name(mfg1))
10-10
Multidimensional Scaling
You use mdscale to perform nonmetric MDS in much the same way as for
metric scaling. The nonmetric STRESS criterion is a common method for
computing the output; for more choices, see the mdscale reference page in
the online documentation. As with metric scaling, the second output from
mdscale is the value of that criterion evaluated for the output configuration.
For nonmetric scaling, however, it measures the how well the inter-point
distances of the output configuration approximate the disparities. The
disparities are returned in the third output. They are the transformed values
of the original dissimilarities:
[Y,stress,disparities] = ...
mdscale(dissimilarities,2,'criterion','stress');
stress
stress =
0.1562
distances = pdist(Y);
[dum,ord] = sortrows([disparities(:) dissimilarities(:)]);
plot(dissimilarities,distances,'bo', ...
dissimilarities(ord),disparities(ord),'r.-', ...
[0 25],[0 25],'k-')
xlabel('Dissimilarities')
ylabel('Distances/Disparities')
legend({'Distances' 'Disparities' '1:1 Line'},...
'Location','NorthWest');
10-11
10 Multivariate Methods
This plot shows that mdscale has found a configuration of points in two
dimensions whose inter-point distances approximates the disparities, which
in turn are a nonlinear transformation of the original dissimilarities. The
concave shape of the disparities as a function of the dissimilarities indicates
that fit tends to contract small distances relative to the corresponding
dissimilarities. This may be perfectly acceptable in practice.
10-12
Multidimensional Scaling
opts = statset('Display','final');
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','stress',...
'start','random','replicates',5,'Options',opts);
Notice that mdscale finds several different local solutions, some of which
do not have as low a stress value as the solution found with the cmdscale
starting point.
10-13
10 Multivariate Methods
Procrustes Analysis
In this section...
“Comparing Landmark Data” on page 10-14
“Data Input” on page 10-14
“Preprocessing Data for Accurate Results” on page 10-15
“Example: Comparing Handwritten Shapes” on page 10-16
Data Input
The procrustes function takes two matrices as input:
10-14
Procrustes Analysis
Z = bYT + c (10-1)
where:
n p
∑∑(X
i =1 j =1
ij − Z ij ) 2
10-15
10 Multivariate Methods
Create X and Y from A and B, moving B to the side to make each shape more
visible:
X = A;
Y = B + repmat([25 0], 10,1);
Plot the shapes, using letters to designate the landmark points. Lines in the
figure join the points to indicate the drawing path of each shape.
10-16
Procrustes Analysis
10-17
10 Multivariate Methods
10-18
Procrustes Analysis
d =
0.1502
The small value of d in this case shows that the two shapes are similar.
numerator = sum(sum((X-Z).^2))
numerator =
166.5321
denominator = sum(sum(bsxfun(@minus,X,mean(X)).^2))
denominator =
1.1085e+003
ratio = numerator/denominator
ratio =
0.1502
10-19
10 Multivariate Methods
tr.b
ans =
0.9291
The sizes of the target and comparison shapes appear similar. This visual
impression is reinforced by the value of b = 0.93, which implies that the best
transformation results in shrinking the comparison shape by a factor .93
(only 7%).
ds = procrustes(X,Y,'Scaling',false)
ds =
0.1552
det(tr.T)
ans =
1.0000
[dr,Zr,trr] = procrustes(X,Y,'Reflection',true);
dr
dr =
10-20
Procrustes Analysis
0.8130
• The landmark data points are now further away from their target
counterparts.
• The transformed three is now an undesirable mirror image of the target
three.
10-21
10 Multivariate Methods
It appears that the shapes might be better matched if you flipped the
transformed shape upside down. Flipping the shapes would make the
transformation even worse, however, because the landmark data points
would be further away from their target counterparts. From this example,
it is clear that manually adjusting the scaling and reflection parameters is
generally not optimal.
10-22
Feature Selection
Feature Selection
In this section...
“Introduction” on page 10-23
“Sequential Feature Selection” on page 10-23
Introduction
Feature selection reduces the dimensionality of data by selecting only a subset
of measured features (predictor variables) to create a model. Selection criteria
usually involve the minimization of a specific measure of predictive error for
models fit to different subsets. Algorithms search for a subset of predictors
that optimally model measured responses, subject to constraints such as
required or excluded features and the size of the subset.
Introduction
A common method of feature selection is sequential feature selection. This
method has two components:
10-23
10 Multivariate Methods
n = 100;
m = 10;
X = rand(n,m);
b = [1 0 0 2 .5 0 0 0.1 0 1];
Xb = X*b';
10-24
Feature Selection
p = 1./(1+exp(-Xb));
N = 50;
y = binornd(N,p);
Y = [y N*ones(size(y))];
[b0,dev0,stats0] = glmfit(X,Y,'binomial');
This is the full model, using all of the features (and an initial constant term).
Sequential feature selection searches for a subset of the features in the full
model with comparative predictive power.
First, you must specify a criterion for selecting the features. The following
function, which calls glmfit and returns the deviance of the fit (a
generalization of the residual sum of squares) is a useful criterion in this case:
[b,dev] = glmfit(X,Y,'binomial');
10-25
10 Multivariate Methods
maxdev = chi2inv(.95,1);
opt = statset('display','iter',...
'TolFun',maxdev,...
'TolTypeFun','abs');
inmodel = sequentialfs(@critfun,X,Y,...
'cv','none',...
'nullmodel',true,...
'options',opt,...
'direction','forward');
The iterative display shows a decrease in the criterion value as each new
feature is added to the model. The final result is a reduced model with only
four of the original ten features: columns 1, 4, 5, and 10 of X. These features
are indicated in the logical vector inmodel returned by sequentialfs.
The deviance of the reduced model is higher than for the full model, but
the addition of any other single feature would not decrease the criterion
by more than the absolute tolerance, maxdev, set in the options structure.
Adding a feature with no effect reduces the deviance by an amount that has
a chi-square distribution with one degree of freedom. Adding a significant
feature results in a larger change. By setting maxdev to chi2inv(.95,1), you
instruct sequentialfs to continue adding features so long as the change in
deviance is more than would be expected by random chance.
10-26
Feature Selection
[b,dev,stats] = glmfit(X(:,inmodel),Y,'binomial');
10-27
10 Multivariate Methods
Feature Transformation
In this section...
“Introduction” on page 10-28
“Nonnegative Matrix Factorization” on page 10-28
“Principal Component Analysis (PCA)” on page 10-31
“Factor Analysis” on page 10-45
Introduction
Feature transformation is a group of methods that create new features
(predictor variables). The methods are useful for dimension reduction when
the transformed features have a descriptive power that is more easily ordered
than the original features. In this case, less descriptive features can be
dropped from consideration when building models.
10-28
Feature Transformation
Introduction
Nonnegative matrix factorization (NMF) is a dimension-reduction technique
based on a low-rank approximation of the feature space. Besides providing
a reduction in the number of features, NMF guarantees that the features
are nonnegative, producing additive models that respect, for example, the
nonnegativity of physical quantities.
load moore
X = moore(:,1:5);
opt = statset('MaxIter',10,'Display','final');
[W0,H0] = nnmf(X,2,'replicates',5,...
'options',opt,...
10-29
10 Multivariate Methods
'algorithm','mult');
rep iteration rms resid |delta x|
1 10 358.296 0.00190554
2 10 78.3556 0.000351747
3 10 230.962 0.0172839
4 10 326.347 0.00739552
5 10 361.547 0.00705539
Final root mean square residual = 78.3556
opt = statset('Maxiter',1000,'Display','final');
[W,H] = nnmf(X,2,'w0',W0,'h0',H0,...
'options',opt,...
'algorithm','als');
rep iteration rms resid |delta x|
1 3 77.5315 3.52673e-005
Final root mean square residual = 77.5315
The two columns of W are the transformed predictors. The two rows of H give
the relative contributions of each of the five predictors in X to the predictors
in W:
H
H =
0.0835 0.0190 0.1782 0.0072 0.9802
0.0558 0.0250 0.9969 0.0085 0.0497
The fifth predictor in X (weight 0.9802) strongly influences the first predictor
in W. The third predictor in X (weight 0.9969) strongly influences the second
predictor in W.
10-30
Feature Transformation
biplot(H','scores',W,'varlabels',{'','','v3','','v5'});
axis([0 1.1 0 1.1])
xlabel('Column 1')
ylabel('Column 2')
Introduction
One of the difficulties inherent in multivariate statistics is the problem of
visualizing data that has many variables. The MATLAB function plot
displays a graph of the relationship between two variables. The plot3
and surf commands display different three-dimensional views. But when
10-31
10 Multivariate Methods
there are more than three variables, it is more difficult to visualize their
relationships.
The first principal component is a single axis in space. When you project
each observation on that axis, the resulting values form a new variable. And
the variance of this variable is the maximum among all possible choices of
the first axis.
The full set of principal components is as large as the original set of variables.
But it is commonplace for the sum of the variances of the first few principal
components to exceed 80% of the total variance of the original data. By
examining plots of these few new variables, researchers often develop a
deeper understanding of the driving forces that generated the original data.
10-32
Feature Transformation
You can use the function princomp to find the principal components. To use
princomp, you need to have the actual measured data you want to analyze.
However, if you lack the actual data, but have the sample covariance or
correlation matrix for the data, you can still use the function pcacov to
perform a principal components analysis. See the reference page for pcacov
for a description of its inputs and outputs.
load cities
whos
Name Size Bytes Class
categories 9x14 252 char array
names 329x43 28294 char array
ratings 329x9 23688 double array
The whos command generates a table of information about all the variables
in the workspace.
10-33
10 Multivariate Methods
categories
categories =
climate
housing
health
crime
transportation
education
arts
recreation
economics
first5 = names(1:5,:)
first5 =
Abilene, TX
Akron, OH
Albany, GA
Albany-Troy, NY
Albuquerque, NM
boxplot(ratings,'orientation','horizontal','labels',categories)
This command generates the plot below. Note that there is substantially
more variability in the ratings of the arts and housing than in the ratings
of crime and climate.
10-34
Feature Transformation
Ordinarily you might also graph pairs of the original variables, but there are
36two-variable plots. Perhaps principal components analysis can reduce the
number of variables you need to consider.
Sometimes it makes sense to compute principal components for raw data. This
is appropriate when all the variables are in the same units. Standardizing the
data is often preferable when the variables are in different units or when the
variance of the different columns is substantial (as in this case).
You can standardize the data by dividing each column by its standard
deviation.
stdr = std(ratings);
sr = ratings./repmat(stdr,329,1);
10-35
10 Multivariate Methods
[coefs,scores,variances,t2] = princomp(sr);
c3 = coefs(:,1:3)
c3 =
0.2064 0.2178 -0.6900
0.3565 0.2506 -0.2082
0.4602 -0.2995 -0.0073
0.2813 0.3553 0.1851
0.3512 -0.1796 0.1464
0.2753 -0.4834 0.2297
0.4631 -0.1948 -0.0265
0.3279 0.3845 -0.0509
0.1354 0.4713 0.6073
The largest coefficients in the first column (first principal component) are
the third and seventh elements, corresponding to the variables health and
arts. All the coefficients of the first principal component have the same sign,
making it a weighted average of all the original variables.
I = c3'*c3
I =
1.0000 -0.0000 -0.0000
-0.0000 1.0000 -0.0000
-0.0000 -0.0000 1.0000
10-36
Feature Transformation
A plot of the first two columns of scores shows the ratings data projected
onto the first two principal components. princomp computes the scores to
have mean zero.
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
The function gname is useful for graphically identifying a few points in a plot
like this. You can call gname with a string matrix containing as many case
10-37
10 Multivariate Methods
labels as points in the plot. The string matrix names works for labeling points
with the city names.
gname(names)
Move your cursor over the plot and click once near each point in the right
half. As you click each point, it is labeled with the proper row from the names
string matrix. Here is the plot after a few clicks:
When you are finished labeling points, press the Return key.
The labeled cities are some of the biggest population centers in the United
States. They are definitely different from the remainder of the data, so
perhaps they should be considered separately. To remove the labeled cities
from the data, first identify their corresponding row numbers as follows:
10-38
Feature Transformation
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component');
ylabel('2nd Principal Component');
4 Click near the points you labeled in the preceding figure. This labels the
points by their row numbers, as shown in the following figure.
Then you can create an index variable containing the row numbers of all
the metropolitan areas you choose.
10-39
10 Multivariate Methods
To remove these rows from the ratings matrix, enter the following.
rsubset = ratings;
nsubset = names;
nsubset(metro,:) = [];
rsubset(metro,:) = [];
size(rsubset)
ans =
322 9
variances
variances =
3.4083
1.2140
1.1415
0.9209
0.7533
0.6306
0.4930
0.3180
0.1204
You can easily calculate the percent of the total variability explained by each
principal component.
10-40
Feature Transformation
percent_explained = 100*variances/sum(variances)
percent_explained =
37.8699
13.4886
12.6831
10.2324
8.3698
7.0062
5.4783
3.5338
1.3378
Use the pareto function to make a scree plot of the percent variability
explained by each principal component.
pareto(percent_explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')
10-41
10 Multivariate Methods
The preceding figure shows that the only clear break in the amount of
variance accounted for by each component is between the first and second
components. However, that component by itself explains less than 40% of the
variance, so more components are probably needed. You can see that the first
three principal components explain roughly two-thirds of the total variability
in the standardized ratings, so that might be a reasonable way to reduce the
dimensions in order to visualize the data.
It is not surprising that the ratings for New York are the furthest from the
average U.S. town.
Visualizing the Results. Use the biplot function to help visualize both
the principal component coefficients for each variable and the principal
component scores for each observation in a single plot. For example, the
following command plots the results from the principal components analysis
on the cities and labels each of the variables.
biplot(coefs(:,1:2), 'scores',scores(:,1:2),...
'varlabels',categories);
axis([-.26 1 -.51 .51]);
10-42
Feature Transformation
Each of the nine variables is represented in this plot by a vector, and the
direction and length of the vector indicates how each variable contributes to
the two principal components in the plot. For example, you have seen that the
first principal component, represented in this biplot by the horizontal axis,
has positive coefficients for all nine variables. That corresponds to the nine
vectors directed into the right half of the plot. You have also seen that the
second principal component, represented by the vertical axis, has positive
coefficients for the variables education, health, arts, and transportation, and
negative coefficients for the remaining five variables. That corresponds to
vectors directed into the top and bottom halves of the plot, respectively. This
indicates that this component distinguishes between cities that have high
values for the first set of variables and low for the second, and cities that
have the opposite.
10-43
10 Multivariate Methods
The variable labels in this figure are somewhat crowded. You could either
leave out the VarLabels parameter when making the plot, or simply select
and drag some of the labels to better positions using the Edit Plot tool from
the figure window toolbar.
You can use the Data Cursor, in the Tools menu in the figure window, to
identify the items in this plot. By clicking on a variable (vector), you can read
off that variable’s coefficients for each principal component. By clicking on
an observation (point), you can read off that observation’s scores for each
principal component.
You can also make a biplot in three dimensions. This can be useful if the first
two principal coordinates do not explain enough of the variance in your data.
Selecting Rotate 3D in the Tools menu enables you to rotate the figure to
see it from different angles.
biplot(coefs(:,1:3), 'scores',scores(:,1:3),...
'obslabels',names);
axis([-.26 1 -.51 .51 -.61 .81]);
view([30 40]);
10-44
Feature Transformation
Factor Analysis
• “Introduction” on page 10-45
• “Example: Factor Analysis” on page 10-46
Introduction
Multivariate data often includes a large number of measured variables, and
sometimes those variables overlap, in the sense that groups of them might be
dependent. For example, in a decathlon, each athlete competes in 10 events,
but several of them can be thought of as speed events, while others can be
thought of as strength events, etc. Thus, you can think of a competitor’s 10
event scores as largely dependent on a smaller set of three or four types of
athletic ability.
10-45
10 Multivariate Methods
Factor analysis is a way to fit a model to multivariate data to estimate just this
sort of interdependence. In a factor analysis model, the measured variables
depend on a smaller number of unobserved (latent) factors. Because each
factor might affect several variables in common, they are known as common
factors. Each variable is assumed to be dependent on a linear combination
of the common factors, and the coefficients are known as loadings. Each
measured variable also includes a component due to independent random
variability, known as specific variance because it is specific to one variable.
Specifically, factor analysis assumes that the covariance matrix of your data
is of the form
∑ x = ΛΛΤ + Ψ
where Λ is the matrix of loadings, and the elements of the diagonal matrix
Ψ are the specific variances. The function factoran fits the Factor Analysis
model using maximum likelihood.
Factor Loadings. Over the course of 100 weeks, the percent change in stock
prices for ten companies has been recorded. Of the ten companies, the first
four can be classified as primarily technology, the next three as financial, and
the last three as retail. It seems reasonable that the stock prices for companies
that are in the same sector might vary together as economic conditions
change. Factor Analysis can provide quantitative evidence that companies
within each sector do experience similar week-to-week changes in stock price.
In this example, you first load the data, and then call factoran, specifying a
model fit with three common factors. By default, factoran computes rotated
estimates of the loadings to try and make their interpretation simpler. But in
this example, you specify an unrotated solution.
10-46
Feature Transformation
load stockreturns
[Loadings,specificVar,T,stats] = ...
factoran(stocks,3,'rotate','none');
The first two factoran return arguments are the estimated loadings and the
estimated specific variances. Each row of the loadings matrix represents one
of the ten stocks, and each column corresponds to a common factor. With
unrotated estimates, interpretation of the factors in this fit is difficult because
most of the stocks contain fairly large coefficients for two or more factors.
Loadings
Loadings =
0.8885 0.2367 -0.2354
0.7126 0.3862 0.0034
0.3351 0.2784 -0.0211
0.3088 0.1113 -0.1905
0.6277 -0.6643 0.1478
0.4726 -0.6383 0.0133
0.1133 -0.5416 0.0322
0.6403 0.1669 0.4960
0.2363 0.5293 0.5770
0.1105 0.1680 0.5524
Note “Factor Rotation” on page 10-48 helps to simplify the structure in the
Loadings matrix, to make it easier to assign meaningful interpretations to
the factors.
From the estimated specific variances, you can see that the model indicates
that a particular stock price varies quite a lot beyond the variation due to
the common factors.
specificVar
specificVar =
0.0991
0.3431
0.8097
0.8559
0.1429
10-47
10 Multivariate Methods
0.3691
0.6928
0.3162
0.3311
0.6544
The p value returned in the stats structure fails to reject the null hypothesis
of three common factors, suggesting that this model provides a satisfactory
explanation of the covariation in these data.
stats.p
ans =
0.8144
To determine whether fewer than three factors can provide an acceptable fit,
you can try a model with two common factors. The p value for this second fit
is highly significant, and rejects the hypothesis of two factors, indicating that
the simpler model is not sufficient to explain the pattern in these data.
[Loadings2,specificVar2,T2,stats2] = ...
factoran(stocks, 2,'rotate','none');
stats2.p
ans =
3.5610e-006
10-48
Feature Transformation
loadings in the rotated coordinate system. There are various ways to do this.
Some methods leave the axes orthogonal, while others are oblique methods
that change the angles between them. For this example, you can rotate the
estimated loadings by using the promax criterion, a common oblique method.
[LoadingsPM,specVarPM] = factoran(stocks,3,'rotate','promax');
LoadingsPM
LoadingsPM =
0.9452 0.1214 -0.0617
0.7064 -0.0178 0.2058
0.3885 -0.0994 0.0975
0.4162 -0.0148 -0.1298
0.1021 0.9019 0.0768
0.0873 0.7709 -0.0821
-0.1616 0.5320 -0.0888
0.2169 0.2844 0.6635
0.0016 -0.1881 0.7849
-0.2289 0.0636 0.6475
biplot(LoadingsPM,'varlabels',num2str((1:10)'));
axis square
view(155,27);
10-49
10 Multivariate Methods
This plot shows that promax has rotated the factor loadings to a simpler
structure. Each stock depends primarily on only one factor, and it is possible
to describe each factor in terms of the stocks that it affects. Based on which
companies are near which axes, you could reasonably conclude that the first
factor axis represents the financial sector, the second retail, and the third
technology. The original conjecture, that stocks vary primarily within sector,
is apparently supported by the data.
Because the data in this example are the raw stock price changes, and not
just their correlation matrix, you can have factoran return estimates of the
10-50
Feature Transformation
value of each of the three rotated common factors for each week. You can
then plot the estimated scores to see how the different stock sectors were
affected during each week.
[LoadingsPM,specVarPM,TPM,stats,F] = ...
factoran(stocks, 3,'rotate','promax');
plot3(F(:,1),F(:,2),F(:,3),'b.')
line([-4 4 NaN 0 0 NaN 0 0], [0 0 NaN -4 4 NaN 0 0],...
[0 0 NaN 0 0 NaN -4 4], 'Color','black')
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
grid on
axis square
view(-22.5, 8)
10-51
10 Multivariate Methods
Oblique rotation often creates factors that are correlated. This plot shows
some evidence of correlation between the first and third factors, and you can
investigate further by computing the estimated factor correlation matrix.
inv(TPM'*TPM)
ans =
1.0000 0.1559 0.4082
0.1559 1.0000 -0.0559
0.4082 -0.0559 1.0000
Visualizing the Results. You can use the biplot function to help visualize
both the factor loadings for each variable and the factor scores for each
observation in a single plot. For example, the following command plots the
results from the factor analysis on the stock data and labels each of the 10
stocks.
biplot(LoadingsPM,'scores',F,'varlabels',num2str((1:10)'))
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
axis square
view(155,27)
10-52
Feature Transformation
In this case, the factor analysis includes three factors, and so the biplot is
three-dimensional. Each of the 10 stocks is represented in this plot by a vector,
and the direction and length of the vector indicates how each stock depends
on the underlying factors. For example, you have seen that after promax
rotation, the first four stocks have positive loadings on the first factor, and
unimportant loadings on the other two factors. That first factor, interpreted
as a financial sector effect, is represented in this biplot as one of the horizontal
axes. The dependence of those four stocks on that factor corresponds to the
four vectors directed approximately along that axis. Similarly, the dependence
of stocks 5, 6, and 7 primarily on the second factor, interpreted as a retail
sector effect, is represented by vectors directed approximately along that axis.
Each of the 100 observations is represented in this plot by a point, and their
locations indicate the score of each observation for the three factors. For
example, points near the top of this plot have the highest scores for the
10-53
10 Multivariate Methods
technology sector factor. The points are scaled to fit within the unit square, so
only their relative locations can be determined from the plot.
You can use the Data Cursor tool from the Tools menu in the figure window
to identify the items in this plot. By clicking a stock (vector), you can read off
that stock’s loadings for each factor. By clicking an observation (point), you
can read off that observation’s scores for each factor.
10-54
11
Cluster Analysis
Introduction
Cluster analysis, also called segmentation analysis or taxonomy analysis,
creates groups, or clusters, of data. Clusters are formed in such a way that
objects in the same cluster are very similar and objects in different clusters
are very distinct. Measures of similarity depend on the application.
11-2
Hierarchical Clustering
Hierarchical Clustering
In this section...
“Introduction” on page 11-3
“Algorithm Description” on page 11-3
“Similarity Measures” on page 11-4
“Linkages” on page 11-6
“Dendrograms” on page 11-8
“Verifying the Cluster Tree” on page 11-10
“Creating Clusters” on page 11-16
Introduction
Hierarchical clustering groups data over a variety of scales by creating a
cluster tree or dendrogram. The tree is not a single set of clusters, but rather
a multilevel hierarchy, where clusters at one level are joined as clusters at
the next level. This allows you to decide the level or scale of clustering that
is most appropriate for your application. The Statistics Toolbox function
clusterdata supports agglomerative clustering and performs all of the
necessary steps for you. It incorporates the pdist, linkage, and cluster
functions, which you can use separately for more detailed analysis. The
dendrogram function plots the cluster tree.
Algorithm Description
To perform agglomerative hierarchical cluster analysis on a data set using
Statistics Toolbox functions, follow this procedure:
11-3
11 Cluster Analysis
The following sections provide more information about each of these steps.
Similarity Measures
You use the pdist function to calculate the distance between every pair of
objects in a data set. For a data set made up of m objects, there are m*(m –
1)/2 pairs in the data set. The result of this computation is commonly known
as a distance or dissimilarity matrix.
There are many ways to calculate this distance information. By default, the
pdist function calculates the Euclidean distance between objects; however,
you can specify one of several other options. See pdist for more information.
Note You can optionally normalize the values in the data set before
calculating the distance information. In a real world data set, variables can
be measured against different scales. For example, one variable can measure
Intelligence Quotient (IQ) test scores and another variable can measure head
circumference. These discrepancies can distort the proximity calculations.
Using the zscore function, you can convert all the values in the data set to
use the same proportional scale. See zscore for more information.
11-4
Hierarchical Clustering
For example, consider a data set, X, made up of five objects where each object
is a set of x,y coordinates.
• Object 1: 1, 2
• Object 2: 2.5, 4.5
• Object 3: 2, 2
• Object 4: 4, 1.5
• Object 5: 4, 2.5
and pass it to pdist. The pdist function calculates the distance between
object 1 and object 2, object 1 and object 3, and so on until the distances
between all the pairs have been calculated. The following figure plots these
objects in a graph. The Euclidean distance between object 2 and object 3 is
shown to illustrate one interpretation of distance.
Distance Information
The pdist function returns this distance information in a vector, Y, where
each element contains the distance between a pair of objects.
11-5
11 Cluster Analysis
Y = pdist(X)
Y =
Columns 1 through 5
2.9155 1.0000 3.0414 3.0414 2.5495
Columns 6 through 10
3.3541 2.5000 2.0616 2.0616 1.0000
squareform(Y)
ans =
0 2.9155 1.0000 3.0414 3.0414
2.9155 0 2.5495 3.3541 2.5000
1.0000 2.5495 0 2.0616 2.0616
3.0414 3.3541 2.0616 0 1.0000
3.0414 2.5000 2.0616 1.0000 0
Linkages
Once the proximity between objects in the data set has been computed, you
can determine how objects in the data set should be grouped into clusters,
using the linkage function. The linkage function takes the distance
information generated by pdist and links pairs of objects that are close
together into binary clusters (clusters made up of two objects). The linkage
function then links these newly formed clusters to each other and to other
objects to create bigger clusters until all the objects in the original data set
are linked together in a hierarchical tree.
For example, given the distance vector Y generated by pdist from the sample
data set of x- and y-coordinates, the linkage function generates a hierarchical
cluster tree, returning the linkage information in a matrix, Z.
Z = linkage(Y)
Z =
4.0000 5.0000 1.0000
11-6
Hierarchical Clustering
In this output, each row identifies a link between objects or clusters. The first
two columns identify the objects that have been linked. The third column
contains the distance between these objects. For the sample data set of x-
and y-coordinates, the linkage function begins by grouping objects 4 and 5,
which have the closest proximity (distance value = 1.0000). The linkage
function continues by grouping objects 1 and 3, which also have a distance
value of 1.0000.
The third row indicates that the linkage function grouped objects 6 and 7. If
the original sample data set contained only five objects, what are objects 6
and 7? Object 6 is the newly formed binary cluster created by the grouping
of objects 4 and 5. When the linkage function groups two objects into a
new cluster, it must assign the cluster a unique index value, starting with
the value m+1, where m is the number of objects in the original data set.
(Values 1 through m are already used by the original data set.) Similarly,
object 7 is the cluster formed by grouping objects 1 and 3.
As the final cluster, the linkage function grouped object 8, the newly formed
cluster made up of objects 6 and 7, with object 2 from the original data set.
The following figure graphically illustrates the way linkage groups the
objects into a hierarchy of clusters.
11-7
11 Cluster Analysis
Dendrograms
The hierarchical, binary cluster tree created by the linkage function is most
easily understood when viewed graphically. The Statistics Toolbox function
dendrogram plots the tree, as follows:
dendrogram(Z)
11-8
Hierarchical Clustering
2.5
1.5
4 5 1 3 2
In the figure, the numbers along the horizontal axis represent the indices of
the objects in the original data set. The links between objects are represented
as upside-down U-shaped lines. The height of the U indicates the distance
between the objects. For example, the link representing the cluster containing
objects 1 and 3 has a height of 1. The link representing the cluster that groups
object 2 together with objects 1, 3, 4, and 5, (which are already clustered as
object 8) has a height of 2.5. The height represents the distance linkage
computes between objects 2 and 8. For more information about creating a
dendrogram diagram, see the dendrogram reference page.
11-9
11 Cluster Analysis
Verifying Dissimilarity
In a hierarchical cluster tree, any two objects in the original data set are
eventually linked together at some level. The height of the link represents
the distance between the two clusters that contain those two objects. This
height is known as the cophenetic distance between the two objects. One
way to measure how well the cluster tree generated by the linkage function
reflects your data is to compare the cophenetic distances with the original
distance data generated by the pdist function. If the clustering is valid, the
linking of objects in the cluster tree should have a strong correlation with
the distances between objects in the distance vector. The cophenet function
compares these two sets of values and computes their correlation, returning a
value called the cophenetic correlation coefficient. The closer the value of the
cophenetic correlation coefficient is to 1, the more accurately the clustering
solution reflects your data.
You can use the cophenetic correlation coefficient to compare the results of
clustering the same data set using different distance calculation methods or
clustering algorithms. For example, you can use the cophenet function to
evaluate the clusters created for the sample data set
c = cophenet(Z,Y)
c =
0.8615
where Z is the matrix output by the linkage function and Y is the distance
vector output by the pdist function.
11-10
Hierarchical Clustering
Execute pdist again on the same data set, this time specifying the city block
metric. After running the linkage function on this new pdist output using
the average linkage method, call cophenet to evaluate the clustering solution.
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
c = cophenet(Z,Y)
c =
0.9047
Verifying Consistency
One way to determine the natural cluster divisions in a data set is to compare
the height of each link in a cluster tree with the heights of neighboring links
below it in the tree.
A link that is approximately the same height as the links below it indicates
that there are no distinct divisions between the objects joined at this level of
the hierarchy. These links are said to exhibit a high level of consistency,
because the distance between the objects being joined is approximately the
same as the distances between the objects they contain.
On the other hand, a link whose height differs noticeably from the height of
the links below it indicates that the objects joined at this level in the cluster
tree are much farther apart from each other than their components were when
they were joined. This link is said to be inconsistent with the links below it.
The following dendrogram illustrates inconsistent links. Note how the objects
in the dendrogram fall into two groups that are connected by links at a much
higher level in the tree. These links are inconsistent when compared with the
links below them in the hierarchy.
11-11
11 Cluster Analysis
11-12
Hierarchical Clustering
function compares each link in the cluster hierarchy with adjacent links that
are less than two levels below it in the cluster hierarchy. This is called the
depth of the comparison. You can also specify other depths. The objects at
the bottom of the cluster tree, called leaf nodes, that have no further objects
below them, have an inconsistency coefficient of zero. Clusters that join two
leaves also have a zero inconsistency coefficient.
For example, you can use the inconsistent function to calculate the
inconsistency values for the links created by the linkage function in
“Linkages” on page 11-6.
I = inconsistent(Z)
I =
1.0000 0 1.0000 0
1.0000 0 1.0000 0
1.3539 0.6129 3.0000 1.1547
2.2808 0.3100 2.0000 0.7071
Column Description
1 Mean of the heights of all the links included in the calculation
2 Standard deviation of all the links included in the calculation
3 Number of links included in the calculation
4 Inconsistency coefficient
In the sample output, the first row represents the link between objects 4
and 5. This cluster is assigned the index 6 by the linkage function. Because
both 4 and 5 are leaf nodes, the inconsistency coefficient for the cluster is zero.
The second row represents the link between objects 1 and 3, both of which are
also leaf nodes. This cluster is assigned the index 7 by the linkage function.
The third row evaluates the link that connects these two clusters, objects 6
and 7. (This new cluster is assigned index 8 in the linkage output). Column 3
indicates that three links are considered in the calculation: the link itself and
the two links directly below it in the hierarchy. Column 1 represents the mean
of the heights of these links. The inconsistent function uses the height
11-13
11 Cluster Analysis
The following figure illustrates the links and heights included in this
calculation.
Links
Heights
11-14
Hierarchical Clustering
Note In the preceding figure, the lower limit on the y-axis is set to 0 to show
the heights of the links. To set the lower limit to 0, select Axes Properties
from the Edit menu, click the Y Axis tab, and enter 0 in the field immediately
to the right of Y Limits.
Row 4 in the output matrix describes the link between object 8 and object 2.
Column 3 indicates that two links are included in this calculation: the link
itself and the link directly below it in the hierarchy. The inconsistency
coefficient for this link is 0.7071.
The following figure illustrates the links and heights included in this
calculation.
11-15
11 Cluster Analysis
Links
Heights
Creating Clusters
After you create the hierarchical tree of binary clusters, you can prune the
tree to partition your data into clusters using the cluster function. The
cluster function lets you create clusters in two ways, as discussed in the
following sections:
11-16
Hierarchical Clustering
For example, if you use the cluster function to group the sample data set
into clusters, specifying an inconsistency coefficient threshold of 1.2 as the
value of the cutoff argument, the cluster function groups all the objects
in the sample data set into one cluster. In this case, none of the links in the
cluster hierarchy had an inconsistency coefficient greater than 1.2.
T = cluster(Z,'cutoff',1.2)
T =
1
1
1
1
1
The cluster function outputs a vector, T, that is the same size as the original
data set. Each element in this vector contains the number of the cluster into
which the corresponding object from the original data set was placed.
T = cluster(Z,'cutoff',0.8)
T =
3
2
3
1
1
11-17
11 Cluster Analysis
This output indicates that objects 1 and 3 were placed in cluster 1, objects 4
and 5 were placed in cluster 2, and object 2 was placed in cluster 3.
When clusters are formed in this way, the cutoff value is applied to the
inconsistency coefficient. These clusters may, but do not necessarily,
correspond to a horizontal slice across the dendrogram at a certain height.
If you want clusters corresponding to a horizontal slice of the dendrogram,
you can either use the criterion option to specify that the cutoff should be
based on distance rather than inconsistency, or you can specify the number of
clusters directly as described in the following section.
For example, you can specify that you want the cluster function to partition
the sample data set into two clusters. In this case, the cluster function
creates one cluster containing objects 1, 3, 4, and 5 and another cluster
containing object 2.
T = cluster(Z,'maxclust',2)
T =
2
1
2
2
2
To help you visualize how the cluster function determines these clusters, the
following figure shows the dendrogram of the hierarchical cluster tree. The
horizontal dashed line intersects two lines of the dendrogram, corresponding
to setting 'maxclust' to 2. These two lines partition the objects into two
clusters: the objects below the left-hand line, namely 1, 3, 4, and 5, belong to
one cluster, while the object below the right-hand line, namely 2, belongs to
the other cluster.
11-18
Hierarchical Clustering
maxclust= 2
On the other hand, if you set 'maxclust' to 3, the cluster function groups
objects 4 and 5 in one cluster, objects 1 and 3 in a second cluster, and object 2
in a third cluster. The following command illustrates this.
T = cluster(Z,'maxclust',3)
T =
1
3
1
2
2
11-19
11 Cluster Analysis
This time, the cluster function cuts off the hierarchy at a lower point,
corresponding to the horizontal line that intersects three lines of the
dendrogram in the following figure.
maxclust= 3
11-20
K-Means Clustering
K-Means Clustering
In this section...
“Introduction” on page 11-21
“Creating Clusters and Determining Separation” on page 11-22
“Determining the Correct Number of Clusters” on page 11-23
“Avoiding Local Minima” on page 11-26
Introduction
K-means clustering is a partitioning method. The function kmeans partitions
data into k mutually exclusive clusters, and returns the index of the cluster
to which it has assigned each observation. Unlike hierarchical clustering,
k-means clustering operates on actual observations (rather than the larger
set of dissimilarity measures), and creates a single level of clusters. The
distinctions mean that k-means clustering is often more suitable than
hierarchical clustering for large amounts of data.
Each cluster in the partition is defined by its member objects and by its
centroid, or center. The centroid for each cluster is the point to which the sum
of distances from all objects in that cluster is minimized. kmeans computes
cluster centroids differently for each distance measure, to minimize the sum
with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from
each object to its cluster centroid, over all clusters. This algorithm moves
objects between clusters until the sum cannot be decreased further. The
result is a set of clusters that are as compact and well-separated as possible.
You can control the details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the cluster
centroids, and for the maximum number of iterations.
11-21
11 Cluster Analysis
load kmeansdata;
size(X)
ans =
560 4
Even though these data are four-dimensional, and cannot be easily visualized,
kmeans enables you to investigate whether a group structure exists in them.
Call kmeans with k, the desired number of clusters, equal to 3. For this
example, specify the city block distance measure, and use the default starting
method of initializing centroids from randomly selected data points:
idx3 = kmeans(X,3,'distance','city');
To get an idea of how well-separated the resulting clusters are, you can make
a silhouette plot using the cluster indices output from kmeans. The silhouette
plot displays a measure of how close each point in one cluster is to points in
the neighboring clusters. This measure ranges from +1, indicating points that
are very distant from neighboring clusters, through 0, indicating points that
are not distinctly in one cluster or another, to -1, indicating points that are
probably assigned to the wrong cluster. silhouette returns these values in
its first output:
[silh3,h] = silhouette(X,idx3,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-22
K-Means Clustering
From the silhouette plot, you can see that most points in the third cluster
have a large silhouette value, greater than 0.6, indicating that the cluster is
somewhat separated from neighboring clusters. However, the first cluster
contains many points with low silhouette values, and the second contains a
few points with negative values, indicating that those two clusters are not
well separated.
11-23
11 Cluster Analysis
2 1 53 2736.67
3 1 50 2476.78
4 1 102 1779.68
5 1 5 1771.1
6 2 0 1771.1
6 iterations, total sum of distances = 1771.1
Notice that the total sum of distances decreases at each iteration as kmeans
reassigns points between clusters and recomputes cluster centroids. In this
case, the second phase of the algorithm did not make any reassignments,
indicating that the first phase reached a minimum after five iterations. In
some problems, the first phase might not reach a minimum, but the second
phase always will.
A silhouette plot for this solution indicates that these four clusters are better
separated than the three in the previous solution:
[silh4,h] = silhouette(X,idx4,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-24
K-Means Clustering
A more quantitative way to compare the two solutions is to look at the average
silhouette values for the two cases:
mean(silh3)
ans =
0.52594
mean(silh4)
ans =
0.63997
idx5 = kmeans(X,5,'dist','city','replicates',5);
[silh5,h] = silhouette(X,idx5,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
11-25
11 Cluster Analysis
ylabel('Cluster')
mean(silh5)
ans =
0.52657
This silhouette plot indicates that this is probably not the right number of
clusters, since two of the clusters contain points with mostly low silhouette
values. Without some knowledge of how many clusters are really in the data,
it is a good idea to experiment with a range of values for k.
11-26
K-Means Clustering
better solution does exist. However, you can use the optional 'replicates'
parameter to overcome that problem.
For four clusters, specify five replicates, and use the 'display' parameter to
print out the final sum of distances for each of the solutions.
[idx4,cent4,sumdist] = kmeans(X,4,'dist','city',...
'display','final','replicates',5);
17 iterations, total sum of distances = 2303.36
5 iterations, total sum of distances = 1771.1
6 iterations, total sum of distances = 1771.1
5 iterations, total sum of distances = 1771.1
8 iterations, total sum of distances = 2303.36
The output shows that, even for this relatively simple problem, non-global
minima do exist. Each of these five replicates began from a different randomly
selected set of initial centroids, and kmeans found two different local minima.
However, the final solution that kmeans returns is the one with the lowest
total sum of distances, over all replicates.
sum(sumdist)
ans =
1771.1
11-27
11 Cluster Analysis
Introduction
Gaussian mixture models are formed by combining multivariate normal
density components. For information on individual multivariate normal
densities, see “Multivariate Normal Distribution” on page B-58 and related
distribution functions listed under “Multivariate Distributions” on page 5-8.
Gaussian mixture models are often used for data clustering. Clusters are
assigned by selecting the component that maximizes the posterior probability.
Like k-means clustering, Gaussian mixture modeling uses an iterative
algorithm that converges to a local optimum. Gaussian mixture modeling may
be more appropriate than k-means clustering when clusters have different
sizes and correlation within them. Clustering using Gaussian mixture models
is sometimes considered a soft clustering method. The posterior probabilities
for each point indicate that each data point has some probability of belonging
to each cluster.
11-28
Gaussian Mixture Models
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X = [mvnrnd(mu1,sigma1,200);mvnrnd(mu2,sigma2,100)];
scatter(X(:,1),X(:,2),10,'ko')
11-29
11 Cluster Analysis
options = statset('Display','final');
gm = gmdistribution.fit(X,2,'Options',options);
This displays
hold on
ezcontour(@(x,y)pdf(gm,[x y]),[-8 6],[-8 6]);
hold off
11-30
Gaussian Mixture Models
4 Partition the data into clusters using the cluster method for the fitted
mixture distribution. The cluster method assigns each point to one of the
two components in the mixture distribution.
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
11-31
11 Cluster Analysis
For example, plot the posterior probability of the first component for each
point:
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
11-32
Gaussian Mixture Models
[~,order] = sort(P(:,1));
plot(1:size(X,1),P(order,1),'r-',1:size(X,1),P(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');
11-33
11 Cluster Analysis
Although a clear separation of the data is hard to see in a scatter plot of the
data, plotting the membership scores indicates that the fitted distribution
does a good job of separating the data into groups. Very few points have
scores close to 0.5.
11-34
Gaussian Mixture Models
gm2 = gmdistribution.fit(X,2,'CovType','Diagonal',...
'SharedCov',true);
You can compute the soft cluster membership scores without computing hard
cluster assignments, using posterior, or as part of hard clustering, as the
second output from cluster:
11-35
11 Cluster Analysis
1 Given a data set X, first fit a Gaussian mixture distribution. The previous
code has already done that.
gm
gm =
Gaussian mixture distribution with 2 components in 2 dimensions
11-36
Gaussian Mixture Models
Component 1:
Mixing proportion: 0.312592
Mean: -0.9082 -2.1109
Component 2:
Mixing proportion: 0.687408
Mean: 0.9532 1.8940
2 You can then use cluster to assign each point in a new data set, Y, to one
of the clusters defined for the original data:
Y = [mvnrnd(mu1,sigma1,50);mvnrnd(mu2,sigma2,25)];
idx = cluster(gm,Y);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(Y(cluster1,1),Y(cluster1,2),10,'r+');
hold on
scatter(Y(cluster2,1),Y(cluster2,2),10,'bo');
hold off
legend('Class 1','Class 2','Location','NW')
11-37
11 Cluster Analysis
As with the previous example, the posterior probabilities for each point can
be treated as membership scores rather than determining "hard" cluster
assignments.
For cluster to provide meaningful results with new data, Y should come
from the same population as X, the original data used to create the mixture
distribution. In particular, the estimated mixing probabilities for the
Gaussian mixture distribution fitted to X are used when computing the
posterior probabilities for Y.
11-38
12
Parametric Classification
Introduction
Models of data with a categorical response are called classifiers. A classifier is
built from training data, for which classifications are known. The classifier
assigns new test data to one of the categorical levels of the response.
12-2
Discriminant Analysis
Discriminant Analysis
In this section...
“Introduction” on page 12-3
“Example: Discriminant Analysis” on page 12-3
Introduction
Discriminant analysis uses training data to estimate the parameters of
discriminant functions of the predictor variables. Discriminant functions
determine boundaries in predictor space between various classes. The
resulting classifier discriminates among the classes (the categorical levels of
the response) based on the predictor data.
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscatter(SL,SW,group,'rb','v^',[],'off');
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Location','NW')
12-3
12 Parametric Classification
[X,Y] = meshgrid(linspace(4.5,8),linspace(2,4));
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y],[SL SW],...
group,'quadratic');
hold on;
gscatter(X,Y,C,'rb','.',1,'off');
K = coeff(1,2).const;
L = coeff(1,2).linear;
Q = coeff(1,2).quadratic;
% Plot the curve K + [x,y]*L + [x,y]*Q*[x,y]' = 0:
f = @(x,y) K + L(1)*x + L(2)*y + Q(1,1)*x.^2 + ...
12-4
Discriminant Analysis
(Q(1,2)+Q(2,1))*x.*y + Q(2,2)*y.^2
h2 = ezplot(f,[4.5 8 2 4]);
set(h2,'Color','m','LineWidth',2)
axis([4.5 8 2 4])
xlabel('Sepal Length')
ylabel('Sepal Width')
title('{\bf Classification with Fisher Training Data}')
12-5
12 Parametric Classification
2 Prediction step: For any unseen test sample, the method computes the
posterior probability of that sample belonging to each class. The method
then classifies the test sample according the largest posterior probability.
Supported Distributions
Naive Bayes classification is based on estimating P(X|Y), the probability or
probability density of features X given class Y. The Naive Bayes classification
object NaiveBayes provides support for normal (Gaussian), kernel,
multinomial, and multivariate multinomial distributions. It is possible to use
different distributions for different features.
12-6
Naive Bayes Classification
distribution for each class by computing the mean and standard deviation of
the training data in that class. For more information on normal distributions,
see “Normal Distribution” on page B-83.
Kernel Distribution
The 'kernel' distribution is appropriate for features that have a continuous
distribution. It does not require a strong assumption such as a normal
distribution and you can use it in cases where the distribution of a feature may
be skewed or have multiple peaks or modes. It requires more computing time
and more memory than the normal distribution. For each feature you model
with a kernel distribution, the Naive Bayes classifier computes a separate
kernel density estimate for each class based on the training data for that class.
By default the kernel is the normal kernel, and the classifier selects a width
automatically for each class and feature. It is possible to specify different
kernels for each feature, and different widths for each feature or class.
Multinomial Distribution
The multinomial distribution (specify with the 'mn' keyword) is appropriate
when all features represent counts of a set of words or tokens. This is
sometimes called the "bag of words" model. For example, an e-mail spam
classifier might be based on features that count the number of occurrences
of various tokens in an e-mail. One feature might count the number of
exclamation points, another might count the number of times the word
"money" appears, and another might count the number of times the recipient’s
name appears. This is a Naive Bayes model under the further assumption
that the total number of tokens (or the total document length) is independent
of response class.
For the multinomial option, each feature represents the count of one token.
The classifier counts the set of relative token probabilities separately for
each class. The classifier defines the multinomial distribution for each row
by the vector of probabilities for the corresponding class, and by N, the total
token count for that row.
12-7
12 Parametric Classification
For each feature you model with a multivariate multinomial distribution, the
Naive Bayes classifier computes a separate set of probabilities for the set of
feature levels for each class.
12-8
Performance Curves
Performance Curves
In this section...
“Introduction” on page 12-9
“What are ROC Curves?” on page 12-9
“Evaluating Classifier Performance Using perfcurve” on page 12-9
Introduction
After a classification algorithm such as NaiveBayes or TreeBagger has
trained on data, you may want to examine the performance of the algorithm
on a specific test dataset. One common way of doing this would be to compute
a gross measure of performance such as quadratic loss, accuracy, such as
quadratic loss or accuracy, averaged over the entire test dataset.
You can use perfcurve with any classifier or, more broadly, with any method
that returns a numeric score for an instance of input data. By convention
adopted here,
12-9
12 Parametric Classification
• A high score returned by a classifier for any given instance signifies that
the instance is likely from the positive class.
• A low score signifies that the instance is likely from the negative classes.
For some classifiers, you can interpret the score as the posterior probability
of observing an instance of the positive class at point X. An example of such
a score is the fraction of positive observations in a leaf of a decision tree. In
this case, scores fall into the range from 0 to 1 and scores from positive and
negative classes add up to unity. Other methods can return scores ranging
between minus and plus infinity, without any obvious mapping from the
score to the posterior class probability.
perfcurve does not impose any requirements on the input score range.
Because of this lack of normalization, you can use perfcurve to process scores
returned by any classification, regression, or fit method. perfcurve does
not make any assumptions about the nature of input scores or relationships
between the scores for different classes. As an example, consider a problem
with three classes, A, B, and C, and assume that the scores returned by some
classifier for two instances are as follows:
A B C
instance 1 0.4 0.5 0.1
instance 2 0.4 0.1 0.5
perfcurve is intended for use with classifiers that return scores, not those
that return only predicted classes. As a counter-example, consider a decision
tree that returns only hard classification labels, 0 or 1, for data with two
classes. In this case, the performance curve reduces to a single point because
classified instances can be split into positive and negative categories in one
way only.
12-10
Performance Curves
For input, perfcurve takes true class labels for some data and scores assigned
by a classifier to these data. By default, this utility computes a Receiver
Operating Characteristic (ROC) curve and returns values of 1–specificity,
or false positive rate, for X and sensitivity, or true positive rate, for Y. You
can choose other criteria for X and Y by selecting one out of several provided
criteria or specifying an arbitrary criterion through an anonymous function.
You can display the computed performance curve using plot(X,Y).
perfcurve can compute values for various criteria to plot either on the x- or
the y-axis. All such criteria are described by a 2-by-2 confusion matrix, a
2-by-2 cost matrix, and a 2-by-1 vector of scales applied to class counts.
⎛ TP FN ⎞
⎜ ⎟
⎝ FP TN ⎠
where
For example, the first row of the confusion matrix defines how the classifier
identifies instances of the positive class: C(1,1) is the count of correctly
identified positive instances and C(1,2) is the count of positive instances
misidentified as negative.
The cost matrix defines the cost of misclassification for each category:
⎛ Cost( P | P) Cost( N | P) ⎞
⎜ ⎟
⎝ Cost( P | N ) Cost( N | N ) ⎠
where Cost(I|J) is the cost of assigning an instance of class J to class I.
Usually Cost(I|J)=0 for I=J. For flexibility, perfcurve allows you to specify
nonzero costs for correct classification as well.
12-11
12 Parametric Classification
scale( P) * TP
PPV =
scale( P) * TP + scale( N ) * FP
If all scores in the data are above a certain threshold, perfcurve classifies all
instances as 'positive'. This means that TP is the total number of instances
in the positive class and FP is the total number of instances in the negative
class. In this case, PPV is simply given by the prior:
prior( P)
PPV =
prior( P) + prior( N )
The perfcurve function returns two vectors, X and Y, of performance
measures. Each measure is some function of confusion, cost, and scale
values. You can request specific measures by name or provide a function
handle to compute a custom measure. The function you provide should take
confusion, cost, and scale as its three inputs and return a vector of output
values.
By default, perfcurve computes values of the X and Y criteria for all possible
score thresholds. Alternatively, it can compute a reduced number of specific X
values supplied as an input argument. In either case, for M requested values,
perfcurve computes M+1 values for X and Y. The first value out of these M+1
values is special. perfcurve computes it by setting the TP instance count
12-12
Performance Curves
to zero and setting TN to the total count in the negative class. This value
corresponds to the 'reject all' threshold. On a standard ROC curve, this
translates into an extra point placed at (0,0).
If there are NaN values among input scores, perfcurve can process them
in either of two ways:
That is, for any threshold, instances with NaN scores from the positive class
are counted as false negative (FN), and instances with NaN scores from the
negative class are counted as false positive (FP). In this case, the first value
of X or Y is computed by setting TP to zero and setting TN to the total count
minus the NaN count in the negative class. For illustration, consider an
example with two rows in the positive and two rows in the negative class,
each pair having a NaN score:
Class Score
Negative 0.2
Negative NaN
Positive 0.7
Positive NaN
If you discard rows with NaN scores, then as the score cutoff varies, perfcurve
computes performance measures as in the following table. For example, a
cutoff of 0.5 corresponds to the middle row where rows 1 and 3 are classified
correctly, and rows 2 and 4 are omitted.
TP FN FP TN
0 1 0 1
1 0 0 1
1 0 1 0
If you add rows with NaN scores to the false category in their respective
classes, perfcurve computes performance measures as in the following table.
For example, a cutoff of 0.5 corresponds to the middle row where now rows
12-13
12 Parametric Classification
2 and 4 are counted as incorrectly classified. Notice that only the FN and FP
columns differ between these two tables.
TP FN FP TN
0 2 1 1
1 1 1 1
1 1 2 0
For data with three or more classes, perfcurve takes one positive class and a
list of negative classes for input. The function computes the X and Y values
using counts in the positive class to estimate TP and FN, and using counts in
all negative classes to estimate TN and FP. perfcurve can optionally compute
Y values for each negative class separately and, in addition to Y, return a
matrix of size M-by-C, where M is the number of elements in X or Y and C is
the number of negative classes. You can use this functionality to monitor
components of the negative class contribution. For example, you can plot TP
counts on the X-axis and FP counts on the Y-axis. In this case, the returned
matrix shows how the FP component is split across negative classes.
12-14
Performance Curves
12-15
12 Parametric Classification
12-16
13
Supervised Learning
Known Data
1 Model
Known Responses
Model
2 Predicted Responses
New Data
For example, suppose you want to predict if someone will have a heart attack
within a year. You have a set of data on previous people, including their
ages, weight, height, blood pressure, etc. You know if the previous people had
heart attacks within a year of their data measurements. So the problem is
combining all the existing data into a model that can predict whether a new
person will have a heart attack within a year.
• Classification for responses that can have just a few known values, such
as 'true' or 'false'. Classification algorithms apply to nominal, not
ordinal response values.
13-2
Supervised Learning (Machine Learning) Workflow and Algorithms
• Regression for responses that are a real number, such as miles per gallon
for a particular car.
You can have trouble deciding whether you have a classification problem or a
regression problem. In that case, create a regression model first—regression
models are often more computationally efficient.
While there are many Statistics Toolbox algorithms for supervised learning,
most use the same basic workflow for obtaining a predictor model:
Prepare Data
All supervised learning methods start with an input data matrix, usually
called X in this documentation. Each row of X represents one observation.
Each column of X represents one variable, or predictor. Represent missing
entries with NaN values in X. Statistics Toolbox supervised learning algorithms
can handle NaN values, either by ignoring them or by ignoring any row with
a NaN value.
You can use various data types for response data Y. Each element in Y
represents the response to the corresponding row of X. Observations with
missing Y data are ignored.
13-3
13 Supervised Learning
Choose an Algorithm
There are tradeoffs between several characteristics of algorithms, such as:
• Speed of training
• Memory utilization
• Predictive accuracy on new data
• Transparency or interpretability, meaning how easily you can understand
the reasons an algorithm makes its predictions
Fit a Model
The fitting function you use depends on the algorithm you choose.
13-4
Supervised Learning (Machine Learning) Workflow and Algorithms
13-5
13 Supervised Learning
When you are satisfied with the model, you can trim it using the appropriate
compact method (compact for classification trees, compact for classification
ensembles, compact for regression trees, compact for regression ensembles).
compact removes training data and pruning information, so the model uses
less memory.
Ypredicted = predict(obj,Xnew)
Characteristics of Algorithms
This table shows typical characteristics of the various supervised learning
algorithms. The characteristics in any particular case can vary from the listed
ones. Use the table as a guide for your initial choice of algorithms, but be
aware that the table can be inaccurate for some problems. SVM is available if
you have a Bioinformatics Toolbox™ license.
13-6
Supervised Learning (Machine Learning) Workflow and Algorithms
* — SVM prediction speed and memory usage are good if there are few
support vectors, but can be poor if there are many support vectors. When you
use a kernel function, it can be difficult to interpret how SVM classifies data,
though the default linear scheme is easy to interpret.
** — Naive Bayes speed and memory usage are good for simple distributions,
but can be poor for kernel distributions and large data sets.
*** — Nearest Neighbor usually has good predictions in low dimensions, but
can have poor predictions in high dimensions. For linear search, Nearest
Neighbor does not perform any fitting. For kd-trees, Nearest Neighbor does
perform fitting. Nearest Neighbor can have either continuous or categorical
predictors, but not both.
13-7
13 Supervised Learning
Pairwise Distance
Categorizing query points based on their distance to points in a training
dataset can be a simple yet effective way of classifying new points. You can
use various metrics to determine the distance, described next. Use pdist2 to
find the distance between a sets of data and query points.
Distance Metrics
Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors
x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my (1-by-n)
row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt
are defined as follows:
• Euclidean distance
2
dst = ( xs − yt )( xs − yt )′
2
dst = ( xs − yt )V −1 ( xs − yt )′
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2,
where S is the vector containing the inverse weights.
• Mahalanobis distance
2
dst = ( xs − yt )C −1 ( xs − yt )′
13-8
Classification Using Nearest Neighbors
n
dst = ∑ xsj − ytj
j =1
The city block distance is a special case of the Minkowski metric, where p
= 1.
• Minkowski metric
n p
dst = p ∑ xsj − ytj
j =1
For the special case of p = 1, the Minkowski metric gives the city block
metric, for the special case of p = 2, the Minkowski metric gives the
Euclidean distance, and for the special case of p = ∞, the Minkowski metric
gives the Chebychev distance.
• Chebychev distance
{
dst = max j xsj − ytj }
The Chebychev distance is a special case of the Minkowski metric, where p
= ∞.
• Cosine distance
⎛ xs yt′ ⎞
dst = ⎜ 1 − ⎟
⎜
⎝ ( xs xs′ ) ( yt yt′ ) ⎟⎠
• Correlation distance
( xs − xs ) ( yt − yt )′
dst = 1 −
( xs − xs ) ( xs − xs )′ ( yt − yt ) ( yt − yt )′
where
13-9
13 Supervised Learning
1
xs = ∑ xsj
n j
and
1
yt = ∑ ytj
n j
• Hamming distance
• Jaccard distance
( ) ((
# ⎡ xsj ≠ ytj ∩ xsj ≠ 0 ∪ ytj ≠ 0 ⎤
⎣ ⎦) ( ))
dst =
⎡ ( ) (
# xsj ≠ 0 ∪ ytj ≠ 0
⎣
⎤
⎦ )
• Spearman distance
( rs − rs ) ( rt − rt )′
dst = 1 −
( rs − rs ) ( rs − rs )′ ( rt − rt ) ( rt − rt )′
where
- rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank.
- rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank.
- rs and rt are the coordinate-wise rank vectors of xs and yt, i.e., rs = (rs1,
rs2, ... rsn) and rt = (rt1, rt2, ... rtn).
1 ( n + 1)
- rs = ∑
n j
rsj =
2
.
1 ( n + 1)
- rt = ∑
n j
rtj =
2
.
13-10
Classification Using Nearest Neighbors
knnsearch also uses the exhaustive search method if your search object is
an ExhaustiveSearcher object. The exhaustive search method finds the
distance from each query point to every point in X, ranks them in ascending
13-11
13 Supervised Learning
order, and returns the k points with the smallest distances. For example, this
diagram shows the k = 3 nearest neighbors.
13-12
Classification Using Nearest Neighbors
- 'minkowski'
- 'chebychev'
kd-trees divide your data into nodes with at most BucketSize (default is
50) points per node, based on coordinates (as opposed to categories). The
following diagrams illustrate this concept using patch objects to color code
the different “buckets.”
When you want to find the k-nearest neighbors to a given query point,
knnsearch does the following:
13-13
13 Supervised Learning
1 Determines the node to which the query point belongs. In the following
example, the query point (32,90) belongs to Node 4.
2 Finds the closest k points within that node and its distance to the query
point. In the following example, the points in red circles are equidistant
from the query point, and are the closest points to the query point within
Node 4.
3 Chooses all other nodes having any area that is within the same distance,
in any direction, from the query point to the kth closest point. In this
example, only Node 3 overlaps the solid black circle centered at the query
point with radius equal to the distance to the closest points within Node 4.
4 Searches nodes within that range for any points closer to the query point.
In the following example, the point in a red square is slightly closer to the
query point than those within Node 4.
13-14
Classification Using Nearest Neighbors
Using a kd-tree for large datasets with fewer than 10 dimensions (columns)
can be much more efficient than using the exhaustive search method, as
knnsearch needs to calculate only a subset of the distances. To maximize the
efficiency of kd-trees, use a KDTreeSearcher object.
13-15
13 Supervised Learning
All search objects have a knnsearch method specific to that class. This lets
you efficiently perform a k-nearest neighbors search on your object for that
specific object type. In addition, there is a generic knnsearch function that
searches without creating or using an object.
To determine which type of object and search method is best for your data,
consider the following:
• Does your data have many columns, say more than 10? The
ExhaustiveSearcher object may perform better.
• Is your data sparse? Use the ExhaustiveSearcher object.
• Do you want to use one of these distance measures to find the nearest
neighbors? Use the ExhaustiveSearcher object.
- 'seuclidean'
- 'mahalanobis'
- 'cosine'
- 'correlation'
- 'spearman'
- 'hamming'
- 'jaccard'
- A custom distance function
• Is your dataset huge (but with fewer than 10 columns)? Use the
KDTreeSearcher object.
• Are you searching for the nearest neighbors for a large number of query
points? Use the KDTreeSearcher object.
1 Classify a new point based on the last two columns of the Fisher iris data.
Using only the last two columns makes it easier to plot:
13-16
Classification Using Nearest Neighbors
load fisheriris
x = meas(:,3:4);
gscatter(x(:,1),x(:,2),species)
set(legend,'location','best')
newpoint = [5 1.45];
line(newpoint(1),newpoint(2),'marker','x','color','k',...
'markersize',10,'linewidth',2)
13-17
13 Supervised Learning
[n,d] = knnsearch(x,newpoint,'k',10)
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
13-18
Classification Using Nearest Neighbors
4 It appears that knnsearch has found only the nearest eight neighbors. In
fact, this particular dataset contains duplicate values:
x(n,:)
ans =
5.0000 1.5000
4.9000 1.5000
4.9000 1.5000
5.1000 1.5000
5.1000 1.6000
4.8000 1.4000
5.0000 1.7000
4.7000 1.4000
4.7000 1.4000
13-19
13 Supervised Learning
4.7000 1.5000
5 To make duplicate values visible on the plot, use the following code:
The jittered points do not affect any analysis of the data, only the
visualization. This example does not jitter the points.
6 Make the axes equal so the calculated distances correspond to the apparent
distances on the plot axis equal and zoom in to see the neighbors better:
13-20
Classification Using Nearest Neighbors
tabulate(species(n))
Using a rule based on the majority vote of the 10 nearest neighbors, you
can classify this new point as a versicolor.
13-21
13 Supervised Learning
9 Using the same dataset, find the 10 nearest neighbors to three new points:
figure
newpoint2 = [5 1.45;6 2;2.75 .75];
gscatter(x(:,1),x(:,2),species)
legend('location','best')
[n2,d2] = knnsearch(x,newpoint2,'k',10);
line(x(n2,1),x(n2,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
line(newpoint2(:,1),newpoint2(:,2),'marker','x','color','k',...
'markersize',10,'linewidth',2,'linestyle','none')
13-22
Classification Using Nearest Neighbors
10 Find the species of the 10 nearest neighbors for each new point:
tabulate(species(n2(1,:)))
Value Count Percent
virginica 2 20.00%
versicolor 8 80.00%
tabulate(species(n2(2,:)))
Value Count Percent
virginica 10 100.00%
tabulate(species(n2(3,:)))
Value Count Percent
versicolor 7 70.00%
setosa 3 30.00%
13-23
13 Supervised Learning
For further examples using knnsearch methods and function, see the
individual reference pages.
13-24
Classification Trees and Regression Trees
13-25
13 Supervised Learning
If, however, x1 exceeds 0.5, then follow the right branch to the lower-right
triangle node. Here the tree asks if x2 is smaller than 0.5. If so, then follow
the left branch to see that the tree classifies the data as type 0. If not, then
follow the right branch to see that the that the tree classifies the data as
type 1.
13-26
Classification Trees and Regression Trees
tree = ClassificationTree.fit(X,Y);
tree = RegressionTree.fit(X,Y);
ctree =
ClassificationTree:
PredictorNames: {1x34 cell}
CategoricalPredictors: []
ResponseName: 'Y'
ClassNames: {'b' 'g'}
ScoreTransform: 'none'
NObservations: 351
13-27
13 Supervised Learning
rtree =
RegressionTree:
PredictorNames: {'x1' 'x2'}
CategoricalPredictors: []
ResponseName: 'Y'
ResponseTransform: 'none'
NObservations: 94
Viewing a Tree
There are two ways to view a tree:
load fisheriris
ctree = ClassificationTree.fit(meas,species);
view(ctree)
view(ctree,'mode','graph')
13-28
Classification Trees and Regression Trees
13-29
13 Supervised Learning
8 fit = 33.3056
9 fit = 29
view(rtree,'mode','graph')
1 Start with all input data, and examine all possible binary splits on every
predictor.
13-30
Classification Trees and Regression Trees
• If the split leads to a child node having too few observations (less
than the MinLeaf parameter), select a split with the best optimization
criterion subject to the MinLeaf constraint.
Optimization criterion:
13-31
13 Supervised Learning
For a continuous predictor, a tree can split halfway between any two adjacent
unique values found for this predictor. For a categorical predictor with L
levels, a classification tree needs to consider 2L–1–1 splits. To obtain this
formula, observe that you can assign L distinct values to the left and right
nodes in 2L ways. Two out of these 2L configurations would leave either left or
right node empty, and therefore should be discarded. Now divide by 2 because
left and right can be swapped. A classification tree can thus process only
categorical predictors with a moderate number of levels. A regression tree
employs a computational shortcut: it sorts the levels by the observed mean
response, and considers only the L–1 splits between the sorted levels.
Ynew = predict(tree,Xnew);
For each row of data in Xnew, predict runs through the decisions in tree and
gives the resulting prediction in the corresponding element of Ynew. For more
information for classification, see the classification predict reference page;
for regression, see the regression predict reference page.
Ynew =
'g'
To find the predicted MPG of a point at the mean of the carsmall data:
13-32
Classification Trees and Regression Trees
Ynew = predict(rtree,mean(X))
Ynew =
28.7931
load fisheriris
ctree = ClassificationTree.fit(meas,species);
resuberror = resubLoss(ctree)
resuberror =
0.0200
The tree classifies nearly all the Fisher iris data correctly.
13-33
13 Supervised Learning
Cross Validation
To get a better sense of the predictive accuracy of your tree for new data,
cross validate the tree. By default, cross validation splits the training data
into 10 parts at random. It trains 10 new trees, each one on nine parts of
the data. It then examines the predictive accuracy of each new tree on the
data not included in training that tree. This method gives a good estimate
of the predictive accuracy of the resulting tree, since it tests the new trees
on new data.
load carsmall
X = [Acceleration Displacement Horsepower Weight];
rtree = RegressionTree.fit(X,MPG);
resuberror = resubLoss(rtree)
resuberror =
4.7188
The resubstitution loss for a regression tree is the mean-squared error. The
resulting value indicates that a typical predictive error for the tree is about
the square root of 4.7, or a bit over 2.
cvrtree = crossval(rtree);
cvloss = kfoldLoss(cvrtree)
cvloss =
23.4808
The cross-validated loss is almost 25, meaning a typical predictive error for
the tree on new data is about 5. This demonstrates that cross-validated loss
is usually higher than simple resubstitution loss.
13-34
Classification Trees and Regression Trees
If you do not have enough data for training and test, estimate tree accuracy
by cross validation.
load ionosphere
leafs = logspace(1,2,10);
3 Create cross validated classification trees for the ionosphere data with
minimum leaf occupancies from leafs:
N = numel(leafs);
err = zeros(N,1);
for n=1:N
t = ClassificationTree.fit(X,Y,'crossval','on',...
'minleaf',leafs(n));
err(n) = kfoldLoss(t);
end
plot(leafs,err);
xlabel('Min Leaf Size');
ylabel('cross-validated error');
13-35
13 Supervised Learning
The best leaf size is between about 20 and 50 observations per leaf.
DefaultTree = ClassificationTree.fit(X,Y);
view(DefaultTree,'mode','graph')
13-36
Classification Trees and Regression Trees
OptimalTree = ClassificationTree.fit(X,Y,'minleaf',40);
view(OptimalTree,'mode','graph')
resubOpt = resubLoss(OptimalTree);
lossOpt = kfoldLoss(crossval(OptimalTree));
resubDefault = resubLoss(DefaultTree);
lossDefault = kfoldLoss(crossval(DefaultTree));
resubOpt,resubDefault,lossOpt,lossDefault
resubOpt =
0.0883
13-37
13 Supervised Learning
resubDefault =
0.0114
lossOpt =
0.1054
lossDefault =
0.1026
Pruning
Pruning optimizes tree depth (leafiness) is by merging leaves on the same tree
branch. “Control Depth or “Leafiness”” on page 13-34 describes one method
for selecting the optimal depth for a tree. Unlike in that section, you do not
need to grow a new tree for every node size. Instead, grow a deep tree, and
prune it to the level you choose.
Prune a tree at the command line using the prune method (classification) or
prune method (regression). Alternatively, prune a tree interactively with
the tree viewer:
view(tree,'mode','graph')
To prune a tree, the tree must contain a pruning sequence. By default, both
ClassificationTree.fit and RegressionTree.fit calculate a pruning
sequence for a tree during construction. If you construct a tree with the
'Prune' name-value pair set to 'off', or if you prune a tree to a smaller level,
the tree does not contain the full pruning sequence. Generate the full pruning
sequence with the prune method (classification) or prune method (regression).
load ionosphere
13-38
Classification Trees and Regression Trees
tree = ClassificationTree.fit(X,Y);
view(tree,'mode','graph')
13-39
13 Supervised Learning
[~,~,~,bestlevel] = cvLoss(tree,...
'subtrees','all','treesize','min')
bestlevel =
6
13-40
Classification Trees and Regression Trees
6 Set 'treesize' to 'se' (default) to find the maximal pruning level for
which the tree error does not exceed the error from the best level plus one
standard deviation:
13-41
13 Supervised Learning
[~,~,~,bestlevel] = cvLoss(tree,'subtrees','all')
bestlevel =
6
In this case the level is the same for either setting of 'treesize'.
tree = prune(tree,'Level',6);
view(tree,'mode','graph')
Alternative: classregtree
The ClassificationTree and RegressionTree classes are new in MATLAB
R2011a. Previously, you represented both classification trees and regression
trees with a classregtree object. The new classes provide all the
functionality of the classregtree class, and are more convenient when used
in conjunction with “Ensemble Methods” on page 13-50.
13-42
Classification Trees and Regression Trees
1 Load the data and use the classregtree constructor of the classregtree
class to create the classification tree:
load fisheriris
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
2 Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
classification
3 To view the tree, use the view method of the classregtree class:
view(t)
13-43
13 Supervised Learning
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the iris at the triangular branching nodes. A true
answer to any question follows the branch to the left. A false follows the
branch to the right.
4 The tree does not use sepal measurements for predicting species. These
can go unmeasured in new data, and you can enter them as NaN values for
predictions. For example, to use the tree to predict the species of an iris
with petal length 4.8 and petal width 1.6, type:
13-44
Classification Trees and Regression Trees
The object allows for functional evaluation, of the form t(X). This is a
shorthand way of calling the eval method of the classregtree class.
The predicted species is the left leaf node at the bottom of the tree in the
previous view.
5 You can use a variety of methods of the classregtree class, such as cutvar
and cuttype to get more information about the split at node 6 that makes
the final distinction between versicolor and virginica:
6 Classification trees fit the original (training) data well, but can do a poor
job of classifying new values. Lower branches, especially, can be strongly
affected by outliers. A simpler tree often avoids overfitting. You can use
the prune method of the classregtree class to find the next largest tree
from an optimal pruning sequence:
pruned = prune(t,'level',1)
pruned =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 class = versicolor
7 class = virginica
view(pruned)
13-45
13 Supervised Learning
1 Load the data and use the classregtree constructor of the classregtree
class to create the regression tree:
13-46
Classification Trees and Regression Trees
load carsmall
t = classregtree([Weight, Cylinders],MPG,...
'cat',2,'splitmin',20,...
'names',{'W','C'})
t =
2 Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
regression
3 To view the tree, use the view method of the classregtree class:
13-47
13 Supervised Learning
view(t)
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the car at the triangular branching nodes. A true
answer to any question follows the branch to the left; a false follows the
branch to the right.
4 Use the tree to predict the mileage for a 2000-pound car with either 4,
6, or 8 cylinders:
The object allows for functional evaluation, of the form t(X). This is a
shorthand way of calling the eval method of the classregtree class.
5 The predicted responses computed above are all the same. This is because
they follow a series of splits in the tree that depend only on weight,
terminating at the left-most leaf node in the view above. A 4000-pound
car, following the right branch from the top of the tree, leads to different
predicted responses:
13-48
Classification Trees and Regression Trees
mileage4K =
19.2778
19.2778
14.3889
6 You can use a variety of other methods of the classregtree class, such as
cutvar, cuttype, and cutcategories, to get more information about the
split at node 3 that distinguishes the 8-cylinder car:
Regression trees fit the original (training) data well, but may do a poor
job of predicting new values. Lower branches, especially, may be strongly
affected by outliers. A simpler tree often avoids over-fitting. To find the
best regression tree, employing the techniques of resubstitution and cross
validation, use the test method of the classregtree class.
13-49
13 Supervised Learning
Ensemble Methods
In this section...
“Framework for Ensemble Learning” on page 13-50
“Basic Ensemble Examples” on page 13-57
“Test Ensemble Quality” on page 13-59
“Classification: Imbalanced Data or Unequal Misclassification Costs” on
page 13-64
“Example: Classification with Many Categorical Levels” on page 13-71
“Example: Surrogate Splits” on page 13-76
“Ensemble Regularization” on page 13-81
“Example: Tuning RobustBoost” on page 13-92
“TreeBagger Examples” on page 13-96
“Ensemble Algorithms” on page 13-118
ens = fitensemble(X,Y,model,numberens,learners)
• X is the matrix of data, each row containing one observation, each column
contains one predictor variable.
• Y is the responses, with the same number of observations as rows in X.
• model is a string naming the type of ensemble.
• numberens is the number of weak learners in ens from each element of
learners. So the number of elements in ens is numberens times the
number of elements in learners.
13-50
Ensemble Methods
Data matrix X
Responses Y
Weak Learner(s)
Currently, you can use only decision trees as learners for ensembles. Decision
trees can handle NaN values in X. Such values are called “missing.” If you have
13-51
13 Supervised Learning
some missing values in a row of X, a decision tree finds optimal splits using
nonmissing values only. If an entire row consists of NaN, fitensemble ignores
that row. If you have data with a large fraction of missing values in X, use
surrogate decision splits. For examples of surrogate splits, see “Example:
Unequal Classification Costs” on page 13-66 and “Example: Surrogate Splits”
on page 13-76.
For example, suppose your response data consists of three observations in the
following order: true, false, true. You could express Y as:
13-52
Ensemble Methods
Use whichever data type is most convenient. Since you cannot represent
missing values with logical entries, do not use logical entries when you have
missing values in Y.
Since 'Bag' applies to all methods, indicate whether you want a classifier
or regressor with the type name-value pair set to 'classification' or
'regression'.
For descriptions of the various algorithms, and aid in choosing which applies
to your data, see “Ensemble Algorithms” on page 13-118. The following table
gives characteristics of the various algorithms. In the table titles:
• Regress. — Regression
• Classif. — Classification
• Preds. — Predictors
• Estim. — Estimate
13-53
13 Supervised Learning
• Gen. — Generalization
• Pred. — Prediction
• Mem. — Memory usage
13-54
Ensemble Methods
While you can give fitensemble a cell array of learner templates, the most
common usage is to give just one weak learner template.
• The depth of the weak learner tree makes a difference for training time,
memory usage, and predictive accuracy. You control the depth with two
parameters:
- MinLeaf — Each leaf has at least MinLeaf observations. Set small
values of MinLeaf to get a deep tree.
- MinParent — Each branch node in the tree has at least MinParent
observations. Set small values of MinParent to get a deep tree.
If you supply both MinParent and MinLeaf, the learner uses the setting
that gives larger leaves:
MinParent = max(MinParent,2*MinLeaf)
Note Surrogate splits cause training to be slower and use more memory.
Call fitensemble
The syntax of fitensemble is
ens = fitensemble(X,Y,model,numberens,learners)
13-55
13 Supervised Learning
• X is the matrix of data. Each row contains one observation, and each
column contains one predictor variable.
• Y is the responses, with the same number of observations as rows in X.
• model is a string naming the type of ensemble.
• numberens is the number of weak learners in ens from each element of
learners. So the number of elements in ens is numberens times the
number of elements in learners.
• learners is a string naming a weak learner, a weak learner template, or a
cell array of such strings and templates.
For example, to have an ensemble of boosted classification trees with each tree
deeper than the default, set the ClassificationTree.template name-value
pairs (MinLeaf and MinParent) to smaller values than the defaults. This
causes the trees to be leafier (deeper).
To name the predictors in the ensemble (part of the structure of the ensemble),
use the PredictorNames name-value pair in fitensemble.
13-56
Ensemble Methods
load fisheriris
ens = fitensemble(meas,species,'AdaBoostM2',100,'Tree')
ens =
classreg.learning.classif.ClassificationEnsemble:
PredictorNames: {'x1' 'x2' 'x3' 'x4'}
CategoricalPredictors: []
ResponseName: 'Y'
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'none'
NObservations: 150
NTrained: 100
Method: 'AdaBoostM2'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [100x1 double]
FitInfoDescription: [2x83 char]
13-57
13 Supervised Learning
flower = predict(ens,mean(meas))
flower =
'versicolor'
load carsmall
X = [Horsepower Weight];
ens = fitensemble(X,MPG,'LSBoost',100,'Tree')
ens =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {'x1' 'x2'}
CategoricalPredictors: []
ResponseName: 'Y'
ResponseTransform: 'none'
NObservations: 94
NTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
13-58
Ensemble Methods
8 Predict the mileage of a car with 150 horsepower weighing 2750 lbs:
mileage =
22.6735
To obtain a better idea of the quality of an ensemble, use one of these methods:
• Evaluate the ensemble on an independent test set (useful when you have a
lot of training data).
• Evaluate the ensemble by cross validation (useful when you don’t have a
lot of training data).
• Evaluate the ensemble on out-of-bag data (useful when you create a bagged
ensemble with fitensemble).
13-59
13 Supervised Learning
idx = randsample(2000,200);
Y(idx) = ~Y(idx);
Create independent training and test sets of data. Use 70% of the data for
a training set by calling cvpartition with the holdout option:
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
3 Create a bagged classification ensemble of 200 trees from the training data:
bag = fitensemble(Xtrain,Ytrain,'Bag',200,'Tree',...
'type','classification')
bag =
classreg.learning.classif.ClassificationBaggedEnsemble:
PredictorNames: {1x20 cell}
CategoricalPredictors: []
ResponseName: 'Y'
ClassNames: [0 1]
ScoreTransform: 'none'
NObservations: 1400
NTrained: 200
Method: 'Bag'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: []
FitInfoDescription: 'None'
13-60
Ensemble Methods
FResample: 1
Replace: 1
UseObsForLearner: [1400x200 logical]
4 Plot the loss (misclassification) of the test data as a function of the number
of trained trees in the ensemble:
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
xlabel('Number of trees');
ylabel('Test classification error');
5 Cross validation
cv = fitensemble(X,Y,'Bag',200,'Tree',...
'type','classification','kfold',5)
13-61
13 Supervised Learning
cv =
classreg.learning.partition.ClassificationPartitionedEnsemble:
CrossValidatedModel: 'Bag'
PredictorNames: {1x20 cell}
CategoricalPredictors: []
ResponseName: 'Y'
NObservations: 2000
KFold: 5
Partition: [1x1 cvpartition]
NTrainedPerFold: [200 200 200 200 200]
ClassNames: [0 1]
ScoreTransform: 'none'
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold
plot(kfoldLoss(cv,'mode','cumulative'),'r.');
hold off;
xlabel('Number of trees');
ylabel('Classification error');
legend('Test','Cross-validation','Location','NE');
13-62
Ensemble Methods
7 Out-of-Bag Estimates
Generate the loss curve for out-of-bag estimates, and plot it along with
the other curves:
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold
plot(kfoldLoss(cv,'mode','cumulative'),'r.');
plot(oobLoss(bag,'mode','cumulative'),'k--');
hold off;
xlabel('Number of trees');
ylabel('Classification error');
legend('Test','Cross-validation','Out of bag','Location','NE');
13-63
13 Supervised Learning
By using prior, you set prior class probabilities (that is, class probabilities
used for training). Use this option if some classes are under- or
overrepresented in your training set. For example, you might obtain your
training data by simulation. Because simulating class A is more expensive
than class B, you opt to generate fewer observations of class A and more
13-64
Ensemble Methods
observations of class B. You expect, however, that class A and class B are mixed
in a different proportion in the real world. In this case, set prior probabilities
for class A and B approximately to the values you expect to observe in the real
world. fitensemble normalizes prior probabilities to make them add up to 1;
multiplying all prior probabilities by the same positive factor does not affect
the result of classification.
If classes are adequately represented in the training data but you want to
treat them asymmetrically, use the cost parameter. Suppose you want to
classify benign and malignant tumors in cancer patients. Failure to identify
a malignant tumor (false negative) has far more severe consequences than
misidentifying benign as malignant (false positive). You should assign high
cost to misidentifying malignant as benign and low cost to misidentifying
benign as malignant.
0 c
1 0
If you have only two classes, fitensemble adjusts their prior probabilities
using Pi Cij Pi for class i = 1,2 and j ≠ i. Pi are prior probabilities either
passed into fitensemble or computed from class frequencies in the training
data, and Pi are adjusted prior probabilities. Then fitensemble uses the
default cost matrix
0 1
1 0
13-65
13 Supervised Learning
and these adjusted probabilities for training its weak learners. Manipulating
the cost matrix is thus equivalent to manipulating the prior probabilities.
If you have three or more classes, fitensemble also converts input costs
into adjusted prior probabilities. This conversion is more complex. First,
fitensemble attempts to solve a matrix equation described in Zhou and
Liu [15]. If it fails to find a solution, fitensemble applies the “average
cost” adjustment described in Breiman et al. [5]. For more information, see
.Zadrozny, Langford, and Abe [14]
s = urlread(['http://archive.ics.uci.edu/ml/' ...
'machine-learning-databases/hepatitis/hepatitis.data']);
fid = fopen('hepatitis.txt','w');
fwrite(fid,s);
fclose(fid);
size(ds)
ans =
13-66
Ensemble Methods
155 20
3 Convert the data in the dataset to the format for ensembles: a numeric
matrix of predictors, and a cell array with outcome names: 'Die' or
'Live'. The first field in the dataset has the outcomes.
X = double(ds(:,2:end));
ClassNames = {'Die' 'Live'};
Y = ClassNames(ds.die_or_live);
figure;
bar(sum(isnan(X),1)/size(X,1));
xlabel('Predictor');
ylabel('Fraction of missing values');
13-67
13 Supervised Learning
Most predictors have missing values, and one has nearly 45% of missing
values. Therefore, use decision trees with surrogate splits for better
accuracy. Because the dataset is small, training time with surrogate splits
should be tolerable.
6 Examine the data or the description of the data to see which predictors
are categorical:
X(1:5,:)
ans =
Columns 1 through 6
Columns 7 through 12
Columns 13 through 18
13-68
Ensemble Methods
Column 19
1.0000
1.0000
1.0000
1.0000
1.0000
ncat = [2:13,19];
a = fitensemble(X,Y,'GentleBoost',200,t,...
'PredictorNames',VarNames(2:end),'LearnRate',0.1,...
'CategoricalPredictors',ncat,'kfold',5);
figure;
plot(kfoldLoss(a,'mode','cumulative','lossfun','exponential'));
xlabel('Number of trees');
ylabel('Cross-validated exponential loss');
13-69
13 Supervised Learning
9 Inspect the confusion matrix to see which people the ensemble predicts
correctly:
[Yfit,Sfit] = kfoldPredict(a); %
confusionmat(Y,Yfit,'order',ClassNames)
ans =
16 16
10 113
Of the 123 people who live, the ensemble predicts correctly that 113 will
live. But for the 32 people who die of hepatitis, the ensemble only predicts
correctly that half will die of hepatitis.
13-70
Ensemble Methods
Suppose you believe that the first error is five times worse than the second.
Make a new classification cost matrix that reflects this belief:
cost.ClassNames = ClassNames;
cost.ClassificationCosts = [0 5; 1 0];
aC = fitensemble(X,Y,'GentleBoost',200,t,...
'PredictorNames',VarNames(2:end),'LearnRate',0.1,...
'CategoricalPredictors',ncat,'kfold',5,...
'cost',cost);
[YfitC,SfitC] = kfoldPredict(aC);
confusionmat(Y,YfitC,'order',ClassNames)
ans =
19 13
9 114
As expected, the new ensemble does a better job classifying the people
who die. Somewhat surprisingly, the new ensemble also does a better
job classifying the people who live, though the result is not statistically
significantly better. The results of the cross validation are random, so this
result is simply a statistical fluctuation. The result seems to indicate that
the classification of people who live is not very sensitive to the cost.
This example uses demographic data from the U.S. Census, available at
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/.
The objective of the researchers who posted the data is predicting whether an
individual makes more than $50,000/year, based on a set of characteristics.
13-71
13 Supervised Learning
You can see details of the data, including predictor names, in the adult.names
file at the site.
1 Load the 'adult.data' file from the UCI Machine Learning Repository:
s = urlread(['http://archive.ics.uci.edu/ml/' ...
'machine-learning-databases/adult/adult.data']);
s = strrep(s,'?','');
fid = fopen('adult.txt','w');
fwrite(fid,s);
fclose(fid);
clear s;
VarNames = {'age' 'workclass' 'fnlwgt' 'education' 'education_num' ...
'marital_status' 'occupation' 'relationship' 'race' ...
'sex' 'capital_gain' 'capital_loss' ...
'hours_per_week' 'native_country' 'income'};
ds = dataset('file','adult.txt','VarNames',VarNames,...
'Delimiter',',','ReadVarNames',false,'Format',...
'%u%s%u%s%u%s%s%s%s%s%u%u%u%s%s');
cat = ~datasetfun(@isnumeric,ds(:,1:end-1)); % Logical indices
% of categorical variables
catcol = find(cat); % indices of categorical variables
4 Many predictors in the data are categorical. Convert those fields in the
dataset array to nominal:
ds.workclass = nominal(ds.workclass);
ds.education = nominal(ds.education);
ds.marital_status = nominal(ds.marital_status);
ds.occupation = nominal(ds.occupation);
ds.relationship = nominal(ds.relationship);
ds.race = nominal(ds.race);
ds.sex = nominal(ds.sex);
13-72
Ensemble Methods
ds.native_country = nominal(ds.native_country);
ds.income = nominal(ds.income);
X = double(ds(:,1:end-1));
Y = ds.income;
6 Some variables have many levels. Plot the number of levels of each
predictor:
ncat = zeros(1,numel(catcol));
for c=1:numel(catcol)
[~,gn] = grp2idx(X(:,catcol(c)));
ncat(c) = numel(gn);
end
figure;
bar(catcol,ncat);
xlabel('Predictor');
ylabel('Number of categories');
13-73
13 Supervised Learning
lb = fitensemble(X,Y,'LogitBoost',300,'Tree','CategoricalPredictors',cat,...
'PredictorNames',VarNames(1:end-1),'ResponseName','income');
gb = fitensemble(X,Y,'GentleBoost',300,'Tree','CategoricalPredictors',cat,...
'PredictorNames',VarNames(1:end-1),'ResponseName','income');
figure;
plot(resubLoss(lb,'mode','cumulative'));
hold on
plot(resubLoss(gb,'mode','cumulative'),'r--');
hold off
xlabel('Number of trees');
ylabel('Resubstitution error');
legend('LogitBoost','GentleBoost','Location','NE');
13-74
Ensemble Methods
9 Estimate the generalization error for the two algorithms by cross validation.
lbcv = crossval(lb,'kfold',5);
gbcv = crossval(gb,'kfold',5);
figure;
plot(kfoldLoss(lbcv,'mode','cumulative'));
hold on
plot(kfoldLoss(gbcv,'mode','cumulative'),'r--');
hold off
xlabel('Number of trees');
ylabel('Cross-validated error');
legend('LogitBoost','GentleBoost','Location','NE');
13-75
13 Supervised Learning
This example shows the effects of surrogate splits for predictions for data
containing missing entries in both training and test sets. There is a redundant
predictor in the data, which the surrogate split uses to infer missing values.
While the example is artificial, it shows the value of surrogate splits with
missing data.
13-76
Ensemble Methods
figure
plot(X1(:,1),X1(:,2),'k.','MarkerSize',2)
hold on
plot(X2(:,1),X2(:,2),'rx','MarkerSize',3);
hold off
axis square
13-77
13 Supervised Learning
There is a good deal of overlap between the data points. You cannot expect
perfect classification of this data.
X = [X X(:,1)];
13-78
Ensemble Methods
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
Xtest = X(test(cv),:);
Ytest = Y(test(cv));
5 Create two Bag ensembles: one with surrogate splits, one without. First
create the template for surrogate splits, then train both ensembles:
templS = ClassificationTree.template('surrogate','on');
bag = fitensemble(Xtrain,Ytrain,'Bag',50,'Tree',...
'type','class','nprint',10);
Training Bag...
Grown weak learners: 10
Grown weak learners: 20
Grown weak learners: 30
Grown weak learners: 40
Grown weak learners: 50
bagS = fitensemble(Xtrain,Ytrain,'Bag',50,templS,...
'type','class','nprint',10);
Training Bag...
Grown weak learners: 10
Grown weak learners: 20
Grown weak learners: 30
Grown weak learners: 40
Grown weak learners: 50
6 Examine the accuracy of the two ensembles for predicting the test data:
figure
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold on
plot(loss(bagS,Xtest,Ytest,'mode','cumulative'),'r--');
hold off;
legend('Without surrogate splits','With surrogate splits');
xlabel('Number of trees');
ylabel('Test classification error');
13-79
13 Supervised Learning
The ensemble with surrogate splits is obviously more accurate than the
ensemble without surrogate splits.
Yfit = predict(bag,Xtest);
YfitS = predict(bagS,Xtest);
N10 = sum(Yfit==Ytest & YfitS~=Ytest);
N01 = sum(Yfit~=Ytest & YfitS==Ytest);
mcnemar = (abs(N10-N01) - 1)^2/(N10+N01);
pval = 1 - chi2cdf(mcnemar,1)
pval =
0
The extremely low p-value indicates that the ensemble with surrogate
splits is better in a statistically significant manner.
13-80
Ensemble Methods
Ensemble Regularization
Regularization is a process of choosing fewer weak learners for an ensemble
in a way that does not diminish predictive performance. Currently you can
regularize regression ensembles.
N T T
n t t n n t .
w g h x , y
n1 t 1 t 1
Here
The ensemble is regularized on the same (xn,yn,wn) data used for training, so
N T
n t ht xn , yn
w g
n1 t 1
13-81
13 Supervised Learning
load imports-85;
Description
Description =
1985 Auto Imports Database from the UCI repository
http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
Variables have been reordered to place variables with numeric values (referred
to as "continuous" on the UCI site) to the left and categorical values to the
right. Specifically, variables 1:16 are: symboling, normalized-losses,
wheel-base, length, width, height, curb-weight, engine-size, bore, stroke,
compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg, and price.
Variables 17:26 are: make, fuel-type, aspiration, num-of-doors, body-style,
drive-wheels, engine-location, engine-type, num-of-cylinders, and fuel-system.
The objective of this process is to predict the “symboling,” the first variable
in the data, from the other predictors. “symboling” is an integer from
-3 (good insurance risk) to 3 (poor insurance risk). You could use a
classification ensemble to predict this risk instead of a regression ensemble.
As stated in “Steps in Supervised Learning (Machine Learning)” on page
13-2, when you have a choice between regression and classification,
you should try regression first. Furthermore, this example is to show
regularization, which currently works only for regression.
13-82
Ensemble Methods
Y = X(:,1);
X(:,1) = [];
VarNames = {'normalized-losses' 'wheel-base' 'length' 'width' 'height' ...
'curb-weight' 'engine-size' 'bore' 'stroke' 'compression-ratio' ...
'horsepower' 'peak-rpm' 'city-mpg' 'highway-mpg' 'price' 'make' ...
'fuel-type' 'aspiration' 'num-of-doors' 'body-style' 'drive-wheels' ...
'engine-location' 'engine-type' 'num-of-cylinders' 'fuel-system'};
catidx = 16:25; % indices of categorical predictors
4 Create a regression ensemble from the data using 300 default trees:
ls = fitensemble(X,Y,'LSBoost',300,'Tree','LearnRate',0.1,...
'PredictorNames',VarNames,'ResponseName','symboling',...
'CategoricalPredictors',catidx)
ls =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
ResponseName: 'symboling'
ResponseTransform: 'none'
NObservations: 205
NTrained: 300
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [300x1 double]
FitInfoDescription: [2x83 char]
Regularization: []
cv = crossval(ls,'kfold',5);
figure;
plot(kfoldLoss(cv,'mode','cumulative'));
13-83
13 Supervised Learning
xlabel('Number of trees');
ylabel('Cross-validated MSE');
6 Call the regularize method to try to find trees that you can remove from
the ensemble. By default, regularize examines 10 values of the lasso
(Lambda) parameter spaced exponentially.
ls = regularize(ls)
ls =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
13-84
Ensemble Methods
ResponseName: 'symboling'
ResponseTransform: 'none'
NObservations: 205
NTrained: 300
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [300x1 double]
FitInfoDescription: [2x83 char]
Regularization: [1x1 struct]
figure;
semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE);
line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ...
ls.Regularization.ResubstitutionMSE(1)],...
'marker','x','markersize',12,'color','b');
r0 = resubLoss(ls);
line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],...
[r0 r0],'color','r','LineStyle','--');
xlabel('Lambda');
ylabel('Resubstitution MSE');
annotation('textbox',[0.5 0.22 0.5 0.05],'String','unregularized ensemble',...
'color','r','FontSize',14,'LineStyle','none');
figure;
loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1));
line([1e-3 1e-3],...
[sum(ls.Regularization.TrainedWeights(:,1)>0) ...
sum(ls.Regularization.TrainedWeights(:,1)>0)],...
'marker','x','markersize',12,'color','b');
line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],...
[ls.NTrained ls.NTrained],...
13-85
13 Supervised Learning
'color','r','LineStyle','--');
xlabel('Lambda');
ylabel('Number of learners');
annotation('textbox',[0.3 0.8 0.5 0.05],'String','unregularized ensemble',...
'color','r','FontSize',14,'LineStyle','none');
13-86
Ensemble Methods
figure;
semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE);
hold;
semilogx(ls.Regularization.Lambda,mse,'r--');
hold off;
xlabel('Lambda');
ylabel('Mean squared error');
legend('resubstitution','cross-validation','Location','NW');
13-87
13 Supervised Learning
figure;
loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1));
hold;
loglog(ls.Regularization.Lambda,nlearn,'r--');
hold off;
xlabel('Lambda');
ylabel('Number of learners');
legend('resubstitution','cross-validation','Location','NE');
line([1e-3 1e-3],...
[sum(ls.Regularization.TrainedWeights(:,1)>0) ...
sum(ls.Regularization.TrainedWeights(:,1)>0)],...
'marker','x','markersize',12,'color','b');
line([1e-3 1e-3],[nlearn(1) nlearn(1)],'marker','o',...
13-88
Ensemble Methods
'markersize',12,'color','r','LineStyle','--');
13-89
13 Supervised Learning
jj = 1:length(ls.Regularization.Lambda);
[jj;ls.Regularization.Lambda]
ans =
Columns 1 through 6
Columns 7 through 10
13-90
Ensemble Methods
10 Reduce the ensemble size using the shrink method. shrink returns a
compact ensemble with no training data. The generalization error for the
new compact ensemble was already estimated by cross validation in mse(5).
cmp = shrink(ls,'weightcolumn',5)
cmp =
classreg.learning.regr.CompactRegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
ResponseName: 'symboling'
ResponseTransform: 'none'
NTrained: 18
There are only 18 trees in the new ensemble, notably reduced from the
300 in ls.
ans =
162270 2791024
12 Compare the MSE of the reduced ensemble to that of the original ensemble:
figure;
plot(kfoldLoss(cv,'mode','cumulative'));
hold on
plot(cmp.NTrained,mse(5),'ro','MarkerSize',12);
xlabel('Number of trees');
13-91
13 Supervised Learning
ylabel('Cross-validated MSE');
legend('unregularized ensemble','regularized ensemble',...
'Location','NE');
hold off
The reduced ensemble gives low loss while using many fewer trees.
1 Generate data with label noise. This example has twenty uniform random
numbers per observation, and classifies the observation as 1 if the sum
of the first five numbers exceeds 2.5 (so is larger than average), and 0
otherwise:
13-92
Ensemble Methods
idx = randsample(2000,200);
Ytrain(idx) = ~Ytrain(idx);
ada = fitensemble(Xtrain,Ytrain,'AdaBoostM1',...
300,'Tree','LearnRate',0.1);
4 Create an ensemble with RobustBoost. Since the data has 10% incorrect
classification, perhaps an error goal of 15% is reasonable.
rb1 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.15,'RobustMaxMargin',1);
5 Try setting a high value of the error goal, 0.6. You get an error:
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,'Tree','RobustErrorGoal',0.6)
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.4);
rb3 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.01);
figure
13-93
13 Supervised Learning
plot(resubLoss(rb1,'mode','cumulative'));
hold on
plot(resubLoss(rb2,'mode','cumulative'),'r--');
plot(resubLoss(rb3,'mode','cumulative'),'k-.');
plot(resubLoss(ada,'mode','cumulative'),'g.');
hold off;
xlabel('Number of trees');
ylabel('Resubstitution error');
legend('ErrorGoal=0.15','ErrorGoal=0.4','ErrorGoal=0.01',...
'AdaBoostM1','Location','NE');
All the RobustBoost curves show lower resubstitution error than the
AdaBoostM1 curve. The error goal of 0.15 curve shows the lowest
13-94
Ensemble Methods
resubstitution error over most of the range. However, its error is rising in
the latter half of the plot, while the other curves are still descending.
9 Generate test data to see the predictive power of the ensembles. Test the
four ensembles:
Xtest = rand(2000,20);
Ytest = sum(Xtest(:,1:5),2) > 2.5;
idx = randsample(2000,200);
Ytest(idx) = ~Ytest(idx);
figure;
plot(loss(rb1,Xtest,Ytest,'mode','cumulative'));
hold on
plot(loss(rb2,Xtest,Ytest,'mode','cumulative'),'r--');
plot(loss(rb3,Xtest,Ytest,'mode','cumulative'),'k-.');
plot(loss(ada,Xtest,Ytest,'mode','cumulative'),'g.');
hold off;
xlabel('Number of trees');
ylabel('Test error');
legend('ErrorGoal=0.15','ErrorGoal=0.4','ErrorGoal=0.01',...
'AdaBoostM1','Location','NE');
13-95
13 Supervised Learning
The error curve for error goal 0.15 is lowest (best) in the plotted range. The
curve for error goal 0.4 seems to be converging to a similar value for a large
number of trees, but more slowly. AdaBoostM1 has higher error than the
curve for error goal 0.15. The curve for the too-optimistic error goal 0.01
remains substantially higher (worse) than the other algorithms for most
of the plotted range.
TreeBagger Examples
TreeBagger ensembles have more functionality than those constructed with
fitensemble; see TreeBagger Features Not in fitensemble on page 13-120.
Also, some property and method names differ from their fitensemble
13-96
Ensemble Methods
1 Load the dataset and split it into predictor and response arrays:
load imports-85;
Y = X(:,1);
X = X(:,2:end);
rng(1945,'twister')
Finding the Optimal Leaf Size. For regression, the general rule is to
set leaf size to 5 and select one third of input features for decision splits at
random. In the following step, verify the optimal leaf size by comparing
mean-squared errors obtained by regression for various leaf sizes. oobError
computes MSE versus the number of grown trees. You must set oobpred to
'on' to obtain out-of-bag predictions later.
leaf = [1 5 10 20 50 100];
col = 'rgbcmy';
figure(1);
for i=1:length(leaf)
b = TreeBagger(50,X,Y,'method','r','oobpred','on',...
'cat',16:25,'minleaf',leaf(i));
plot(oobError(b),col(i));
hold on;
end
xlabel('Number of Grown Trees');
ylabel('Mean Squared Error');
13-97
13 Supervised Learning
The red (leaf size 1) curve gives the lowest MSE values.
b = TreeBagger(100,X,Y,'method','r','oobvarimp','on',...
'cat',16:25,'minleaf',1);
2 Inspect the error curve again to make sure nothing went wrong during
training:
figure(2);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
13-98
Ensemble Methods
For each feature, you can permute the values of this feature across all of the
observations in the data set and measure how much worse the mean-squared
error (MSE) becomes after the permutation. You can repeat this for each
feature.
figure(3);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Number');
ylabel('Out-Of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.65)
idxvar =
13-99
13 Supervised Learning
1 2 4 16 19
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
13-100
Ensemble Methods
Growing Trees on a Reduced Set of Features. Using just the five most
powerful features selected in “Estimating Feature Importance” on page 13-98,
determine if it is possible to obtain a similar predictive power. To begin, grow
100 trees on these features only. The first three of the five selected features
are numeric and the last two are categorical.
b5v = TreeBagger(100,X(:,idxvar),Y,'method','r',...
13-101
13 Supervised Learning
'oobvarimp','on','cat',4:5,'minleaf',1);
figure(5);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
figure(6);
bar(b5v.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
13-102
Ensemble Methods
These five most powerful features give the same MSE as the full set, and
the ensemble trained on the reduced set ranks these features similarly to
each other. Features 1 and 2 from the reduced set perhaps could be removed
without a significant loss in the predictive power.
b5v = fillProximities(b5v);
The method normalizes this measure by subtracting the mean outlier measure
for the entire sample, taking the magnitude of this difference and dividing the
result by the median absolute deviation for the entire sample:
figure(7);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
13-103
13 Supervised Learning
figure(8);
[~,e] = mdsProx(b5v,'colors','k');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
13-104
Ensemble Methods
Assess the relative importance of the scaled axes by plotting the first 20
eigenvalues:
figure(9);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
13-105
13 Supervised Learning
Saving the Ensemble Configuration for Future Use. To use the trained
ensemble for predicting the response on unseen data, store the ensemble
to disk and retrieve it later. If you do not want to compute predictions for
out-of-bag data or reuse training data in any other way, there is no need to
store the ensemble object itself. Saving the compact version of the ensemble
would be enough in this case. Extract the compact object from the ensemble:
c = compact(b5v)
c =
The goal is to predict good or bad returns using a set of 34 measurements. The
workflow resembles that for “Workflow Example: Regression of Insurance
Risk Rating for Car Imports with TreeBagger” on page 13-97.
1 Fix the initial random seed, grow 50 trees, inspect how the ensemble error
changes with accumulation of trees, and estimate feature importance. For
classification, it is best to set the minimal leaf size to 1 and select the square
root of the total number of features for each decision split at random. These
are the default settings for a TreeBagger used for classification.
load ionosphere;
rng(1945,'twister')
b = TreeBagger(50,X,Y,'oobvarimp','on');
13-106
Ensemble Methods
figure(10);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
2 The method trains ensembles with few trees on observations that are in
bag for all trees. For such observations, it is impossible to compute the true
out-of-bag prediction, and TreeBagger returns the most probable class
for classification and the sample mean for regression. You can change
the default value returned for in-bag observations using the DefaultYfit
property. If you set the default value to an empty string for classification,
the method excludes in-bag observations from computation of the out-of-bag
error. In this case, the curve is more variable when the number of trees
is small, either because some observations are never out of bag (and are
therefore excluded) or because their predictions are based on few trees.
b.DefaultYfit = '';
figure(11);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Error Excluding in-Bag Observations');
13-107
13 Supervised Learning
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure(12);
plot(finbag);
xlabel('Number of Grown Trees');
ylabel('Fraction of in-Bag Observations');
13-108
Ensemble Methods
figure(13);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.8)
idxvar =
3 4 5 7 8
13-109
13 Supervised Learning
5 Having selected the five most important features, grow a larger ensemble
on the reduced feature set. Save time by not permuting out-of-bag
observations to obtain new estimates of feature importance for the reduced
feature set (set oobvarimp to 'off'). You would still be interested in
obtaining out-of-bag estimates of classification error (set oobpred to 'on').
b5v = TreeBagger(100,X(:,idxvar),Y,'oobpred','on');
figure(14);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
13-110
Ensemble Methods
figure(15);
13-111
13 Supervised Learning
plot(oobMeanMargin(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Classification Margin');
b5v = fillProximities(b5v);
figure(16);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
13-112
Ensemble Methods
8 All extreme outliers for this dataset come from the 'good' class:
b5v.Y(b5v.OutlierMeasure>40)
ans =
'g'
'g'
'g'
'g'
'g''
9 As for regression, you can plot scaled coordinates, displaying the two classes
in different colors using the colors argument of mdsProx. This argument
takes a string in which every character represents a color. To find the order
of classes used by the ensemble, look at the ClassNames property:
b5v.ClassNames
ans =
'g'
'b'
13-113
13 Supervised Learning
The 'good' class is first and the 'bad' class is second. Display scaled
coordinates using red for 'good' and blue for 'bad' observations:
figure(17);
[s,e] = mdsProx(b5v,'colors','rb');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
figure(18);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
13-114
Ensemble Methods
[Yfit,Sfit] = oobPredict(b5v);
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,1),'g');
figure(19);
plot(fpr,tpr);
13-115
13 Supervised Learning
Instead of the standard ROC curve, you might want to plot, for example,
ensemble accuracy versus threshold on the score for the 'good' class. The
ycrit input argument of perfcurve lets you specify the criterion for the
y-axis, and the third output argument of perfcurve returns an array of
thresholds for the positive class score. Accuracy is the fraction of correctly
classified observations, or equivalently, 1 minus the classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,1),'g','ycrit','accu');
figure(20);
plot(thre,accu);
xlabel('Threshold for ''good'' Returns');
ylabel('Classification Accuracy');
13-116
Ensemble Methods
The curve shows a flat region indicating that any threshold from 0.2 to 0.6
is a reasonable choice. By default, the function assigns classification labels
using 0.5 as the boundary between the two classes. You can find exactly
what accuracy this corresponds to:
i50 = find(accu>=0.50,1,'first')
accu(abs(thre-0.5)<eps)
returns
i50 =
2
ans =
0.9430
[maxaccu,iaccu] = max(accu)
returns
maxaccu =
0.9459
13-117
13 Supervised Learning
iaccu =
91
thre(iaccu)
ans =
0.5056
Ensemble Algorithms
• “Bagging” on page 13-118
• “AdaBoostM1” on page 13-122
• “AdaBoostM2” on page 13-124
• “LogitBoost” on page 13-125
• “GentleBoost” on page 13-126
• “RobustBoost” on page 13-127
• “LSBoost” on page 13-128
Bagging
Bagging, which stands for “bootstrap aggregation”, is a type of ensemble
learning. To bag a weak learner such as a decision tree on a dataset, generate
many bootstrap replicas of this dataset and grow decision trees on these
replicas. Obtain each bootstrap replica by randomly selecting N observations
out of N with replacement, where N is the dataset size. To find the predicted
response of a trained ensemble, take an average over predictions from
individual trees.
13-118
Ensemble Methods
By default, the minimal leaf sizes for bagged trees are set to 1 for classification
and 5 for regression. Trees grown with the default leaf size are usually
very deep. These settings are close to optimal for the predictive power of
an ensemble. Often you can grow trees with larger leaves without losing
predictive power. Doing so reduces training and prediction time, as well as
memory usage for the trained ensemble.
13-119
13 Supervised Learning
For references related to bagging, see Breiman [2], [3], and [4].
13-120
Ensemble Methods
13-121
13 Supervised Learning
fitensemble then passes the adjusted prior probabilities and the default
cost matrix to the trees. The default cost matrix is ones(K)-eye(K) for K
classes.
• Unlike the loss and edge methods in the new framework, the TreeBagger
error and meanMargin methods do not normalize input observation
weights of the prior probabilities in the respective class.
AdaBoostM1
AdaBoostM1 is a very popular boosting algorithm for binary classification.
The algorithm trains learners sequentially. For every learner with index t,
AdaBoostM1 computes the weighted classification error
N
dn yn ht xn
t
t
n1
After training finishes, AdaBoostM1 computes prediction for new data using
T
f x t ht x ,
t 1
where
13-122
Ensemble Methods
1 1 t
t log
2 t
N
wn exp yn f xn
n1
where
The observation weights wn are the original observation weights you passed
to fitensemble.
By default, the learning rate for boosting algorithms is 1. If you set the
learning rate to a lower number, the ensemble learns at a slower rate, but can
converge to a better solution. 0.1 is a popular choice for the learning rate.
Learning at a rate less than 1 is often called “shrinkage”.
13-123
13 Supervised Learning
For references related to AdaBoostM1, see Freund and Schapire [8], Schapire
et al. [13], Friedman, Hastie, and Tibshirani [10], and Friedman [9].
AdaBoostM2
AdaBoostM2 is an extension of AdaBoostM1 for multiple classes. Instead of
weighted classification error, AdaBoostM2 uses weighted pseudo-loss for N
observations and K classes:
N
1 t
dn,k 1 ht xn , yn ht xn , k
t
2 n1k y
n
where
t
• dn,k are observation weights at step t for class k.
• yn is the true class label taking one of the K values.
• The second sum is over all classes other than the true class yn.
Interpreting the pseudo-loss is harder than classification error, but the idea is
the same. Pseudo-loss can be used as a measure of the classification accuracy
from any learner in an ensemble. Pseudo-loss typically exhibits the same
behavior as a weighted classification error for AdaBoostM1: the first few
learners in a boosted ensemble give low pseudo-loss values. After the first
few training steps, the ensemble begins to learn at a slower pace, and the
pseudo-loss value approaches 0.5 from below.
13-124
Ensemble Methods
LogitBoost
LogitBoost is another popular algorithm for binary classification.
LogitBoost works similarly to AdaBoostM1, except it minimizes binomial
deviance
N
wn log 1 exp 2 yn f xn ,
n1
where
yn* pt xn
y n
pt xn 1 pt xn
where
13-125
13 Supervised Learning
N
dnt y n ht xn
2
n1
• dnt are observation weights at step t (the weights add up to 1).
Values yn can range from –∞ to +∞, so the mean-squared error does not have
well-defined bounds.
GentleBoost
GentleBoost (also known as Gentle AdaBoost) combines features of
AdaBoostM1 and LogitBoost. Like AdaBoostM1, GentleBoost minimizes the
exponential loss. But its numeric optimization is set up differently. Like
LogitBoost, every weak learner fits a regression model to response values
yn {–1,+1}. This makes GentleBoost another good candidate for binary
classification of data with multilevel categorical predictors.
N
dnt y n ht xn
2
n1
13-126
Ensemble Methods
• dnt are observation weights at step t (the weights add up to 1).
• ht(xn) are predictions of the regression model ht fitted to response values yn.
RobustBoost
Boosting algorithms such as AdaBoostM1 and LogitBoost increase weights for
misclassified observations at every boosting step. These weights can become
very large. If this happens, the boosting algorithm sometimes concentrates on
a few misclassified observations and neglects the majority of training data.
Consequently the average classification accuracy suffers.
In this situation, you can try using RobustBoost. This algorithm does not
assign almost the entire data weight to badly misclassified observations. It
can produce better average classification accuracy.
• Time t reaches 1.
13-127
13 Supervised Learning
To get better classification accuracy from RobustBoost, you can adjust three
parameters in fitensemble: RobustErrorGoal, RobustMaxMargin, and
RobustMarginSigma. Start by varying values for RobustErrorGoal from 0 to
1. The maximal allowed value for RobustErrorGoal depends on the two other
parameters. If you pass a value that is too high, fitensemble produces an
error message showing the allowed range for RobustErrorGoal.
LSBoost
You can use least squares boosting (LSBoost) to fit regression ensembles.
At every step, the ensemble fits a new learner to the difference between
the observed response and the aggregated prediction of all learners grown
previously. The ensemble fits to minimize mean-squared error.
13-128
Ensemble Methods
For references related to LSBoost, see Hastie, Tibshirani, and Friedman [11],
Chapters 7 (Model Assessment and Selection) and 15 (Random Forests).
13-129
13 Supervised Learning
Bibliography
[1] Bottou, L., and Chih-Jen Lin. Support Vector Machine Solvers. Available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.4209
&rep=rep1&type=pdf.
[2] Breiman, L. Bagging Predictors. Machine Learning 26, pp. 123–140, 1996.
[3] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.
[4] Breiman, L.
http://www.stat.berkeley.edu/~breiman/RandomForests/
[5] Breiman, L., et al. Classification and Regression Trees. Chapman & Hall,
Boca Raton, 1993.
13-130
Bibliography
[13] Schapire, R. E. et al. Boosting the margin: A new explanation for the
effectiveness of voting methods. Annals of Statistics, Vol. 26, No. 5, pp.
1651–1686, 1998.
13-131
13 Supervised Learning
13-132
14
Markov Models
Introduction
Markov processes are examples of stochastic processes—processes that
generate random sequences of outcomes or states according to certain
probabilities. Markov processes are distinguished by being memoryless—their
next state depends only on their current state, not on the history that led them
there. Models of Markov processes are used in a wide variety of applications,
from daily stock prices to the positions of genes in a chromosome.
14-2
Markov Chains
Markov Chains
A Markov model is given visual representation with a state diagram, such
as the one below.
The rectangles in the diagram represent the possible states of the process you
are trying to model, and the arrows represent transitions between states.
The label on each arrow represents the probability of that transition. At
each step of the process, the model may generate an output, or emission,
depending on which state it is in, and then make a transition to another
state. An important characteristic of Markov models is that the next state
depends only on the current state, and not on the history of transitions that
lead to the current state.
For example, for a sequence of coin tosses the two states are heads and tails.
The most recent coin toss determines the current state of the model and each
subsequent toss determines the transition to the next state. If the coin is fair,
the transition probabilities are all 1/2. The emission might simply be the
current state. In more complicated models, random processes at each state
will generate emissions. You could, for example, roll a die to determine the
emission at any step.
14-3
14 Markov Models
Markov chains begin in an initial state i0 at step 0. The chain then transitions
to state i1 with probability T1i1 , and emits an output sk1 with probability
Ei1k1 . Consequently, the probability of observing the sequence of states
i1i2 ...ir and the sequence of emissions sk1 sk2 ...skr in the first r steps, is
14-4
Hidden Markov Models (HMM)
Introduction
A hidden Markov model (HMM) is one in which you observe a sequence of
emissions, but do not know the sequence of states the model went through to
generate the emissions. Analyses of hidden Markov models seek to recover
the sequence of states from the observed data.
As an example, consider a Markov model with two states and six possible
emissions. The model uses:
The model creates a sequence of numbers from the set {1, 2, 3, 4, 5, 6} with the
following rules:
• Begin by rolling the red die and writing down the number that comes up,
which is the emission.
• Toss the red coin and do one of the following:
- If the result is heads, roll the red die and write down the result.
- If the result is tails, roll the green die and write down the result.
• At each subsequent step, you flip the coin that has the same color as the die
you rolled in the previous step. If the coin comes up heads, roll the same die
as in the previous step. If the coin comes up tails, switch to the other die.
14-5
14 Markov Models
The state diagram for this model has two states, red and green, as shown in
the following figure.
You determine the emission from a state by rolling the die with the same color
as the state. You determine the transition to the next state by flipping the
coin with the same color as the state.
⎡ 0 .9 0 .1 ⎤
T=⎢ ⎥
⎣0.05 0.95⎦
⎡1 1 1 1 1 1⎤
⎢ 6⎥
E=⎢6 6 6 6 6
⎥
⎢7 1 1 1 1 1⎥
⎢⎣ 12 12 12 12 12 12 ⎥⎦
The model is not hidden because you know the sequence of states from the
colors of the coins and dice. Suppose, however, that someone else is generating
14-6
Hidden Markov Models (HMM)
the emissions without showing you the dice or the coins. All you see is the
sequence of emissions. If you start seeing more 1s than other numbers, you
might suspect that the model is in the green state, but you cannot be sure
because you cannot see the color of the die being rolled.
14-7
14 Markov Models
This section shows how to use these functions to analyze hidden Markov
models.
To generate a random sequence of states and emissions from the model, use
hmmgenerate:
[seq,states] = hmmgenerate(1000,TRANS,EMIS);
The output seq is the sequence of emissions and the output states is the
sequence of states.
14-8
Hidden Markov Models (HMM)
sum(states==likelystates)/1000
ans =
0.8200
In this case, the most likely sequence of states agrees with the random
sequence 82% of the time.
The following takes the emission and state sequences and returns estimates
of the transition and emission matrices:
TRANS_EST =
0.8989 0.1011
0.0585 0.9415
EMIS_EST =
0.1721 0.1721 0.1749 0.1612 0.1803 0.1393
0.5836 0.0741 0.0804 0.0789 0.0726 0.1104
You can compare the outputs with the original transition and emission
matrices, TRANS and EMIS:
TRANS
TRANS =
0.9000 0.1000
0.0500 0.9500
EMIS
14-9
14 Markov Models
EMIS =
0.1667 0.1667 0.1667 0.1667 0.1667 0.1667
0.5833 0.0833 0.0833 0.0833 0.0833 0.0833
Using hmmtrain. If you do not know the sequence of states states, but you
have initial guesses for TRANS and EMIS, you can still estimate TRANS and
EMIS using hmmtrain.
Suppose you have the following initial guesses for TRANS and EMIS.
TRANS_EST2 =
0.2286 0.7714
0.0032 0.9968
EMIS_EST2 =
0.1436 0.2348 0.1837 0.1963 0.2350 0.0066
0.4355 0.1089 0.1144 0.1082 0.1109 0.1220
If the algorithm fails to reach the desired tolerance, increase the default value
of the maximum number of iterations with the command:
hmmtrain(seq,TRANS_GUESS,EMIS_GUESS,'maxiterations',maxiter)
14-10
Hidden Markov Models (HMM)
where tol is the desired value of the tolerance. Increasing the value of tol
makes the algorithm halt sooner, but the results are less accurate.
• The algorithm converges to a local maximum that does not represent the
true transition and emission matrices. If you suspect this, use different
initial guesses for the matrices TRANS_EST and EMIS_EST.
• The sequence seq may be too short to properly train the matrices. If you
suspect this, use a longer sequence for seq.
PSTATES = hmmdecode(seq,TRANS,EMIS)
hmmdecode begins with the model in state 1 at step 0, prior to the first
emission. PSTATES(i,1) is the probability that the model is in state i at the
following step 1. To change the initial state, see “Changing the Initial State
Distribution” on page 14-12.
To return the logarithm of the probability of the sequence seq, use the second
output argument of hmmdecode:
[PSTATES,logpseq] = hmmdecode(seq,TRANS,EMIS)
14-11
14 Markov Models
than the smallest positive number your computer can represent. hmmdecode
returns the logarithm of the probability to avoid this problem.
⎡0 p ⎤
T̂ = ⎢ ⎥
⎣0 T ⎦
where T is the true transition matrix. The first column of T̂ contains M+1
zeros. p must sum to 1.
⎡0⎤
Ê = ⎢ ⎥
⎣ E⎦
If the transition and emission matrices are TRANS and EMIS, respectively, you
create the augmented matrices with the following commands:
14-12
15
Design of Experiments
Introduction
Passive data collection leads to a number of problems in statistical modeling.
Observed changes in a response variable may be correlated with, but
not caused by, observed changes in individual factors (process variables).
Simultaneous changes in multiple factors may produce interactions that are
difficult to separate into individual effects. Observations may be dependent,
while a model of the data considers them to be independent.
y = 0 + 1 x1 + 2 x2 + 3 x1 x2 +
Here ε includes both experimental error and the effects of any uncontrolled
factors in the experiment. The terms β1x1 and β2x2 are main effects and the
term β3x1x2 is a two-way interaction effect. A designed experiment would
systematically manipulate x1 and x2 while measuring y, with the objective of
accurately estimating β0, β1, β2, and β3.
15-2
Full Factorial Designs
Multilevel Designs
To systematically vary experimental factors, assign each factor a discrete
set of levels. Full factorial designs measure response variables using every
treatment (combination of the factor levels). A full factorial design for n
factors with N1, ..., Nn levels requires N1 × ... × Nn experimental runs—one for
each treatment. While advantageous for separating individual effects, full
factorial designs can make large demands on data collection.
dFF = fullfact([3,4])
dFF =
1 1
2 1
3 1
1 2
2 2
3 2
1 3
2 3
3 3
1 4
2 4
3 4
15-3
15 Design of Experiments
Two-Level Designs
Many experiments can be conducted with two-level factors, using two-level
designs. For example, suppose the machine shop in the previous example
always keeps the same operator on the same machine, but wants to measure
production effects that depend on the composition of the day and night
shifts. The Statistics Toolbox function ff2n generates a full factorial list of
treatments:
dFF2 = ff2n(4)
dFF2 =
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Each of the 24 = 16 rows of dFF2 represent one schedule of operators for the
day (0) and night (1) shifts.
15-4
Fractional Factorial Designs
Introduction
Two-level designs are sufficient for evaluating many production processes.
Factor levels of ±1 can indicate categorical factors, normalized factor extremes,
or simply “up” and “down” from current factor settings. Experimenters
evaluating process changes are interested primarily in the factor directions
that lead to process improvement.
For experiments with many factors, two-level full factorial designs can lead to
large amounts of data. For example, a two-level full factorial design with 10
factors requires 210 = 1024 runs. Often, however, individual factors or their
interactions have no distinguishable effects on a response. This is especially
true of higher order interactions. As a result, a well-designed experiment can
use fewer runs for estimating model parameters.
Plackett-Burman Designs
Plackett-Burman designs are used when only main effects are considered
significant. Two-level Plackett-Burman designs require a number of
experimental runs that are a multiple of 4 rather than a power of 2. The
MATLAB function hadamard generates these designs:
dPB = hadamard(8)
15-5
15 Design of Experiments
dPB =
1 1 1 1 1 1 1 1
1 -1 1 -1 1 -1 1 -1
1 1 -1 -1 1 1 -1 -1
1 -1 -1 1 1 -1 -1 1
1 1 1 1 -1 -1 -1 -1
1 -1 1 -1 -1 1 -1 1
1 1 -1 -1 -1 -1 1 1
1 -1 -1 1 -1 1 1 -1
Binary factor levels are indicated by ±1. The design is for eight runs (the rows
of dPB) manipulating seven two-level factors (the last seven columns of dPB).
The number of runs is a fraction 8/27 = 0.0625 of the runs required by a full
factorial design. Economy is achieved at the expense of confounding main
effects with any two-way interactions.
Specify general fractional factorial designs using a full factorial design for
a selected subset of basic factors and generators for the remaining factors.
Generators are products of the basic factors, giving the levels for the
remaining factors. Use the Statistics Toolbox function fracfact to generate
these designs:
15-6
Fractional Factorial Designs
-1 1 -1 1 -1 1
-1 1 1 -1 -1 1
-1 1 1 1 1 -1
1 -1 -1 -1 -1 1
1 -1 -1 1 1 -1
1 -1 1 -1 1 -1
1 -1 1 1 -1 1
1 1 -1 -1 1 1
1 1 -1 1 -1 -1
1 1 1 -1 -1 -1
1 1 1 1 1 1
This is a six-factor design in which four two-level basic factors (a, b, c, and
d in the first four columns of dfF) are measured in every combination of
levels, while the two remaining factors (in the last three columns of dfF) are
measured only at levels defined by the generators bcd and acd, respectively.
Levels in the generated columns are products of corresponding levels in the
columns that make up the generator.
These are generators for a six-factor design with factors a through f, using 24
= 16 runs to achieve resolution IV. The fracfactgen function uses an efficient
search algorithm to find generators that meet the requirements.
[dfF,confounding] = fracfact(generators);
15-7
15 Design of Experiments
confounding
confounding =
'Term' 'Generator' 'Confounding'
'X1' 'a' 'X1'
'X2' 'b' 'X2'
'X3' 'c' 'X3'
'X4' 'd' 'X4'
'X5' 'bcd' 'X5'
'X6' 'acd' 'X6'
'X1*X2' 'ab' 'X1*X2 + X5*X6'
'X1*X3' 'ac' 'X1*X3 + X4*X6'
'X1*X4' 'ad' 'X1*X4 + X3*X6'
'X1*X5' 'abcd' 'X1*X5 + X2*X6'
'X1*X6' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X2*X3' 'bc' 'X2*X3 + X4*X5'
'X2*X4' 'bd' 'X2*X4 + X3*X5'
'X2*X5' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X2*X6' 'abcd' 'X1*X5 + X2*X6'
'X3*X4' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X3*X5' 'bd' 'X2*X4 + X3*X5'
'X3*X6' 'ad' 'X1*X4 + X3*X6'
'X4*X5' 'bc' 'X2*X3 + X4*X5'
'X4*X6' 'ac' 'X1*X3 + X4*X6'
'X5*X6' 'ab' 'X1*X2 + X5*X6'
The confounding pattern shows that main effects are effectively separated
by the design, but two-way interactions are confounded with various other
two-way interactions.
15-8
Response Surface Designs
Introduction
As discussed in “Response Surface Models” on page 9-45, quadratic response
surfaces are simple models that provide a maximum or minimum without
making additional assumptions about the form of the response. Quadratic
models can be calibrated using full factorial designs with three or more levels
for each factor, but these designs generally require more runs than necessary
to accurately estimate model parameters. This section discusses designs for
calibrating quadratic models that are much more efficient, using three or five
levels for each factor, but not using all combinations of levels.
15-9
15 Design of Experiments
15-10
Response Surface Designs
Each design consists of a factorial design (the corners of a cube) together with
center and star points that allow for estimation of second-order effects. For
a full quadratic model with n factors, CCDs have enough design points to
estimate the (n+2)(n+1)/2 coefficients in a full quadratic model with n factors.
The type of CCD used (the position of the factorial and star points) is
determined by the number of factors and by the desired properties of the
design. The following table summarizes some important properties. A design
is rotatable if the prediction variance depends only on the distance of the
design point from the center of the design.
15-11
15 Design of Experiments
dCC = ccdesign(3,'type','circumscribed')
dCC =
-1.0000 -1.0000 -1.0000
-1.0000 -1.0000 1.0000
-1.0000 1.0000 -1.0000
-1.0000 1.0000 1.0000
1.0000 -1.0000 -1.0000
1.0000 -1.0000 1.0000
1.0000 1.0000 -1.0000
1.0000 1.0000 1.0000
-1.6818 0 0
1.6818 0 0
0 -1.6818 0
0 1.6818 0
0 0 -1.6818
0 0 1.6818
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
15-12
Response Surface Designs
0 0 0
The repeated center point runs allow for a more uniform estimate of the
prediction variance over the entire design space.
Box-Behnken Designs
Like the designs described in “Central Composite Designs” on page
15-9, Box-Behnken designs are used to calibrate full quadratic models.
Box-Behnken designs are rotatable and, for a small number of factors (four or
less), require fewer runs than CCDs. By avoiding the corners of the design
space, they allow experimenters to work around extreme factor combinations.
Like an inscribed CCD, however, extremes are then poorly estimated.
Design points are at the midpoints of edges of the design space and at the
center, and do not contain an embedded factorial design.
15-13
15 Design of Experiments
dBB = bbdesign(3)
dBB =
-1 -1 0
-1 1 0
1 -1 0
1 1 0
-1 0 -1
-1 0 1
1 0 -1
1 0 1
0 -1 -1
0 -1 1
0 1 -1
0 1 1
0 0 0
0 0 0
0 0 0
Again, the repeated center point runs allow for a more uniform estimate of
the prediction variance over the entire design space.
15-14
D-Optimal Designs
D-Optimal Designs
In this section...
“Introduction” on page 15-15
“Generating D-Optimal Designs” on page 15-16
“Augmenting D-Optimal Designs” on page 15-19
“Specifying Fixed Covariate Factors” on page 15-20
“Specifying Categorical Factors” on page 15-21
“Specifying Candidate Sets” on page 15-21
Introduction
Traditional experimental designs (“Full Factorial Designs” on page 15-3,
“Fractional Factorial Designs” on page 15-5, and “Response Surface Designs”
on page 15-9) are appropriate for calibrating linear models in experimental
settings where factors are relatively unconstrained in the region of interest.
In some cases, however, models are necessarily nonlinear. In other cases,
certain treatments (combinations of factor levels) may be expensive or
infeasible to measure. D-optimal designs are model-specific designs that
address these limitations of traditional designs.
15-15
15 Design of Experiments
Function Description
candexch Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model and a
specified candidate set. This is the second component of the
algorithm used by rowexch.
candgen Generates a candidate set for a specified model. This is the
first component of the algorithm used by rowexch.
cordexch Uses a coordinate-exchange algorithm to generate a D-optimal
design with a specified number of runs for a specified model.
daugment Uses a coordinate-exchange algorithm to augment an existing
D-optimal design with additional runs to estimate additional
model terms.
dcovary Uses a coordinate-exchange algorithm to generate a D-optimal
design with fixed covariate factors.
rowexch Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model. The
algorithm calls candgen and then candexch. (Call candexch
separately to specify a candidate set.)
Note The Statistics Toolbox function rsmdemo generates simulated data for
experimental settings specified by either the user or by a D-optimal design
generated by cordexch. It uses the rstool interface to visualize response
surface models fit to the data, and it uses the nlintool interface to visualize
a nonlinear model fit to the data.
15-16
D-Optimal Designs
Both cordexch and rowexch use iterative search algorithms. They operate by
incrementally changing an initial design matrix X to increase D = |XTX| at
each step. In both algorithms, there is randomness built into the selection of
the initial design and into the choice of the incremental changes. As a result,
both algorithms may return locally, but not globally, D-optimal designs. Run
each algorithm multiple times and select the best result for your final design.
Both functions have a 'tries' parameter that automates this repetition
and comparison.
For example, suppose you want a design to estimate the parameters in the
following three-factor, seven-term interaction model:
y = 0 + 1 x 1 + 2 x 2 + 3 x 3 + 12 x 1 x 2 + 13 x 1 x 3 + 23 x 2 x 3 +
nfactors = 3;
nruns = 7;
[dCE,X] = cordexch(nfactors,nruns,'interaction','tries',10)
dCE =
-1 1 1
-1 -1 -1
1 1 1
-1 1 -1
1 -1 1
15-17
15 Design of Experiments
1 -1 -1
-1 -1 1
X =
1 -1 1 1 -1 -1 1
1 -1 -1 -1 1 1 1
1 1 1 1 1 1 1
1 -1 1 -1 -1 1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 -1 -1 1 1 -1 -1
Columns of the design matrix X are the model terms evaluated at each row of
the design dCE. The terms appear in order from left to right:
1 Constant term
[dRE,X] = rowexch(nfactors,nruns,'interaction','tries',10)
dRE =
-1 -1 1
1 -1 1
1 -1 -1
1 1 1
-1 -1 -1
-1 1 -1
-1 1 1
X =
1 -1 -1 1 1 -1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 1 1 1 1 1 1
1 -1 -1 -1 1 1 1
1 -1 1 -1 -1 1 -1
15-18
D-Optimal Designs
1 -1 1 1 -1 -1 1
For example, the following eight-run design is adequate for estimating main
effects in a four-factor model:
dCEmain = cordexch(4,8)
dCEmain =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
To estimate the six interaction terms in the model, augment the design with
eight additional runs:
dCEinteraction = daugment(dCEmain,8,'interaction')
dCEinteraction =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
-1 1 1 1
-1 -1 -1 -1
1 -1 1 -1
1 1 -1 1
-1 1 1 -1
15-19
15 Design of Experiments
1 1 -1 -1
1 -1 1 1
1 1 1 -1
The augmented design is full factorial, with the original eight runs in the
first eight rows.
time = linspace(-1,1,8)';
[dCV,X] = dcovary(3,time,'linear')
dCV =
-1.0000 1.0000 1.0000 -1.0000
1.0000 -1.0000 -1.0000 -0.7143
-1.0000 -1.0000 -1.0000 -0.4286
1.0000 -1.0000 1.0000 -0.1429
1.0000 1.0000 -1.0000 0.1429
-1.0000 1.0000 -1.0000 0.4286
1.0000 1.0000 1.0000 0.7143
-1.0000 -1.0000 1.0000 1.0000
X =
1.0000 -1.0000 1.0000 1.0000 -1.0000
1.0000 1.0000 -1.0000 -1.0000 -0.7143
1.0000 -1.0000 -1.0000 -1.0000 -0.4286
1.0000 1.0000 -1.0000 1.0000 -0.1429
15-20
D-Optimal Designs
The column vector time is a fixed factor, normalized to values between ±1.
The number of rows in the fixed factor specifies the number of runs in the
design. The resulting design dCV gives factor settings for the three controlled
model factors at each time.
For example, the following eight-run design is for a linear additive model with
five factors in which the final factor is categorical with three levels:
dCEcat = cordexch(5,8,'linear','categorical',5,'levels',3)
dCEcat =
-1 -1 1 1 2
-1 -1 -1 -1 3
1 1 1 1 3
1 1 -1 -1 2
1 -1 -1 1 3
-1 1 -1 1 1
-1 1 1 -1 3
1 -1 1 -1 1
15-21
15 Design of Experiments
For example, the following uses rowexch to generate a five-run design for
a two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1 1
0 0
1 -1
1 0
1 1
The same thing can be done using candgen and candexch in sequence:
15-22
D-Optimal Designs
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0 -1
-1 -1
-1 1
1 -1
-1 0
You can replace C in this example with a design matrix evaluated at your own
candidate set. For example, suppose your experiment is constrained so that
the two factors cannot have extreme settings simultaneously. The following
produces a restricted candidate set:
Use the x2fx function to convert the candidate set to a design matrix:
my_C = x2fx(my_dC,'purequadratic')
my_C =
1 0 -1 0 1
1 -1 0 1 0
1 0 0 0 0
1 1 0 1 0
1 0 1 0 1
15-23
15 Design of Experiments
3
my_dRE = my_dC(my_treatments,:) % Display design
my_dRE =
-1 0
1 0
0 1
0 -1
0 0
15-24
16
Introduction
Statistical process control (SPC) refers to a number of different methods for
monitoring and assessing the quality of manufactured goods. Combined
with methods from the Chapter 15, “Design of Experiments”, SPC is used in
programs that define, measure, analyze, improve, and control development
and production processes. These programs are often implemented using
“Design for Six Sigma” methodologies.
16-2
Control Charts
Control Charts
A control chart displays measurements of process samples over time. The
measurements are plotted together with user-defined specification limits and
process-defined control limits. The process can then be compared with its
specifications—to see if it is in control or out of control.
The chart is just a monitoring tool. Control activity might occur if the chart
indicates an undesirable, systematic change in the process. The control
chart is used to discover the variation, so that the process can be adjusted
to reduce it.
Control charts are created with the controlchart function. Any of the
following chart types may be specified:
• Xbar or mean
• Standard deviation
• Range
• Exponentially weighted moving average
• Individual observation
• Moving range of individual observations
• Moving average of individual observations
• Proportion defective
• Number of defectives
• Defects per unit
• Count of defects
For example, the following commands create an xbar chart, using the
“Western Electric 2” rule (2 of 3 points at least 2 standard errors above the
center line) to mark out of control measurements:
load parts;
st = controlchart(runout,'rules','we2');
16-3
16 Statistical Process Control
x = st.mean;
cl = st.mu;
se = st.sigma./sqrt(st.n);
hold on
plot(cl+2*se,'m')
R = controlrules('we2',x,cl,se);
I = find(R)
16-4
Control Charts
I =
21
23
24
25
26
27
16-5
16 Statistical Process Control
Capability Studies
Before going into production, many manufacturers run a capability study to
determine if their process will run within specifications enough of the time.
Capability indices produced by such a study are used to estimate expected
percentages of defective parts.
Capability studies are conducted with the capability function. The following
capability indices are produced:
• mu — Sample mean
• sigma — Sample standard deviation
• P — Estimated probability of being within the lower (L) and upper (U)
specification limits
• Pl — Estimated probability of being below L
• Pu — Estimated probability of being above U
• Cp — (U-L)/(6*sigma)
• Cpl — (mu-L)./(3.*sigma)
• Cpu — (U-mu)./(3.*sigma)
• Cpk — min(Cpl,Cpu)
data = normrnd(3,0.005,100,1);
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
16-6
Capability Studies
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
capaplot(data,[2.99 3.01]);
grid on
16-7
16 Statistical Process Control
16-8
17
Parallel Statistics
Note To use parallel computing as described in this chapter, you must have
a Parallel Computing Toolbox™ license.
In this section...
“What Is Parallel Statistics Functionality?” on page 17-2
“How To Compute in Parallel” on page 17-3
“Example: Parallel Treebagger” on page 17-5
• bootci
• bootstrp
• candexch
• cordexch
• crossval
• daugment
• dcovary
• jackknife
• nnmf
• plsregress
17-2
Quick Start Parallel Computing for Statistics Toolbox™
• rowexch
• sequentialfs
• TreeBagger
• TreeBagger.growTrees
This chapter gives the simplest way to use these enhanced functions in
parallel. For more advanced topics, including the issues of reproducibility and
nested parfor loops, see the other sections in this chapter.
help parallelstats
Open matlabpool
To run a statistical computation in parallel, first set up a parallel environment.
matlabpool open n
17-3
17 Parallel Statistics
Many parallel statistical functions call a function that can be one you define
in a file. For example, jackknife calls a function (jackfun) that can be a
built-in MATLAB function such as corr, but can also be a function you define.
Built-in functions are available to all workers. However, you must take extra
steps to enable workers to access a function file that you define.
To place a function file on the path of all workers, and check that it is
accessible:
or
pctRunOnAll('addpath network_file_path')
pctRunOnAll('which filename')
17-4
Quick Start Parallel Computing for Statistics Toolbox™
paroptions = statset('UseParallel','always');
After you have finished computing in parallel, close the parallel environment:
matlabpool close
Tip To save time, keep the pool open if you expect to compute in parallel
again soon.
matlabpool open 2
17-5
17 Parallel Statistics
connected to 2 labs.
paroptions = statset('UseParallel','always');
3 Load the problem data and separate it into input and response:
load imports-85;
Y = X(:,1);
X = X(:,2:end);
4 Estimate feature importance using leaf size 1 and 1000 trees in parallel.
Time the function for comparison purposes:
tic
b = TreeBagger(1000,X,Y,'Method','r','OOBVarImp','on',...
'cat',16:25,'MinLeaf',1,'Options',paroptions);
toc
tic
b = TreeBagger(1000,X,Y,'Method','r','OOBVarImp','on',...
'cat',16:25,'MinLeaf',1); % No options gives serial
toc
Computing in parallel took less than 60% of the time of computing serially.
17-6
Concepts of Parallel Computing in Statistics Toolbox™
• Nested parallel evaluations (see “No Nested parfor Loops” on page 17-11).
Only the outermost parfor loop runs in parallel, the others run serially.
• Reproducible results when using random numbers (see “Reproducibility
in Parallel Statistical Computations” on page 17-13). How can you get
exactly the same results when repeatedly running a parallel computation
that uses random numbers?
17-7
17 Parallel Statistics
17-8
When to Run Statistical Functions in Parallel
17-9
17 Parallel Statistics
17-10
Working with parfor
Client
Lines of code
execute top
to bottom
parfor i = 1:n
Lines of code
distributed to
workers
end
Results
returned
to client
Worker 1 Worker n
...
Characteristics of parfor
More caveats related to parfor appear in “Limitations” in the Parallel
Computing Toolbox documentation.
17-11
17 Parallel Statistics
Suppose, for example, you want to apply jackknife to your function userfcn,
which calls parfor, and you want to call jackknife in a loop. The following
figure shows three cases:
17-12
Reproducibility in Parallel Statistical Computations
This section addresses the case when your function uses random numbers,
and you want reproducible results in parallel. This section also addresses the
case when you want the same results in parallel as in serial.
17-13
17 Parallel Statistics
5 To reproduce the computation, reset the stream, then call the function
again.
s = RandStream('mlfg6331_64');
options = statset('UseParallel','always', ...
'Streams',s,'UseSubstreams','always');
2 Run your parallel computation. For instructions, see “Quick Start Parallel
Computing for Statistics Toolbox” on page 17-2.
reset(s);
17-14
Reproducibility in Parallel Statistical Computations
can go the kMth pseudorandom number in the stream. From that point,
RandStream can generate the subsequent entries in the stream. Currently,
RandStream has M = 272, about 5e21, or more.
Beginning
of stream M 2M 3M
17-15
17 Parallel Statistics
To obtain reproducible results, simply reset the stream, and all the
substreams generate identical random numbers when called again. This
method succeeds when all the workers use the same stream, and the stream
supports substreams. This concludes the discussion of how the procedure
in “Running Reproducible Parallel Computations” on page 17-13 gives
reproducible parallel results.
• crossval
• plsregress
• sequentialfs
To obtain identical results, reset the random stream on the client, or the
random stream you pass to the client. For example:
s = RandStream.getDefaultStream;
reset(s)
% run the statistical function
reset(s)
% run the statistical function again, obtain identical results
While this method enables you to run reproducibly in parallel, the results can
differ from a serial computation. The reason for the difference is parfor loops
run in reverse order from for loops. Therefore, a serial computation can
generate random numbers in a different order than a parallel computation.
For unequivocal reproducibility, use the technique in “Running Reproducible
Parallel Computations” on page 17-13.
17-16
Reproducibility in Parallel Statistical Computations
s = RandStream.create('mrg32k3a','NumStreams',4,...
'CellOutput',true);
Pass these streams to a statistical function using the Streams option. For
example:
See “Example: Parallel Bootstrap” on page 17-20 for a plot of the results
of this computation.
This method of distributing streams gives each worker a different stream for
the computation. However, it does not allow for a reproducible computation,
because the workers perform the 200 bootstraps in an unpredictable order. If
you want to perform a reproducible computation, use substreams as described
in “Running Reproducible Parallel Computations” on page 17-13.
If you set the UseSubstreams option to 'always', then set the Streams
option to a single random stream of the type that supports substreams
('mlfg6331_64' or 'mrg32k3a'). This setting gives reproducible
computations.
17-17
17 Parallel Statistics
matlabpool open
opts = statset('UseParallel','always');
sigma = 5;
y = normrnd(0,sigma,100,1);
m = jackknife(@var, y,1,'Options',opts);
n = length(y);
bias = -sigma^2 / n % known bias formula
jbias = (n - 1)*(mean(m)-var(y,1)) % jackknife bias estimate
bias =
-0.2500
jbias =
-0.2698
jackknife does not use random numbers, so gives the same results every
time, whether run in parallel or serial.
17-18
Examples of Parallel Statistical Functions
matlabpool open
opts = statset('UseParallel','always');
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
regf=@(XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.0999
matlabpool open
17-19
17 Parallel Statistics
s = RandStream('mlfg6331_64');
options = statset('UseParallel','always',...
'Streams',s,'UseSubstreams','always');
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
regf=@(XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.1020
reset(s)
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.1020
17-20
Examples of Parallel Statistical Functions
latt = -4:0.01:12;
myfun = @(X) ksdensity(X,latt);
pdfestimate = myfun(x);
3 Bootstrap the estimate to get a sense of its sampling variability. Run the
bootstrap in serial for timing comparison.
tic;B = bootstrp(200,myfun,x);toc
matlabpool open
Starting matlabpool using the 'local' configuration ...
connected to 2 labs.
opt = statset('UseParallel','always');
tic;B = bootstrp(200,myfun,x,'Options',opt);toc
Overlay the ksdensity density estimate with the 200 bootstrapped estimates
obtained in the parallel bootstrap. You can get a sense of how to assess the
accuracy of the density estimate from this plot.
hold on
for i=1:size(B,1),
plot(latt,B(i,:),'c:')
end
plot(latt,pdfestimate);
xlabel('x');ylabel('Density estimate')
17-21
17 Parallel Statistics
17-22
Examples of Parallel Statistical Functions
ans =
1
17-23
17 Parallel Statistics
17-24
18
Function Reference
File I/O
caseread Read case names from file
casewrite Write case names to file
tblread Read tabular data from file
tblwrite Write tabular data to file
tdfread Read tab-delimited file
xptread Create dataset array from data
stored in SAS XPORT format file
18-2
Data Organization
Data Organization
Categorical Arrays (p. 18-3)
Dataset Arrays (p. 18-6)
Grouped Data (p. 18-7)
Categorical Arrays
addlevels (categorical) Add levels to categorical array
cat (categorical) Concatenate categorical arrays
categorical Create categorical array
cellstr (categorical) Convert categorical array to cell
array of strings
char (categorical) Convert categorical array to
character array
circshift (categorical) Shift categorical array circularly
ctranspose (categorical) Transpose categorical matrix
double (categorical) Convert categorical array to double
array
droplevels (categorical) Drop levels
end (categorical) Last index in indexing expression for
categorical array
flipdim (categorical) Flip categorical array along specified
dimension
fliplr (categorical) Flip categorical matrix in left/right
direction
flipud (categorical) Flip categorical matrix in up/down
direction
getlabels (categorical) Access categorical array labels
getlevels (categorical) Get categorical array levels
18-3
18 Function Reference
18-4
Data Organization
18-5
18 Function Reference
Dataset Arrays
cat (dataset) Concatenate dataset arrays
cellstr (dataset) Create cell array of strings from
dataset array
dataset Construct dataset array
datasetfun (dataset) Apply function to dataset array
variables
double (dataset) Convert dataset variables to double
array
end (dataset) Last index in indexing expression for
dataset array
export (dataset) Write dataset array to file
get (dataset) Access dataset array properties
grpstats (dataset) Summary statistics by group for
dataset arrays
horzcat (dataset) Horizontal concatenation for dataset
arrays
isempty (dataset) True for empty dataset array
join (dataset) Merge observations
length (dataset) Length of dataset array
18-6
Data Organization
Grouped Data
gplotmatrix Matrix of scatter plots by group
grp2idx Create index vector from grouping
variable
grpstats Summary statistics by group
gscatter Scatter plot by group
18-7
18 Function Reference
Descriptive Statistics
Summaries (p. 18-8)
Measures of Central Tendency
(p. 18-8)
Measures of Dispersion (p. 18-8)
Measures of Shape (p. 18-9)
Statistics Resampling (p. 18-9)
Data with Missing Values (p. 18-9)
Data Correlation (p. 18-10)
Summaries
crosstab Cross-tabulation
grpstats Summary statistics by group
summary (categorical) Summary statistics for categorical
array
tabulate Frequency table
Measures of Dispersion
iqr Interquartile range
mad Mean or median absolute deviation
18-8
Descriptive Statistics
Measures of Shape
kurtosis Kurtosis
moment Central moments
prctile Calculate percentile values
quantile Quantiles
skewness Skewness
zscore Standardized z-scores
Statistics Resampling
bootci Bootstrap confidence interval
bootstrp Bootstrap sampling
jackknife Jackknife sampling
18-9
18 Function Reference
Data Correlation
canoncorr Canonical correlation
cholcov Cholesky-like covariance
decomposition
cophenet Cophenetic correlation coefficient
corr Linear or rank correlation
corrcov Convert covariance matrix to
correlation matrix
partialcorr Linear or rank partial correlation
coefficients
tiedrank Rank adjusted for ties
18-10
Statistical Visualization
Statistical Visualization
Distribution Plots (p. 18-11)
Scatter Plots (p. 18-12)
ANOVA Plots (p. 18-12)
Regression Plots (p. 18-13)
Multivariate Plots (p. 18-13)
Cluster Plots (p. 18-13)
Classification Plots (p. 18-14)
DOE Plots (p. 18-14)
SPC Plots (p. 18-14)
Distribution Plots
boxplot Box plot
cdfplot Empirical cumulative distribution
function plot
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
ecdfhist Empirical cumulative distribution
function histogram
fsurfht Interactive contour plot
hist3 Bivariate histogram
histfit Histogram with normal fit
normplot Normal probability plot
normspec Normal density plot between
specifications
pareto Pareto chart
probplot Probability plots
18-11
18 Function Reference
Scatter Plots
gline Interactively add line to plot
gname Add case names to plot
gplotmatrix Matrix of scatter plots by group
gscatter Scatter plot by group
lsline Add least-squares line to scatter plot
refcurve Add reference curve to plot
refline Add reference line to plot
scatterhist Scatter plot with marginal
histograms
ANOVA Plots
anova1 One-way analysis of variance
aoctool Interactive analysis of covariance
manovacluster Dendrogram of group mean clusters
following MANOVA
multcompare Multiple comparison test
18-12
Statistical Visualization
Regression Plots
addedvarplot Added-variable plot
gline Interactively add line to plot
lsline Add least-squares line to scatter plot
polytool Interactive polynomial fitting
rcoplot Residual case order plot
refcurve Add reference curve to plot
refline Add reference line to plot
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
view (classregtree) Plot tree
Multivariate Plots
andrewsplot Andrews plot
biplot Biplot
glyphplot Glyph plot
parallelcoords Parallel coordinates plot
Cluster Plots
dendrogram Dendrogram plot
manovacluster Dendrogram of group mean clusters
following MANOVA
silhouette Silhouette plot
18-13
18 Function Reference
Classification Plots
perfcurve Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree) Plot tree
DOE Plots
interactionplot Interaction plot for grouped data
maineffectsplot Main effects plot for grouped data
multivarichart Multivari chart for grouped data
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
SPC Plots
capaplot Process capability plot
controlchart Shewhart control charts
histfit Histogram with normal fit
normspec Normal density plot between
specifications
18-14
Probability Distributions
Probability Distributions
Distribution Objects (p. 18-15)
Distribution Plots (p. 18-16)
Probability Density (p. 18-17)
Cumulative Distribution (p. 18-19)
Inverse Cumulative Distribution
(p. 18-21)
Distribution Statistics (p. 18-23)
Distribution Fitting (p. 18-24)
Negative Log-Likelihood (p. 18-26)
Random Number Generators
(p. 18-26)
Quasi-Random Numbers (p. 18-28)
Piecewise Distributions (p. 18-29)
Distribution Objects
cdf (ProbDist) Return cumulative distribution
function (CDF) for ProbDist object
fitdist Fit probability distribution to data
icdf (ProbDistUnivKernel) Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivKernel object
icdf (ProbDistUnivParam) Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivParam object
iqr (ProbDistUnivKernel) Return interquartile range (IQR) for
ProbDistUnivKernel object
iqr (ProbDistUnivParam) Return interquartile range (IQR) for
ProbDistUnivParam object
18-15
18 Function Reference
Distribution Plots
boxplot Box plot
cdfplot Empirical cumulative distribution
function plot
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
ecdfhist Empirical cumulative distribution
function histogram
18-16
Probability Distributions
Probability Density
betapdf Beta probability density function
binopdf Binomial probability density
function
chi2pdf Chi-square probability density
function
copulapdf Copula probability density function
disttool Interactive density and distribution
plots
evpdf Extreme value probability density
function
exppdf Exponential probability density
function
18-17
18 Function Reference
18-18
Probability Distributions
Cumulative Distribution
betacdf Beta cumulative distribution
function
binocdf Binomial cumulative distribution
function
cdf Cumulative distribution functions
cdf (gmdistribution) Cumulative distribution function for
Gaussian mixture distribution
cdf (piecewisedistribution) Cumulative distribution function for
piecewise distribution
cdfplot Empirical cumulative distribution
function plot
chi2cdf Chi-square cumulative distribution
function
copulacdf Copula cumulative distribution
function
18-19
18 Function Reference
18-20
Probability Distributions
18-21
18 Function Reference
18-22
Probability Distributions
Distribution Statistics
betastat Beta mean and variance
binostat Binomial mean and variance
chi2stat Chi-square mean and variance
copulastat Copula rank correlation
evstat Extreme value mean and variance
expstat Exponential mean and variance
fstat F mean and variance
gamstat Gamma mean and variance
geostat Geometric mean and variance
gevstat Generalized extreme value mean
and variance
gpstat Generalized Pareto mean and
variance
hygestat Hypergeometric mean and variance
lognstat Lognormal mean and variance
nbinstat Negative binomial mean and
variance
ncfstat Noncentral F mean and variance
nctstat Noncentral t mean and variance
ncx2stat Noncentral chi-square mean and
variance
18-23
18 Function Reference
Distribution Fitting
Supported Distributions (p. 18-24)
Piecewise Distributions (p. 18-25)
Supported Distributions
18-24
Probability Distributions
Piecewise Distributions
18-25
18 Function Reference
Negative Log-Likelihood
betalike Beta negative log-likelihood
evlike Extreme value negative
log-likelihood
explike Exponential negative log-likelihood
gamlike Gamma negative log-likelihood
gevlike Generalized extreme value negative
log-likelihood
gplike Generalized Pareto negative
log-likelihood
lognlike Lognormal negative log-likelihood
mvregresslike Negative log-likelihood for
multivariate regression
normlike Normal negative log-likelihood
wbllike Weibull negative log-likelihood
18-26
Probability Distributions
18-27
18 Function Reference
Quasi-Random Numbers
addlistener (qrandstream) Add listener for event
delete (qrandstream) Delete handle object
end (qrandset) Last index in indexing expression for
point set
eq (qrandstream) Test handle equality
findobj (qrandstream) Find objects matching specified
conditions
findprop (qrandstream) Find property of MATLAB handle
object
ge (qrandstream) Greater than or equal relation for
handles
gt (qrandstream) Greater than relation for handles
haltonset Construct Halton quasi-random
point set
isvalid (qrandstream) Test handle validity
le (qrandstream) Less than or equal relation for
handles
18-28
Probability Distributions
Piecewise Distributions
boundary (piecewisedistribution) Piecewise distribution boundaries
cdf (piecewisedistribution) Cumulative distribution function for
piecewise distribution
icdf (piecewisedistribution) Inverse cumulative distribution
function for piecewise distribution
lowerparams (paretotails) Lower Pareto tails parameters
nsegments (piecewisedistribution) Number of segments
paretotails Construct Pareto tails object
18-29
18 Function Reference
18-30
Hypothesis Tests
Hypothesis Tests
ansaribradley Ansari-Bradley test
barttest Bartlett’s test
canoncorr Canonical correlation
chi2gof Chi-square goodness-of-fit test
dwtest Durbin-Watson test
friedman Friedman’s test
jbtest Jarque-Bera test
kruskalwallis Kruskal-Wallis test
kstest One-sample Kolmogorov-Smirnov
test
kstest2 Two-sample Kolmogorov-Smirnov
test
lillietest Lilliefors test
linhyptest Linear hypothesis test
ranksum Wilcoxon rank sum test
runstest Run test for randomness
sampsizepwr Sample size and power of test
signrank Wilcoxon signed rank test
signtest Sign test
ttest One-sample and paired-sample t-test
ttest2 Two-sample t-test
vartest Chi-square variance test
vartest2 Two-sample F-test for equal
variances
vartestn Bartlett multiple-sample test for
equal variances
18-31
18 Function Reference
Analysis of Variance
ANOVA Plots (p. 18-32)
ANOVA Operations (p. 18-32)
ANOVA Plots
anova1 One-way analysis of variance
aoctool Interactive analysis of covariance
manovacluster Dendrogram of group mean clusters
following MANOVA
multcompare Multiple comparison test
ANOVA Operations
anova1 One-way analysis of variance
anova2 Two-way analysis of variance
anovan N-way analysis of variance
aoctool Interactive analysis of covariance
dummyvar Create dummy variables
friedman Friedman’s test
kruskalwallis Kruskal-Wallis test
manova1 One-way multivariate analysis of
variance
18-32
Parametric Regression Analysis
Regression Plots
addedvarplot Added-variable plot
gline Interactively add line to plot
lsline Add least-squares line to scatter plot
polytool Interactive polynomial fitting
rcoplot Residual case order plot
refcurve Add reference curve to plot
refline Add reference line to plot
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
view (classregtree) Plot tree
18-33
18 Function Reference
Linear Regression
coxphfit Cox proportional hazards regression
dummyvar Create dummy variables
glmfit Generalized linear model regression
glmval Generalized linear model values
invpred Inverse prediction
leverage Leverage
mnrfit Multinomial logistic regression
mnrval Multinomial logistic regression
values
mvregress Multivariate linear regression
mvregresslike Negative log-likelihood for
multivariate regression
plsregress Partial least-squares regression
polyconf Polynomial confidence intervals
polytool Interactive polynomial fitting
regress Multiple linear regression
regstats Regression diagnostics
ridge Ridge regression
robustdemo Interactive robust regression
robustfit Robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
stepwise Interactive stepwise regression
stepwisefit Stepwise regression
x2fx Convert predictor matrix to design
matrix
18-34
Parametric Regression Analysis
Nonlinear Regression
dummyvar Create dummy variables
hougen Hougen-Watson model
nlinfit Nonlinear regression
nlintool Interactive nonlinear regression
nlmefit Nonlinear mixed-effects estimation
nlmefitsa Fit nonlinear mixed effects model
with stochastic EM algorithm
nlparci Nonlinear regression parameter
confidence intervals
nlpredci Nonlinear regression prediction
confidence intervals
18-35
18 Function Reference
Multivariate Methods
Multivariate Plots (p. 18-36)
Multidimensional Scaling (p. 18-36)
Procrustes Analysis (p. 18-36)
Feature Selection (p. 18-37)
Feature Transformation (p. 18-37)
Multivariate Plots
andrewsplot Andrews plot
biplot Biplot
glyphplot Glyph plot
parallelcoords Parallel coordinates plot
Multidimensional Scaling
cmdscale Classical multidimensional scaling
mahal Mahalanobis distance
mdscale Nonclassical multidimensional
scaling
pdist Pairwise distance between pairs of
objects
squareform Format distance matrix
Procrustes Analysis
procrustes Procrustes analysis
18-36
Multivariate Methods
Feature Selection
sequentialfs Sequential feature selection
Feature Transformation
Nonnegative Matrix Factorization
(p. 18-37)
Principal Component Analysis
(p. 18-37)
Factor Analysis (p. 18-37)
Factor Analysis
18-37
18 Function Reference
Cluster Analysis
Cluster Plots (p. 18-38)
Hierarchical Clustering (p. 18-38)
K-Means Clustering (p. 18-39)
Gaussian Mixture Models (p. 18-39)
Cluster Plots
dendrogram Dendrogram plot
manovacluster Dendrogram of group mean clusters
following MANOVA
silhouette Silhouette plot
Hierarchical Clustering
cluster Construct agglomerative clusters
from linkages
clusterdata Agglomerative clusters from data
cophenet Cophenetic correlation coefficient
inconsistent Inconsistency coefficient
linkage Agglomerative hierarchical cluster
tree
pdist Pairwise distance between pairs of
objects
squareform Format distance matrix
18-38
Model Assessment
K-Means Clustering
kmeans K-means clustering
mahal Mahalanobis distance
Model Assessment
confusionmat Confusion matrix
crossval Loss estimate using cross-validation
cvpartition Create cross-validation partition for
data
repartition (cvpartition) Repartition data for cross-validation
18-39
18 Function Reference
Parametric Classification
Classification Plots (p. 18-40)
Discriminant Analysis (p. 18-40)
Naive Bayes Classification (p. 18-40)
Distance Computation and Nearest
Neighbor Search (p. 18-41)
Classification Plots
perfcurve Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree) Plot tree
Discriminant Analysis
classify Discriminant analysis
mahal Mahalanobis distance
18-40
Parametric Classification
18-41
18 Function Reference
Supervised Learning
Classification Trees
catsplit (classregtree) Categorical splits used for branches
in decision tree
children (classregtree) Child nodes
classcount (classregtree) Class counts
ClassificationPartitionedModel Cross-validated classification model
ClassificationTree Binary decision tree for classification
classname (classregtree) Class names for classification
decision tree
classprob (classregtree) Class probabilities
classregtree Construct classification and
regression trees
classregtree Classification and regression trees
compact (ClassificationTree) Compact tree
CompactClassificationTree Compact classification tree
crossval (ClassificationTree) Cross-validated decision tree
cutcategories (classregtree) Cut categories
cutpoint (classregtree) Decision tree cut point values
cuttype (classregtree) Cut types
cutvar (classregtree) Cut variable names
cvloss (ClassificationTree) Classification error by cross
validation
edge (CompactClassificationTree) Classification edge
eval (classregtree) Predicted responses
fit (ClassificationTree) Fit classification tree
isbranch (classregtree) Test node for branch
18-42
Supervised Learning
18-43
18 Function Reference
18-44
Supervised Learning
Regression Trees
catsplit (classregtree) Categorical splits used for branches
in decision tree
children (classregtree) Child nodes
classregtree Construct classification and
regression trees
classregtree Classification and regression trees
compact (RegressionTree) Compact regression tree
CompactRegressionTree Compact regression tree
crossval (RegressionTree) Cross-validated decision tree
cutcategories (classregtree) Cut categories
cutpoint (classregtree) Decision tree cut point values
cuttype (classregtree) Cut types
cutvar (classregtree) Cut variable names
cvloss (RegressionTree) Regression error by cross validation
eval (classregtree) Predicted responses
fit (RegressionTree) Binary decision tree for regression
isbranch (classregtree) Test node for branch
kfoldfun Cross validate function
(RegressionPartitionedModel)
kfoldLoss Cross-validation loss of partitioned
(RegressionPartitionedModel) regression model
kfoldPredict Predict response for observations not
(RegressionPartitionedModel) used for training.
loss (CompactRegressionTree) Regression error
meansurrvarassoc (classregtree) Mean predictive measure of
association for surrogate splits in
decision tree
18-45
18 Function Reference
18-46
Supervised Learning
18-47
18 Function Reference
18-48
Supervised Learning
18-49
18 Function Reference
18-50
Supervised Learning
18-51
18 Function Reference
18-52
Hidden Markov Models
18-53
18 Function Reference
Design of Experiments
DOE Plots (p. 18-54)
Full Factorial Designs (p. 18-54)
Fractional Factorial Designs
(p. 18-55)
Response Surface Designs (p. 18-55)
D-Optimal Designs (p. 18-55)
Latin Hypercube Designs (p. 18-55)
Quasi-Random Designs (p. 18-56)
DOE Plots
interactionplot Interaction plot for grouped data
maineffectsplot Main effects plot for grouped data
multivarichart Multivari chart for grouped data
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
18-54
Design of Experiments
D-Optimal Designs
candexch Candidate set row exchange
candgen Candidate set generation
cordexch Coordinate exchange
daugment D-optimal augmentation
dcovary D-optimal design with fixed
covariates
rowexch Row exchange
rsmdemo Interactive response surface
demonstration
18-55
18 Function Reference
Quasi-Random Designs
addlistener (qrandstream) Add listener for event
delete (qrandstream) Delete handle object
end (qrandset) Last index in indexing expression for
point set
eq (qrandstream) Test handle equality
findobj (qrandstream) Find objects matching specified
conditions
findprop (qrandstream) Find property of MATLAB handle
object
ge (qrandstream) Greater than or equal relation for
handles
gt (qrandstream) Greater than relation for handles
haltonset Construct Halton quasi-random
point set
isvalid (qrandstream) Test handle validity
le (qrandstream) Less than or equal relation for
handles
length (qrandset) Length of point set
lt (qrandstream) Less than relation for handles
ndims (qrandset) Number of dimensions in matrix
ne (qrandstream) Not equal relation for handles
net (qrandset) Generate quasi-random point set
notify (qrandstream) Notify listeners of event
qrand (qrandstream) Generate quasi-random points from
stream
qrandset Abstract quasi-random point set
class
qrandstream Construct quasi-random number
stream
18-56
Design of Experiments
18-57
18 Function Reference
SPC Plots
capaplot Process capability plot
controlchart Shewhart control charts
histfit Histogram with normal fit
normspec Normal density plot between
specifications
SPC Functions
capability Process capability indices
controlrules Western Electric and Nelson control
rules
gagerr Gage repeatability and
reproducibility study
18-58
GUIs
GUIs
aoctool Interactive analysis of covariance
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
fsurfht Interactive contour plot
polytool Interactive polynomial fitting
randtool Interactive random number
generation
regstats Regression diagnostics
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
surfht Interactive contour plot
18-59
18 Function Reference
Utilities
combnk Enumeration of combinations
perms Enumeration of permutations
statget Access values in statistics options
structure
statset Create statistics options structure
zscore Standardized z-scores
18-60
19
Class Reference
Data Organization
In this section...
“Categorical Arrays” on page 19-2
“Dataset Arrays” on page 19-2
Categorical Arrays
categorical Arrays for categorical data
nominal Arrays for nominal categorical data
ordinal Arrays for ordinal categorical data
Dataset Arrays
dataset Arrays for statistical data
19-2
Probability Distributions
Probability Distributions
In this section...
“Distribution Objects” on page 19-3
“Quasi-Random Numbers” on page 19-3
“Piecewise Distributions” on page 19-4
Distribution Objects
ProbDist Object representing probability
distribution
ProbDistKernel Object representing nonparametric
probability distribution defined by
kernel smoothing
ProbDistParametric Object representing parametric
probability distribution
ProbDistUnivKernel Object representing univariate
kernel probability distribution
ProbDistUnivParam Object representing univariate
parametric probability distribution
Quasi-Random Numbers
haltonset Halton quasi-random point sets
qrandset Quasi-random point sets
qrandstream Quasi-random number streams
sobolset Sobol quasi-random point sets
19-3
19 Class Reference
Piecewise Distributions
paretotails Empirical distributions with Pareto
tails
piecewisedistribution Piecewise-defined distributions
Model Assessment
cvpartition Data partitions for cross-validation
19-4
Parametric Classification
Parametric Classification
In this section...
“Naive Bayes Classification” on page 19-5
“Distance Classifiers” on page 19-5
Distance Classifiers
ExhaustiveSearcher Nearest neighbors search using
exhaustive search
KDTreeSearcher Nearest neighbors search using
kd-tree
NeighborSearcher Nearest neighbor search object
Supervised Learning
In this section...
“Classification Trees” on page 19-6
“Classification Ensemble Classes” on page 19-6
“Regression Trees” on page 19-6
“Regression Ensemble Classes” on page 19-7
19-5
19 Class Reference
Classification Trees
ClassificationPartitionedModel Cross-validated classification model
ClassificationTree Binary decision tree for classification
classregtree Classification and regression trees
CompactClassificationTree Compact classification tree
Regression Trees
classregtree Classification and regression trees
CompactRegressionTree Compact regression tree
RegressionPartitionedModel Cross-validated regression model
RegressionTree Regression tree
19-6
Supervised Learning
19-7
19 Class Reference
19-8
20
Functions — Alphabetical
List
addedvarplot
Syntax addedvarplot(X,y,num,inmodel)
addedvarplot(X,y,num,inmodel,stats)
20-2
addedvarplot
Examples Load the data in hald.mat, which contains observations of the heat of
reaction of various cement mixtures:
load hald
whos
Name Size Bytes Class Attributes
20-3
addedvarplot
The wide scatter and the low slope of the fitted line are evidence against
the statistical significance of adding the third column to the model.
20-4
categorical.addlevels
Syntax B = addlevels(A,newlevels)
Examples Example 1
Add levels for additional species in Fisher’s iris data:
load fisheriris
species = nominal(species,...
{'Species1','Species2','Species3'},...
{'setosa','versicolor','virginica'});
species = addlevels(species,{'Species4','Species5'});
getlabels(species)
ans =
'Species1' 'Species2' 'Species3' 'Species4' 'Species5'
Example 2
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
20-5
categorical.addlevels
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
20-6
qrandstream.addlistener
Syntax el = addlistener(hsource,'eventname',callback)
el = addlistener(hsource,property,'eventname',callback)
20-7
gmdistribution.AIC property
20-8
andrewsplot
Syntax andrewsplot(X)
andrewsplot(X,...,'Standardize',standopt)
andrewsplot(X,...,'Quantile',alpha)
andrewsplot(X,...,'Group',group)
andrewsplot(X,...,’PropName’,PropVal,...)
h = andrewsplot(X,...)
20-9
andrewsplot
load fisheriris
andrewsplot(meas,'group',species)
20-10
andrewsplot
andrewsplot(meas,'group',species,'quantile',.25)
20-11
andrewsplot
20-12
anova1
Syntax p = anova1(X)
p = anova1(X,group)
p = anova1(X,group,displayopt)
[p,table] = anova1(...)
[p,table,stats] = anova1(...)
20-13
anova1
4 The mean squares (MS) for each source, which is the ratio SS/df.
The box plot of the columns of X suggests the size of the F-statistic and
the p value. Large differences in the center lines of the boxes correspond
to large values of F and correspondingly small values of p.
anova1 treats NaN values as missing, and disregards them.
p = anova1(X,group) performs ANOVA by group. For more
information on grouping variables, see “Grouped Data” on page 2-34.
If X is a matrix, anova1 treats each column as a separate group, and
evaluates whether the population means of the columns are equal. This
form of anova1 is appropriate when each group has the same number of
elements (balanced ANOVA). group can be a character array or a cell
array of strings, with one row per column of X, containing group names.
Enter an empty array ([]) or omit this argument if you do not want to
specify group names.
If X is a vector, group must be a categorical variable, vector, string
array, or cell array of strings with one name for each element of X. X
values corresponding to the same value of group are placed in the same
group. This form of anova1 is appropriate when groups have different
numbers of elements (unbalanced ANOVA).
If group contains empty or NaN-valued cells or strings, the corresponding
observations in X are disregarded.
p = anova1(X,group,displayopt) enables the ANOVA table and box
plot displays when displayopt is 'on' (default) and suppresses the
displays when displayopt is 'off'. Notches in the boxplot provide a
test of group medians (see boxplot) different from the F test for means
in the ANOVA table.
[p,table] = anova1(...) returns the ANOVA table (including
column and row labels) in the cell array table. Copy a text version of
20-14
anova1
the ANOVA table to the clipboard using the Copy Text item on the
Edit menu.
[p,table,stats] = anova1(...) returns a structure stats used
to perform a follow-up multiple comparison test. anova1 evaluates
the hypothesis that the samples all have the same mean against the
alternative that the means are not all the same. Sometimes it is
preferable to perform a test to determine which pairs of means are
significantly different, and which are not. Use the multcompare function
to perform such tests by supplying the stats structure as input.
Assumptions
The ANOVA test makes the following assumptions about the data in X:
Examples Example 1
Create X with columns that are constants plus random normal
disturbances with mean zero and standard deviation one:
X = meshgrid(1:5)
X =
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
X = X + normrnd(0,1,5,5)
X =
1.3550 2.0662 2.4688 5.9447 5.4897
20-15
anova1
p = anova1(X)
p =
7.9370e-006
20-16
anova1
Example 2
The following example is from a study of the strength of structural
beams in Hogg. The vector strength measures deflections of beams in
thousandths of an inch under 3,000 pounds of force. The vector alloy
identifies each beam as steel ('st'), alloy 1 ('al1'), or alloy 2 ('al2').
(Although alloy is sorted in this example, grouping variables do not
need to be sorted.) The null hypothesis is that steel beams are equal in
strength to beams made of the two more expensive alloys.
20-17
anova1
78 75 76 77 79 79 77 78 82 79];
alloy = {'st','st','st','st','st','st','st','st',...
'al1','al1','al1','al1','al1','al1',...
'al2','al2','al2','al2','al2','al2'};
p = anova1(strength,alloy)
p =
1.5264e-004
20-18
anova1
The p value suggests rejection of the null hypothesis. The box plot
shows that steel beams deflect more than beams made of the more
expensive alloys.
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
20-19
anova2
Syntax p = anova2(X,reps)
p = anova2(X,reps,displayopt)
[p,table] = anova2(...)
[p,table,stats] = anova2(...)
A =1 A=2
⎡ x111 x121 ⎤ ⎫
⎢x ⎬B =1
⎢ 112 x122 ⎥⎥ ⎭
⎢ x211 x221 ⎥ ⎫
⎢ ⎥ ⎬B = 2
⎢ x212 x222 ⎥ ⎭
⎢x x321 ⎥ ⎫
⎢ 311 ⎥ ⎬B = 3
⎣⎢ x312 x322 ⎥⎦ ⎭
1 The p value for the null hypothesis, H0A, that all samples from factor
A (i.e., all column-samples in X) are drawn from the same population
2 The p value for the null hypothesis, H0B, that all samples from factor
B (i.e., all row-samples in X) are drawn from the same population
20-20
anova2
3 The p value for the null hypothesis, H0AB, that the effects due to
factors A and B are additive (i.e., that there is no interaction between
factors A and B)
If any p value is near zero, this casts doubt on the associated null
hypothesis. A sufficiently small p value for H0A suggests that at least
one column-sample mean is significantly different that the other
column-sample means; i.e., there is a main effect due to factor A. A
sufficiently small p value for H0B suggests that at least one row-sample
mean is significantly different than the other row-sample means; i.e.,
there is a main effect due to factor B. A sufficiently small p value for
H0AB suggests that there is an interaction between factors A and B.
The choice of a limit for the p value to determine whether a result
is “statistically significant” is left to the researcher. It is common to
declare a result significant if the p value is less than 0.05 or 0.01.
anova2 also displays a figure showing the standard ANOVA table,
which divides the variability of the data in X into three or four parts
depending on the value of reps:
20-21
anova2
• The fourth shows the Mean Squares (MS), which is the ratio SS/df.
• The fifth shows the F statistics, which is the ratio of the mean
squares.
Examples The data below come from a study of popcorn brands and popper type
(Hogg 1987). The columns of the matrix popcorn are brands (Gourmet,
National, and Generic). The rows are popper type (Oil and Air.) The
study popped a batch of each brand three times with each popper. The
values are the yield in cups of popped popcorn.
load popcorn
popcorn
popcorn =
5.5000 4.5000 3.5000
5.5000 4.5000 4.0000
6.0000 4.0000 3.0000
6.5000 5.0000 4.0000
7.0000 5.5000 5.0000
20-22
anova2
p = anova2(popcorn,3)
p =
0.0000 0.0001 0.7462
The vector p shows the p-values for the three brands of popcorn, 0.0000,
the two popper types, 0.0001, and the interaction between brand and
popper type, 0.7462. These values indicate that both popcorn brand and
popper type affect the yield of popcorn, but there is no evidence of a
synergistic (interaction) effect of the two.
The conclusion is that you can get the greatest yield using the Gourmet
brand and an Air popper (the three values popcorn(4:6,1)).
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
20-23
anovan
Syntax p = anovan(y,group)
p = anovan(y,group,param,val)
[p,table] = anovan(y,group,param,val)
[p,table,stats] = anovan(y,group,param,val)
[p,table,stats,terms] = anovan(y,group,param,val)
Parameter Value
'alpha' A number between 0 and 1 requesting 100(1 –
alpha)% confidence bounds (default 0.05 for 95%
confidence)
'continuous' A vector of indices indicating which grouping
variables should be treated as continuous predictors
rather than as categorical predictors.
'display' 'on' displays an ANOVA table (the default)
'off' omits the display
20-24
anovan
Parameter Value
'model' The type of model used. See “Model Type” on page
20-26 for a description of this parameter.
'nested' A matrix M of 0’s and 1’s specifying the nesting
relationships among the grouping variables. M(i,j) is
1 if variable i is nested in variable j.
'random' A vector of indices indicating which grouping
variables are random effects (all are fixed by default).
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'random'.
'sstype' 1, 2, 3 (default), or h specifies the type of sum of
squares. See “Sum of Squares” on page 20-27 for a
description of this parameter.
'varnames' A character matrix or a cell array of strings specifying
names of grouping variables, one per grouping
variable. When you do not specify 'varnames', the
default labels 'X1', 'X2', 'X3', ..., 'XN' are used.
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'varnames'.
20-25
anovan
Model Type
This section explains how to use the argument 'model' with the syntax:
[...] = anovan(y,group,'model',modeltype)
The argument modeltype, which specifies the type of model the function
uses, can be any one of the following:
⎛N⎞
for null hypotheses on the N main effects and the ⎜ ⎟ two-factor
interactions. ⎝2⎠
• 'full' — The 'full' model computes the p-values for null
hypotheses on the N main effects and interactions at all levels.
• An integer — For an integer value of modeltype, k (k ≤ N),
anovan computes all interaction levels through the kth level. For
example, the value 3 means main effects plus two- and three-factor
interactions. The values k = 1 and k = 2 are equivalent to the
'linear' and 'interaction' specifications, respectively, while the
value k = N is equivalent to the 'full' specification.
• A matrix of term definitions having the same form as the input to the
x2fx function. All entries must be 0 or 1 (no higher powers).
For more precise control over the main and interaction terms that
anovan computes, modeltype can specify a matrix containing one row
for each main or interaction term to include in the ANOVA model. Each
row defines one term using a vector of N zeros and ones. The table
below illustrates the coding for a 3-factor ANOVA.
20-26
anovan
Sum of Squares
This section explains how to use the argument 'sstype' with the
syntax:
[...] = anovan(y,group,'sstype',type)
This syntax computes the ANOVA using the type of sum of squares
specified by type, which can be 1, 2, 3, or h. While the numbers 1 – 3
designate Type 1, Type 2, or Type 3 sum of squares, respectively, h
represents a hierarchical model similar to type 2, but with continuous
as well as categorical factors used to determine the hierarchy of
terms. The default value is 3. For a model containing main effects
but no interactions, the value of type only influences computations
on unbalanced data.
The sum of squares for any term is determined by comparing two
models. The Type 1 sum of squares for a term is the reduction in
residual sum of squares obtained by adding that term to a fit that
already includes the terms listed before it. The Type 2 sum of squares is
20-27
anovan
The models for Type 3 sum of squares have sigma restrictions imposed.
This means, for example, that in fitting R(B, AB), the array of AB
effects is constrained to sum to 0 over A for each value of B, and over B
for each value of A.
20-28
anovan
This defines a three-way ANOVA with two levels of each factor. Every
observation in y is identified by a combination of factor levels. If the
factors are A, B, and C, then observation y(1) is associated with
• Level 1 of factor A
• Level 'hi' of factor B
• Level 'may' of factor C
• Level 2 of factor A
• Level 'hi' of factor B
• Level 'june' of factor C
p = anovan(y,{g1 g2 g3})
p =
0.4174
0.0028
0.9140
Output vector p contains p-values for the null hypotheses on the N main
effects. Element p(1) contains the p value for the null hypotheses,
H0A, that samples at all levels of factor A are drawn from the same
population; element p(2) contains the p value for the null hypotheses,
H0B, that samples at all levels of factor B are drawn from the same
population; and so on.
If any p value is near zero, this casts doubt on the associated null
hypothesis. For example, a sufficiently small p value for H0A suggests
that at least one A-sample mean is significantly different from the other
A-sample means; that is, there is a main effect due to factor A. You
need to choose a bound for the p value to determine whether a result is
statistically significant. It is common to declare a result significant if
the p value is less than 0.05 or 0.01.
20-29
anovan
Two-Factor Interactions
By default, anovan computes p-values just for the three main effects.
To also compute p-values for the two-factor interactions, X1*X2, X1*X3,
20-30
anovan
p = anovan(y,{g1 g2 g3},'model','interaction')
p =
0.0347
0.0048
0.2578
0.0158
0.1444
0.5000
The first three entries of p are the p-values for the main effects. The
last three entries are the p-values for the two-factor interactions. You
can determine the order in which the two-factor interactions occur from
the ANOVAN table shown in the following figure.
20-31
anovan
Field Description
coeffs Estimated coefficients
coeffnames Name of term for each coefficient
vars Matrix of grouping variable values for each term
resid Residuals from the fitted model
The stats structure also contains the following fields if there are
random effects:
Field Description
ems Expected mean squares
denom Denominator definition
rtnames Names of random terms
varest Variance component estimates (one per random term)
varci Confidence intervals for variance components
Examples “Example: Two-Way ANOVA” on page 8-10 shows how to use anova2 to
analyze the effects of two factors on a response in a balanced design.
For a design that is not balanced, use anovan instead.
The data in carbig.mat gives measurements on 406 cars. Use anonvan
to study how the mileage depends on where and when the cars were
made:
load carbig
p = anovan(MPG,{org when},'model',2,'sstype',3,...
'varnames',{'Origin';'Mfg date'})
p =
20-32
anovan
0
0
0.3059
The p value for the interaction term is not small, indicating little
evidence that the effect of the year or manufacture (when) depends on
where the car was made (org). The linear effects of those two factors,
however, are significant.
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
20-33
ansaribradley
Syntax h = ansaribradley(x,y)
h = ansaribradley(x,y,alpha)
h = ansaribradley(x,y,alpha,tail)
[h,p] = ansaribradley(...)
[h,p,stats] = ansaribradley(...)
[...] = ansaribradley(x,y,alpha,tail,exact)
[...] = ansaribradley(x,y,alpha,tail,exact,dim)
20-34
ansaribradley