0% found this document useful (0 votes)
604 views612 pages

Financial Econometrics Notes: Kevin Sheppard University of Oxford

Uploaded by

Colin Philip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
604 views612 pages

Financial Econometrics Notes: Kevin Sheppard University of Oxford

Uploaded by

Colin Philip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 612

Financial Econometrics Notes

Kevin Sheppard
University of Oxford

November 14, 2012


2
This version: 11:51, November 14, 2012

©2005-2012 Kevin Sheppard


ii
Contents

1 Probability, Random Variables and Expectations 1


1.1 Axiomatic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Univariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Estimation, Inference and Hypothesis Testing 63


2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Convergence and Limits for Random Variables . . . . . . . . . . . . . . . . . . . . . 74
2.3 Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.6 The Bootstrap and Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.7 Inference on Financial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3 Analysis of Cross-Sectional Data 141


3.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.2 Functional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.4 Assessing Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6 Small Sample Properties of OLS estimators . . . . . . . . . . . . . . . . . . . . . . 158
3.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.8 Small Sample Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.9 Large Sample Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.10 Large Sample Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.11 Large Sample Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.12 Violations of the Large Sample Assumptions . . . . . . . . . . . . . . . . . . . . . . 189
3.13 Model Selection and Specification Checking . . . . . . . . . . . . . . . . . . . . . . 203
iv CONTENTS

3.14 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221


3.A Selected Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

4 Analysis of a Single Time Series 235


4.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
4.2 Stationarity, Ergodicity and the Information Set . . . . . . . . . . . . . . . . . . . . . 236
4.3 ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
4.4 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
4.5 Data and Initial Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.6 Autocorrelations and Partial Autocorrelations . . . . . . . . . . . . . . . . . . . . . . 256
4.7 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
4.8 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
4.9 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.10 Nonstationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
4.11 Nonlinear Models for Time-Series Analysis . . . . . . . . . . . . . . . . . . . . . . . 291
4.12 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
4.A Computing Autocovariance and Autocorrelations . . . . . . . . . . . . . . . . . . . . 306

5 Analysis of Multiple Time Series 331


5.1 Vector Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
5.2 Companion Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
5.3 Empirical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
5.4 VAR forecasting . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 341
5.5 Estimation and Identification . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 343
5.6 Granger causality . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 347
5.7 Impulse Response Function . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 349
5.8 Cointegration . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 354
5.9 Cross-sectional Regression with Time-series Data . . . . . . . . . . . . . . . . . . . 371
5.A Cointegration in a trivariate VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

6 Generalized Method Of Moments (GMM) 389


6.1 Classical Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
6.3 General Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
6.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
6.5 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
6.6 Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.7 Special Cases of GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
CONTENTS v

6.8 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418


6.9 Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
6.10 Two-Stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
6.11 Weak Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
6.12 Considerations for using GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

7 Univariate Volatility Modeling 429


7.1 Why does volatility change? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.2 ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
7.3 Forecasting Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
7.4 Realized Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
7.5 Implied Volatility and VIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
7.A Kurtosis of an ARCH(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
7.B Kurtosis of a GARCH(1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

8 Value-at-Risk, Expected Shortfall and Density Forecasting 493


8.1 Defining Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.2 Value-at-Risk (VaR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3 Expected Shortfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
8.4 Density Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.5 Coherent Risk Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522

9 Multivariate Volatility, Dependence and Copulas 531


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
9.3 Simple Models of Multivariate Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 535
9.4 Multivariate ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.5 Realized Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
9.6 Measuring Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
9.7 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
9.A Bootstrap Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
vi CONTENTS
List of Figures

1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Bernoulli Random Variables . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Normal PDF and CDF . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Poisson and χ 2 distributions . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Bernoulli Random Variables . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Joint and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7 Joint distribution of the FTSE 100 and S&P 500 . . . . . . . . . . . . . . . . . . . . 34
1.8 Simulation and Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.9 Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


2.2 Consistency and Central Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.3 Central Limit Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.4 Data Generating Process and Asymptotic Covariance of Estimators . . . . . . . . . . 99
2.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.6 Standard Normal CDF and Empirical CDF . . . . . . . . . . . . . . . . . . . . . . . 119
2.7 CRSP Value Weighted Market (VWM) Excess Returns . . . . . . . . . . . . . . . . . 126

3.1 Rejection regions of a t 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166


3.2 Bivariate F distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.3 Rejection region of a F5,30 distribution . . . . .. . . . . . . . . . . . . . . . . . . . . 171
3.4 Location of the three test statistic statistics . .
. . . . . . . . . . . . . . . . . . . . . 179
IV
3.5 Effect of correlation on the variance of β̂ . .
. . . . . . . . . . . . . . . . . . . . . 196
3.6 Gains of using GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
3.7 Neglected Nonlinearity and Residual Plots . . . . . . . . . . . . . . . . . . . . . . . 209
3.8 Rolling Parameter Estimates in the 4-Factor Model . . . . . . . . . . . . . . . . . . . 212
3.9 Recursive Parameter Estimates in the 4-Factor Model . . . . . . . . . . . . . . . . . 213
3.10 Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.11 Correct and Incorrect use of “Robust” Estimators . . . . . . . . . . . . . . . . . . . . 220
3.12 Weights of an S&P 500 Tracking Portfolio . . . . . . . . . . . . . . . . . . . . . . . . 222

4.1 Dynamics of linear difference equations . . . . . . . . . . . . . . . . . . . . . . . . . 254


4.2 Stationarity of an AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
viii LIST OF FIGURES

4.3 VWM and Default Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257


4.4 ACF and PACF for ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 262
4.5 ACF and PACF for ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.6 Autocorrelations and Partial Autocorrelations for the VWM and the Default Spread . . . 267
4.7 M1, M1 growth, and the ACF and PACF of M1 growth . . . . . . . . . . . . . . . . . . 281
4.8 Time Trend Models of GDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4.9 Unit Root Analysis of ln C P I and the Default Spread . . . . . . . . . . . . . . . . . 290
4.10 Ideal Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.11 Actual Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
4.12 Cyclical Component of U.S. Real GDP . . . . . . . . . . . . . . . . . . . . . . . . . 301
4.13 Markov Switching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
4.14 Self Exciting Threshold Autoregression Processes . . . . . . . . . . . . . . . . . . . 307
4.15 Exercise 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

5.1 Comparing forecasts from a VAR(1) and an AR(1) . . . . . . . . . . . . . . . . . . . 343


5.2 ACF and CCF . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.3 Impulse Response Functions . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 352
5.4 Cointegration . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 356
5.5 Detrended CAY Residuals . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.6 Impulse Response of Level-Slope-Curvature . . . . . . . . . . . . . . . . . . . . . . 385

6.1 2-Step GMM Objective Function Surface . . . . . . . . . . . . . . . . . . . . . . . . 403

7.1 Returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437


7.2 Squared returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . 438
7.3 Absolute returns of the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . . . 445
7.4 News impact curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
7.5 Various estimated densities for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . 458
7.6 Effect of distribution on volatility estimates . . . . . . . . . . . . . . . . . . . . . . . 460
7.7 ACF and PACF of S&P 500 squared returns . . . . . . . . . . . . . . . . . . . . . . 462
7.8 ACF and PACF of IBM squared returns . . . . . . . . . . . . . . . . . . . . . . . . . 463
7.9 Realized Variance and sampling frequency . . . . . . . . . . . . . . . . . . . . . . . 473
7.10 RV AC 1 and sampling frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
7.11 Volatility Signature Plot for SPDR RV . . . . . . . . . . . . . . . . . . . . . . . . . 475
7.12 Black-Scholes Implied Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
7.13 VIX and alternative measures of volatility . . . . . . . . . . . . . . . . . . . . . . . . 483

8.1 Graphical representation of Value-at-Risk . . . . . . . . . . . . . . . . . . . . . . . . 495


8.2 Estimated % VaR for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.3 S&P 500 Returns and a Parametric Density . . . . . . . . . . . . . . . . . . . . . . . 507
8.4 Empirical and Smoothed empirical CDF . . . . . . . . . . . . . . . . . . . . . . . . 515
8.5 Naïve and Correct Density Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . 517
LIST OF FIGURES ix

8.6 Fan plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518


8.7 QQ plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
8.8 Kolmogorov-Smirnov plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
8.9 Returns, Historical Simulation VaR and Normal GARCH VaR. . . . . . . . . . . . . . 529

9.1 Lag weights in RiskMetrics methodologies . . . . . . . . . . . . . . . . . . . . . . . 538


9.2 Rolling Window Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 545
9.3 Observable and Principal Component Correlation Measures . . . . . . . . . . . . . . 546
9.4 Volatility from Multivariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
9.5 Small Cap - Large Cap Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
9.6 Small Cap - Long Government Bond Correlation . . . . . . . . . . . . . . . . . . . . 559
9.7 Large Cap - Bond Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
9.8 Symmetric and Asymmetric Dependence . . . . . . . . . . . . . . . . . . . . . . . . 566
9.9 Rolling Dependence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
9.10 Exceedance Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
9.11 Copula Distributions and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
9.12 Copula Densities with Standard Normal Margins . . . . . . . . . . . . . . . . . . . . 583
9.13 S&P 500 - FTSE 100 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
9.14 S&P 500 and FTSE 100 Exceedance Correlations . . . . . . . . . . . . . . . . . . . 592
x LIST OF FIGURES
List of Tables

1.1 Monte Carlo and Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.1 Parameter Values of Mixed Normals . . . . . . . . . . . . . . . . . . . . . . . . . . 100


2.2 Outcome matrix for a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3 Inference on the Market Premium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.4 Inference on the Market Premium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.5 Comparing the Variance of the NASDAQ and S&P 100 . . . . . . . . . . . . . . . . . 127
2.6 Comparing the Variance of the NASDAQ and S&P 100 . . . . . . . . . . . . . . . . . 129
2.7 Wald, LR and LM Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

3.1 Fama-French Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144


3.2 Descriptive Statistics of the Fama-French Data Set . . . . . . . . . . . . . . . . . . . 145
3.3 Regression Coefficient on the Fama-French Data Set . . . . . . . . . . . . . . . . . . 151
2
3.4 Centered and Uncentered R2 and R̄ . . . . . . . . . . . . . .
. . . . . . . . . . . . 155
2 2
3.5 Centered and Uncentered R and R̄ with Regressor Changes . . . . . . . . . . . . . 156
3.6 t -stats for the Big-High Portfolio . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 172
3.7 Likelihood Ratio Tests on the Big-High Portfolio . . . . . . . . .
. . . . . . . . . . . . 175
3.8 Comparison of Small- and Large- Sample t -Statistics . . . . . . . . . . . . . . . . . 189
3.9 Comparison of Small- and Large- Sample Wald, LR and LM Statistic . . . . . . . . . . 190
3.10 OLS and GLS Parameter Estimates and t -stats . . . . . . . . . . . . . . . . . . . . 203

4.1 Estimates from Time-Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . 257


4.2 ACF and PACF for ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.3 Seasonal Model Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
4.4 Unit Root Analysis of ln C P I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

5.1 Parameter estimates from Campbell’s VAR . . . . . . . . . . . . . . . . . . . . . . . 341


5.2 AIC and SBIC in Campbell’s VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
5.3 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
5.4 Johansen Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
5.5 Unit Root Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

6.1 Parameter Estimates from a Consumption-Based Asset Pricing Model . . . . . . . . . 402


xii LIST OF TABLES

6.2 Stochastic Volatility Model Parameter Estimates . . . . . . . . . . . . . . . . . . . . 404


6.3 Effect of Covariance Estimator on GMM Estimates . . . . . . . . . . . . . . . . . . . 412
6.4 Stochastic Volatility Model Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 413
6.5 Tests of a Linear Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
6.6 Fama-MacBeth Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

7.1 Summary statistics for the S&P 500 and IBM . . . . . . . . . . . . . . . . . . . . . . 443
7.2 Parameter estimates from ARCH-family models . . . . . . . . . . . . . . . . . . . . 444
7.3 Bollerslev-Wooldridge Covariance estimates . . . . . . . . . . . . . . . . . . . . . . 454
7.4 GARCH-in-mean estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
7.5 Model selection for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
7.6 Model selection for IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

8.1 Estimated model parameters and quantiles . . . . . . . . . . . . . . . . . . . . . . . 503


8.2 Unconditional VaR of the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . 506

9.1 Principal Component Analysis of the S&P 500 . . . . . . . . . . . . . . . . . . . . . 541


9.2 Correlation Measures for the S&P 500 . . . . . . . . . . . . . . . . . . . . . . . . . 544
9.3 CCC GARCH Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
9.4 Multivariate GARCH Model Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 556
9.5 Refresh-time sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
9.6 Dependence Measures for Weekly FTSE and S&P 500 Returns . . . . . . . . . . . . 568
9.7 Copula Tail Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
9.8 Unconditional Copula Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
9.9 Conditional Copula Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Chapter 1

Probability, Random Variables and


Expectations

Note: The primary reference for these notes is Mittelhammer (1999). Other treatments of
probability theory include Gallant (1997), Casella & Berger (2001) and Grimmett & Stirzaker
(2001).

This chapter provides an overview of probability theory as it applied to both dis-


crete and continuous random variables. The material covered in this chapter
serves as a foundation of the econometric sequence and is useful throughout fi-
nancial economics. The chapter begins with a discussion of the axiomatic founda-
tions of probability theory, and then proceeds to describe properties of univariate
random variables. Attention then turns to multivariate random variables and im-
portant difference form standard univariate random variables. Finally, the chap-
ter discusses the expectations operator and moments.

1.1 Axiomatic Probability

Probability theory is derived from a small set of axioms – a minimal set of essential assump-
tions. A deep understanding of axiomatic probability theory is not essential to financial
econometrics or to the use of probability and statistics in general, although understanding
these core concepts does provide additional insight.
The first concept in probability theory is the sample space, which is an abstract concept
containing primitive probability events.

Definition 1.1 (Sample Space). The sample space is a set, Ω, that contains all possible out-
comes.

Example 1.2. Suppose interest is in a standard 6-sided die. The sample space is 1-dot,
2-dots, . . ., 6-dots.
2 Probability, Random Variables and Expectations

Example 1.3. Suppose interest is in a standard 52-card deck. The sample space is then A♣,
2♣, 3♣, . . . , J ♣, Q ♣, K ♣, A♦, . . . , K ♦, A♥, . . . , K ♥, A♠, . . . , K ♠.

Example 1.4. Suppose interest is in the logarithmic stock return, defined as rt = ln Pt −


ln Pt −1 , then the sample space is R, the real line.

The next item of interest is an event.

Definition 1.5 (Event). An event, ω, is a subset of the sample space Ω.

Events typically include be any subsets of the sample space Ω (including the entire sam-
ple space), and the set of all events is known as the event space.

Definition 1.6 (Event Space). The set of all events in the sample space Ω is called the event
space, and is denoted F.

Event spaces are a somewhat more difficult concept. For finite event spaces, the event
space is usually the power set of the outcomes – that is, the set of all possible unique sets
that can be constructed from the elements. When variables can take infinitely many out-
comes, then a more nuanced definition is needed, although one natural one is to consider
is the set of all (small) intervals (so that each interval has infinitely many points in it).

Example 1.7. Suppose interest lies in the outcome of a coin flip. Then the sample space is
{H , T } and the event space is {∅, {H } , {T } , {H , T }} where ∅ is the empty set.

The first two axioms of probability state that all probabilities are non-negative and pro-
vide a normalization.

Axiom 1.8. For any event ω ∈ F,


Pr (ω) ≥ 0. (1.1)

Axiom 1.9. The probability of all events in the sample space Ω is unity, i.e.

Pr (Ω) = 1. (1.2)

The second axiom is a normalization that states that the probability of the entire sample
space is 1 and ensures that the sample space must contain all events that may occur. Pr (·)
is a set valued function – that is, Pr (ω) returns the probability, a number between 0 and 1,
of observing an event ω.
Before proceeding, it is useful to refresh three concepts from set theory.

Definition 1.10 (Set Union). Let A and B be two sets, then the union is defined A ∪ B =
{x : x ∈ A or x ∈ B }.

A union of two sets contains all elements that are in either set.

Definition 1.11 (Set Intersection). Let A and B be two sets, then the intersection is defined
A ∩ B = {x : x ∈ A and x ∈ B }.
1.1 Axiomatic Probability 3

Set Complement Disjoint Sets

A AC A B

Set Intersection Set Union

A B A B
A∩B A∪B

Figure 1.1: The four set definitions shown in R2 . The upper left panel shows a set and its
complement. The upper right shows two disjoint sets. The lower left shows the intersection
of two sets (darkened region) and the lower right shows the union of two sets (darkened
region). I all diagrams, the outer box represents the entire space.

The intersection contains only the elements that are in both sets.

Definition 1.12 (Set Complement). Let A be a set, then the complement set, denoted A c =
/ A}.
{x : x ∈

The complement of a set contains all elements which are not contained in the set.

Definition 1.13 (Disjoint Sets). Let A and B be sets, then A and B are disjoint if and only if
A ∩ B = ∅.

Figure 1.1 provides a graphical representation of the four set operations in a 2-dimensional
space.
The third and final axiom states that probability is additive when sets have no overlap.

Axiom 1.14. Let {A i }, i = 1, 2, . . . be a countably infinite1 (or finite) set of disjoint events.
1

Definition 1.15. A S set is countably infinite if there exists a bijective (one-to-one) function from the elements
of S to the natural numbers N = {1, 2, . . .} .
4 Probability, Random Variables and Expectations

Then

! ∞
[ X
Pr Ai = Pr (A i ) . (1.3)
i =1 i =1

Assembling a sample space, event space and a probability measure into a set produces
what is known as a probability space. Throughout the course, and in virtually all statistics,
a complete probability space is assumed (typically without explicitly stating this assump-
tion).2

Definition 1.16 (Probability Space). A probability space is denoted using the tuple (Ω, F, Pr)
where Ω is the sample space, F is the event space and Pr is the probability set function
which has domain ω ∈ F.

The three axioms of modern probability are very powerful, and a large number of the-
orems can be proven using only these axioms. A few simple example are provided, and
selected proofs appear in the Appendix.

Theorem 1.17. Let A be an event in the sample space Ω, and let A c be the complement of A
so that Ω = A ∪ A c . Then Pr (A) = 1 − Pr (A c ).

Since A and A c are disjoint, and by definition A c is everything not in A, then the proba-
bility of the two must be unity.

Theorem 1.18. Let A and B be events in the sample space Ω. Then Pr (A∪B )= Pr (A)+Pr (B )−
Pr (A ∩ B ).

This theorem shows that for any two sets, the probability of the union of the two sets is
equal to the probability of the two sets minus the probability of the intersection of the sets.

1.1.1 Conditional Probability


Conditional probability extends the basic concepts of probability to the case where interest
lies in the probability of one event conditional on the occurrence of another event.

Definition 1.19 (Conditional Probability). Let A and B be two events in the sample space
Ω. If Pr (B ) 6= 0, then the conditional probability of the event A, given event B , is given by

 Pr (A ∩ B )
Pr A|B = . (1.4)
Pr (B )

The definition of conditional probability is intuitive. The probability of observing an


event in set A, given an event in set B has occurred is the probability of observing an event
in the intersection of the two sets normalized by the probability of observing an event in
set B .
2
A complete probability space is complete if and only if B ∈ F where Pr (B ) = 0 and A ⊂ B , then A ∈ F.
This essentially ensures that probability can be assigned to any event.
1.1 Axiomatic Probability 5

Example 1.20. In the example of rolling a die, suppose A = {1, 3, 5} is the event that the
outcome is odd and B = {1, 2, 3} is the event that the outcome of the roll is less than 4.
Then the conditional probability of A given B is
2

Pr {1, 3} 2
 = 63 =
Pr {1, 2, 3} 6
3

since the intersection of A and B is {1, 3}.

The axioms can be restated in terms of conditional probability, where the sample space
is now the events in the set B .

1.1.2 Independence
Independence is an important concept that is frequently encountered. In essence, inde-
pendence means that any information about an event occurring in one set has no infor-
mation about whether an event occurs in another set.

Definition 1.21. Let A and B be two events in the sample space Ω. Then A and B are inde-
pendent if and only if
Pr (A ∩ B ) = Pr (A) Pr (B ) (1.5)
, A ⊥⊥ B is commonly used to indicate that A and B are independent.

One immediate implication of the definition of independence is that when A and B are
independent, then the conditional probability of one given the other is the same as the
unconditional probability of the random variable – Pr A|B = Pr (A).


1.1.3 Bayes Rule


Bayes rule is frequently encountered in both statistics (known as Bayesian statistics) and
in financial models where agents learn about their environment. Bayes rule follows as a
corollary to a theorem that states that the total probability of a set A is equal to the condi-
tional probability of A given a set of disjoint sets B which span the sample space.

Theorem 1.22. Let Bi ,i = 1, 2 . . . be a countably infinite (or finite) partition of the sample
space Ω so that B j ∩ Bk = ∅ for j 6= k and ∞ i =1 Bi = Ω. Let Pr (Bi ) > 0 for all i , then for any
S

set A,
X∞
Pr (A) = Pr A|Bi Pr (Bi ) .

(1.6)
i =1

Bayes rule is a restatement of the previous theorem, and states that the probability of
observing an event in B j given an event in A is observed can be related to the conditional
probability of A given B j .
6 Probability, Random Variables and Expectations

Corollary 1.23 (Bayes Rule). Let Bi ,i = 1, 2 . . . be a countably infinite (or finite) partition of
the sample space Ω so that B j ∩ Bk = ∅ for j 6= k and ∞ i =1 Bi = Ω. Let Pr (Bi ) > 0 for all i ,
S

then for any set A where Pr (A) > 0,


 
Pr A|B j Pr B j
Pr B j |A = P∞

.
(B )

i =1 Pr A|Bi Pr i
 
Pr A|B j Pr B j
=
Pr (A)

An immediate consequence of the definition of conditional probability is the Pr (A ∩ B ) =


Pr A|B Pr (B ), which is referred to as the multiplication rule. Also notice that the “order”


is arbitrary, so that the rule can be also given as Pr (A ∩ B ) = Pr B |A Pr (A). Combining




these two (as long as Pr (A) > 0),

Pr A|B Pr (B ) = Pr B |A Pr (A)
 

Pr A|B Pr (B )

⇒ Pr B |A =

. (1.7)
Pr (A)

Example 1.24. Suppose a family has 2 children and one is a boy, that the probability of hav-
ing a child of either sex is equal and independent across children. What is the probability
that they have 2 boys?
Before learning that one child is a boy, there are 4 equally probable possibilities, {B , B },
{B , G }, {G , B } and {G , G }. Using Bayes rule,
 
Pr B ≥ 1| {B , B } × Pr {B , B }
Pr {B , B } |B ≥ 1 = P

S ∈{{B ,B },{B ,G },{G ,B },{G ,B }} Pr B ≥ 1|S Pr (S )


1 × 14
=
1 × 14 + 1 × 14 + 1 × 14 + 0 × 1
4
1
=
3

so that knowing one child is a boy increases the probability of 2 boys from 41 to 13 . Note that
X
Pr B ≥ 1|S Pr (S ) = Pr (B ≥ 1) .


S ∈{{B ,B },{B ,G },{G ,B },{G ,B }}

Example 1.25. The famous Monte Hall Let’s Make a Deal television program provides a an
example of Bayes rule. In the game show, there were three prizes, a large one (e.g. a car)
and two uninteresting ones (duds). The prizes were hidden behind doors numbered 1, 2
and 3. Ex ante, the contestant has no information about the which door has the large prize,
and to the initial probabilities are all 31 . During the negotiations with the host, it is revealed
that one of the non-selected doors does not contain the large prize. The host then gives
1.2 Univariate Random Variables 7

the contestant the chance to switch to the door they didn’t choose. For example, suppose
the contestant choose door 1 initially, and that the host revealed that the large prize is not
behind door 2. The contestant then has the chance to choose door 3 or to stay with door
1. In this example, B is the event where the contestant chooses the door which hides the
large prize, and A is the event that the large prize is not behind door 2.
Initially there are three equally likely outcomes (from the contestant’s point of view),
where D indicates dud and L indicated large, and the order corresponds to the door num-
ber.
{D , D , L } , {D , L , D } , {L , D , D }
The contestant has a 31 chance of having the large prize behind door 1. The host will never
remove the large prize, and so applying Bayes rule we have

Pr H = 3|S = 1, L = 2 × Pr L = 2|S = 1
 
Pr L = 2|H = 3, S = 1 = P3

i =1 Pr H = 3|S = 1, L = i × Pr L = i |S = 1
 

1 × 31
= 1
2
× 13 + 1 × 13 + 0 × 1
3
1
= 3
1
2
2
= .
3
where H is the door the host reveals, S is initial door selected, and L is the door containing
the large prize. This shows that the probability the large prize is behind door 2, given that
the player initially selected door 1 and the host revealed door 3 can be computed using
Bayes rule.
Pr H = 3|S = 1, L = 2 is the probability that the host shows door 3 given the contes-


tant selected door 1 and the large prize is in door 2, which always happens. P L = 2|S = 1


is the probability that the large is in door 2 given the contestant selected door 1, which is 13 .
Pr H = 3|S = 1, L = 1 is the probability that the host reveals door 3 given that door 1 was


selected and contained the large prize, which is 12 , and P H = 3|S = 1, L = 3 is the prob-


ability that the host reveals door 3 given door 3 contains the prize, which never happens.
Bayes rule shows that it is optimal to switch doors since when the host opens a door,
it reveals information about the location of the large prize. Essentially, the two doors not
selected have probability 32 before the doors are opened, and opening one assigns all prob-
ability to the door not opened.

1.2 Univariate Random Variables

Studying the behavior of random variables, and more importantly functions of random
variables (i.e. statistics) is essential for both the theory and practice of financial economet-
8 Probability, Random Variables and Expectations

rics. This section covers univariate random variables, and the discussion of multivariate
random variables is reserved for a later section.
The previous discussion of probability is set based and so includes objects which cannot
be described as random variables, which are a limited (but highly useful) sub-class of all
objects which can be described using probability theory. The primary characteristic of a
random variable is that it takes values on the real line.

Definition 1.26 (Random Variable). Let (Ω, F, P ) be a probability space. If X : Ω → R is a


real-valued function have as its domain elements of Ω, then X is called a random variable.
A random variable is essentially a function which takes ω ∈ Ω as an input an produces
a value x ∈ R, where R is the symbol for the real line. Random variables come in one of
three forms: discrete, continuous and mixed. Random variables which mix discrete and
continuous distributions are generally less important in financial economics and so here
the focus is on discrete and continuous random variables.

Definition 1.27 (Discrete Random Variable). A random variable is called discrete if its range
consists of a countable (possibly infinite) number of elements.
While discrete random variables are less useful than continuous random variables, they
are still very common.

Example 1.28. A random variable which takes on values in {0, 1} is known as a Bernoulli
random variable, and is the simplest non-degenerate random variable (see Section 1.2.3.1).3
Bernoulli random variables are often used to model “success” or “failure”, where success is
loosely defined – a large negative return, the existence of a bull market or a corporate de-
fault.

The distinguishing characteristic of a discrete random variable is not that it takes only
finitely many values, but that the values it takes are distinct in that it is possible to fit a small
interval around each point.

Example 1.29. Poisson random variables take values in{0, 1, 2, 3, . . .} (an infinite range),
and are commonly used to model hazard rates (i.e. the number of occurrences of an event
in an interval). They are especially useful in modeling trading activity (see Section 1.2.3.2).

1.2.1 Mass, Density and Distribution Functions


Discrete random variables are characterized by a probability mass function (pmf) which
gives the probability of observing a particular value of the random variable.

Definition 1.30 (Probability Mass Function). The probability mass function, f , for a dis-
crete random variable X is defined as f (x ) = Pr (x ) for all x ∈ R (X ), and f (x ) = 0 for all
/ R (X ) where R (X ) is the range of X (i.e. the values for which X is defined).
x ∈
3
A degenerate random variable always takes the same value, and so is not meaningfully random.
1.2 Univariate Random Variables 9

Positive Weekly Return Positive Monthly Return


60 70
FTSE 100
50 S&P 500 60

50
40
40
30
30
20
20
10 10

0 0
Less than 0 Above 0 Less than 0 Above 0
Weekly Return above -1% Monthly Return above -4%
80 100

80
60

60
40
40

20
20

0 0
Less than −1% Above −1% Less than −4% Above −4%

Figure 1.2: These four charts show examples of Bernoulli random variables using returns
on the FTSE 100 and S&P 500. In the top two, a success was defined as a positive return. In
the bottom two, a success was a return above -1% (weekly) or -4% (monthly).

Example 1.31. The probability mass function of a Bernoulli random variable takes the form

f (x ; p ) = p x (1 − p )1−x

where p ∈ [0, 1] is the probability of success.

Figure 1.2 contains a few examples of Bernoulli pmfs using data from the FTSE 100 and
S&P 500 over the period 1984–2012. Both weekly returns, using Friday to Friday prices and
monthly returns were constructed. Log returns were used (rt = ln Pt /Pt −1 ) in both ex-


amples. Two of the pmfs defined success as the return being positive. The other two define
the probability of success as a return larger than -1% (weekly) or larger than -4% (monthly).
These show that the probability of a positive return is much larger for monthly horizons
than for weekly.
10 Probability, Random Variables and Expectations

Example 1.32. The probability mass function of a Poisson random variable is

λx
f (x ; λ) = exp (−λ)
x!
where λ ∈ [0, ∞) determines the intensity of arrival (the average value of the random vari-
able).
The pmf of the Poisson distribution can be evaluated for every value of x ≥ 0, which
is the support of a Poisson random variable. Figure 1.4 shows empirical distribution tab-
ulated using a histogram for the time elapsed where .1% of the daily volume traded in the
S&P 500 tracking ETF SPY on May 31, 2012. This data series is a good candidate for model-
ing using a Poisson distribution.
Continuous random variables, on the other hand, take a continuum of values – techni-
cally an uncountable infinity of values.
Definition 1.33 (Continuous Random Variable). A random variable is called continuous
if its range is uncountable infinite and there exists a non-negative-valued function f (x )
defined or all x ∈ (−∞, ∞) such that for any event B ⊂ R (X ), Pr (B ) = x ∈B f (x ) dx and
R

f (x ) = 0 for all x ∈
/ R (X ) where R (X ) is the range of X (i.e. the values for which X is
defined).
The pmf of a discrete random variable is replaced with the probability density function
(pdf) for continuous random variables. This change in naming reflects that the probability
of a single point of a continuous random variable is 0, although the probability of observing
a value inside an arbitrarily small interval in R (X ) is not.
Definition 1.34 (Probability Density Function). For a continuous random variable, the func-
tion f is called the probability density function (pdf).
Before providing some examples of pdfs, it is useful to characterize the properties that
any pdf should have.
Definition 1.35 (Continuous Density Function Characterization). A function f : R → R
is a member of the class of continuous density functions if and only if f (x ) ≥ 0 for all
R∞
x ∈ (−∞, ∞) and −∞ f (x ) dx = 1.
There are two essential properties. First, that the function is non-negative, which fol-
lows from the axiomatic definition of probability, and second, that the function integrated
to 1. This means that the total probability under the function is set to 1. This may seem
like a limitation, but it is only a normalization since any integrable function can always be
normalized to that it integrates to 1.
Example 1.36. A simple continuous random variable can be defined on [0, 1] using the
probability density function
 2
1
f (x ) = 12 x −
2
1.2 Univariate Random Variables 11

and figure 1.3 contains a plot of the pdf.


This simple pdf has peaks near 0 and 1 and a trough at 1/2. More realistic pdfs allow for
values in (−∞, ∞), such as in the density of a normal random variable.
Example 1.37. The pdf of a normal random variable with parameters µ and σ2 is given by
!
1 (x − µ)2
f (x ) = √ exp − , (1.8)
2πσ2 2σ2

and N µ, σ2 is used as a shorthand notation. When µ = 0 and σ2 = 1, the distribution is




known as a standard normal. Figure 1.3 contains a plot of the standard normal pdf along
with two other parameterizations.
For large values of x (in the absolute sense), the pdf takes very small values, and peaks
at x = 0 with a value of 0.3989. The shape of the normal distribution is that of a bell (and
is occasionally referred to a bell curve).
A closely related function to the pdf is the cumulative distribution function, which re-
turns the total probability of observing a value of the random variable less than its input.
Definition 1.38 (Cumulative Distribution Function). The cumulative distribution function
(cdf) for a random variable X is defined as F (c ) = Pr (x ≤ c ) for all c ∈ (−∞, ∞).
Cumulative distribution functions are available for both discrete and continuous ran-
dom variables, and particularly simple for discrete random variables.
Definition 1.39 (Discrete CDF). When X is a discrete random variable, the cdf is
X
F (x ) = f (s ) (1.9)
s ≤x

for x ∈ (−∞, ∞).


Example 1.40. The cdf of a Bernoulli is

 0
 if x <0
F (x ; p ) = p if 0≤x <1 .
 1 if x ≥1

The Bernoulli cdf is simple since it only takes 2 values. The cdf of a Poisson random
variable is also just the sum the probability mass function for all values less than or equal
to the function’s argument.
Example 1.41. The cdf of a Poisson(λ)random variable is given by

bx c
X λi
F (x ; λ) = exp (−λ) , x ≥ 0.
i!
i =0
12 Probability, Random Variables and Expectations

where b·c returns the largest integer smaller than the input (the floor operator).

Continuous cdfs operate much like discrete cdfs, only the summation is replaced by an
integral since there are a continuum of values possible for X .

Definition 1.42 (Continuous CDF). When X is a continuous random variable, the cdf is
Z x
F (x ) = f (s ) ds (1.10)
−∞

for x ∈ (−∞, ∞).

The integral simply computes the total area under the pdf starting from −∞ up to x.
2
Example 1.43. The cdf of the random variable with pdf given by 12 x − 1/2 is

F (x ) = 4x 3 − 6x 2 + 3x .
and figure 1.3 contains a plot of this cdf.

This cdf is simply the integral of the pdf, and checking shows that F (0) = 0, F 1/2 =


1/2 (since it is symmetric around 1/2) and F (1) = 1, which must be 1 since the random
variable is only defined on [0, 1].

Example 1.44. The cdf of a normally distributed random variable with parameters µ and
σ2 is given by
!
(s − µ)2
Z x
1
F (x ) = √ exp − ds . (1.11)
2πσ2 −∞ 2σ2
Figure 1.3 contains a plot of the standard normal cdf along with two other parameteriza-
tions.

In the case of a standard normal random variable, the cdf is not available in closed form,
and so when computed using a computer (i.e. in Excel or MATLAB), fast, accurate numeric
approximations based on polynomial expansions are used (Abramowitz & Stegun 1964).
The cdf can be similarly derived from the pdf as long as the cdf is continuously differen-
tiable. At points where the cdf is not continuously differentiable, the pdf is defined to take
the value 0.4

Theorem 1.45 (Relationship between CDF and pdf). Let f (x ) and F (x ) represent the pdf
and cdf of a continuous random variable X , respectively. The density function for X can be
defined as f (x ) = ∂ ∂F x(x ) whenever f (x ) is continuous and f (x ) = 0 elsewhere.
4
Formally a pdf does not have to exist for a random variable, although a cdf always does. In practice, this is
a technical point and distributions which have this property are rarely encountered in financial economics.
1.2 Univariate Random Variables 13

Probability Density Function Cumulative Distribution Function


3 1

2.5 0.8
2
0.6
1.5
0.4
1

0.5 0.2

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Normal PDFs Normal CDFs
1
µ = 0, σ 2 = 1
0.4 µ = 1, σ 2 = 1
µ = 0, σ 2 = 4 0.8
0.3
0.6

0.2
0.4

0.1 0.2

0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
2
Figure 1.3: The top panels show the pdf for the density f (x ) = 12 x − 21 and its asso-
ciated cdf. The bottom left panel shows the probability density function for normal distri-
butions with difference combinations of µ and σ2 . The bottom right panel shows the cdf
for the same parameterizations.

Example 1.46. Taking the derivative of the cdf in the running example,

∂ F (x )
= 12x 2 − 12x + 3.
∂x  
1
= 12 x − x +
2
4
 2
1
= 12 x − .
2
14 Probability, Random Variables and Expectations

1.2.2 Quantile Functions


The quantile function is closely related to the cdf – and in many important cases, the quan-
tile function is the inverse (function) of the cdf. Before defining quantile functions, it is
necessary to define a quantile.

Definition 1.47 (Quantile). Any number q satisfying Pr (x ≤ q ) = α and Pr (x ≥ q ) = 1 − α


is known as the α-quantile of X and is denoted qα .

A quantile is just the point on the cdf where the total probability that a random variable
is smaller is α and the probability that the random variable takes a larger value is 1 − α.
The definition of the quantile does not necessarily require the quantile to be unique – non-
unique quantiles are encountered when pdfs have regions of 0 probability (or equivalently
cdfs are discontinuous). Quantiles are unique for random variables which have continu-
ously differentiable cdfs. One common modification of the quantile definition is to select
the smallest number which satisfies the two conditions – this ensures that quantiles are
unique.
The function which returns the quantile is known as the quantile function.

Definition 1.48 (Quantile Function). Let X be a continuous random variable with cdf F (x ).
The quantile function for X is defined as G (α) = q where Pr (x ≤ q ) = α and Pr (x > q ) =
1 − α. When F (x ) is one-to-one (and hence X is strictly continuous) then G (α) = F −1 (α).

Quantile functions are generally set-valued when quantiles are not unique, although in
the common case where the pdf does not contain any regions of 0 probability, the quantile
function is simply the inverse of the cdf.

Example 1.49. The cdf of an exponential random variable is


 x
F (x ; λ) = 1 − exp −
λ
for x ≥ 0 and λ > 0. Since f (x ; λ) > 0 for x > 0, the quantile function is

F −1 (α; λ) = −λ ln (1 − α) .

The quantile function plays an important role in simulation of random variables. In


particular, if u ∼ U (0, 1)5 , then x = F −1 (u) is distributed F . For example, when u is a
standard uniform (U (0, 1)), and F −1 (α) is the quantile function of an exponential random
variable, then x = F −1 (u; λ) follows an exponential (λ) distribution.

Theorem 1.50 (Probability Integral Transform). Let U be a standard uniform random vari-
able, FX (x ) be a continuous, increasing cdf . Then Pr F −1 (U ) < x = FX (x ) and so F −1 (U )


is distributed F .
5
The mathematical notation ∼ is read “distributed as”. For example, x ∼ U (0, 1) indicates that x is dis-
tributed as a standard uniform random variable.
1.2 Univariate Random Variables 15

Proof. Let U be a standard uniform random variable, and for an x ∈ R (X ),

Pr (U ≤ F (x )) = F (x ) ,

which follows from the definition of a standard uniform.

Pr (U ≤ F (x )) = Pr F −1 (U ) ≤ F −1 (F (x ))


= Pr F −1 (U ) ≤ x


= Pr (X ≤ x ) .

The key identity is that Pr F −1 (U ) ≤ x = Pr (X ≤ x ), which shows that the distribu-




tion of F −1 (U ) is F by definition of the cdf. The right panel of figure 1.8 shows the relation-
ship between the cdf of a standard normal and the associated quantile function. Applying
F (X ) produces a uniform U through the cdf and applying F −1 (U ) produces X through the
quantile function.

1.2.3 Common Univariate Distributions

Discrete

1.2.3.1 Bernoulli

A Bernoulli random variable is a discrete random variable which takes one of two values,
0 or 1. It is often used to model success or failure, where success is loosely defined. For
example, a success may be the event that a trade was profitable net of costs, or the event
that stock market volatility as measured by VIX was greater than 40%. The Bernoulli distri-
bution depends on a single parameter p which determines the probability of success.

Parameters

p ∈ [0, 1]

Support

x ∈ {0, 1}

Probability Mass Function

f (x ; p ) = p x (1 − p )1−x , p ≥ 0
16 Probability, Random Variables and Expectations

Time for .1% of Volume in SPY


1000
Time Difference
800

600

400

200

0
0 50 100 150 200 250
5-minute Realized Variance of SPY
Scaled χ23
0.2 5-minute RV

0.1

−0.1
0 0.05 0.1 0.15

Figure 1.4: The left panel shows a histogram of the elapsed time in seconds required for
.1% of the daily volume being traded to occur for SPY on May 31, 2012. The right panel
shows both the fitted scaled χ 2 distribution and the raw data (mirrored below) for 5-minute
“realized variance” estimates for SPY on May 31, 2012.

Moments

Mean p
Variance p (1 − p )

1.2.3.2 Poisson

A Poisson random variable is a discrete random variable taking values in {0, 1, . . .}. The
Poisson depends on a single parameter λ (known as the intensity). Poisson random vari-
ables are often used to model counts of events during some interval, for example the num-
ber of trades executed over a 5-minute window.

Parameters

λ≥0
1.2 Univariate Random Variables 17

Support

x ∈ {0, 1, . . .}

Probability Mass Function

λx
f (x ; λ) = x!
exp (−λ)

Moments

Mean λ
Variance λ

Continuous

1.2.3.3 Normal (Gaussian)

The normal is the most important univariate distribution in financial economics. It is the
familiar “bell-shaped” distribution, and is used heavily in hypothesis testing and in mod-
eling (net) asset returns (e.g. rt = ln Pt − ln Pt −1 or rt = Pt P−P t −1
t −1
).

Parameters

µ ∈ (−∞, ∞) , σ2 ≥ 0

Support

x ∈ (−∞, ∞)

Probability Density Function


 
−µ)2
f x ; µ, σ 2
= √ 1 − (x2σ

2πσ2
exp 2

Cumulative Distribution Function


 
x −µ
F x ; µ, σ2 = 1
+ 21 erf √1 where erf is the error function.6

2 2 σ

6
The error function does not have a closed form and so is a definite integral of the form
Z x
2  
erf (x ) = √ exp −s 2 ds .
π 0
18 Probability, Random Variables and Expectations

Weekly FTSE Weekly S&P 500


Normal Normal
20 Std. t, ν = 5 Std. t, ν = 4
FTSE 100 Return 20 S&P 500 Return
15 15
10 10
5 5

0 0

−5 −5

−10 −10
−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1
Monthly FTSE Monthly S&P 500
10 Normal Normal
Std. t, ν = 5 Std. t, ν = 4
FTSE 100 Return 10 S&P 500 Return

5
5

0 0

−5 −5
−0.15 −0.1 −0.05 0 0.05 0.1 0.15 −0.1 −0.05 0 0.05 0.1

Figure 1.5: Weekly and monthly densities for the FTSE 100 and S&P 500. All panels plot the
pdf of a normal and standardized Student’s t using parameters estimates using maximum
likelihood estimation (See Chapter2). The points below 0 on the y-axis show the actual
returns observed during this period.
1.2 Univariate Random Variables 19

Moments

Mean µ
Variance σ2
Median µ
Skewness 0
Kurtosis 3

Notes

The normal with mean µ and variance σ2 is written N µ, σ2 . A normally distributed ran-


dom variable with µ = 0 and σ2 = 1 is known as a standard normal. Figure 1.5 shows the fit
normal distribution to the FTSE 100 and S&P 500 using both weekly and monthly returns
for the period 1984–2012. Below each figure is a plot of the raw data.

1.2.3.4 Log-Normal

Log-normal random variables are closely related to normals. If X is log-normal, then Y =


ln (X ) is normal. Like the normal, the log-normal family depends on two parameters, µ and
σ2 , although unlike the normal these parameters do not correspond to the mean and vari-
ance. Log-normal random variables are commonly used to model gross returns, Pt +1 /Pt
(although it is often simpler to model rt = ln Pt − ln Pt −1 which is normally distributed).

Parameters

µ ∈ (−∞, ∞) , σ2 ≥ 0

Support

x ∈ [0, ∞)

Probability Density Function


 
x −µ)2
f x ; µ, σ2 = √1 exp − (ln 2σ

2
x 2πσ2

Cumulative Distribution Function

Since Y = ln (X ) ∼ N µ, σ2 , the cdf is the same as the normal only using ln x in place of


x.
20 Probability, Random Variables and Expectations

Moments
 
σ2
Mean exp µ + 2
Median exp (µ)
exp σ 2
− 1 exp 2µ + σ2
  
Variance

1.2.3.5 χ 2 (Chi-square)

χν2 random variables depend on a single parameter ν known as the degree-of-freedom.


They are commonly encountered when testing hypotheses, although they are also used to
model continuous variables which are non-negative, such as variances. χν2 random vari-
ables are closely related to standard normal random variables and are defined as the sum
of ν independent standard normal random variables which have been squared. Suppose
Z 1 , . . . , Z ν are standard normally distributed and independent, then x = νi =1 z i2 follows a
P

χν2 .7

Parameters

ν ∈ [0, ∞)

Support

x ∈ [0, ∞)

Probability Density Function


ν−2
f (x ; ν) = 1
exp − x2 , ν ∈ {1, 2, . . .} where Γ (a ) is the Gamma function.

ν x 2
2 2 Γ ν2 ( )

Cumulative Distribution Function

ν x
F (x ; ν) = 1
γ where γ (a , b ) is the lower incomplete gamma function.

Γ(
,
ν
2 ) 2 2

Moments

Mean ν
Variance 2ν
Pn Pn
7
In general, if Z 1 , . . . , Z n are i.i.d. standard normal, and y = i =1 wi z i2 , then y ∼ χν2 where ν = i =1 wi .
This extends the previous definition to all for non-integer values of ν.
1.2 Univariate Random Variables 21

Notes

Figure 1.4 shows a χ 2 pdf which was used to fit some simple estimators of the 5-minute
variance of the S&P 500 from May 31, 2012. These were computed by summing and squar-
ing 1-minute returns within a 5-minute interval (all using log prices). 5-minute variance
estimators are important in high-frequency trading and other (slower) algorithmic trading.

1.2.3.6 Student’s t and standardized Student’s t

Student’s t random variables are also commonly encountered in hypothesis testing and,
like χν2 random variables, are closely related to standard normals. Student’s t random vari-
ables depend on a single parameter, ν, and can be constructed from two other independent
random variables. If Z a standard normal, W a χν2 and Z ⊥ ⊥ W , then x = z / wν follows
p

a Student’s t distribution. Student’s t are similar to normals except that they are heavier
tailed, although as ν → ∞ a Student’s t converges to a standard normal.

Support

x ∈ (−∞, ∞)

Probability Density Function

Γ ( ν+1 )
 − ν+1
x2 2
f (x ; ν) = √ 2 ν
νπΓ ( 2 )
1+ ν
where Γ (a ) is the Gamma function.

Moments

Mean 0, ν > 1
Median 0
ν
Variance ν−2
,ν>2
Skewness 0, ν > 3
(ν−2)
Kurtosis 3 ν−4 , ν > 4

Notes

When ν = 1, a Student’s t is known as a Cauchy random variable.


The standardized Student’s t extends the usual Student’s t in two directions. First, it
removes the variance’s dependence on ν so that the scale of the random variable can be
established separately from the degree of freedom parameter. Second, it explicitly adds
location and scale parameters so that if Y is a Student’s t random variable with degree of
freedom ν, then √
ν−2
x =µ+σ √ y
ν
22 Probability, Random Variables and Expectations

follows a standardized Student’s t distribution (ν > 2 is required). The standardized Stu-


dent’s t is commonly used to model heavy tailed return distributions such as stock market
indices.
Figure 1.5 shows the fit (using maximum likelihood) standardized t distribution to the
FTSE 100 and S&P 500 using both weekly and monthly returns from the period 1984–2012.
The typical degree of freedom parameter was around 4, indicating that (unconditional)
distributions are heavy tailed with a large kurtosis.

1.2.3.7 Uniform

The continuous uniform is commonly encountered in certain test statistics, especially those
involving testing whether assumed densities are appropriate for a particular series. Uni-
form random variables, when combined with quantile functions, are also useful for simu-
lating random variables.

Parameters

a , b the end points of the interval, where a < b

Support

x ∈ [a , b ]

Probability Density Function

f (x ) = 1
b −a

Cumulative Distribution Function

F (x ) = x −a
b −a
for a ≤ x ≤ b , F (x ) = 0 for x < a and F (x ) = 1 for x > b

Moments
b −a
Mean 2
b −a
Median 2
(b −a )2
Variance 12
Skewness 0
9
Kurtosis 5

Notes

A standard uniform has a = 0 and b = 1. When x ∼ F , then F (x ) ∼ U (0, 1)


1.3 Multivariate Random Variables 23

1.3 Multivariate Random Variables


While univariate random variables are very important in financial economics, most appli-
cation require the use multivariate random variables. Multivariate random variables allow
relationship between two or more random quantities to be modeled and studied. For ex-
ample, the joint distribution of equity and bond returns is important for many investors.
Throughout this section, the multivariate random variable is assumed to have n com-
ponents,  
X1
 X2 
X = . 
 
.
 . 
Xn
which are arranged into a column vector. The definition of a multivariate random variable
is virtually identical to that of a univariate random variable, only mapping ω ∈ Ω to the
n -dimensional space Rn .

Definition 1.51 (Multivariate Random Variable). Let (Ω, F, P ) be a probability space. If X :


Ω → Rn is a real-valued vector function having its domain the elements of Ω, then X : Ω →
Rn is called a (multivariate) n -dimensional random variable.

Multivariate random variables, like univariate random variables, are technically func-
tions of events in the underlying probability space X (ω), although the function argument
ω (the event) is usually suppressed.
Multivariate random variables can be either discrete or continuous. Discrete multivari-
ate random variables are fairly uncommon in financial economics and so the remainder
of the chapter focuses exclusively on the continuous case. The characterization of a what
makes a multivariate random variable continuous is also virtually identical to that in the
univariate case.

Definition 1.52 (Continuous Multivariate Random Variable). A multivariate random vari-


able is said to be continuous if its range is uncountable infinite and if there exists a non-
negative valued function f (x1 , . . . , xn ) defined for all (x1 , . . . , xn ) ∈ Rn such that for any
event B ⊂ R (X ),
Z Z
Pr (B ) = . . . f (x1 , . . . , xn ) dx1 . . . dxn (1.12)
{x1 ,...,xn }∈B

and f (x1 , . . . , xn ) = 0 for all (x1 , . . . , xn ) ∈


/ R (X ).

Multivariate random variables, at least when continuous, are often described by their
probability density function.

Definition 1.53 (Continuous Density Function Characterization). A function f : Rn → R is


a member of the class of multivariate continuous density functions if and only if f (x1 , . . . , xn ) ≥
24 Probability, Random Variables and Expectations

0 for all x ∈ Rn and Z ∞ Z ∞


... f (x1 , . . . , xn ) dx1 . . . dxn = 1. (1.13)
−∞ −∞
Definition 1.54 (Multivariate Probability Density Function). The function f (x1 , . . . , xn ) is
called a multivariate probability function (pdf).
A multivariate density, like a univariate density, is a function which is everywhere non-
negative and that integrates to unity. Figure 1.7 shows the fit joint probability density func-
tion to weekly returns on the FTSE 100 and S&P 500 (assuming that returns are normally
distributed). Two views are presented – one shows the 3-dimensional plot of the pdf and
the other shows the iso-probability contours of the pdf. The figure also contains a scatter
plot of the raw weekly data for comparison. All parameters were estimated using maximum
likelihood.
Example 1.55. Suppose X is a bivariate random variable, then the function f (x1 , x2 ) =
3
x12 + x22 and is defined on [0, 1] × [0, 1] is a valid probability density function.

2

Example 1.56. Suppose X is a bivariate standard normal random variable. Then the prob-
ability density function of X is

x12 + x22
 
1
f (x1 , x2 ) = exp − .
2π 2

The multivariate cumulative distribution function is virtually identical to that in the


univariate case, and measure the total probability between −∞ (for each element of X )
and some point.
Definition 1.57 (Multivariate Cumulative Distribution Function). The joint cumulative dis-
tribution function of an n -dimensional random variable X is defined by

F (x1 , . . . , xn ) = Pr (X i ≤ xi , i = 1, . . . , n )

for all (x1 , . . . , xn ) ∈ Rn , and is given by


Z xn Z x1
F (x1 , . . . , xn ) = ... f (s1 , . . . , sn ) ds1 . . . dsn . (1.14)
−∞ −∞

Example 1.58. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The associated cdf is


x13 x2 + x1 x23
F (x1 , x2 ) = .
2
Figure 1.6 shows the joint cdf of the density in the previous example. As was the case for
univariate random variables, the probability density function can be determined by differ-
entiating the cumulative distribution function – only in the multivariate case, a derivative
is needed for each component.
1.3 Multivariate Random Variables 25

Theorem 1.59 (Relationship between CDF and PDF). Let f (x1 , . . . , xn ) and F (x1 , . . . , xn )
represent the pdf and cdf of an n -dimensional continuous random variable X , respectively.
The density function for X can be defined as f (x1 , . . . , xn ) = ∂ x1∂∂ xF2 ...∂
(x)
n

xn
whenever f (x1 , . . . , xn )
is continuous and f (x1 , . . . , xn ) = 0 elsewhere.

Example 1.60. Suppose X is a bivariate random variable with cumulative distribution func-
x 3 x +x x 3
tion F (x1 , x2 ) = 1 2 2 1 2 . The probability density function can be determined using

∂ 2 F (x1 , x2 )
f (x1 , x2 ) =
∂ x1 ∂ x2
1 ∂ 3x12 x2 + x23

=
2 ∂ x2
3 2
= x1 + x22 .

2

1.3.1 Marginal Densities and Distributions

The marginal distribution is the first concept unique to multivariate random variables.
Marginal densities and distribution functions summarize the information in a subset, usu-
ally a single component of X by averaging over all possible values of the components of X
which are not being marginalized. This involves integrating out the variables which are not
of interest. First, consider the bivariate case.

Definition 1.61 (Bivariate Marginal Probability Density Function). Let X be a bivariate ran-
dom variable comprised of X 1 and X 2 . The marginal distribution of X 1 is given by
Z ∞
f1 (x1 ) = f (x1 , x2 ) dx2 . (1.15)
−∞

The marginal density of X 1 is a density function where X 2 has been integrated out. This
integration is simply a form of averaging – varying x2 according to the probability associ-
ated with each value of x2 . The marginal is only a function of x1 . Both probability density
functions and cumulative distribution functions have marginal versions.

Example 1.62. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The marginal probability density


function for X 1 is  
3 1
f1 (x1 ) = x1 +
2
,
2 3
and by symmetry the marginal probability density function of X 2 is
 
3 1
f2 (x2 ) = x2 +
2
.
2 3
26 Probability, Random Variables and Expectations

Example 1.63. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 6 x1 x22 and is defined on [0, 1] × [0, 1]. The marginal probability density func-


tions for X 1 and X 2 are


f1 (x1 ) = 2x1 and f2 (x2 ) = 3x22 .
Example 1.64. Suppose X is bivariate normal with parameters µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ= ,
σ12 σ22

then the marginal pdf of X 1 is N µ1 , σ12 , and the marginal pdf of X 2 is N µ2 , σ22 .
 

Figure 1.7 shows the fit marginal distributions to weekly returns on the FTSE 100 and
S&P 500 assuming that returns are normally distributed. Marginal pdfs can be transformed
into marginal pdfs through integration.
Definition 1.65 (Bivariate Marginal Cumulative Distribution Function). The cumulative
marginal distribution function of X 1 in bivariate random variable X is defined by

F1 (x1 ) = Pr (X 1 ≤ x1 )

for all x1 ∈ R, and is given by Z x1


F1 (x1 ) = f1 (s1 ) ds1 .
−∞
The general j -dimensional marginal distribution partitions the n -dimensional random
variable X into two blocks, and constructs the marginal distribution for the first j by inte-
grating out (averaging over) the remaining n − j components of X . In the definition, both
X 1 and X 2 are vectors.
Definition 1.66 (Marginal Probability Density Function). Let X be a n -dimensional ran-
dom variable and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder
0
into X 2 so that X = X 10 X 20 . The marginal probability density function for X 1 is given by


Z ∞ Z ∞
f1,..., j x1 , . . . , x j = f (x1 , . . . , xn ) dx j +1 . . . dxn .

... (1.16)
−∞ −∞

The marginal cumulative distribution function is related to the marginal probability


density function in the same manner as the joint probability density function is related to
the cumulative distribution function. It also has the same interpretation.
Definition 1.67 (Marginal Cumulative Distribution Function). Let X be a n -dimensional
random variable and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder
0
into X 2 so that X = X 10 X 20 . The marginal cumulative distribution function for X 1 is given


by Z x1 Z xj
F1,..., j x1 , . . . , x j =
 
... f1,..., j s1 , . . . , s j ds1 . . . ds j . (1.17)
−∞ −∞
1.3 Multivariate Random Variables 27

1.3.2 Conditional Distributions

Marginal distributions provide the tools needed to model the distribution of a subset of
the components of a random variable while averaging over the other components. Condi-
tional densities and distributions, on the other hand, consider a subset of the components
a random variable conditional on observing a specific value or the remaining components.
In practice, the vast majority of modeling makes use of conditional information where the
interest is in understanding the distribution of a random variable conditional on the ob-
served values of some other random variables. For example, consider the problem of mod-
eling the expected return of an individual stock. Usually other information such as the book
value of assets, earnings and return on equity are all available, and can be conditioned on
to model the conditional distribution of the stock’s return.
First, consider the bivariate case.

Definition 1.68 (Bivariate Conditional Probability Density Function). Let X be a bivariate


random variable comprised of X 1 and X 2 . The conditional probability density function for
X 1 given that X 2 ∈ B where B is an event where Pr (X 2 ∈ B ) > 0 is

f (x1 , x2 ) dx2
R
f x1 |X 2 ∈ B = BR

. (1.18)
B 2 (x 2 )
f dx2

When B is an elementary event (e.g. single point), so that Pr (X 2 = x2 ) = 0 and f2 (x2 ) > 0,
then
 f (x1 , x2 )
f x1 |X 2 = x2 = . (1.19)
f2 (x2 )

Conditional density functions differ slightly depending on whether the conditioning


variable is restricted to a set or a point. When the conditioning variable is specified to be a
set where Pr (X 2 ∈ B ) > 0, then the conditional density is simply the joint probability of X 1
and X 2 ∈ B divided by the marginal probability of X 2 ∈ B . When the conditioning variable
is restricted to a point, the conditional density is simply the ratio of the joint pdf divided by
the margin pdf of X 2 .

Example 1.69. Suppose X is a bivariate random variable with probability density function
f (x1 , x2 ) = 23 x12 + x22 and is defined on [0, 1] × [0, 1]. The conditional probability of X 1

1 
given X 2 ∈ 2 , 1   
1 1
= 12x12 + 7 ,

f x1 |X 2 ∈ ,1
2 11
the conditional probability density function of X 1 given X 2 ∈ 0, 12 is
 

  
1 1
= 12x12 + 1 ,

f x1 |X 2 ∈ 0,
2 5
28 Probability, Random Variables and Expectations

and the conditional probability density function of X 1 given X 2 = x2 is

 x2 + x2
f x1 |X 2 = x2 = 1 2 2 .
x2 + 1

Figure 1.6 shows the joint pdf along with both types of conditional densities. The up-
per left panel shows that conditional density for X 2 ∈ [0.25, 0.5]. The highlighted region
contains the components of the joint pdf which are averaged to produce the conditional
density. The lower left also shows the pdf but also shows three (non-normalized) condi-

tional densities of the form f x1 |x2 . The lower right pane shows these three densities
correctly normalized.
The previous example shows that, in general, the conditional probability density func-
tion differs as the region used changes.
Example 1.70. Suppose X is bivariate normal with mean µ = [µ1 µ2 ]0 and covariance
" #
σ12 σ12
Σ= ,
σ12 σ22

σ12
 2

σ12
then the conditional distribution of X 1 given X 2 = x2 is N µ1 + σ22
(x2 − µ2 ) , σ12 − σ22
.
Marginal distributions and conditional distributions are related in a number of ways.
One obvious way is that f x1 |X 2 ∈ R (X 2 ) = f1 (x1 ) – that is, the conditional probability


of X 1 given that X 2 is in its range is simply the marginal pdf of X 1 . This holds since inte-
grating over all values of x2 is essentially not conditioning on anything (which is known as
the unconditional, and a marginal density could, in principle, be called the unconditional
density since it averages across all values of the other variable).
The general definition allows for an n -dimensional random vector where the condition-
ing variable has dimension between 1 and j < n .
Definition 1.71 (Conditional Probability Density Function). Let f (x1 , . . . , xn ) be the joint
density function for an n -dimensional random variable X = [X 1 . . . X n ]0 and partition the
0
first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that X = X 10 X 20 . The


conditional probability density function for X 1 given that X 2 ∈ B is given by

( x j +1 ,...,xn )∈B f (x1 , . . . , xn ) dxn . . . dx j +1


R
f x1 , . . . , x j |X 2 ∈ B = R

 , (1.20)
( x j +1 ,...,xn )∈B f j +1,...,n x j +1 , . . . , xn dxn . . . dx j +1

and when B is an elementary event (denoted x2 ) and if f j +1,...,n (x2 ) > 0,



f x 1 , . . . , x j , x2
f x1 , . . . , x j |X 2 = x2 =

(1.21)
f j +1,...,n (x2 )

In general the simplified notation f x1 , . . . , x j |x2 will be used to represent f x1 , . . . , x j |X 2 = x2 .


 
1.3 Multivariate Random Variables 29

Bivariate CDF Conditional Probability


f(x1 |x2 ∈ [0.25, 0.5])

1 3 x2 ∈ [0.25, 0.5]
F (x1 , x2 )

f(x1 , x2 )
2
0.5
1

0 0
1 1
1 1
0.5 0.5 0.5 0.5
x1 0 0 x2 x1 0 0 x2
Conditional Densities Normalized Conditional Densities
3
f(x1 |x2 = 0.3)
f(x1 |x2 = 0.5)
3 f (x1 |x2 = 0.7) 2.5 f(x1 |x2 = 0.7)
f (x1 |x2 = 0.5)
f (x1 |x2 = 0.3)
2
f(x1 , x2 )

2
f(x1 |x2 )

1.5
1
1
0
1 0.5
1
0.5 0.5 0
x1 0 0 x2 0 0.2 0.4 0.6 0.8 1
x1

Figure 1.6: These four panels show four views of a distribution defined on [0, 1] × [0, 1].
The upper left panel shows the joint cdf. The upper right shows the pdf along with  the
portion of the pdf used to construct a conditional distribution f x1 |x2 ∈ [0.25, 0.5] . The
line shows the actual correctlyscaled conditional distribution which is only a function of
x1 which has been plotted at E X 2 |X 2 ∈ [0.25, 0.5] . The lower left panel also shows the pdf
along with three non-normalized conditional densities. The bottom right panel shows the
correctly normalized conditional densities.
30 Probability, Random Variables and Expectations

1.3.3 Independence
When variables are independent, there is a special relationship between the joint proba-
bility density function and the marginal density functions – the joint must be the product
of each marginal.

Theorem 1.72 (Independence of Random Variables). The random variables X 1 ,. . . , X n with


joint density function f (x1 , . . . , xn ) are independent if and only if
n
Y
f (x1 , . . . , xn ) = fi (xi ) (1.22)
i =1

where fi (xi ) is the marginal distribution of X i .

The intuition behind this result follows from the fact that when the components of a
random variable are independent, any change in one component has no information for
the others. In other words, both marginals and conditionals must be the same.

Example 1.73. Let X be a bivariate random variable with probability density function f (x1 , x2 ) =
x1 x2 on [0, 1] × [0, 1], then X 1 and X 2 are independent. This can be verified since

f1 (x1 ) = x1 and f2 (x2 ) = x2

so that the joint is the product of the two marginal densities.

Independence is a very strong concept, and it carries over from random variables to
functions of random variables as long as each function involves one random variable.8

Theorem 1.74 (Independence of Functions of Independent Random Variables). Let X 1 and


X 2 be independent random variables and define y1 = Y1 (x1 ) and y2 = Y2 (x2 ), then the ran-
dom variables Y1 and Y2 are independent.

Independence is often combined with an assumption that the marginal distribution is


the same to simplify the analysis of collections of random data.

Definition 1.75 (Independent, Identically Distributed). Let {X i } be a sequence of random


variables. If the marginal distribution for X i is the same for all i and X i ⊥⊥ X j for all i 6= j ,
then {X i } is said to be an independent, identically distributed (i.i.d. ) sequence.

1.3.4 Bayes Rule


Bayes rule is used both in financial economics and econometrics. In financial economics,
it is often used to model agents learning, and in econometrics it is used to make inference
8
This can be generalized to the full multivariate case where X is an n-dimensional random variable where
 are independent from the last n − j components defining y1 = Y1 x1 , . . . , x j and

the first j components
y2 = Y2 x j +1 , . . . , xn .
1.3 Multivariate Random Variables 31

about unknown parameters given observed data (a branch known as Bayesian economet-
rics). Bayes rule follows directly from the definition of a conditional density so that the
joint can be factored into a conditional and a marginal. Suppose X is a bivariate random
variable, then

f (x1 , x2 ) = f x1 |x2 f2 (x2 )




= f x2 |x1 f1 (x2 ) .


The joint can be factored two ways, and equating the two factorizations produces Bayes
rule.

Definition 1.76 (Bivariate Bayes Rule). Let X by a bivariate random variable with compo-
nents X 1 and X 2 , then

f x2 |x1 f1 (x1 )

f x1 |x2 =

(1.23)
f2 (x2 )

Bayes rule states that the probability of observing X 1 given a value of X 2 is equal to the
joint probability of the two random variables divided by the marginal probability of ob-
serving X 2 . Bayes rule is normally applied where there is a belief about X 1 ( f1 (x1 ), called a

prior), and the conditional distribution of X 1 given X 2 is a known density ( f x2 |x1 , called

the likelihood), which combine to form a belief about X 1 ( f x1 |x2 , called the posterior).
The marginal density of X 2 is not important when using Bayes rule since the numerator is
still proportional to the conditional density of X 1 given X 2 since f2 (x2 ) is just some value,
and so it is common to express the posterior as

f x1 |x2 ∝ f x2 |x1 f1 (x1 ) ,


 

where ∝ is read “is proportional to”.

Example 1.77. Suppose interest lies in the probability a firm does bankrupt which can be
modeled as a Bernoulli distribution. The parameter p is unknown but, given a value of p ,
the likelihood that a firm goes bankrupt is

f x |p = p x (1 − p )1−x .


While p is known, a prior for the bankruptcy rate can be specified. Suppose the prior for p
follows a Beta (α, β ) distribution which has pdf

p α−1 (1 − p )β −1
f (p ) =
B (α, β )
32 Probability, Random Variables and Expectations

where B (a , b ) is Beta function that acts as a normalizing constant.9 The Beta distribution
has support on [0, 1] and nests the standard uniform as a special case. The expected value
of a random variable with a Beta (α, β ) is α+βα
and the variance is (α+β )2αβ
(α+β +1)
where α > 0
and β > 0.
Using Bayes rule,

p α−1 (1 − p )β −1
∝ p x (1 − p )1−x ×

f p |x
B (α, β )
p α−1+x (1 − p )β −x
= .
B (α, β )

Note that this isn’t a density since it has the wrong normalizing constant. However, the
component of the density which contains p is p α−1+x (1 − p )β −x (known as the kernel) is
the same as in the Beta distribution, only with different parameters. Thus the posterior,
f p |x is Beta (α + x , β + 1 − x ). Since the posterior is the same as the prior, it could be


combined with another observation (and the Bernoulli likelihood) to produce an updated
posterior. When a Bayesian problem has this property, the prior density is called conjugate
to the likelihood.

Example 1.78. Suppose M is a random variable representing the score on the midterm,
and interest lies in the final course grade, C . The prior for C is normal with mean µ and
variance σ2 , and that the distribution of M given C is also conditionally normal with mean
C and variance τ2 . Bayes rule can be used to make inference on the final course grade
given the midterm grade.

∝ f m|c fC (c )
 
f c |m
! !
1 (m − c )2
1 (c − µ)2
∝ √ exp − 2
√ exp −
2πτ2 2τ 2πσ 2 2σ2
( )!
1 (m − c )2 (c − µ)2
= K exp − +
2 τ2 σ2
2c µ m 2 µ2
 2
c2
 
1 c 2c m
= K exp − + − − 2 + 2 + 2
2 τ2 σ2 τ2 σ τ σ
µ µ2
     2 
1 1 1  m  m
= K exp − c 2
+ − 2c + + + 2
2 τ2 σ2 τ2 σ2 τ2 σ

This (non-normalized) density can be shown to have the kernel of a normal by com-
9
The beta function can only be given as an indefinite integral,
Z 1
B (a , b ) = s a −1 (1 − s )b −1 ds .
0
1.3 Multivariate Random Variables 33

pleting the square,10


  !2 
µ
 1 m
τ2
+ σ2 
f c |m ∝ exp − c− .
+
−1 1 1
2 1
τ2
+ 1
σ2 τ2 σ2

This is the kernel of a normal density with mean


µ
m
+

τ2 σ 
2
,
1
τ2
+ 1
σ2

and variance  −1


1 1
+ 2 .
τ2 σ
The mean is a weighted average of the prior mean, µ and the midterm score, m, where the
weights are determined by the inverse variance of the prior and conditional distributions.
Since the weights are proportional to the inverse of the variance, a small variance leads to
a relatively large weight. If τ2 = σ2 ,then the posterior mean is simply the average of the
prior mean and the midterm score. The variance of the posterior depends on both the vari-
ance of the prior and the conditional variance of the data. The posterior variance is always
below than the smaller of σ2 and τ2 . Like the Bernoulli-Beta combination in the previ-
ous problem, the normal distribution is a conjugate prior when the conditional density is
normal.

1.3.5 Common Multivariate Distributions

1.3.5.1 Multivariate Normal

Like the univariate normal, the multivariate normal depends on 2 parameters, µ and n
by 1 vector of means and Σ an n by n positive semi-definite matrix of covariances. The
multivariate normal is closed to both to marginalization and conditioning – in other words,
if X is multivariate normal, then all marginal distributions of X are normal, and so are all
conditional distributions of X 1 given X 2 for any partitioning.

Parameters

µ ∈ Rn , Σ a positive semi-definite matrix

10
Suppose a quadratic in x has the form a x 2 + b x + c . Then

a x 2 + b x + c = a (x − d )2 + e

where d = b /(2a ) and e = c − b 2 / (4a ).


34 Probability, Random Variables and Expectations

Weekly FTSE and S&P 500 Returns Marginal Densities


0.06
FTSE 100
0.04 S&P 500
15
S&P 500 Return

0.02

0 10

−0.02
5
−0.04

−0.06 0
−0.05 0 0.05 −0.05 0 0.05
FTSE 100 Return
Bivariate Normal PDF Contour of Bivariate Normal PDF
0.06

0.04
S&P 500 Return

300 0.02
200 0
100
−0.02

0.05 −0.04
0.05
0 0 −0.06
−0.05 −0.05 −0.05 0 0.05
S&P 500 FTSE 100 FTSE 100 Return

Figure 1.7: These four figures show different views of the weekly returns of the FTSE 100
and the S&P 500. The top left contains a scatter plot of the raw data. The top right shows
the marginal distributions from a fit bivariate normal distribution (using maximum like-
lihood). The bottom two panels show two representations of the joint probability density
function.
1.3 Multivariate Random Variables 35

Support

x ∈ Rn

Probability Density Function


n 1
 
f (x; µ, Σ) = (2π)− 2 |Σ|− 2 exp − 12 (x − µ)0 Σ−1 (x − µ)

Cumulative Distribution Function

Can be expressed as a series of n univariate normal cdfs using repeated conditioning.

Moments

Mean µ
Median µ
Variance Σ
Skewness 0
Kurtosis 3

Marginal Distribution

The marginal distribution for the first j components is


 
− 2j − 21 1 0 −1
f X 1 ,...X j x1 , . . . , x j = (2π) |Σ11 | exp − x1 − µ1 Σ11 x1 − µ1 ,
 
2
where it is assumed that the marginal distribution is that of the first j random variables11 ,
µ = [µ01 µ02 ]0 where µ1 correspond to the first j entries, and
" #
Σ11 Σ12
Σ= .
Σ012 Σ22
0
In other words, the distribution of X 1 , . . . X j is N µ1 , Σ11 . Moreover, the marginal dis-
 

tribution of a single element of X is N µi , σi2 where µi is the ith element of µ and σi2 is the


ith diagonal element of Σ.

Conditional Distribution

The conditional probability of X 1 given X 2 = x2 is

N µ1 + β 0 x2 − µ2 , Σ11 − β 0 Σ22 β
 

11
Any two variables can be reordered in a multivariate normal by swapping their means and reordering the
covariance matrix by swapping the corresponding rows and columns.
36 Probability, Random Variables and Expectations

where β = Σ−1
22 Σ12 .
0

When X is a bivariate normal random variable,


" # " # " #!
x1 µ1 σ12 σ12
∼N , ,
x2 µ2 σ12 σ22

the conditional distribution is, defining

σ12 σ12
2
 
X 1 |X 2 = x2 ∼ N µ1 + 2 (x2 − µ2 ) , σ1 − 2 ,
2
σ2 σ2

where the variance can be seen to always be positive since σ12 σ22 ≥ σ12
2
by the Cauchy-
Schwartz inequality.

Notes

A standard multivariate normal has µ = 0 and Σ = In . When the covariance between


elements i and j equals zero (so that σi j = 0), they are independent. For the normal,
a covariance (or correlation) of 0 implies independence. This is not true of most other
multivariate random variables.

1.4 Expectations and Moments


Expectations and moments are (non-random) functions of random variables that are use-
ful in both understanding properties of random variables – e.g. when comparing the dis-
persion between two distributions – and when estimating parameters using a technique
known as the method of moments (see Chapter 2).

1.4.1 Expectations

The expectation is the value, on average, of a random variable (or function of a random
variable). Unlike common English language usage, where one’s expectation is not well de-
fined (e.g. could be the mean or the mode, another measure of the tendency of a random
variable), the expectation in a probabilistic sense always averages over the possible values
weighting by the probability of observing each value. The form of an expectation in the
discrete case is particularly simple.

Definition 1.79 (Expectation of a Discrete Random Variable). The expectation of a discrete


random variable, defined E [X ] = x ∈R (X ) x f (x ), exists if and only if x ∈R (X ) |x | f (x ) < ∞.
P P

When the range of X is finite then the expectation always exists. When the range is infi-
nite, such as when a random variable takes on values in the range 0, 1, 2, . . ., the probability
1.4 Expectations and Moments 37

mass function must be sufficiently small for large values of the random variable in order
for the expectation to exist.12 Expectations of continuous random variables are virtually
identical, only replacing the sum with an integral.

Definition 1.80 (Expectation of a Continuous Random Variable). The expectation of a con-


R∞ R∞
tinuous random variable, defined E [X ] = −∞ x f (x ) dx , exists if and only if −∞ |x | f (x ) dx <
∞.

Existence of an expectation is a somewhat difficult concept. For continuous random


variables, expectations may not exist if the probability of observing an arbitrarily large
value (in the absolute sense) is very high. For example, in a Student’s t distribution when
the degree of freedom parameter ν is 1 (also known as a Cauchy distribution), the proba-
bility of observing a value with size |x | is proportional to x −1 for large x (in other words,
f (x ) ∝ c x −1 ) so that when we compute x f (x ), we simply have c for large x . The range is
unbounded, and so the integral of a constant, even if very small, will not converge, and so
the expectation does not exist. On the other hand, when a random variable is bounded, it’s
expectation always exists.

Theorem 1.81 (Expectation Existence for Bounded Random Variables). If |x | < c for all
x ∈ R (X ), then E [X ] exists.

The expectation operator, E [·] is generally defined for arbitrary functions of a random
variable, g (x ). In practice, g (x ) takes many forms – x , x 2 , x p for some p , exp (x ) or some-
thing more complicated. Discrete and continuous expectations are closely related. Figure
1.8 shows a standard normal along with a discrete approximation where each bin has a
width of 0.20 and the height is based on the pdf value at the mid-point of the bin. Treating
the normal as a discrete distribution based on this approximation would provide reason-
able approximations to the correct (integral) expectations.

Definition 1.82 (Expectation of a Function of Random Variable). The expectation of a ran-


 R ∞
dom variable defined as a function of X , Y = g (x ), is E [Y ] = E g X ) = −∞ g (x ) f (x ) dx

R∞
exists if and only if −∞ |g (x )| dx < ∞.

When g (x ) is either concave or convex, Jensen’s inequality provides a relationship be-


tween the expected value of the function and the function of the expected value of the
underlying random variable.

Theorem 1.83 (Jensen’s Inequality). If g (·) is a continuous convex function on an open in-
terval containing the range of X , then E [g (X )] ≥ g (E [X ]). Similarly, if g (·) is a continuous
concave function on an open interval containing the range of X , then E [g (X )] ≤ g (E [X ]).
12
Non-existence of an expectation simply means that the sum converges to ±∞ or oscillates. The use of
the |x | in the definition of existence is to rule out both the −∞ and the oscillating cases.
38 Probability, Random Variables and Expectations

Approximation to Std. Normal CDF and Quantile Function


Standard Normal PDF 1
Discrete Approximation Cumulative Distribution Function
0.4
0.8

0.3 0.6

U
0.2 0.4 Quantile Function

0.1 0.2

0 0
−2 −1 0 1 2 −3 −2 −1 0 1 2 3
X

Figure 1.8: The left panel shows a standard normal and a discrete approximation. Discrete
approximations are useful for approximating integrals in expectations. The right panel
shows the relationship between the quantile function and the cdf.

The inequalities become strict if the functions are strictly convex (or concave) as long
as X is not degenerate.13 Jensen’s inequality is common in economic applications. For ex-
ample, standard utility functions (U (·)) are assumed to be concave which reflects the idea
that marginal utility (U 0 (·)) is decreasing in consumption (or wealth). Applying Jensen’s in-
equality shows that if consumption is random, then E [U (c )] < U (E [c ]) – in other words,
the economic agent is worse off when facing uncertain consumption. Convex functions are
also commonly encountered, for example in option pricing or in (production) cost func-
tions. The expectations operator has a number of simple and useful properties:

13
A degenerate random variable has probability 1 on a single point, and so is not meaningfully random.
1.4 Expectations and Moments 39

• If c is a constant, thenE [c ] = c . This property follows since the expectation is an


integral against a probability density which integrates to unity.

• If c is a constant, then E [c X ] = c E [X ]. This property follows directly from pass-


ing the constant out of the integral in the definition of the expectation operator.

• The expectation of the sum is the sum of the expectations,


 
X k k
X
E g i (X ) = E [g i (X )] .
i =1 i =1

This property follows directly from the distributive property of multiplication.

• If a is a constant, then E [a + X ] = a + E [X ]. This property also follows from the


distributive property of multiplication.

• E [f (X )] = f (E [X ]) when f (x ) is affine (i.e. f (x ) = a + b x where a and b are


constants). For general non-linear functions, it is usually the case that E [f (X )] 6=
f (E [X ]) when X is non-degenerate.

• E [X p ] 6= E [X ]p except when p = 1 when X is non-degenerate.

These rules are used throughout financial economics when studying random variables
and functions of random variables.
The expectation of a function of a multivariate random variable is similarly defined,
only integrating across all dimensions.

Definition 1.84 (Expectation of a Multivariate Random Variable). Let (X 1 , X 2 , . . . , X k ) be a


continuously distributed n -dimensional multivariate random variable with joint density
function f (x1 , x2 , . . . xn ). The expectation of Y = g (X 1 , X 2 , . . . , X n ) is defined as
Z ∞Z ∞ Z ∞
... g (x1 , x2 , . . . , xn ) f (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn . (1.24)
−∞ −∞ −∞

It is straight forward to see that rule that the expectation of the sum is the sum of the
expectation carries over to multivariate random variables, and so
" n # n
X X
E g i (X 1 , . . . X k ) = E [g i (X 1 , . . . X k )] .
i =1 i =1
hP i
k Pk
Additionally, taking g i (X 1 , . . . X k ) = X i , we have E i =1 Xi = i =1 E [X i ].
40 Probability, Random Variables and Expectations

1.4.2 Moments

Moments are simply expectations of particular functions of a random variable, typically


g (x ) = x s for s = 1, 2, . . ., and are often used to compare distributions or to estimate
parameters.

Definition 1.85 (Noncentral Moment). The rth noncentral moment of a continuous ran-
dom variable X is defined
Z ∞
µr ≡ E X =
0
x r f (x ) dx
 r
(1.25)
−∞

for r = 1, 2, . . ..

The first non-central moment is simply the average, or mean, of the random variable.

Definition 1.86 (Mean). The first non-central moment of a random variable X is called the
mean of X and is denoted µ.

Central moments are similarly defined, only centered around the mean.

Definition 1.87 (Central Moment). The rth central moment of a random variables X is de-
fined Z ∞
r
µr ≡ E (X − µ) = (x − µ)r f (x ) dx
 
(1.26)
−∞

for r = 2, 3 . . ..

Aside from the first moment, most generic use of “moment” will be referring to central
moments. Moments may not exist if a distribution is sufficiently heavy tailed. However, if
the r th moment exists, then any moment of lower order must also exist.

Theorem 1.88 (Lesser Moment Existence). If µ0r exists for some r , then µ0s exists for s ≤ r .
Moreover, for any r , µ0r exists if and only if µr exists.

Central moments are used to describe a distribution since they are invariant to changes
in the mean. The second central moment is known as the variance.
h i
Definition 1.89 (Variance). The second central moment of a random variable X , E (X − µ)2
is called the variance and is denoted σ2 or equivalently V [X ].

The variance operator (V [·]) also has a number of useful properties.


1.4 Expectations and Moments 41

• If c is a constant, then V [c ] = 0.

• If c is a constant, then V [c X ] = c 2 V [X ].

• If a is a constant, then V [a + X ] = V [X ].

• The variance of the sum is the sum of the variances plus twice all of the covari-
ancesa , " n #
X Xn n
X n
X
Xi = V [X i ] + 2
 
V Cov X j , X k
i =1 i =1 j =1 k = j +1

a
See Section 1.4.7 for more one covariances.
The variance is a measure of dispersion, although the square root of the variance, known
as the standard deviation, is typically more useful.14
Definition 1.90 (Standard Deviation). The square root of the variance is known as the stan-
dard deviations and is denoted σ or equivalently std (X ).
The standard deviation is a more meaningful measure than the variance since its units
are the same as the mean (and random variable). For example, suppose X is the return
on the stock market next year, and that the mean of X is 8% and the standard deviation is
20% (the variance is .04). The mean and standard deviation are both measures as percent-
age change in investment, and so can be directly compared, such as in the Sharpe ratio
(Sharpe 1994). Applying the properties of the expectation operator and variance operator,
it is possible to define a studentized (or standardized) random variable.
Definition 1.91 (Studentization). Let X be a random variable with mean µ and variance
σ2 , then
x −µ
Z = (1.27)
σ
is a studentized version of X (also known as standardized). Z has mean 0 and variance 1.
Standard deviation also provides a bound on the probability which can lie in the tail of
a distribution, as shown in Chebyshev’s inequality.
Theorem 1.92 (Chebyshev’s Inequality). Pr |x − µ| ≥ k σ ≤ 1/k 2 for k > 0.
 

Chebyshev’s inequality is useful in a number of contexts. One of the most useful is in


establishing that an estimator which has a variance that tends to 0 as the sample size in-
creases which ensures that virtually all of the probability must be concentrated around a
particular point.
14
The standard deviation is occasionally confused for the standard error. While both are square roots of
variances, the standard deviation refers to deviation in a random variable while standard error is reserved for
parameter estimators.
42 Probability, Random Variables and Expectations

The third central moment does not have a specific name, although it is called the skew-
ness when standardized by the scaled variance.

Definition 1.93 (Skewness). The third central moment, standardized by the second central
moment raised to the power 3/2,
h i
3
µ3 E (X − E [X ])
= 3 = E Z
 3
3 (1.28)
(σ2 ) 2
h i
2 2
E (X − E [X ])

is defined as the skewness where Z is a studentized version of X .

The skewness is a general measure of asymmetry, and is 0 for symmetric distribution


(assuming the third moment exists). The normalized fourth central moment is known as
the kurtosis.

Definition 1.94 (Kurtosis). The fourth central moment, standardized by the second central
squared,
h i
4
µ4 E (X − E [X ])
= = E Z4
 
2 2
(1.29)
(σ2 )
h i
2
E (X − E [X ])

is defined as the kurtosis and is denoted κ where Z is a studentized version of X .

The kurtosis is a measure of the chance of observing a large (and absolute terms) value,
and is often given as excess kurtosis.

Definition 1.95 (Excess Kurtosis). The kurtosis of a random variable minus the kurtosis of
a normal random variable, κ − 3, is known as excess kurtosis.

Random variables with a positive excess kurtosis are often referred to as heavy tailed.

1.4.3 Related Measures


While moments are useful in describing the properties of a random, other measures are
also commonly encountered. The median is an alternative measure of central tendency.

Definition 1.96 (Median). Any number m satisfying Pr (X ≤ m ) = 0.5 and Pr (X ≥ m) = 0.5


is known as the median of X .

The median measures the point where 50% of the distribution lies on either side (it may
not be unique), and is just a particular quantile. The median has a few advantages over the
mean, and in particular it is less affected by outliers (e.g. the difference between mean and
median income) and it always exists (the mean doesn’t exist for very heavy tailed distribu-
tions).
1.4 Expectations and Moments 43

The interquartile range uses quartiles15 to provide an alternative measure of dispersion


than standard deviation.

Definition 1.97 (Interquartile Range). The value q.75 − q.25 is known as the interquartile
range.

The mode complements the mean and median as a measure of central tendency. A
mode is a simply maximum of a density.

Definition 1.98 (Mode). Let X be a random variable with density function f (x ). An point
c where f (x ) attains a maximum is known as a mode.

Distributions can be unimodal or multimodal.

Definition 1.99 (Unimodal Distribution). Any random variable which has a single, unique
mode is called unimodal.

Note that modes in a multimodal distribution do not necessarily have to have equal
probability.

Definition 1.100 (Multimodal Distribution). Any random variable which as more than one
mode is called multimodal.

Figure 1.9 shows a number of distributions. The distributions depicted in the top panels
are all unimodal. The distributions in the bottom pane are mixtures of normals, meaning
that with probability p random variables come form one normal, and with probability 1−p
they are drawn from the other. Both mixtures of normals are multimodal.

1.4.4 Multivariate Moments


Other moment definitions are only meaningful when studying 2 or more random variables
(or an n -dimensional random variable). When applied to a vector or matrix, the expecta-
tions operator applies element-by-element. For example, if X is an n -dimensional random
variable,

E [X 1 ]
   
X1
 X2   E [X 2 ] 
E [X ] = E  = . (1.30)
   
.. ..
 .   . 
Xn E [X n ]

Covariance is a measure which captures the tendency of two variables to move together
in a linear sense.
15
-tiles are include
 terciles
 (3), quartiles (4), quintiles (5), deciles (10) and percentiles (100). In all cases the
bin ends i − 1/m , i /m where m is the number of bins and i = 1, 2, . . . , m.
44 Probability, Random Variables and Expectations

0.4 0.5
Std. Normal χ21
χ23
0.4 χ25
0.3

0.3
0.2
0.2

0.1
0.1

0 0
−3 −2 −1 0 1 2 3 0 2 4 6 8 10
50-50 Mixture Normal 30-70 Mixture Normal
0.3
0.2

0.2

0.1
0.1

0 0
−4 −2 0 2 4 −4 −2 0 2 4

Figure 1.9: These four figures show two unimodal (upper panels) and two multimodal
(lower panels) distributions. The upper left is a standard normal density. The upper right
shows three χ 2 densities for ν = 1, 3 and 5. The lower panels contain mixture distributions
of 2 normals – the left is a 50-50 mixture of N (−1, 1) and N (1, 1) and the right is a 30-70
mixture of N (−2, 1) and N (1, 1).
1.4 Expectations and Moments 45

Definition 1.101 (Covariance). The covariance between two random variables X and Y is
defined

Cov [X , Y ] = σX Y = E [(X − E [X ]) (Y − E [Y ])] . (1.31)

Covariance can be alternatively defined using the joint product moment and the prod-
uct of the means.
Theorem 1.102 (Alternative Covariance). The covariance between two random variables X
and Y can be equivalently defined

σX Y = E [X Y ] − E [X ] E [Y ] . (1.32)

Inverting the covariance expression shows that covariance, or a lack of covariance, is


enough to ensure that the expectation of a product is the product of the expectations.
Theorem 1.103 (Zero Covariance and Expectation of Product). If X and Y have σX Y = 0,
then E [X Y ] = E [X ] E [Y ].
The previous result follows directly from the definition of covariance since σX Y = E [X Y ]−
E [X ] E [Y ]. In financial economics, this result is often applied to products of random vari-
ables so that the mean of the product can be directly determine by knowledge of the mean
of each variable and the covariance between the two. For example, when studying con-
sumption based asset pricing, it is common to encounter terms involving the expected
value of consumption growth times the pricing kernel (or stochastic discount factor) – in
many cases the full joint distribution of the two is intractable although the mean and co-
variance of the two random variables can be determined.
The Cauchy-Schwartz inequality is a version of the triangle inequality and states that
the expectation of the squared product is less than the product of the squares.
h i
Theorem 1.104 (Cauchy-Schwarz Inequality). E (X Y )2 ≤ E X 2 E Y 2 .
   

Example 1.105. When X is an n -dimensional random variable, it is useful to assemble the


variances and covariances into a covariance matrix.
Definition 1.106 (Covariance Matrix). The covariance matrix of an n -dimensional random
variable X is defined
..
 
σ 2
σ12 . σ1n
 1
σ12 σ22 . . . σ2n 

0
Cov [X ] = Σ = E (X − E [X ]) (X − E [X ]) =  .
  
.. .. 
 
. ..
 . . . . 
σ1n σ2n . . . σn2

where the ith diagonal element contains the variance of X i (σi2 ) and the element in position
(i , j ) contains the covariance between X i and X j σi j .

46 Probability, Random Variables and Expectations

When X is composed of two sub-vectors, a block form of the covariance matrix is often
convenient.

Definition 1.107 (Block Covariance Matrix). Suppose X 1 is an n1 -dimensional random vari-


able and X 2 is an n2 -dimensional random variable. The block covariance matrix of X =
 0 0 0
X 1 X 2 is
" #
Σ11 Σ12
Σ= (1.33)
Σ012 Σ22
where Σ11 is the n1 by n1 covariance of X 1 , Σ22 is the n2 by n2 covariance of X 2 and Σ12 is the
n1 by n2 covariance matrix between X 1 and X 2 and element (i , j ) equal to Cov X 1,i , X 2, j .
 

A standardized version of covariance is often used to produce a scale free measure.

Definition 1.108 (Correlation). The correlation between two random variables X and Y is
defined
σX Y
Corr [X , Y ] = ρX Y = . (1.34)
σX σY

Additionally, the correlation is always in the interval [−1, 1], which follows from the
Cauchy-Schwartz inequality.

Theorem 1.109. If X and Y are independent random variables, then ρX Y = 0 as long as σ2X
and σ2Y exist.

It is important to note that the converse of this statement is not true – that is, a lack of
correlation does not imply that two variables are independent. In general, a correlation of
0 only implies independence when the variables are multivariate normal.

Example 1.110. Suppose X and Y have ρX Y = 0, then X and Y are not necessarily inde-
pendent. Suppose X is a discrete uniform random variable taking values in {−1, 0, 1} and
Y = X 2 , then σ2X = 2/3, σ2Y = 2/9 and σX Y = 0. While X and Y are uncorrelated, the are
clearly not independent, since when the random variable Y takes the value 1, X must take
either the value −1 or 1.

The corresponding correlation matrix can be assembled. Note that a correlation matrix
has 1s on the diagonal and values bounded by [−1, 1] on the off diagonal positions.

Definition 1.111 (Correlation Matrix). The correlation matrix of an n -dimensional random


variable X is defined
1 1
(Σ In )− 2 Σ (Σ In )− 2 (1.35)
 
where the i , j th element has the form σX i X j / σX i σX j when i 6= j and 1 when i = j .
1.4 Expectations and Moments 47

1.4.5 Conditional Expectations

One of the most useful forms of expectation is the conditional expectation, which are ex-
pectations using conditional densities in place of joint or marginal densities. Conditional
expectations essentially treat one of the variables (in a bivariate random variable) as con-
stant.

Definition 1.112 (Bivariate Conditional Expectation). Let X be a continuous bivariate ran-


dom variable comprised of X 1 and X 2 . The conditional expectation of X 1 given X 2
Z ∞
E g (X 1 ) |X 2 = x2 = g (x1 ) f x1 |x2 dx1
  
(1.36)
−∞

where f x1 |x2 is the conditional probability density function of X 1 given X 2 .16




 
In many cases, it is useful to avoid specifying a specific value for X 2 in which case E X 1 |X 1
 
will be used. Note that E X 1 |X 2 will typically be a function of the random variable X 2 .

Example 1.113. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ = ,
σ12 σ22

σ12
then E X 1 |X 2 = x2 = µ1 + (x2 − µ2 ). This follows from the conditional density of a
 
σ22
bivariate random variable.

The law of iterated expectations uses conditional expectations to show that the condi-
tioning does not affect the final result of taking expectations – in other words, the order of
taking expectations, does not matter.

Theorem 1.114 (Bivariate Law of Iterated Expectations). Let X be a continuous bivariate


random variable comprised of X 1 and X 2 . Then E E g (X 1 ) |X 2 = E [g (X 1 )] .
  

The law of iterated expectations follows from basic properties of an integral since the
order of integration does not matter as long as all integrals are taken.

Example 1.115. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = [µ1 µ2 ]0 and
" #
σ12 σ12
Σ = ,
σ12 σ22
16
A conditional expectation can also be defined in a natural way for functions of X 1 given X 2 ∈ B where
Pr (X 2 ∈ B ) > 0.
48 Probability, Random Variables and Expectations

then E [X 1 ] = µ1 and

σ12
 
= E µ1 + 2 (X 2 − µ2 )
  
E E X 1 |X 2
σ2
σ12
= µ1 + 2 (E [X 2 ] − µ2 )
σ2
σ12
= µ1 + 2 (µ2 − µ2 )
σ2
= µ1 .

When using conditional expectations, any random variable conditioned on is essen-


tially non-random (in the conditional expectation), and so E E X 1 X 2 |X 2 = E X 2 E X 1 |X 2 .
     
 
This is a very useful tool when combined with the law of iterated expectations when E X 1 |X 2
is a known function of X 2 .

Example 1.116. Suppose X is a bivariate normal distribution with components X 1 and X 2 ,


µ = 0 and
" #
σ12 σ12
Σ = ,
σ12 σ22

then

E [X 1 X 2 ] = E E X 1 X 2 |X 2
  

= E X 2 E X 1 |X 2
  

σ12
  
= E X2 X2
σ22
σ12  2 
= E X2
σ22
σ12
= σ 2

σ22 2

= σ12 .

One particularly useful application of conditional expectations occurs when the condi-
tional expectation is known and constant, so thatE X 1 |X 2 = c .
 

Example 1.117. Suppose X is a bivariate random variable composed of X 1 and X 2 and that
E X 1 |X 2 = c . Then E [X ] = c since
 

E [X 1 ] = E E X 1 |X 2
  

= E [c ]
= c.

Conditional expectations can be taken for general n -dimensional random variables,


1.4 Expectations and Moments 49

and the law of iterated expectations holds as well.


Definition 1.118 (Conditional Expectation). Let X be a n-dimensional random variable
and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that
0
X = X 10 X 20 . The conditional expectation of X 1 given X 2 = x2


Z ∞ Z ∞
E g (X 1 ) |X 2 = x2 =
   
... g x1 , . . . , x j f x1 , . . . , x j |x2 dx j . . . dx1 (1.37)
−∞ −∞

where f x1 , . . . , x j |x2 is the conditional probability density function of X 1 given X 2 = x2 .




The law of iterated expectations also hold for arbitrary partitions as well.
Theorem 1.119 (Law of Iterated Expectations). Let X be a n-dimensional random variable
and partition the first 1 ≤ j < n elements of X into X 1 , and the remainder into X 2 so that
0
X = X 10 X 20 . Then E E g (X 1 ) |X 2 = E [g (X 1 )]. The law of iterated expectations is also
   

known as the law of total expectations.


Full multivariate conditional expectations are extremely common in time series. For
example, when using daily data, there are over 30,000 observations of the Dow Jones In-
dustrial Average available to model. Attempting to model the full joint distribution would
be a formidable task. On the other hand, modeling the conditional expectation (or condi-
tional mean) of the final observation, conditioning on those observations in the past, is far
simpler.
Example 1.120. Suppose {X t } is a sequence of random variables where X t comes after
X t − j for j ≥ 1. The conditional conditional expectation of X t given its past is
 
E X t |X t −1, X t −2 , . . . .

Example 1.121. Let {εt } be a sequence of independent, identically distributed random


variables with mean 0 and variance σ2 < ∞. Define X 0 = 0 and X t = X t −1 + εt , then
X t is a random walk, and E X t |X t −1 = X t −1 , which follows since X t −1 is a function of
 

εt −1 , εt −2 , . . ..
This leads naturally to the definition of a martingale, which is an important concept in
financial economics which related to efficient markets.
Definition 1.122 (Martingale). If E X t + j |X t −1 , X t −2 . . . = X t −1 for all j ≥ 0 and E |X t | <
   

∞, both holding for all t , then {X t } is a martingale. Similarly, if E X t + j − X t −1 |X t −1 , X t −2 . . . =


 

0 for all j ≥ 0 and E |X t | < ∞, both holding for all t , then {X t } is a martingale.
 

1.4.6 Conditional Moments


All moments can be defined in a conditional form simply by replacing the integral using
the probability density function with an integral using the conditional probability density
50 Probability, Random Variables and Expectations

function. For example, the (unconditional) mean becomes the conditional mean, and the
variance becomes a conditional variance.

Definition 1.123 (Conditional Variance). The variance of a random X conditional on an-


other random variable Y is
h 2 i
= E X − E X |Y
  
V X |Y |Y (1.38)
2
= E X |Y − E X |Y .
 2  

The two definitions of conditional variance are identical to those of the (unconditional)
variance where the expectation operators has been replaced with conditional expectations.
Conditioning can be used to compute higher-order moments as well.

Definition 1.124 (Conditional Moment). The rth central moment of a random variables X
conditional on another random variable Y is defined
h r i
µr ≡ E X − E X |Y

|Y (1.39)

for r = 2, 3, . . ..

Combining the conditional expectation and the conditional variance, leads to the law
of total variance.

Theorem 1.125. The variance of a random variable X can be decomposed into the variance
of the conditional expectation plus the expectation of the conditional variance,

V [X ] = V E X |Y + E V X |Y .
     
(1.40)

The law of total variance shows that the total variance of a variable can be decomposed
into the variability of the conditional mean plus the average of the conditional variance.
This is a useful decomposition for time-series.
Independence can also be defined conditionally.

Definition 1.126 (Conditional Independence). Two random variables X 1 and X 2 are con-
ditionally independent, conditional on Y , if

f x1 , x2 |y = f1 x1 |y f2 x2 |y .
  

Random variables that are conditionally independent are not necessarily uncondition-
ally independent. However, knowledge of the variable is sufficient that the portions of the
underlying random variables which cannot be explained by the conditioning variables be-
come independent.
1.4 Expectations and Moments 51

Example 1.127. Suppose X is a trivariate normal random variable with mean 0 and covari-
ance  
σ12 0 0
Σ =  0 σ22 0 
 
0 0 σ32
and define Y1 = x1 + x3 and Y2 = x2 + x3 . Then Y1 and Y2 are correlated bivariate normal
with mean 0 and covariance
" #
σ12 + σ32 σ32
ΣY = ,
σ32 σ22 + σ32

but the joint distribution of Y1 and Y2 given X 3 is bivariate normal with mean 0 and
" #
σ12 0
ΣY |X 3 =
0 σ22

and so Y1 and Y2 are independent conditional on X 3 .


Other properties of unconditionally independent random variables continue to hold
for conditionally independent random variables. For example, when X 1 and X 2 are inde-
pendent conditional on X 3 , then the conditional covariance between X 1 and X 2 is 0 (as
is the conditional correlation), and E E X 1 X 2 |X 3 = E E X 1 |X 3 E X 2 |X 3 – that is, the
       

conditional expectation of the product is the product of the conditional expectations.

1.4.7 Vector and Matrix Forms


Finally, some useful results for linear combinations of random variables. These are partic-
ularly useful in finance since portfolios are often of interest where the underlying random
variables are the individual assets and the combination vector is simply the vector of port-
folio weights.
Pn
Theorem 1.128. Let Y = i =1 c i X i where c i , i = 1, . . . , n are constants. Then E [Y ] =
Pn
i =1 c i E [X i ]. In matrix notation, Y = c x where c is an n by 1 vector and E [Y ] = c E [X ] .
0 0

The variance of the sum is the weighted sum of the variance plus all of the covariances.
Theorem 1.129. Let Y = ni=1 ci X i where ci are constants. Then
P

n
X n
X n
X
V [Y ] = ci2 V [X i ] + 2
 
c j ck Cov X i , X j (1.41)
i =1 j =1 k = j +1

or equivalently
n
X n
X n
X
σ2Y = ci2 σ2X i +2 c j c k σX j X k .
i =1 j =1 k = j +1
52 Probability, Random Variables and Expectations

This result can be equivalently expressed in vector-matrix notation.


Theorem 1.130. Let c in an n by 1 vector and let X by an n-dimensional random variable
with covariance Σ. Define Y = c0 x. The variance of Y is σ2Y = c0Cov [X ] c = c0 Σc.
Note that the result holds when c is replaced by a matrix C.
Theorem 1.131. Let C be an n by m matrix and let X be an n-dimensional random variable
with mean µX and covariance ΣX . Define Y = C0 x. The expected value of Y is E [Y ] = µY =
C0 E [X ] = C0 µX and the covariance of Y is ΣY = C0Cov [X ] C = C0 ΣX C.
Definition 1.132 (Multivariate Studentization). Let X be an n -dimensional random vari-
able with mean µ and covariance Σ, then
1
Z = Σ− 2 (x − µ) (1.42)
1
is a studentized version of X where Σ 2 is a matrix square root such as the Cholesky factor
or one based on the spectral decomposition of Σ. Z has mean 0 and covariance equal to
the identity matrix In .
The final result for vectors relates quadratic forms of normals (inner-products) to χ 2
distributed random variables.
Theorem 1.133 (Quadratic Forms of Normals). Let X be an n-dimensional normal random
variable with mean 0 and identity covariance In . Then x0 x = ni=1 xi2 ∼ χn2 .
P

Combing this result with studentization, when X is a general n -dimensional normal


random variable with mean µ and covariance Σ,
 1 0 1
(x − µ)0 Σ− 2 Σ− 2 (x − µ)0 = (x − µ)0 Σ−1 (x − µ)0 ∼ χn2 .

1.4.8 Monte Carlo and Numerical Integration


Expectations of functions of continuous random variables are integrals against the under-
lying pdf. In some cases these integrals are analytically tractable, although in many situa-
tions integrals cannot be analytically computed and so numerical techniques are needed
to compute expected values and moments.
Monte Carlo is one method to approximate an integral. Monte Carlo utilizes simulated
draws from the underlying distribution and averaging to approximate integrals.
Definition 1.134 (Monte Carlo Integration). Suppose X ∼ F (θ ) and that it is possible to
simulate a series {xi } from F (θ ). The Monte Carlo expectation of a function g (x ) is defined
n
X
n −1 g (xi ) ,
i =1

Moreover, as long as E |g (x )| < ∞, limn →∞ n −1 ni=1 g (xi ) = E [g (x )].


  P
1.4 Expectations and Moments 53

The intuition behind this result follows from the properties of {xi }. Since these are
i.i.d. draws from F (θ ), they will, on average, tend to appear in any interval B ∈ R (X ) in
proportion to the probability Pr (X ∈ B ). In essence, the simulated values coarsely approx-
imating the discrete approximation shows in 1.8.
While Monte Carlo integration is a general technique, there are some important limi-
tations. First, if the function g (x ) takes large values in regions where Pr (X ∈ B ) is small,
it may require a very large number of draws to accurately approximate E [g (x )] since, by
construction, there are unlikely to many points in B . In practice the behavior of h (x ) =
g (x ) f (x ) plays an important role in determining the appropriate sample size.17 Second,
while Monte Carlo integration is technically valid for random variables with any number
of dimensions, in practice it is usually only reliable when the dimension is small (typically
3 or fewer), especially when the range is unbounded (R (X ) ∈ Rn ). When the dimension of
X is large, many simulated draws are needed to visit the corners of the (joint) pdf, and if
1,000 draws are sufficient for a unidimensional problem, 1000n may be needed to achieve
the same accuracy when X has n dimensions.
Alternatively the function to be integrated can be approximated using a polygon with
an easy-to-compute area, such as the rectangles approximating the normal pdf shows in
figure 1.8. The quality of the approximation will depend on the resolution of the grid used.
Suppose u and l are the upper and lower bounds of the integral, respectively, and that the
region can be split into m intervals l = b0 < b1 < . . . < bm −1 < bm = u . Then the integral
of a function h (·) is
Z u Xm Z bi
h (x ) dx = h (x ) dx .
l i =1 bi −1

In practice, l and u may be infinite, in which case some cut-off point is required. In general,
the cut-off should be chosen to that they vast majority of the probability lies between l and
Ru
u ( l f (x ) dx ≈ 1).
This decomposition is combined with an area for approximating the area under h be-
tween bi −1 and bi . The simplest is the rectangle method, which uses a rectangle with a
height equal to the value of the function at the mid-point.

Definition 1.135 (Rectangle Method). The rectangle rule approximates the area under the
curve with a rectangle and is given by

u +l
Z u  
h (x ) dx ≈ h (u − l ) .
l 2

The rectangle rule would be exact if the function was piece-wise flat. The trapezoid rule
improves the approximation by replacing the function at the midpoint with the average
value of the function, and would be exact for any piece-wise linear function (including
17
Monte Carlo integrals can also be seen as estimators, and in many cases standard inference can be used
to determine the accuracy of the integral. See Chapter 2 for more details on inference and constructing con-
fidence intervals.
54 Probability, Random Variables and Expectations

piece-wise flat functions).

Definition 1.136 (Trapezoid Method). The trapezoid rule approximates the area under the
curve with a trapezoid and is given by

h (u ) + h (l )
Z u
h (x ) dx ≈ (u − l ) .
l 2
The final method is known as Simpson’s rule which is based on using quadratic approx-
imation to the underlying function. It is exact when the underlying function is piece-wise
linear or quadratic.

Definition 1.137 (Simpson’s Rule). Simpson’s Rule uses an approximation that would be
exact if they underlying function were quadratic, and is given by

u +l
Z u    
u −l
h (x ) dx ≈ h (u) + 4h + h (l ) .
l 6 2

Example 1.138. Consider the problem of computing the expected payoff of an option. The
payoff of a call option is given by

c = max (s1 − k , 0)

where k is the strike price and s1 is the stock price at expiration and s0 is the current stock
price. Suppose returns are normally distributed with mean µ = .08 and standard deviation
σ = .20. In this problem, g (r ) = (s0 exp (r ) − k ) I[s0 exp(r )>k ] where I[·] and a binary indicator
function which takes the value 1 when the argument is true, and
!
1 (r − µ)2
f (r ) = √ exp − .
2πσ2 2σ2

Combined, the function the be integrated is


Z ∞ Z ∞
h (r ) dr = g (r ) f (r ) dr
−∞ −∞
!

1 (r − µ)2
Z
= (s0 exp (r ) − k ) I[s0 exp(r )>k ] √ exp − dr
−∞ 2πσ2 2σ2

s0 = k = 50 was used in all results.


All four methods were applied to the problem. The number of bins and the range of
integration was varied for the analytical approximations. The number of bins ranged across
{10, 20, 50, 1000} and the integration range spanned {±3σ, ±4σ, ±6σ, ±10σ} . In all cases
the bins were uniformly spaced along the integration range. Monte Carlo integration was
applied with m ∈ {100, 1000}.
1.4 Expectations and Moments 55

All thing equal, increasing the number of bins increases the accuracy of the approxima-
tion. In this example, 50 appears to be sufficient. However, having a range which is too
small produces values which differ from the correct value of 7.33. The sophistication of the
method also improves the accuracy, especially when the number of nodes is small. The
Monte Carlo results are also close, on average. However, the standard deviation is large,
about 5%, even when 1000 draws are used, and so large errors would be commonly en-
countered and so many more points are needed.
56 Probability, Random Variables and Expectations

Rectangle Method
Bins ±3σ ±4σ ±6σ ±10σ
10 7.19 7.43 7.58 8.50
20 7.13 7.35 7.39 7.50
50 7.12 7.33 7.34 7.36
1000 7.11 7.32 7.33 7.33

Trapezoid Method
Bins ±3σ ±4σ ±6σ ±10σ
10 6.96 7.11 6.86 5.53
20 7.08 7.27 7.22 7.01
50 7.11 7.31 7.31 7.28
1000 7.11 7.32 7.33 7.33

Simpson’s Rule
Bins ±3σ ±4σ ±6σ ±10σ
10 7.11 7.32 7.34 7.51
20 7.11 7.32 7.33 7.34
50 7.11 7.32 7.33 7.33
1000 7.11 7.32 7.33 7.33

Monte Carlo
Draws (m ) 100 1000
Mean 7.34 7.33
Std. Dev. 0.88 0.28

Table 1.1: Computed values for the expected payout for an option, where the correct value
is 7.33 The top three panels use approximations to the function which have simple to com-
pute areas. The bottom panel shows the average and standard deviation from a Monte
Carlo integration where the number of points varies and 10, 000 simulations were used.
1.4 Expectations and Moments 57

Exercises
Exercise 1.1. Prove that E [a + b X ] = a + b E [X ] when X is a continuous random variable.

Exercise 1.2. Prove that V [a + b X ] = b 2 V [X ] when X is a continuous random variable.

Exercise 1.3. Prove that Cov [a + b X , c + d Y ] = b d Cov [X , Y ] when X and Y are a con-
tinuous random variables.

Exercise 1.4. Prove that V [a + b X + c Y ] = b 2 V [X ] + c 2 V [Y ] + 2b c Cov [X , Y ] when X


and Y are a continuous random variables.

Exercise 1.5. Suppose {X i } is an i.i.d. sequence of random variables. Show that V X̄ =


 

V n1 ni=1 X i = n −1 σ2 where σ2 is V [X 1 ].
 P 

Exercise 1.6. Prove that Corr [a + b X , c + d Y ] = Corr [X , Y ].

Exercise 1.7. Suppose {X i } is a sequence of random variables where, for all i , V [X i ] = σ2 ,


Cov [X i , X i −1 ] = θ and Cov X i , X i − j = 0 for j > 1.. What is V X̄ ?
   

Exercise 1.8. Prove that E a + b X |Y = a + b E X |Y when X and Y are continuous


   

random variables.

Exercise 1.9. Suppose that E X |Y = Y 2 where Y is normally distributed with mean µ and
 

variance σ2 . What is E [a + b X ]?

Exercise 1.10. Suppose E X |Y = y = a + b y and V X |Y = y = c + d y 2 where Y is


   

normally distributed with mean µ and variance σ2 . What is V [X ]?

Exercise 1.11. Show that the law of total variance holds for a V [X 1 ] when X is a bivariate
normal with mean µ = [µ1 µ2 ]0 and covariance
" #
σ12 σ12
Σ = .
σ12 σ22

Exercise 1.12. Sixty percent (60%) of all traders hired by a large financial firm are rated as
performing satisfactorily or better in their first year review. Of these, 90% earned a first in
financial econometrics. Of the traders who were rated as unsatisfactory, only 20% earned
a first in financial econometrics.

i. What is the probability that a trader is rated as satisfactory or better given they re-
ceived a first in financial econometrics?

ii. What is the probability that a trader is rated as unsatisfactory given they received a
first in financial econometrics?

iii. Is financial econometrics a useful indicator or trader performance? Why or why not?
58 Probability, Random Variables and Expectations

Exercise 1.13. Large financial firms use automated screening to detect rogue trades – those
that exceed risk limits. One of your former colleagues has introduced a new statistical test
using the trading data that, given that a trader has exceeded her risk limit, detects this with
probability 98%. It also only indicates false positives – that is non-rogue trades that are
flagged as rogue – 1% of the time.

i. Assuming 99% of trades are legitimate, what is the probability that a detected trade is
rogue? Explain the intuition behind this result.

ii. Is this a useful test? Why or why not?

iii. How low would the false positive rate have to be to have a 98% chance that a detected
trade was actually rogue?

Exercise 1.14. Your corporate finance professor uses a few jokes to add levity to his lectures.
Each week he tells 3 different jokes. However, he is also very busy, and so forgets week to
week which jokes were used.

i. Assuming he has 12 jokes, what is the probability of 1 repeat across 2 consecutive


weeks?

ii. What is the probability of hearing 2 of the same jokes in consecutive weeks?

iii. What is the probability that all 3 jokes are the same?

iv. Assuming the term is 8 weeks long, and they your professor has 96 jokes, what is the
probability that there is no repetition across the term? Note: he remembers the jokes
he gives in a particular lecture, only forgets across lectures.

v. How many jokes would your professor need to know to have a 99% chance of not
repeating any in the term?

Exercise 1.15. A hedge fund company manages three distinct funds. In any given month,
the probability that the return is positive is shown in the following table:
Pr (r1,t > 0) = .55 Pr (r1,t > 0 ∪ r2,t > 0) = .82
Pr (r2,t > 0) = .60 Pr (r1,t > 0 ∪ r3,t > 0) = .7525
Pr (r3,t > 0) = .45 Pr (r2,t > 0 ∪ r3,t > 0) = .78
Pr r2,t > 0 ∩ r3,t > 0|r1,t > 0 = .20


i. Are the events of “positive returns” pairwise independent?

ii. Are the events of “positive returns” independent?

iii. What is the probability that funds 1 and 2 have positive returns, given that fund 3 has
a positive return?
1.4 Expectations and Moments 59

iv. What is the probability that at least one fund will has a positive return in any given
month?

Exercise 1.16. Suppose the probabilities of three events, A, B and C are as depicted in the
following diagram:

.15 .10 .15

A B
.10
.05 .05

.175
C

i. Are the three events pairwise independent?

ii. Are the three events independent?

iii. What is Pr (A ∩ B )?

iv. What is Pr A ∩ B |C ?

v. What is Pr C |A ∩ B ?

vi. What is Pr C |A ∪ B ?

Exercise 1.17. At a small high-frequency hedge fund, two competing algorithms produce
trades. Algorithm α produces 80 trades per second and 5% lose money. Algorithm β pro-
duces 20 trades per second but only 1% lose money. Given the last trade lost money, what
is the probability it was produced by algorithm β ?

Exercise 1.18. Suppose f (x , y ) = 2 − x − y where x ∈ [0, 1] and y ∈ [0, 1].

i. What is Pr (X > .75 ∩ Y > .75)?

ii. What is Pr (X + Y > 1.5)?

iii. Show formally whether X and Y are independent.

iv. What is Pr Y < .5|X = x ?




Exercise 1.19. Suppose f (x , y ) = x y for x ∈ [0, 1] and y ∈ [0, 2].

i. What is the joint cdf?


60 Probability, Random Variables and Expectations

ii. What is Pr (X < 0.5 ∩ Y < 1)?

iii. What is the marginal cdf of X ? What is Pr (X < 0.5)?

iv. What is the marginal density of X ?

v. Are X and Y independent?

Exercise 1.20. Suppose F (x ) = 1 − p x +1 for x ∈ [0, 1, 2, . . .] and p ∈ (0, 1).

i. Find the pmf.

ii. Verify that the pmf is valid.

iii. What is Pr (X ≤ 8) if p = .75?

iv. What is Pr (X ≤ 1) given X ≤ 8?

Exercise 1.21. A firm producing widgets has a production function q (L ) = L 0.5 where L is
the amount of labor. Sales prices fluctuate randomly and can be $10 (20%), $20 (50%) or
$30 (30%). Labor prices also vary and can be $1 (40%), 2 (30%) or 3 (30%). The firm always
maximizes profits after seeing both sales prices and labor prices.

i. Define the distribution of profits possible?

ii. What is the probability that the firm makes at least $100?

iii. Given the firm makes a profit of $100, what is the probability that the profit is over
$200?

Exercise 1.22. A fund manager tells you that her fund has non-linear returns as a function
of the market and that his return is ri ,t = 0.02 + 2rm ,t − 0.5rm2 ,t where ri ,t is the return on
the fund and rm,t is the return on the market.

i. She tells you her expectation of the market return this year is 10%, and that her fund
will have an expected return of 22%. Can this be?

ii. At what variance is would the expected return on the fund be negative?

Exercise 1.23. For the following densities, find the mean (if it exists), variance (if it exists),
median and mode, and indicate whether the density is symmetric.

i. f (x ) = 3x 2 for x ∈ [0, 1]

ii. f (x ) = 2x −3 for x ∈ [1, ∞)


−1
iii. f (x ) = π 1 + x 2 for x ∈ (−∞, ∞)


!
4
iv. f (x ) = .2 x .84−x for x ∈ {0, 1, 2, 3, 4}
x
1.4 Expectations and Moments 61

Exercise 1.24. The daily price of a stock has an average value of £2. Then then Pr (X > 10) <
.2 where X denotes the price of the stock. True or false?

Exercise 1.25. An investor can invest in stocks or bonds which have expected returns and
covariances as " # " #
.10 .04 −.003
µ= , Σ=
.03 −.003 .0009
where stocks are the first component.

i. Suppose the investor has £1,000 to invest, and splits the investment evenly. What
is the expected return, standard deviation, variance and Sharpe Ratio (µ/σ) for the
investment?

ii. Now suppose the investor seeks to maximize her expected utility where her utility is
defined is defined in terms of her portfolio return, U (r ) = E [r ] − .01V [r ]. How much
should she invest in each asset?

Exercise 1.26. Suppose f (x ) = (1 − p ) x p for x ∈ (0, 1, . . .) and p ∈ (0, 1]. Show that a ran-
dom variable from the distribution is “memoryless” in the sense that Pr X ≥ s + r |X ≥ r =


Pr (X ≥ s ). In other words, the probability of surviving s or more periods is the same whether
starting at 0 or after having survived r periods.

Exercise 1.27. Your Economics professor offers to play a game with you. You pay £1,000
to play and your Economics professor will flip a fair coin and pay you 2 x where x is the
number of tries required for the coin to show heads.

i. What is the pmf of X ?

ii. What is the expected payout from this game?

Exercise 1.28. Consider the roll of a fair pair of dice where a roll of a 7 or 11 pays 2x and
anything else pays −x where x is the amount bet. Is this game fair?

Exercise 1.29. Suppose the joint density function of X and Y is given by f (x , y ) = 1/2 x exp (−x y )
where x ∈ [3, 5] and y ∈ (0, ∞).

i. Give the form of E Y |X = x .


 

ii. Graph the conditional expectation curve.

Exercise 1.30. Suppose a fund manager has $10,000 of yours under management and tells
you that the expected value of your portfolio in two years time is $30,000 and that with
probability 75% your investment will be worth at least $40,000 in two years time.

i. Do you believe her?


62 Probability, Random Variables and Expectations

ii. Next, suppose she tells you that the standard deviation of your portfolio value is 2,000.
Assuming this is true (as is the expected value), what is the most you can say about
the probability your portfolio value falls between $20,000 and $40,000 in two years
time?

Exercise 1.31. Suppose the joint probability density function of two random variables is
given by f (x ) = 52 (3x + 2y ) where x ∈ [0, 1] and y ∈ [0, 1].

i. What is the marginal probability density function of X ?

ii. What is E X |Y = y ? Are X and Y independent? (Hint: What must the form of
 
 
E X |Y be when they are independent?)

Exercise 1.32. Let Y be distributed χ15


2
.

i. What is Pr (y > 27.488)?

ii. What is Pr (6.262 ≤ y ≤ 27.488)?

iii. Find C where Pr (y ≥ c ) = α for α ∈ {0.01, 0.05, 0.01}.


Next, Suppose Z is distributed χ52 and is independent of Y .

iv. Find C where Pr (y + z ≥ c ) = α for α ∈ {0.01, 0.05, 0.01}.

Exercise 1.33. Suppose X is a bivariate random variable with parameters


" # " #
5 2 −1
µ= , Σ= .
8 −1 3
 
i. What is E X 1 |X 2 ?
 
ii. What is V X 1 |X 2 ?

iii. Show (numerically) that the law of total variance holds for X 2 .

Exercise 1.34. Suppose y ∼ N (5, 36) and x ∼ N (4, 25) where X and Y are independent.

i. What is Pr (y > 10)?

ii. What is Pr (−10 < y < 10)?

iii. What is Pr (x − y > 0)?

iv. Find C where Pr (x − y > C ) = α for α ∈ {0.10, 0.05, 0.01}?


Chapter 2

Estimation, Inference and Hypothesis Testing

Note: The primary reference for these notes is Ch. 7 and 8 of Casella & Berger (2001). This
text may be challenging if new to this topic and Ch. 7 – 10 of Wackerly, Mendenhall &
Scheaffer (2001) may be useful as an introduction.
This chapter provides an overview of estimation, distribution theory, inference
and hypothesis testing. Testing an economic or financial theory is a multi-step
process. First, any unknown parameters must be estimated. Next, the distribu-
tion of the estimator must be determined. Finally, formal hypothesis tests must
be conducted to examine whether the data are consistent with the theory. This
chapter is intentionally “generic” by design and focuses on the case where the data
are independent and identically distributed. Properties of specific models will be
studied in detail in the chapters on linear regression, time series and univariate
volatility modeling.

Three steps must be completed to test the implications of an economic theory:


• Estimate unknown parameters

• Determine the distributional of estimator

• Conduct hypothesis tests to examine whether the data are compatible with a theo-
retical model
This chapter covers each of these steps with a focus on the case where the data is indepen-
dent and identically distributed (i.i.d.). The heterogeneous but independent case will be
covered in the chapter on linear regression and the dependent case will be covered in the
chapters on time series.

2.1 Estimation
Once a model has been specified and hypotheses postulated, the first step is to estimate
the parameters of the model. Many methods are available to accomplish this task. These
64 Estimation, Inference and Hypothesis Testing

include parametric, semi-parametric, semi-nonparametric and nonparametric estimators


and a variety of estimation methods often classified as M-, R- and L-estimators.1
Parametric models are tightly parameterized and have desirable statistical properties
when their specification is correct, such as providing consistent estimates with small vari-
ances. Nonparametric estimators are more flexible and avoid making strong assumptions
about the relationship between variables. This allows nonparametric estimators to capture
a wide range of relationships but comes at the cost of precision. In many cases, nonpara-
metric estimators are said to have a slower rate of convergence than similar parametric
estimators. The practical consequence of the rate is that nonparametric estimators are de-
sirable when there is a proliferation of data and the relationships between variables may
be difficult to postulate a priori. In situations where less data is available, or when an eco-
nomic model proffers a relationship among variables, parametric estimators are generally
preferable.
Semi-parametric and semi-nonparametric estimators bridge the gap between fully para-
metric estimators and nonparametric estimators. Their difference lies in “how paramet-
ric” the model and estimator are. Estimators which postulate parametric relationships be-
tween variables but estimate the underlying distribution of errors flexibly are semi-parametric.
Estimators which take a stand on the distribution of the errors but allow for flexible rela-
tionships between variables are semi-nonparametric. This chapter focuses exclusively on
parametric models and estimators. This choice is more reflective of common practice than
a critique of parametric and nonparametric methods.
The other important characterization of estimators is whether they are members of the
M-, L- or R-estimator classes.2 M-estimators (also known as extremum estimators) always
involve maximizing or minimizing some objective function. M-estimators are the common
in financial econometrics and include maximum likelihood, regression, classical minimum
distance and both the classical and the generalized method of moments. L-estimators, also
known as linear estimators, are a class where the estimator can be expressed as a linear
function of ordered data. Members of this family can always be written as
n
X
wi yi
i =1

for some set of weights {wi } where the data, yi , are ordered such that y j −1 ≤ y j for j =
2, 3, . . . , n . This class of estimators obviously includes the sample mean by setting wi = n1
for all i , and it also includes the median by setting wi = 0 for all i except w j = 1 where
j = (n + 1)/2 (n is odd) or w j = w j +1 = 1/2 where j = n /2 (n is even). R-estimators exploit

1
There is another important dimension in the categorization of estimators: Bayesian or frequentist.
Bayesian estimators make use of Bayes rule to perform inference about unknown quantities – parameters
– conditioning on the observed data. Frequentist estimators rely on randomness averaging out across obser-
vations. Frequentist methods are dominant in financial econometrics although the use of Bayesian methods
has been recently increasing.
2
Many estimators are members of more than one class. For example, the median is a member of all three.
2.1 Estimation 65

the rank of the data. Common examples of R-estimators include the minimum, maximum
and Spearman’s rank correlation, which is the usual correlation estimator on the ranks of
the data rather than on the data themselves. Rank statistics are often robust to outliers and
non-linearities.

2.1.1 M-Estimators

The use of M-estimators is pervasive in financial econometrics. Three common types of


M-estimators include the method of moments, both classical and generalized, maximum
likelihood and classical minimum distance.

2.1.2 Maximum Likelihood

Maximum likelihood uses the distribution of the data to estimate any unknown parame-
ters by finding the values which make the data as likely as possible to have been observed
– in other words, by maximizing the likelihood. Maximum likelihood estimation begins by
specifying the joint distribution, f (y; θ ), of the observable data, y = {y1 , y2 , . . . , yn }, as a
function of a k by 1 vector θ which contains all parameters. Note that this is the joint den-
sity, and so it includes both the information in the marginal distributions of yi and infor-
mation relating the marginals to one another.3 Maximum likelihood estimation “reverses”
the likelihood to express the probability of θ in terms of the observed y, L (θ ; y) = f (y; θ ).
The maximum likelihood estimator, θ̂ , is defined as the solution to

θ̂ = argmax L (θ ; y) (2.1)
θ

where argmax is used in place of max to indicate that the maximum may not be unique – it
could be set valued – and to indicate that the global maximum is required.4 Since L (θ ; y) is
strictly positive, the log of the likelihood can be used to estimate θ .5 The log-likelihood is
defined as l (θ ; y) = ln L (θ ; y). In most situations the maximum likelihood estimator (MLE)
can be found by solving the k by 1 score vector,
3
Formally the relationship between the marginal is known as the copula. Copulas and their use in financial
econometrics will be explored in the second term.
4
Many likelihoods have more than one maximum (i.e. local maxima). The maximum likelihood estimator
is always defined as the global maximum.
5
Note that the log transformation is strictly increasing and globally concave. If z ? is the maximum of g (z ),
and thus
∂ g (z )

=0
∂ z z =z ?
then z ? must also be the maximum of ln(g (z )) since

∂ ln(g (z )) g 0 (z )

0
= = =0
∂z
z =z ? g (z ) z =z ? g (z ? )

which follows since g (z ) > 0 for any value of z .


66 Estimation, Inference and Hypothesis Testing

∂ l (θ ; y)
=0
∂θ
although a score-based solution does not work when θ is constrained and θ̂ lies on the
boundary of the parameter space or when the permissible range of values for yi depends on
θ . The first problem is common enough that it is worth keeping in mind. It is particularly
common when working with variances which must be (weakly) positive by construction.
The second issue is fairly rare in financial econometrics.

2.1.2.1 Maximum Likelihood Estimation of a Poisson Model

Realizations from a Poisson process are non-negative and discrete. The Poisson is com-
mon in ultra high-frequency econometrics where the usual assumption that prices lie in
a continuous space is implausible. For example, trade prices of US equities evolve on a
i.i.d.
grid of prices typically separated by $0.01. Suppose yi ∼ Poisson(λ). The pdf of a single
observation is

exp(−λ)λ yi
f (yi ; λ) = (2.2)
yi !
and since the data are independent and identically distributed (i.i.d.), the joint likelihood
is simply the product of the n individual likelihoods,
n
Y exp(−λ)λ yi
f (y; λ) = L (λ; y) = .
yi !
i =1

The log-likelihood is
n
X
l (λ; y) = −λ + yi ln(λ) − ln(yi !) (2.3)
i =1

which can be further simplified to


n
X n
X
l (λ; y) = −nλ + ln(λ) yi − ln(y j !)
i =1 j =1

The first derivative is


n
∂ l (λ; y) X
= −n + λ−1 yi . (2.4)
∂λ
i =1

The MLE is found by setting the derivative to 0 and solving,


n
X
−n + λ̂ −1
yi = 0
i =1
2.1 Estimation 67

n
X
λ̂−1
yi = n
i =1
X n
yi = n λ̂
i =1
n
X
λ̂ = n −1 yi
i =1

Thus the maximum likelihood estimator in a Poisson is the sample mean.

2.1.2.2 Maximum Likelihood Estimation of a Normal (Gaussian) Model

Suppose yi is assumed to be i.i.d. normally distributed with mean µ and variance σ2 . The
pdf of a normal is
!
1 (yi − µ)2
f (yi ; θ ) = √ exp − . (2.5)
2πσ2 2σ2
0
where θ = µ σ2 . The joint likelihood is the product of the n individual likelihoods,


n
!
Y 1 (yi − µ)2
f (y; θ ) = L (θ ; y) = √ exp − .
2πσ2 2σ2
i =1

Taking logs,
n
X 1 1 (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ2 ) − (2.6)
2 2 2σ2
i =1

which can be simplified to


n
n n 1 X (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ ) −
2
.
2 2 2 σ2
i =1
0
Taking the derivative with respect to the parameters θ = µ, σ2 ,

n
∂ l (θ ; y) X (yi − µ)
= (2.7)
∂µ σ2
i =1
n
∂ l (θ ; y) n 1 X (yi − µ)2
= − + . (2.8)
∂ σ2 2σ2 2 σ4
i =1

Setting these equal to zero, the first condition can be directly solved by multiplying both
sides by σ̂2 , assumed positive, and the estimator for µ is the sample average.
68 Estimation, Inference and Hypothesis Testing

n
X (yi − µ̂)
=0
σ̂2
i =1
n
X (yi − µ̂)
σ̂ 2
= σ̂2 0
σ̂2
i =1
n
X
yi − n µ̂ = 0
i =1
n
X
n µ̂ = yi
i =1
n
X
µ̂ = n −1
yi
i =1

Plugging this value into the second score and setting equal to 0, the ML estimator of σ2 is
n
n 1 X (yi − µ̂)2
− 2+ =0
2σ̂ 2 σ̂4
i =1
n
!
2
n 1 X (yi − µ̂)
2σ̂4 − 2+ = 2σ̂4 0
2σ̂ 2 σ̂4
i =1
n
X 2
−n σ̂ +
2
(yi − µ̂) = 0
i =1
n
X 2
σ̂ = n
2 −1
(yi − µ̂)
i =1

2.1.3 Conditional Maximum Likelihood

Interest often lies in the distribution of a random variable conditional on one or more ob-
served values, where the distribution of the observed values is not of interest. When this
occurs, it is natural to use conditional maximum likelihood. Suppose interest lies in model-
ing a random variable Y conditional on one or more variablesX. The likelihood for a single

observation is fi yi |xi , and when Yi are conditionally i.i.d. , then
n
Y
L (θ ; y|X) =

f yi |xi ,
i =1

and the log-likelihood is


n
X
l θ ; y|X =
 
ln f yi |xi .
i =1
2.1 Estimation 69

The conditional likelihood is not usually sufficient to estimate parameters since the re-
lationship between Y and X has not bee specified. Conditional maximum likelihood spec-
ifies the model parameters conditionally on xi . For example, in an conditional normal,
y |xi ∼ N µi , σ2 where µi = g (β , xi ) is some function which links parameters and con-


ditioning variables. In many applications a linear relationship is assumed so that

yi = β 0 xi + εi
Xk
= βi xi , j + εi
j =1
= µ i + εi .

Other relationships are possible, including functions g β 0 xi which limits to range of β 0 xi




such as exp β 0 xi (positive numbers), the normal cdf (Φ β 0 x ) or the logistic function,
 

exp β 0 xi / 1 + exp β 0 xi (both limit the range to (0, 1)).


 

2.1.3.1 Example: Conditional Bernoulli

Suppose Yi and X i are Bernoulli random variables where the conditional distribution of Yi
given X i is
yi |xi ∼ Bernoulli (θ0 + θ1 xi )
so that the conditional probability of observing a success (yi = 1) is pi = θ0 + θ1 xi . The
conditional likelihood is
n
 Y y 1−y
L θ ; y|x = (θ0 + θ1 xi ) i (1 − (θ0 + θ1 xi )) i ,
i =1

the conditional log-likelihood is


n
 X
l θ ; y|x = yi ln (θ0 + θ1 xi ) + (1 − yi ) ln (1 − (θ0 + θ1 xi )) ,
i =1

and the maximum likelihood estimator can be found by differentiation.


 
∂ l θ̂ B ; y|x X n
yi 1 − yi
= − =0
∂ θ̂0 i =1 θ̂0 + θ̂1 x i 1 − θ̂0 − θ̂1 xi
n
∂ l θ ; y|x xi (1 − yi )

X xi yi
= − = 0.
∂ θ1 θ̂0 + θ̂1 xi 1 − θ̂0 − θ̂1 xi
i =1
70 Estimation, Inference and Hypothesis Testing

Using the fact that xi is also Bernoulli, the second score can be solved
 
n n
X  yi (1 − yi ) X nx y nx − nx y
0= xi  +  = −

i =1 θ̂0 + θ̂1 1 − θ̂0 − θ̂1 i =1 θ̂0 + θ̂1 1 − θ̂0 − θ̂1 xi
    
= n x y 1 − θ̂0 + θ̂1 − n x − n x y θ̂0 + θ̂1
     
= n x y − n x y θ̂0 + θ̂1 − n x θ̂0 + θ̂1 + n x y θ̂0 + θ̂1
nx y
θ̂0 + θ̂1 = ,
nx

Define n x = ni=1 xi , n y = ni=1 yi and n x y = ni=1 xi yi . The first score than also be rewrit-
P P P

ten as
n n
X yi 1 − yi X yi (1 − x I ) yi xi 1 − yi (1 − x I ) (1 − yi ) xi
0= − = + − −
i =1 θ̂0 + θ̂1 xi 1 − θ̂0 − θ̂1 xi i =1 θ̂0 θ̂0 + θ̂1 1 − θ̂0 1 − θ̂0 − θ̂1
n
( )
X yi (1 − x I ) 1 − yi (1 − x I ) xi yi xi (1 − yi )
= − + −
i =1 θ̂0 1 − θ̂0 θ̂0 + θ̂1 1 − θ̂0 − θ̂1
ny − nx y n − ny − nx + nx y
= − + {0}
θ̂0 1 − θ̂0
= n y − n x y − θ̂0 n y + θ̂0 n − θ̂0 n + θ̂0 n y + θ̂0 n x − θ̂0 n x y
ny − nx y
θ̂0 =
n − nx
n n −n
so that θ̂1 = nxxy − n−n
y xy
x
. The “0” in the previous derivation follows from noting that the
quantity in {} is equivalent to the first score and so is 0 at the MLE. If X i was not a Bernoulli
random variable, then it would not be possible to analytically solve this problem. In these
cases, numerical methods are needed.6

2.1.3.2 Example: Conditional Normal

Suppose µi = β xi where Yi given X i is conditionally normal. Assuming that Yi are condi-


tionally i.i.d. , the likelihood and log-likelihood are
n
!
2
Y 1 (yi − β xi )
L θ ; y|x =

√ exp −
2πσ 2 2σ2
i =1
n
!
X 1  (yi − β xi )2
l θ ; y|x = ln (2π) + ln σ +2

− .
2 σ2
i =1

6
When X i is not Bernoulli, it is also usually necessary to use a function to ensure pi , the conditional prob-
ability, is in [0, 1]. Tow common choices are the normal cdf and the logistic function.
2.1 Estimation 71

The scores of the likelihood are


 
∂ l θ ; y|x
n x y − β̂ x

X i i i
= =0
∂β σ̂2
i =1
 2
∂ l θ ; y|x
n y − β̂ x

1X 1 i i
= − − =0
∂ σ2 2 σ̂2 (σ̂2 )
2
i =1

After multiplying both sides the first score by σ̂2 , and both sides of the second score by
−2σ̂4 , solving the scores is straight forward, and so
Pn
xi yi
β̂ = Pi =1
n 2
j =1 x j
n
X 2
σ̂ 2
=n −1
(yi − β xi ) .
i =1

2.1.3.3 Example: Conditional Poisson

Suppose Yi is conditional on X 1 i.i.d. distributed Poisson(λi ) where λi = exp (θ xi ). The


likelihood and log-likelihood are
n y
Y exp(−λi )λi i
L θ ; y|x =

yi !
i =1
n
X
l θ ; y|x = exp (θ xi ) + yi (θ xi ) − ln (yi !) .


i =1

The score of the likelihood is


n
∂ l θ ; y|x
  
X
= −xi exp θ̂ xi + xi yi = 0.
∂θ
i =1

This score cannot be analytically solved and so a numerical optimizer must be used to dins
the solution. It is possible, however, to show the score has conditional expectation 0 since
E Yi |X i = λi .
 

"  # " n #
∂ l θ ; y|x X
E |X = E −xi exp (θ xi ) + xi yi |X
∂θ
i =1
n
X
= E −xi exp (θ xi ) |X + E xi yi |X
   

i =1
72 Estimation, Inference and Hypothesis Testing

n
X
= −xi λi + xi E yi |X
 

i =1
X n
= −xi λi + xi λi = 0.
i =1

2.1.4 Method of Moments


Method of moments, often referred to as the classical method of moments to differenti-
ate it from the generalized method of moments (GMM, chapter 6) uses the data to match
noncentral moments.

Definition 2.1 (Noncentral Moment). The rth noncentral moment is defined

µ0r ≡ E X r
 
(2.9)

for r = 1, 2, . . ..

Central moments are similarly defined, only centered around the mean.

Definition 2.2 (Central Moment). The rth central moment is defined


h  i
0 r
µ r ≡ E X − µ1 (2.10)

for r = 2, 3, . . . where the 1st central moment is defined to be equal to the 1st noncentral
moment.

Since E xir is not known any estimator based on it is infeasible. The obvious solution
 

is to use the sample analogue to estimate its value, and the feasible method of moments
estimator is

n
X
µ̂0r =n −1
xir , (2.11)
i =1

the sample average of the data raised to the rth power. While the classical method of mo-
ments was originally specified using noncentral moments, the central moments are usually
the quantities of interest. The central moments can be directly estimated,

n
X
µ̂r = n −1
(xi − µ̂1 )r , (2.12)
i =1

which is simple implement by first estimating the mean (µ̂1 ) and then estimating the re-
maining central moments. An alternative is to expand the noncentral moment in terms of
2.1 Estimation 73

central moments. For example, the second noncentral moment can be expanded in terms
of the first two central moments,

µ02 = µ2 + µ21
which is the usual identity that states that expectation of a random variable squared, E[xi2 ],
is equal to the variance, µ2 = σ2 , plus the mean squared, µ21 . Likewise, it is easy to show
that

µ03 = µ3 + 3µ2 µ1 + µ31


h i
directly by expanding E (X − µ1 ) and solving for µ03 . To understand that the method of
3

moments is in the class of M-estimators, note that the expression in eq. (2.12) is the first
order condition of a simple quadratic form,

n
!2 k n
!2
X X X
argmin n −1 x i − µ1 + n −1 (xi − µ) j − µ j , (2.13)
µ,µ2 ,...,µk
i =1 j =2 i =1

and since the number of unknown parameters is identical to the number of equations, the
solution is exact.7

2.1.4.1 Method of Moments Estimation of the Mean and Variance

The classical method of moments h estimatori for the mean and variance for a set of i.i.d. data
{yi }i =1 where E [Yi ] = µ and E (Yi − µ) = σ2 is given by estimating the first two noncen-
n 2

tral moments and then solving for σ2 .

n
X
µ̂ = n −1
yi
i =1
n
X
σ̂2 + µ̂2 = n −1 yi2
i =1

and thus the variance estimator is σ̂2 = n −1 ni=1 yi2 − µ̂2 . Following some algebra, it is
P

simple to show that the central moment estimator could be used equivalently, and so σ̂2 =
n −1 ni=1 (yi − µ̂)2 .
P

2.1.4.2 Method of Moments Estimation of the Range of a Uniform

Consider a set of realization of a random variable with a uniform density over [0, θ ], and
so yi ∼ U (0, θ ). The expectation of yi is E [Yi ] = θ /2, and so the method of moments
i.i.d.

7
Note that µ1 , the mean, is generally denoted with the subscript suppressed as µ.
74 Estimation, Inference and Hypothesis Testing

estimator for the upper bound is


n
X
θ̂ = 2n −1
yi .
i =1

2.1.5 Classical Minimum Distance


A third – and less frequently encountered – type of M-estimator is classical minimum dis-
tance (CMD) which is also known as minimum χ 2 in some circumstances. CMD differs
from MLE and the method of moments in that it is an estimator that operates using initial
parameter estimates produced by another estimator rather than on the data directly. CMD
is most common when a simple MLE or moment-based estimator is available that can es-
timate a model without some economically motivated constraints on the parameters. This
initial estimator, ψ̂ is then used to estimate the parameters of the model, θ , by minimizing
a quadratic function of the form
 0  
θ̂ = argmin ψ̂ − g (θ ) W ψ̂ − g (θ ) (2.14)
θ

where W is a positive definite weighting matrix. When W is chosen as the covariance of ψ̂,
the CMD estimator becomes the minimum-χ 2 estimator since outer products of standard-
ized normals are χ 2 random variables.

2.2 Convergence and Limits for Random Variables


Before turning to properties of estimators, it is useful to discuss some common measures
of convergence for sequences. Before turning to the alternative definitions which are ap-
propriate for random variables, recall the definition of a limit of a non-stochastic sequence.

Definition 2.3 (Limit). Let {xn } be a non-stochastic sequence. If there exists N such that
for ever n > N , |xn − x | < ε ∀ε > 0, when x is called the limit of xn . When this occurs,
xn → x or limn→∞ xn = x .

A limit is a point where a sequence will approach, and eventually, always remain near.
It isn’t necessary that the limit is ever attained, only that for any choice of ε > 0, xn will
eventually always be less than ε away from its limit.
Limits of random variables come is many forms. The first the type of convergence is
both the weakest and most abstract.

Definition 2.4 (Convergence in Distribution). Let {Yn } be a sequence of random variables


and let {Fn } be the associated sequence of cdfs. If there exists a cdf F where Fn (y) → F (y)
for all y where F is continuous, then F is the limiting cdf of {Yn }. Let Y be a random variable
d d
with cdf F , then Yn converges in distribution to Y ∼ F , Yn → Y ∼ F , or simply Yn → F .
2.2 Convergence and Limits for Random Variables 75

1
Normal CDF
F
0.9 4
F
5
0.8 F
10
F100
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 2.1: This figure shows a sequence of cdfs {Fi } that converge to the cdf of a standard
normal.

Convergence in distribution means that the limiting cdf of a sequence of random vari-
ables is the same as the convergent random variable. This is a very weak form of con-
vergence since all it requires is that the distributions are the same. For example, suppose
{X n } is an i.i.d. sequence of standard normal random variables, and Y is a standard nor-
d
mal random variable. X n trivially converges to distribution to Y (X n → Y ) even through
Y is completely independent of {X n } – the limiting cdf of X n is merely the same as the
cdf of Y . Despite the weakness of convergence in distribution, it is an essential notion of
convergence that is used to perform inference on estimated parameters.
Figure 2.1 shows an example of a sequence of random variables which converge in dis-
tribution. The sequence is
√ n1 ni=1 Yi − 1
P
Xn = n √
2
where Yi are i.i.d. χ12 random variables. This is a studentized average since the variance of
the average is 2/n and the mean is 1. By the time n = 100, F100 is nearly indistinguishable
from the standard normal cdf.
Convergence in distribution is preserved through functions.

d
Theorem 2.5 (Continuous Mapping Theorem). Let Xn → X and let the random variable
g (X) be defined by a function g (x) that is continuous everywhere except possibly on a set
76 Estimation, Inference and Hypothesis Testing

d
with zero probability. Then g (Xn ) → g (X).

The continuous mapping theorem is useful since it allow functions of sequences of


random variable to be studied. For example, in hypothesis testing it is common to use
quadratic forms of normals, and when appropriately standardized, quadratic forms of nor-
mally distributed random variables follow a χ 2 distribution.
The next form of convergence is stronger than convergence in distribution since the
limit is to a specific target, not just a cdf.

Definition 2.6 (Convergence in Probability). The sequence of random variables {Xn } con-
verges in probability to X if and only if

lim Pr |X i ,n − X i | < ε = 1 ∀ε > 0, ∀i .



n →∞

p
When this holds, Xn → X or equivalently plim Xn = X (or plim Xn − X = 0) where plim is
probability limit.

Note that X can be either a random variable or a constant (degenerate random variable).
For example, if X n = n −1 + Z where Z is a normally distributed random variable, then
p
X n → Z . Convergence in probability requires virtually all of the probability mass of Xn to
lie near X. This is a very weak form of convergence since it is possible that a small amount
of probability can be arbitrarily far away from X. Suppose a scalar random sequence {X n }
p
takes the value 0 with probability 1 − 1/n and n with probability 1/n . Then {X n } → 0
although E [X n ] = 1 for all n .
Convergence in probability, however, is strong enough that it is useful work studying
random variables and functions of random variables.
p
Theorem 2.7. Let Xn → X and let the random variable g (X) be defined by a function g (x )
p
that is continuous everywhere except possibly on a set with zero probability. Then g (Xn ) →
g (X) (or equivalently plim g (Xn ) = g (X)).
p
This theorem has some, simple useful forms. Suppose the k -dimensional vector Xn →
p
X, the conformable vector Yn → Y and C is a conformable constant matrix, then

• plim CXn = CX

• plim ki=1 X i ,n = ki=1 plim X i ,n – the plim of the sum is the sum of the plims
P P

• plim ki=1 X i ,n = ki=1 plim X i ,n – the plim of the product is the product of the plims
Q Q

• plim Yn Xn = XY
p
• When Yn is a square matrix and Y is nonsingular, then Y−1 n → Y
−1
– the inverse func-
tion is continuous and so plim of the inverse is the inverse of the plim
p
• When Yn is a square matrix and Y is nonsingular, then Y−1 −1
n Xn → Y X.
2.2 Convergence and Limits for Random Variables 77

These properties are very difference from the expectations operator. In particular, the plim
operator passes through functions which allows for broad application. For example,
 
1 1
E 6=
X E [X ]
p
whenever X is a non-degenerate random variable. However, if X n → X , then

1 1
plim =
Xn plimX n
1
= .
X
Alternative definitions of convergence strengthen convergence in probability. In partic-
ular, convergence in mean square requires that the expected squared deviation must be
zero. This requires that E [X n ] = X and V [X n ] = 0.

Definition 2.8 (Convergence in Mean Square). The sequence of random variables {Xn }
converges in mean square to X if and only if
h i
lim E (X i ,n − X i )2 = 0 ∀ε > 0, ∀i .
n →∞

m.s .
When this holds, Xn → X.

Mean square convergence is strong enough to ensure that, when the limit is random X
than E [Xn ] = E [X] and V [Xn ] = V [X] – these relationships do not necessarily hold when
p
only Xn → X.
m .s . p
Theorem 2.9 (Convergence in mean square implies consistency). If Xn → X then Xn → X.

This result follows directly from Chebyshev’s inequality. A final, and very strong, measure
of convergence for random variables is known as almost sure convergence.

Definition 2.10 (Almost sure convergence). The sequence of random variables {Xn } con-
verges almost surely to X if and only if

lim Pr (X i ,n − X i = 0) = 1, ∀i .
n →∞

a .s .
When this holds, Xn → X.

Almost sure convergence requires all probability to be on the limit point. This is a
stronger condition than either convergence in probability or convergence in mean square,
both of which allow for some probability to be (relatively) far from the limit point.
a .s . m.s . p
Theorem 2.11 (Almost sure convergence implications). If Xn → X then Xn → X and Xn →
X.
78 Estimation, Inference and Hypothesis Testing

Random variables which converge almost surely to a limit are asymptotically degener-
ate on that limit.
The Slutsky theorem combines variables which converge in distribution with variables
which converge in probability to show that the joint limit of functions behaves as expected.
d p
Theorem 2.12 (Slutsky Theorem). Let Xn → X and let Y → C, a constant, then for con-
formable X and C,
d
1. Xn + Yn → X + c
d
2. Yn Xn → CX
d
3. Y−1 −1
n Xn → C X as long as C is non-singular.

This theorem is at the core of hypothesis testing where estimated parameters are often
asymptotically normal and an estimated parameter covariance, which converges in prob-
ability to the true covariance, is used to studentize the parameters.

2.3 Properties of Estimators


The first step in assessing the performance of an economic model is the estimation of the
parameters. There are a number of desirable properties estimators may possess.

2.3.1 Bias and Consistency

A natural question to ask about an estimator is whether, on average, it will be equal to the
population value of the parameter estimated. Any discrepancy between the expected value
of an estimator and the population parameter is known as bias.

Definition 2.13 (Bias). The bias of an estimator, θ̂ , is defined


h i h i
B θ̂ = E θ̂ − θ 0 (2.15)

where θ 0 is used to denote the population (or “true”) value of the parameter.

When an estimator has a bias of 0 it is said to be unbiased. Unfortunately many esti-


mators are not unbiased. Consistency is a closely related concept that measures whether
a parameter will be far from the population value in large samples.

Definition 2.14 (Consistency). An estimator θ̂ n is said to be consistent if plimθ̂ n = θ 0 . The


explicit dependence
n o∞ of the estimator on the sample size is used to clarify that these form a
sequence, θ̂ n .
n =1
2.3 Properties of Estimators 79

Consistency requires an estimator to exhibit two features as the sample size becomes large.
First, any bias must be shrinking. Second, the distribution of θ̂ around θ 0 must be shrink-
ing in such a way that virtually all of the probability mass is arbitrarily close to θ 0 . Behind
consistency are a set of theorems known as laws of large numbers. Laws of large number
provide conditions where an average will converge to its expectation. The simplest is the
Kolmogorov Strong Law of Large numbers and is applicable to i.i.d. data.8

Theorem 2.15 (Kolmogorov Strong Law of Large Numbers). Let {yi } by a sequence of i.i.d. random
variables with µ ≡ E [yi ] and define ȳn = n −1 ni=1 yi . Then
P

a .s .
ȳn → µ (2.16)

if and only if E |yi | < ∞.


 

In the case of i.i.d. data the only requirement for consistence is that the expectation exists,
and so a law of large numbers will apply to an average of i.i.d. data whenever its expectation
exists. For example, Monte Carlo integration uses i.i.d. draws and so the Kolmogorov LLN
is sufficient to ensure that Monte Carlo integrals converge to their expected values.
h i  2 
The variance of an estimator is the same as any other variance, V θ̂ = E θ̂ − E[θ̂ ]
although it is worth noting that the variance is defined as the variation around its expec-
tation, E[θ̂ ], not the population value of the parameters, θ 0 . Mean square error measures
this alternative form of variation around the population value of the parameter.

  2.16 (Mean Square Error). The mean square error of an estimator θ̂ , denoted
Definition
MSE θ̂ , is defined
   2 
MSE θ̂ = E θ̂ − θ 0 . (2.17)
  h i2
It can be equivalently expressed as the bias squared plus the variance, MSE θ̂ = B θ̂ +
h i
V θ̂ .

m .s .
When the bias and variance of an estimator both converge to zero, then θ̂ n → θ 0 .

2.3.1.1 Bias and Consistency of the Method of Moment Estimators

The method of moments estimators of the mean and variance are defined as
8
A law of large numbers is strong if the convergence is almost sure. It is weak if convergence is in proba-
bility.
80 Estimation, Inference and Hypothesis Testing

n
X
µ̂ = n −1
yi
i =1
n
X 2
σ̂2 = n −1 (yi − µ̂) .
i =1

When the data are i.i.d. with finite mean µ and variance σ2 , the mean estimator is un-
biased while the variance is biased by an amount that becomes small as the sample size
increases. The mean is unbiased since

" n
#
X
E [µ̂] = E n −1 yi
i =1
n
X
=n −1
E [yi ]
i =1
X n
= n −1 µ
i =1

= n −1 n µ

The variance estimator is biased since

" n
#
X 2
E σ̂ = E n −1 (yi − µ̂)
 2

i =1
" n
!#
X
=E n −1
yi − n µ̂
2 2

i =1
n
!
X
= n −1 E yi − n E µ̂
 2  2

i =1
n !
σ2
X 
= n −1 µ2 + σ 2 − n µ2 +
n
i =1

= n −1 nµ2 + n σ2 − n µ2 − σ2


σ2
 
=n −1 2
nσ − n
n
n −1 2
= σ
n
2.3 Properties of Estimators 81

where the sample mean is equal to the population mean plus an error that is decreasing in
n,

n
!2
X
µ̂2 = µ + n −1 εi
i =1
n n
!2
X X
= µ2 + 2µn −1 εi + n −1 εi
i =1 i =1

and so its square has expectation

 !2 
n
X n
X
E µ̂ = E µ + 2µn −1
εi + n −1 εi
 2 2
 
i =1 i =1
" #  !2 
n
X n
X
= µ2 + 2µn −1 E εi + n −2 E  εi 
i =1 i =1

σ2
= µ2 + .
n

2.3.2 Asymptotic Normality


While unbiasedness and consistency are highly desirable properties of any estimator, alone
these do not provide a method to perform inference. The primary tool in econometrics for
inference is the central limit theorem (CLT). CLTs exist for a wide range of possible data
characteristics that include i.i.d., heterogeneous and dependent cases. The Lindberg-Lévy
CLT, which is applicable to i.i.d. data, is the simplest.

Theorem 2.17 (Lindberg-Lévy). Let {yi } be a sequence of i.i.d. random scalars with µ ≡
E [Yi ] and σ2 ≡ V [Yi ] < ∞. If σ2 > 0, then

ȳn − µ √ ȳn − µ d
= n → N (0, 1) (2.18)
σ̄n σ
q
σ2
Pn
where ȳn = n −1
i =1 yi and σ̄n = n
.

Lindberg-Lévy states that as long as i.i.d. data have 2 moments – a mean and variance –
the sample mean will be asymptotically normal. It can further be seen to show that other
moments of i.i.d. random variables, such as the variance, will be asymptotically normal
as long as two times the power of the moment exists. In other words, an estimator of the
rth moment will be asymptotically normal as long as the 2rth moment exists – at least in
i.i.d. data. Figure 2.2 contains density plots of the sample average of n independent χ12
82 Estimation, Inference and Hypothesis Testing

Consistency and Central Limits


Unscaled Estimator Distribution
3
N=5
2.5 N=10
N=50
2 N=100

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3

n -scaled Estimator Distribution
0.5

0.4

0.3

0.2

0.1

0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Figure 2.2: These two panels illustrate the difference between consistency and the cor-
rectly scaled estimators. The sample mean was computed 1,000 times using 5, 10, 50 and
100 i.i.d. χ 2 data points. The top panel contains a kernel density plot of the estimates of
the mean. The density when n = 100 is much tighter√than when pn = 5 or n = 10 since
the estimates are not scaled. The bottom panel plots n (µ̂ − 1)/ 2/N , the standardized
version for which a CLT applies. All scaled densities have similar dispersion although it is
clear that the asymptotic approximation of the CLT is not particularly accurate when n = 5
or n = 10 (due to the right skew in the χ12 data).

random variables for n = 5, 10, 50 and 100.9 The top panel contains the density of the un-
scaled estimates. The bottom panel contains the density plot of the correctly scaled terms,

n(µ̂ − 1)/ 2/n where µ̂ is the sample average. In the top panel the densities are collaps-
p

ing. This is evidence of consistency since the asymptotic distribution of µ̂ is collapsing on


1. The bottom panel demonstrates the operation of a CLT since the appropriately stan-
dardized means all have similar dispersion and are increasingly normal.
Central limit theorems exist for a wide variety of other data generating process includ-
ing processes which are independent but not identically distributed (i.n.i.d) or processes
9
The mean and variance of a χν2 are ν and 2ν, respectively.
2.3 Properties of Estimators 83

which are dependent, such as time-series data. As the data become more heterogeneous,
whether through dependence or by having different variance or distributions, more re-
strictions are needed on certain characteristics of the data to ensure that averages will be
asymptotically normal. The Lindberg-Feller CLT allows for heteroskedasticity (different
variances) and/or different marginal distributions.
Theorem 2.18 (Lindberg-Feller). Let {yi } be a sequence of independent random scalars with
µi ≡ E [yi ] and 0 < σi2 ≡ V [yi ] < ∞ where yi ∼ Fi , i = 1, 2, . . .. Then

√ ȳn − µ̄n d
n → N (0, 1) (2.19)
σ̄n

and
σi2
lim max n −1 =0 (2.20)
n→∞ 1≤i ≤n σ̄n2
if and only if, for every ε > 0,
n Z
X
lim σ̄n2 n −1 (z − µn )2 dFi (z ) = 0 (2.21)
n→∞ (z −µn )2 >εN σn2
i =1

Pn Pn
where µ̄ = n −1 i =1 µi and σ̄2 = n −1 i =1 σi2 .
The Lindberg-Feller CLT relaxes the requirement that the marginal distributions are
identical in the Lindberg-Lévy CLT at a the cost of a technical condition. The final con-
dition, known as a Lindberg condition, essentially requires that no random variable is so
heavy tailed that it dominates the others when averaged. In practice this can be a concern
when the variables have a wide range of variances (σi2 ). Many macroeconomic data series
exhibit a large decrease in the variance of their shocks after 1984, a phenomena is referred
to as the great moderation. The statistical consequence of this decrease is that averages
that use data both before and after 1984 not be well approximated by a CLT and caution
is warranted when using asymptotic approximations. This phenomena is also present in
equity returns where some periods – for example the technology “bubble” from 1997-2002
– have substantially higher volatility than periods before or after. These large persistent
changes in the characteristics of the data have negative consequences on the quality of
CLT approximations and large data samples are often needed.

2.3.2.1 What good is a CLT?

Central limit theorems are the basis of most inference in econometrics, although their for-
mal justification is only asymptotic and hence only guaranteed to be valid for an arbitrarily
large data set. Reconciling these two statements is an important step in the evolution of an
econometrician.
Central limit theorems should be seen as approximations, and as an approximation they
can be accurate or arbitrarily poor. For example, when a series of random variables are
84 Estimation, Inference and Hypothesis Testing

Central Limit Approximations


Accurate CLT Approximation

0.4 T (λ̂ − λ) density
CLT Density

0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
Inaccurate CLT Approximation
0.8 √
T (ρ̂ − ρ) density
CLT Density
0.6

0.4

0.2

0
−4 −3 −2 −1 0 1 2 3 4

Figure 2.3: These two plots illustrate how a CLT can provide a good approximation, even
in small samples (top panel), or a bad approximation even for moderately large samples
(bottom panel). The top panel contains a kernel density plot of the standardized sample
mean of n = 10 Poisson random variables (λ = 5) over 10,000 Monte Carlo simulations.
Here the finite sample distribution and the asymptotic distribution overlay one another.
The bottom panel contains the conditional ML estimates of ρ from the AR(1) yi = ρ yi −1 +εi
where εi is i.i.d. standard normal using 100 data points and 10,000 replications. While ρ̂ is
asymptotically normal, the quality of the approximation when n = 100 is poor.

i.i.d. , thin-tailed and not skewed, the distribution of the sample mean computed using
as few as 10 observations may be very well approximated using a central limit theorem.
On the other hand, the approximation of a central limit theorem for the estimate of the
autoregressive parameter, ρ, in

yi = ρ yi −1 + εi (2.22)
may be poor even for hundreds of data points when ρ is close to one (but smaller). Figure
2.3 contains kernel density plots of the sample means computed from a set of 10 i.i.d. draws
from a Poisson distribution with λ = 5 in the top panel and the estimated autoregressive
2.3 Properties of Estimators 85

parameter from the autoregression in eq. (2.22) with ρ = .995 in the bottom. Each figure
also contains the pdf of an appropriately scaled normal. The CLT for the sample means of
the Poisson random variables is virtually indistinguishable from the actual distribution. On
the other hand, the CLT approximation for ρ̂ is very poor being based on 100 data points –
10× more than in the i.i.d. uniform example. The difference arises because the data in the
AR(1) example are not independent. With ρ = 0.995 data are highly dependent and more
data is required for averages to be well behaved so that the CLT approximation is accurate.
Unfortunately there are no hard and fast rules as to when a CLT will be a good approx-
imation. In general, the more dependent and the more heterogeneous a series, the worse
the approximation for a fixed number of observations. Simulations (Monte Carlo) are a
useful tool to investigate the validity of a CLT since they allow the finite sample distribu-
tion to be tabulated and compared to the asymptotic distribution.

2.3.3 Efficiency
A final concept, efficiency, is useful for ranking consistent asymptotically normal (CAN)
estimators that have the same rate of convergence.10

Definition 2.19 (Relative Efficiency). Let θ̂ n and θ̃ n be two n -consistent asymptotically

normal estimators for θ 0 . If the asymptotic variance of θ̂ n , written avar θ̂ n is less than
the asymptotic variance of θ̃ n , and so
   
avar θ̂ n < avar θ̃ n (2.23)

then θ̂ n is said to be relatively efficient to θ̃ n .11


 
Note that when θ is a vector, avar θ̂ n will be a covariance matrix. Inequality for ma-
trices A and B is interpreted to mean that if A < B then B − A is positive semi-definite, and
so all of the variances of the inefficient estimator must be (weakly) larger than those of the
efficient estimator.

Definition 2.20 (Asymptotically Efficient Estimator). Let θ̂ n and θ̃ n be two n -consistent
asymptotically normal estimators for θ 0 . If
   
avar θ̂ n < avar θ̃ n (2.24)
10
In any consistent estimator the asymptotic distribution of θ̂ − θ 0 is degenerate. In order to make infer-
ence, the difference between the estimate and the population parameters must 
be scaled by a function of the
√ √ 
number of data points. For most estimators this rate is n, and so n θ̂ − θ 0 will have an asymptotically
 
normal distribution. In the general case, the scaled difference can be written as n δ θ̂ − θ 0 where n δ is
known as the rate.
√  
11
The asymptotic variance of a n-consistent estimator, written avar θ̂ n is defined as
h√  i
limn→∞ V n θ̂ n − θ 0 .
86 Estimation, Inference and Hypothesis Testing

for any choice of θ̃ n then θ̂ n is said to be the efficient estimator of θ .

One of the important features of efficiency comparisons is that they are only meaningful

if both estimators are asymptotically normal, and hence consistent, at the same rate – n
in the usual case. It is trivial to produce an estimator that has a smaller variance but is
inconsistent. For example, if an estimator for a scalar unknown is θ̂ = 7 then it has no
variance: it will always be 7. However, unless θ0 = 7 it will also be biased. Mean square error
is a more appropriate method to compare estimators where one or more may be biased,
since it accounts for the total variation, not just the variance.12

2.4 Distribution Theory

Most distributional theory follows from a central limit theorem applied to the moment con-
ditions or to the score of the log-likelihood. While the moment condition or score are not
the object of interest – θ is – a simple expansion can be used to establish the asymptotic
distribution of the estimated parameters.

2.4.1 Method of Moments

Distribution theory for classical method of moments estimators is the most straightfor-
ward. Further, Maximum Likelihood can be considered a special case and so the method
of moments is a natural starting point.13 The method of moments estimator is defined as

n
X
µ̂ = n −1
xi
i =1
X n
2
µ̂2 = n −1 (xi − µ̂)
i =1
..
.
n
X k
µ̂k = n −1 (xi − µ̂)
i =1

√  
d
12
Some consistent asymptotically normal estimators have an asymptotic bias and so n θ̃ n − θ →
   0 
N (B, Σ). Asymptotic MSE defined as E n θ̂ n − θ 0 θ̂ n − θ 0 = BB0 + Σ provides a method to compare
estimators using their asymptotic properties.
13
While the class of method of moments estimators and maximum likelihood estimators contains a sub-
stantial overlap, there are method of moments estimators that cannot be replicated as a score condition of
any likelihood since the likelihood is required to integrate to 1.
2.4 Distribution Theory 87

To understand the distribution theory for the method of moments estimator, begin by re-
formulating the estimator as the solution of a set of k equations evaluated using the pop-
ulations values of µ, µ2 , . . .

n
X
n −1
xi − µ = 0
i =1
n
X
n −1 (xi − µ)2 − µ2 = 0
i =1
..
.
n
X
n −1
(xi − µ)k − µk = 0
i =1

Define g 1i = xi − µ and g j i = (xi − µ) j − µ j , j = 2, . . . , k , and the vector gi as


 
g 1i
 g 2i 
gi =  . (2.25)
 
..
 . 
gki
Using this definition, the method of moments estimator can be seen as the solution to
n
X
n −1
gi = 0.
i =1

Consistency of the method of moments estimator relies on a law of large numbers hold-
Pn Pn j
ing for n −1 i =1 (x i − µ) for j = 2, . . . , k . If x i is an i.i.d. sequence and as
−1
h i =1 xi and
i n p
long as E |xn − µ| j exists, then n −1 ni=1 (xn − µ) j → µ j .14 An alternative, and more re-
P
h i
strictive approach is to assume that E (xn − µ)2 j = µ2 j exists, and so
" n
#
X
E n −1 (xi − µ) j = µj (2.26)
i =1
" n
#  h
X i h i2 
j 2j j
V n −1
(xi − µ) =n −1
E (xi − µ) − E (xi − µ) (2.27)
i =1
 
= n −1 µ2 j − µ2j ,

Pn m .s .
and so n −1 i =1 (xi − µ) j → µ j which implies consistency.
Pn a .s .
Technically, n −1 i =1 (xi − µ) j → µ j by the Kolmogorov law of large numbers, but since a.s. conver-
14

gence implies convergence in probability, the original statement is also true.


88 Estimation, Inference and Hypothesis Testing

The asymptotic normality of parameters estimated using the method of moments fol-
lows from the asymptotic normality of

n
! n
√ X
− 12
X
n n −1
gi =n gi , (2.28)
i =1 i =1

an assumption. This requires the elements of gn to be sufficiently well behaved so that aver-
ages are asymptotically normally distribution. For example, when xi is i.i.d., the Lindberg-
Lévy CLT would require xi to have 2k moments when estimating k parameters. When esti-
mating the mean, 2 moments are required (i.e. the variance is finite). To estimate the mean
and the variance using i.i.d. data, 4 moments are required for the estimators to follow a CLT.
As long as the moment conditions are differentiable in the actual parameters of interest θ –
for example the mean and the variance – a mean value expansion can be used to establish
the asymptotic normality of these parameters.15

n n n
∂ gi (θ )
X X X  
n −1
gi (θ̂ ) = n −1
gi (θ 0 ) + n −1
θ̂ − θ 0 (2.30)
i =1 i =1 i =1
∂ θ 0 θ =θ̄
X n   
=n −1
gi (θ 0 ) + Gn θ̄ θ̂ − θ 0
i =1

Pn
where θ̄ is a vector that lies between θ̂ and θ 0 , element-by-element. Note that n −1 i =1 gi (θ̂ ) =
0 by construction and so

n
X   
n −1
gi (θ 0 ) + Gn θ̄ θ̂ − θ 0 = 0
i =1
   n
X
Gn θ̄ θ̂ − θ 0 = −n −1 gi (θ 0 )
i =1
   −1 n
X
θ̂ − θ 0 = −Gn θ̄ n −1 gi (θ 0 )
i =1

15
The mean value expansion is defined in the following theorem.

Theorem 2.21 (Mean Value Theorem). Let s : Rk → R be defined on a convex set Θ ⊂ Rk . Further, let s be
continuously differentiable on Θ with k by 1 gradient

∂ s (θ )
 
∇s θ̂ ≡ . (2.29)
∂ θ θ =θ̂

Then for any points θ and θ 0 there exists θ̄ lying on the segment between θ and θ 0 such that s (θ ) = s (θ 0 ) +
 0
∇s θ̄ (θ − θ 0 ).
2.4 Distribution Theory 89

n
√    −1 √ X
n θ̂ − θ 0 = −Gn θ̄ nn −1
gi (θ 0 )
i =1
√    −1 √
n θ̂ − θ 0 = −Gn θ̄ n gn (θ 0 )

where gn (θ 0 ) = n −1 ni=1 gi (θ 0 ) is the average of the moment conditions. Thus the nor-
P

malized difference between the estimated and the population values of the parameters,
√  √
   −1 
n θ̂ − θ 0 is equal to a scaled −Gn θ̄ n gn (θ 0 ) that has an

random variable
√ d
asymptotic normal distribution. By assumption n gn (θ 0 ) → N (0, Σ) and so
√  
d
 −1 
n θ̂ − θ 0 → N 0, G−1 Σ G0 (2.31)
 
where Gn θ̄ has been replaced with its limit as n → ∞, G.

∂ gn (θ )

G = plimn→∞ (2.32)
∂ θ 0 θ =θ 0
n
∂ gi (θ )
X
= plimn→∞ n −1
∂ θ0 i =1 θ =θ 0

p p
Since θ̂ is a consistent estimator, θ̂ → θ 0 and so θ̄ → θ 0 since it is between θ̂ and θ 0 . This
form of asymptotic covariance is known as a “sandwich” covariance estimator.

2.4.1.1 Inference on the Mean and Variance

To estimate the mean and variance by the method of moments, two moment conditions
are needed,

n
X
n −1
xi = µ̂
i =1
n
X
n −1
(xi − µ̂)2 = σ̂2
i =1

To derive the asymptotic distribution, begin by forming gi ,


" #
xi − µ
gi =
(xi − µ)2 − σ2

Note that gi is mean 0 and a function of a single xi so that gi is also i.i.d.. The covariance of
gi is given by
90 Estimation, Inference and Hypothesis Testing

"" # #
xi − µ h i
Σ = E gi g0i = E xi − µ (xi − µ)2 − σ2
 
(2.33)
(xi − µ)2 − σ2
   
(xi − µ) 2
(xi − µ) (xi − µ) − σ 2 2

= E
    2 
(xi − µ) (xi − µ)2 − σ2 (xi − µ)2 − σ2

" #
(xi − µ)2 (xi − µ)3 − σ2 (xi − µ)
=E
(xi − µ)3 − σ2 (xi − µ) (xi − µ)4 − 2σ2 (xi − µ)2 + σ4
" #
σ2 µ3
=
µ3 µ4 − σ 4

and the Jacobian is

n
∂ gi (θ )
X
G = plimn →∞ n−1

i =1
∂ θ 0 θ =θ 0
n
" #
X −1 0
= plimn →∞ n −1 .
−2 (xi − µ) −1
i =1

Pn
Since plimn →∞ n −1 i =1 (xi − µ) = plimn →∞ x̄n − µ = 0,
" #
−1 0
G= .
0 −1
0
Thus, the asymptotic distribution of the method of moments estimator of θ = µ, σ2 is
" # " #! " # " #!
√ µ̂ µ d 0 σ2 µ3
n − →N ,
σ̂2 σ2 0 µ3 µ4 − σ 4
0
since G = −I2 and so G−1 Σ G−1 = −I2 Σ (−I2 ) = Σ.

2.4.2 Maximum Likelihood

The steps to deriving the asymptotic distribution of ML estimators are similar to those for
method of moments estimators where the score of the likelihood takes the place of the
moment conditions. The maximum likelihood estimator is defined as the maximum of the
log-likelihood of the data with respect to the parameters,

θ̂ = argmax l (θ ; y). (2.34)


θ
2.4 Distribution Theory 91

When the data are i.i.d., the log-likelihood can be factored into n log-likelihoods, one for
each observation16 ,
n
X
l (θ ; y) = l i (θ ; yi ) . (2.35)
i =1

It is useful to work with the average log-likelihood directly, and so define


n
l¯n (θ ; y) = n −1
X
l i (θ ; yi ) . (2.36)
i =1

The intuition behind the asymptotic distribution follows from the use of the average. Un-
der some regularity conditions, l¯n (θ ; y) converges uniformly in θ to E l (θ ; yi ) . However,
 

since the average log-likelihood is becoming a good approximation for the expectation of
the log-likelihood, the value of θ that maximizes the log-likelihood of the data and its ex-
pectation will be very close for n sufficiently large. As a result,whenever the log-likelihood
is differentiable and the range of yi does not depend on any of the parameters in θ ,
 
∂ ¯n (θ ; yi )
l
E =0 (2.37)
∂θ


θ =θ 0

where θ 0 are the parameters of the data generating process. This follows since


∂ f (y;θ 0 )
∂ l¯n (θ 0 ; y) ∂θ
Z Z
θ =θ 0
f (y; θ 0 ) dy = f (y; θ 0 ) dy (2.38)
∂θ f (y; θ 0 )

Sy Sy
θ =θ 0
∂ f (y; θ 0 )
Z
= dy
Sy ∂θ
θ =θ 0


Z
= f (y; θ ) dy

∂ θ Sy
θ =θ 0

= 1
∂θ
=0

where Sy denotes the support of y. The scores of the average log-likelihood are

16
Even when the data are not i.i.d., the log-likelihood can be factored into n log-likelihoods using condi-
tional distributions for y2 , . . . , yi and the marginal distribution of y1 ,
N
X
l (θ ; y) = l i θ ; yi |yi −1 , . . . , y1 + l 1 (θ ; y1 ) .


n =2
92 Estimation, Inference and Hypothesis Testing

n
∂ l¯n (θ ; yi ) X ∂ l i (θ ; yi )
=n −1
(2.39)
∂θ ∂θ
i =1

and when yi is i.i.d. the scores will be i.i.d., and so the average scores will follow a law of
large numbers for θ close to θ 0 . Thus
n
∂ l i (θ ; yi ) ∂ l (θ ; Yi )
 
a .s .
X
−1
n →E (2.40)
∂θ ∂θ
i =1

As a result, the population value of θ , θ 0 , will also asymptotically solve the first order con-
dition. The average scores are also the basis of the asymptotic normality of maximum like-
lihood estimators. Under some further regularity conditions, the average scores will follow
a central limit theorem, and so

n
!
√ √ ∂ l (θ ; yi )
d
n ∇θ l¯ (θ 0 ) ≡ n
X
n −1 → N (0, J ) . (2.41)

∂θ

i =1

θ =θ 0

Taking a mean value expansion around θ 0 ,

√   √ √   
n ∇θ l¯ θ̂ = n∇θ l¯ (θ 0 ) + n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0
√ √   
0 = n∇θ l¯ (θ 0 ) + n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0
√    √
− n ∇θ θ 0 l¯ θ̄ θ̂ − θ 0 = n∇θ l¯ (θ 0 )
√   h  i−1 √
¯
n θ̂ − θ 0 = −∇θ θ 0 l θ̄ n∇θ l (θ 0 )

where

n
¯
  X ∂ 2
l (θ ; y i )
∇θ θ 0 l θ̄ ≡ n −1
(2.42)

0
∂ θ∂ θ i =1 θ =θ̄

and where θ̄ is a vector whose elements lie between θ̂ and θ 0 . Since θ̂ is a consistent
p
estimator of θ 0 , θ̄ → θ 0 and so functions of θ̄ will converge to their value at θ 0 , and the
asymptotic distribution of the maximum likelihood estimator is

√  
d
n θ̂ − θ 0 → N 0, I −1 J I −1

(2.43)

where
" #
∂ 2 l (θ ; yi )

I = −E (2.44)
∂ θ ∂ θ 0 θ =θ 0
2.4 Distribution Theory 93

and " #
∂ l (θ ; yi ) ∂ l (θ ; yi )

J =E (2.45)
∂θ ∂ θ 0 θ =θ 0

The asymptotic covariance matrix can be further simplified using the information matrix
p
equality which states that I − J → 0 and so
√  
d
n θ̂ − θ 0 → N 0, I −1

(2.46)

or equivalently
√  
d
n θ̂ − θ 0 → N 0, J −1 .

(2.47)

The information matrix equality follows from taking the derivative of the expected score,

∂ 2 l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 ) 1 ∂ f (y; θ 0 ) ∂ f (y; θ 0 )


= − (2.48)
∂ θ∂ θ 0
f (y; θ ) ∂ θ ∂ θ 0
f (y; θ )2 ∂θ ∂ θ0
∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 )
+ =
∂ θ∂ θ0 ∂θ ∂ θ0 f (y; θ ) ∂ θ ∂ θ 0

and so, when the model is correctly specified,

∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y) 1 ∂ 2 f (y; θ 0 )
  Z
E + = 0 f (y; θ )d y
∂ θ∂ θ0 ∂θ ∂ θ0 Sy f (y; θ ) ∂ θ ∂ θ

∂ 2 f (y; θ 0 )
Z
= dy
Sy ∂ θ∂ θ0
∂2
Z
= f (y; θ 0 )d y
∂ θ ∂ θ 0 Sy
∂2
= 1
∂ θ∂ θ0
= 0.

and

∂ 2 l (θ 0 ; y) ∂ l (θ 0 ; y) ∂ l (θ 0 ; y)
   
E = −E .
∂ θ∂ θ0 ∂θ ∂ θ0
A related concept, and one which applies to ML estimators when the information matrix
equality holds – at least asymptotically – is the Cramér-Rao lower bound.

Theorem 2.22 (Cramér-Rao Inequality). Let f (y; θ ) be the joint density of y where θ is a k
dimensional parameter vector. Let θ̂ be a consistent estimator of θ with finite covariance.
94 Estimation, Inference and Hypothesis Testing

Under some regularity condition on f (·)


 
avar θ̂ ≥ I −1 (θ ) (2.49)

where " #
∂ 2 ln f (Yi ; θ )

I(θ ) = −E . (2.50)
∂ θ ∂ θ 0 θ =θ 0

The important implication of the Cramér-Rao theorem is that maximum likelihood estima-
tors, which are generally consistent, are asymptotically efficient.17 This guarantee makes a
strong case for using the maximum likelihood when available.

2.4.2.1 Inference in a Poisson MLE

Recall that the log-likelihood in a Poisson MLE is


n yi
X X
l (λ; y) = −nλ + ln(λ) yi − ln(i )
i =1 i =1

and that the first order condition is


n
∂ l (λ; y) X
= −n + λ−1 yi .
∂λ
i =1
Pn
The MLE was previously shown to be λ̂ = n −1
i =1 yi . To compute the variance, take the
expectation of the negative of the second derivative,

∂ 2 l (λ; yi )
= −λ−2 yi
∂λ 2

and so

∂ 2 l (λ; yi )
 
I = −E =
 −2 
−E −λ yi
∂ λ2
= λ−2 E [yi ]
 

= λ−2 λ
 

λ
 
=
λ2
= λ−1
17
The Cramér-Rao bound also applied in finite samples when θ̂ is unbiased. While most maximum likeli-
hood estimators are biased in finite samples, there are important cases where estimators are unbiased for any
sample size and so the Cramér-Rao theorem will apply in finite samples. Linear regression is an important
case where the Cramér-Rao theorem applies in finite samples (under some strong assumptions).
2.4 Distribution Theory 95

√  
d
and so n λ̂ − λ0 → N (0, λ) since I −1 = λ.
Alternatively the covariance of the scores could be used to compute the parameter co-
variance,
 
yi 2
J = V −1 +
λ
1
= V [yi ]
λ2
= λ−1 .

I = J and so the IME holds when the data are Poisson distributed. If the data were not
Poisson distributed, then it would not normally be the case that E [yi ] = V [yi ] = λ, and so
I and J would not (generally) be equal.

2.4.2.2 Inference in the Normal (Gaussian) MLE

Recall that the MLE estimators of the mean and variance are

n
X
µ̂ = n −1
yi
i =1
X n
2
σ̂2 = n −1 (yi − µ̂)
i =1

and that the log-likelihood is


n
n n 1 X (yi − µ)2
l (θ ; y) = − ln(2π) − ln(σ ) −
2
.
2 2 2 σ2
i =1
0
Taking the derivative with respect to the parameter vector, θ = µ, σ2 ,

n
∂ l (θ ; y) X (yi − µ)
=
∂µ σ2
i =1
n
∂ l (θ ; y) n 1 X (yi − µ)2
= − + .
∂ σ2 2σ2 2 σ4
i =1

The second derivatives are

n
∂ 2 l (θ ; y) X 1
=−
∂ µ∂ µ σ2
i =1
96 Estimation, Inference and Hypothesis Testing

n
∂ 2 l (θ ; y) X (yi − µ)
= −
∂ µ∂ σ2 σ4
i =1
n
∂ l (θ ; y)
2
n 2 X (yi − µ)2
= − .
∂ σ2 ∂ σ2 2σ4 2 σ6
i =1

The first does not depend on data and so no expectation is needed. The other two have
expectations,

∂ 2 l (θ ; yi ) (yi − µ)
   
E =E −
∂ µ∂ σ2 σ4
(E [yi ] − µ)
=−
σ4
µ−µ
=−
σ4
=0

and

" #
∂ 2 l (θ ; yi ) 2 (yi − µ)2
 
1
E =E −
∂ σ2 ∂ σ2 2σ4 2 σ6
h i
1 E (yi − µ)2

= −
2σ4 σ6
1 σ 2
= − 6
2σ 4 σ
1 1
= − 4
2σ 4 σ
1
=− 4

Putting these together, the expected Hessian can be formed,

 " #
∂ 2 l (θ ; yi ) 1

− σ2 0
E 0 =
∂ θ∂ θ 0 − 2σ1 4

and so the asymptotic covariance is

−1 " #−1


∂ 2 l (θ ; yi ) 1

0
I −1 = −E = σ2
∂ θ∂ θ0 0 1
2σ4
2.4 Distribution Theory 97

" #
σ2 0
=
0 2σ4

The asymptotic distribution is then


" # " #! " # " #!
√ µ̂ µ d 0 σ2 0
n − →N ,
σ̂ 2
σ 2
0 0 2σ4
Note that this is different from the asymptotic variance for the method of moments estima-
tor of the mean and the variance. This is because the data have been assumed to come from
a normal distribution and so the MLE is correctly specified. As a result µ3 = 0 (the normal
is symmetric) and the IME holds. In general the IME does not hold and so the asymptotic
covariance may take a different form which depends on the moments of the data as in eq.
(2.33).

2.4.3 Quasi Maximum Likelihood

While maximum likelihood is an appealing estimation approach, it has one important draw-
back: knowledge of f (y; θ ). In practice the density assumed in maximum likelihood esti-
mation, f (y; θ ), is misspecified for the actual density of y, g (y). This case has been widely
studied and estimators where the distribution is misspecified are known as quasi-maximum
likelihood (QML) estimators. Unfortunately QML estimators generally lose all of the fea-
tures that make maximum likelihood estimators so appealing: they are generally inconsis-
tent for the parameters of interest, the information matrix equality does not hold and they
do not achieve the Cramér-Rao lower bound.
First, consider the expected score from a QML estimator,

∂ l (θ 0 ; y) ∂ l (θ 0 ; y)
  Z
Eg = g (y) dy (2.51)
∂θ Sy ∂θ
∂ l (θ 0 ; y) f (y; θ 0 )
Z
= g (y) dy
Sy ∂θ f (y; θ 0 )
∂ l (θ 0 ; y) g (y)
Z
= f (y; θ 0 ) dy
Sy ∂θ f (y; θ 0 )
∂ l (θ 0 ; y)
Z
= h (y) f (y; θ 0 ) dy
Sy ∂θ

which shows that the QML estimator can be seen as a weighted average with respect to the
density assumed. However these weights depend on the data, and so it will no longer be the
case that the expectation of the score at θ 0 will necessarily be 0. Instead QML estimators
generally converge to another value of θ , θ ∗ , that depends on both f (·) and g (·) and is
known as the pseudo-true value of θ .
98 Estimation, Inference and Hypothesis Testing

The other important consideration when using QML to estimate parameters is that the
Information Matrix Equality (IME) no longer holds, and so “sandwich” covariance estima-
tors must be used and likelihood ratio statistics will not have standard χ 2 distributions.
An alternative interpretation of QML estimators is that of method of moments estimators
where the scores of l (θ ; y) are used to choose the moments. With this interpretation, the
distribution theory of the method of moments estimator will apply as long as the scores,
evaluated at the pseudo-true parameters, follow a CLT.

2.4.3.1 The Effect of the Data Distribution on Estimated Parameters

Figure 2.4 contains three distributions (left column) and the asymptotic covariance of the
mean and the variance estimators, illustrated through joint confidence ellipses contain-
ing 80, 95 and 99% probability the true value is within their bounds (right column).18 The
ellipses were all derived from the asymptotic covariance of µ̂ and σ̂2 where the data are
i.i.d. and distributed according to a mixture of normals distribution where
(
µ1 + σ1 z i with probability p
yi =
µ2 + σ2 z i with probability 1 − p

where z is a standard normal. A mixture of normals is constructed from mixing draws from
a finite set of normals with possibly different means and/or variances, and can take a wide
variety of shapes. All of the variables were constructed so that E [yi ] = 0 and V [yi ] = 1.
This requires

p µ1 + (1 − p )µ2 = 0
and

p µ21 + σ12 + (1 − p ) µ22 + σ22 = 1.


 

The values used to produce the figures are listed in table 2.1. The first set is simply a stan-
dard normal since p = 1. The second is known as a contaminated normal and is com-
posed of a frequently occurring (95% of the time) mean-zero normal with variance slightly
smaller than 1 (.8), contaminated by a rare but high variance (4.8) mean-zero normal. This
produces heavy tails but does not result in a skewed distribution. The final example uses
different means and variance to produce a right (positively) skewed distribution.
The confidence ellipses illustrated in figure 2.4 are all derived from estimators produced
assuming that the data are normal, but using the “sandwich” version of the covariance,
I −1 J I −1 . The top panel illustrates the correctly specified maximum likelihood estimator.
Here the confidence ellipse is symmetric about its center. This illustrates that the param-
18
The ellipses are centered at (0,0) since the population value of the parameters has been subtracted. Also
√ that even though the confidence ellipse for σ̂ extended into the negative space, these must be divided
2
note
by n and re-centered at the estimated value when used.
2.4 Distribution Theory 99

Data Generating Process and Asymptotic Covariance of Estimators


Standard Normal Standard Normal CI
4 99%
0.4 90%
80%
2
0.3

σ2
0
0.2

−2
0.1

−4
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
µ
Contaminated Normal Contaminated Normal CI
6
0.4
4

0.3 2
σ2

0
0.2
−2
0.1 −4

−6
−5 0 5 −3 −2 −1 0 1 2 3
µ
Mixture of Normals Mixture of Normals CI
4
0.4

2
0.3
σ2

0
0.2
−2
0.1
−4
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
µ

Figure 2.4: The six subplots illustrate how the data generating process, not the assumed
model, determine the asymptotic covariance of parameter estimates. In each panel the
data generating process was a mixture of normals, yi = µ1 + σ1 z i with probability p and
yi = µ2 + σ2 z i with probability 1 − p where the parameters were chosen so that E [yi ] = 0
and V [yi ] = 1. By varying p , µ1 , σ1 , µ2 and σ2 , a wide variety of distributions can be created
including standard normal (top panels), a heavy tailed distribution known as a contami-
nated normal (middle panels) and a skewed distribution (bottom panels).
100 Estimation, Inference and Hypothesis Testing

p µ1 σ12 µ2 σ22
Standard Normal 1 0 1 0 1
Contaminated Normal .95 0 .8 0 4.8
Right Skewed Mixture .05 2 .5 -.1 .8

Table 2.1: Parameter values used in the mixtures of normals illustrated in figure 2.4.

eters are uncorrelated – and hence independent, since they are asymptotically normal –
and that they have different variances. The middle panel has a similar shape but is elon-
gated on the variance axis (x). This illustrates that the asymptotic variance of σ̂2 is affected
by the heavy tails of the data (large 4th moment) of the contaminated normal. The final
confidence ellipse is rotated which reflects that the mean and variance estimators are no
longer asymptotically independent. These final two cases are examples of QML; the esti-
mator is derived assuming a normal distribution but the data are not. In these examples,
the estimators are still consistent but have different covariances.19

2.4.4 The Delta Method


Some theories make predictions about functions of parameters rather than on the param-
eters directly. One common example in finance is the Sharpe ratio, S , defined
 
E r − rf
S=q   (2.52)
V r − rf

where r is the return on a risky asset and r f is the risk-free rate – and so r − r f is the excess
return on the risky asset. While the quantities in both the numerator and the denominator
are standard statistics, the mean and the standard deviation, the ratio is not.
The delta method can be used to compute the covariance of functions of asymptotically
normal parameter estimates.
√ d
 
Definition 2.23 (Delta method). Let n (θ̂ −θ 0 ) → N 0, G−1 Σ (G0 )−1 where Σ is a positive
definite covariance matrix. Further, suppose that d(θ ) is a m by 1 continuously differen-
tiable vector function of θ from Rk → Rm . Then,
√ d
 h −1 i 
n (d(θ̂ ) − d(θ 0 )) → N 0, D(θ 0 ) G−1 Σ G0 D(θ 0 )0

where
∂ d (θ )

D (θ 0 ) = . (2.53)
∂ θ 0 θ =θ 0
19
While these examples are consistent, it is not generally the case that the parameters estimated using a
misspecified likelihood (QML) are consistent for the quantities of interest.
2.4 Distribution Theory 101

2.4.4.1 Variance of the Sharpe Ratio

The Sharpe ratio is estimated by “plugging in” the usual estimators of the mean and the
variance,

µ̂
Ŝ = √ .
σ̂2
In this case d (θ 0 ) is a scalar function of two parameters, and so

µ
d (θ 0 ) = √ 2
σ
and
 
1 −µ
D (θ 0 ) =
σ 2σ3
Recall that the asymptotic distribution of the estimated mean and variance is
" # " #! " # " #!
√ µ̂ µ d 0 σ2 µ3
n − →N , .
σ̂2 σ2 0 µ3 µ4 − σ 4

The asymptotic distribution of the Sharpe ratio can be constructed by combining the asymp-
0
totic distribution of θ̂ = µ̂, σ̂2 with the D (θ 0 ), and so
" # 0 !
√  σ2 µ3


d 1 −µ 1 −µ
n Ŝ − S → N 0,
σ 2σ3 µ3 µ4 − σ 4 σ 2σ3

which can be simplified to


!
√  
d µµ3 µ2 µ4 − σ4
n Ŝ − S → N 0, 1 − 4 + .
σ 4σ6

The asymptotic variance can be rearranged to provide some insight into the sources of
uncertainty,

√ 
 

d 1 2
n Ŝ − S → N 0, 1 − S × s k + S (κ − 1) ,
4
where s k is the skewness and κ is the kurtosis. This shows that the variance of the Sharpe
ratio will be higher when the data is negatively skewed or when the data has a large kurtosis
(heavy tails), both empirical regularities of asset pricing data. If asset returns were normally
distributed, and so s k = 0 and κ = 3, the expression of the asymptotic variance simplifies
to
h√  i S2
V n Ŝ − S = 1 + , (2.54)
2
102 Estimation, Inference and Hypothesis Testing

an expression commonly given for the variance of the Sharpe ratio. As this example illus-
trates, the expression in eq. (2.54) is only correct if the skewness is 0 and returns have a
kurtosis of 3 – something that would only be expected if returns are normal.

2.4.5 Estimating Covariances


The presentation of the asymptotic theory in this chapter does not provide a method to
implement hypothesis tests since all of the distributions depend on the covariance of the
scores and the expected second derivative or Jacobian in the method of moments. Feasi-
ble testing requires estimates of these. The usual method to estimate the covariance uses
“plug-in” estimators. Recall that in the notation of the method of moments,
n
!
1
X
Σ ≡ avar n − 2 gi (θ 0 ) (2.55)
i =1

or in the notation of maximum likelihood,


" #
∂ l (θ ; Yi ) ∂ l (θ ; Yi )

J ≡E . (2.56)
∂θ ∂ θ 0 θ =θ 0
When the data are i.i.d., the scores or moment conditions should be i.i.d., and so the
variance of the average is the average of the variance. The “plug-in” estimator for Σ uses
the moment conditions evaluated at θ̂ , and so the covariance estimator for method of mo-
ments applications with i.i.d. data is
n
X    0
Σ̂ = n −1
gi θ̂ gi θ̂ (2.57)
i =1

which is simply the average outer-product of the moment


  condition. The estimator of Σ in
the maximum likelihood is identical replacing gi θ̂ with ∂ l (θ ; yi ) /∂ θ evaluated at θ̂ ,

n
X ∂ l (θ ; yi ) ∂ l (θ ; y )
i

Jˆ = n −1 . (2.58)
∂θ ∂ θ0

i =1 θ =θ̂

The “plug-in” estimator for the second derivative of the log-likelihood or the Jacobian
of the moment conditions is similarly defined,
n

X ∂ g (θ )

Ĝ = n −1
(2.59)

0
∂θ i =1 θ =θ̂

or for maximum likelihood estimators



n
X ∂ l (θ ; yi )
2
Î = n −1 − . (2.60)
∂ θ∂ θ0

i =1 θ =θ̂
2.4 Distribution Theory 103

2.4.6 Estimating Covariances with Dependent Data


The estimators in eq. (2.57) and eq. (2.58) are only appropriate when the moment condi-
tions or scores are not correlated across i .20 If the moment conditions or scores are corre-
lated across observations the covariance estimator (but not the Jacobian estimator) must
be changed to account for the dependence. Since Σ is defined as the variance of a sum it is
necessary to account for both the sum of the variances plus all of the covariances.

n
!
− 21
X
Σ ≡ avar n gi (θ 0 ) (2.61)
i =1
 
n
X n−1 X
X n
= lim n −1  E gi (θ 0 ) gi (θ 0 )0 + E g j (θ 0 ) g j −i (θ 0 )0 + g j −i (θ 0 ) g j (θ 0 ) 
   
n →∞
i =1 i =1 j =i +1

This expression depends on both the usual covariance of the moment conditions and on
the covariance between the scores. When using i.i.d. data the second term vanishes since
the moment conditions must be uncorrelated and so cross-products must have expecta-
tion 0.
If the moment conditions are correlated across i then covariance estimator must be
adjusted to account for this. The obvious solution is estimate the expectations of the cross
terms in eq. (2.57) with their sample analogues, which would result in the covariance esti-
mator

 
Xn    0 X n−1 X
n     0    0 
Σ̂DEP = n −1  gi θ̂ gi θ̂ + g j θ̂ g j −i θ̂ + g j −i θ̂ g j θ̂ .
i =1 i =1 j =i +1
(2.62)
Pn  Pn 0 Pn
This estimator is always zero since Σ̂DEP = n i =1 gi
−1
i =1 gi and i =1 gi = 0, and so
Σ̂DEP cannot be used in practice. One solution is to truncate the maximum lag to be some-
21

thing less than n −1 (usually much less than n −1), although the truncated estimator is not
guaranteed to be positive definite. A better solution is to combine truncation with a weight-
ing function (known as a kernel) to construct an estimator which will consistently estimate
the covariance and is guaranteed to be positive definite. The most common covariance es-
timator of this type is the Newey & West (1987) covariance estimator. Covariance estimators
20
Since i.i.d. implies no correlation, the i.i.d. case is trivially covered.
21
The scalar version of Σ̂DEP may be easier to understand. If g i is a scalar, then
  
X n   n
X −1 Xn    
σ̂DEP
2
= n −1  g i2 θ̂ + 2  g j θ̂ g j −i θ̂  .
i =1 i =1 j =i +1

The first term is the usual variance estimator and the second term is the sum of the (n − 1) covariance esti-
mators. The more complicated expression in eq. (2.62) arises since order matters when multiplying vectors.
104 Estimation, Inference and Hypothesis Testing

for dependent data will be examined in more detail in the chapters on time-series data.

2.5 Hypothesis Testing


Econometrics models are estimated in order to test hypotheses, for example, whether a
financial theory is supported by data or to determine if a model with estimated parameters
can outperform a naïve forecast. Formal hypothesis testing begins by specifying the null
hypothesis.

Definition 2.24 (Null Hypothesis). The null hypothesis, denoted H0 , is a statement about
the population values of some parameters to be tested. The null hypothesis is also known
as the maintained hypothesis.

The null defines the condition on the population parameters that is to be tested. A null
can be either simple, for example H0 : µ = 0, or complex, which allows for testing of multi-
ple hypotheses. For example, it is common to test whether data exhibit any predictability
using a regression model

yi = θ1 + θ2 x2,i + θ3 x3,i + εi , (2.63)


and a composite null, H0 : θ2 = 0 ∩ θ3 = 0, often abbreviated H0 : θ2 = θ3 = 0.22
Null hypotheses cannot be accepted; the data can either lead to rejection of the null or
a failure to reject the null. Neither option is “accepting the null”. The inability to accept the
null arises since there are important cases where the data are not consistent with either the
null or its testing complement, the alternative hypothesis.

Definition 2.25 (Alternative Hypothesis). The alternative hypothesis, denoted H1 , is a com-


plementary hypothesis to the null and determines the range of values of the population
parameter that should lead to rejection of the null.

The alternative hypothesis specifies the population values of parameters for which the null
should be rejected. In most situations the alternative is the natural complement to the null
in the sense that the null and alternative are exclusive of each other but inclusive of the
range of the population parameter. For example, when testing whether a random variable
has mean 0, the null is H0 : µ = 0 and the usual alternative is H1 : µ 6= 0.
In certain circumstances, usually motivated by theoretical considerations, one-sided
alternatives are desirable. One-sided alternatives only reject for population parameter val-
ues on one side of zero and so test using one-sided alternatives may not reject even if both
the null and alternative are false. Noting that a risk premium must be positive (if it exists),
the null hypothesis of H0 : µ = 0 should be tested against the alternative H1 : µ > 0. This
alternative indicates the null should only be rejected if there is compelling evidence that
22
∩, the intersection operator, is used since the null requires both statements to be true.
2.5 Hypothesis Testing 105

the mean is positive. These hypotheses further specify that data consistent with large neg-
ative values of µ should not lead to rejection. Focusing the alternative often leads to an
increased probability to rejecting a false null. This occurs since the alternative is directed
(positive values for µ), and less evidence is required to be convinced that the null is not
valid.
Like null hypotheses, alternatives can be composite. The usual alternative to the null
H0 : θ2 = 0 ∩ θ3 = 0 is H1 : θ2 6= 0 ∪ θ3 6= 0 and so the null should be rejected when-
ever any of the statements in the null are false – in other words if either or both θ2 6= 0
or θ3 6= 0. Alternatives can also be formulated as lists of exclusive outcomes.23 When ex-
amining the relative precision of forecasting models, it is common to test the null that the
forecast performance is equal against a composite alternative that the forecasting perfor-
mance is superior for model A or that the forecasting performance is superior for model B .
If δ is defined as the average forecast performance difference, then the null is H0 : δ = 0
and the composite alternatives are H1A : δ > 0 and H1B : δ < 0, which indicate superior
performance of models A and B, respectively.
Once the null and the alternative have been formulated, a hypothesis test is used to
determine whether the data support the alternative.

Definition 2.26 (Hypothesis Test). A hypothesis test is a rule that specifies which values to
reject H0 in favor of H1 .

Hypothesis testing requires a test statistic, for example an appropriately standardized


mean, and a critical value. The null is rejected when the test statistic is larger than the
critical value.

Definition 2.27 (Critical Value). The critical value for an α-sized test, denoted Cα , is the
value where a test statistic, T , indicates rejection of the null hypothesis when the null is
true.

The region where the test statistic is outside of the critical value is known as the rejection
region.

Definition 2.28 (Rejection Region). The rejection region is the region where T > Cα .

An important event occurs when the null is correct but the hypothesis is rejected. This
is known as a Type I error.

Definition 2.29 (Type I Error). A Type I error is the event that the null is rejected when the
null is true.

A closely related concept is the size of the test. The size controls how often Type I errors
should occur.
23
The ∪ symbol indicates the union of the two alternatives.
106 Estimation, Inference and Hypothesis Testing

Decision
Do not reject H0 Reject H0

H0 Correct Type I Error

Truth
(Size)

H1 Type II Error Correct


(Power)

Table 2.2: Outcome matrix for a hypothesis test. The diagonal elements are both correct
decisions. The off diagonal elements represent Type I error, when the null is rejected but is
valid, and Type II error, when the null is not rejected and the alternative is true.

Definition 2.30 (Size). The size or level of a test, denoted α, is the probability of rejecting
the null when the null is true. The size is also the probability of a Type I error.
Typical sizes include 1%, 5% and 10%, although ideally the selected size should reflect the
decision makers preferences over incorrectly rejecting the null. When the opposite occurs,
the null is not rejected when the alternative is true, a Type II error is made.
Definition 2.31 (Type II Error). A Type II error is the event that the null is not rejected when
the alternative is true.
Type II errors are closely related to the power of a test.
Definition 2.32 (Power). The power of the test is the probability of rejecting the null when
the alternative is true. The power is equivalently defined as 1 minus the probability of a
Type II error.
The two error types, size and power are summarized in table 2.2.
A perfect test would have unit power against any alternative. In other words, whenever
the alternative is true it would reject immediately. Practically the power of a test is a func-
tion of both the sample size and the distance between the population value of a parameter
and its value under the null. A test is said to be consistent if the power of the test goes to
1 as n → ∞ whenever the population value is in the alternative. Consistency is an impor-
tant characteristic of a test, but it is usually considered more important to have correct size
rather than to have high power. Because power can always be increased by distorting the
size, and it is useful to consider a related measure known as the size-adjusted power. The
size-adjusted power examines the power of a test in excess of size. Since a test should reject
at size even when the null is true, it is useful to examine the percentage of times it will reject
in excess of the percentage it should reject.
One useful tool for presenting results of test statistics is the p-value, or simply the p-val.
Definition 2.33 (P-value). The p-value is largest size (α) where the null hypothesis cannot
be rejected. The p-value can be equivalently defined as the smallest size where the null
hypothesis can be rejected.
2.5 Hypothesis Testing 107

The primary advantage of a p-value is that it immediately demonstrates which test sizes
would lead to rejection: anything above the p-value. It also improves on the common prac-
tice of reporting the test statistic alone since p-values can be interpreted without knowl-
edge of the distribution of the test statistic. A related representation is the confidence in-
terval for a parameter.

Definition 2.34 (Confidence Interval). A confidence interval for a scalar parameter is the
range of values, θ0 ∈ (C α , C α ) where the null H0 : θ = θ0 cannot be rejected for a size of α.

The formal definition of a confidence interval is not usually sufficient to uniquely identify
√ d
the confidence interval. Suppose that a n (θ̂ − θ0 ) → N (0, σ2 ). The common 95% confi-
dence interval is (θ̂ − 1.96σ2 , θ̂ + 1.96σ2 ). This set is known as the symmetric confidence
interval and is formally defined as points (C α , C α ) where Pr (θ0 ) ∈ (C α , C α ) = 1 − α and
C α − θ = θ − C α ) . An alternative, but still valid, confidence interval can be defined as
(−∞, θ̂ + 1.645σ2 ). This would also contain the true value with probability 95%. In gen-
eral, symmetric confidence intervals should be used, especially for asymptotically normal
parameter estimates. In rare cases where symmetric confidence intervals are not appro-
priate, other options for defining a confidence interval include shortest interval, so that the
confidence interval is defined as values (C α , C α ) where Pr (θ0 ) ∈ (C α , C α ) = 1 − α subject to
C α − C α chosen to be as small as possible, or symmetric in probability, so that the confi-
dence interval satisfies Pr (θ0 ) ∈ (C α , θ̂ ) = Pr (θ0 ) ∈ (θ̂ , C α ) = 1/2 − α/2. When constructing
confidence internals for parameters that are asymptotically normal, these three definitions
coincide.

2.5.0.1 Size and Power of a Test of the Mean with Normal Data

Suppose n i.i.d. normal random variables have unknown mean µ but known variance σ2
and so the sample mean, ȳ = n −1 ni=1 yi , is then distributed N (µ, σ2 /N ). When testing a
P

null that H0 : µ = µ0 against an alternative H1 : µ 6= µ0 , the size of the test is the probability
that the null is rejected when it is true. Since the distribution
 under the null is N (µ
 0 , σ /N )
2

and the size can be set to α by selecting points where Pr µ̂ ∈ C α , C α |µ = µ0 = 1 − α.


Since the distribution is normal, one natural choice is to select the points symmetrically
so that C α = µ0 + √σN Φ−1 α/2 and C α = µ0 + √σN Φ−1 1 − α/2 where Φ (·) is the cdf of a
 

standard normal.
The power of the test is defined as the probability the null is rejected when the alter-
native is true. This probability will depend on the population mean, µ1 , the sample size,
the test size and mean specified by the null hypothesis. When testing using an α-sized test,
rejection will occur when µ̂ < µ0 + √σN Φ−1 α/2 or µ̂ > µ0 + √σN Φ−1 1 − α/2 . Since under
 

the alternative µ̂ is N µ1 , σ2 , these probabilities will be




µ0 + √σN Φ−1 α/2 − µ1


 ! !
C α − µ1
Φ σ =Φ σ
√ √
N N
108 Estimation, Inference and Hypothesis Testing

and
µ0 + √σ Φ−1 1 − α/2 − µ1
 ! !
N C α − µ1
1−Φ =1−Φ .
√σ √σ
N N

The total probability that the null is rejected is known as the power function,
! !
C α − µ1 C α − µ1
Power (µ0 , µ1 , σ, α, N ) = Φ +1−Φ .
√σ √σ
N N

A graphical illustration of the power is presented in figure 2.5. The null hypothesis is
H0 : µ = 0 and the alternative distribution was drawn at µ1 = .25. The variance σ2 = 1,
n = 5 and the size was set to 5%. The highlighted regions indicate the power: the area
under the alternative distribution, and hence the probability, which is outside of the critical
values. The bottom panel illustrates the power curve for the same parameters allowing n
to range from 5 to 1,000. When n is small, the power is low even for alternatives far from
the null. As n grows the power increases and when n = 1, 000, the power of the test is close
to unity for alternatives greater than 0.1.

2.5.1 Statistical and Economic Significance

While testing can reject hypotheses and provide meaningful p-values, statistical signifi-
cance is different from economic significance. Economic significance requires a more de-
tailed look at the data than a simple hypothesis test. Establishing the statistical significance
of a parameter is the first, and easy, step. The more difficult step is to determine whether
the effect is economically important. Consider a simple regression model

yi = θ1 + θ2 x2,i + θ3 x3,i + εi (2.64)

and suppose that the estimates of both θ2 and θ3 are statistically different from zero. This
can happen for a variety of reasons, including having an economically small impact accom-
panied with a very large sample. To assess the relative contributions other statistics such
as the percentage of the variation that can be explained by either variable alone and/or the
range and variability of the x s.
The other important aspect of economic significance is that rejection of a hypothesis,
while formally as a “yes” or “no” question, should be treated in a more continuous manner.
The p-value is a useful tool in this regard that can provide a deeper insight into the strength
of the rejection. A p-val of .00001 is not the same as a p-value of .09999 even though a 10%
test would reject for either.

2.5.2 Specifying Hypotheses

Formalized in terms of θ , a null hypothesis is


2.5 Hypothesis Testing 109

Power
Rejection Region and Power
Power
Null Distribution
0.8 Alt. Distribution
Crit. Val

0.6

0.4

0.2

0
−1.5 −1 −0.5 0 0.5 1 1.5
Power Curve
1

0.8

0.6

0.4
N=5
0.2 N=10
N=100
N=1000
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
µ1

Figure 2.5: The top panel illustrates the power. The distribution of the mean under the
null and alternative hypotheses were derived under that assumption that the data are
i.i.d. normal with means µ0 = 0 and µ1 = .25, variance σ2 = 1, n = 5 and α = .05. The
bottom panel illustrates the power function, in terms of the alternative mean, for the same
parameters when n = 5, 10, 100 and 1,000.

H0 : R(θ ) = 0 (2.65)

where R(·) is a function from Rk to Rm , m ≤ k , where m represents the number of hy-


potheses in a composite null. While this specification of hypotheses is very flexible, testing
non-linear hypotheses raises some subtle but important technicalities and further discus-
sion will be reserved for later. Initially, a subset of all hypotheses, those in the linear equality
restriction (LER) class, which can be specified as

H0 : Rθ − r = 0 (2.66)

will be examined where R is a m by k matrix and r is a m by 1 vector. All hypotheses in the


LER class can be written as weighted sums of model parameters,
110 Estimation, Inference and Hypothesis Testing

R11 θ1 + R12 θ2 . . . + R1k θk = r1


 
 R21 θ1 + R22 θ2 . . . + R2k θk = r2 
(2.67)
 
 .. 
 . 
Rm 1 θ1 + Rm 2 θ2 . . . + Rm k θk = ri .
Each linear hypothesis is represented as a row in the above set of equations. Linear equality
constraints can be used to test parameter restrictions on θ = (θ1 , θ2 , θ3 , θ4 )0 such as

θ1 = 0 (2.68)
3θ2 + θ3 = 1
4
X
θj = 0
j =1

θ1 = θ2 = θ3 = 0.

For example, the hypotheses in eq. (2.68) can be described in terms of R and r as

H0 R r
h i
θ1 = 0 1 0 0 0 0

h i
3θ2 + θ3 = 1 0 3 1 0 1

Pk h i
j =1 θ j = 0 1 1 1 1 0

 
1 0 0 0 h i0
θ1 = θ2 = θ3 = 0  0 1 0 0  0 0 0
 
0 0 1 0

When using linear equality constraints, alternatives are generally formulated as H1 :


Rθ − r 6= 0. Once both the null the alternative hypotheses have been postulated, it is
necessary to determine whether the data are consistent with the null hypothesis using one
of the many tests.

2.5.3 The Classical Tests


Three classes of statistics will be described to test hypotheses: Wald, Lagrange Multiplier
and Likelihood Ratio. Wald tests are perhaps the most intuitive: they directly test whether
Rθ̂ − r, the value under the null, is close to zero by exploiting the asymptotic normality
2.5 Hypothesis Testing 111

of the estimated parameters. Lagrange Multiplier tests incorporate the constraint into the
estimation problem using a Lagrangian. If the constraint has a small effect on value of
objective function, the Lagrange multipliers, often described as the shadow price of the
constraint in economic applications, should be close to zero. The magnitude of the scores
form the basis of the LM test statistic. Finally, likelihood ratios test whether the data are
less likely under the null than they are under the alternative. If these restrictions are not
statistically meaningful, this ratio should be close to one since the difference in the log-
likelihoods should be small.

2.5.4 Wald Tests

Wald test statistics are possibly the most natural method to test a hypothesis, and are often
the simplest to compute since only the unrestricted model must be estimated. Wald tests
directly exploit the asymptotic normality of the estimated parameters to form test statistics
with asymptotic χm 2
distributions. Recall that a χν2 random variable is defined to be the sum
of ν independent standard normals squared, νi =1 z i2 where z i ∼ N (0, 1). Also recall that if
P i.i.d.

z is a m dimension normal vector with mean µ and covariance Σ,

z ∼ N (µ, Σ) (2.69)
then the standardized version of z can be constructed as

1
Σ− 2 (z − µ) ∼ N (0, I). (2.70)
− 12 PM
Defining w = Σ (z − µ) ∼ N (0, I), it is easy to see that w0 w = m =1 w m ∼ χm . In
2 2

the usual case, the method of moments estimator, which nests ML and QML estimators as
special cases, is asymptotically normal
√  
d
 
−1 0
n θ̂ − θ 0 → N 0, G Σ G
−1
. (2.71)

If null hypothesis, H0 : Rθ = r is true, it follows directly that


√  
d
 0 
n Rθ̂ − r → N 0, RG−1 Σ G−1 R0 . (2.72)

This allows a test statistic to be formed


 0   −1  
−1 0 0
W = n Rθ̂ − r RG Σ G
−1
R Rθ̂ − r (2.73)

which is the sum of the squares of m random variables, each asymptotically uncorrelated
standard normal and so W is asymptotically χm2
distributed. A hypothesis test with size α
can be conducted by comparing W against Cα = F −1 (1 − α) where F (·) is the cdf of a χm2
.
If W ≥ Cα then the null is rejected.
There is one problem with the definition of W in eq. (2.73): it is infeasible since it
112 Estimation, Inference and Hypothesis Testing

depends on G and Σ which are unknown. The usual practice is to replace the unknown
elements of the covariance matrix with consistent estimates to compute a feasible Wald
statistic,
 0   0 −1  
W = n Rθ̂ − r RĜ Σ̂ Ĝ
−1 −1
R0
Rθ̂ − r . (2.74)

which has the same asymptotic distribution as the infeasible Wald.

2.5.4.1 t -tests

A t -test is a special case of a Wald and is applicable to tests involving a single hypothesis.
Suppose the null is

H0 : Rθ − r = 0
where R is 1 by k , and so
√  
d 0
n Rθ̂ − r → N (0, RG−1 Σ G−1 R0 ).

The studentized version can be formed by subtracting the mean and dividing by the stan-
dard deviation,
√  
n Rθ̂ − r d
t =p 0 0
→ N (0, 1). (2.75)
RG Σ (G ) R
−1 −1

and the test statistic can be compared to the critical values from a standard normal to con-
duct a hypothesis test. t -tests have an important advantage over the broader class of Wald
tests – they can be used to test one-sided null hypotheses. A one-sided hypothesis takes
the form H0 : Rθ ≥ r or H0 : Rθ ≤ r which are contrasted with one-sided alternatives of
H1 : Rθ < r or H1 : Rθ > r , respectively. When using a one-sided test, rejection occurs
when R − r is statistically different from zero and when Rθ < r or Rθ > r as specified by
the alternative.
t -tests are also used in commonly encountered test statistic, the t -stat, a test of the null
that a parameter is 0 against an alternative that it is not. The t -stat is popular because most
models are written in such a way that if a parameter θ = 0 then it will have no impact.

Definition 2.35 (t -stat). The t -stat of a parameter θ j is the t -test value of the null H0 : θ j =
0 against a two-sided alternative H1 : θ j 6= 0.

θ̂ j
t -stat ≡ (2.76)
σθ̂
where
2.5 Hypothesis Testing 113

s
e j G−1 Σ (G−1 )0 e0j
σθ̂ = (2.77)
n
and where e j is a vector of 0s with 1 in the jth position.
Note that the t -stat is identical to the expression in eq. (2.75) when R = e j and r = 0.
R = e j corresponds to a hypothesis test involving only element j of θ and r = 0 indicates
that the null is θ j = 0.
A closely related measure is the standard error of a parameter. Standard errors are es-
sentially standard deviations – square-roots of variance – except that the expression “stan-
dard error” is applied when describing the estimation error of a parameter while “standard
deviation” is used when describing the variation in the data or population.

Definition 2.36 (Standard Error). The standard error of a parameter θ is the square root of
the parameter’s variance,   q
s.e. θ̂ = σθ̂2 (2.78)

where 0
e j G−1 Σ G−1 e0j
σθ̂2 = (2.79)
n
and where e j is a vector of 0s with 1 in the jth position.

2.5.5 Likelihood Ratio Tests


Likelihood ratio tests examine how “likely” the data are under the null and the alternative.
If the hypothesis is valid then the data should be (approximately) equally likely under each.
The LR test statistic is defined as
    
L R = −2 l θ̃ ; y − l θ̂ ; y (2.80)

where θ̃ is defined

θ̃ = argmax l (θ ; y) (2.81)
θ

subject to Rθ − r = 0

and θ̂ is the unconstrained estimator,

θ̂ = argmax l (θ ; y). (2.82)


θ

d
Under the null H0 : Rθ − r = 0, the L R → χm 2
. The intuition behind the asymptotic
distribution of the LR can be seen in a second order Taylor expansion around parameters
114 Estimation, Inference and Hypothesis Testing

estimated under the null, θ̃ .

 0 ∂ l (y; θ̂ ) 1 √  0 1 ∂ 2 l (y; θ̂ ) √  
l (y; θ̃ ) = l (y; θ̂ ) + θ̃ − θ̂ + n θ̃ − θ̂ n θ̃ − θ̂ + R 3 (2.83)
∂θ 2 n ∂ θ∂ θ0

where R 3 is a remainder term that is vanishing as n → ∞. Since θ̂ is an unconstrained


estimator of θ 0 ,

∂ l (y; θ̂ )
=0
∂θ
and

!
  √  0 1 ∂ l (y; θ̂ )
2 √  
−2 l (y; θ̃ ) − l (y; θ̂ ) ≈ n θ̃ − θ̂ − n θ̃ − θ̂ (2.84)
n ∂ θ∂ θ0

Under some mild regularity conditions, when the MLE is correctly specified

1 ∂ 2 l (y; θ̂ ) p ∂ l (y; θ 0 )
 2 
− → −E = I,
n ∂ θ∂ θ0 ∂ θ∂ θ0
and
√  
d
n θ̃ − θ̂ → N (0, I −1 ).

Thus,

√  0 1 ∂ 2 l (y; θ̂ ) √  
d
n θ̃ − θ̂ 0 n θ̂ − θ̂ → χm
2
(2.85)
n ∂ θ∂ θ
 
d
and so 2 l (y; θ̂ ) − l (y; θ̂ ) → χm 2
. The only difficultly remaining is that the distribution of
this quadratic form is a χm 2
an not a χk2 since k is the dimension of the parameter vector.
While formally establishing this is tedious, the intuition follows from the number of restric-
tions. If θ̃ were unrestricted then it must be the case that θ̃ = θ̂ since θ̂ is defined as the
unrestricted estimators. Applying a single restriction leave k − 1 free parameters in θ̃ and
thus it should be close to θ̂ except for this one restriction.
When models are correctly specified LR tests are very powerful against point alterna-
tives (e.g. H0 : θ = θ 0 against H1 : θ = θ 1 ). Another important advantage of the LR is
that the covariance of the parameters does not need to be estimated. In many problems
accurate parameter covariances may be difficult to estimate, and imprecise covariance es-
timators have negative consequence for test statistics, such as size distortions where a 5%
test will reject substantially more than 5% of the time when the null is true.
It is also important to note that the likelihood ratio does not have an asymptotic χm 2
2.5 Hypothesis Testing 115

when the assumed likelihood f (y; θ ) is misspecified. When this occurs the information
matrix equality fails to hold and the asymptotic distribution of the LR is known as a mixture
of χ 2 distribution. In practice, the assumed error distribution is often misspecified and so
it is important that the distributional assumptions used to estimate θ are verified prior to
using likelihood ratio tests.
Likelihood ratio tests are not available for method of moments estimators since no dis-
tribution function is assumed.24

2.5.6 Lagrange Multiplier, Score and Rao Tests


Lagrange Multiplier (LM), Score and Rao test are all the same statistic. While Lagrange
Multiplier test may be the most appropriate description, describing the tests as score tests
illustrates the simplicity of the test’s construction. Score tests exploit the first order condi-
tion to test whether a null hypothesis is compatible with the data. Using the unconstrained
estimator of θ , θ̂ , the scores must be zero,

∂ l (θ ; y)

= 0. (2.86)
∂ θ θ =θ̂
The score test examines whether the scores are “close” to zero – in a statistically mean-
ingful way – when evaluated using the parameters estimated subject to the null restriction,
θ̃ . Define
  ∂ l (θ ; y )
i i
si θ̃ = (2.87)
∂θ
θ =θ̃

as the ith score, evaluated at the restricted estimator. If the null hypothesis is true, then
24
It is possible to construct a likelihood ratio-type statistic for method of moments estimators. Define
n
X
gn (θ ) = n −1 gi (θ )
i =1

to be the average moment conditions evaluated at a parameter θ . The likelihood ratio-type statistic for
method of moments estimators is defined as
  −1     −1  
L M = ng0n θ̃ Σ̂ gn θ̃ − ng0n θ̂ Σ̂ gn θ̂
  −1  
= ng0n θ̃ Σ̂ gn θ̃
 
where the simplification is possible since gn θ̂ = 0 and where

n
X    0
Σ̂ = n −1 gi θ̂ gi θ̂
i =1

is the sample covariance of the moment conditions evaluated at the unrestricted parameter estimates. This
test statistic only differs from the LM test statistic in eq. (2.90) via the choice of the covariance estimator, and
it should be similar in performance to the adjusted LM test statistic in eq. (2.92).
116 Estimation, Inference and Hypothesis Testing

n
!
√ X  
d
n n −1 si θ̃ → N (0, Σ) . (2.88)
i =1

This forms the basis of the score test, which is computed as


 0  
L M = n s̄ θ̃ Σ−1 s̄ θ̃ (2.89)
  Pn  
where s̄ θ̃ = n −1
i =1 si θ̃ . While this version is not feasible since it depends on Σ, the
standard practice is to replace Σ with a consistent estimator and to compute the feasible
score test,
 0 −1  
L M = n s̄ θ̃ Σ̂ s̄ θ̃ (2.90)

where the estimator of Σ depends on the assumptions made about the scores. In the case
where the scores are i.i.d. (usually because the data are i.i.d.),
n
X    0
Σ̂ = n −1
si θ̃ si θ̃ (2.91)
i =1
h  i
is a consistent estimator since E si θ̃ = 0 if the null is true. In practice a more pow-
erful version of the LM test can be formed by subtracting the mean from the covariance
estimator and using
n   
X       0
Σ̃ = n −1
si θ̃ − s̄ θ̃ si θ̃ − s̄ θ̃ (2.92)
i =1

which must be smaller (in the matrix sense) than Σ̂, although asymptotically, if the null is
true, these two estimators will converge to the same limit. Like the Wald and the LR, the LM
follows an asymptotic χm 2
distribution, and an LM test statistic will be rejected if L M > Cα
where Cα is the 1 − α quantile of a χm2
distribution.
Scores test can be used with method of moments estimators by simply replacing the
score of the likelihood with the moment conditions evaluated at the restricted parameter,
   
si θ̃ = gi θ̃ ,

and then evaluating eq. (2.90) or (2.92).

2.5.7 Comparing and Choosing the Tests

All three of the classic tests, the Wald, likelihood ratio and Lagrange multiplier have the
same limiting asymptotic distribution. In addition to all being asymptotically distributed
as a χm
2
, they are all asymptotically equivalent in the sense they all have an identical asymp-
2.6 The Bootstrap and Monte Carlo 117

totic distribution – in other words, the χm 2


that each limits to is the same. As a result, there
is no asymptotic argument that one should be favored over the other.
The simplest justifications for choosing one over the others are practical considerations.
Wald requires estimation under the alternative – the unrestricted model – and require an
estimate of the asymptotic covariance of the parameters. LM tests require estimation un-
der the null – the restricted model – and require an estimate of the asymptotic covariance
of the scores evaluated at the restricted parameters. LR tests require both forms to be esti-
mated but do not require any covariance estimates. On the other hand, Wald and LM tests
can easily be made robust to many forms of misspecification by using the “sandwich” co-
0
variance estimator, G−1 Σ G−1 for moment based estimators or I −1 J I −1 for QML estima-
tors. LR tests cannot be easily corrected and instead will have a non-standard distribution.
Models which are substantially easier to estimate under the null or alternative lead to
a natural choice. If a model is easy to estimate in its restricted form, but not unrestricted
LM tests are good choices. If estimation under the alternative is simpler then Wald tests
are reasonable. If they are equally simple to estimate, and the distributional assumptions
used in ML estimation are plausible, LR tests are likely the best choice. Empirically a re-
lationship exists where W ≈ L R ≥ L M . LM is often smaller, and hence less likely to
reject the null, since it estimates the covariance of the scores under the null. When the null
may be restrictive, the scores will generally have higher variances when evaluated using
the restricted parameters. The larger variances will lower the value of L M since the score
covariance is inverted in the statistic. A simple method to correct this is to use the adjusted
LM based off the modified covariance estimator in eq. (2.92).

2.6 The Bootstrap and Monte Carlo

The bootstrap is an alternative technique for estimating parameter covariances and con-
ducting inference. The name bootstrap is derived from the expression “to pick yourself up
by your bootstraps” – a seemingly impossible task. The bootstrap, when initially proposed,
was treated as an equally impossible feat, although it is now widely accepted as a valid, and
in some cases, preferred method to plug-in type covariance estimation. The bootstrap is a
simulation technique and is similar to Monte Carlo. However, unlike Monte Carlo, which
requires a complete data-generating process, the bootstrap makes use of the observed data
to simulate the data – hence the similarity to the original turn-of-phrase.
Monte Carlo is an integration technique that uses simulation to approximate the un-
i.i.d.
derlying distribution of the data. Suppose Yi ∼ F (θ ) where F is some distribution, and
that interest is in the E [g (Y )]. Further suppose it is possible to simulate from F (θ ) so that
a sample {yi } can be constructed. Then
n
X p
n −1
g (yi ) → E [g (Y )]
i =1
118 Estimation, Inference and Hypothesis Testing

as long as this expectation exists since the simulated data are i.i.d. by construction.
The observed data can be used to compute the empirical cdf.

Definition 2.37 (Empirical CDF). The empirical cdf is defined


n
X
Fˆ (c ) = n −1
I[yi <c ] .
i =1

As long as Fˆ is close to F , then the empirical cdf can be used to simulate random vari-
ables which should be approximately distributed F , and simulated data from the empirical
cdf should have similar statistical properties (mean, variance, etc.) as data simulated from
the true population cdf. The empirical cdf is a coarse step function and so only values
which have been observed can be simulated, and so simulating from the empirical cdf of
the data is identical to re-sampling the original data. In other words, the observed data can
be directly used to simulate the from the underlying (unknown) cdf.
Figure 2.6 shows the population cdf for a standard normal and two empirical cdfs, one
estimated using n = 20 observations and the other using n = 1, 000. The coarse empirical
cdf highlights the stair-like features of the empirical cdf estimate which restrict random
numbers generated using the empirical cdf to coincide with the data used to compute the
empirical cdf.
The bootstrap can be used for a variety of purposes. The most common is to estimate
the covariance matrix of estimated parameters. This is an alternative to the usual plug-in
type estimator, and is simple to implement when the estimator is available in closed form.

Algorithm 2.38 (i.i.d. Nonparametric Bootstrap Covariance).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample y ji .

3. Estimate the parameters of interest using y ji , and denote the estimate θ̃ b .




4. Repeat steps 1 through 3 a total of B times.

5. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B
b −1
θ̃ j − θ̂ θ̃ j − θ̂ .
b =1

or alternatively
h i B 
X  0
b θ̂ = B−1
V θ̃ j − θ̃ θ̃ j − θ̃ .
b =1
2.6 The Bootstrap and Monte Carlo 119

Standard Normal CDF and Empirical CDFs for n = 20 and 1, 000


1 Normal CDF
ECDF, n = 20
0.9 ECDF, n = 1000

0.8

0.7

0.6
(X)
Fd

0.5

0.4

0.3

0.2

0.1

0
−3 −2 −1 0 1 2 3
X

Figure 2.6: These three lines represent the population cdf of a standard normal, and two
empirical cdfs constructed form simulated data. The very coarse empirical cdf is based on
20 observations and clearly highlights the step-nature of empirical cdfs. The other empir-
ical cdf, which is based on 1,000 observations, appear smoother but is still a step function.

The variance estimator that comes from this algorithm cannot be directly compared
to the asymptotic covariance estimator since the bootstrap covariance is converging to 0.

Normalizing the bootstrap covariance estimate by n will allow comparisons and direct
application of the test statistics based on the asymptotic covariance. Note that when using
a conditional model, the vector [yi x0i ]0 should be jointly bootstrapped. Aside from this small
modification to step 2, the remainder of the procedure remains valid.
The nonparametric bootstrap is closely related to the residual bootstrap, at least when
it is possible to appropriately define a residual. For example, when Yi |Xi ∼ N β 0 xi , σ2 ,

0
the residual can be defined ε̂i = yi − β̂ xi . Alternatively if Yi |Xi ∼ Scaled − χν2 exp β 0 xi ,

q
0
then ε̂i = yi / β̂ x . The residual bootstrap can be used whenever it is possible to express
yi = g (θ , εi , xi ) for some known function g .
Algorithm 2.39 (i.i.d. Residual Bootstrap Covariance).
1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].
 
2. Construct a simulated sample ε̂ ji , x ji and define ỹi = g θ̂ , ε̃i , x̃i where ε̃i = ε̂ ji

120 Estimation, Inference and Hypothesis Testing

and x̃i = x ji .25

3. Estimate the parameters of interest using { ỹi , x̃i }, and denote the estimate θ̃ b .

4. Repeat steps 1 through 3 a total of B times.

5. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B
b −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

or alternatively
h i B 
X  0
b θ̂ = B−1
V θ̃ b − θ̃ θ̃ b − θ̃ .
b =1

It is important to emphasize that the bootstrap is not, generally, a better estimator of


parameter covariance than standard plug-in estimators.26 Asymptotically both are con-
sistent and can be used equivalently. Additionally, i.i.d. bootstraps can only be applied to
(conditionally) i.i.d. data and using an inappropriate bootstrap will produce an inconsis-
tent estimator. When data have dependence it is necessary to use an alternative bootstrap
scheme.
When the interest lies in confidence intervals, an alternative procedure that directly
uses the empirical quantiles of the bootstrap parameter estimates can be constructed (known
as the percentile method).

Algorithm 2.40 (i.i.d. Nonparametric Bootstrap Confidence Interval).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample y ji .

3. Estimate the parameters of interest using y ji , and denote the estimate θ̃ b .




4. Repeat steps 1 through 3 a total of B times.

5. Estimate the 1 − α confidence interval of θ̂k using


h n o n oi
qα/2 θ̃k , q1−α/2 θ̃k

25
In some models, it is possible to use independent indices on ε̂ and x, such as in a linear regression when
the data are conditionally homoskedastic (See chapter 3). In general it is not possible to explicitly break the
link between εi and xi , and so these should usually be resampled using the same indices.
26
There are some problem-dependent bootstraps that are more accurate than plug-in estimators in an
asymptotic sense. These are rarely encountered in financial economic applications.
2.6 The Bootstrap and Monte Carlo 121

n o
where qα θ̃k is the empirical α quantile of the bootstrap estimates. 1-sided lower
confidence intervals can be constructed as
h n oi
R (θk ), q1−α θ̃k

and 1-sided upper confidence intervals can be constructed as


h n o i
qα θ̃k , R (θk )

where R (θk ) and R (θk ) are the lower and upper extremes of the range of θk (possibly
±∞).

The percentile method can also be used directly to compute P-values of test statistics.
This requires enforcing the null hypothesis on the data and so is somewhat more involved.
For example, suppose the null hypothesis is E [yi ] = 0. This can be enforced by replacing
the original data with ỹi = yi − ȳ in step 2 of the algorithm.

Algorithm 2.41 (i.i.d. Nonparametric Bootstrap P-value).

1. Generate a set of n uniform integers { ji }ni=1 on [1, 2, . . . , n ].



2. Construct a simulated sample using data where the null hypothesis is true, ỹ ji .
 
3. Compute the test statistic of interest using ỹ ji , and denote the statistic T θ̃ b .


4. Repeat steps 1 through 3 a total of B times.

5. Compute the bootstrap P-value using


B
X
P−
d v a l = B −1 I[T (θ̂ )≤T (θ̃ )]
b =1

for 1-sided tests where the rejection region is for large values (e.g. a Wald test). When
using 2-sided tests, compute the bootstrap P-value using
B
X
P−
d v a l = B −1 I[|T (θ̂ )|≤|T (θ̃ )|]
b =1

The test statistic may depend on a covariance matrix. When this is the case, the co-
variance matrix is usually estimated from the bootstrapped data using a plug-in method.
Alternatively, it is possible to use any other consistent estimator (when the null is true) of
the asymptotic covariance, such as one based on an initial (separate) bootstrap.
122 Estimation, Inference and Hypothesis Testing

When models are maximum likelihood based, so that a complete model for the data
is specified, it is possible to apple a parametric form of the bootstrap to estimate covari-
ance matrices. This procedure is virtually identical to standard Monte Carlo except that
the initial estimate θ̂ is used in the simulation.

Algorithm 2.42 (i.i.d. Parametric Bootstrap Covariance (Monte Carlo)).


 
1. Simulate a set of n i.i.d. draws { ỹi } from F θ̂ .

2. Estimate the parameters of interest using { ỹi }, and denote the estimates θ̃ b .

3. Repeat steps 1 through 4 a total of B times.

4. Estimate the variance of θ̂ using

h i B 
X  0
V θ̂ = B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

or alternatively
h i B 
X  0
V θ̂ = B −1
θ̃ b − θ̃ θ̃ b − θ̃ .
b =1

When models use conditional maximum likelihood, it is possible to use parametric


bootstrap as part of a two-step procedure. First, apply a nonparametric bootstrap to the
conditioning
  data{xi }, and then, using the bootstrapped conditioning data, simulate Yi ∼
F θ̂ |X̃i . This is closely related to the residual bootstrap, only the assumed parametric
distribution F is used in place of the data-derived residuals.

2.7 Inference on Financial Data


Inference will be covered in greater detail in conjunction with specific estimators and mod-
els, such as linear regression or ARCH models. These examples examine relatively simple
hypotheses to illustrate the steps required in testing hypotheses.

2.7.1 Testing the Market Premium


Testing the market premium is a cottage industry. While current research is more interested
in predicting the market premium, testing whether the market premium is significantly
different from zero is a natural application of the tools introduced in this chapter. Let λ
denote the market premium and let σ2 be the variance of the return. Since the market is a
traded asset it must be the case that the premium for holding market risk is the same as the
mean of the market return. Monthly data for the Value Weighted Market (V W M ) and the
2.7 Inference on Financial Data 123

risk-free rate (R f ) was available between January 1927 and June 2008. Data for the V W M
was drawn from CRSP and data for the risk-free rate was available from Ken French’s data
library. Excess returns on the market are defined as the return to holding the market minus
the risk free rate, V W M ie = V W M i − R fi . The excess returns along with a kernel density
plot are presented in figure 2.7. Excess returns are both negatively skewed and heavy tailed
– October 1987 is 5 standard deviations from the mean.
The mean and variance can be computed using the method of moments as detailed in
section 2.1.4, and the covariance of the mean and the variance can be computed using the
estimators described in section 2.4.1. The estimates were calculated according to
" #  −1 Pn e

λ̂ n i =1 V W M i
=  −1 Pn  2 
σ̂ 2
n i =1 V W M i
e
− λ̂

and, defining ε̂i = V W M ie − λ̂, the covariance of the moment conditions was estimated
by
" Pn Pn  #
i =1 ε̂i ε̂ ε̂ σ̂
2 2 2
i −
Σ̂ = n −1 Pn Pi =1
n
i
2 .
i =1 ε̂i ε̂i − σ̂ ε̂ σ̂2
2 2 2

i =1 i −

Since the plim of the Jacobian is −I2 , the parameter covariance is also Σ̂. Combining
these two results with a Central Limit Theorem (assumed to hold), the asymptotic distri-
bution is

√ h i
d
n θ − θ̂ → N (0, Σ)
0
where θ = λ, σ2 . These produce the results in the first two rows of table 2.3.

These estimates can also be used to make inference on the standard deviation, σ = σ2
and the Sharpe ratio, S = λ/σ. The derivation of the asymptotic distribution of the Sharpe
ratio was presented in 2.4.4.1 and the asymptotic distribution
√ of the standard deviation can
be determined in a similar manner where d (θ ) = σ2 and so

∂ d (θ )
 
1
D (θ ) = = 0 √ .
∂ θ0 2 σ2
Combining this expression with the asymptotic distribution for the estimated mean and
variance, the asymptotic distribution of the standard deviation estimate is

√ µ4 − σ 4
 
d
n (σ̂ − σ) → N 0, .
4σ2

which was computed by dividing the [2,2] element of the parameter covariance by 4σ̂2 .
124 Estimation, Inference and Hypothesis Testing

2.7.1.1 Bootstrap Implementation

The bootstrap can be used to estimate parameter covariance, construct confidence inter-
vals – either used the estimated covariance or the percentile method, and to tabulate the
P-value of a test statistic. Estimating the parameter covariance is simple – the data is re-
sampled to create a simulated sample with n observations and the mean and variance are
estimated. This is repeated 10,000 times and the parameter covariance is estimated using

B
" # "" ##! " # "" ##!0
X µ̃b µ̂b µ̃b µ̂b
Σ̂ = B −1 − −
σ̃2b µ̂2b σ̃2b µ̂2b
b =1
B 
X  0
= B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
b =1

The percentile method can be used to construct confidence intervals for the parame-
ters as estimated and for functions of parameters such as the Sharpe ratio. Constructing
the confidence intervals for a function of the parameters requires constructing the function
of the estimated parameters using each simulated sample and then computing the confi-
dence interval using the empirical quantile of these estimates. Finally, the test P-value for
the statistic for the null H0 : λ = 0 can be computed directly by transforming the returns
so that they have mean 0 using r̃i = ri − r̄i . The P-value can be tabulated using
B
X
− val = B −1
Pd I[r̄ ≤r̃ b ]
b =1

where r̃ b is the average from bootstrap replication b . Table 2.4 contains the bootstrap
standard errors, confidence intervals based on the percentile method and the bootstrap
P-value for testing whether the mean return is 0. The standard errors are virtually identical
to those estimated using the plug-in method, and the confidence intervals are similar to
θ̂k ± 1.96s.e. (θk ). The null that the average return is 0 is also strongly rejected.

2.7.2 Is the NASDAQ Riskier than the S&P 100?


A second application examines the riskiness of the NASDAQ and the S&P 100. Both of these
indices are value-weighted and contain 100 companies. The NASDAQ 100 contains only
companies that trade on the NASDAQ while the S&P 100 contains large companies that
trade on either the NYSE or the NASDAQ.
The null hypothesis is that the variances are the same, H0 : σS2 P = σN2 D , and the alterna-
tive is that the variance of the NASDAQ is larger, H1 : σN2 D > σS2 P .27 The null and alternative
can be reformulated as a test that δ = σN2 D − σS2 P is equal to zero against an alternative that
27
It may also be interesting to test against a two-sided alternative that the variances are unequal, H1 :
σN
2
D 6= σS2 P .
2.7 Inference on Financial Data 125

Parameter Estimate Standard Error t -stat


λ 0.627 0.173 3.613
σ2 29.41 2.957 9.946
σ 5.423 0.545 9.946
λ
σ
0.116 0.032 3.600

Table 2.3: Parameter estimates and standard errors for the market premium (λ), the vari-
ance of the excess return (σ2 ), the standard deviation of the excess return (σ) and the
Sharpe ratio ( σλ ). Estimates and variances were computed using the method of moments.
The standard errors for σ and σλ were computed using the delta method.

Bootstrap Confidence Interval


Parameter Estimate Standard Error Lower Upper
λ 0.627 0.174 0.284 0.961
σ2 29.41 2.964 24.04 35.70
σ 5.423 0.547 4.903 5.975
λ
σ
0.116 0.032 0.052 0.179

H0 : λ = 0
P-value 3.00 ×10−4

Table 2.4: Parameter estimates, bootstrap standard errors and confidence intervals (based
on the percentile method) for the market premium (λ), the variance of the excess return
(σ2 ), the standard deviation of the excess return (σ) and the Sharpe ratio ( σλ ). Estimates
were computed using the method of moments. The standard errors for σ and σλ were com-
puted using the delta method using the bootstrap covariance estimator.

it is greater than zero. The estimation of the parameters can be formulated as a method of
moments problem,

   
µ̂S P rS P,i
n  2
σ̂S2 P (rS P,i − µ̂S P )
  X 
= n −1
   
µ̂N D rN D ,i
   
i =1
   
2
σ̂N2 D (rN D ,i − µ̂N D )

Inference can be performed by forming the moment vector using the estimated parame-
ters, gi ,
126 Estimation, Inference and Hypothesis Testing

CRSP Value Weighted Market (VWM) Excess Returns


CRSP VWM Excess Returns
40

20

−20

1930 1940 1950 1960 1970 1980 1990 2000

CRSP VWM Excess Return Density

0.08

0.06

0.04

0.02

0
−20 −10 0 10 20 30

Figure 2.7: These two plots contain the returns on the VWM (top panel) in excess of the
risk free rate and a kernel estimate of the density (bottom panel). While the mode of the
density (highest peak) appears to be clearly positive, excess returns exhibit strong negative
skew and are heavy tailed.

 
rS P,i − µS P
(rS P,i − µS P )2 − σS2 P
 
gi = 
 
rN D ,i − µN D

 
(rN D ,i − µN D )2 − σN2 D

and recalling that the asymptotic distribution is given by

√  
d
 −1 
n θ̂ − θ → N 0, G−1 Σ G0 .

Using the set of moment conditions,


2.7 Inference on Financial Data 127

Daily Data
Parameter Estimate Std. Error/Correlation
µS P 9.06 3.462 -0.274 0.767 -0.093
σS P 17.32 -0.274 0.709 -0.135 0.528
µN D 9.73 0.767 -0.135 4.246 -0.074
σN S 21.24 -0.093 0.528 -0.074 0.443

Test Statistics
δ 0.60 σ̂δ 0.09 t -stat 6.98

Monthly Data
Parameter Estimate Std. Error/Correlation
µS P 8.61 3.022 -0.387 0.825 -0.410
σS P 15.11 -0.387 1.029 -0.387 0.773
µN D 9.06 0.825 -0.387 4.608 -0.418
σN S 23.04 -0.410 0.773 -0.418 1.527

Test Statistics
δ 25.22 σ̂δ 4.20 t -stat 6.01

Table 2.5: Estimates, standard errors and correlation matrices for the S&P 100 and NAS-
DAQ 100. The top panel uses daily return data between January 3, 1983 and December 31,
2007 (6,307 days) to estimate the parameter values in the left most column. The rightmost
4 columns contain the parameter standard errors (diagonal elements) and the parameter
correlations (off-diagonal elements). The bottom panel contains estimates, standard er-
rors and correlations from monthly data between January 1983 and December 2007 (300
months). Parameter and covariance estimates have been annualized. The test statistics
(and related quantities) were performed and reported on the original (non-annualized)
values.

 
−1 0 0 0
n 
−2 (rS P,i − µS P ) −1 0 0
X 
G = plimn →∞ n −1  
0 0 −1 0
 
i =1
 
0 0 −2 (rN D ,i − µN D ) −1
= −I4 .

Σ can
 be estimated using the moment conditions evaluated at the estimated parameters,
gi θ̂ ,
128 Estimation, Inference and Hypothesis Testing

n
X    
Σ̂ = n −1
gi θ̂ g0i θ̂ .
i =1

Noting that the (2,2) element of Σ is the variance of σ̂S2 P , the (4,4) element of Σ is the vari-
ance of σ̂N2 D and the (2,4) element is the covariance of the two, the variance of δ̂ = σ̂N2 D −
σ̂S2 P can be computed as the sum of the variances minus two times the covariance, Σ[2,2] +
Σ[4,4] − 2Σ[2,4] . Finally a one-sided t -test can be performed to test the null.
Data was taken from Yahoo! finance between January 1983 and December 2008 at both
the daily and monthly frequencies. Parameter estimates are presented in table 2.5. The
table also contains the parameter standard errors p – the square-root of the asymptotic co-
variance divided by the number of observations ( Σ[i ,i ] /n ) – along the diagonal and the
parameter correlations – Σ[i , j ] / Σ[i ,i ] Σ[ j , j ] – in the off-diagonal positions. The top panel
p

contains results for daily data while the bottom contains results for monthly data. In both
panels 100× returns were used.
All parameter estimates are reported in annualized form, which requires multiplying
daily (monthly)
√  mean estimates by 252 (12), and daily (monthly) volatility estimated by

252 12 . Additionally, the delta method was used to adjust the standard errors on the
volatility estimates since the actual parameter estimates were the means and variances.
Thus, the reported parameter variance covariance matrix has the form
   
252 √
0 0 0 252 √
0 0 0
252 252
     0 0 0   0 0 0 
D θ̂ Σ̂D θ̂ =  2σS P
 Σ̂  2σS P
.
   
 0 0 252 √
0   0 0 252 √
0 
252 252
0 0 0 2σN D
0 0 0 2σN D

In both cases δ is positive with a t -stat greater than 6, indicating a strong rejection of the
null in favor of the alternative. Since this was a one-sided test, the 95% critical value would
be 1.645 (Φ (.95)).
This test could also have been implemented using an LM test, which requires estimating
the two mean parameters but restricting the variances to be equal. One θ̃ is estimated, the
LM test statistic is computed as
  −1  
L M = n gn θ̃ Σ̂ g0n θ̃

where

  n
X  
gn θ̃ = n −1
gi θ̃
i =1

and where µ̃S P = µ̂S P , µ̃N D = µ̂N D (unchanged) and σ̃S2 P = σ̃N2 D = σ̂S2 P + σ̂N2 D /2.

2.7 Inference on Financial Data 129

Daily Data
Parameter Estimate BootStrap Std. Error/Correlation
µS P 9.06 3.471 -0.276 0.767 -0.097
σS P 17.32 -0.276 0.705 -0.139 0.528
µN D 9.73 0.767 -0.139 4.244 -0.079
σN S 21.24 -0.097 0.528 -0.079 0.441

Monthly Data
Parameter Estimate Bootstrap Std. Error/Correlation
µS P 8.61 3.040 -0.386 0.833 -0.417
σS P 15.11 -0.386 1.024 -0.389 0.769
µN D 9.06 0.833 -0.389 4.604 -0.431
σN S 23.04 -0.417 0.769 -0.431 1.513

Table 2.6: Estimates and bootstrap standard errors and correlation matrices for the S&P
100 and NASDAQ 100. The top panel uses daily return data between January 3, 1983 and
December 31, 2007 (6,307 days) to estimate the parameter values in the left most column.
The rightmost 4 columns contain the bootstrap standard errors (diagonal elements) and
the correlations (off-diagonal elements). The bottom panel contains estimates, bootstrap
standard errors and correlations from monthly data between January 1983 and December
2007 (300 months). All parameter and covariance estimates have been annualized.

2.7.2.1 Bootstrap Covariance Estimation

The bootstrap is an alternative to the plug-in covariance estimators. The bootstrap was
implemented using 10,000 resamples where the data were assumed to be i.i.d.. In each
bootstrap resample, the full 4 by 1 vector of parameters was computed. These were com-
bined to estimate the parameter covariance using
B 
X  0
Σ̂ = B −1
θ̃ b − θ̂ θ̃ b − θ̂ .
i =1

Table 2.6 contains the bootstrap standard errors and correlations. Like the results in 2.5, the
parameter estimates and covariance have been annualized, and volatility rather than vari-
ance is reported. The covariance estimates are virtually indistinguishable to those com-
puted using the plug-in estimator. This highlights that the bootstrap is not (generally) a
better estimator, but is merely an alternative.28

28
In this particular application, as the bootstrap and the plug-in estimators are identical as B → ∞ for fixed
n. This is not generally the case.
130 Estimation, Inference and Hypothesis Testing

2.7.3 Testing Factor Exposure

Suppose excess returns were conditionally normal with mean µi = β 0 xi and constant vari-
ance σ2 . This type of model is commonly used to explain cross-sectional variation in re-
turns, and when the conditioning variables include only the market variable, the model is
known as the Capital Asset Pricing Model (CAP-M, Lintner (1965), Sharpe (1964)). Multi-
factor models allow for additional conditioning variables such as the size and value factors
(Fama & French 1992, 1993, Ross 1976). The size factor is the return on a portfolio which is
long small cap stocks and short large cap stocks. The value factor is the return on a port-
folio that is long high book-to-market stocks (value) and short low book-to-market stocks
(growth).

This example estimates a 3 factor model where the conditional mean of excess returns
on individual assets is modeled as a linear function of the excess return to the market, the
size factor and the value factor. This leads to a model of the form
 
f f
ri − ri = β0 + β1 rm ,i − ri + β2 rs ,i + β3 rv,i + εi
rie = β 0 xi + εi
f
where ri is the risk-free rate (short term government rate), rm ,i is the return to the market
portfolio, rs ,i is the return to the size portfolio and rv,i is the return to the value portfolio. εi
is a residual which is assumed to have a N 0, σ2 distribution.


Factor models can be formulated as a conditional maximum likelihood problem,


0 2
n
( )
1X r i − β xi
l r|X; θ = − ln (2π) + ln σ2 +
 
2 σ2
i =1

0
where θ = β 0 σ2 . The MLE can be found using the first order conditions, which are


n
∂ l (r ; θ ) 1 X  0

= xi ri − β̂ xi = 0
∂β σ̂2
i =1
n
!−1 n
X X
⇒ β̂ = xi x0i x i ri
i =1 j =1
 0
2
∂ l (r ; θ ) 1
n
X 1 ri − β̂ xi
= − − =0
∂σ 2 2 σ̂2 σ̂4
i =1
n  2
X 0
⇒ σ̂ 2
= n −1
ri − β̂ xi
i =1
2.7 Inference on Financial Data 131

The vector of scores is


 " # " #" # " #
∂ l ri |xi ; θ σ
1
2 xi ε i
1
0 xi εi xi ε i
= ε 2 = σ2 =S
∂θ − 2σ1 2 + 2σi 4 0 1
2σ4
σ − ε2i
2
σ − ε2i
2

where εi = ri − β 0 xi . The second form will be used to simplify estimating the parameters
covariance. The Hessian is
 " #
∂ 2 l ri |xi ; θ − σ12 xi x0i − σ14 xi εi
= ε2 ,
∂ θ∂ θ0 − σ14 xi εi 2σ1 4 − σi6

and the information matrix is


" #
− σ12 xi x0i − σ14 xi εi
I = −E ε2
− σ14 xi εi 2σ1 4 − σi6
"  #
1 1
ε
 0  
E x x
i i − E x E |X
= σ2  σ4 i  i
− σ14 E xi E εi |X E 2σ1 4
 
" #
1
 0
E x i x 0
= σ 2 i
1 .
0 2σ4

The covariance of the scores is


" " # #
ε2i xi x0i σ2 xi εi − xi ε3i
J = E S 2 S
σ2 x0i εi − x0i ε3i σ2 − ε2i
E ε2i xi x0i E σ ε ε
"    2 3
 #
x i i − xi
= S h 2 ii S
E σ2 x0i εi − x0i ε3i E σ2 − ε2i
 

E E ε2i |X xi x0i E σ2 x0i E h εi |X − x0i Ei ε3i |X


"          #
= S 2 S
E E σ2 x0i εi − x0i ε3i |X E σ2 − ε2i
  
" # " #
σ2 E xi x0i 1
   0
0 E xi x 0
= S S= σ 2 i
1
0 2σ4 0 2σ4

The estimators of the covariance matrices are


n
" #" # " #
X 1
0 x ε̂
i i
h i 1
0
Jˆ = n −1 σ̂2 x0i ε̂i σ̂2 − ε̂2i σ̂2
0 2σ̂1 4 σ̂2 − ε̂2i 0 2σ̂1 4
i =1
n
" #" #" #
1
0 ε̂ 2
x x
i i i
0
σ̂ 2
x ε̂
i i − x ε̂ 3 1
0
i i
X
= n −1 σ̂2
2 2
σ̂2

i =1
0 2σ̂1 4 σ̂ 2 0
x ε̂
i i − x ε̂
0 3
i i σ̂ 2
− ε̂ i 0 2σ̂1 4
132 Estimation, Inference and Hypothesis Testing

and
n
" #
X − σ̂12 xi x0i − σ̂14 xi εi
Î = −1 × n −1 ε2i
i =1
− σ̂14 xi εi 1
2σ̂4
− σ̂6
n
" #
X − σ̂12 xi x0i 0
= −1 × n −1 1 σ̂2
i =1
0 2σ̂4
− σ̂6
n
" # n
" #" #
X − σ̂12 xi x0i 0 X 1
0 xi x0i 0
= −1 × n −1 = n −1 σ̂2
0 − 2σ̂1 4 0 1
2σ̂4
0 1
i =1 i =1

Note that the off-diagonal term in J , σ̂2 x0i ε̂i − x0i ε̂3i , is not necessarily 0 when the data may
be conditionally skewed. Combined, the QMLE parameter covariance estimator is then

n
" #!−1 " n
" ## n
" #
X xi x0i 0 X ε̂2i xi x0i σ̂2 xi ε̂i − xi ε̂3i X xi x0i 0
Î −1 J I −1 = n −1 n −1 2 n −1
i =1
0 1
i =1
σ̂2 x0i ε̂i − x0i ε̂3i σ̂2 − ε̂2i i =1
0 1

where the identical scaling terms have been canceled. Additionally, when returns are con-
ditionally normal,
n
" #" #" #
1
0 ε̂ 2
x x
i i i
0
σ̂ 2
x ε̂
i i − x ε̂ 3 1
0
i i
X
plim Jˆ = plim n −1 σ̂2
2 2
σ̂2

i =1
0 2σ̂1 4 σ̂ xi ε̂i − xi ε̂i
2 0 0 3
σ̂ − ε̂i
2 0 2σ̂1 4
" #" #" #
1
0 σ xi xi
2 0
0 1
0
= σ2
1
σ2
0 2σ4 0 2σ 4
0 2σ1 4
" #
1 0
xx
σ2 i i
0
= 1
0 2σ4

and
n
" #
1
X x x0
σ̂2 i i
0
plim Î = plim n −1
1
0 2σ̂4
i =1
" #
1
x x0
σ2 i i
0
= 1 ,
0 2σ4

and so the IME, plim J −I = 0, will hold when returns are conditionally normal. Moreover,
when returns are not normal, all of the terms in J will typically differ from the limits above
and so the IME will not generally hold.

2.7.3.1 Data and Implementation

Three assets are used to illustrate hypothesis testing: ExxonMobil (XOM), Google (GOOG)
and the SPDR Gold Trust ETF (GLD). The data used to construct the individual equity re-
2.7 Inference on Financial Data 133

turns were downloaded from Yahoo! Finance and span the period September 2, 2002 until
September 1, 2012.29 The market portfolio is the CRSP value-weighted market, which is
a composite based on all listed US equities. The size and value factors were constructed
using portfolio sorts and are made available by Ken French. All returns were scaled by 100.

2.7.3.2 Wald tests

Wald tests make use of the parameters and estimated covariance to assess the evidence
against the null. When testing whether the size and value factor are relevant for an asset,
the null is H0 : β2 = β3 = 0. This problem can be set up as a Wald test using
" # " #
0 0 1 0 0 0
R= ,r=
0 0 0 1 0 0

and  0 h i−1  
W = n Rθ̂ − r RÎ −1 J Î −1 R0 Rθ̂ − r .

The Wald test has an asymptotic χ22 distribution since the null imposes 2 restrictions.
t -stats can similarly be computed for individual parameters

√ β̂ j
tj = n  
s.e. β̂ j
 
where s.e. β̂ j is the square of the jth diagonal element of the parameter covariance matrix.
Table 2.7 contains the parameter estimates from the models, t -stats for the coefficients and
the Wald test statistics for the null H0 : β2 = β3 = 0. The t -stats and the Wald tests where
implemented using both the sandwich covariance estimator (QMLE) and the maximum
likelihood covariance estimator. The two sets of test statistics differ in magnitude since the
assumption of normality is violated in the data, and so only the QMLE-based test statistics
should be considered reliable.

2.7.3.3 Likelihood Ratio tests

Likelihood ratio tests are simple to implement when parameters are estimated using MLE.
The likelihood ratio test statistic is
    
L R = −2 l r|X; θ̃ − l r|X; θ̂

where θ̃ is the null-restricted estimator of the parameters. The likelihood ratio has an
asymptotic χ22 distribution since there are two restrictions. Table 2.7 contains the likeli-
29
Google and the SPDR Gold Trust ETF both started trading after the initial sample date. In both cases, all
available data was used.
134 Estimation, Inference and Hypothesis Testing

hood ratio test statistics for the null H0 : β2 = β3 = 0. Caution is needed when interpret-
ing likelihood ratio test statistics since the asymptotic distribution is only valid when the
model is correctly specified – in this case, when returns are conditionally normal, which is
not plausible.

2.7.3.4 Lagrange Multiplier tests

Lagrange Multiplier tests are somewhat more involved in this problem. The key to com-
puting the LM test statistic is to estimate the score using the restricted parameters,
" #
1
σ2 i i
x ε̃
s̃i = ε̃2 ,
− 2σ̃1 2 + 2σ̃i 4

0
h 0 i0
where ε̃i = ri − β̃ xi and θ̃ = β̃ σ̃2 is the vector of parameters estimated when the null
is imposed. The LM test statistic is then

L M = n s̃S̃−1 s̃

where
n
X n
X
s̃ = n −1
s̃i , and S̃ = n −1
s̃i s̃0i .
i =1 i =1

The improved version of the LM can be computed by replacing S̃ with a covariance estima-
tor based on the scores from the unrestricted estimates,
n
X
Ŝ = n −1 ŝi ŝ0i .
i =1

Table 2.7 contains the LM test statistics for the null H0 : β2 = β3 = 0 using the two co-
variance estimators. LM test statistics are naturally robust to violations of the assumed
normality since Ŝ and S̃ are directly estimated from the scores and not based on properties
of the assumed normal distribution.

2.7.3.5 Discussion of Test Statistics

Table 2.7 contains all test statistics for the three series. The test statistics based on the MLE
and QMLE parameter covariances differ substantially in all three series, and importantly,
the conclusions also differ for the SPDR Gold Trust ETF. The difference between the two sets
of results arises since the assumption that returns are conditionally normal with constant
variance is not supported in the data. The MLE-based Wald test and the LR test (which is
implicitly MLE-based) have very similar magnitudes for all three series. The QMLE-based
Wald test statistics are also always larger than the LM-based test statistics which reflects
2.7 Inference on Financial Data 135

the difference of estimating the covariance under the null or under the alternative.
136 Estimation, Inference and Hypothesis Testing

ExxonMobil
Parameter Estimate t (MLE) t (QMLE)
β0 0.016 0.774 0.774 Wald (MLE) 251.21
(0.439) (0.439) (<0.001)
β1 0.991 60.36 33.07 Wald (QMLE) 88.00
(<0.001) (<0.001) (<0.001)
β2 -0.536 −15.13 −9.24 LR 239.82
(<0.001) (<0.001) (<0.001)
β3 -0.231 −6.09 −3.90 LM (S̃) 53.49
(<0.001) (<0.001) (<0.001)
LM (Ŝ) 54.63
(<0.001)

Google
Parameter Estimate t (MLE) t (QMLE)
β0 0.063 1.59 1.60 Wald (MLE) 18.80
(0.112) (0.111) (<0.001)
β1 0.960 30.06 23.74 Wald (QMLE) 10.34
(<0.001) (<0.001) (0.006)
β2 -0.034 −0.489 −0.433 LR 18.75
(0.625) (0.665) (<0.001)
β3 -0.312 −4.34 −3.21 LM (S̃) 10.27
(<0.001) (0.001) (0.006)
LM (Ŝ) 10.32
(0.006)

SPDR Gold Trust ETF


Parameter Estimate t (MLE) t (QMLE)
β0 0.057 1.93 1.93 Wald (MLE) 12.76
(0.054) (0.054) (0.002)
β1 0.130 5.46 2.84 Wald (QMLE) 5.16
(<0.001) (0.004) (0.076)
β2 -0.037 −0.733 −0.407 LR 12.74
(0.464) (0.684) (0.002)
β3 -0.191 −3.56 −2.26 LM (S̃) 5.07
(<0.001) (0.024) (0.079)
LM (Ŝ) 5.08
(0.079)

Table 2.7: Parameter estimates, t-statistics (both MLE and QMLE-based), and tests of the
exclusion restriction that the size and value factors have no effect (H0 : β2 = β3 = 0) on the
returns of the ExxonMobil, Google and SPDR Gold Trust ETF.
2.7 Inference on Financial Data 137

Exercises
Exercise 2.1. The distribution of a discrete random variable X depends on a discretely val-
ued parameter θ ∈ {1, 2, 3} according to
x f (x |θ = 1) f (x |θ = 2) f (x |θ = 3)
1 1
1 2 3
0
1 1
2 3 4
0
1 1 1
3 6 3 6
1 1
4 0 12 12
3
5 0 0 4

Find the MLE of θ if one value from X has been observed. Note: The MLE is a function that
returns an estimate of θ given the data that has been observed. In the case where both the
observed data and the parameter are discrete, a “function” will take the form of a table.

Exercise 2.2. Let X 1 , . . . , X n be an i.i.d. sample from a gamma(α,β ) distribution. The den-
sity of a gamma(α,β ) is

1
f (x ; α, β ) = x α−1 exp(−x /β )
Γ (α) β α

where Γ (z ) is the gamma-function evaluated at z . Find the MLE of β assuming α is known.

Exercise 2.3. Let X 1 , . . . , X n be an i.i.d. sample from the pdf

θ
f (x |θ ) = , 1 ≤ x < ∞, θ > 1
x θ +1
i. What is the MLE of θ ?

ii. What is E[X j ]?

iii. How can the previous answer be used to compute a method of moments estimator of
θ?

Exercise 2.4. Let X 1 , . . . , X n be an i.i.d. sample from the pdf

1
f (x |θ ) = , 0 ≤ x ≤ θ,θ > 0
θ
i. What is the MLE of θ ? [This is tricky]

ii. What is the method of moments Estimator of θ ?

iii. Compute the bias and variance of each estimator.


138 Estimation, Inference and Hypothesis Testing

Exercise 2.5. Let X 1 , . . . , X n be an i.i.d. random sample from the pdf

f (x |θ ) = θ x θ −1 , 0 ≤ x ≤ 1, 0 < θ < ∞

i. What is the MLE of θ ?

ii. What is the variance of the MLE.

iii. Show that the MLE is consistent.

Exercise 2.6. Let X 1 , . . . , X i be an i.i.d. sample from a Bernoulli(p ).

i. Show that X̄ achieves the Cramér-Rao lower bound.

ii. What do you conclude about using X̄ to estimate p ?

Exercise 2.7. Suppose you witness a coin being flipped 100 times with 56 heads and 44
tails. Is there evidence that this coin is unfair?

Exercise 2.8. Let X 1 , . . . , X i be an i.i.d. sample with mean µ and variance σ2 .

i. Show X̃ = N
PN
i =1 w i = 1.
P
i =1 w i X i is unbiased if and only if

ii. Show that the variance of X̃ is minimized if wi = 1


n
for i = 1, 2, . . . , n .

Exercise 2.9. Suppose {X i } in i.i.d. sequence of normal variables with unknown mean µ
and known variance σ2 .

i. Derive the power function of a 2-sided t -test of the null H0 : µ = 0 against an alter-
native H1 : µ 6= 0? The power function should have two arguments, the mean under
the alternative, µ1 and the number of observations n.

ii. Sketch the power function for n = 1, 4, 16, 64, 100.

iii. What does this tell you about the power as n → ∞ for µ 6= 0?

Exercise 2.10. Let X 1 and X 2 are independent and drawn from a Uniform(θ , θ + 1) distri-
bution with θ unknown. Consider two test statistics,

T1 : Reject if X 1 > .95

and
T2 : Reject if X 1 + X 2 > C

i. What is the size of T1 ?

ii. What value must C take so that the size of T2 is equal to T1


2.7 Inference on Financial Data 139

iii. Sketch the power curves of the two tests as a function of θ . Which is more powerful?

Exercise 2.11. Suppose {yi } are a set of transaction counts (trade counts) over 5-minute
intervals which are believed to be i.i.d. distributed from a Poisson with parameter λ. Recall
the probability density function of a Poisson is

λ yi e −λ
f (yi ; λ) =
yi !

i. What is the log-likelihood for this problem?

ii. What is the MLE of λ?

iii. What is the variance of the MLE?

iv. Suppose that λ̂ = 202.4 and that the sample size was 200. Construct a 95% confidence
interval for λ.

v. Use a t -test to test the null H0 : λ = 200 against H1 : λ 6= 200 with a size of 5%

vi. Use a likelihood ratio to test the same null with a size of 5%.

vii. What happens if the assumption of i.i.d. data is correct but that the data does not
follow a Poisson distribution?

Upper tail probabilities


for a standard normal z
Cut-off c Pr(z > c )
1.282 10%
1.645 5%
1.96 2.5%
2.32 1%

5% Upper tail cut-off for χq2


Degree of Freedom q Cut-Off
1 3.84
2 5.99
199 232.9
200 234.0
140 Estimation, Inference and Hypothesis Testing
Chapter 3

Analysis of Cross-Sectional Data

Note: The primary reference text for these notes is Hayashi (2000). Other comprehensive
treatments are available in Greene (2007) and Davidson & MacKinnon (2003).

Linear regression is the foundation of modern econometrics. While the impor-


tance of linear regression in financial econometrics has diminished in recent
years, it is still widely employed. More importantly, the theory behind least
squares estimators is useful in broader contexts and many results of this chap-
ter are special cases of more general estimators presented in subsequent chap-
ters. This chapter covers model specification, estimation, inference, under both
the classical assumptions and using asymptotic analysis, and model selection.

Linear regression is the most basic tool of any econometrician and is widely used through-
out finance and economics. Linear regression’s success is owed to two key features: the
availability of simple, closed form estimators and the ease and directness of interpretation.
However, despite superficial simplicity, the concepts discussed in this chapter will reappear
in the chapters on time series, panel data, Generalized Method of Moments (GMM), event
studies and volatility modeling.

3.1 Model Description

Linear regression expresses a dependent variable as a linear function of independent vari-


ables, possibly random, and an error.

yi = β1 x1,i + β2 x2,i + . . . + βk xk ,i + εi , (3.1)

where yi is known as the regressand, dependent variable or simply the left-hand-side vari-
able. The k variables, x1,i , . . . , xk ,i are known as the regressors, independent variables or
right-hand-side variables. β1 , β2 , . . ., βk are the regression coefficients, εi is known as the
142 Analysis of Cross-Sectional Data

innovation, shock or error and i = 1, 2, . . . , n index the observation. While this representa-
tion clarifies the relationship between yi and the x s, matrix notation will generally be used
to compactly describe models:

β1 ε1
      
y1 x11 x12 . . . x1k
 y2   x21 x22 . . . x2k  β2   ε2 
= + (3.2)
      
 .. .. .. .. ..  .. .. 
 .   . . . .  .   . 
yn xn 1 xn 2 . . . xn k βk εn

y = Xβ + ε (3.3)
where X is an n by k matrix, β is a k by 1 vector, and both y and ε are n by 1 vectors.
Two vector notations will occasionally be used: row,

= x1 β +ε1
 
y1
 y2 = x2 β +ε2 
(3.4)
 
 .. .. .. 
 . . . 
yn = xn β +εn
and column,

y = β1 x1 + β2 x2 + . . . + βk xk + ε. (3.5)
Linear regression allows coefficients to be interpreted all things being equal. Specifi-
cally, the effect of a change in one variable can be examined without changing the others.
Regression analysis also allows for models which contain all of the information relevant
for determining yi whether it is directly of interest or not. This feature provides the mecha-
nism to interpret the coefficient on a regressor as the unique effect of that regressor (under
certain conditions), a feature that makes linear regression very attractive.

3.1.1 What is a model?

What constitutes a model is a difficult question to answer. One view of a model is that of
the data generating process (DGP). For instance, if a model postulates

yi = β1 xi + εi
one interpretation is that the regressand, yi , is exactly determined by xi and some random
shock. An alternative view, one that I espouse, holds that xi is the only relevant variable
available to the econometrician that explains variation in yi . Everything else that deter-
mines yi cannot be measured and, in the usual case, cannot be placed into a framework
which would allow the researcher to formulate a model.
Consider monthly returns on the S&P 500, a value weighted index of 500 large firms in
3.1 Model Description 143

the United States. Equity holdings and returns are generated by individuals based on their
beliefs and preferences. If one were to take a (literal) data generating process view of the
return on this index, data on the preferences and beliefs of individual investors would need
to be collected and formulated into a model for returns. This would be a daunting task to
undertake, depending on the generality of the belief and preference structures.
On the other hand, a model can be built to explain the variation in the market based on
observable quantities (such as oil price changes, macroeconomic news announcements,
etc.) without explicitly collecting information on beliefs and preferences. In a model of
this type, explanatory variables can be viewed as inputs individuals consider when form-
ing their beliefs and, subject to their preferences, taking actions which ultimately affect the
price of the S&P 500. The model allows the relationships between the regressand and re-
gressors to be explored and is meaningful even though the model is not plausibly the data
generating process.
In the context of time-series data, models often postulate that the past values of a series
are useful in predicting future values. Again, suppose that the data were monthly returns
on the S&P 500 and, rather than using contemporaneous explanatory variables, past re-
turns are used to explain present and future returns. Treated as a DGP, this model implies
that average returns in the near future would be influenced by returns in the immediate
past. Alternatively, taken an approximation, one interpretation postulates that changes in
beliefs or other variables that influence holdings of assets change slowly (possibly in an
unobservable manner). These slowly changing “factors” produce returns which are pre-
dictable. Of course, there are other interpretations but these should come from finance
theory rather than data. The model as a proxy interpretation is additionally useful as it al-
lows models to be specified which are only loosely coupled with theory but that capture
interesting features of a theoretical model.
Careful consideration of what defines a model is an important step in the development
of an econometrician, and one should always consider which assumptions and beliefs are
needed to justify a specification.

3.1.2 Example: Cross-section regression of returns on factors

The concepts of linear regression will be explored in the context of a cross-section regres-
sion of returns on a set of factors thought to capture systematic risk. Cross sectional regres-
sions in financial econometrics date back at least to the Capital Asset Pricing Model (CAPM,
Markowitz (1959), Sharpe (1964) and Lintner (1965)), a model formulated as a regression of
individual asset’s excess returns on the excess return of the market. More general specifica-
tions with multiple regressors are motivated by the Intertemporal CAPM (ICAPM, Merton
(1973)) and Arbitrage Pricing Theory (APT, Ross (1976)).
The basic model postulates that excess returns are linearly related to a set of systematic
risk factors. The factors can be returns on other assets, such as the market portfolio, or any
other variable related to intertemporal hedging demands, such as interest rates, shocks to
144 Analysis of Cross-Sectional Data

Variable Description
VWM Returns on a value-weighted portfolio of all NYSE, AMEX and NASDAQ
stocks
SM B Returns on the Small minus Big factor, a zero investment portfolio that
is long small market capitalization firms and short big caps.
HML Returns on the High minus Low factor, a zero investment portfolio that
is long high BE/ME firms and short low BE/ME firms.
UMD Returns on the Up minus Down factor (also known as the Momentum factor),
a zero investment portfolio that is long firms with returns in the top
30% over the past 12 months and short firms with returns in the bottom 30%.
SL Returns on a portfolio of small cap and low BE/ME firms.
SM Returns on a portfolio of small cap and medium BE/ME firms.
SH Returns on a portfolio of small cap and high BE/ME firms.
BL Returns on a portfolio of big cap and low BE/ME firms.
BM Returns on a portfolio of big cap and medium BE/ME firms.
BH Returns on a portfolio of big cap and high BE/ME firms.
RF Risk free rate (Rate on a 3 month T-bill).
D AT E Date in format YYYYMM.

Table 3.1: Variable description for the data available in the Fama-French data-set used
throughout this chapter.

inflation or consumption growth.

f
ri − ri = f i β + ε i

or more compactly,

rie = fi β + εi
f
where rie = ri − ri is the excess return on the asset and fi = [ f1,i , . . . , fk ,i ] are returns on
factors that explain systematic variation.
Linear factors models have been used in countless studies, the most well known by
Fama and French (Fama & French (1992) and Fama & French (1993)) who use returns on
specially constructed portfolios as factors to capture specific types of risk. The Fama-French
data set is available in Excel (ff.xls) or MATLAB (ff.mat) formats and contains the vari-
ables listed in table 3.1.
All data, except the interest rates, are from the CRSP database and were available monthly
from January 1927 until June 2008. Returns are calculated as 100 times the logarithmic
price difference (100(ln(pi ) − ln(pn −1 ))). Portfolios were constructed by sorting the firms
into categories based on market capitalization, Book Equity to Market Equity (BE/ME), or
past returns over the previous year. For further details on the construction of portfolios,
3.2 Functional Form 145

Portfolio Mean Std. Dev Skewness Kurtosis


V WMe 0.63 5.43 0.22 10.89
SM B 0.23 3.35 2.21 25.26
HML 0.41 3.58 1.91 19.02
UMD 0.78 4.67 -2.98 31.17
BH e 0.91 7.26 1.72 21.95
BM e 0.67 5.81 1.41 20.85
B Le 0.59 5.40 -0.08 8.30
SH e 1.19 8.34 2.30 25.68
SM e 0.99 7.12 1.39 18.29
SLe 0.70 7.84 1.05 13.56

Table 3.2: Descriptive statistics of the six portfolios that will be used throughout this chap-
ter. The data consist of monthly observations from January 1927 until June 2008 (n = 978).

see Fama & French (1993) or Ken French’s website:

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.

A general model for the B H portfolio can be specified

B Hi − R Fi = β1 + β2 (V W M i − R Fi ) + β3S M Bi + β4 H M L i + β5U M Di + εi

or, in terms of the excess returns,

B Hie = β1 + β2 V W M ie + β3S M Bi + β4 H M L i + β5U M Di + εi .


The coefficients in the model can be interpreted as the effect of a change in one variable
holding the other variables constant. For example, β3 captures the effect of a change in the
S M Bi risk factor holding V W M ie , H M L i and U M Di constant. Table 3.2 contains some
descriptive statistics of the factors and the six portfolios included in this data set.

3.2 Functional Form


A linear relationship is fairly specific and, in some cases, restrictive. It is important to dis-
tinguish specifications which can be examined in the framework of a linear regression from
those which cannot. Linear regressions require two key features of any model: each term
on the right hand side must have only one coefficient that enters multiplicatively and the
error must enter additively.1 Most specifications satisfying these two requirements can be
1
A third but obvious requirement is that neither yi nor any of the x j ,i may be latent (unobservable), j =
1, 2, . . . , k , i = 1, 2, . . . , n.
146 Analysis of Cross-Sectional Data

treated using the tools of linear regression.2 Other forms of “nonlinearities” are permissi-
ble. Any regressor or the regressand can be nonlinear transformations of the original ob-
served data.
Double log (also known as log-log) specifications, where both the regressor and the re-
gressands are log transformations of the original (positive) data, are common.

ln yi = β1 + β2 ln xi + εi .
In the parlance of a linear regression, the model is specified

ỹi = β1 + β2 x̃i + εi
where ỹi = ln(yi ) and x̃i = ln(xi ). The usefulness of the double log specification can be
illustrated by a Cobb-Douglas production function subject to a multiplicative shock

β β
Yi = β1 K i 2 L i 3 εi .
Using the production function directly, it is not obvious that, given values for output (Yi ),
capital (K i ) and labor (L i ) of firm i , the model is consistent with a linear regression. How-
ever, taking logs,

ln Yi = ln β1 + β2 ln K i + β3 ln L i + ln εi
the model can be reformulated as a linear regression on the transformed data. Other forms,
such as semi-log (either log-lin, where the regressand is logged but the regressors are un-
changed, or lin-log, the opposite) are often useful to describe certain relationships.
Linear regression does, however, rule out specifications which may be of interest. Linear
β
regression is not an appropriate framework to examine a model of the form yi = β1 x1,i2 +
β
β3 x2,i4 +εi . Fortunately, more general frameworks, such as generalized method of moments
(GMM) or maximum likelihood estimation (MLE), topics of subsequent chapters, can be
applied.
Two other transformations of the original data, dummy variables and interactions, can
be used to generate nonlinear (in regressors) specifications. A dummy variable is a special
class of regressor that takes the value 0 or 1. In finance, dummy variables (or dummies) are
used to model calendar effects, leverage (where the magnitude of a coefficient depends
on the sign of the regressor), or group-specific effects. Variable interactions parameterize
nonlinearities into a model through products of regressors. Common interactions include
2 3
powers of regressors (x1,i , x1,i , . . .), cross-products of regressors (x1,i x2,i ) and interactions
between regressors and dummy variables. Considering the range of nonlinear transforma-
tion, linear regression is surprisingly general despite the restriction of parameter linearity.
The use of nonlinear transformations also change the interpretation of the regression
2
There are further requirements on the data, both the regressors and the regressand, to ensure that esti-
mators of the unknown parameters are reasonable, but these are treated in subsequent sections.
3.2 Functional Form 147

coefficients. If only unmodified regressors are included,

yi = xi β + εi
∂ yi
and ∂ xk ,i
= βk . Suppose a specification includes both xk and xk2 as regressors,

yi = β1 xi + β2 xi2 + εi

In this specification, ∂∂ xyii = β1 + β2 xi and the level of the variable enters its partial effect.
Similarly, in a simple double log model

ln yi = β1 ln xi + εi ,

and

∂y
∂ ln yi y %∆y
β1 = = =
∂ ln xi ∂x
x
%∆x

Thus, β1 corresponds to the elasticity of yi with respect to xi . In general, the coefficient on


a variable in levels corresponds to the effect of a one unit changes in that variable while
the coefficient on a variable in logs corresponds to the effect of a one percent change. For
example, in a semi-log model where the regressor is in logs but the regressand is in levels,

yi = β1 ln xi + εi ,

β1 will correspond to a unit change in yi for a % change in xi . Finally, in the case of dis-
crete regressors, where there is no differential interpretation of coefficients, β represents
the effect of a whole unit change, such as a dummy going from 0 to 1.

3.2.1 Example: Dummy variables and interactions in cross section regression

Two calendar effects, the January and the December effects, have been widely studied in
finance. Simply put, the December effect hypothesizes that returns in December are un-
usually low due to tax-induced portfolio rebalancing, mostly to realized losses, while the
January effect stipulates returns are abnormally high as investors return to the market.
To model excess returns on a portfolio (B Hie ) as a function of the excess market return
(V W M ie ), a constant, and the January and December effects, a model can be specified

B Hie = β1 + β2 V W M ie + β3 I1i + β4 I12i + εi

where I1i takes the value 1 if the return was generated in January and I12i does the same for
December. The model can be reparameterized into three cases:
148 Analysis of Cross-Sectional Data

B Hie = (β1 + β3 ) + β2 V W M ie + εi January


B Hie = (β1 + β4 ) + β2 V W M ie + εi December
B Hie = β1 + β2 V W M ie + εi Otherwise

Similarly dummy interactions can be used to produce models with both different intercepts
and different slopes in January and December,

B Hie = β1 + β2 V W M ie + β3 I1i + β4 I12i + β5 I1i V W M ie + β6 I12i V W M ie + εi .

If excess returns on a portfolio were nonlinearly related to returns on the market, a simple
model can be specified

B Hie = β1 + β2 V W M ie + β3 (V W M ie )2 + β4 (V W M ie )3 + εi .
Dittmar (2002) proposed a similar model to explain the cross-sectional dispersion of ex-
pected returns.

3.3 Estimation
Linear regression is also known as ordinary least squares (OLS) or simply least squares, a
moniker derived from the method of estimating the unknown coefficients. Least squares
minimizes the squared distance between the fit line (or plane if there are multiple regres-
sors) and the regressand. The parameters are estimated as the solution to
n
X
min (y − Xβ ) (y − Xβ ) = min
0
(yi − xi β )2 . (3.6)
β β
i =1

First order conditions of this optimization problem are


n
X
− 2X0 (y − Xβ ) = −2 X0 y − X0 Xβ = −2 xi (yi − xi β ) = 0

(3.7)
i =1

and rearranging, the least squares estimator for β , can be defined.

Definition 3.1 (OLS Estimator). The ordinary least squares estimator, denoted β̂ , is defined

β̂ = (X0 X)−1 X0 y. (3.8)

Clearly this estimator is only reasonable if X0 X is invertible which is equivalent to the


condition that rank(X) = k . This requirement states that no column of X can be exactly
3.3 Estimation 149

expressed as a combination of the k − 1 remaining columns and that the number of ob-
servations is at least as large as the number of regressors (n ≥ k ). This is a weak condition
and is trivial to verify in most econometric software packages: using a less than full rank
matrix will generate a warning or error.
Dummy variables create one further issue worthy of special attention. Suppose dummy
variables corresponding to the 4 quarters of the year, I1i , . . . , I4i , are constructed from a
quarterly data set of portfolio returns. Consider a simple model with a constant and all 4
dummies

rn = β1 + β2 I1i + β3 I2i + β4 I3i + β5 I4i + εi .

It is not possible to estimate this model with all 4 dummy variables and the constant
because the constant is a perfect linear combination of the dummy variables and so the
regressor matrix would be rank deficient. The solution is to exclude either the constant
or one of the dummy variables. It makes no difference in estimation which is excluded,
although the interpretation of the coefficients changes. In the case where the constant is
excluded, the coefficients on the dummy variables are directly interpretable as quarterly
average returns. If one of the dummy variables is excluded, for example the first quarter
dummy variable, the interpretation changes. In this parameterization,

rn = β1 + β2 I2i + β3 I3i + β4 I4i + εi ,

β1 is the average return in Q1, while β1 + β j is the average return in Q j .


It is also important that any regressor, other the constant, be nonconstant. Suppose a
regression included the number of years since public floatation but the data set only con-
tained assets that have been trading for exactly 10 years. Including this regressor and a
constant results in perfect collinearity, but, more importantly, without variability in a re-
gressor it is impossible to determine whether changes in the regressor (years since float)
results in a change in the regressand or whether the effect is simply constant across all as-
sets. The role that variability of regressors plays be revisited when studying the statistical
properties of β̂ .
The second derivative matrix of the minimization,

2X0 X,

ensures that the solution must be a minimum as long as X0 X is positive definite. Again,
positive definiteness of this matrix is equivalent to rank(X) = k .
Once the regression coefficients have been estimated, it is useful to define the fit values,
ŷ = Xβ̂ and sample residuals ε̂ = y − ŷ = y − Xβ̂ . Rewriting the first order condition in
terms of the explanatory variables and the residuals provides insight into the numerical
properties of the residuals. An equivalent first order condition to eq. (3.7) is
150 Analysis of Cross-Sectional Data

X0 ε̂ = 0. (3.9)
This set of linear equations is commonly referred to as the normal equations or orthogonal-
ity conditions. This set of conditions requires that ε̂ is outside the span of the columns of X.
Moreover, considering the columns of X separately, X0j ε̂ = 0 for all j = 1, 2, . . . , k . When a
column contains a constant (an intercept in the model specification), ι 0 ε̂ = 0 ( ni=1 ε̂i = 0),
P

and the mean of the residuals will be exactly 0.3


The OLS estimator of the residual variance, σ̂2 , can be defined.4
Definition 3.2 (OLS Variance Estimator). The OLS residual variance estimator, denoted σ̂2 ,
is defined
ε̂0 ε̂
σ̂ =
2
(3.10)
n −k

Definition 3.3 (Standard Error of the Regression). The standard error of the regression is
defined as √
σ̂ = σ̂2 (3.11)
The least squares estimator has two final noteworthy properties. First, nonsingular
transformations of the x ’s and non-zero scalar transformations of the y ’s have determin-
istic effects on the estimated regression coefficients. Suppose A is a k by k nonsingular
matrix and c is a non-zero scalar. The coefficients of a regression of c yi on xi A are

β̃ = [(XA)0 (XA)]−1 (XA)0 (c y) (3.12)


= c (A X XA) A X y
0 0 −1 0 0

= c A−1 (X0 X)−1 A0−1 A0 X0 y


= c A−1 (X0 X)−1 X0 y
= c A−1 β̂ .

Second, as long as the model contains a constant, the regression coefficients on all
terms except the intercept are unaffected by adding an arbitrary constant to either the re-
gressor or the regressands. Consider transforming the standard specification,

yi = β1 + β2 x2,i + . . . + βk xk ,i + εi
to

ỹi = β1 + β2 x̃2,i + . . . + βk x̃k ,i + εi


ι is an n by 1 vector of 1s.
3
4
The choice of n − k in the denominator will be made clear once the properties of this estimator have
been examined.
3.4 Assessing Fit 151

Portfolio Constant V WMe SM B HML UMD σ


BH e -0.06 1.08 0.02 0.80 -0.04 1.29
BM e -0.02 0.99 -0.12 0.32 -0.03 1.25
B Le 0.09 1.02 -0.10 -0.24 -0.02 0.78
SH e 0.05 1.02 0.93 0.77 -0.04 0.76
SM e 0.06 0.98 0.82 0.31 0.01 1.07
SLe -0.10 1.08 1.05 -0.19 -0.06 1.24
p
Table 3.3: Estimated regression coefficients from the model ri i = β1 + β2 V W M ie +
p
β3S M Bi + β4 H M L i + β5U M Di + εi , where ri i is the excess return on one of the six size and
BE/ME sorted portfolios. The final column contains the standard error of the regression.

where ỹi = yi + c y and x̃ j ,i = x j ,i + c x j . This model is identical to

yi = β̃1 + β2 x2,i + . . . + βk xk ,i + εi
where β̃1 = β1 + c y − β2 c x2 − . . . − βk c xk .

3.3.1 Estimation of Cross-Section regressions of returns on factors

Table 3.3 contains the estimated regression coefficients as well as the standard error of the
regression for the 6 portfolios in the Fama-French data set in a specification including all
four factors and a constant. There has been a substantial decrease in the magnitude of the
standard error of the regression relative to the standard deviation of the original data. The
next section will formalize how this reduction is interpreted.

3.4 Assessing Fit

Once the parameters have been estimated, the next step is to determine whether or not the
model fits the data. The minimized sum of squared errors, the objective of the optimiza-
tion, is an obvious choice to assess fit. However, there is an important limitation drawback
to using the sum of squared errors: changes in the scale of yi alter the minimized sum of
squared errors without changing the fit. In order to devise a scale free metric, it is neces-
sary to distinguish between the portions of y which can be explained by X from those which
cannot.
Two matrices, the projection matrix, PX and the annihilator matrix, MX , are useful when
decomposing the regressand into the explained component and the residual.

Definition 3.4 (Projection Matrix). The projection matrix, a symmetric idempotent ma-
trix that produces the projection of a variable onto the space spanned by X, denoted PX , is
152 Analysis of Cross-Sectional Data

defined

PX = X(X0 X)−1 X0 (3.13)

Definition 3.5 (Annihilator Matrix). The annihilator matrix, a symmetric idempotent ma-
trix that produces the projection of a variable onto the null space of X0 , denoted MX , is
defined

MX = In − X(X0 X)−1 X0 . (3.14)

These two matrices have some desirable properties. Both the fit value of y (ŷ) and the
estimated errors, ε̂, can be simply expressed in terms of these matrices as ŷ = PX y and
ε̂ = MX y respectively. These matrices are also idempotent: PX PX = PX and MX MX = MX
and orthogonal: PX MX = 0. The projection matrix returns the portion of y that lies in the
linear space spanned by X, while the annihilator matrix returns the portion of y which lies
in the null space of X0 . In essence, MX annihilates any portion of y which is explainable by
X leaving only the residuals.
Decomposing y using the projection and annihilator matrices,

y = PX y + MX y
which follows since PX + MX = In . The squared observations can be decomposed

y0 y = (PX y + MX y)0 (PX y + MX y)


= y0 PX PX y + y0 PX MX y + y0 MX PX y + y0 MX MX y
= y0 PX y + 0 + 0 + y0 MX y
= y0 PX y + y0 MX y

noting that PX and MX are idempotent and PX MX = 0n . These three quantities are often
referred to as5

n
X
yy=
0
yi2 Uncentered Total Sum of Squares (TSSU ) (3.15)
i =1

5
There is no consensus about the names of these quantities. In some texts, the component capturing the
fit portion is known as the Regression Sum of Squares (RSS) while in others it is known as the Explained Sum
of Squares (ESS), while the portion attributable to the errors is known as the Sum of Squared Errors (SSE),
the Sum of Squared Residuals (SSR) ,the Residual Sum of Squares (RSS) or the Error Sum of Squares (ESS).
The choice to use SSE and RSS in this text was to ensure the reader that SSE must be the component of the
squared observations relating to the error variation.
3.4 Assessing Fit 153

n
X
y PX y =
0
(xi β̂ )2 Uncentered Regression Sum of Squares (RSSU ) (3.16)
i =1
X n
y 0 MX y = (yi − xi β̂ )2 Uncentered Sum of Squared Errors (SSEU ). (3.17)
i =1

Dividing through by y0 y

y0 PX y y0 MX y
+ =1
y0 y y0 y
or

RSSU SSEU
+ = 1.
TSSU TSSU
This identity expresses the scale-free total variation in y that is captured by X (y0 PX y)
and that which is not (y0 MX y). The portion of the total variation explained by X is known as
the uncentered R2 (R2U ),
Definition 3.6 (Uncentered R 2 (R2U )). The uncentered R2 , which is used in models that do
not include an intercept, is defined

RSSU SSEU
=1−
R2U = (3.18)
TSSU TSSU
While this measure is scale free it suffers from one shortcoming. Suppose a constant
is added to y, so that the TSSU changes to (y + c )0 (y + c ). The identity still holds and so
(y + c )0 (y + c ) must increase (for a sufficiently large c ). In turn, one of the right-hand side
variables must also grow larger. In the usual case where the model contains a constant, the
increase will occur in the RSSU (y0 PX y), and as c becomes arbitrarily large, uncentered R2
will asymptote to one. To overcome this limitation, a centered measure can be constructed
which depends on deviations from the mean rather than on levels.
Let ỹ = y − ȳ = Mι y where Mι = In − ι(ι 0 ι)−1 ι 0 is matrix which subtracts the mean from
a vector of data. Then

y0 Mι PX Mι y + y0 Mι MX Mι y = y0 Mι y
y0 Mι PX Mι y y0 Mι MX Mι y
+ =1
y 0 Mι y y 0 Mι y

or more compactly

ỹ0 PX ỹ ỹ0 MX ỹ
+ = 1.
ỹ0 ỹ ỹ0 ỹ
Centered R2 (R2C ) is defined analogously to uncentered replacing the uncentered sums
of squares with their centered counterparts.
154 Analysis of Cross-Sectional Data

Definition 3.7 (Centered R 2 (R2C )). The uncentered R2 , used in models that include an in-
tercept, is defined
RSSC SSEC
R2C = =1− (3.19)
TSSC TSSC
where
n
X
y Mι y =
0
(yi − ȳ )2 Centered Total Sum of Squares (TSSC ) (3.20)
i =1
X n
y 0 Mι PX Mι y = (xi β̂ − x̄β̂ )2 Centered Regression Sum of Squares (RSSC ) (3.21)
i =1
X n
y 0 Mι MX Mι y = (yi − xi β̂ )2 Centered Sum of Squared Errors (SSEC ). (3.22)
i =1

Pn
and where x̄ = n −1 i =1 xi .
The expressions R2 , SSE, RSS and TSS should be assumed to correspond to the cen-
tered version unless further qualified. With two versions of R2 available that generally dif-
fer, which should be used? Centered should be used if the model is centered (contains a
constant) and uncentered should be used when it does not. Failing to chose the correct R2
can lead to incorrect conclusions about the fit of the model and mixing the definitions can
lead to a nonsensical R2 that falls outside of [0, 1]. For instance, computing R2 using the
centered version when the model does not contain a constant often results in a negative
value when

SSEC
R2 = 1 − .
TSSC
Most software will return centered R2 and caution is warranted if a model is fit without a
constant.
R2 does have some caveats. First, adding an additional regressor will always (weakly)
increase the R2 since the sum of squared errors cannot increase by the inclusion of an ad-
ditional regressor. This renders R2 useless in discriminating between two models where
one is nested within the other. One solution to this problem is to use the degree of freedom
adjusted R2 .
 
2
Definition 3.8 (Adjusted R2 R̄ ). The adjusted R2 , which adjusts for the number of esti-
mated parameters, is defined

SSE
2 n−k SSE n − 1
R̄ = 1 − =1− . (3.23)
TSS TSS n − k
n−1

2
R̄ will increase if the reduction in the SSE is large enough to compensate for a loss of 1
2
degree of freedom, captured by the n − k term. However, if the SSE does not change, R̄
3.4 Assessing Fit 155

Regressors R2U R2C R̄2U R̄2C


V WMe 0.8141 0.8161 0.8141 0.8161
V W M e ,SM B 0.8144 0.8163 0.8143 0.8161
V W Me,HM L 0.9684 0.9641 0.9683 0.9641
V W M e ,SM B, H M L 0.9685 0.9641 0.9684 0.9640
V W M e , SM B, H M L,U M D 0.9691 0.9665 0.9690 0.9664
1, V W M e 0.8146 0.8116 0.8144 0.8114
1, V W M e , S M B 0.9686 0.9681 0.9685 0.9680
1, V W M e , S M B , H M L 0.9687 0.9682 0.9686 0.9681
1, V W M e , S M B , H M L , U M D 0.9692 0.9687 0.9690 0.9685
2
Table 3.4: Centered and uncentered R2 and R̄ from a variety of factors models. Bold in-
dicates the correct version (centered or uncentered) for that model. R2 is monotonically
2
increasing in larger models, while R̄ is not.

2
will decrease. R̄ is preferable to R2 for comparing models, although the topic of model
2
selection will be more formally considered at the end of this chapter. R̄ , like R2 should
be constructed from the appropriate versions of the RSS, SSE and TSS (either centered or
uncentered) .
Second, R2 is not invariant to changes in the regressand. A frequent mistake is to use R2
to compare the fit from two models with different regressands, for instance yi and ln(yi ).
These numbers are incomparable and this type of comparison must be avoided. Moreover,
R2 is even sensitive to more benign transformations. Suppose a simple model is postulated,

yi = β1 + β2 xi + εi ,
and a model logically consistent with the original model,

yi − xi = β1 + (β2 − 1)xi + εi ,
is estimated. The R2 s from these models will generally differ. For example, suppose the
original coefficient on xi was 1. Subtracting xi will reduce the explanatory power of xi to
0, rendering it useless and resulting in a R2 of 0 irrespective of the R2 in the original model.

2
3.4.1 Example: R2 and R̄ in Cross-Sectional Factor models
To illustrate the use of R2 , and the problems with its use, consider a model for B H e which
can depend on one or more risk factor.
The R2 values in Table 3.4 show two things. First, the excess return on the market port-
folio alone can explain 80% of the variation in excess returns on the big-high portfolio.
Second, the H M L factor appears to have additional explanatory power on top of the mar-
ket evidenced by increases in R2 from 0.80 to 0.96. The centered and uncentered R2 are
156 Analysis of Cross-Sectional Data

Regressand Regressors R2U R2C R¯2U R¯2C


BH e V WMe 0.7472 0.7598 0.7472 0.7598
BH e 1, V W M e 0.7554 0.7504 0.7552 0.7502
10 + B H e V WMe 0.3179 0.8875 0.3179 0.8875
10 + B H e 1, V W M e 0.9109 0.7504 0.9108 0.7502
100 + B H e V WMe 0.0168 2.4829 0.0168 2.4829
100 + B H e 1, V W M e 0.9983 0.7504 0.9983 0.7502
BH e 1, V W M e 0.7554 0.7504 0.7552 0.7502
BH e − V W M e 1, V W M e 0.1652 0.1625 0.1643 0.1616
2
Table 3.5: Centered and uncentered R2 and R̄ from models with regressor or regressand
changes. Using the wrong R2 can lead to nonsensical values (outside of 0 and 1) or a false
sense of fit (R2 near one). Some values are larger than 1 because there were computed
using RSSC /TSSC . Had 1 − SSEC /TSSC been used, the values would be negative because
RSSC /TSSC + SSEC /TSSC = 1. The bottom two lines examine the effect of subtracting a
regressor before fitting the model: the R2 decreases sharply. This should not be viewed as
problematic since models with different regressands cannot be compared using R2 .

very similar because the intercept in the model is near zero. Instead, suppose that the de-
pendent variable is changed to 10 + B H e or 100 + B H e and attention is restricted to the
CAPM. Using the incorrect definition for R2 can lead to nonsensical (negative) and mislead-
ing (artificially near 1) values. Finally, Table 3.5 also illustrates the problems of changing
the regressand by replacing the regressand B Hie with B Hie − V W M ie . The R2 decreases
from a respectable 0.80 to only 0.10, despite the interpretation of the model is remaining
unchanged.

3.5 Assumptions

Thus far, all of the derivations and identities presented are purely numerical. They do not
indicate whether β̂ is a reasonable way to estimate β . It is necessary to make some as-
sumptions about the innovations and the regressors to provide a statistical interpretation
of β̂ . Two broad classes of assumptions can be used to analyze the behavior of β̂ : the
classical framework (also known as small sample or finite sample) and asymptotic analysis
(also known as large sample).
Neither method is ideal. The small sample framework is precise in that the exact distri-
bution of regressors and test statistics are known. This precision comes at the cost of many
restrictive assumptions – assumptions not usually plausible in financial applications. On
the other hand, asymptotic analysis requires few restrictive assumptions and is broadly ap-
plicable to financial data, although the results are only exact if the number of observations
is infinite. Asymptotic analysis is still useful for examining the behavior in finite samples
3.5 Assumptions 157

when the sample size is large enough for the asymptotic distribution to approximate the
finite-sample distribution reasonably well.
This leads to the most important question of asymptotic analysis: How large does n
need to be before the approximation is reasonable? Unfortunately, the answer to this ques-
tion is “It depends”. In simple cases, where residuals are independent and identically dis-
tributed, as few as 30 observations may be sufficient for the asymptotic distribution to be
a good approximation to the finite-sample distribution. In more complex cases, anywhere
from 100 to 1,000 may be needed, while in the extreme cases, where the data is heteroge-
nous and highly dependent, an asymptotic approximation may be poor with more than
1,000,000 observations.
The properties of β̂ will be examined under both sets of assumptions. While the small
sample results are not generally applicable, it is important to understand these results as
the lingua franca of econometrics, as well as the limitations of tests based on the classi-
cal assumptions, and to be able to detect when a test statistic may not have the intended
asymptotic distribution. Six assumptions are required to examine the finite-sample distri-
bution of β̂ and establish the optimality of the OLS procedure( although many properties
only require a subset).

Assumption 3.9 (Linearity). yi = xi β + εi

This assumption states the obvious condition necessary for least squares to be a reason-
able method to estimate the β . It further imposes a less obvious condition, that xi must
be observed and measured without error. Many applications in financial econometrics in-
clude latent variables. Linear regression is not applicable in these cases and a more sophis-
ticated estimator is required. In other applications, the true value of xk ,i is not observed and
a noisy proxy must be used, so that x̃k ,i = xk ,i + νk ,i where νk ,i is an error uncorrelated with
xk ,i . When this occurs, ordinary least squares estimators are misleading and a modified
procedure (two-stage least squares (2SLS) or instrumental variable regression (IV)) must
be used.

Assumption 3.10 (Conditional Mean). E[εi |X] = 0, i = 1, 2, . . . , n

This assumption states that the mean of each εi is zero given any xk ,i , any function of
any xk ,i or combinations of these. It is stronger than the assumption used in the asymp-
totic analysis and is not valid in many applications (e.g. time-series data). When the re-
gressand and regressor consist of time-series data, this assumption may be violated and
E[εi |xi + j ] 6= 0 for some j . This assumption also implies that the correct form of xk ,i enters
the regression, that E[εi ] = 0 (through a simple application of the law of iterated expec-
tations), and that the innovations are uncorrelated with the regressors, so that E[εi 0 x j ,i ] =
0, i 0 = 1, 2, . . . , n , i = 1, 2, . . . , n , j = 1, 2, . . . , k .

Assumption 3.11 (Rank). The rank of X is k with probability 1.


158 Analysis of Cross-Sectional Data

This assumption is needed to ensure that β̂ is identified and can be estimated. In prac-
tice, it requires that the no regressor is perfectly co-linear with the others, that the number
of observations is at least as large as the number of regressors (n ≥ k ) and that variables
other than a constant have non-zero variance.

Assumption 3.12 (Conditional Homoskedasticity). V[εi |X] = σ2

Homoskedasticity is rooted in homo (same) and skedannumi (scattering) and in mod-


ern English means that the residuals have identical variances. This assumption is required
to establish the optimality of the OLS estimator and it specifically rules out the case where
the variance of an innovation is a function of a regressor.

Assumption 3.13 (Conditional Correlation). E[εi ε j |X] = 0, i = 1, 2, . . . , n, j = i + 1, . . . , n

Assuming the residuals are conditionally uncorrelated is convenient when coupled with
the homoskedasticity assumption: the covariance of the residuals will be σ2 In . Like ho-
moskedasticity, this assumption is needed for establishing the optimality of the least squares
estimator.

Assumption 3.14 (Conditional Normality). ε|X ∼ N (0, Σ)

Assuming a specific distribution is very restrictive – results based on this assumption


will only be correct is the errors are actually normal – but this assumption allows for precise
statements about the finite-sample distribution of β̂ and test statistics. This assumption,
when combined with assumptions 3.12 and 3.13, provides a simple distribution for the
d
innovations: εi |X → N (0, σ2 ).

3.6 Small Sample Properties of OLS estimators


Using these assumptions, many useful properties of β̂ can be derived. Recall that β̂ =
(X0 X)−1 X0 y.

Theorem 3.15 (Bias of β̂ ). Under assumptions 3.9 - 3.11

E[β̂ |X] = β . (3.24)

While unbiasedness is a desirable property, it is not particularly meaningful without


further qualification. For instance, an estimator which is unbiased, but does not increase
in precision as the sample size increases is generally not desirable. Fortunately, β̂ is not
only unbiased, it has a variance that goes to zero.

Theorem 3.16 (Variance of β̂ ). Under assumptions 3.9 - 3.13

V[β̂ |X] = σ2 (X0 X)−1 . (3.25)


3.6 Small Sample Properties of OLS estimators 159

Under the conditions necessary for unbiasedness for β̂ , plus assumptions about ho-
moskedasticity and the conditional correlation of the residuals, the form of the variance is
simple. Consistency follows since
Pn !−1
x0 xi
(X0 X)−1 = n i =1 i (3.26)
n
1  −1
≈ E x0i xi
n
will be declining as the sample size increases.
However, β̂ has an even stronger property under the same assumptions. It is BLUE:
Best Linear Unbiased Estimator. Best, in this context, means that it has the lowest variance
among all other linear unbiased estimators. While this is a strong result, a few words of
caution are needed to properly interpret this result. The class of Linear Unbiased Estima-
tors (LUEs) is small in the universe of all unbiased estimators. Saying OLS is the “best” is
akin to a one-armed boxer claiming to be the best one-arm boxer. While possibly true, she
probably would not stand a chance against a two-armed opponent.

Theorem 3.17 (Gauss-Markov Theorem). Under assumptions 3.9-3.13, β̂ is the minimum


variance estimator among all linear unbiased estimators. That is V[β̃ |X] - V[β̂ |X] is positive
semi-definite where β̃ = Cy, E[β̃ ] = β and C 6= (X0 X)−1 X0 .

Letting β̃ be any other linear, unbiased estimator of β , it must have a larger covariance.
However, many estimators, including most maximum likelihood estimators, are nonlinear
and so are not necessarily less efficient. Finally, making use of the normality assumption,
it is possible to determine the conditional distribution of β̂ .

Theorem 3.18 (Distribution of β̂ ). Under assumptions 3.9 – 3.14,

β̂ |X ∼ N (β , σ2 (X0 X)−1 ) (3.27)


Theorem 3.18 should not be surprising. β̂ is a linear combination of (jointly) normally
distributed random variables and thus is also normally distributed. Normality is also use-
ful for establishing the relationship between the estimated residuals ε̂ and the estimated
parameters β̂ .

Theorem 3.19 (Conditional Independence of ε̂ and β̂ ). Under assumptions 3.9 - 3.14, ε̂ is


independent of β̂ , conditional on X.

One implication of this theorem is that Cov(ε̂i , β̂ j |X) = 0 i = 1, 2, . . . , n, j = 1, 2, . . . , k .


As a result, functions of ε̂ will be independent of functions of β̂ , a property useful in deriv-
ing distributions of test statistics that depend on both. Finally, in the small sample setup,
the exact distribution of the sample error variance estimator, σ̂2 = ε̂0 ε̂/(n − k ), can be
derived.
160 Analysis of Cross-Sectional Data

Theorem 3.20 (Distribution of σ̂2 ).

σ̂2
(n − k ) ∼ χn−k
2
σ2
y 0 MX y ε̂0 ε̂
where σ̂2 = n−k
= n −k
.

Since ε̂i is a normal random variable, once it is standardized and squared, it should be
a χ12 . The change in the divisor from n to n − k reflects the loss in degrees of freedom due
to the k estimated parameters.

3.7 Maximum Likelihood

Once the assumption that the innovations are conditionally normal has been made, con-
ditional maximum likelihood is an obvious method to estimate the unknown parameters
(β , σ2 ). Conditioning on X, and assuming the innovations are normal, homoskedastic, and
conditionally uncorrelated, the likelihood is given by

(y − Xβ )0 (y − Xβ )
 
2 − n2
f (y|X; β , σ ) = (2πσ )
2
exp − (3.28)
2σ2
and, taking logs, the log likelihood

n n (y − Xβ )0 (y − Xβ )
l (β , σ2 ; y|X) = − log(2π) − log(σ2 ) − . (3.29)
2 2 2σ2
Recall that the logarithm is a monotonic, strictly increasing transformation, and the ex-
tremum points of the log-likelihood and the likelihood will occur at the same parameters.
Maximizing the likelihood with respect to the unknown parameters, there are k + 1 first
order conditions

∂ l (β , σ2 ; y|X) X0 (y − Xβ̂ )
= =0 (3.30)
∂β σ2
∂ l (β , σ2 ; y|X) n (y − Xβ̂ )0 (y − Xβ̂ )
= − + = 0. (3.31)
∂ σ̂2 2σ̂2 2σ̂4

The first set of conditions is identical to the first order conditions of the least squares esti-
mator ignoring the scaling by σ2 , assumed to be greater than 0. The solution is

MLE
β̂ = (X0 X)−1 X0 y (3.32)
σ̂2 MLE = n −1 (y − Xβ̂ )0 (y − Xβ̂ ) = n −1 ε̂0 ε̂. (3.33)
3.7 Maximum Likelihood 161

The regression coefficients are identical under maximum likelihood and OLS, although the
divisor in σ̂2 and σ̂2 MLE differ.
It is important to note that the derivation of the OLS estimator does not require an as-
sumption of normality. Moreover, the unbiasedness, variance, and BLUE properties do not
rely on conditional normality of residuals. However, if the innovations are homoskedastic,
uncorrelated and normal, the results of the Gauss-Markov theorem can be strengthened
using the Cramer-Rao lower bound.

Theorem 3.21 (Cramer-Rao Inequality). Let f (z; θ ) be the joint density of z where θ is a k
dimensional parameter vector Let θ̂ be an unbiased estimator of θ 0 with finite covariance.
Under some regularity condition on f (·)

V[θ̂ ] ≥ I −1 (θ 0 )

where " #
∂ 2 ln f (z; θ )

I = −E (3.34)
∂ θ ∂ θ 0 θ =θ 0
and " #
∂ ln f (z; θ ) ∂ ln f (z; θ )

J =E (3.35)
∂θ ∂ θ0
θ =θ 0

and, under some additional regularity conditions,

I(θ 0 ) = J (θ 0 ).
The last part of this theorem is the information matrix equality (IME) and when a model is
correctly specified in its entirety, the expected covariance of the scores is equal to negative
of the expected hessian.6 The IME will be revisited in later chapters. The second order
conditions,

∂ 2 l (β , σ2 ; y|X) X0 X
=− 2 (3.36)
∂ β∂ β 0
σ̂
∂ l (β , σ ; y|X)
2 2
X0 (y − Xβ )
= − (3.37)
∂ β ∂ σ2 σ4
∂ 2 l (β , σ2 ; y|X) n (y − Xβ )0 (y − Xβ )
= − (3.38)
∂ 2 σ2 2σ4 σ6

are needed to find the lower bound for the covariance of the estimators of β and σ2 . Taking

6
There are quite a few regularity conditions for the IME to hold, but discussion of these is beyond the
scope of this course. Interested readers should see White (1996) for a thorough discussion.
162 Analysis of Cross-Sectional Data

expectations of the second derivatives,

∂ l (β , σ2 ; y|X) X0 X
 2 
E = − (3.39)
∂ β∂ β0 σ2
∂ l (β , σ2 ; y|X)
 2 
E =0 (3.40)
∂ β ∂ σ2
∂ l (β , σ2 ; y|X)
 2 
n
E =− 4 (3.41)
∂ σ
2 2 2σ
MLE
and so the lower bound for the variance of β̂ = β̂ is σ2 (X0 X)−1 . Theorem 3.16 show
that σ2 (X0 X)−1 is also the variance of the OLS estimator β̂ and so the Gauss-Markov theo-
rem can be strengthened in the case of conditionally homoskedastic, uncorrelated normal
residuals.
MLE
Theorem 3.22 (Best Unbiased Estimator). Under assumptions 3.9 - 3.14, β̂ = β̂ is the
best unbiased estimator of β .

The difference between this theorem and the Gauss-Markov theorem is subtle but im-
portant. The class of estimators is no longer restricted to include only linear estimators and
so this result is both broad and powerful: MLE (or OLS) is an ideal estimator under these
assumptions (in the sense that no other unbiased estimator, linear or not, has a lower vari-
ance). This results does not extend to the variance estimator since E[σ̂2 MLE ] = n −k
n
σ2 6= σ2 ,
and so the optimality of σ̂2 MLE cannot be established using the Cramer-Rao theorem.

3.8 Small Sample Hypothesis Testing

Most regressions are estimated to test implications of economic or finance theory. Hypoth-
esis testing is the mechanism used to determine whether data and theory are congruent.
Formalized in terms of β , the null hypothesis (also known as the maintained hypothesis)
is formulated as

H0 : R(β ) − r = 0 (3.42)

where R(·) is a function from Rk to Rm , m ≤ k and r is an m by 1 vector. Initially, a subset


of all hypotheses, those in the linear equality hypotheses class, formulated

H0 : Rβ − r = 0 (3.43)

will be examined where R is a m by k matrix. In subsequent chapters, more general test


specifications including nonlinear restrictions on the parameters will be considered. All
hypotheses in this class can be written as weighted sums of the regression coefficients,
3.8 Small Sample Hypothesis Testing 163

R11 β1 + R12 β2 . . . + R1k βk = r1


R21 β1 + R22 β2 . . . + R2k βk = r2
..
.
Rm 1 β1 + Rm2 β2 . . . + Rm k βk = ri
Each constraint is represented as a row in the above set of equations. Linear equality con-
straints can be used to test parameter restrictions such as

β1 = 0 (3.44)
3β2 + β3 = 1
k
X
βj = 0
j =1

β1 = β2 = β3 = 0.

For instance, if the unrestricted model was

yi = β1 + β2 x2,i + β3 x3,i + β4 x4,i + β5 x5,i + εi


the hypotheses in eq. (3.44) can be described in terms of R and r as

H0 R r
h i
β1 = 0 1 0 0 0 0 0

h i
3β2 + β3 = 1 0 3 1 0 0 1

Pk h i
j =1 β j = 0 0 1 1 1 1 0

   
1 0 0 0 0 0
β1 = β2 = β3 = 0  0 1 0 0 0   0 
   
0 0 1 0 0 0

When using linear equality constraints, alternatives are specified as H1 : Rβ − r 6= 0.


Once both the null and the alternative hypotheses have been postulated, it is necessary to
discern whether the data are consistent with the null hypothesis. Three classes of statistics
will be described to test these hypotheses: Wald, Lagrange Multiplier and Likelihood Ratio.
Wald tests are perhaps the most intuitive: they directly test whether Rβ − r is close to zero.
Lagrange Multiplier tests incorporate the constraint into the least squares problem using
a lagrangian. If the constraint has a small effect on the minimized sum of squares, the
lagrange multipliers, often described as the shadow price of the constraint in economic
164 Analysis of Cross-Sectional Data

applications, should be close to zero. The magnitude of these form the basis of the LM test
statistic. Finally, likelihood ratios test whether the data are less likely under the null than
they are under the alternative. If the null hypothesis is not restrictive this ratio should be
close to one and the difference in the log-likelihoods should be small.

3.8.1 t -tests
T-tests can be used to test a single hypothesis involving one or more coefficients,

H0 : Rβ = r

where R is a 1 by k vector and r is a scalar. Recall from theorem 3.18, β̂ −β ∼ N (0, σ2 (X0 X)−1 ).
Under the null, R(β̂ − β ) = Rβ̂ − Rβ = Rβ̂ − r and applying the properties of normal ran-
dom variables,
Rβ̂ − r ∼ N (0, σ2 R(X0 X)−1 R0 ).

A simple test can be constructed

Rβ̂ − r
z =p , (3.45)
σ2 R(X0 X)−1 R0
where z ∼ N (0, 1). To perform a test with size α, the value of z can be compared to the
critical values of the standard normal and rejected if |z | > Cα where Cα is the 1 − α quantile
of a standard normal. However, z is an infeasible statistic since it depends on an unknown
quantity, σ2 . The
q natural solution is to replace the unknown parameter with an estimate.
s2
Dividing z by σ2
and simplifying,

z
t =q (3.46)
s2
σ2
Rβ̂ −r

σ2 R(X0 X)−1 R0
= q
s2
σ2

Rβ̂ − r
=p .
s 2 R(X0 X)−1 R0
2
Note that the denominator (n − k ) σs 2 ∼ χn−k
2
, and so t is the ratio of a standard normal to
the square root of a χν normalized by it standard deviation. As long as the standard normal
2

in the numerator and the χv2 are independent, this ratio will have a Student’s t distribution.

Definition 3.23 (Student’s t distribution). Let z ∼ N (0, 1) (standard normal) and let w ∼
3.8 Small Sample Hypothesis Testing 165

χν2 where z and w are independent. Then


z
p w ∼ tν. (3.47)
ν

The independence of β̂ and s 2 – which is only a function of ε̂ – follows from 3.19, and
so t has a Student’s t distribution.

Theorem 3.24 (t -test). Under assumptions 3.9 - 3.14,

Rβ̂ − r
p ∼ t n−k . (3.48)
s 2 R(X0 X)−1 R0

As ν → ∞, the Student’s t distribution converges to a standard normal. As a practical


matter, when ν > 30, the T distribution is close to a normal. While any single linear re-
striction can be tested with a t -test , the expression t -stat has become synonymous with a
specific null hypothesis.

Definition 3.25 (t -stat). The t -stat of a coefficient, βk , is the t -test value of a test of the
null H0 : βk = 0 against the alternative H1 : βk 6= 0, and is computed

β̂k
q (3.49)
s 2 (X0 X)−1
[k k ]

where (X0 X)−1


[k k ] is the k diagonal element of (X X) .
th 0 −1

The previous examples were all two-sided; the null would be rejected if the parameters
differed in either direction from the null hypothesis. T-tests is also unique among these
three main classes of test statistics in that they can easily be applied against both one-sided
alternatives and two-sided alternatives.7
However, there is often a good argument to test a one-sided alternative. For instance, in
tests of the market premium, theory indicates that it must be positive to induce investment.
Thus, when testing the null hypothesis that a risk premium is zero, a two-sided alternative
could reject in cases which are not theoretically interesting. More importantly, a one-sided
alternative, when appropriate, will have more power than a two-sided alternative since the
direction information in the null hypothesis can be used to tighten confidence intervals.
The two types of tests involving a one-sided hypothesis are upper tail tests which test nulls
of the form H0 : Rβ ≤ r against alternatives of the form H1 : Rβ > r , and lower tail tests
which test H0 : Rβ ≥ r against H1 : Rβ < r .
Figure 3.1 contains the rejection regions of a t 10 distribution. The dark gray region cor-
responds to the rejection region of a two-sided alternative to the null that H0 : β̂ = β 0 for a
10% test. The light gray region, combined with the upper dark gray region corresponds to
7
Wald, LM and LR tests can be implemented against one-sided alternatives with considerably more effort.
166 Analysis of Cross-Sectional Data

the rejection region of a one-sided upper tail test, and so test statistic between 1.372 and
1.812 would be rejected using a one-sided alternative but not with a two-sided one.

Rejection regions of a t 10
90% One−sided (Upper)
90% 2−sided
t10

1.372

−1.812 1.812

ˆ 0
β−β
se(β)
ˆ

−3 −2 −1 0 1 2 3

Figure 3.1: Rejection region for a t -test of the nulls H0 : β = β 0 (two-sided) and H0 : β ≤
β 0 . The two-sided rejection region is indicated by dark gray while the one-sided (upper)
rejection region includes both the light and dark gray areas in the right tail.

Algorithm 3.26 (t -test). 1. Estimate β̂ using least squares.


Pn
2. Compute s 2 = (n − k )−1 i =1 ε̂2i and s 2 (X0 X)−1 .

3. Construct the restriction matrix, R, and the value of the restriction, r from the null
hypothesis.

Rβ̂ −r
4. Compute t = √ .
s 2 R(X0 X)−1 R0

5. Compare t to the critical value, Cα , of the t n −k distribution for a test size with α. In the
case of a two tailed test, reject the null hypothesis if |t | > Ft ν 1 − α/2 where Ft ν (·) is


the CDF of a t ν -distributed random variable. In the case of a one-sided upper-tail test,
reject if t > Ft ν (1 − α) or in the case of a one-sided lower-tail test, reject if t < Ft ν (α).
3.8 Small Sample Hypothesis Testing 167

3.8.2 Wald Tests


Wald test directly examines the distance between Rβ and r. Intuitively, if the null hypoth-
esis is true, then Rβ − r ≈ 0. In the small sample framework, the distribution of Rβ − r
follows directly from the properties of normal random variables. Specifically,

Rβ − r ∼ N (0, σ2 R(X0 X)−1 R0 )


Thus, to test the null H0 : Rβ − r = 0 against the alternative H0 : Rβ − r 6= 0, a test statistic
can be based on
−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)

WInfeasible = (3.50)
σ 2

which has a χm
2
distribution.8 However, this statistic depends on an unknown quantity, σ2 ,
and to operationalize W , σ2 must be replaced with an estimate, s 2 .

−1 −1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m σ2 (Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m
 
W = =
σ 2 s 2 s 2
(3.51)
The replacement of σ with s has an affect on the distribution of the estimator which
2 2

follows from the definition of an F distribution.

Definition 3.27 (F distribution). Let z 1 ∼ χν21 and let z 2 ∼ χν22 where z 1 and z 2 are inde-
pendent. Then
z1
ν1
z2 ∼ Fν1 ,ν2 (3.52)
ν2

The conclusion that W has an Fm,n −k distribution follows from the independence of β̂
and ε̂, which in turn implies the independence of β̂ and s 2 .

Theorem 3.28 (Wald test). Under assumptions 3.9 - 3.14,


−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)/m

2
∼ Fm ,n−k (3.53)
s

Analogous to the t ν distribution, an Fν1 ,ν2 distribution converges to a scaled χ 2 in large


samples (χν21 /ν1 as ν2 → ∞). Figure 3.2 contains failure to reject (FTR) regions for some
hypothetical Wald tests. The shape of the region depends crucially on the correlation be-
tween the hypotheses being tested. For instance, panel (a) corresponds to testing a joint
hypothesis where the tests are independent and have the same variance. In this case, the
  
− 1 Im 0
8
The distribution can be derived noting that R(X0 X)−1 R0 2 (Rβ − r) ∼ N

0, where the
0 0
matrix square root makes use of a generalized inverse. A more complete discussion of reduced rank normals
and generalized inverses is beyond the scope of this course.
168 Analysis of Cross-Sectional Data

Bivariate F distributions
(a) (b)
3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−2 0 2 −2 0 2
(c) (d)
3 3
99%
90%
2 2
80%
1 1

0 0

−1 −1

−2 −2

−3 −3
−2 0 2 −2 0 2

Figure 3.2: Bivariate plot of an F distribution. The four panels contain the failure-to-reject
regions corresponding to 20, 10 and 1% tests. Panel (a) contains the region for uncorrelated
tests. Panel (b) contains the region for tests with the same variance but a correlation of 0.5.
Panel (c) contains the region for tests with a correlation of -.8 and panel (d) contains the
region for tests with a correlation of 0.5 but with variances of 2 and 0.5 (The test with a
variance of 2 is along the x-axis).

FTR region is a circle. Panel (d) shows the FTR region for highly correlated tests where one
restriction has a larger variance.

Once W has been computed, the test statistic should be compared to the critical value
of an Fm,n −k and rejected if the test statistic is larger. Figure 3.3 contains the pdf of an F5,30
distribution. Any W > 2.049 would lead to rejection of the null hypothesis using a 10%
test.

The Wald test has a more common expression in terms of the SSE from both the re-
stricted and unrestricted models. Specifically,
3.8 Small Sample Hypothesis Testing 169

SSE R −SSEU SSE R −SSEU


W = m
SSEU
= m
. (3.54)
n−k
s2

where SSE R is the sum of squared errors of the restricted model.9 The restricted model is
the original model with the null hypothesis imposed. For example, to test the null H0 : β2 =
β3 = 0 against an alternative that H1 : β2 6= 0 or β3 6= 0 in a bivariate regression,

yi = β1 + β2 x1,i + β3 x2,i + εi (3.55)


the restricted model imposes the null,

yi = β1 + 0x1,i + 0x2,i + εi
= β1 + εi .

The restricted SSE, SSE R is computed using the residuals from this model while the un-
restricted SSE, SSEU , is computed from the general model that includes both x variables
(eq. (3.55)). While Wald tests usually only require the unrestricted model to be estimated,
the difference of the SSEs is useful because it can be computed from the output of any
standard regression package. Moreover, any linear regression subject to linear restrictions
can be estimated using OLS on a modified specification where the constraint is directly
imposed. Consider the set of restrictions, R, in an augmented matrix with r

[R r]
By transforming this matrix into row-echelon form,
 
Im R̃ r̃
a set of m restrictions can be derived. This also provides a direct method to check whether
a set of constraints is logically consistent and feasible or if it contains any redundant re-
strictions.

Theorem 3.29 (Restriction Consistency and Redundancy). If Im R̃ r̃ is [R r] in re-


 

duced echelon form, then a set of restrictions is logically consistent if rank(R̃) = rank( Im R̃ r̃ ).
 

Additionally, if rank(R̃) = rank( Im R̃ r̃ ) = m, then there are no redundant restrictions.


 

1. Estimate the unrestricted model yi = xi β + εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
9
The SSE should be the result of minimizing the squared errors. The centered should be used if a constant
is included and the uncentered versions if no constant is included.
170 Analysis of Cross-Sectional Data

SSE R −SSEU
3. Compute W = m
SSEU .
n−k

4. Compare W to the critical value, Cα , of the Fm ,n−k distribution at size α. Reject the null
hypothesis if W > Cα .

Finally, in the same sense that the t -stat is a test of the null H0 : βk = 0 against the
alternative H1 : βk 6= 0, the F -stat of a regression tests whether all coefficients are zero
(except the intercept) against an alternative that at least one is non-zero.

Definition 3.30 (F -stat of a Regression). The F -stat of a regression is the value of a Wald
test that all coefficients are zero except the coefficient on the constant (if one is included).
Specifically, if the unrestricted model is

yi = β1 + β2 x2,i + . . . βk xk ,i + εi ,

the F -stat is the value of a Wald test of the null H0 : β2 = β3 = . . . = βk = 0 against the
alternative H1 : β j 6= 0, for j = 2, . . . , k and corresponds to a test based on the restricted
regression
yi = β1 + εi .

3.8.3 Example: T and Wald Tests in Cross-Sectional Factor models


Returning to the factor regression example, the t -stats in the 4-factor model can be com-
puted

β̂ j
tj = q .
s 2 (X0 X)−1
[j j]

For example, consider a regression of B H e on the set of four factors and a constant,

B Hie = β1 + β2 V W M ie + β3S M Bi + β4 H M L i + β5U M Di + εi


The fit coefficients, t -stats and p-values are contained in table 3.6.

Definition 3.31 (P-value ). The p-value is smallest size (α) where the null hypothesis may
be rejected. The p-value can be equivalently defined as the largest size where the null hy-
pothesis cannot be rejected.

P-values have the advantage that they are independent of the distribution of the test
statistic. For example, when using a 2-sided t -test, the p-value of a test statistic t is 2(1 −
Ft ν (|t |)) where Ft ν (| · |) is the CDF of a t -distribution with ν degrees of freedom. In a Wald
test, the p-value is 1 − Ffν1 ,ν2 (W ) where Ffν1 ,ν2 (·) is the CDF of an fν1 ,ν2 distribution.
The critical value, Cα , for a 2-sided 10% t test with 973 degrees of freedom (n − 5) is
1.645, and so if |t | > Cα the null hypothesis should be rejected, and the results indicate
3.8 Small Sample Hypothesis Testing 171

Rejection region of a F5,30 distribution

F5,30

2.049 (90% Quantile)

Rejection Region

0 1 2 3 4

Figure 3.3: Rejection region for a F5,30 distribution when using a test with a size of 10%. If
the null hypothesis is true, the test statistic should be relatively small (would be 0 if exactly
true). Large test statistics lead to rejection of the null hypothesis. In this example, a test
statistic with a value greater than 2.049 would lead to a rejection of the null at the 10%
level.

that the null hypothesis that the coefficients on the constant and S M B are zero cannot
be rejected the 10% level. The p-values indicate the null that the constant was 0 could be
rejected at an α of 14% but not one of 13%.

Table 3.6 also contains the Wald test statistics and p-values for a variety of hypothe-
ses, some economically interesting, such as the set of restrictions that the 4 factor model
reduces to the CAPM, β j = 0, j = 1, 3, . . . , 5. Only one regression, the completely unre-
stricted regression, was needed to compute all of the test statistics using Wald tests,

−1
(Rβ − r)0 R(X0 X)−1 R0 (Rβ − r)

W =
s2

where R and r depend on the null being tested. For example, to test whether a strict CAPM
was consistent with the observed data,
172 Analysis of Cross-Sectional Data

t Tests
β̂ σ̂2 [(X0 X)−1 ] j j
p
Parameter t -stat p-val
Constant -0.064 0.043 -1.484 0.138
V WMe 1.077 0.008 127.216 0.000
SM B 0.019 0.013 1.440 0.150
H M L 0.803 0.013 63.736 0.000
U M D -0.040 0.010 -3.948 0.000

Wald Tests
Null Alternative W M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 6116 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 1223.1 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 11.14 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.049 2 0.129
β5 = 0 β5 6= 0 15.59 1 0.000

Table 3.6: The upper panel contains t -stats and p-values for the regression of Big-High
excess returns on the 4 factors and a constant. The lower panel contains test statistics and
p-values for Wald tests of the reported null hypothesis. Both sets of tests were computed
using the small sample assumptions and may be misleading since the residuals are both
non-normal and heteroskedastic.

   
1 0 0 0 0 0
0 0 1 0 0 0
   
R=  and r =  .
   
 0 0 0 1 0   0 
0 0 0 0 1 0
All of the null hypotheses save one are strongly rejected with p-values of 0 to three dec-
imal places. The sole exception is H0 : β1 = β3 = 0, which produced a Wald test statistic
of 2.05. The 5% critical value of an F2,973 is 3.005, and so the null hypothesis would be not
rejected at the 5% level. The p-value indicates that the test would be rejected at the 13%
level but not at the 12% level. One further peculiarity appears in the table. The Wald test
statistic for the null H0 : β5 = 0 is exactly the square of the t -test statistic for the same
null. This should not be surprising since W = t 2 when testing a single linear hypothesis.
Moreover, if z ∼ t ν , then z 2 ∼ F1,ν . This can be seen by inspecting the square of a t ν and
applying the definition of an F1,ν -distribution.

3.8.4 Likelihood Ratio Tests

Likelihood Ratio (LR) test are based on the relative probability of observing the data if the
null is valid to the probability of observing the data under the alternative. The test statistic
3.8 Small Sample Hypothesis Testing 173

is defined
!
maxβ ,σ2 f (y|X; β , σ2 ) subject to Rβ = r
L R = −2 ln (3.56)
maxβ ,σ2 f (y|X; β , σ2 )

Letting β̂ R denote the constrained estimate of β , this test statistic can be reformulated

!
f (y|X; β̂ R , σ̂R2 )
L R = −2 ln (3.57)
f (y|X; β̂ , σ̂2 )
= −2[l (β̂ R , σ̂R2 ; y|X; ) − l (β̂ , σ̂2 ; y|X)]
= 2[l (β̂ , σ̂2 ; y|X) − l (β̂ R , σ̂R2 ; y|X)]

In the case of the normal log likelihood, L R can be further simplified to10

!
f (y|X; β̂ R , σ̂R2 )
L R = −2 ln
f (y|X; β̂ , σ̂2 )
 
2 − n2 (y−Xβ̂ R )0 (y−Xβ̂ R )
(2πσ̂R ) exp(− 2σ̂R2
)
= −2 ln 

(2πσ̂2 )− 2 exp(− (y−Xβ2)σ̂(y−Xβ )
n 0
)

2

n
!
(σ̂R2 )− 2
= −2 ln n
(σ̂2 )− 2
 2 − n2
σ̂R
= −2 ln
σ̂2
= n ln(σ̂R2 ) − ln(σ̂2 )
 

= n ln(SSE R ) − ln(SSEU )
 

Finally, the distribution of the L R statistic can be determined by noting that

σ̂R2
   
SSE R
L R = n ln = N ln (3.58)
SSEU σ̂U2

and that
   
n −k LR
exp − 1 = W. (3.59)
m n
The transformation between W and L R is monotonic so the transformed statistic has the
10
Note that σ̂R2 and σ̂2 use n rather than a degree-of-freedom adjustment since they are MLE estimators.
174 Analysis of Cross-Sectional Data

same distribution as W , a Fm,n −k .

Algorithm 3.32 (Small Sample Wald Test). 1. Estimate the unrestricted model yi = xi β +
εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
 
3. Compute L R = n ln SSE SSE R
U
.

4. Compute W = n −k LR
  
m
exp n
−1 .

5. Compare W to the critical value, Cα , of the Fm ,n−k distribution at size α. Reject the null
hypothesis if W > Cα .

3.8.5 Example: LR Tests in Cross-Sectional Factor models

LR tests require estimating the model under both the null and the alternative. In all ex-
amples here, the alternative is the unrestricted model with four factors while the restricted
models (where the null is imposed) vary. The simplest restricted model corresponds to the
most restrictive null, H0 : β j = 0, j = 1, . . . , 5, and is specified

yi = εi .
To compute the likelihood ratio, the conditional mean and variance must be estimated.
In this simple specification, the conditional mean is ŷR = 0 (since there are no parameters)
and the conditional variance is estimated using the MLE with the mean, σ̂R2 = y0 y/n (the
sum of squared regressands). The mean under the alternative is ŷU = x0i β̂ and the variance
is estimated using σ̂U2 = (y − x0i β̂ )0 (y − x0i β̂ )/n . Once these quantities have been computed,
the L R test statistic is calculated

σ̂R2
 
L R = n ln (3.60)
σ̂U2

σ̂2
where the identity σ̂2R = SSE
SSE R n −k LR
  
U
has been applied. Finally, L R is transformed by m
exp n
− 1
U
to produce the test statistic, which is numerically identical to W . This can be seen by com-
paring the values in table 3.7 to those in table 3.6.

3.8.6 Lagrange Multiplier Tests

Consider minimizing the sum of squared errors subject to a linear hypothesis.


3.8 Small Sample Hypothesis Testing 175

LR Tests
Null Alternative LR M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 6116 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 1223.1 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 11.14 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.049 2 0.129
β5 = 0 β5 6= 0 15.59 1 0.000

LM Tests
Null Alternative LM M p-val
β j = 0, j = 1, . . . , 5 β j 6= 0, j = 1, . . . , 5 189.6 5 0.000
β j = 0, j = 1, 3, 4, 5 β j 6= 0, j = 1, 3, 4, 5 203.7 4 0.000
β j = 0, j = 1, 5 β j 6= 0, j = 1, 5 10.91 2 0.000
β j = 0, j = 1, 3 β j 6= 0, j = 1, 3 2.045 2 0.130
β5 = 0 β5 6= 0 15.36 1 0.000

Table 3.7: The upper panel contains test statistics and p-values using LR tests for using a
regression of excess returns on the big-high portfolio on the 4 factors and a constant. In all
cases the null was tested against the alternative listed. The lower panel contains test statis-
tics and p-values for LM tests of same tests. Note that the LM test statistics are uniformly
smaller than the LR test statistics which reflects that the variance in a LM test is computed
from the model estimated under the null, a value that must be larger than the estimate of
the variance under the alternative which is used in both the Wald and LR tests. Both sets
of tests were computed using the small sample assumptions and may be misleading since
the residuals are non-normal and heteroskedastic.

min (y − Xβ )0 (y − Xβ ) subject to Rβ − r = 0.
β

This problem can be formulated in terms of a Lagrangian,

L(β , λ) = (y − Xβ )0 (y − Xβ ) + (Rβ − r)0 λ

and the problem is


 
max min L(β , λ)
λ β

The first order conditions correspond to a saddle point,

∂L
= −2X0 (y − Xβ ) + R0 λ = 0
∂β
176 Analysis of Cross-Sectional Data

∂L
= Rβ − r = 0
∂λ

pre-multiplying the top FOC by R(X0 X)−1 (which does not change the value, since it is 0),

2R(X0 X)−1 (X0 X)β − 2R(X0 X)−1 X0 y + R(X0 X)−1 R0 λ = 0


⇒ 2Rβ − 2Rβ̂ + R(X0 X)−1 R0 λ = 0

where β̂ is the usual OLS estimator. Solving,


−1
λ̃ = 2 R(X0 X)−1 R0 (Rβ̂ − r)

(3.61)
−1
β̃ = β̂ − (X0 X)−1 R0 R(X0 X)−1 R0 (Rβ̂ − r)

(3.62)

These two solutions provide some insight into the statistical properties of the estima-
tors. β̃ , the constrained regression estimator, is a function of the OLS estimator, β̂ , and a
step in the direction of the constraint. The size of the change is influenced by the distance
between the unconstrained estimates and the constraint (Rβ̂ − r). If the unconstrained
estimator happened to exactly satisfy the constraint, there would be no step.11
The Lagrange multipliers, λ̃, are weighted functions of the unconstrained estimates, β̂ ,
and will be near zero if the constraint is nearly satisfied (Rβ̂ − r ≈ 0). In microeconomics,
Lagrange multipliers are known as shadow prices since they measure the magnitude of the
change in the objective function would if the constraint was relaxed a small amount. Note
that β̂ is the only source of randomness in λ̃ (like β̃ ), and so λ̃ is a linear combination of
normal random variables and will also follow a normal distribution. These two properties
combine to provide a mechanism for testing whether the restrictions imposed by the null
are consistent with the data. The distribution of λ̂ can be directly computed and a test
statistic can be formed.
There is another method to derive the LM test statistic that is motivated by the alterna-
tive name of LM tests: Score tests. Returning to the first order conditions and plugging in
the parameters,

R0 λ = 2X0 (y − Xβ̃ )
R0 λ = 2X0 ε̃

where β̃ is the constrained estimate of β and ε̃ are the corresponding estimated errors
(ε̃ = y − Xβ̃ ). Thus R0 λ has the same distribution as 2X0 ε̃. However, under the small sam-
ple assumptions, ε̃ are linear combinations of normal random variables and so are also
11
Even if the constraint is valid, the constraint will never be exactly satisfied.
3.8 Small Sample Hypothesis Testing 177

normal,

2X0 ε̃ ∼ N (0, 4σ2 X0 X)


and

X0 ε̃ ∼ N (0, σ2 X0 X). (3.63)


A test statistic that these are simultaneously zero can be constructed in the same manner
as a Wald test:

ε̃0 X(X0 X)−1 X0 ε̃


L M Infeasible = . (3.64)
σ2
However, like a Wald test this statistic is not feasible because σ2 is unknown. Using the
same substitution, the LM test statistic is given by

ε̃0 X(X0 X)−1 X0 ε̃


LM = (3.65)
s̃ 2
and has a Fm ,n−k +m distribution where s̃ 2 is the estimated error variance from the con-
strained regression. This is a different estimator than was used in constructing a Wald test
statistic, where the variance was computed from the unconstrained model. Both estimates
are consistent under the null. However, since SSE R ≥ SSEU , s̃ 2 is likely to be larger than s 2 .12
LM tests are usually implemented using a more convenient – but equivalent – form,
SSE R −SSEU
LM = m
SSE R
. (3.66)
n−k +m

To use the Lagrange Multiplier principal to conduct a test:

Algorithm 3.33 (Small Sample LM Test). 1. Estimate the unrestricted model yi = xi β +


εi , and the restricted model, ỹi = x̃i β + εi .
Pn 2
2. Compute SSE R = i =1 ε̃ i where ε̃i = ỹi − x̃i β̃ are the residuals from the restricted
regression, and SSEU = ni=1 ε̂2i where ε̂i = yi − xi β̂ are the residuals from the unre-
P

stricted regression.
SSE R −SSEU
3. Compute L M = m
SSE R .
n−k +m

4. Compare L M to the critical value, Cα , of the Fm,n −k +m distribution at size α. Reject the
null hypothesis if L M > Cα .

Alternatively,
12
Note that since the degree-of-freedom adjustment in the two estimators is different, the magnitude esti-
mated variance is not directly proportional to SSE R and SSEU .
178 Analysis of Cross-Sectional Data

Algorithm 3.34 (Small Sample LM Test). 1. Estimate the restricted model, ỹi = x̃i β + εi .

ε̃0 X(X0 X)−1 X0 ε̃


2. Compute L M = m
s2
where X is n by k the matrix of regressors from the uncon-
Pn
ε̃2i
strained model and s 2 = i =1
n−k +m
.

3. Compare L M to the critical value, Cα , of the Fm,n −k +m distribution at size α. Reject the
null hypothesis if L M > Cα .

3.8.7 Example: LM Tests in Cross-Sectional Factor models

Table 3.7 also contains values from LM tests. LM tests have a slightly different distributions
than the Wald and LR and do not produce numerically identical results. While the Wald and
LR tests require estimation of the unrestricted model (estimation under the alternative),
LM tests only require estimation of the restricted model (estimation under the null). For
example, in testing the null H0 : β1 = β5 = 0 (that the U M D factor has no explanatory
power and that the intercept is 0), the restricted model is estimated from

B Hie = γ1 V W M ie + γ2S M Bi + γ3 H M L i + εi .

The two conditions, that β1 = 0 and that β5 = 0 are imposed by excluding these regressors.
Once the restricted regression is fit, the residuals estimated under the null, ε̃i = yi − xi β̃
are computed and the LM test is calculated from

ε̃0 X(X0 X)−1 X0 ε̃


LM =
s2
where X is the set of explanatory variables from the unrestricted regression (in the case,
xi = [1 V W M ie S M Bi H M L i U M Di ]). Examining table 3.7, the LM test statistics are con-
siderably smaller than the Wald test statistics. This difference arises since the variance used
in computing the LM test statistic, σ̃2 , is estimated under the null. For instance, in the most
restricted case (H0 = β j = 0, j = 1, . . . , k ), the variance is estimated by y0 y/N (since k = 0
in this model) which is very different from the variance estimated under the alternative
(which is used by both the Wald and LR). Despite the differences in the test statistics, the
p-values in the table would result in the same inference. For the one hypothesis that is not
completely rejected, the p-value of the LM test is slightly larger than that of the LR (or W).
However, .130 and .129 should never make a qualitative difference (nor should .101 and
.099, even when using a 10% test). These results highlight a general feature of LM tests:
test statistics based on the LM-principle are smaller than Likelihood Ratios and Wald tests,
and so less likely to reject.
3.8 Small Sample Hypothesis Testing 179

Location of the three test statistic statistics

Rβ − r = 0

2X0 (y − Xβ)

Likelihood
Ratio
Lagrange Wald
Multiplier (y − Xβ)0 (y − Xβ))

β̃ βˆ

Figure 3.4: Graphical representation of the three major classes of tests. The Wald test mea-
sures the magnitude of the constraint, Rβ − r , at the OLS parameter estimate, β̂ . The
LM test measures the magnitude of the score at the restricted estimator (β̃ ) while the LR
test measures the difference between the SSE at the restricted estimator and the SSE at
the unrestricted estimator. Note: Only the location of the test statistic, not their relative
magnitudes, can be determined from this illustration.

3.8.8 Comparing the Wald, LR and LM Tests

With three tests available to test the same hypothesis, which is the correct one? In the small
sample framework, the Wald is the obvious choice because W ≈ L R and W is larger than
L M . However, the LM has a slightly different distribution, so it is impossible to make an
absolute statement. The choice among these three tests reduces to user preference and
ease of computation. Since computing SSEU and SSE R is simple, the Wald test is likely the
simplest to implement.
These results are no longer true when nonlinear restrictions and/or nonlinear models
are estimated. Further discussion of the factors affecting the choice between the Wald, LR
and LM tests will be reserved until then. Figure 3.4 contains a graphical representation of
the three test statistics in the context of a simple regression, yi = β xi + εi .13 The Wald
13
Magnitudes of the lines is not to scale, so the magnitude of the test statistics cannot be determined from
the picture.
180 Analysis of Cross-Sectional Data

test measures the magnitude of the constraint R β − r at the unconstrained estimator β̂ .


The LR test measures how much the sum of squared errors has changed between β̂ and β̃ .
Finally, the LM test measures the magnitude of the gradient, X0 (y − Xβ ) at the constrained
estimator β̃ .

3.9 Large Sample Assumption


While the small sample assumptions allow the exact distribution of the OLS estimator and
test statistics to be derived, these assumptions are not realistic in applications using finan-
cial data. Asset returns are non-normal (both skewed and leptokurtic), heteroskedastic,
and correlated. The large sample framework allows for inference on β without making
strong assumptions about the distribution or error covariance structure. However, the gen-
erality of the large sample framework comes at the loss of the ability to say anything exact
about the estimates in finite samples.
Four new assumptions are needed to analyze the asymptotic behavior of the OLS esti-
mators.
Assumption 3.35 (Stationary Ergodicity). {(xi , εi )} is a strictly stationary and ergodic se-
quence.
This is a technical assumption needed for consistency and asymptotic normality. It im-
plies two properties about the joint density of {(xi , εi )}: the joint distribution of {(xi , εi )}
and {(xi + j , εi + j )} depends on the time between observations ( j ) and not the observation
index (i ) and that averages will converge to their expected value (as long as they exist).
There are a number of alternative assumptions that could be used in place of this assump-
tion, although this assumption is broad enough to allow for i.i.d. , i.d.n.d (independent not
identically distributed, including heteroskedasticity), and some n.i.n.i.d. data, although it
does rule out some important cases. Specifically, the regressors cannot be trending or oth-
erwise depend on the observation index, an important property of some economic time
series such as the level of a market index or aggregate consumption. Stationarity will be
considered more carefully in the time-series chapters.
Assumption 3.36 (Rank). E[x0i xi ] = ΣXX is nonsingular and finite.
This assumption, like assumption 3.11, is needed to ensure identification.
Assumption 3.37 (Martingale Difference). {x0i εi , Fi } is a martingale difference sequence,
h 2 i
E x j ,i εi < ∞, j = 1, 2, . . . , k , i = 1, 2 . . .

1
and S = V[n − 2 X0 ε] is finite and non singular.
A martingale difference sequence has the property that its mean is unpredictable using the
information contained in the information set (Fi ).
3.10 Large Sample Properties 181

Definition 3.38 (Martingale Difference Sequence). Let {zi } be a vector stochastic process
and Fi be the information set corresponding to observation i containing all information
available when observation i was collected except zi . {zi , Fi } is a martingale difference
sequence if
E[zi |Fi ] = 0

In the context of the linear regression model, it states that the current score is not pre-
dictable by any of the previous scores, that the mean of the scores is zero (E[X0i εi ] = 0),
and there is no other variable in Fi which can predict the scores. This assumption is suf-
ficient to ensure that n −1/2 X0 ε will follow a Central Limit Theorem, and it plays a role in
consistency of the estimator. A m.d.s. is a fairly general construct and does not exclude
using time-series regressors as long as they are predetermined, meaning that they do not
depend on the process generating εi . For instance, in the CAPM, the return on the market
portfolio can be thought of as being determined independently of the idiosyncratic shock
affecting individual assets.

Assumption 3.39 (Moment Existence). E[x j4,i ] < ∞, i = 1, 2, . . ., j = 1, 2, . . . , k and E[ε2i ] =


σ2 < ∞, i = 1, 2, . . ..

This final assumption requires that the fourth moment of any regressor exists and the
variance of the errors is finite. This assumption is needed to derive a consistent estimator
of the parameter covariance.

3.10 Large Sample Properties


These assumptions lead to two key theorems about the asymptotic behavior of β̂ : it is con-
sistent and asymptotically normally distributed. First, some new notation is needed. Let
−1 
X0 X X0 y
 
β̂ n = (3.67)
n n
be the regression coefficient using n realizations from the stochastic process {xi , εi }.

Theorem 3.40 (Consistency of β̂ ). Under assumptions 3.9 and 3.35 - 3.37


p
β̂ n → β

Consistency is a weak property of the OLS estimator, but it is important. This result
p
relies crucially on the implication of assumption 3.37 that n −1 X0 ε → 0, and under the same
assumptions, the OLS estimator is also asymptotically normal.

Theorem 3.41 (Asymptotic Normality of β̂ ). Under assumptions 3.9 and 3.35 - 3.37
√ d
n (β̂ n − β ) → N (0, Σ−1
XX SΣXX )
−1
(3.68)
182 Analysis of Cross-Sectional Data

where ΣXX = E[x0i xi ] and S = V[n −1/2 X0 ε]

Asymptotic normality provides the basis for hypothesis tests on β . However, using only
theorem 3.41, tests are not feasible since ΣXX and S are unknown, and so must be estimated.

Theorem 3.42 (Consistency of OLS Parameter Covariance Estimator). Under assumptions


3.9 and 3.35 - 3.39,
p
Σ̂XX =n −1 X0 X → ΣXX
n
X p
Ŝ =n −1
ei2 x0i xi → S
i =1 
=n −1 X0 ÊX

and
−1 −1 p
Σ̂XX ŜΣ̂XX → Σ−1 −1
XX SΣXX

where Ê = diag(ε̂21 , . . . , ε̂2n ) is a matrix with the estimated residuals squared along its diago-
nal.

Combining these theorems, the OLS estimator is consistent, asymptotically normal and
the asymptotic variance can be consistently estimated. These three properties provide the
tools necessary to conduct hypothesis tests in the asymptotic framework. The usual esti-
mator for the variance of the residuals is also consistent for the variance of the innovations
under the same conditions.

Theorem 3.43 (Consistency of OLS Variance Estimator). Under assumptions 3.9 and 3.35 -
3.39 ,
p
σ̂n2 = n −1 ε̂0 ε̂ → σ2

Further, if homoskedasticity is assumed, then the parameter covariance estimator can


be simplified.

Theorem 3.44 (Homoskedastic Errors). Under assumptions 3.9, 3.12, 3.13 and 3.35 - 3.39,
√ d
n (β̂ n − β ) → N (0, σ2 Σ−1
XX )

Combining the result of this theorem with that of theorems 3.42 and 3.43, a consistent
2 −1
estimator of σ2 Σ−1
XX is given by σ̂n Σ̂XX .

3.11 Large Sample Hypothesis Testing


All three tests, Wald, LR, and LM have large sample equivalents that exploit the asymptotic
normality of the estimated parameters. While these tests are only asymptotically exact,
3.11 Large Sample Hypothesis Testing 183

the use of the asymptotic distribution is justified as an approximation to the finite-sample


distribution, although the quality of the CLT approximation depends on how well behaved
the data are.

3.11.1 Wald Tests

Recall from Theorem 3.41,


√ d
n (β̂ n − β ) → N (0, Σ−1
XX SΣXX ).
−1
(3.69)
Applying the properties of a normal random variable, if z ∼ N (µ, Σ), c0 z ∼ N (c0 µ, c0 Σc)
and that if w ∼ N (µ, σ2 ) then (wσ−µ)
2
2 ∼ χ12 . Using these two properties, a test of the null

H0 : Rβ − r = 0
against the alternative

H1 : Rβ − r 6= 0
can be constructed.
Following from Theorem 3.41, if H0 : Rβ − r = 0 is true, then
√ d
n(Rβ̂ n − r) → N (0, RΣ−1
XX SΣXX R )
−1 0
(3.70)
and
1√ d
Γ − 2 n (Rβ̂ n − r) → N (0, Ik ) (3.71)
where Γ = RΣ−1
XX SΣXX R . Under the null that H0 : Rβ − r = 0,
−1 0

h i−1
d
n (Rβ̂ n − r)0 RΣ−1
XX SΣ −1 0
XX R (Rβ̂ n − r) → χm
2
(3.72)

where m is the rank(R). This estimator is not feasible since Γ is not known and must be
estimated. Fortunately, Γ can be consistently estimated by applying the results of Theorem
3.42

Σ̂XX = n −1 X0 X

n
X
Ŝ = n −1
ei2 x0i xi
i =1

and so

−1 −1
Γ̂ = Σ̂XX ŜΣ̂XX .
184 Analysis of Cross-Sectional Data

The feasible Wald statistic is defined


h −1 −1 i−1
d
W = n(Rβ̂ n − r)0 RΣ̂XX ŜΣ̂XX R0 (Rβ̂ n − r) → χm
2
. (3.73)

Test statistic values can be compared to the critical value Cα from a χm


2
at the α-significance
level and the null is rejected if W is greater than Cα . The asymptotic t -test (which has a
normal distribution) is defined analogously,

√ Rβ̂ n − r d
t = n p → N (0, 1), (3.74)
RΓ̂ R0
where R is a 1 by k vector. Typically R is a vector with 1 in its jth element, producing statistic

√ β̂ j N d
t = nq → N (0, 1)
[Γ̂ ] j j

where [Γ̂ ] j j is the jth diagonal element of Γ̂ .



The n term in the Wald statistic (or n in the t -test) may appear strange at first, al-
though these terms are present in the classical tests as well. Recall that the t -stat (null
H0 : β j = 0) from the classical framework with homoskedastic data is given by

β̂ j
t1 = p .
σ̂2 [(X0 X)−1 ] j j
The t -stat in the asymptotic framework is

√ β̂ j N
t2 = nq .
−1
σ̂2 [Σ̂ XX ] j j

If t 1 is multiplied and divided by n , then

√ β̂ j √ β̂ j √ β̂ j
t1 = n√ p = nq = nq = t2,
n σ̂ [(X X) ] j j
2 0 −1 0
σ̂2 [( X X )−1 ] j j
−1
σ̂2 [Σ̂ ]
n XX j j

and these two statistics have the same value since X0 X differs from Σ̂XX by a factor of n .

Algorithm 3.45 (Large Sample Wald Test). 1. Estimate the unrestricted model yi = xi β +
εi .
−1 −1
2. Estimate the parameter covariance using Σ̂XX ŜΣ̂XX where
n
X n
X
Σ̂XX = n −1 x0i xi , Ŝ = n −1 ε̂2i x0i xi
i =1 i =1
3.11 Large Sample Hypothesis Testing 185

3. Construct the restriction matrix, R, and the value of the restriction, r, from the null
hypothesis.
h −1 −1 i−1
4. Compute W = n (Rβ̂ n − r)0 RΣ̂XX ŜΣ̂XX R0 (Rβ̂ n − r).

5. Reject the null if W > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.2 Lagrange Multiplier Tests

Recall that the first order conditions of the constrained estimation problem require

R0 λ̂ = 2X0 ε̃
where ε̃ are the residuals estimated under the null H0 : Rβ − r = 0. The LM test exam-
ines whether λ is close to zero. In the large sample framework, λ̂, like β̂ , is asymptotically
normal and R0 λ̂ will only be close to 0 if λ̂ ≈ 0. The asymptotic version of the LM test
can be compactly expressed if s̃ is defined as the average score of the restricted estimator,
s̃ = n −1 X0 ε̃. In this notation,

d
L M = n s̃0 S−1 s̃ → χm
2
. (3.75)
If the model is correctly specified, n −1 X0 ε̃, which is a k by 1 vector with jth element
n −1 ni=1 x j ,i ε̃i , should be a mean-zero vector with asymptotic variance S by assumption
P
√ d
3.35. Thus, n(n −1 X0 ε̃) → N (0, S) implies
" #!
√ −1 d I m 0
n S 2 s̃ → N 0, (3.76)
0 0
d
and so n s̃0 S−1 s̃ → χm
2
. This version is infeasible and the feasible version of the LM test must
be used,

d
L M = n s̃0 S̃−1 s̃ → χm
2
. (3.77)
where S̃ = n −1 ni=1 ε̃2i x0i xi is the estimator of the asymptotic variance computed under the
P

null. This means that S̃ is computed using the residuals from the restricted regression, ε̃,
and that it will differ from the usual estimator Ŝ which is computed using residuals from
the unrestricted regression, ε̂. Under the null, both S̃ and Ŝ are consistent estimators for S
and using one or the other has no asymptotic effect.
If the residuals are homoskedastic, the LM test can also be expressed in terms of the
2
R of the unrestricted model when testing a null that the coefficients on all explanatory
variable except the intercept are zero. Suppose the regression fit was

yi = β0 + β1 x1,i + β2 x2,i + . . . + βk xk n .
186 Analysis of Cross-Sectional Data

To test the H0 : β1 = β2 = . . . = βk = 0 (where the excluded β1 corresponds to a


constant),

d
L M = n R2 → χk2 (3.78)
is equivalent to the test statistic in eq. (3.77). This expression is useful as a simple tool to
jointly test whether the explanatory variables in a regression appear to explain any varia-
tion in the dependent variable. If the residuals are heteroskedastic, the n R2 form of the LM
test does not have standard distribution and should not be used.

Algorithm 3.46 (Large Sample LM Test). 1. Form the unrestricted model, yi = xi β + εi .

2. Impose the null on the unrestricted model and estimate the restricted model, ỹi = x̃i β +
εi .

3. Compute the residuals from the restricted regression, ε̃i = ỹi − x̃i β̃ .

4. Construct the score using the residuals from the restricted regression from both models,
s̃i = xi ε̃i where xi are the regressors from the unrestricted model.

5. Estimate the average score and the covariance of the score,


n
X n
X
s̃ = n −1 s̃i , S̃ = n −1 s̃0i s̃i (3.79)
i =1 i =1

6. Compute the LM test statistic as L M = n s̃S̃−1 s̃0 .

7. Reject the null if L M > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.3 Likelihood Ratio Tests


One major difference between small sample hypothesis testing and large sample hypoth-
esis testing is the omission of assumption 3.14. Without this assumption, the distribution
of the errors is left unspecified. Based on the ease of implementing the Wald and LM tests
their asymptotic framework, it may be tempting to think the likelihood ratio is asymptoti-
cally valid. This is not the case. The technical details are complicated but the proof relies
crucially on the Information Matrix Equality holding. When the data are heteroskedastic
or the not normal, the IME will generally not hold and the distribution of LR tests will be
nonstandard.14
There is, however, a feasible likelihood-ratio like test available. The motivation for this
test will be clarified in the GMM chapter. For now, the functional form will be given with
only minimal explanation,
14
In this case, the LR will converge to a weighted mixture of m independent χ12 random variables where
the weights are not 1. The resulting distribution is not a χm
2
.
3.11 Large Sample Hypothesis Testing 187

d
L R = n s̃0 S−1 s̃ → χm
2
, (3.80)
where s̃ = n −1 X0 ε̃ is the average score vector when the estimator is computed under the
null. This statistic is similar to the LM test statistic, although there are two differences. First,
one term has been left out of this expression, and the formal definition of the asymptotic
LR is

d
L R = n s̃0 S−1 s̃ − ŝ0 S−1 ŝ → χm
2
(3.81)
where ŝ = n −1 X0 ε̂ are the average scores from the unrestricted estimator. Recall from the
first-order conditions of OLS (eq. (3.7)) that ŝ = 0 and the second term in the general
expression of the L R will always be zero. The second difference between L R and L M exists
only in the feasible versions. The feasible version of the LR is given by

d
L R = n s̃0 Ŝ−1 s̃ → χm
2
. (3.82)
where Ŝ is estimated using the scores of the unrestricted model (under the alternative),
n
1X 2 0
Ŝ −1
= ε̂i xi xi . (3.83)
n
i =1

The feasible LM, n s̃0 S̃−1 s̃, uses a covariance estimator (S̃)based on the scores from the re-
stricted model, s̃.
In models with heteroskedasticity it is impossible to determine a priori whether the
LM or the LR test statistic will be larger, although folk wisdom states that LR test statistics
are larger than LM test statistics (and hence the LR will be more powerful). If the data
are homoskedastic, and homoskedastic estimators of Ŝ and S̃ are used (σ̂2 (X0 X/n )−1 and
σ̃2 (X0 X/n )−1 , respectively), then it must be the case that L M < L R . This follows since OLS
minimizes the sum of squared errors, and so σ̂2 must be smaller than σ̃2 , and so the LR
can be guaranteed to have more power in this case since the LR and LM have the same
asymptotic distribution.

Algorithm 3.47 (Large Sample LR Test). 1. Estimate the unrestricted model yi = xi β +


εi .

2. Impose the null on the unrestricted model and estimate the restricted model, ỹi = x̃i β +
εi .

3. Compute the residuals from the restricted regression, ε̃i = ỹi − x̃i β̃ , and from the un-
restricted regression, ε̂i = yi − xi β̂ .

4. Construct the score from both models, s̃i = xi ε̃i and ŝi = xi ε̂i , where in both cases xi
are the regressors from the unrestricted model.
188 Analysis of Cross-Sectional Data

5. Estimate the average score and the covariance of the score,


n
X n
X
s̃ = n −1
s̃i , Ŝ = n −1
ŝ0i ŝi (3.84)
i =1 i =1

6. Compute the LR test statistic as L R = n s̃Ŝ−1 s̃0 .

7. Reject the null if L R > Cα where Cα is the critical value from a χm


2
using a size of α.

3.11.4 Revisiting the Wald, LM and LR tests

The previous test results, all based on the restrictive small sample assumptions, can now be
revisited using the test statistics that allow for heteroskedasticity. Tables 3.8 and 3.9 contain
the values for t -tests, Wald tests, LM tests and LR tests using the large sample versions
of these test statistics as well as the values previously computed using the homoskedastic
small sample framework.
There is a clear direction in the changes of the test statistics: most are smaller, some
substantially. Examining table 3.8, 4 out of 5 of the t -stats have decreased. Since the esti-
mator of β̂ is the same in both the small sample and the large sample frameworks, all of
the difference is attributable to changes in the standard errors, which typically increased
by 50%. When t -stats differ dramatically under the two covariance estimators, the likely
cause is heteroskedasticity.
Table 3.9 shows that the Wald, LR and LM test statistics also changed by large amounts.15
Using the heteroskedasticity robust covariance estimator, the Wald statistics decreased by
up to a factor of 2 and the robust LM test statistics decreased by up to 5 times. The LR
test statistic values were generally larger than those of the corresponding Wald or LR test
statistics. The relationship between the robust versions of the Wald and LR statistics is not
clear, and for models that are grossly misspecified, the Wald and LR test statistics are sub-
stantially larger than their LM counterparts. However, when the value of the test statistics
is smaller, the three are virtually identical, and inference made using of any of these three
tests is the same. All nulls except H0 : β1 = β3 = 0 would be rejected using standard sizes
(5-10%).
These changes should serve as a warning to conducting inference using covariance
estimates based on homoskedasticity. In most applications to financial time-series, het-
eroskedasticity robust covariance estimators (and often HAC (Heteroskedasticity and Au-
tocorrelation Consistent), which will be defined in the time-series chapter) are automati-
cally applied without testing for heteroskedasticity.
15
The statistics based on the small-sample assumptions have fm ,t −k or fm,t −k +m distributions while the
statistics based in the large-sample assumptions have χm
2
distributions, and so the values of the small sample
statistics must be multiplied by m to be compared to the large sample statistics.
3.12 Violations of the Large Sample Assumptions 189

t Tests
Homoskedasticity Heteroskedasticity
Parameter β̂ S.E. t -stat p-val S.E. t -stat p-val
Constant -0.064 0.043 -1.484 0.138 0.043 -1.518 0.129
V WMe 1.077 0.008 127.216 0.000 0.008 97.013 0.000
SM B 0.019 0.013 1.440 0.150 0.013 0.771 0.440
HML 0.803 0.013 63.736 0.000 0.013 43.853 0.000
UMD -0.040 0.010 -3.948 0.000 0.010 -2.824 0.005

Table 3.8: Comparing small and large sample t -stats. The small sample statistics, in the left
panel of the table, overstate the precision of the estimates. The heteroskedasticity robust
standard errors are larger for 4 out of 5 parameters and one variable which was significant
at the 15% level is insignificant.

3.12 Violations of the Large Sample Assumptions

The large sample assumptions are just that: assumptions. While this set of assumptions is
far more general than the finite-sample setup, they may be violated in a number of ways.
This section examines the consequences of certain violations of the large sample assump-
tions.

3.12.1 Omitted and Extraneous Variables

Suppose that the model is linear but misspecified and a subset of the relevant regressors
are excluded. The model can be specified

yi = β 1 x1,i + β 2 x2,i + εi (3.85)

where x1,i is 1 by k1 vector of included regressors and x2,i is a 1 by k2 vector of excluded but
relevant regressors. Omitting x2,i from the fit model, the least squares estimator is
−1
X01 X1 X01 y

β̂ 1n = . (3.86)
n n
Using the asymptotic framework, the estimator can be shown to have a general form of
bias.

Theorem 3.48 (Misspecified Regression). Under assumptions 3.9 and 3.35 - 3.37 through , if
X can be partitioned [X1 X2 ] where X1 correspond to included variables while X2 correspond
190 Analysis of Cross-Sectional Data

Wald Tests
Small Sample Large Sample
Null M W p-val W p-val
β j = 0, j = 1, . . . , 5 5 6116 0.000 16303 0.000
β j = 0, j = 1, 3, 4, 5 4 1223.1 0.000 2345.3 0.000
β j = 0, j = 1, 5 2 11.14 0.000 13.59 0.001
β j = 0, j = 1, 3 2 2.049 0.129 2.852 0.240
β5 = 0 1 15.59 0.000 7.97 0.005

LR Tests
Small Sample Large Sample
Null M L R p-val L R p-val
β j = 0, j = 1, . . . , 5 5 6116 0.000 16303 0.000
β j = 0, j = 1, 3, 4, 5 4 1223.1 0.000 10211.4 0.000
β j = 0, j = 1, 5 2 11.14 0.000 13.99 0.001
β j = 0, j = 1, 3 2 2.049 0.129 3.088 0.213
β5 = 0 1 15.59 0.000 8.28 0.004

LM Tests
Small Sample Large Sample
Null M L M p-val L M p-val
β j = 0, j = 1, . . . , 5 5 190 0.000 106 0.000
β j = 0, j = 1, 3, 4, 5 4 231.2 0.000 170.5 0.000
β j = 0, j = 1, 5 2 10.91 0.000 14.18 0.001
β j = 0, j = 1, 3 2 2.045 0.130 3.012 0.222
β5 = 0 1 15.36 0.000 8.38 0.004

Table 3.9: Comparing large- and small-sample Wald, LM and LR test statistics. The large-
sample test statistics are smaller than their small-sample counterparts, a results of the
heteroskedasticity present in the data. While the main conclusions are unaffected by the
choice of covariance estimator, this will not always be the case.

to excluded variables with non-zero coefficients, then


p
β̂ 1n → β 1 + Σ−1
X1 X1 Σ X1 X2 β 2 (3.87)
p
β̂ 1 → β 1 + δβ 2

where " #
Σ X1 X1 Σ X1 X2
ΣXX =
Σ0X1 X2 ΣX2 X2
The bias term, δβ 2 is composed to two elements δ is a matrix of regression coefficients
3.12 Violations of the Large Sample Assumptions 191

where the jth column is the probability limit of the least squares estimator in the regression

X2 j = X1 δ j + ν,
where X2 j is the jth column of X2 . The second component of the bias term is the original
regression coefficients. As should be expected, larger coefficients on omitted variables lead
to larger bias.
p
β̂ 1n → β 1 under one of three conditions:
p
1. δ̂n → 0

2. β 2 = 0
p
3. The product δ̂n β 2 → 0.
p
β 2 has been assumed to be non-zero (if β 2 = 0 the model is correctly specified). δn → 0
only if the regression coefficients of X2 on X1 are zero, which requires that the omitted and
included regressors to be uncorrelated (X2 lies in the null space of X1 ). This assumption
should be considered implausible in most applications and β̂ 1n will be biased, although
certain classes of regressors, such as dummy variables, are mutually orthogonal and can be
safely omitted.16 Finally, if both δ and β 2 are non-zero, the product could be zero, although
without a very peculiar specification and a careful selection of regressors, this possibility
should be considered unlikely.
Alternatively, consider the case where some irrelevant variables are included. The cor-
rect model specification is

yi = x1,i β 1 + εi

and the model estimated is

yi = x1,i β 1 + x2,i β 2 + εi

As long as the assumptions of the asymptotic framework are satisfied, the least squares
estimator is consistent under theorem 3.40 and
" # " #
p β1 β1
β̂ n → =
β2 0

If the errors are homoskedastic, the variance of n(β̂ n − β ) is σ2 Σ−1 XX where X = [X1 X2 ].
The variance of β̂ 1n is the upper left k1 by k1 block of σ ΣXX . Using the partitioned inverse,
2 −1

16
Safely in terms of consistency of estimated parameters. Omitting variables will cause the estimated vari-
ance to be inconsistent.
192 Analysis of Cross-Sectional Data

" #
Σ−1
X1 X1 + Σ X1 X1 Σ X1 X2 M 1 Σ X1 X2 Σ X1 X1
−1 0 −1
−Σ−1X1 X1 ΣX1 X2 M1
Σ−1
XX =
M 1 Σ X1 X2 Σ X1 X1
0 −1
ΣX2 X2 + ΣX2 X2 Σ0X1 X2 M2 ΣX1 X2 Σ−1
−1 −1
X2 X2

where
X02 MX1 X2
M1 = lim
n→∞ n
X01 MX2 X1
M2 = lim
n →∞ n

and so the upper left block of the variance, Σ−1 X1 X1 + ΣX1 X1 ΣX1 X2 M1 ΣX1 X2 ΣX1 X1 , must be larger
−1 0 −1

than Σ−1X1 X1 because the second term is a quadratic form and M1 is positive semi-definite.
17

Noting that σ̂ is consistent under both the correct specification and the expanded speci-
2

fication, the cost of including extraneous regressors is an increase in the asymptotic vari-
ance.
In finite samples, there is a bias-variance tradeoff. Fewer regressors included in a model
leads to more precise estimates, while including more variables leads to less bias, and
when relevant variables are omitted σ̂2 will be larger and the estimated parameter vari-
ance, σ̂2 (X0 X)−1 must be larger.
Asymptotically only bias remains as it is of higher order than variance (scaling β̂ n − β

by n, the bias is exploding while the variance is constant),and so when the sample size
is large and estimates are precise, a larger model should be preferred to a smaller model.
In cases where the sample size is small, there is a justification for omitting a variable to
enhance the precision of those remaining, particularly when the effect of the omitted vari-
able is not of interest or when the excluded variable is highly correlated with one or more
included variables.

3.12.2 Errors Correlated with Regressors

Bias can arise from sources other than omitted variables. Consider the case where X is
measured with noise and define x̃i = xi + ηi where x̃i is a noisy proxy for xi , the “true”
(unobserved) regressor, and ηi is an i.i.d. mean 0 noise process which is independent of X
and ε with finite second moments Σηη . The OLS estimator,

!−1
X̃0 X̃ X̃0 y
β̂ n = (3.88)
n n
−1
(X + η)0 (X + η) (X + η)0 y

= (3.89)
n n
17
Both M1 and M2 are covariance matrices of the residuals of regressions of x2 on x1 and x1 on x2 respec-
tively.
3.12 Violations of the Large Sample Assumptions 193

−1
X0 X X0 η η0 X η0 η(X + η)0 y

= + + + (3.90)
n n n n n
−1
X X X0 η η0 X η0 η X0 y η0 y
 0   
= + + + + (3.91)
n n n n n n

will be biased downward. To understand the source of the bias, consider the behavior,
under the asymptotic assumptions, of

X0 X p
→ ΣXX
n
X0 η p
→0
n
η0 η p
→ Σηη
n
X0 y p
→ ΣXX β
n
η0 y p
→0
n
so −1
X0 X X0 η η0 X η0 η

p
+ + + → (ΣXX + Σηη )−1
n n n n
and

p
β̂ n → (ΣXX + Σηη )−1 ΣXX β .
p
If Σηη 6= 0, then β̂ n 9 β and the estimator is inconsistent.
p
The OLS estimator is also biased in the case where n −1 X0 ε 9 0k , which arises in sit-
uations with endogeneity. In these cases, xi and εi are simultaneously determined and
p
correlated. This correlation results in a biased estimator since β̂ n → β + Σ−1 XX ΣXε where
ΣXε is the limit of n X ε. The classic example of endogeneity is simultaneous equation
−1 0

models although many situations exist where the innovation may be correlated with one
or more regressors; omitted variables can be considered a special case of endogeneity by
reformulating the model.
The solution to this problem is to find an instrument, zi , which is correlated with the
endogenous variable, xi , but uncorrelated with εi . Intuitively, the endogenous portions
of xi can be annihilated by regressing xi on zi and using the fit values. This procedure is
known as instrumental variable (IV) regression in the case where the number of zi variables
is the same as the number of xi variables and two-stage least squares (2SLS) when the size
of zi is larger than k .
Define zi as a vector of exogenous variables where zi may contain any of the variables
in xi which are exogenous. However, all endogenous variables – those correlated with the
194 Analysis of Cross-Sectional Data

error – must be excluded.


First, a few assumptions must be reformulated.

Assumption 3.49 (IV Stationary Ergodicity). {(zi , xi , εi )} is a strictly stationary and ergodic
sequence.

Assumption 3.50 (IV Rank). E[z0i xi ] = ΣZX is nonsingular and finite.

Assumption 3.51 (IV Martingale Difference). {z0i εi , Fi } is a martingale difference sequence,


h 2 i
E z j ,i εi < ∞, j = 1, 2, . . . , k , i = 1, 2 . . .

1
and S = V[n − 2 Z0 ε] is finite and non singular.

Assumption 3.52 (IV Moment Existence). E[x j4i ] < ∞ and E[z j4i ] < ∞, j = 1, 2, . . . , k ,
i = 1, 2, . . . and E[ε2i ] = σ2 < ∞, i = 1, 2, . . ..

These four assumptions are nearly identical to the four used to establish the asymptotic
normality of the OLS estimator. The IV estimator is defined
−1
Z0 X Z0 y

IV
β̂ n = (3.92)
n n
where the n term is present to describe the number of observations used in the IV estima-
tor. The asymptotic properties are easy to establish and are essentially identical to those of
the OLS estimator.

Theorem 3.53 (Consistency of the IV Estimator). Under assumptions 3.9 and 3.49-3.51, the
IV estimator is consistent,
IV p
β̂ n → β
and asymptotically normal
√ IV d
n (β̂ n − β ) → N (0, Σ−1
ZX S̈ΣZX )
−1
(3.93)

where ΣZX = E[x0i zi ] and S̈ = V[n −1/2 Z0 ε].

Additionally, consistent estimators are available for the components of the asymptotic
variance.

Theorem 3.54 (Asymptotic Normality of the IV Estimator). Under assumptions 3.9 and 3.49
- 3.52,
p
Σ̂ZX = n −1 Z0 X → ΣZX (3.94)

n
S̈ˆ = n −1
X p
ε2i z0i zi → S̈ (3.95)
i =1
3.12 Violations of the Large Sample Assumptions 195

and
Σ̂ZX S̈ˆ Σ̂ZX → Σ−1
−1 0−1 p 0−1
ZX S̈ΣZX (3.96)

The asymptotic variance can be easily computed from

n
!
Σ̂ZX S̈ˆ Σ̂ZX
−1 −1 −1 X −1
=N Z X 0
ε̂2i z0i zi X0 Z (3.97)
i =1
−1  0  0 −1
=N Z0 X Z ÊZ X Z