Linear and Nonlinear Optimization, Second Edition

Linear and Nonlinear Optimization Linear and Nonlinear Optimization SECOND EDITION Igor Griva Stephen G. Nash Ariela

2,585 281 3MB

Pages 765 Page size 400 x 586.5 pts

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

Linear and Nonlinear Optimization, Second Edition

Linear and Nonlinear Optimization Linear and Nonlinear Optimization SECOND EDITION Igor Griva Stephen G. Nash Ariela

422 68 3MB Read more

Linear and Nonlinear Optimization

SECOND EDITION Igor Griva Stephen G. Nash Ariela Sofer George Mason University Fairfax, Virginia Society for Indust

638 225 3MB Read more

Linear and nonlinear optimization

SECOND EDITION Igor Griva Stephen G. Nash Ariela Sofer George Mason University Fairfax, Virginia Society for Indust

1,331 811 3MB Read more

Linear and Nonlinear Optimization,

Linear and Nonlinear Optimization Linear and Nonlinear Optimization SECOND EDITION Igor Griva Stephen G. Nash Ariela

460 51 7MB Read more

Linear and nonlinear optimization

SECOND EDITION Igor Griva Stephen G. Nash Ariela Sofer George Mason University Fairfax, Virginia Society for Indust

865 82 3MB Read more

Linear Control Theory: Structure, Robustness, and Optimization

Linear Control Theory Structure, Robustness, and Optimization AUTOMATION AND CONTROL ENGINEERING A Series of Reference

860 130 7MB Read more

Linear control theory: Structure, robustness, and optimization

Linear Control Theory Structure, Robustness, and Optimization AUTOMATION AND CONTROL ENGINEERING A Series of Reference

942 133 7MB Read more

Introduction to Nonlinear and Global Optimization (Springer Optimization and Its Applications)

INTRODUCTION TO NONLINEAR AND GLOBAL OPTIMIZATION Springer Optimization and Its Applications VOLUME 37 Managing Editor

283 49 6MB Read more

Optimization Concepts and Applications in Engineering, Second Edition

OPTIM IZATION CONCEPTS AND APPLICATIONS IN ENGINEERING Second Edition It is vitally important to meet or exceed previous

800 108 2MB Read more

Introduction to Applied Optimization, Second Edition (Springer Optimization and Its Applications)

INTRODUCTION TO APPLIED OPTIMIZATION Springer Optimization and Its Applications VOLUME 22 Managing Editor Panos M. Par

257 19 4MB Read more

File loading please wait...

Citation preview

Linear and Nonlinear Optimization

Linear and Nonlinear Optimization SECOND EDITION

Igor Griva Stephen G. Nash Ariela Sofer George Mason University Fairfax, Virginia

Society for Industrial and Applied Mathematics • Philadelphia

Copyright © 2009 by the Society for Industrial and Applied Mathematics. 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. Excel is a trademark of Microsoft Corporation in the United States and/or other countries. Mathematica is a registered trademark of Wolfram Research, Inc. MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000, Fax: 508-647-7001, [email protected], www.mathworks.com. SAS is a registered trademark of SAS Institute Inc. Cover image of the Golden Gate Bridge used with permission of istockphoto.com. Cover designed by Galina Spivak. Library of Congress Cataloging-in-Publication Data Griva, Igor. Linear and nonlinear optimization / Igor Griva, Stephen G. Nash, Ariela Sofer. -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-0-898716-61-0 1. Linear programming. 2. Nonlinear programming. I. Nash, Stephen (Stephen G.) II. Sofer, Ariela. III. Title. T57.74.G75 2008 519.7'2--dc22 2008032477

is a registered trademark.

i

i

i

book 2008/10/23 page v i

Contents Preface

I 1

2

xiii

Basics

1

Optimization Models 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Optimization: An Informal Introduction . . . . . . . . . . . . . . 1.3 Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Linear Optimization . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Least-Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Nonlinear Optimization . . . . . . . . . . . . . . . . . . . . . . . 1.7 Optimization Applications . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Crew Scheduling and Fleet Scheduling . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Support Vector Machines . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Portfolio Optimization . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.4 Intensity Modulated Radiation Treatment Planning . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.5 Positron Emission Tomography Image Reconstruction Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.6 Shape Optimization . . . . . . . . . . . . . . . . . . 1.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

3 3 4 7 10 12 12 14 14 18 18 22 22 24 25 27 28 31 32 34 35 40

Fundamentals of Optimization 2.1 Introduction . . . . . . . . . . . . . . . 2.2 Feasibility and Optimality . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . 2.3 Convexity . . . . . . . . . . . . . . . . 2.3.1 Derivatives and Convexity .

. . . . .

. . . . .

43 43 43 47 48 50

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

v

i

i i

i

i

i

i

vi

3

II 4

5

book 2008/10/23 page vi i

Contents Exercises . . . . . . . . . . . . . . . . . . . . . . . 2.4 The General Optimization Algorithm . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . 2.5 Rates of Convergence . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . 2.6 Taylor Series . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . 2.7 Newton’s Method for Nonlinear Equations . 2.7.1 Systems of Nonlinear Equations Exercises . . . . . . . . . . . . . . . . . . . . . . . 2.8 Notes . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

52 54 58 58 61 62 65 67 72 74 76

Representation of Linear Constraints 3.1 Basic Concepts . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . 3.2 Null and Range Spaces . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . 3.3 Generating Null-Space Matrices . . . . . . 3.3.1 Variable Reduction Method . 3.3.2 Orthogonal Projection Matrix 3.3.3 Other Projections . . . . . . 3.3.4 The QR Factorization . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . 3.4 Notes . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

77 77 82 82 84 86 86 89 90 90 91 93

. . . . . . . . . . .

Linear Programming Geometry of Linear Programming 4.1 Introduction . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . 4.2 Standard Form . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . 4.3 Basic Solutions and Extreme Points . . . Exercises . . . . . . . . . . . . . . . . . . . . . 4.4 Representation of Solutions; Optimality Exercises . . . . . . . . . . . . . . . . . . . . . 4.5 Notes . . . . . . . . . . . . . . . . . . .

95

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

97 97 98 100 105 106 114 117 123 124

The Simplex Method 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Simplex Method . . . . . . . . . . . . . . . . . . . 5.2.1 General Formulas . . . . . . . . . . . . . . 5.2.2 Unbounded Problems . . . . . . . . . . . . 5.2.3 Notation for the Simplex Method (Tableaus) 5.2.4 Deficiencies of the Tableau . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

125 125 126 129 134 135 139

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

i

i i

i

i

i

i

Contents

6

7

book 2008/10/23 page vii i

vii

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Simplex Method (Details) . . . . . . . . . . . . . 5.3.1 Multiple Solutions . . . . . . . . . . . . . 5.3.2 Feasible Directions and Edge Directions . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Getting Started—Artificial Variables . . . . . . . . . . 5.4.1 The Two-Phase Method . . . . . . . . . . 5.4.2 The Big-M Method . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Degeneracy and Termination . . . . . . . . . . . . . . 5.5.1 Resolving Degeneracy Using Perturbation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

141 144 144 145 148 149 150 156 159 162 167 170 171

Duality and Sensitivity 6.1 The Dual Problem . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . 6.2 Duality Theory . . . . . . . . . . . . . 6.2.1 Complementary Slackness 6.2.2 Interpretation of the Dual Exercises . . . . . . . . . . . . . . . . . . . . 6.3 The Dual Simplex Method . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . 6.4 Sensitivity . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . 6.5 Parametric Linear Programming . . . . Exercises . . . . . . . . . . . . . . . . . . . . 6.6 Notes . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

173 173 177 179 182 184 185 189 194 195 201 204 210 211

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

213 213 214 221 222 227 227 238 240 240 248 256 259 260 264 265 266

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

Enhancements of the Simplex Method 7.1 Introduction . . . . . . . . . . . . . . . . . . . . 7.2 Problems with Upper Bounds . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Column Generation . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 The Decomposition Principle . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Representation of the Basis . . . . . . . . . . . . 7.5.1 The Product Form of the Inverse . . 7.5.2 Representation of the Basis—The LU Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Numerical Stability and Computational Efficiency 7.6.1 Pricing . . . . . . . . . . . . . . . . 7.6.2 The Initial Basis . . . . . . . . . . . 7.6.3 Tolerances; Degeneracy . . . . . . . 7.6.4 Scaling . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

i

i i

i

i

i

i

viii

Contents 7.6.5 Preprocessing 7.6.6 Model Formats Exercises . . . . . . . . . . . . . . 7.7 Notes . . . . . . . . . . . .

8

9

10

book 2008/10/23 page viii i

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

267 268 269 270

Network Problems 8.1 Introduction . . . . . . . . . . 8.2 Basic Concepts and Examples . Exercises . . . . . . . . . . . . . . . . 8.3 Representation of the Basis . . Exercises . . . . . . . . . . . . . . . . 8.4 The Network Simplex Method Exercises . . . . . . . . . . . . . . . . 8.5 Resolving Degeneracy . . . . . Exercises . . . . . . . . . . . . . . . . 8.6 Notes . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

271 271 271 280 280 287 287 294 295 299 299

Computational Complexity of Linear Programming 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . 9.2 Computational Complexity . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Worst-Case Behavior of the Simplex Method . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 The Ellipsoid Method . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 The Average-Case Behavior of the Simplex Method 9.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

301 301 302 304 305 308 308 313 314 316

Interior-Point Methods for Linear Programming 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 The Primal-Dual Interior-Point Method . . . . . . . . . . . . . 10.2.1 Computational Aspects of Interior-Point Methods 10.2.2 The Predictor-Corrector Algorithm . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Feasibility and Self-Dual Formulations . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Some Concepts from Nonlinear Optimization . . . . . . . . . 10.5 Affine-Scaling Methods . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Path-Following Methods . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

319 319 321 328 329 330 331 334 334 336 343 344 352 353

i

i i

i

i

i

i

Contents

ix

III Unconstrained Optimization 11

12

13

Basics of Unconstrained Optimization 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . 11.2 Optimality Conditions . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Newton’s Method for Minimization . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Guaranteeing Descent . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Guaranteeing Convergence: Line Search Methods . 11.5.1 Other Line Searches . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Guaranteeing Convergence: Trust-Region Methods Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . .

book 2008/10/23 page ix i

355

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

357 357 357 361 364 369 371 374 375 381 385 391 398 399

Methods for Unconstrained Optimization 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Steepest-Descent Method . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Automating Derivative Calculations . . . . . . . . . . . . 12.4.1 Finite-Difference Derivative Estimates . . . 12.4.2 Automatic Differentiation . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Methods That Do Not Require Derivatives . . . . . . . . 12.5.1 Simulation-Based Optimization . . . . . . . 12.5.2 Compass Search: A Derivative-Free Method 12.5.3 Convergence of Compass Search . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Termination Rules . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 Historical Background . . . . . . . . . . . . . . . . . . . 12.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

401 401 402 408 411 420 422 422 426 429 431 432 434 437 440 441 445 446 448

Low-Storage Methods for Unconstrained Problems 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 The Conjugate-Gradient Method for Solving Linear Equations Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Truncated-Newton Methods . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Nonlinear Conjugate-Gradient Methods . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Limited-Memory Quasi-Newton Methods . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

451 451 452 459 460 465 466 469 470

. . . . . . . . . . . . .

. . . . . . . . . . . . .

i

i i

i

i

i

i

x

Contents Exercises . . . . . . . . 13.6 Preconditioning Exercises . . . . . . . . 13.7 Notes . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

IV Nonlinear Optimization 14

15

book 2008/10/23 page x i

473 474 477 478

481

Optimality Conditions for Constrained Problems 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Optimality Conditions for Linear Equality Constraints . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 The Lagrange Multipliers and the Lagrangian Function . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Optimality Conditions for Linear Inequality Constraints . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Optimality Conditions for Nonlinear Constraints . . . . . . . . 14.5.1 Statement of Optimality Conditions . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6 Preview of Methods . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Derivation of Optimality Conditions for Nonlinear Constraints Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.8 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.8.1 Games and Min-Max Duality . . . . . . . . . . . 14.8.2 Lagrangian Duality . . . . . . . . . . . . . . . . 14.8.3 Wolfe Duality . . . . . . . . . . . . . . . . . . . 14.8.4 More on the Dual Function . . . . . . . . . . . . 14.8.5 Duality in Support Vector Machines . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.9 Historical Background . . . . . . . . . . . . . . . . . . . . . . 14.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

483 483 484 489 491 493 494 501 502 503 508 510 514 515 520 522 523 526 532 534 538 542 543 546

Feasible-Point Methods 15.1 Introduction . . . . . . . . . . . . . 15.2 Linear Equality Constraints . . . . . Exercises . . . . . . . . . . . . . . . . . . . 15.3 Computing the Lagrange Multipliers Exercises . . . . . . . . . . . . . . . . . . . 15.4 Linear Inequality Constraints . . . . 15.4.1 Linear Programming . . Exercises . . . . . . . . . . . . . . . . . . . 15.5 Sequential Quadratic Programming . Exercises . . . . . . . . . . . . . . . . . . . 15.6 Reduced-Gradient Methods . . . . . Exercises . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

549 549 549 555 556 561 563 570 572 573 580 581 588

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

i

i i

i

i

i

i

Contents

book 2008/10/23 page xi i

xi

15.7 Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 15.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 16

V A

Penalty and Barrier Methods 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Classical Penalty and Barrier Methods . . . . . . . . . . . . . . . . 16.2.1 Barrier Methods . . . . . . . . . . . . . . . . . . . . . 16.2.2 Penalty Methods . . . . . . . . . . . . . . . . . . . . . 16.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Stabilized Penalty and Barrier Methods . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Exact Penalty Methods . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 Multiplier-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 16.6.1 Dual Interpretation . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7 Nonlinear Primal-Dual Methods . . . . . . . . . . . . . . . . . . . . 16.7.1 Primal-Dual Interior-Point Methods . . . . . . . . . . . 16.7.2 Convergence of the Primal-Dual Interior-Point Method Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

Appendices Topics from Linear Algebra A.1 Introduction . . . . . . . . . . . . . . . . . . . . . A.2 Eigenvalues . . . . . . . . . . . . . . . . . . . . . A.3 Vector and Matrix Norms . . . . . . . . . . . . . . A.4 Systems of Linear Equations . . . . . . . . . . . . A.5 Solving Systems of Linear Equations by Elimination A.6 Gaussian Elimination as a Matrix Factorization . . . A.6.1 Sparse Matrix Storage . . . . . . . . . A.7 Other Matrix Factorizations . . . . . . . . . . . . . A.7.1 Positive-Definite Matrices . . . . . . . A.7.2 The LDLT and Cholesky Factorizations A.7.3 An Orthogonal Matrix Factorization . . A.8 Sensitivity (Conditioning) . . . . . . . . . . . . . . A.9 The Sherman–Morrison Formula . . . . . . . . . . A.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . .

601 601 602 603 610 613 617 618 619 623 623 626 626 635 638 640 641 645 647 649 654 656

659

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

661 661 661 662 664 666 669 675 676 677 678 681 683 686 688

i

i i

i

i

i

i

xii B

C

book 2008/10/23 page xii i

Contents Other Fundamentals B.1 Introduction . . . . . . . . . . . . . . . . . . . B.2 Computer Arithmetic . . . . . . . . . . . . . . B.3 Big-O Notation, O(·) . . . . . . . . . . . . . . B.4 The Gradient, Hessian, and Jacobian . . . . . . B.5 Gradient and Hessian of a Quadratic Function . B.6 Derivatives of a Product . . . . . . . . . . . . . B.7 The Chain Rule . . . . . . . . . . . . . . . . . B.8 Continuous Functions; Closed and Bounded Sets B.9 The Implicit Function Theorem . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

691 691 691 693 694 696 697 698 699 700

Software 703 C.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703

Bibliography

707

Index

727

i

i i

i

i

i

i

book 2008/10/23 page xiii i

Preface This book provides an introduction to the applications, theory, and algorithms of linear and nonlinear optimization. The emphasis is on practical aspects—modern algorithms, as well as the influence of theory on the interpretation of solutions or on the design of software. Two important goals of this book are to present linear and nonlinear optimization in an integrated setting, and to incorporate up-to-date interior-point methods in linear and nonlinear optimization. As an illustration of this unified approach, almost every algorithm in this book is presented in the form of a General Optimization Algorithm. This algorithm has two major steps: an optimality test, and a step that improves the estimate of the solution. This framework is general enough to encompass the simplex method and various interior-point methods for linear programming, as well as Newton’s method and active-set methods for nonlinear optimization. The optimality test in this algorithm motivates the discussion of optimality conditions for a variety of problems. The step procedure motivates the discussion of feasible directions (for constrained problems) and Newton’s method and its variants (for nonlinear problems). In general, there is an attempt to develop the material from a small number of basic concepts, emphasizing the interrelationships among the many topics. Our hope is that, by emphasizing a few fundamental principles, it will be easier to understand and assimilate the vast panorama of linear and nonlinear optimization. We have attempted to make accessible a number of topics that are not often found in textbooks. Within linear programming, we have emphasized the importance of sparse matrices on the design of algorithms, described computational techniques used in sophisticated software packages, and derived the primal-dual interior-point method together with the predictor-corrector technique. Within nonlinear optimization, we have included discussions of truncated-Newton methods for large problems, convergence theory for trust-region methods, filter methods, and techniques for alleviating the ill-conditioning in barrier methods. We hope that the book serves as a useful introduction to research papers in these areas. The book was designed for use in courses and course sequences that discuss both linear and nonlinear optimization. We have used consistent approaches when discussing the two topics, often using the same terminology and notation in order to emphasize the similarities between the two topics. However, it can also be used in traditional (and separate) courses in Linear Programming and Nonlinear Optimization—in fact, that is the way we use it in the courses that we teach. At the end of this preface are chapter descriptions and course outlines indicating these possibilities. xiii

i

i i

i

i

i

i

xiv

book 2008/10/23 page xiv i

Preface

We have also used the book for more advanced courses. The later chapters (and the later sections within chapters) contain a great deal of material that would be difficult to cover in an introductory course. The Notes at the ends of many sections contain pointers to research papers and other references, and it would be straightforward to use such materials to supplement the book. The book is divided into four parts plus appendices. Part I (Basics) contains material that might be used in a number of different topics. It is not intended that all of this material be presented in the classroom. Some of it might be irrelevant (as the sample course outlines illustrate). In other cases, material might be familiar to the students from other courses, or simple enough to be assigned as a reading exercise. The material in Part I could also be taught in stages, as it is needed. In a course on Nonlinear Optimization, for example, Chapter 4 (Representation of Linear Constraints) could be delayed until after Part III (Unconstrained Optimization). Our intention in designing Part I was to make the book as flexible as possible, and instructors should feel free to exploit this flexibility. Part II (Linear Programming) and Part III (Unconstrained Optimization) are independent of each other. Either one could be taught or read before the other. In addition, it is not necessary to cover Part II before going on to Part IV (Nonlinear Optimization), although the material in Part IV will benefit from an understanding of Linear Programming. The material in the appendices may already be familiar. If not, it could either be presented in class or left for students to read independently. Many sections in the book can be omitted without interrupting the flow of the discussions (detailed information on this is given below). Proofs of theorems and lemmas can similarly be omitted. Roughly speaking, it is possible to skip later sections within a chapter and later chapters within a part and move on to later chapters in the book. The book was organized in this way so that it would be accessible to a wider audience, as well as to increase its flexibility. Many of the exercises are computational. In some cases, pencil-and-paper techniques would suffice, but the use of a computer is recommended. We have not specified how the computer might be used, and we leave this up to the instructor. In courses with an emphasis on modeling, a specialized linear or nonlinear optimization package might be appropriate. In other courses, the students might be asked to program algorithms themselves. We leave these decisions up to the instructor. Some information about software packages can be found in Appendix C. In addition, some exercises depend on auxiliary data sets that can be found on the web site for the book: http://www.siam.org/books/ot108 In our own classes, we use the MATLAB® software package for class demonstrations and homework assignments. It allows us to demonstrate a great many techniques easily, and it allows students to program individual algorithms without much difficulty. It also includes (in its toolboxes) prepared algorithms for many of the optimization problems that we discuss. We have gone to considerable effort to ensure the accuracy of the material in this book. Even so, we expect that some errors remain. For this reason, we have set up an online page for errata. It can be obtained at the book Web site.

i

i i

i

i

i

i

Preface

book 2008/10/23 page xv i

xv

Using This Book This book is designed to be flexible. It can be read and taught in many different ways. The material in the appendices can be taught as needed, or left to the students to read independently. Also, all formally identified proofs can be omitted. Part II (Linear Programming) and Part III (Unconstrained Optimization) are independent of each other. Part II does not assume any knowledge of Calculus. Part IV (Nonlinear Optimization) does not assume that Part II has been read (with the exception of Section 14.4.1). The only “essential” chapters in Part II are Chapters 4 (Geometry of Linear Programming), 5 (The Simplex Method), and 6 (Duality). The only “essential” chapter in Part III is Chapter 11 (Basics of Unconstrained Optimization). The other chapters can be skipped. We now describe the chapters individually, pointing out various ways they can be used. The sample course outlines that follow indicate how chapters might be selected to construct individual courses (based on a 15-week semester).1 Part I: Basics • Chapter 1: Optimization Models. This chapter is self-contained and describes a variety of optimization models. Sections 1.3–1.5 are independent of one another. Section 1.6 includes more realistic models and assumes that the reader is familiar with the basic models described in the earlier sections. The subsections of Section 1.6 are independent of one another. • Chapter 2: Fundamentals of Optimization. For Part II, only Sections 2.1–2.4 are needed (and Section 2.3.1 can be omitted). For Parts III and IV the whole chapter is relevant. • Chapter 3: Representation of Linear Constraints. Sections 3.3.2–3.3.4 can be omitted (although Section 3.3.2 is needed for Part IV). This chapter is only relevant to Parts II and IV; it is not needed for Part III. Part II: Linear Programming • Chapter 4: Geometry of Linear Programming. All sections of this chapter are needed in Part II. • Chapter 5: The Simplex Method. Sections 5.1 and 5.2 are the most important. How the rest of the chapter is used depends on the goals of the instructor, in particular with regard to tableaus. In a number of examples, we use the full simplex tableau to display data for linear programs. Thus, it is necessary to be able to read these tableaus to extract information. This is the only use we make of the tableaus elsewhere in the book. It is not necessary to be able to manipulate these tableaus. 1 Throughout the book, the number of a section or subsection begins with the chapter number. That is, Section 10.3 refers to the third section in Chapter 10, and Section 16.7.2 refers to the second subsection in the seventh section of Chapter 16. Also, a reference to Appendix A.9 refers to the ninth section of Appendix A. A similar system is used for tables, examples, theorems, etc.; Figure 8.10 refers to the tenth figure in Chapter 8, for example. For exercises, however, the chapter number is omitted, e.g., Exercise 4.7 is the seventh exercise in Section 4 of the current chapter (unless another chapter is specified).

i

i i

i

i

i

i

xvi

book 2008/10/23 page xvi i

Preface • Chapter 6: Duality and Sensitivity. Sections 6.1 and 6.2 are the most important. The remaining sections can be skipped, if desired. If taught, we recommend that Sections 6.3–6.5 be taught in order, although Section 6.3 is only used in a minor way in the remaining two sections. It would be possible to stop after any section. Note: The remaining chapters in Part II are independent of each other. • Chapter 7: Enhancements of the Simplex Method. The sections in this chapter are independent of each other. The instructor is free to pick and choose material, with one partial exception: the discussion of the decomposition principle is easier to understand if column generation has already been read. • Chapter 8: Network Problems. In this chapter, the sections must be taught in order. It would be possible to stop after any section. • Chapter 9: Computational Complexity of Linear Programming. The first two sections contain basic material used in Sections 9.3–9.5. Ideally, the remaining sections should be taught in order, although Sections 9.4 and 9.5 are independent of each other. Even if some topics are not of interest, at least the introductory paragraphs of each section should be read. (Section 9.5 requires some knowledge of statistics.) • Chapter 10: Interior-Point Methods for Linear Programming. Sections 10.1 and 10.2 are the most important. The later sections could be skipped but, if taught, Sections 10.4–10.6 should be taught in order. Section 10.4 reviews some fundamental concepts from nonlinear optimization needed in Sections 10.5–10.6.

Part III: Unconstrained Optimization • Chapter 11: Basics of Unconstrained Optimization. We recommend reading all of this chapter (with the exception of the proofs). If desired, either Section 11.5 or Section 11.6 could be omitted, but not both. Chapters 12 and 13 could be omitted. Chapter 13 makes more sense if taught after Chapter 12, but in fact, only Section 13.5 makes explicit use of the material in Chapter 12. • Chapter 12: Methods for Unconstrained Optimization. Sections 12.1–12.3 are the most important. All the remaining sections and subsections can be taught independently of each other. • Chapter 13: Low-Storage Methods for Unconstrained Problems. Once Sections 13.1 and 13.2 have been taught, the remaining sections are independent of each other. Part IV: Nonlinear Optimization • Chapter 14: Optimality Conditions for Constrained Problems. We recommend reading Sections 14.1–14.6. The rest of the chapter may be omitted. Within Section 14.8, Sections 14.8.3 and 14.8.5 can be taught without teaching the remaining subsections, although Section 14.8.5 depends on Section 14.8.3. (The discussion of nonlinear duality in Section 14.8 is only needed in Sections 16.6–16.8 of Chapter 16.) • Chapter 15: Feasible-Point Methods. We recommend reading Sections 15.1–15.4 (although Section 15.4.1 could be omitted). These sections explain how to solve problems with linear constraints. Sections 15.5–15.7 discuss methods for problems

i

i i

i

i

i

i

Preface

book 2008/10/23 page xvii i

xvii

with nonlinear constraints. Sections 15.5 and 15.6 are independent of each other, but Section 15.7 depends on Section 15.5. • Chapter 16: Penalty and Barrier Methods. We recommend reading Sections 16.1 and 16.2 (although Section 16.2.3 could be omitted). If more of the chapter is covered, then Section 16.3 should be read. Sections 16.4–16.8 are independent of each other. Sections 16.6–16.8 use Section 14.8.3 of Chapter 14.

Changes in the Second Edition The overall structure of the book has not changed in the new addition, and the major topic areas are the same. However, we have updated certain topics to reflect developments since the first edition appeared. We list the major changes here. Chapter 1 has been expanded to include examples of more realistic optimization models (Section 1.6). The description of interior-point methods for linear programming has been thoroughly revised and restructured (Chapter 10). The discussion of derivative-free methods has been extensively revised to reflect advances in theory and algorithms (Section 12.5). In Part IV we have added material on filter methods (Section 15.7), nonlinear primaldual methods (Section 16.7), and semidefinite programming (Section 16.8). In addition, numerous smaller changes have been made throughout the book. Some material from the first edition has been omitted here. The most notable examples are the chapter on nonlinear least-squares data fitting, and the sections on interior-point methods for convex programming. These topics from the first edition are available at the book Web site (see above for the URL). Sample Course Outlines We provide below some sample outlines for courses that might use this book. If a section is listed without mention of subsections, then it is assumed that all the subsections will be taught. If a subsection is specified, then the unmentioned subsections may be omitted. Proposed Course Outline: Linear Programming I: Foundations Chapter 1. Optimization Models 1. Introduction 3. Linear Equations 4. Linear Optimization 7. Optimization Applications 1. Crew Scheduling and Fleet Scheduling Chapter 2. Fundamentals of Optimization 1. Introduction 2. Feasibility and Optimality 3. Convexity 4. The General Optimization Algorithm

i

i i

i

i

i

i

xviii

book 2008/10/23 page xviii i

Preface

Chapter 3. Representation of Linear Constraints 1. Basic Concepts 2. Null and Range Spaces 3. Generating Null-Space Matrices 1. Variable Reduction Method II: Linear Programming Chapter 4. Geometry of Linear Programming 1. Introduction 2. Standard Form 3. Basic Solutions and Extreme Points 4. Representation of Solutions; Optimality Chapter 5. The Simplex Method 1. Introduction 2. The Simplex Method 3. The Simplex Method (Details) 4. Getting Started—Artificial Variables 1. The Two-Phase Method 5. Degeneracy and Termination Chapter 6. Duality and Sensitivity 1. The Dual Problem 2. Duality Theory 3. The Dual Simplex Method 4. Sensitivity Chapter 7. Enhancements of the Simplex Method 1. Introduction 2. Problems with Upper Bounds 3. Column Generation 5. Representation of the Basis Chapter 9. Computational Complexity of Linear Programming 1. Introduction 2. Computational Complexity 3. Worst-Case Behavior of the Simplex Method 4. The Ellipsoid Method 5. The Average-Case Behavior of the Simplex Method Chapter 10. Interior-Point Methods for Linear Programming 1. Introduction 2. The Primal-Dual Interior-Point Method Proposed Course Outline: Nonlinear Optimization I: Foundations Chapter 1. Optimization Models 1. Introduction

i

i i

i

i

i

i

Preface

book 2008/10/23 page xix i

xix

3. Linear Equations 5. Least-Squares Data Fitting 6. Nonlinear Optimization 7. Optimization Applications2 2. Support Vector Machines 3. Portfolio Optimization 4. Intensity Modulated Radiation Treatment Planning 5. Positron Emission Tomography Image Reconstruction 6. Shape Optimization Chapter 2. Fundamentals of Optimization 1. Introduction 2. Feasibility and Optimality 3. Convexity 4. The General Optimization Algorithm 5. Rates of Convergence 6. Taylor Series 7. Newton’s Method for Nonlinear Equations Chapter 3. Representation of Linear Constraints3 1. Basic Concepts 2. Null and Range Spaces 3. Generating Null-Space Matrices 1. Variable Reduction Method III: Unconstrained Optimization Chapter 11. Basics of Unconstrained Optimization 1. Introduction 2. Optimality Conditions 3. Newton’s Method for Minimization 4. Guaranteeing Descent 5. Guaranteeing Convergence: Line Search Methods 6. Guaranteeing Convergence: Trust-Region Methods Chapter 12. Methods for Unconstrained Optimization 1. Introduction 2. Steepest-Descent Method 3. Quasi-Newton Methods Chapter 13. Low-Storage Methods for Unconstrained Problems 1. Introduction 2. The Conjugate-Gradient Method for Solving Linear Equations 3. Truncated-Newton Methods 4. Nonlinear Conjugate-Gradient Methods 5. Limited-Memory Quasi-Newton Methods 2 Not 3 The

all the applications need be taught. material in Chapter 3 is not needed until Part IV.

i

i i

i

i

i

i

xx

book 2008/10/23 page xx i

Preface

IV: Nonlinear Optimization Chapter 14. Optimality Conditions for Constrained Problems 1. Introduction 2. Optimality Conditions for Linear Equality Constraints 3. The Lagrange Multipliers and the Lagrangian Function 4. Optimality Conditions for Linear Inequality Constraints 5. Optimality Conditions for Nonlinear Constraints 6. Preview of Methods 8. Duality 3. Wolfe Duality 5. Duality in Support Vector Machines Chapter 15. Feasible-Point Methods 1. Introduction 2. Linear Equality Constraints 3. Computing the Lagrange Multipliers 4. Linear Inequality Constraints 5. Sequential Quadratic Programming Chapter 16. Penalty and Barrier Methods 1. Introduction 2. Classical Penalty and Barrier Methods Proposed Course Outline: Introduction to Optimization I: Foundations Chapter 1. Optimization Models 1. Introduction 3. Linear Equations 4. Linear Optimization 5. Least-Squares Data Fitting 6. Nonlinear Optimization 7. Optimization Applications4 Chapter 2. Fundamentals of Optimization 1. Introduction 2. Feasibility and Optimality 3. Convexity 4. The General Optimization Algorithm 5. Rates of Convergence 6. Taylor Series 7. Newton’s Method for Nonlinear Equations Chapter 3. Representation of Linear Constraints 1. Basic Concepts 2. Null and Range Spaces 4 Not

all the applications need be taught.

i

i i

i

i

i

i

Preface

book 2008/10/23 page xxi i

xxi

3. Generating Null-Space Matrices 1. Variable Reduction Method II: Linear Programming Chapter 4. Geometry of Linear Programming 1. Introduction 2. Standard Form 3. Basic Solutions and Extreme Points 4. Representation of Solutions; Optimality Chapter 5. The Simplex Method 1. Introduction 2. The Simplex Method 3. The Simplex Method (Details) 4. Getting Started—Artificial Variables 1. The Two-Phase Method 5. Degeneracy and Termination Chapter 6. Duality and Sensitivity 1. The Dual Problem 2. Duality Theory 4. Sensitivity Chapter 8. Network Problems 1. Introduction 2. Basic Concepts and Examples III: Unconstrained Optimization Chapter 11. Basics of Unconstrained Optimization 1. Introduction 2. Optimality Conditions 3. Newton’s Method for Minimization 4. Guaranteeing Descent 5. Guaranteeing Convergence: Line Search Methods IV: Nonlinear Optimization Chapter 14. Optimality Conditions for Constrained Problems 1. Introduction 2. Optimality Conditions for Linear Equality Constraints 3. The Lagrange Multipliers and the Lagrangian Function 4. Optimality Conditions for Linear Inequality Constraints 5. Optimality Conditions for Nonlinear Constraints 6. Preview of Methods

i

i i

i

i

i

i

xxii

book 2008/10/23 page xxii i

Preface

Acknowledgments We owe a great deal of thanks to the people who have assisted us in preparing this second edition of this book. In particular, we would like to thank the following individuals for reviewing various portions of the manuscript and providing helpful advice and guidance: ErlingAndersen, Bob Bixby, Sanjay Mehrotra, Hans Mittelmann, Michael Overton, Virginia Torczon, and Bob Vanderbei. We are especially grateful to Sara Murphy at SIAM for guiding us through the preparation of the manuscript. Special thanks also to Galina Spivak, whose design for the front cover skillfully conveys, in our minds, the spirit of the book. We continue to be grateful to those individuals who contributed to the preparation of the first edition. These include: Kurt Anstreicher, John Anzalone, Todd Beltracchi, Dimitri Bertsekas, Bob Bixby, Paul Boggs, Dennis Bricker, Tony Chan, Jessie Cohen, Andrew Conn, Blaine Crowthers, John Dennis, Peter Foellbach, John Forrest, Bob Fourer, Christoph Luitpold Frommel, Saul Gass, David Gay, James Ho, Sharon Holland, Jeffrey Horn, Soonam Kahng, Przemyslaw Kowalik, Michael Lewis, Lorin Lund, Irvin Lustig, Maureen Mackin, Eric Munson, Arkadii Nemirovsky, Florian Potra, Michael Rothkopf, Michael Saunders, David Shanno, Eric Smith, Martin Smith, Pete Stewart, André Tits, Michael Todd, Virginia Torczon, Luis Vicente, Don Wagner, Bing Wang, and Tjalling Ypma. While preparing the first edition, we received valuable support from the National Science Foundation. We also benefited from the facilities of the National Institute of Standards and Technology and Rice University. Igor Griva Stephen G. Nash Ariela Sofer

i

i i

i

i

i

i

book 2008/10/23 page 1 i

Part I

Basics

i

i i

i

i

book 2008/10/23 page 2 i

i

i

i

i

i

i

i

i

i

book 2008/10/23 page 3 i

Chapter 1

Optimization Models

1.1

Introduction

Optimization models attempt to express, in mathematical terms, the goal of solving a problem in the “best” way. That might mean running a business to maximize profit, minimize loss, maximize efficiency, or minimize risk. It might mean designing a bridge to minimize weight or maximize strength. It might mean selecting a flight plan for an aircraft to minimize time or fuel use. The desire to solve a problem in an optimal way is so common that optimization models arise in almost every area of application. They have even been used to explain the laws of nature, as in Fermat’s derivation of the law of refraction for light. Optimization models have been used for centuries, since their purpose is so appealing. In recent times they have come to be essential, as businesses become larger and more complicated, and as engineering designs become more ambitious. In many circumstances it is no longer possible, or economically feasible, for decisions to be made without the aid of such models. In a large, multinational corporation, for example, a minor percentage improvement in operations might lead to a multimillion dollar increase in profit, but achieving this improvement might require analyzing all divisions of the corporation, a gargantuan task. Likewise, it would be virtually impossible to design a new computer chip involving millions of transistors without the aid of such models. Such large models, with all the complexity and subtlety that they can represent, would be of little value if they could not be solved. The last few decades have witnessed astonishing improvements in computer hardware and software, and these advances have made optimization models a practical tool in business, science, and engineering. It is now possible to solve problems with thousands or even millions of variables. The theory and algorithms that make this possible form a large portion of this book. In the first part of this chapter we give some simple examples of optimization models. They are grouped in categories, where the divisions reflect the properties of the models as well as the differences in the techniques used to solve them. We include also a discussion of systems of linear equations, which are not normally considered to be optimization models. However, linear equations are often included as constraints in optimization models, and their solution is an important step in the solution of many optimization problems. 3

i

i i

i

i

i

i

4

book 2008/10/23 page 4 i

Chapter 1. Optimization Models

x2 3−

2−

1−

x

| 1

| 2

| 3

x1

Figure 1.1. Nonlinear optimization problem. The feasible set is the dark line. In the last section of this chapter we give some examples of applications of optimization. These examples reflect families of problems that are either in wide use, or—at the time of writing of this edition of the book—are subject of intense research. The examples reflect the tastes of the authors; by no means do they constitute a broad or representative sample of the myriad applications where optimization is in use today.

1.2

Optimization: An Informal Introduction

Consider the problem of finding the point on the line x1 + x2 = 2 that is closest to the point (2, 2)T (see Figure 1.1) . The problem can be written as minimize subject to

f (x) = (x1 − 2)2 + (x2 − 2)2 x1 + x2 = 2.

It is easy, of course, to see that the problem has an optimum at x = (1, 1)T. This problem is an example of an optimization problem. Optimization problems typically minimize or maximize a function f (called the objective function) in a set of points S (called the feasible set). Commonly, the feasible set is defined by some constraints on the variables. In this example our objective function is the nonlinear function f (x) = (x1 −2)2 +(x2 −2)2 , and the feasible set S is defined by a single linear constraint x1 +x2 = 2. The feasible set could also be defined by multiple constraints. An example is the problem minimize subject to

f (x) = x1 x12 ≤ x2 x12 + x22 ≤ 2.

The feasible set S for this problem is shown in Figure 1.2; it is easy to see that the optimal point is x = (−1, 1)T. It is possible to have an unconstrained optimization problem where

i

i i

i

i

i

i

1.2. Optimization: An Informal Introduction

book 2008/10/23 page 5 i

5

x2 −

x

| −2

1−

| −1

0

| 1

x1

−1 −

−2 −

Figure 1.2. Nonlinear optimization problem with inequality constraints. there are no constraints, as in the example minimize f (x) = (ex1 − 1)2 + (x2 − 1)2 . The feasible set S here is the entire two-dimensional space. The minimizer is x = (0, 1)T, since the function value is zero at this point and positive elsewhere. We see from these examples that the feasible set can be defined by equality constraints or inequality constraints or no constraints at all. The functions defining the objective function and the constraints may be linear or nonlinear. The examples above are nonlinear optimization problems since at least some of the functions involved are nonlinear. If the objective function and the constraints are all linear, the problem is a linear optimization problem or linear program. An example is the problem maximize subject to

f (x) = 2x1 + x2 x1 + x2 ≤ 1 x1 ≥ 0, x2 ≥ 0.

Figure 1.3 shows the feasible set. The optimal solution is clearly x = (1, 0)T. Consider now the nonlinear optimization problem maximize subject to

f (x) = (x1 + x2 )2 x1 x2 ≥ 0 −2 ≤ x1 ≤ 1 −2 ≤ x2 ≤ 1.

The feasible set is shown in Figure 1.4. The point xc = (1, 1)T has an objective value of f (xc ) = 4, which is a higher objective value than any of its “nearby” feasible points. It is therefore called a local optimizer. In contrast the point x = (−2, −2)T has an objective value f (x ) = 16 which is the best among all feasible points. It is called a global optimizer.

i

i i

i

i

i

i

6

book 2008/10/23 page 6 i

Chapter 1. Optimization Models

x2 2−

1

x 0

| 2

1

x1

−1 −

Figure 1.3. Linear optimization problem. The feasible region is shaded.

x2 2−

xc 1

−2

| −1

0

1

| 2

x1

−1 −

x

−2

Figure 1.4. Local and global solutions. The feasible region is shaded. The methods we consider in this book focus on finding local optima. We will usually assume that the problem functions and their first and second derivatives are continuous. We can then use derivative information at a given point to anticipate the behavior of the problem functions at “nearby” points and use this to determine whether the point is a local solution and if not, to find a better point. The derivative information cannot usually anticipate the behavior of the functions at points “farther away,” and hence cannot determine whether a local solution is also the global solution. One exception is when the problem solved is a convex optimization problem, in which any local optimizer is also a global optimizer (see Section 2.3). Luckily, linear programs are convex so that for this important family of problems, local solutions are also global.

i

i i

i

i

i

i

1.3. Linear Equations

book 2008/10/23 page 7 i

7

It may seem odd to give so much attention to finding local optimizers when they are not always guaranteed to be global optimizers. However, most global optimization algorithms seek the global optimum by finding local solutions to a sequence of subproblems generated by some methodical approximation to the original problem; the techniques described in the book are suitable for these subproblems. In addition, for some applications a local solution may be sufficient, or the user might be satisfied with an improvement on the objective value. Of course, some applications require finding a global solution. The drawback is that for a problem that is not convex (or not known to be convex), finding a global solution can require substantially more computational effort than finding a local solution. Our book will also assume that the variables of the problems are continuous, that is, they can take a continuous range of real values. For this reason the problems we consider are also referred to as continuous optimization problems. Many variables such as length, volume, weight, and time are by nature continuous, and even though we cannot compute or measure them to infinite precision, it is plausible in the optimization to assume that they are continuous. On the other hand, variables such as the number of people to be hired, the number of flights to dispatch per day, or the number of new plants to be opened can assume only integer values. Problems where the variables can only take on integer values are called discrete optimization problems or, in the case where all problem functions are linear, integer programming problems. In a few applications it is sufficient to solve the problem ignoring the integrality restriction, and once a solution is obtained, to round off the variables to their nearest integer. Unfortunately rounding off of a solution does not guarantee that it is optimal, or even that it is feasible, so this approach is often inadequate. While a discussion of discrete optimization is beyond the scope of this book, we will mention that such problems are much harder than their continuous counterparts for much the same reason global optimization is harder than local optimization. Since at a given point we only have information of the behavior of the function at “nearby points,” there are no straightforward conditions that can determine whether a given feasible solution is optimal. Hence the solution process must rule out either explicitly or implicitly every other feasible solution. Thus the search for an integer solution requires the solution of a potentially large sequence of continuous optimization subproblems. Typically the first of these subproblems is a relaxed problem, in which the integrality requirement on each variable is relaxed (omitted) and replaced by a (continuous) constraint on the range of the variable. If, for example, a variable xj is restricted to be either 0, 1, or 2, the relaxed constraint would be 0 ≤ xj ≤ 2. Subsequent subproblems would typically include additional continuous constraints. The subproblems would be solved by continuous optimization methods such as those described in the book. Continuous optimization is the basis for the solution of many applied problems, both discrete and continuous, convex or nonconvex. The examples in this chapter reflect just a small fraction of such applications.

1.3

Linear Equations

Systems of linear equations are central to almost all optimization algorithms and form a part of a great many optimization models. They are used in this section to represent a datafitting example. A slight generalization of this example will lead to the important problem of least-squares data fitting. Linear equations are also used to represent constraints in a model.

i

i i

i

i

i

i

8

book 2008/10/23 page 8 i

Chapter 1. Optimization Models

Finally, solving systems of linear equations is an important step in the simplex method for linear programming and Newton’s method for nonlinear optimization, and is a technique used to determine dual variables (Lagrange multipliers) in both settings. In this chapter we only give examples of linear equations. Techniques for their solution are discussed in Appendix A. Our example is based on Figure 1.5. The points marked by • are assumed to lie on the graph of a quadratic function. These points, denoted by (ti , bi )T, have the coordinates (2, 1)T, (3, 6)T, and (5, 4)T. The quadratic function can be written as b(t) = x1 + x2 t + x3 t 2 , where x1 , x2 , and x3 are three unknown parameters that determine the quadratic. The three data points define three equations of the form b(ti ) = bi : x1 + x2 (2) + x3 (2)2 = 1 x1 + x2 (3) + x3 (3)2 = 6 x1 + x2 (5) + x3 (5)2 = 4 or x1 + 2x2 + 4x3 = 1 x1 + 3x2 + 9x3 = 6 x1 + 5x2 + 25x3 = 4. The solution is (x1 , x2 , x3 )T = (−21, 15, −2)T, or b(t) = −21 + 15t − 2t 2 , and is graphed in Figure 1.5. This approach to data fitting has many applications. It is not unique to fitting data by a quadratic function. If the data were thought to have some sort of periodic component (perhaps a daily fluctuation), then a more appropriate model might be b(t) = x1 + x2 t + x3 sin t, and the system of equations would have the form x1 + x2 (2) + x3 (sin 2) = 1 x1 + x2 (3) + x3 (sin 3) = 6 x1 + x2 (5) + x3 (sin 5) = 4. Also, there is nothing special about having three data points and three terms in the model. If we wish to associate the data-fitting problem with a system of linear equations, then the number of data points and the number of model terms must be the same. However, through the use of least-squares models (see Section 1.5), it would be possible to have more data points than model terms. In fact, this is often the case. Least-squares techniques are also appropriate if there are measurement errors in the data (also a common occurrence).

i

i i

i

i

i

i

1.3. Linear Equations

book 2008/10/23 page 9 i

9

10

8

6

4

b

2

0

–2

–4

–6

–8

–10

0

1

2

3

4

5

6

7

t

Figure 1.5. Fitting a quadratic function to data. Let us return to the example of the quadratic model. We can write the system of equations in matrix form as

or more generally,

⎛

1 1 1

2 3 5

4 9 25

1

t1

t12

t2 t3

t22 t32

⎝1 1

⎞ ⎠

x1 x2 x3

x1 x2 x3

1 = 6 , 4

=

b1 b2 b3

.

If there were n data points and the model were of the form b(t) = x1 + x2 t + · · · + xn t n−1 , then the system would have the form ⎛1 ⎜1 ⎜ ⎝ 1

t1 t2 tn

· · · t1n−1 ⎞ ⎛ x1 ⎞ ⎛ b1 ⎞ · · · t2n−1 ⎟ ⎜ x2 ⎟ ⎜ b2 ⎟ ⎟⎜ . ⎟ = ⎜ . ⎟. .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠ . · · · tnn−1

xn

bn

We will often denote such a system of linear equations as Ax = b. For these examples the number of data points is equal to the number of variables. Equivalently the matrix A has the same number of rows and columns. We refer to this as a “square” system because of the shape of the matrix A. It is also possible to consider problems with unequal numbers of data points and variables. Such examples, called “rectangular,” are discussed in Section 1.5.

i

i i

i

i

i

i

10

book 2008/10/23 page 10 i

Chapter 1. Optimization Models

Table 1.1. Cabinet data. Cabinet Bookshelf With Doors With Drawers Custom

1.4

Wood 10 12 25 20

Labor 2 4 8 12

Revenue 100 150 200 400

Linear Optimization

A linear optimization model (also knows as a “linear program”) involves the optimization of a linear function subject to linear constraints on the variables. Although linear functions are simple functions, they arise frequently in economics, production planning, networks, scheduling, and other applications. We will consider several examples. Further examples are included in Section 1.7 and in Chapters 5–8. In particular, examples of network models are discussed in Section 8.2. Suppose that a manufacturer of kitchen cabinets is trying to maximize the weekly revenue of a factory. Various orders have come in that the company could accept. They include bookcases with open shelves, cabinets with doors, cabinets with drawers, and customdesigned cabinets. Table 1.1 indicates the quantities of materials and labor required to assemble the four types of cabinets, as well as the revenue earned. Suppose that 5000 units of wood and 1500 units of labor are available. Let x1 , . . . , x4 represent the number of cabinets of each type made (x1 for bookshelves, x2 for cabinets with doors, etc.). Then the corresponding linear programming model might be maximize

z = 100x1 + 150x2 + 200x3 + 400x4

subject to

10x1 + 12x2 + 25x3 + 20x4 ≤ 5000 2x1 + 4x2 + 8x3 + 12x4 ≤ 1500 x1 , x2 , x3 , x4 ≥ 0.

This problem can easily be expanded from four products (bookshelves, cabinets with doors, cabinets without doors, etc.) to any number of products n, and from two resources (wood and labor) to any number of resources m. Denoting the unit profit from product j by cj , the amount available of resource i by bi , and the amount of resource i used by a unit of product j by aij , the problem can be written in the form maximize subject to

z= n

j =1

n

cj xj

j =1

aij ≤ bi ,

xj ≥ 0,

i = 1, . . . , m j = 1, . . . , n.

The problem can be written in a more compact manner by introducing matrix-vector notation. Letting x = (x1 , . . . , xn )T, c = (c1 , . . . , cn )T, b = (b1 , . . . , bm )T, and denoting the matrix

i

i i

i

i

i

i

1.4. Linear Optimization

book 2008/10/23 page 11 i

11

Table 1.2. Work times (in minutes). Worker

Information

Policy

Claim

1 2 3 4 5

10 15 13 19 17

28 22 18 25 23

31 42 35 29 33

of coefficients aij by A, the problem becomes maximize subject to

z = cTx Ax ≤ b x ≥ 0.

This is a typical example of a linear program. Here a linear objective function is to be maximized subject to linear inequality constraints and nonnegativity constraints on the variables. In the general case, the objective of a linear program may be either maximized or minimized, the constraints may involve a combination of inequalities and equalities, and the variables may be either restricted in sign or unrestricted. Although these may appear as different forms, it is easy to convert from one form to another. As another example, consider the assignment of jobs to workers. Suppose that an insurance office handles three types of work: requests for information, new policies, and claims. There are five workers. Based on a study of office operations, the average work times (in minutes) for the workers are known; see Table 1.2. The company would like to minimize the overall elapsed time for handling a (long) sequence of tasks, by appropriately assigning a fraction of each type of task to each worker. Let pi be the fraction of information calls assigned to worker i, qi the fraction of new policy calls, and ri the fraction of claims; t will represent the elapsed time. Then a linear programming model for this situation would be minimize subject to

z=t p1 + p2 + p3 + p4 + p5 = 1 q1 + q2 + q3 + q4 + q5 = 1 r1 + r2 + r3 + r4 + r5 = 1 10p1 + 28q1 + 31r1 ≤ t 15p2 + 22q2 + 42r2 ≤ t 13p3 + 18q3 + 35r3 ≤ t 19p4 + 25q4 + 29r4 ≤ t 17p5 + 23q5 + 33r5 ≤ t pi , qi , ri ≥ 0, i = 1, . . . , 5.

The constraints in this model assure that t is no less than the overall elapsed time. Since the objective is to minimize t, at the optimal solution t will be equal to the elapsed time.

i

i i

i

i

i

i

12

book 2008/10/23 page 12 i

Chapter 1. Optimization Models

The problems we have introduced so far are small, involving only a handful of variables and constraints. Many real-life applications involve much larger problems, with possibly hundreds of thousands of variables and constraints. Section 1.7 discusses some of these applications.

Exercise5 4.1. Consider the production scheduling problem of the perfume Polly named after a famous celebrity. The manufacturer of the perfume must plan production for the first four months of the year and anticipates a demand of 4000, 5000, 6000, and 4500 gallons in January, February, March, and April, respectively. At the beginning of the year the company has an inventory of 2000 gallons. The company is planning on issuing a new and improved perfume called Pollygone in May, so that all Polly produced must be sold by the end of April. Assume that the production cost for January and February is $5 per gallon and this will rise to $5.5 per gallon in March and April. The company can hold any amount produced in a certain month over to the next month at an inventory cost of $1 per unit. Formulate a linear optimization model that will minimize the costs incurred in meeting the demand for Polly in the period January through April. Assume for simplicity that any amount produced in a given month may be used to fulfill demand for that month.

1.5

Least-Squares Data Fitting

Let us re-examine the quadratic model from Section 1.3: b(t) = x1 + x2 t + x3 t 2 . For the data points (2, 1), (3, 6), and (5, 4) we obtained the linear system 1 2 4 x1 1 1 3 9 x2 = 6 1 5 25 4 x3 with solution x = (−21, 15, −2)T so that b(t) = −21 + 15t − 2t 2 . It is easy to check that the three data points satisfy this equation. Suppose that the data points had been obtained from an experiment, with an observation made at times t1 = 2, t2 = 3, and t3 = 5. If another observation were made at t4 = 7, then (assuming that the quadratic model is correct) it should satisfy b(7) = −21 + 15 × 7 − 2 × 72 = −14. If the observed value at t4 = 7 were not equal to −14, then the observation would not be consistent with the model. 5 See

Footnote 1 in the Preface for an explanation of the Exercise numbering within chapters.

i

i i

i

i

i

i

1.5. Least-Squares Data Fitting

book 2008/10/23 page 13 i

13

It is common when collecting data to gather more data points than there are variables in the model. This is true in political polls where hundreds or thousands of people will be asked which candidate they plan to vote for (so that there is only one variable). It is also true in scientific experiments where repeated measurements will be made of a desired quantity. It is expected that each of the measurements will be in error, and that the observations will be used collectively in the hope of obtaining a better result than any individual measurement provides. (The collective result may only be better in the sense that the bound on its error will be smaller. Since the true value is often unknown, the actual errors cannot be measured.) Since each of the measurements is considered to be in error, it is no longer sensible to ask that the model equation (in our case b(t) = x1 + x2 t + x3 t 2 ) be solved exactly. Instead we will try to make components of the “residual vector” ⎛ ⎞ b1 − (x1 + x2 t1 + x3 t12 ) ⎜ b − (x + x t + x t 2 ) ⎟ 1 2 2 3 2 ⎟ ⎜ 2 r = b − Ax = ⎜ ⎟ .. ⎝ ⎠ . bm − (x1 + x2 tm + x3 tm2 ) small in some sense. The most commonly used approach is called “least squares” data fitting, where we try to minimize the sum of the squares of the components of r: minimize r12 + · · · + rm2 = x

m

[bi − (x1 + x2 ti + x3 ti2 )]2 .

i=1

Under appropriate assumptions about the errors in the observations, it can be shown that this is an optimal way of selecting the coefficients x. If the fourth data point was (7, −14)T, then the least-squares approach would give x = (−21, 15, −2)T, since this choice of x would make r = 0. In this case the graph of the model would pass through all four data points. However, if the fourth data point was (7, −15)T, then the least-squares solution would be x=

−21.9422 15.6193 −2.0892

.

The corresponding residual vector would be ⎛

⎞ 0.0603 ⎜ −0.1131 ⎟ r = b − Ax = ⎝ ⎠. 0.0754 −0.0226 None of the residuals is zero, and so the graph of the model does not pass through any of the data points. This is typical in least-squares models. If the residuals can be written as r = b − Ax, then the model is “linear.” This name is used because each of the coefficients xj occurs linearly in the model. It does not mean that the model terms are linear in t. In fact, the model above has a quadratic term x3 t 2 . Other

i

i i

i

i

i

i

14

book 2008/10/23 page 14 i

Chapter 1. Optimization Models

examples of linear models would be b(t) = x1 + x2 sin t + x3 sin 2t + · · · + xk+1 sin kt x2 b(t) = x1 + . 1 + t2 “Nonlinear” models are also possible. Some examples are b(t) = x1 + x2 ex3 t + x4 ex5 t x2 b(t) = x1 + . 1 + x3 t 2 In these models there are nonlinear relationships among the coefficients xj . A nonlinear least-squares model can be written in the form minimize f (x) =

m

ri (x)2 ,

i=1

where ri (x) represents the residual at ti . For example, ri (x) ≡ bi − (x1 + x2 ex3 ti + x4 ex5 ti ) for the first nonlinear model above. We can also write this as f (x) = r(x)Tr(x). If the model is linear, then r(x) = b − Ax and f (x) can be shown to be a quadratic function. See the Exercises. Nonlinear least squares models are examples of unconstrained minimization problems, that is, they correspond to the minimization of a nonlinear function without constraints on the variables. In fact, they are one of the most commonly encountered unconstrained minimization problems.

Exercises 5.1. Prove that for the linear least-squares problem with r(x) = b − Ax, the objective f (x) = r(x)Tr(x) is a quadratic function.

1.6

Nonlinear Optimization

A nonlinear optimization model (also referred to as a “nonlinear program”) consists of the optimization of a function subject to constraints, where any of the functions may be nonlinear. This is the most general type of model that we will consider in this book. It includes all the other types of models as special cases. Nonlinear optimization models arise often in science and engineering. For example, the volume of a sphere is a nonlinear function of its radius, the energy dissipated in an electric

i

i i

i

i

i

i

1.6. Nonlinear Optimization

book 2008/10/23 page 15 i

15

6 ( x2 , y2 ) 4 w2

( x1 , y1 ) w1

(x , y ) 0

2 w3 ( x3 , y3 )

0

w4 ( x4 , y4 ) 5

10

-2

-4

Figure 1.6. Electrical connections. circuit is a nonlinear function of the resistances, the size of an animal population is a nonlinear function of the birth and death rates, etc. We will develop two specific examples here. Suppose that four buildings are to be connected by electrical wires. The positions of the buildings are illustrated in Figure 1.6. The first two buildings are circular: one at (1, 4)T with radius 2, the second at (9, 5)T with radius 1. The third building is square with sides of length 2 centered at (3, −2)T. The fourth building is rectangular with height 4 and width 2 centered at (7, 0)T. The electrical wires will be joined at some central point (x0 , y0 )T and will connect to building i at position (xi , yi )T. The objective is to minimize the amount of wire used. Let wi be the length of the wire connecting building i to (x0 , y0 )T. A model for this problem is minimize subject to

z = w1 + w2 + w3 + w4 wi = (xi − x0 )2 + (yi − y0 )2 , i = 1, 2, 3, 4, (x1 − 1)2 + (y1 − 4)2 ≤ 4 (x2 − 9)2 + (y2 − 5)2 ≤ 1 2 ≤ x3 ≤ 4 −3 ≤ y3 ≤ −1 6 ≤ x4 ≤ 8 −2 ≤ y4 ≤ 2.

We assume here for simplicity that the wires can be routed through the buildings (if necessary) at no additional cost. The constraints in nonlinear optimization problems are often written so that the righthand sides are equal to zero. For the above model this would correspond to using constraints of the form wi − (xi − x0 )2 + (yi − y0 )2 = 0, i = 1, 2, 3, 4, and so forth. This is just a cosmetic change to the model.

i

i i

i

i

i

i

16

book 2008/10/23 page 16 i

Chapter 1. Optimization Models

h

r

Figure 1.7. Archimedes’ problem.

2 1

4

3 Figure 1.8. Traffic network. As a second example we consider a problem posed by Archimedes. Figure 1.7 illustrates a portion of a sphere with radius r, where the height of the spherical segment is h. The problem is to choose r and h so as to maximize the volume of the segment, but where the surface area A of the segment is fixed. The model is maximize subject to

v(r, h) = π h2 (r − h3 ) 2π rh = A.

Archimedes was able to prove that the solution was a hemisphere (i.e., h = r). As another illustration of how nonlinear models can arise, consider the network in Figure 1.8. This represents a set of road intersections, and the arrows indicate the direction of traffic. If few cars are on the roads, the travel times between intersections can be considered as constants, but if the traffic is heavy, the travel times can increase dramatically. Let us focus on the travel time between a pair of intersections i and j . Let ti,j be the (constant) travel time when the traffic is light, let xi,j be the number of cars entering the road per hour, let ci,j be the capacity of the road, that is, the maximum number of cars entering per hour, and let αi,j be a constant reflecting the rate at which travel time increases as the traffic get heavier. (The constant αi,j might be selected using data collected about the road system.)

i

i i

i

i

i

i

1.6. Nonlinear Optimization

book 2008/10/23 page 17 i

17

Then the travel time between intersections i and j could be modeled by Ti,j (xi,j ) = ti,j + αi,j

xi,j . 1 − xi,j /ci,j

If there is no traffic on the road (xi,j = 0), then the travel time is ti,j . If xi,j approaches the capacity of the road ci,j , then the travel time tends to +∞. Ti,j is a nonlinear function of xi,j . Suppose we wished to minimize the total travel time through the network for a volume of X cars per hour. Then our model would be minimize f (x) =

xi,j Ti,j (xi,j )

subject to the constraints x1,2 + x1,3 = X x2,3 + x2,4 − x1,2 = 0 x3,4 − x1,3 − x2,3 = 0 x2,4 + x3,4 = X 0 ≤ xi,j ≤ cij . The equations ensure that all cars entering an intersection also leave an intersection. The objective sums up the travel times for all the cars. A potential snag with this formulation is that if the traffic volume reaches capacity on any arc (xi,j = ci,j ), the objective function becomes undefined, which will cause optimization software to fail. A number of measures could be invoked to prevent this situation. One alternative is to slightly lower the upper bounds on the variables, so that xi,j ≤ ci,j − , where is a small positive number. Alternatively we could increase each denominator in the objective by a small positive amount , thus forcing the denominator to have a value of at least and thereby avoiding division by zero. Our last example is the problem of finding the minimum distance from a point r to the set {x : a Tx = b}. In two dimensions the points in the set S define a line, and in three dimensions they define a plane; in the more general case, the set is called a hyperplane. The least-distance problem can be written as minimize subject to

f (x) = 12 (x − r)T(x − r) a Tx = b.

(The coefficient of one half in the objective is included for convenience; it allows for simpler formulas when analyzing the problem.) Unlike most nonlinear problems this one has a closed-form solution. It is given by x=r+

b − a Tr a. a Ta

(See the Exercises for Section 14.2.)

i

i i

i

i

i

i

18

book 2008/10/23 page 18 i

Chapter 1. Optimization Models

The minimum distance problem is an example of a quadratic program. In general, a quadratic program involves the minimization of a quadratic function subject to linear constraints. An example is the problem minimize subject to

f (x) = 12 x TQx Ax ≥ b.

Quadratic programs for which the matrix Q is positive definite are relatively easy to solve, compared to other nonlinear problems.

1.7

Optimization Applications

In this section we present a number of applications that are of current interest to practitioners or researchers. The models we present are but a few of the numerous applications where optimization is making a significant impact. We start by presenting two problems arising in the optimization of airline operations— the crew scheduling and fleet scheduling problems. Both problems are large linear programs with the added restriction that the variables must take on integer values. Next we discuss an approach for pattern classification known as support vector machines. Given a set of points that all belong to one of two classes, the idea is to estimate a function that will automatically classify to which of the two classes a new point belongs. In particular we discuss the case where the classifying function is linear. The resulting problem is a quadratic program. This topic is developed further in Chapter 14. Also in this section we discuss a portfolio optimization problem that attempts to balance between the competing goals of maximizing expected returns and minimizing risk in investment planning. This too is a quadratic program. Next we will discuss two optimization problems arising from medical applications. One problem arises from planning for treatment of cancer by radiation, where the conflicting goals of providing sufficient radiation to the tumor and limiting the dosage to nearby vital organs give rise to a plethora of models which cover the spectrum from linear through quadratic to nonlinear. The other problem arises from positron emission tomography (PET) image reconstruction, where a model of the image that best fits the scan data gives rise to a linearly constrained nonlinear problem. In both applications the optimization problems can be very large and challenging to solve. Finally we use optimization to find the shape of a hanging cable with minimum potential energy. We present several models of the problem and emphasize the importance of certain modeling issues.

1.7.1

Crew Scheduling and Fleet Scheduling

Consider an airline that operates 2000 flights per day serving 100 cities worldwide, with 400 aircraft of 10 different types, each requiring a flight crew. The airline must design a flight schedule that meets the passenger demand, the maintenance requirements on aircraft, and all other safety regulations and labor contract rules, while trying to be cost effective in order to maximize profit.

i

i i

i

i

i

i

1.7. Optimization Applications

book 2008/10/23 page 19 i

19

This planning problem is extremely complex. For this reason many airlines used a phased planning cycle that breaks the problem into smaller steps. While more manageable, the individual steps themselves can also be complex. Arguably the most challenging of these is the crew scheduling problem, that assigns crews (pilots and flight attendants) to flights. Economically it is a significant problem, since the cost of crews is second only to the cost of fuel in an airline’s operating expenses. Saving even 1% of this cost can save the airline hundreds of millions of dollars annually. Computationally it is a difficult problem since it involves a linear model, which is not only very large, but also involves integer variables, which necessitates multiple solutions of linear programs. In planning the crew activities, the flight schedule is subdivided into “legs,” representing a nonstop flight from one city to another. If a plane flew from, say, New York via Chicago to Los Angeles, this would be considered as two legs. A large airline would typically have hundreds of flight legs per day. The planning period might be a day, a week, or a month. The crews themselves are certified for particular aircraft, and this restricts how personnel can be assigned to legs. In addition, there are union rules and federal laws that constrain the crew assignments. To set up the model, the airline first specifies a set of possible crew assignments. One of these assignments might correspond to sending a crew from New York (their home city) to a sequence of cities and then back to New York. Each such round trip is called a “pairing.” The number of pairings grows exponentially with the number of legs, and for a large airline, the number of pairings may easily run into the billions, even for the shorter planning period of one week. The variables in the model are xj , where xj is 1 if a particular pairing is selected as part of the total schedule, and 0 otherwise. Let the total number of pairings be N . The majority of the constraints correspond to the requirement that each leg in the planning period be covered by exactly one pairing. For the ith leg, the constraint has the form N

ai,j xj = 1,

j =1

where the constant ai,j = 1 if a particular pairing includes leg i, and zero otherwise. There is one such constraint for every leg in the schedule. The columns of the matrix A correspond to the pairings, and each pairing must represent a round trip that is technically and legally feasible. For example, if a crew flies from New York to Chicago, it cannot then immediately fly out of Denver. The pairing makes sense if it makes sense chronologically, includes minimum rests between flights, satisfies regulations on maximum flying time, and so forth. This places many restrictions on how the pairings are generated, and hence on the coefficients ai,j . The resulting columns of A are typically very sparse, with many zeros, and just a few ones, corresponding to the legs of the roundtrip. The cost cj of a pairing is a function of the duration of the pairing, the number of flight hours, and “penalties” that may be associated with the pairing. For example, extra wages and expenses must be paid if the crew spends a night away from its home city, or it may be necessary to transport a crew from one city to another for them to complete the pairing.

i

i i

i

i

i

i

20

book 2008/10/23 page 20 i

Chapter 1. Optimization Models The basic model has the form minimize subject to

z = cTx N

ai,j xj = 1 j =1

xj = 0 or 1. The problem is a linear program with the additional requirement that the variables take on integer values (here—zero and one), hence it is an integer programming problem. As mentioned in Section 1.2, such problems are most commonly solved by solving a sequence of linear programs, where the integrality restrictions are relaxed and replaced by a (continuous) constraint on the range of the variable. The range should ideally be as tight as possible, yet should not exclude the optimal solution. For a zero-one problem the relaxed constraints for the first subproblem would typically be 0 ≤ xj ≤ 1 for all j . Subsequent problems are variants of the relaxed problem, usually with additional constraints or an adjusted objective function. Crew scheduling problems can be very large. A major effort is required just to generate the possible pairings. Commonly, only a partial model is generated, corresponding to a subset of the possible pairings. Even so, problems with millions of variables are typical. Linear programs of this size (even ignoring the integrality restriction) are difficult to solve. They demand all the resources of the most sophisticated software. The special structure of the matrix A (and in particular its sparsity—the large number of zero entries) and the latest algorithmic techniques must be used. Many of these techniques are discussed in Part II. The crew scheduling problem is typically the last step in an airline’s schedule planning. The first step begins about several months prior to the actual service when the airline selects the optimal set of flight legs to be included in its schedule. The flight schedule lists the schedule of flight legs by departure time, destination, and arrival time. The next step is fleet assignment, which determines which type of aircraft will fly each leg of the schedule. Airline fleets are made up of many different types of aircraft, which differ in capacity and in operational characteristics such as speed, fuel burn rates, landing weights, range, and maintenance costs. Allocating an aircraft that is too small will result in loss of revenue from passengers turned away, while allocating an aircraft that is too big will result in too many empty seats to cover the high expenses. The airline’s problem is to determine the best aircraft to use for each flight leg such that capacity is matched to demand while minimizing the operating cost. This problem is frequently represented as a time-line network. The network includes a line called a “time-line” for each airport, with nodes positioned along the line in chronological order at each arrival and departure time. Each flight is represented by an arc in the network. Thus for example a flight leaving Washington Dulles (IAD) at 6:00 am (Eastern Standard Time) and arriving at Denver (DEN) at 10:00 am (Eastern Standard Time) would be represented by an arc connecting the 6:00 am node on the IAD time-line to the 10:00 am node on the DEN time-line. (In practice, the arrival time is adjusted to account for the time it takes to prepare the aircraft for the next flight, but we will ignore that here.) In addition to the flight arcs we create an arc from each node on a time line to the consecutive node on the

i

i i

i

i

i 2008/1

page 21

i

i

1.7. Optimization Applications

21

IAD

DEN

SFO

6:00

8:00

10:00

12:00

2:00

4:00

6:00

Figure 1.9. Time-line network. time line, and (assuming the schedule is repeated daily) an arc from the last node returning to the first node. The flow on these arcs represents aircraft on the ground that are waiting for their flight. Figure 1.9 illustrates a time-line network for an airline that has two flights a day each from IAD to DEN, DEN to IAD, DEN to SFO (San Francisco), and SFO to DEN. Define now xij to be the number of aircraft of type i on arc j . Any feasible fleet assignment solution must satisfy the following constraints: (i) Covering constraints: each flight leg must be covered by exactly one aircraft; (ii) Flow-balance constraints: for each node of the network the total number of aircraft of type i entering the node must equal the total number of aircraft of type i exiting the node; (iii) Fleet size constraints: the number of aircraft used of each type must not exceed the number of aircraft available. The objective is to minimize the total cost of the assignment. The problem is by nature integer, but it is generally solved by a series of linear programs where the integrality restrictions are relaxed. Once the fleet is assigned, the individual aircraft of the fleet must be assigned to their flights. This is known as the aircraft routing problem. The planning must take into account the required maintenance for each aircraft. To meet safety regulations, an airline might typically maintain aircraft every 40–45 hours of flying with the maximum time between checks restricted to three to four calendar days. The problem is to determine the most cost effective assignment of aircraft of a single fleet to the scheduled flights, so that all flight legs are covered and aircraft maintenance requirements are satisfied. The last step of the planning cycle is the task of crew scheduling. Breaking down the full planning cycle into steps helps make the planning more manageable, but ultimately it leads to suboptimal schedules (see Exercise 7.2). Researchers are therefore investigating methods that combine two or more of the planning phases together for more profitable schedules.

i

i i

i

i

i

i

22

book 2008/10/23 page 22 i

Chapter 1. Optimization Models

Exercises 7.1. Formulate the fleet scheduling problem corresponding to Figure 1.9. 7.2. Consider an airline that has scheduled the flight legs for the next month. It has done so by breaking down the planning cycle into a sequence of steps: first determine the optimal fleet for this schedule; next route the aircraft within the fleet to the flight legs; and finally assign crews for each of the flight legs. Discuss why this makes the planning more manageable but likely leads to suboptimal schedules.

1.7.2

Support Vector Machines

Suppose that you have a set of data points that you have classified in one of two ways: either they have a certain stated property or they do not. These data points might represent the subject titles of email messages, which are classified as either being legitimate email or spam; or they may represent medical data such as age, sex, weight, blood pressure, cholesterol levels, and genetic traits of patients that have been classified either as high risk or as low risk for a heart attack; or they may represent some features of handwritten digits such as ratio of height to width, curvature, that have been classified either as (say) zero or not zero. Suppose now that you obtain a new data point. Your goal is to determine whether this new point does or does not have the stated property. The set of techniques for doing this is broadly referred to as pattern classification. The main idea is to identify some rule based on the existing data (referred to as the training data) that characterizes the set of points that have the property, which can then be used to determine whether a new point has the property. In its simplest form classification uses linear functions to provide the characterization. Suppose we have a set of m training data xi ∈ n with classification yi , where either yi = 1 or yi = −1. A two-dimensional example is shown in the left-hand side of Figure 1.10, where the two classes of points are designated by circles of different shades. Suppose it is possible to find some hyperplane w Tx + b = 0 which separates the positive points from the negative. Ideally we would like to have a sharp separation of the positive points from the negative. Thus we will require wTxi + b ≥ +1 wTxi + b ≤ −1

for yi = +1, for yi = −1.

There is nothing special about the separation coefficients ± on the right-hand side of the above inequalities. The coefficients w and b of the hyperplane can always be scaled so that the separation will be ± 1. To obtain the best results we would like the hyperplanes separating the positive points from the negative to be as far apart as possible. From basic geometric principles it can be shown that the distance between the two hyperplanes (that is, the separation margin) is 2/ w. Thus among all separating hyperplanes we should seek the one that maximizes this margin. This is equivalent to minimizing w Tw. The resulting problem is to determine the coefficients w and b that solve minimize subject to

f (w, b) = 21 w Tw yi (w Txi + b) ≥ 1,

i = 1, . . . , m.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 23 i

23

wT x + b = 1

wT x + b = −1

margin

Figure 1.10. Linear separating hyperplane for the separable case.

The coefficient 12 in the objective is included for convenience; it results in simpler formulas when analyzing the problem. The right-hand side of Figure 1.10 shows the solution of our two-dimensional example. The training points that lie on the boundary of either of the hyperplanes are called the support vectors; they are highlighted by larger circles. Removal of these points from our training set would change the coefficients of the hyperplanes. Removal of the other training points would leave the coefficients unchanged. The method is called a “support vector machine” because support vectors are used for classifying data as part of a machine (computerized) learning process. Once the coefficients w and b of the separating hyperplane are found from the training data, we can use the value of the function f (x) = wTx + b (our “learning machine”) to predict whether a new point x¯ has the property of interest or not, depending on the sign of f (x). ¯ So far we have assumed that the data set was separable, that is, a hyperplane separating the positive points from the negative points exists. For the case where the data set is not separable, we can refine the approach to the separable case. We will now allow the points to violate the equations of the separating hyperplane, but we will impose a penalty for the violation. Letting the nonnegative variable ξi denote the amount by which the point xi violates the constraint at the margin, we now require w Txi + b ≥ +1 − ξi wTxi + b ≤ −1 + ξi

for yi = +1 for yi = −1.

A common way to impose the penalty is to add to the objective a term proportional to the sum of the violations. The added penalty term takes the form C m i=1 ξ to the objective, where the larger the value of the parameter C, the larger the penalty for violating

i

i i

i

i

i

i

24

book 2008/10/23 page 24 i

Chapter 1. Optimization Models

wT x + b = 1

wT x + b = −1

Figure 1.11. Linear separating hyperplane for the nonseparable case. the separation. Our problem is now to find w, b, and ξ that solve minimize f (w, b, ξ ) = 12 w Tw + C m i=1 ξi subject to

yi (w Txi + b) ≥ 1 − ξi , ξi ≥ 0.

i = 1, . . . , m,

Figure 1.11 shows an example of the nonseparable case and the resulting separating hyperplane. We see in this example that two of the points (indicated in the figure by the extra squares) are misclassified, since they lie on the incorrect side of the hyperplane wTx +b = 0. In later chapters of this book we will see that many problems have a companion problem called the dual problem, that there are important relations between a problem and its dual, and that these relations sometimes lead to insights for solving the problem. In Section 14.8 we will discuss the dual of the problem of finding the hyperplanes with the largest separation margin. We will show that the dual problem directly identifies the support vectors, and that the dual formulation can give rise to a rich family of nonlinear classifications that are often more useful and more accurate than the linear hyperplane classification we presented here.

Exercises 7.1. Consider two classes of data, where the points (13.3), (0.31.5), (2, 4.2), (2.2, 2.9), (1.7, 3.6), (3, 4), (1, 4) possess a certain property and the points (1.8, 1.5), (3.4, 3.6), (0.2, 2.5), (1, 1.3), (1, 2.5), (3, 1.1), (2, 0.1)

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 25 i

25

do not possess this property. Use optimization software to compute the maximum margin hyperplane that separates the two classes of points. Are the classes indeed separable? What are the support vectors? Repeat the problem when the first class includes also the point (0.2, 2.5) and the second class includes the point (1.7, 3.6). 7.2. In this project we create a support vector machine for breast cancer diagnosis. We use the Wisconsin Diagnosis Breast Cancer Database (WDBC) made publicly available by Wolberg, Street, and Mangasarian of the University of Wisconsin. A link to the data base is made available on the Web page for this book, http://www.siam.org/books/ ot108. There are two files: wdbc.data and wdbc.names. The file wdbc.names gives more details about the data, and you should read it to understand the context. The file wdbc.data gives N = 569 data vectors. Each data vector (in row form) has n = 32 components. The first component is the patient number, and the second is either “M” or “B” depending on whether the data is malignant or benign. You may manually change the entries “M” to “+1” and “B” to “−1”. These entries are the indicators yi . Elements 3 through 32 of each row i form a 30-dimensional vector xiT of observations. (i) Use the first 500 data vectors as your training set. Use a modeling language to formulate the problem for the nonseparable case, using C = 1000. Solve the problem and display the separating hyperplane. Determine whether the data are indeed separable. (ii) Use the output of the run to predict whether the remaining 69 patients have cancer. Compare your prediction to the actual patients’ medical status. Evaluate the accuracy (proportion of correct predictions), the sensitivity (proportion of positive diagnoses for patients with the disease), and the specificity (the proportion of negative diagnoses for patients without the disease).

1.7.3

Portfolio Optimization

Suppose that an investor wishes to select a set of assets to achieve a good return on the investment while controlling risks of losses. The use of nonlinear models to manage investments began in the 1950s with the pioneering work of Nobel Prize laureate Harry Markowitz, who demonstrated how to reduce the risk of investment by selecting a portfolio of stocks rather than picking individual attractive stocks, and established the trade-off between reward and risk in investment portfolios. An investment portfolio is defined by the vector x = (x1 , . . . , xn ), where xj denotes the proportion of the investment to be invested in asset j . Letting μj denote the expected rate of return of asset j , the expected rate of return of the portfolio is μTx. Let be the matrix of variances and covariances of the assets’ returns. The entry j,j is the variance of investment j . A high variance indicates high volatility or high risk; a low variance indicates stability or low risk. The entry i,j is the covariance of investments i and j . A positive value of i,j indicates assets whose values usually move in the same direction, as often occurs with stocks of companies in the same industry. A negative value indicates assets whose values generally move in opposite directions—a desirable feature

i

i i

i

i

i

i

26

book 2008/10/23 page 26 i

Chapter 1. Optimization Models

in a diversified portfolio. Markowitz defined the risk of the portfolio to be its expected variance x Tx. Our optimization problem has two conflicting objectives: to maximize the return μTx, and to minimize the risk x Tx. The relative importance of these objectives will vary depending on the investor’s tolerance for risk. We introduce a nonnegative parameter α that reflects the investor’s trade-off between risk and return. The objective function in the model will be some combination of the two objectives, parameterized by α, leading to the model maximize f (x) = μTx − αx Tx subject to the constraints

xi = 1

and

x ≥ 0.

i

The value of α reflects the investor’s aversion to risk. A large value indicates a reluctance to take on risk, with an emphasis on the stability of the investment. A low value indicates a high tolerance for risk with an emphasis on the expected return of the investment. It can be difficult to choose a sensible value for α. For this reason it is common to solve this model for a range of values of this parameter. This can reveal how sensitive the solution is to considerations of risk. The solution of the problem for any value of α is called efficient indicating that there is no other portfolio that has a larger expected return and a smaller variance. There are of course some limitations to our model. First, we do not generally know the theoretical (joint) distribution of the assets’ return and will need to estimate the mean and variance from historical data. Denoting the estimate of μ by r and the estimate of by V , the actual problem we solve is maximize subject to

r Tx − αx TV x i xi = 1 xi ≥ 0.

Second, investors should be aware that past performance is no indicator of future returns. Finally, we note that the matrix V is dense; that is, it has many nonzero elements. As a result, when the number of assets is large, computations involving V can be expensive thus making the optimization problem computationally difficult. To illustrate portfolio optimization, consider an investor who is planning a portfolio based on four stocks. Data on the rates of return of the stocks in the last six periods are given in Table 1.3. Using this information we estimate the mean of the rate of return as r = ( 0.0667

0.0900

0.0717

0.0733 ) ,

and the variance as ⎛

0.00019 ⎜ 0.00065 V =⎝ 0.00004 0.00308

0.00065 0.00883 0.00218 0.00327

0.00004 0.00218 0.00125 0.00063

⎞ 0.00038 0.00327 ⎟ ⎠. 0.00063 0.00162

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 27 i

27

Table 1.3. Past rates of return of stocks. Period 1 2 3 4 5 6

Stock 1 0.08 0.06 0.07 0.04 0.08 0.07

Stock 2 0.05 0.17 0.05 −0.07 0.12 0.22

Stock 3 0.01 0.09 0.10 0.04 0.08 0.11

Stock 4 0.08 0.12 0.07 −0.01 0.09 0.09

Table 1.4. Optimal portfolio for selected values of α. α 1 2 5 10 100

Stock 1 0 0.12 0.57 0.71 0.87

Stock 2 1 0.65 0.19 0.04 0

Stock 3 0 0.23 0.24 0.25 0.13

Stock 4 0 0 0 0 0

Mean 0.090 0.083 0.072 0.069 0.067

Variance 8.8 ×10−3 4.5 ×10−3 8.0 ×10−4 2.6 ×10−4 1.7 ×10−4

The solution of the optimization problem for a selection of values of the parameter α is given in Table 1.4. Figure 1.12 plots the rate of return against the variance of the optimized portfolios for a continuous range of values of α. The curved line is called the efficient frontier since it depicts the collection of all efficient points. The figure also shows the rate of return and variance obtained when allocating the entire portfolio to one stock only. In this example, a person who has a high tolerance for risk may choose to invest entirely in Stock 2, whereas a person who is extremely cautious may choose to invest entirely in Stock 1. Investing only in Stock 3, or only in Stock 4, or half in Stock 1 and half in Stock 2 are not recommended strategies for anyone, since they are dominated by strategies that have both higher return and lower risk.

Exercises 7.1. How would the formulation to the problem change if a risk-free asset (such as government treasury bills at a fixed rate of return) is also being considered? 7.2. An investor wants to put together a portfolio consisting of the 30 stocks used to determine the Dow Jones industrial average. Use 25 weekly returns ending on the last Friday of last month to find the optimal portfolio. Experiment with different values of the parameter α and plot the corresponding points on the efficient frontier. You will need access to a nonlinear optimization solver. You may need to use a modeling language to formulate the problem for input to the solver.

i

i i

i

i

i

i

28

book 2008/10/23 page 28 i

Chapter 1. Optimization Models

0.095

0.090 Stock 2

Rate of Return

0.085

0.080

0.075 Stock 4 Stock 3

0.070 Stock 1

0.065

1

2

3

4 5 6 Variance in 10e−3

7

8

9

Figure 1.12. Efficient frontier.

1.7.4

Intensity Modulated Radiation Treatment Planning

Radiotherapy is the treatment of cancerous tissues with external beams of radiation. As a beam of radiation passes through the body, energy is deposited at points along its path, and as this happens the beam intensity gradually decreases (this is called attenuation). The radiation dosage is the amount of energy deposited locally per unit mass. High doses of radiation can kill cancerous cells, but will also damage nearby healthy cells. If vital organs receive too much radiation, serious complications may arise. Some limited damage to healthy cells may be tolerable however, since normal cells repair themselves more effectively than cancerous cells. If the radiation dosage is limited, the surrounding organs can continue to function and may eventually recover. The goal of the radiation treatment planning is to design a treatment that will kill the cancer in its entirety but limit the damage to surrounding healthy tissue. To keep the radiation levels of normal healthy tissue low, the treatment typically uses several beams of radiation delivered from different angles. Intensity modulated radiation therapy (IMRT) is an important recent advance that allows each beam to be broken into hundreds (or possibly thousands) of beamlets of varying intensity. This is achieved using a set of metallic leaves (called collimators) that can sequentially move from open to closed position, thus filtering the radiation in a way that not only allows for the modulation of the intensity of the beam, but also enables control of its shape. This enables more accurate radiation treatment. This is particularly important in cases where the tumor has an unusual shape as is the case when it is wrapped around the spinal cord, or when it is close to a vital structure such as the optic nerve. A simplified example of the desired goals for treatment of a hypothetical prostate cancer patient is given in Table 1.5. Radiation dosage is measured in a unit call Gray (Gy). One Gy is equal to one Joule of energy deposited in one kilogram. The planning target volume (PTV) describes a region large enough to incorporate the diseased organ, the

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 29 i

29

Table 1.5. Sample treatment specifications. Volume

Requirement

PTV excluding rectum overlap

Prescription dose Maximum dose Minimum dose 95% of volume ≥

PTV/rectum overlap

Prescription dose 74 Gy Maximum dose 77 Gy Minimum dose 74 Gy

Rectum

Maximum dose 76 Gy 70% of volume ≤ 32 Gy

Bladder

Maximum dose 78 Gy 70% of volume ≤ 32 Gy

80 Gy 82 Gy 78 Gy 79 Gy

cancerous cells, as well as a margin to account for patient movement during the treatment. Organs at risk are the rectum and the bladder. Since the PTV may overlap with the rectum, different treatment specifications are given for the primary region where the PTV is distinct from the rectum, and for the region where they overlap. The specifications for the primary region, for example, include a desired “prescription” dose of 80 Gy at every cell, a minimum dose value of 78 Gy, a maximum dose of 82 Gy, and finally, a “dose-volume” requirement that specifies that 95% of the cells in this region must receive at least 79 Gy. The treatment specification for the bladder includes an upper limit of 72 Gy for the entire organ and a dose-volume requirement that 70% of the organ must receive 32 Gy or less. To determine the treatment plan we will need to define a volume of interest that includes the PTV and any nearby tissue that may be adversely affected by the treatment. We will divide this volume into a three-dimensional grid of small boxes called voxels. We will denote the dose deposited in voxel i by di . A key decision in the treatment planning is the fluence map—the radiation intensity of the beamlets in each beam. Let xj denote the intensity of beamlet j . Then the total radiation dosage deposited in the volume of interest is given approximately by the equation d = Ax. The matrix A is called the fluence matrix and is assumed to be known. Its components ai,j represent the amount of dose absorbed by voxel i per unit intensity emission from beamlet j . The problem is therefore to find a fluence map x that yields a radiation dose d that meets the requirements specified by the physician, as in Table 1.5. As such, this seems to be a feasibility problem, namely one of finding a feasible solution, rather than an optimization problem. Unfortunately the treatment requirements are usually conflicting, and it is impossible to satisfy all the requirements simultaneously. To resolve this, the requirements are usually broken up into “hard” constraints for which any violation is prohibited, and “soft” constraints for which violations are allowed. Typically, hard constraints are included in

i

i i

i

i

i

i

30

book 2008/10/23 page 30 i

Chapter 1. Optimization Models

the formulation as explicit constraints, whereas soft constraints are incorporated into the objective function via some penalty that is imposed for their violation. For example, the requirement that region S in the primary treatment volume will receive a minimum dose l and a maximum dose u could be treated as a hard constraint by explicitly requiring that l ≤ di ≤ u for all i ∈ S. Alternatively the requirement could be treated as a soft constraint, where a violation is allowed, but with penalty. One approach is to include in the objective function the nonlinear term

max(0, l − di )2 + wu max(0, di − u)2 , wl i∈S

i∈S

which sums up the squared deviation from the desired bound for those voxels where the bounds are violated. The parameters wl and wu are weights representing the relative importance of the bounds on the doses and may differ by region. For instance, underdosing the tumor can be more harmful than overdosing it, so the weights for this region satisfy wl ≥ wu . For an alternative way to impose a penalty for violating the bounds on the doses, see the Exercises. The “dose-volume constraints” that specify that a fraction β of some volume must receive a dose of u or less (or a dose of l or more) are more difficult to incorporate. As an example, suppose that the bladder volume in our example has 10,000 voxels. Then at least 7,000 of the voxels must receive 32 Gy or less. To count the number of voxels that exceed 32 Gy we must define an indicator for each voxel that determines whether its dose meets 32 Gy or exceeds it. This can be done by defining for each voxel a variable yi that is either zero or one, depending on whether the dose meets the desired upper limit or not. Then adding the constraints

di ≤ 32(1 − yi ) + 78yi , yi ≤ 3,000, yi ∈ {0, 1} i∈S

enforces the dose-volume constraints. The first constraint implies that if di exceeds 32 Gy, then yi must be one; the second implies that the number of voxels where the dose exceeds 32 Gy is at most 3,000. This formulation expresses the dose-volume requirements as hard constraints. However models with integer variables can be difficult to solve and may require a specialized implementation. For this reason, some researchers prefer other formulations. One way to use a soft constraint for the dose-volume requirement is to add to the objective function a penalty term of the form

w max(0, di − 32)2 , i∈S(d)

where S(d) is the set of 7,000 voxels (out of the 10,000) with the lowest dose, and w is the weight of the penalty. Unfortunately, we have traded one difficulty for another. In this alternative formulation, the penalty term does not have continuous derivatives (see the Exercises), which can create challenges for many optimization algorithms. One may wonder why there are so many different models and formulations. There are several reasons. First, because the requirements are conflicting, there is no consensus

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 31 i

31

among physicians as to what should be a hard constraint and what should be a soft constraint. Second, physicians have other desired objectives in the treatment that are extremely important yet cannot be adequately modeled. For example, they are concerned about the tumor control probability—the probability that the dose delivered will indeed kill the tumor. However models that incorporate these probabilities directly are computationally impractical. As another example, physicians obtain important information from the shape of the dose-volume histogram, a graph displaying for each dose level the percentage of the volume that receives at least that dose amount. Ideally one would like to include constraints that enforce the dose-volume histogram to have a “good” shape, but this would amount to including numerous dose-volume constraints, which again is computationally impractical. A third factor is the trade-off between solution time and solution quality. Most commercial systems use the weighted sum of penalties since these can typically be solved efficiently. However, because all the constraints are “soft,” the solutions are not always adequate. The solutions can sometimes include undesirable features, such as regions of low dosage (“cold spots”) within the tumor, or regions of high dosage (“hot spots”) in healthy tissue. The problem of optimizing the fluence map can be immense. The number of voxels may range from tens of thousands to hundreds of thousands. Typically a treatment may use 5–10 beams, and the number of beamlets per beam can run into the thousands. Even if the direction of the beams is prescribed, the problem can be challenging. The problem becomes even harder if one attempts to optimize the number of beams and their directions, in addition to their fluence. There is an additional challenge. Recall that the beamlets are formed by the movement of the leaf collimators; the longer a leaf is open, the more dose it allows to pass through. It is also necessary to determine the sequence of leaf positions and length of their open times that creates the desired fluence map—or an approximation to it—in a total sequencing time that does not unduly prolong the patient’s total treatment time.

Exercises 7.1. One possible way to allow some violation of the constraint l ≤ di in a region S is to introduce for each voxel i in S two new nonnegative variables si and si satisfying di − si + si = li si , si ≥ 0, and to include a penalty term of the form wl i∈S si in the objective. Explain why this approach would work, and derive an equivalent approach for the constraint di ≤ u. 7.2. The purpose of this exercise is to show that when the dose-volume requirements are included as soft constraints in the objective, the resulting penalty term may have discontinuous derivatives. Consider a region with only two voxels, and suppose that it is required that not more than half the voxels exceed a dose of u. Show that the approach described in this section for incorporating this requirement as a soft constraint adds a penalty term of the form w max(0, (mini=1,2 {di } − u))2 to

i

i i

i

i

i

i

32

book 2008/10/23 page 32 i

Chapter 1. Optimization Models the objective. Evaluate the gradient of this penalty term at points where it exists. Determine whether the first derivatives are continuous on d ≥ 0.

1.7.5

Positron Emission Tomography Image Reconstruction6

Positron emission tomography (PET) is a medical imaging technique that helps diagnose disease and assess the effect of treatment. Unlike other imaging techniques such as Xrays or CT-scans that directly study the anatomical structure of an organ, PET studies the physiology (blood flow or level of metabolism) of the organ. Metabolic activity is an important tool in diagnosis: cancerous cells have high metabolism or high activity, while tumor cells damaged by irradiation have low metabolism or low activity. Alzheimer’s disease is indicated by regions of reduced activity in the brain, and coronary tissue damage is indicated by regions of reduced activity in the heart. In a PET scan the patient is injected with a radioactively labeled compound (most commonly glucose, but sometimes water or ammonia) that is selected for its tendency to be absorbed in the organ of interest. Once the compound settles, it starts emitting radioactive emissions that are counted by the PET scanner. The level of emissions is proportional to the amount of drug absorbed, or in turn, to the level of cell activity. The emissions are counted using a PET scanner that surrounds the body. Based on the emissions counts obtained in the scanner, the goal is to determine the level of emissions from within the organ, and hence the level of metabolic activity. The output of the reconstruction is typically presented in a color image that reflects the different activity levels in the organ. We describe the physics of PET in further detail. As the radioisotope decays, it emits positrons. Each positron annihilates with an electron, and produces two photons which move in nearly opposite directions, each hitting a tiny photodetector within the scanner at almost the same time. Any near-simultaneous detection of an event by two such detectors defines a coincidence event along a coincidence line. The number of coincidence events yj detected along each of the possible coincidence lines j is the input to the image reconstruction. Consider the situation depicted in Figure 1.13, where a grid of boxes or voxels has been imposed over the emitting object (for simplicity, the figure is depicted in two dimensions; the concept is readily extended to three dimensions). Given a set of measurements yj along the coincidence lines j = 1, . . . , N, we seek to estimate xi , i = 1, . . . n, the expected number of counts emitted from voxel i, where n is the number of voxels in the grid. Most reconstruction methods are based on a technique known as filtered back projection. Although this technique yields fast reconstructions, the quality of the image can be poor in situations where the amount of radioactive substance used must be small. Under such situations it is necessary to use a statistical model of the emission process to determine the most likely image that fits the data. The approach is via the maximum likelihood estimation technique. The radioactive emissions from voxels i = 1, . . . , n are assumed to be statistically independent random variables that follow a Poisson distribution with mean xi . Denote by Ci,j the probability that an emission emanating from voxel i will hit detector pair (coincidence line) j . The n × N matrix C = Ci,j depends on the geometry of the scanner and on the tissue being scanned, and is assumed to be known. 6 This

section requires some basic concepts from probability theory.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 33 i

33

Figure 1.13. PET. Using these assumptions one can show that the emissions emanating from voxel i and hitting detector pair j are also independent Poisson variables with mean rate Ci,j xi , and the total emissions received by the detector pairs j = 1, . . . , N are independent Poisson dis tributed variables with mean rate i Ci,j xi . Let q = CeN where eN is a vector of 1’s. The vector q denotes the sum of the columns of C (which need not be 1). It is computationally easier if we write the optimization model using the logarithm of the likelihood function. If we ignore a constant term, the resulting logarithm is fML = −q T x +

yj log C Tx j .

j

(See the Exercises.) Since the emission level is nonnegative, the final reconstruction problem becomes maximize fML = −q T x + j yj log C Tx j subject to x ≥ 0. The size of the problem can be enormous. If one wishes to reconstruct, say, a volume of, say, 5 cubic cm at a resolution of half a millimeter, then the size of the grid would be 100 by 100 by 100, corresponding to n = 100,000 variables. Problems of this size and even larger are not uncommon. The size of the data is also huge. The scanner may have thousands of photodetectors and since any pair of these can define a coincidence line, the number of coincidence lines N can be on the order of millions. Since every function evaluation requires the computation of a matrix product C Tx, and the matrix C is large, the function evaluations are time consuming. The efficient solution of such large problems often requires understanding of their structure. By structure we mean special characteristics of the function, its gradient, and Hessian. Often structure is associated with the sparsity pattern of the Hessian, that is the number of zeros, and possibly their location. The special structure of fML and its derivatives

i

i i

i

i

i

i

34

book 2008/10/23 page 34 i

Chapter 1. Optimization Models

can be used in designing effective methods for solving the problem. Here we will just give the formulas for the derivatives. Defining yˆ = C T x, we can write the gradient and Hessian of the objective function, respectively, as ∇fML (x) = −q + C Yˆ −1 y, ∇ 2 fML (x) = −C Y Yˆ −2 C T , where Y = diag(y) and Yˆ = diag(y). ˆ The matrix C itself is sparse, and only a small fraction of its entries are nonzero. The diagonal matrices Y and Yˆ are of course also sparse. Even so, the Hessian ∇ 2 fML (x) is dense; almost all of its entries are likely to be nonzero. A key challenge in the design of effective algorithms is to exploit the sparsity of C.

Exercises 7.1. The goal of this exercise is to derive the maximum likelihood model for PET image reconstruction. Parts (a) and (b) require some basic background in stochastic methods. (i) Let Zij be the number of events emitted from voxel i and detected at coincidence line j , and let Yj be the total emissions received by detector pair j , for j = 1, . . . , N. Use the assumptions given in the section to prove that Zij are independent Poisson variables with mean Ci,j xi , and that Yj are independent Poisson distributed variables with mean rates yˆj = i Ci,j xi . (ii) Prove that the likelihood may be written as e− i Ci,j xi ( Ci,j xi )yj e−yˆj yˆ yj j i = . P {y|x} = yj ! yj ! j

j

(iii) Prove the final expression for the maximum likelihood estimation objective function fML . Hint: Take the logarithm of the likelihood and omit the constant term that does not depend on x. 7.2. Derive the formulas for the gradient and Hessian matrix of fML . 7.3. The purpose of this exercise is to show that the Hessian of fML may be dense, even when its matrix factors are sparse. Suppose that C = ( I I en ) and y = yˆ = e2n+1 where I is the identity matrix, and ek is a vector of ones of size k. Show that every element of the Hessian is nonzero. 7.4. The purpose of this problem is to write a program in the modeling language of your choice to solve a PET image reconstruction problem. Your model should not only be correct, but also efficient and clear. Try and make your model as general as possible. (i) Develop the model and test it on a problem with n = 9 variables corresponding to a 3 × 3 grid, and with N = 33 detector pairs. The data are C= B B B ,

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 35 i

35 where B is a sparse n × (n + 2) matrix with the following nonzero entries: Bi,i = a,

Bi,i+1 = b,

Bi,i+2 = a,

i = 1, . . . , n,

where a = 0.18,

b = 0.017,

and yT = ( 0 0 1

0 0 0

1 1 1

19 7 3

27 20 17

30 38 38

40 56 40

50 55 20

35 38 7

15 20 1

1 ... 7 ... 0 ).

(ii) Test your software on a problem with n = 1080 variables corresponding to a 36 × 30 grid, and with N = 1444 detector pairs. The data are C = B 2B , where B is defined as in part (a) with the parameter values a = 0.15 and b = 0.05. The vector y can be downloaded in text format from the Web page for this book (http://www.siam.org/books/ot108). Display the values of the first row of the reconstructed image. (iii) Identify the image you obtained in (ii). You will need software for displaying intensity images.

1.7.6

Shape Optimization

In this section we show how nonlinear optimization can address a problem of finding the shape of a hanging cable, which in equilibrium minimizes the potential energy of the cable. This problem often is called the catenary problem (from the Latin word “catena” meaning a chain). The solution to the simplest case of the hanging cable problem, when the mass of the cable is uniformly distributed along the cable, was found at the end of the 18th century independently by John Bernoulli, Christian Huygens, and Gottfried Leibniz. More recently, the catenary has played an important role in civil engineering. The solution to the catenary problem helps understand the effects on suspended cables of external applied forces arising from the live loads on a suspension bridge. Here we demonstrate how a general hanging cable problem can be modeled as an optimization problem. We present several optimization models to illustrate that sometimes a physical problem can have multiple equivalent mathematical formulations, some of which are numerically tractable while others are not. First, for simplicity we assume that the mass of the cable is distributed uniformly. The objective will be to minimize the potential energy of the cable xb minimize y(x) 1 + y (x)2 dx. y(x)

xa

i

i i

i

i

i

i

36

book 2008/10/23 page 36 i

Chapter 1. Optimization Models 0

- 0. 5

-1

- 1. 5

-2

-2. 5

-3

- 3. 5

-4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Figure 1.14. Hanging cable with uniformly distributed mass. Here y(x) is the height of the cable measured from some zero level, and 1 + y (x)2 dx is the arc length, which is proportional to mass since the mass is distributed uniformly. The model also has constraints: the cable has a specified length L xb 1 + y (x)2 dx = L, xa

and the ends of the cable are fixed y(xa ) = ya ,

y(xb ) = yb .

It can be shown that the solution to this problem is a hyperbolic cosine 1 y(x) = C0 cosh x+C + C2 , C0 where cosh(x) = (ex + e−x )/2 and the values of C0 , C1 , and C2 are determined by the constraints. Figure 1.14 shows the graphical representation of y(x). In contrast to our previous optimization models where we had a finite number of variables, here we are seeking an optimal function, that is, an infinite continuum of values. In order to solve such a problem using nonlinear optimization algorithms, we discretize the function by approximating it at a finite number of points, as shown in Figure 1.15. Here we describe the simplest method for discretizing such problems. If xa = x0 < x1 < · · · < xN−1 < xN = xb is a uniform discretization of segment [xa , xb ] such that x = x1 − x0 = x2 − x1 = · · · = xN − xN −1 , a simple approximation to an integral of a function f (x) is b N −1

f (x)dx ≈ f (xi ) x. a

i=0

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 37 i

37 A

B

0

- 0. 5

-1

-1. 5

-2

- 2. 5

Δl

-3

Δy

Δx - 3. 5

-4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Figure 1.15. Discretized hanging cable with uniformly distributed mass. The function values used to approximate the integral for the shape optimization problem are f (xi ) = y(xi ) 1 + y (xi )2 . We will approximate the values of the derivative y (x) at the discretization points xi by yi =

yi+1 −yi , x

i = 0, 1, . . . , N − 1,

where yi = y(xi ). The discretized problem consists of finding variables yi , i = 1, . . . , N −1, and yi , i = 0, . . . , N − 1, that solve the problem

N −1

yi 1 + (yi )2 x

minimize

E(y, y ) =

subject to

yi+1 = yi + yi x,

i=0

N −1

i = 0, . . . , N − 1

1 + (yi )2 x = L

i=0

y0 = ya ,

yN = yb .

We refer to this as optimization model 1. The greater the number of discretizations points N, the better the solution to optimization model 1 approximates the solution of the original problem.However for very large N, 2 the optimization model 1 is difficult to solve. The constraint N i=1 1 + (yi ) x = L is nonlinear and can be a source of numerical difficulties for optimization algorithms. In the two-dimensional case this constraint defines the perimeter shown in Figure 1.16 (left). The point x0 is on the perimeter and hence is feasible, but almost any perturbation of x0 will move off the perimeter and hence out of the feasible region. Fortunately, there is another formulation of the catenary problem that leads to a more tractable model. Rather than representing the cable as a function y(x) of the variable x, we parameterize it as a function of its length with respect to its left end point. The points on the cable will

i

i i

i

i

i

i

38

book 2008/10/23 page 38 i

Chapter 1. Optimization Models 5

4

3

2

1

0

-1

-2

x0

-3

-4

-5 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure 1.16. Feasible regions. now have the form (x(l), y(l)), l ∈ [0, L]. This representation leads to a model that is simpler to analyze both mathematically and numerically. Now we look for (x(l), y(l)), l ∈ [0, L], which minimizes the potential energy L min m(l)y(l)dl 0

subject to a constraint based on the Pythagorean theorem that defines the relations between dx, dy, and dl (see Figure 1.15), dx 2 + dy 2 = dl 2 , and the ends of the cable are fixed x(0) = xa ,

x(L) = xb , y(L) = yb . L Here m(l) is a mass distribution function such that 0 m(l)dl = M is the total mass of the cable. The discretization of this problem with the uniform distribution of mass and the total mass of the cable M consists of finding variables xl , l = 1, . . . , N − 1, and yi , i = 1, . . . , N − 1, using the following optimization model 2: minimize

y(0) = ya ,

E(y) =

M N

N

yl

l=0

subject to

(xl − xl−1 )2 + (yl − yl−1 )2 = x0 = xa , xN = xb y0 = ya , yN = yb ,

L 2 N

,

l = 1, . . . , N

where the mass distribution function is m = const = M/N. This optimization model also has N nonlinear constraints: 2 (xl − xl−1 )2 + (yl − yl−1 )2 = NL , l = 1, . . . , N, which again can be a potential source of difficulties for optimization algorithms if N is large (the two-dimensional case is shown in Figure 1.16 (center)). However, the optimization model 2 can be simplified substantially by relaxing these constraints into inequalities: 2 (xl − xl−1 )2 + (yl − yl−1 )2 ≤ NL , l = 1, . . . , N.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 39 i

39 10

9.5

9

8.5

8

7.5

7

6.5 correct model relaxed model 6

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Figure 1.17. Constraints cannot always be relaxed.

Of course, we changed the formulation, which is legitimate only if we can prove that the new formulation has the same solution as the original one. In other words, we have to prove that both optimization model 2 and its relaxation have the same solution. We can prove this by contradiction. Suppose that the optimal solution of the relaxed model satisfies at least one constraint as a strict inequality. Then we can lower the discretized components of the solution corresponding to this constraint and still remain feasible. But lowering part of the cable decreases the potential energy, i.e., it decreases the objective function, so our solution could not have been optimal. This contradicts our original assumption. Thus optimization model 2 and its relaxation have the same optimal solutions. But the two models are not equivalent computationally, since the feasible region for the relaxation has properties that make it easier for optimization algorithms to handle. In the two-dimensional case, the feasible region of the relaxed optimization model 2 is shown in Figure 1.16 (right). It is the entire circle, not just its perimeter. If x0 is a feasible point in the interior of the feasible region, any small perturbation of x0 is also in the interior. This feasible region has a convex shape; i.e., if we connect any two points from the feasible set, all the points between them are also feasible. This property of the interior of feasible set helps some optimization algorithms, later described in the book, efficiently find the solution. It is apparently not possible to relax the constraints of optimization model 1 without changing the optimal solution, but it is easy to do so with optimization model 2. Relaxation of the nonlinear equality in optimization model 1 to an inequality N

1 + (yi )2 x ≤ L

i=1

gives a model that is not equivalent and can result in an incorrect solution as shown in Figure 1.17. In this example, the length of an optimal cable for the relaxed model is less than L.

i

i i

i

i

i

i

40

book 2008/10/23 page 40 i

Chapter 1. Optimization Models 0

−0.5

−1

−1.5

−2

−2.5

−3

−3.5

−4

−4.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Figure 1.18. Hanging cable with a nonuniform mass distribution.

Optimization model 2 has another attractive property. Mathematicians in the 18th century assumed that the string is flexible and uniform, which implies that every segment of equal length has equal mass. This assumption is too restrictive for modern engineering. In many practical problems the total weight of the cable is not uniformly distributed along the cable. If the mass distribution function is not uniform along the cable but instead is a general known function m(l), then it is still easy to obtain a solution of a hanging cable problem using N +1 optimization model 2. We just have to replace the objective function M i=0 yi with a more N N +1 general linear objective function i=0 mi yi with appropriately selected coefficients mi corresponding to a certain distribution of mass along the cable. For example, if the mass of most nodes is much smaller than that of three special nodes—the center node and the two nodes one quarter of the length away from both end points—then it is still easy to find the shape of such a cable (see Figure 1.18). We would not be able to easily model such a case using optimization model 1, for which the assumption of uniformly distributed mass is essential. We conclude the section by emphasizing the importance of proper modeling of a problem. It is the responsibility of a modeler not to make the formulation more difficult than it need be. A problem that is computationally challenging in one formulation may become much easier to solve in a different formulation. It is up to the modeler to carefully consider the merits of a formulation prior to solving the problem.

1.8

Notes

Further information on integer programming can be found in the book by Wolsey (1998). References on global optimization are listed in the Notes for Chapter 2. Overviews of the crew scheduling, fleet assignment problem, and other airline scheduling problems are given in the articles by Barnhart et al. (1999) and Gopalan and Talluri (1998);

i

i i

i

i

i

i

1.8. Notes

book 2008/10/23 page 41 i

41

methods for solving the related linear program are described in the paper by Bixby et al. (1992). The portfolio problem is described in the book by Markowitz and Todd (2000). An innovative approach to calculating the entire efficient frontier by solving just one linear programming problem using a specialized parametric method was developed by Ruszczynski and Vanderbei (2003). The concept of support vector machines was initially developed by Vapnik (1998) in the late 1970s. A comprehensive overview on the subject is found in the tutorial by Burges (1988). More recent research is discussed in the books by Cristianini and Shawe-Taylor (2000), and by Schökopf et al. (1999). Overviews of IMRT planning can be found in the articles by Shepard et al. (1999) and by Lee and Deasy (2006). The book by Herman (1980) and the papers by Shepp and Vardi (1982) and Lange and Carson (1984) are among the pioneering works pertaining to PET. Figure 1.13 is due to Calvin Johnson, and was taken from the paper by Johnson and Sofer (2001). Further applications of optimization can be found in the books by Vanderbei (2007), and by Fourer, Gay, and Kernighan (2003). The hanging cable or catenary problem was first posed in the Acta Eruditorium in 1690 by Jacob Bernoulli. Simple catenary problems can be solved analytically. More complicated cases, those with nonuniformly distributed mass, may have to be solved numerically. More details about how to find shapes of a hanging cable analytically and numerically can be found in the paper of Griva and Vanderbei (2005) and many books on variational calculus; see, e.g., Gelfand and Fomin (1963, reprinted 2000).

i

i i

i

i

book 2008/10/23 page 42 i

i

i

i

i

i

i

i

i

i

book 2008/10/23 page 43 i

Chapter 2

Fundamentals of Optimization

2.1

Introduction

This chapter discusses basic optimization topics that are relevant to both linear and nonlinear problems. Sections 2.2–2.4 discuss local and global optima, convexity, and the general form of an optimization algorithm. These topics have traditionally been considered as fundamental topics in all areas of optimization. The later sections of the chapter, discussing rates of convergence, series approximations to nonlinear functions, and Newton’s method for nonlinear equations, are most relevant to nonlinear optimization. In fact, Part II on linear programming can be understood without these later sections. The later topics are basic to discussions of nonlinear optimization, since they allow us to derive optimality conditions and develop and analyze algorithms for optimization problems involving nonlinear functions. Although not essential, these topics give a fuller understanding of linear programming as well. For example, “interior-point” methods apply nonlinear optimization techniques to linear programming. They might use Newton’s method to find a solution to the optimality conditions for a linear program, or use a nonlinear optimization algorithm on a linear programming problem. The tools from this chapter underlie the interior-point methods derived in Chapter 10.

2.2

Feasibility and Optimality

There are a variety of terms that are used to describe feasible and optimal points. We first discuss the terms associated with feasibility. We consider a set of constraints of the form gi (x) = 0,

i ∈ E,

gi (x) ≥ 0,

i ∈ I.

Here { gi } are given functions that define the constraints in the model, E is an index set for the equality constraints, and I is an index set for the inequality constraints. Any set 43

i

i i

i

i

i

i

44

book 2008/10/23 page 44 i

Chapter 2. Fundamentals of Optimization

of equations and inequalities can be rearranged in this form. For example, the equation 3x12 + 2x2 = 3x3 − 9 could be written as g1 (x) = 3x12 + 2x2 − 3x3 + 9 = 0, and the inequality sin x1 ≤ cos x2 is equivalent to g2 (x) = − sin x1 + cos x2 ≥ 0. Such transformations are merely cosmetic, but they simplify the notation for describing the constraints. A point that satisfies all the constraints is said to be feasible. The set of all feasible points is termed the feasible region or feasible set. We shall denote it by S. At a feasible point x, ¯ an inequality constraint gi (x) ≥ 0 is said to be binding or active ¯ = 0, and nonbinding or inactive if gi (x) ¯ > 0. The point x¯ is said to be on the if gi (x) boundary of the constraint in the former case, and in the interior of the constraint in the latter. All equality constraints are regarded as active at any feasible point. The active set at a feasible point is defined as the set of all constraints that are active at that point. The set of feasible points for which at least one inequality is binding is called the boundary of the feasible region. All other feasible points are interior points. (Interior points are only “interior” to the inequality constraints. If equality constraints are present, any feasible point will satisfy them. Since it is not possible to be interior to an equality constraint, some authors use the term relative interior points.) Figure 2.1 illustrates the feasible region defined by the constraints g1 (x) = x1 + 2x2 + 3x3 − 6 = 0 g2 (x) = x1 ≥ 0 g3 (x) = x2 ≥ 0 g4 (x) = x3 ≥ 0. At the feasible point xa = (0, 0, 2)T, the first two inequality constraints x1 ≥ 0 and x2 ≥ 0 are active, while the third is inactive. At the point xb = (3, 0, 1)T only the second inequality is active, while at the interior point xc = (1, 1, 1)T none of the inequalities are active. The boundary of the feasible region is indicated by bold lines. Let us now look at terms associated with optimality. It may seem surprising that there is any question about what is meant by a “solution” to an optimization problem. The confusion arises because there are a variety of conditions associated with an optimal point and each of these conditions gives rise to a slightly different notion of a “solution.” Let us consider the n-dimensional problem minimize f (x). x∈S

There is no fundamental difference between minimization and maximization problems. We can maximize f by solving minimize (−f (x)), x∈S

and then multiplying the optimal objective value by −1. For this reason, it is sufficient to discuss minimization problems only.

i

i i

i

i

i

i

2.2. Feasibility and Optimality

book 2008/10/23 page 45 i

45 x3

xa xc

xb

x2

x1

Figure 2.1. Example of feasible region.

strict global minimizer

no global minimizer

nonstrict global minimizer

Figure 2.2. Examples of global minimizers. The set S of feasible points is usually defined by a set of constraints, as above. For problems without constraints, the set S would be n , the set of vectors of length n whose components are real numbers. The most basic definition of a solution is that x∗ minimizes f if f (x∗ ) ≤ f (x)

for all x ∈ S.

The point x∗ is referred to as a global minimizer of f in S. If in addition x∗ satisfies f (x∗ ) < f (x)

for all x ∈ S such that x = x∗ ,

then x∗ is a strict global minimizer. Not all functions have a finite global minimizer, and even if a function has a global minimizer there is no guarantee that it will have a strict global minimizer; see Figure 2.2. It would be satisfying theoretically, and important practically, to be able to find global minimizers. However, many of the methods that we will study are based on the Taylor

i

i i

i

i

i

i

46

book 2008/10/23 page 46 i

Chapter 2. Fundamentals of Optimization

strict local minimizer (global)

strict local minimizer

nonstrict local minimizers

strict local minimizer

Figure 2.3. Examples of local minimizers. series; that is, they are based on information about the function at a single point, and this information is normally only be valid within a small neighborhood of that point (see Section 2.6). Without additional information or assumptions about the problem it will not be possible to guarantee that a global solution has been found. An important exception is in the case where the function f and the set S are convex (see Section 2.3), which is true for linear programming problems. If we cannot find the global solution, then at the least we would like to find a point that is better than its surrounding points. More precisely, we would like to find a local minimizer of f in S, a point satisfying f (x∗ ) ≤ f (x)

for all x ∈ S such that x − x∗ < .

Here is some small positive number that may depend on x∗ . The point x∗ is a strict local minimizer if f (x∗ ) < f (x)

for all x ∈ S such that x = x∗ and x − x∗ < .

Various one-dimensional examples are illustrated in Figure 2.3. In many important cases, strict local minimizers can be identified using first and second derivative values at x = x∗ , and hence they can be identified by algorithms that compute first and second derivatives of the problem functions. (A local minimizer that is not a strict local minimizer is a degenerate case and is often considered to be a special situation.) Many algorithms, in particular those that only compute first derivative values, are only guaranteed to find a stationary point for the problem. (For unconstrained problems a stationary point is a point where the first derivatives of f are equal to zero. For constrained problems the definition is more complicated; see Chapter 14.) A local minimizer of f is also a stationary point of f but the reverse need not be true. Having all these various definitions of what is meant by a solution may seem perverse, but it merely reflects the fact that if we only have limited information, then we can draw only limited conclusions. The definitions are not without merit, though. In the case where all these various types of solutions are defined and where the function has several continuous

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 47 i

47

derivatives, a global solution will also be both a local solution and a stationary point. In important special cases such as linear programming the reverse will also be true. In our experience, it is unusual for an algorithm to converge to a point that is a stationary point but not a local minimum. However, it is common for an algorithm to converge to a local minimum that is not a global minimum. It may seem troubling that a local but not global solution is often found, but in many practical situations this can be acceptable if the local minimizer produces a satisfactory reduction in the value of the objective function. For example, if the objective function represented the costs of running a business, a 10% reduction in these costs would be a valuable saving, even if it did not correspond to the global solution to the optimization problem. Local optimization techniques are a valuable tool even if global solutions are desired, since techniques for global optimization typically solve a sequence of local optimization problems.

Exercises 2.1. Consider the feasible region defined by the constraints √ 1 − x12 − x22 ≥ 0, 2 − x1 − x2 ≥ 0, and

x2 ≥ 0.

For each of the following points, determine whether the point is feasible or infeasible, and (if it is feasible) whether it is interior to or on the boundary of each of the constraints: xa = ( 12 , 12 )T, xb = (1, 0)T, xc = (−1, 0)T, xd = (− 12 , 0)T, and xe = √ √ (1/ 2, 1/ 2)T. 2.2. Consider the one-variable function f (x) = (x + 1)x(x − 2)(x − 5) = x 4 − 6x 3 + 3x 2 + 10x. Graph this function and locate (approximately) the stationary points, local minima, and global minima. 2.3. Consider the problem minimize f (x) = x1 subject to x12 + x22 ≤ 4 x12 ≥ 1. Graph the feasible set. Use the graph to find all local minimizers for the problem, and determine which of those are also global minimizers. 2.4. Consider the problem minimize subject to

f (x) = x1 (x1 − 1)2 + x22 = 1 (x1 + 1)2 + x22 = 1.

Graph the feasible set. Are there local minimizers? Are there global minimizers? 2.5. Give an example of a function that has no global minimizer and no global maximizer.

i

i i

i

i

i

i

48

book 2008/10/23 page 48 i

Chapter 2. Fundamentals of Optimization

2.6. Provide definitions for a global maximizer, a strict global maximizer, a local maximizer, and a strict local maximizer. 2.7. Consider minimizing f (x) for x ∈ S where S is the set of integers. Prove that every point in S is a local minimizer of f . 2.8. Let S = { x : gi (x) ≥ 0, i = 1, . . . , m } and assume the functions that

{ gi } are continuous. Prove that if gi (x) ˆ > 0 for all i, then x : x − xˆ < ⊂ S for some > 0. 2.9. Let S be the feasible region in Figure 2.1. Show that S can be represented by equality and inequality constraints in such a way that it has no interior points. Thus the interior of a set may depend on the way it is represented. 2.10. Let S = { x : gi (x) ≥ 0, i = 1, . . . , m } and assume that the functions { gi } are continuous. Assume that there exists a point xˆ such that gi (x) ˆ > 0 for all i. Prove that S has a nonempty interior regardless of how S is represented.

2.3

Convexity

There is one important case where global solutions can be found, the case where the objective function is a convex function and the feasible region is a convex set. Let us first talk about the feasible region. A set S is convex if, for any elements x and y of S, αx + (1 − α)y ∈ S

for all 0 ≤ α ≤ 1.

In other words, if x and y are in S, then the line segment connecting x and y is also in S. Examples of convex and nonconvex sets are given in Figure 2.4. More generally, every set defined by a system of linear constraints is a convex set; see the Exercises. A function f is convex on a convex set S if it satisfies f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) for all 0 ≤ α ≤ 1 and for all x, y ∈ S. This definition says that the line segment connecting the points (x, f (x)) and (y, f (y)) lies on or above the graph of the function; see Figure 2.5. Intuitively, the graph of the function is bowl shaped. Analogously, a function is concave on S if it satisfies f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y)

convex

nonconvex

Figure 2.4. Convex and nonconvex sets.

i

i i

i

i

i

i

2.3. Convexity

book 2008/10/23 page 49 i

49

f f ( y) α f ( x) + (1−α) f ( y )

f ( x) f ( α x+(1−α) y )

α x+(1−α) y

x

y

Figure 2.5. Convex function. for all 0 ≤ α ≤ 1 and for all x, y ∈ S. Concave functions are explored in the Exercises below. Linear functions are both convex and concave. We say that a function is strictly convex if f (αx + (1 − α)y) < αf (x) + (1 − α)f (y) for all x = y and 0 < α < 1 where x, y ∈ S. Let us now return to the discussion of local and global solutions. We define a convex optimization problem to be a problem of the form minimize f (x), x∈S

where S is a convex set and f is a convex function on S. A problem minimize subject to

f (x) gi (x) ≥ 0, i = 1, . . . , m,

is a convex optimization problem if f is convex and the functions { gi } are concave; see the Exercises. The following theorem shows that any local solution of such a problem is also a global solution. This result is important to linear programming, since every linear program is a convex optimization problem. Theorem 2.1 (Global Solutions of Convex Optimization Problems). Let x∗ be a local minimizer of a convex optimization problem. Then x∗ is also a global minimizer. If the objective function is strictly convex, then x∗ is the unique global minimizer. Proof. The proof is by contradiction. Let x∗ be a local minimizer and suppose, by contradiction, that it is not a global minimizer. Then there exists some point y ∈ S satisfying

i

i i

i

i

i

i

50

book 2008/10/23 page 50 i

Chapter 2. Fundamentals of Optimization

f (y) < f (x∗ ). If 0 < α < 1, then f (αx∗ + (1 − α)y) ≤ αf (x∗ ) + (1 − α)f (y) < αf (x∗ ) + (1 − α)f (x∗ ) = f (x∗ ). This shows that there are points arbitrarily close to x∗ (i.e., when α is arbitrarily close to 1) whose function values are strictly less than f (x∗ ). These points are in S because S is convex. This contradicts the definition of a local minimizer. Hence a point such as y cannot exist, and x∗ must be a global minimizer. If the objective function is strictly convex, then a similar argument can be used to show that x∗ is the unique global minimizer; see the Exercises. For general problems it may be as difficult to determine if the function f and the region S are convex as it is to find a global solution, so this result is not always useful. However, there are important practical problems, such as linear programs, where convexity can be guaranteed. We conclude this section by defining a convex combination (weighted average) of a finite set of points. A convex combination is a linear combination whose coefficients are nonnegative and sum to one. Algebraically, the point y is a convex combination of the points { xi }ki=1 if k

αi x i , y= i=1

where

k

αi = 1

and

αi ≥ 0,

i = 1, . . . , k.

i=1

There will normally be many ways in which y can be expressed as a convex combination of { xi }. As an example, consider the points x1 = (0, 0)T, x2 = (1, 0)T, x3 = (0, 1)T, and x4 = (1, 1)T. If y = ( 12 , 12 )T, then y can be expressed as a convex combination of { xi } in the following ways: y = 0x1 + 12 x2 + 12 x3 + 0x4 = 12 x1 + 0x2 + 0x3 + 12 x4 = 14 x1 + 14 x2 + 14 x3 + 14 x4 , and so forth.

2.3.1

Derivatives and Convexity

If a one-dimensional function f has two continuous derivatives, then an alternative definition of convexity can be given that is often easier to check. Such a function is convex if and only if f (x) ≥ 0 for all x ∈ S;

i

i i

i

i

i

i

2.3. Convexity

book 2008/10/23 page 51 i

51

see the Exercises in Section 2.6. For example, the function f (x) = x 4 is convex on the entire real line because f (x) = 12x 2 ≥ 0 for all x. The function f (x) = sin x is neither convex nor concave on the real line because f (x) = − sin x can be both positive and negative. In the multidimensional case the Hessian matrix of second derivatives must be positive semidefinite; that is, at every point x ∈ S y T∇ 2 f (x)y ≥ 0

for all y;

see the Exercises in Section 2.6. (The Hessian matrix is defined in Appendix B.4.) Notice that the vector y is not restricted to lie in the set S. The quadratic function f (x1 , x2 ) = 4x12 + 12x1 x2 + 9x22 is convex over any subset of 2 since y ∇ f (x)y = ( y1 T

2

y2 )

8 12

12 18

y1 y2

= 8y12 + 24y1 y2 + 18y22 = 2(2y1 + 3y2 )2 ≥ 0. Alternatively, it would have been possible to show that the eigenvalues of the Hessian matrix were all greater than or equal to zero. In the one-dimensional case, if a function satisfies f (x) > 0

for all x ∈ S,

then it is strictly convex on S. In the multidimensional case, if the Hessian matrix ∇ 2 f (x) is positive definite for all x ∈ S, then the function is strictly convex on S. This is not an “if and only if ” condition, since the Hessian of a strictly convex function need not be positive definite everywhere (see the Exercises). Now we consider another characterization of convexity that can be applied to functions that have one continuous derivative. In this case a function f is convex over a convex set S if and only if it satisfies f (y) ≥ f (x) + ∇f (x)T(y − x) for all x, y ∈ S. This property states that the function is on or above any of its tangents. (See Figure 2.6.) To prove this property, note that if f is convex, then for any x and y in S and for any 0 < α ≤ 1, f (αy + (1 − α)x) ≤ αf (y) + (1 − α)f (x), so that

f (x + α(y − x)) − f (x) ≤ f (y) − f (x). α

If we let α approach 0 from above, we can conclude that f (y) ≥ f (x) + ∇f (x)T(y − x).

i

i i

i

i

i

i

52

book 2008/10/23 page 52 i

Chapter 2. Fundamentals of Optimization

f (x)

x

Figure 2.6. Convex function with continuous first derivative. Conversely, suppose that the function f satisfies f (y) ≥ f (x) + ∇f (x)T(y − x) for all x and y in S. Let t = αx + (1 − α)y. Then t is also in the set S, so f (x) ≥ f (t) + ∇f (t)T(x − t) and f (y) ≥ f (t) + ∇f (t)T(y − t). Multiplying the two inequalities by α and 1 − α, respectively, and then adding yields the desired result. See the Exercises for details.

Exercises 3.1. Prove that the intersection of a finite number of convex sets is also a convex set. 3.2. Let S1 = { x : x1 + x2 ≤ 1, x1 ≥ 0 } and S2 = { x : x1 − x2 ≥ 0, x1 ≤ 1 }, and let S = S1 ∪ S2 . Prove that S1 and S2 are both convex sets but S is not a convex set. This shows that the union of convex sets is not necessarily convex. 3.3. Consider a feasible region S defined by a set of linear constraints S = { x : Ax ≤ b } . Prove that S is convex. 3.4. Prove that a function f is concave if and only if −f is convex. 3.5. Let f (x) be a function on n . Prove that f is both convex and concave if and only if f (x) = cTx for some constant vector c. 3.6. Prove that a convex combination of convex functions all defined on the same convex set S is also a convex function on S.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 53 i

53

3.7. Let f be a convex function on a convex set S ∈ n . Let k be a nonzero scalar, and define g(x) = kf (x). Prove that if k > 0, then g is a convex function on S, and if k < 0, then g is a concave function on S. 3.8. (Jensen’s Inequality.) Let f be a function on a convex set S ∈ n . Prove that f is convex if and only if k k

αi x i ≤ αi f (xi ) f i=1

i=1

for all x1 , . . . , xm ∈ S and 0 ≤ αi ≤ 1 where ki=1 αi = 1. 3.9. Prove the well-known inequality between the arithmetic mean and the geometric mean of a set of positive numbers: (x1 + · · · + xk )/k ≥ (x1 · · · xk )1/k . Hint: Apply the previous problem to the function f (x) = − log(x). p q 3.10. Consider the function f (x1 , x2 ) = αx1 x2 , defined on S = {x : x > 0}. For what values of α, p, and q is the function convex? Strictly convex? For what values is it concave? Strictly concave? 3.11. Consider the problem maximize f (x), x∈S

where S is a convex set and f is a concave function. Prove that any local maximizer is also a global maximizer. 3.12. Let g1 , . . . , gm be concave functions on n . Prove that the set S = { x : gi (x) ≥ 0, i = 1, . . . , m } is convex. 3.13. Let f be a convex function on the convex set S. Prove that the level set T = { x ∈ S : f (x) ≤ k } is convex for all real number k. 3.14. A function f is said to be quasi convex on the convex set S if every level set of f in S is convex, that is, if { x ∈ S : f (x) ≤ k } is convex for all k.

√ (i) Prove that f (x) = x is a quasi-convex function on S = x ∈ 1 , x ≥ 0 but it is not convex on S. (ii) Prove that f is quasi convex on a convex set S if and only if for every x and y in S and every 0 ≤ α ≤ 1, f (αx + (1 − α)x) ≤ max{f (x), f (y)}. (iii) Prove that any local minimizer of a quasi-convex function on a convex set is also a global minimizer.

i

i i

i

i

i

i

54

book 2008/10/23 page 54 i

Chapter 2. Fundamentals of Optimization

3.15. Let g1 , . . . , gm be concave functions on n . Prove that the set S = { x : gi (x) ≥ 0, i = 1, . . . , m }

3.16.

3.17. 3.18. 3.19.

is convex. Let f : n → 1 be a convex function, and let g : 1 → 1 be a convex nondecreasing function. (The notation f : n → 1 means that f is a real-valued function of n variables; g is a real-valued function of one variable.) Prove that the composite function h : n → 1 defined by h(x) = g(f (x)) is convex. Complete the proof of Theorem 2.1 for the case when the objective function is strictly convex. Express (2, 2)T as a convex combination of (0, 0)T, (1, 4)T, and (3, 1)T. For each of the following functions, determine if it is convex, concave, both, or neither on the real line. If the function is convex or concave, indicate if it is strictly convex or strictly concave. (i) (ii) (iii) (iv) (v) (vi) (vii)

f (x) = 3x 2 + 4x − 5 f (x) = exp(x 2 ) f (x) = 7x − 15 √ f (x) = 1 + x 2 f (x) = 4 − 5x + 3x 2 f (x) = 2x 4 + 3x 3 + 4x 2 f (x) = x/(1 + x 4 ).

3.20. Determine if f (x1 , x2 ) = 2x12 − 3x1 x2 + 5x22 − 2x1 + 6x2 is convex, concave, both, or neither for x ∈ 2 . 3.21. Give an example of a one-dimensional function f that is strictly convex on the real line even though f (x) ˆ = 0 at some point x. ˆ n 3.22. Let g1 , . . . , gm be concave functions on , let f be a convex function on n , and let μ be a positive constant. Prove that the function β(x) = f (x) − μ

m

log gi (x)

i=1

is convex on the set S = { x : gi (x) > 0, i = 1, . . . , m }.

2.4 The General Optimization Algorithm More algorithms for solving optimization problems have been proposed than could possibly be discussed in a single book. This has happened in part because optimization problems can come in so many forms, but even for particular problems such as one-variable unconstrained minimization problems, there are many different algorithms that one could use.

i

i i

i

i

i

i

2.4. The General Optimization Algorithm

book 2008/10/23 page 55 i

55

Despite this diversity of both algorithms and problems, all of the algorithms that we will discuss in any detail in this book will have the same general form. Algorithm 2.1. General Optimization Algorithm I 1. Specify some initial guess of the solution x0 . 2. For k = 0, 1, . . . (i) If xk is optimal, stop. (ii) Determine xk+1 , a new estimate of the solution. This algorithm is so simple that it almost conveys no information at all. However, as we discuss ever more complex algorithms for ever more elaborate problems, it is often helpful to keep in mind that we are still working within this simple and general framework. The algorithm suggests that testing for optimality and determining a new point xk+1 are separate ideas, but this is usually not true. Often the information obtained from the optimality test is the basis for the computation of the new point. For example, if we are trying to solve the one-dimensional problem without constraints minimize f (x), then the optimality test will often be based on the condition f (x) = 0. If f (xk ) = 0, then xk is not optimal, and the sign and value of f (xk ) indicate whether f is increasing or decreasing at the point xk , as well as how rapidly f is changing. Such information is valuable in selecting xk+1 . Many of our algorithms will have a more specific form. Algorithm 2.2. General Optimization Algorithm II 1. Specify some initial guess of the solution x0 . 2. For k = 0, 1, . . . (i) If xk is optimal, stop. (ii) Determine a search direction pk . (iii) Determine a step length αk that leads to an improved estimate of the solution: xk+1 = xk + αk pk . In this algorithm, pk is a search direction that we hope points in the general direction of the solution, or that “improves” our solution in some sense. The scalar αk is a step length that determines the point xk+1 ; once the search direction pk has been computed, the step length αk is found by solving some auxiliary one-dimensional problem; see Figure 2.7.

i

i i

i

i

i

i

56

book 2008/10/23 page 56 i

Chapter 2. Fundamentals of Optimization

xk +1 = xk + α k pk

xk pk

Figure 2.7. General optimization algorithm. Why do we not just solve for the solution directly? Except for the simplest optimization problems, formulas for the solution do not exist. For example, consider the problem minimize f (x) = ex + x 2 . The optimality condition f (x) = 0 has the form ex + 2x = 0, but there is no simple formula for the solution to this equation. Hence for many problems some form of iterative method must be employed to determine a solution. (Any finite sequence of calculations is a formula of some sort, and so the solution of a general optimization problem can only be found as the limit of an infinite sequence. When we refer to computing a “solution” we most always mean an approximate solution, an element of this sequence that has sufficient accuracy. Determining the exact solution, or the limit of such a sequence, would be an “infinite” calculation.) Why do we split the computation of xk+1 into two calculations? Ideally we would like to have xk+1 = xk + pk where pk solves minimize f (xk + p), p

but this is equivalent to our original problem minimize f (x). x

Instead a compromise is employed. For an unconstrained problem of the form here, we will typically require that the search direction pk be a descent direction for the function f at the point xk . This means that for “small” steps taken along pk the function value is guaranteed to decrease: f (xk + αpk ) < f (xk ) for 0 < α ≤

i

i i

i

i

i

i

2.4. The General Optimization Algorithm

book 2008/10/23 page 57 i

57

f ( xk )

f ( xk + α k p k )

xk + α k p k xk pk

Figure 2.8. Line search. for some . For a linear function f (x) = cTx, pk is a descent direction if cT(xk + pk ) = cTxk + cTpk < cTxk , or in other words if cTpk < 0. Techniques for computing descent directions for nonlinear functions are discussed in Chapter 11. With pk available, we would ideally like to determine the step length αk so as to minimize the function in that direction: minimize f (xk + αpk ). α≥0

This is a problem only involving one variable, the parameter α. The restriction α ≥ 0 is imposed because pk is a descent direction. Even for this one-dimensional problem there may not be a simple formula for the solution, so it too cannot normally be solved exactly. Instead, an αk is computed that either “sufficiently decreases” the value of f or yields an “approximate minimizer” of the function f in the direction pk . Both these terms have precise theoretical meanings that will be specified in later chapters, and computational techniques are available that allow αk to be determined at reasonable cost. The calculation of αk is called a line search because it corresponds to a search along the line xk + αpk defined by α. The line search is illustrated in Figure 2.8. Algorithm II with its three major steps (the optimality test, computation of pk , and computation of αk ) has been the basis for a great many of the most successful optimization algorithms ever developed. It has been used to develop many software packages for nonlinear optimization, and it is also present implicitly as part of the simplex method for linear programming. It is not the only approach possible (see Section 11.6), but it is the approach that we will emphasize in this book.

i

i i

i

i

i

i

58

book 2008/10/23 page 58 i

Chapter 2. Fundamentals of Optimization

Using the concept of descent directions, we can establish an important condition for optimality for the constrained problem minimize f (x). x∈S

We define p to be a feasible descent direction at a point xk ∈ S if, for some > 0, xk + αp ∈ S

and

f (xk + αp) < f (xk )

for all 0 < α ≤ . If a feasible descent direction exists at a point xk , then it is possible to move a short distance along this direction to a feasible point with a better objective value. Then xk cannot be a local minimizer for this problem. Hence, if x∗ is a local minimizer, there cannot exist any feasible descent directions at x∗ . This result will be used to derive optimality conditions for a variety of optimization problems.

Exercises 4.1. Let xk = (2, 1)T and pk = (−1, 3)T. Plot the set { x : x = xk + αpk , α ≥ 0 }. 4.2. Find all descent directions for the linear function f (x) = x1 − 2x2 + 3x3 . Does your answer depend on the value of x? 4.3. Consider the problem minimize subject to

f (x) = −x1 − x2 x1 + x2 ≤ 2 x1 , x2 ≥ 0.

(i) Determine the feasible directions at x = (0, 0)T, (0, 1)T, (1, 1)T, and (0, 2)T. (ii) Determine whether there exist feasible descent directions at these points, and hence determine which (if any) of the points can be local minimizers.

2.5

Rates of Convergence

Many of the algorithms discussed in this book do not find a solution in a finite number of steps. Instead these algorithms compute a sequence of approximate solutions that we hope get closer and closer to a solution. When discussing such an algorithm, the following two questions are often asked: • Does it converge? • How fast does it converge? It is the second question that is the topic of this section. If an algorithm converges in a finite number of steps, the cost of that algorithm is often measured by counting the number of steps required, or by counting the number of arithmetic operations required. For example, if Gaussian elimination is applied to a system

i

i i

i

i

i

i

2.5. Rates of Convergence

book 2008/10/23 page 59 i

59

of n linear equations, then it will require about n3 operations. This cost is referred to as the computational complexity of the algorithm. This concept is discussed in more detail in Chapter 9 in the context of linear programming. For many optimization methods, the number of operations or steps required to find an exact solution will be infinite, so some other measure of efficiency must be used. The rate of convergence is one such measure. It describes how quickly the estimates of the solution approach the exact solution. Let us assume that we have a sequence of points xk converging to a solution x∗ . We define the sequence of errors to be ek = xk − x∗ . Note that lim ek = 0.

k→∞

We say that the sequence { xk } converges to x∗ with rate r and rate constant C if lim

k→∞

ek+1 =C ek r

and C < ∞. To understand this idea better, let us look at some examples. Initially let us assume that we have ideal convergence behavior ek+1 = C ek r

for all k,

so that we can avoid having to deal with limits. When r = 1 this is referred to as linear convergence: ek+1 = C ek . If 0 < C < 1, then the norm of the error is reduced by a constant factor at every iteration. If C > 1, then the sequence diverges. (What can happen when C = 1?) If we choose C = 0.1 = 10−1 and e0 = 1, then the norms of the errors are 1, 10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , and seven-digit accuracy is obtained in seven iterations, a good result. On the other hand, if C = 0.99, then the norms of the errors take on the values 1, 0.99, 0.9801, 0.9703, 0.9606, 0.9510, 0.9415, 0.9321, . . . , and it would take about 1600 iterations to reduce the error to 10−7 , a less impressive result. If r = 1 and C = 0, the convergence is called superlinear. Superlinear convergence includes all cases where r > 1 since if ek+1 = C < ∞, k→∞ ek r lim

then lim

k→∞

ek+1 ek+1 ek r−1 = C × lim ek r−1 = 0. = lim k→∞ ek r k→∞ ek

i

i i

i

i

i

i

60

book 2008/10/23 page 60 i

Chapter 2. Fundamentals of Optimization

When r = 2, the convergence is called quadratic. As an example, let r = 2, C = 1, and e0 = 10−1 . Then the sequence of error norms is 10−1 , 10−2 , 10−4 , 10−8 , and so three iterations are sufficient to achieve seven-digit accuracy. In this form of quadratic convergence the error is squared at each iteration. Another way of saying this is that the number of correct digits in xk doubles at every iteration. Of course, if the constant C = 1, then this is not an accurate statement, but it gives an intuitive sense of the attractions of a quadratic convergence rate. For optimization algorithms there is one other important case, and that is when 1 < r < 2. This is another special case of superlinear convergence. This case is important because (a) it is qualitatively similar to quadratic convergence for the precision of common computer calculations, and (b) it can be achieved by algorithms that only compute first derivatives, whereas to achieve quadratic convergence it is often necessary to compute second derivatives as well. To get a sense of what this form of superlinear convergence looks like, let r = 1.5, C = 1, and e0 = 10−1 . Then the sequence of error norms is 1 × 10−1 , 3 × 10−2 , 6 × 10−3 , 4 × 10−4 , 9 × 10−6 , 3 × 10−8 , and five iterations are required to achieve single-precision accuracy. Example 2.2 (Rate of Convergence of a Sequence). Consider the sequence 2, 1.1, 1.01, 1.001, 1.0001, 1.00001, . . . with general term xk = 1 + 10−k . This sequence converges to x∗ = 1 and ek = xk − x∗ = 10−k . Hence ek+1 10−(k+1) 1 = , lim = lim −k k→∞ ek k→∞ 10 10 so that the sequence converges linearly with rate constant Now consider the sequence

1 . 10

4, 2.5, 2.05, 2.00060975, . . . defined by the formula xk+1 =

xk 1 4 2 = xk + + 2 xk 2 xk

with x0 = 4. It can be shown that xk → 2. Also ek+1 = xk+1 − x∗ xk 2 = + −2 2 xk 1 (x 2 + 4 − 4xk ) = 2xk k 1 2 1 (xk − 2)2 = e . = 2xk 2xk k

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 61 i

61

From this it follows that lim

k→∞

ek+1 1 1 = = . 2 2|x∗ | 4 ek

Hence this sequence converges quadratically with rate constant 14 . In practical situations ideal convergence behavior is not always observed. The rate of convergence is only observed in the limit, so at the initial iterations there is no guarantee that the norm of the error will be reduced at all, let alone at any predictable rate. In fact, it is not uncommon for an algorithm to expend almost all of its effort far from the solution, with this asymptotic convergence rate only becoming apparent at the last few iterations. In addition, the algorithm will be terminated after a finite number of iterations when the error in the solution is below some tolerance, and so the limiting behavior described here may be only imperfectly observed. There is ambiguity in the definition of the rate of convergence. For instance, any sequence that converges quadratically also converges linearly, but with rate constant equal to zero. It is common when discussing algorithms to refer to the fastest rate at which the algorithm typically converges. For example, in Section 2.7 we show that a certain sequence { xk } satisfies f (x∗ ) xk+1 − x∗ ≈ (xk − x∗ )2 , 2f (x∗ ) where x∗ = lim xk and f is a function used to define the sequence. Based on this formula, the sequence { xk } is said to converge quadratically. However, if f (x∗ ) = 0 the right-hand side is not defined. On the other hand, if f (x∗ ) = 0 but f (x∗ ) = 0, then the sequence can converge faster than quadratically. “Typically” these things do not happen. In many situations people use a sort of shorthand and only refer to the convergence rate without mention of the rate constant. For quadratic rates of convergence this is not too misleading, since the ideal behavior and the observed behavior are similar unless the rate constant is exceptionally large or small. However, in the linear case the rate constant plays an important role. It is not uncommon to see rate constants that are close to one, and more unusual to see rate constants near zero. As a result, linear convergence rates are often considered to be inferior. However, if the rate constant is small, then there is little practical difference between linear and higher rates of convergence at the level of precision common on many computers. In summary, even though it is generally true that higher rates of convergence often represent improvements in performance, this is not guaranteed, and an algorithm with a linear rate of convergence can sometimes be effective in a practical setting.

Exercises 5.1. For each of the following sequences, prove that the sequence converges, find its limit, and determine the rate of convergence and the rate constant.

i

i i

i

i

i

i

62

book 2008/10/23 page 62 i

Chapter 2. Fundamentals of Optimization (i) The sequence 1 1 1 1 , , , , 1 ,... 2 4 8 16 32

with general term xk = 2−k , for k = 1, 2, . . . . (ii) The sequence 1.05, 1.0005, 1.000005, . . . with general term xk = 1 + 5 × 10−2k , for k = 1, 2, . . . . k (iii) The sequence with general term xk = 2−2 . 2 (iv) The sequence with general term xk = 3−k . k (v) The sequence with general term xk = 1 − 2−2 for k odd, and xk = 1 + 2−k for k even. 5.2. Consider the sequence defined by x0 = a > 0 and a 1 xk+1 = 2 xk + . xk √ Prove that this sequence converges to x∗ = a and that the convergence rate is quadratic, and determine the rate constant. 5.3. Consider a convergent sequence { xk } and define a second sequence { yk } with yk = cxk where c is some nonzero constant. What is the relationship between the convergence rates and rate constants of the two sequences? 5.4. Let { xk } and { ck } be convergent sequences, and assume that lim ck = c = 0.

k→∞

Consider the sequence { yk } with yk = ck xk . Is this sequence guaranteed to converge? If so, can its convergence rate and rate constant be determined from the rates and rate constants for the sequences { xk } and { ck }?

2.6 Taylor Series The Taylor series is a tool for approximating a function f near a specified point x0 . The approximation obtained is a polynomial, i.e., a function that is easy to manipulate. The Taylor series is a general tool—it can be applied whenever the function has derivatives— and it has many uses: • It allows you to estimate the value of the function near the given point (when the function is difficult to evaluate directly). • The derivatives and integral of the approximation can be used to estimate the derivatives and integral of the original function. • It is used to derive many algorithms for finding zeroes of functions (see below), for minimizing functions, etc.

i

i i

i

i

i

i

2.6. Taylor Series

book 2008/10/23 page 63 i

63

Since many problems are difficult to solve exactly, and an approximate solution is often adequate (the data for the problem may be inaccurate), the Taylor series is widely used, both theoretically and practically. Even if the data are exact, an approximate solution may be adequate, and in any case it is all we can hope for under most circumstances. How does it work? We first consider the case of a one-dimensional function f with n continuous derivatives. Let x0 be a specified point (say x0 = 17.5 or x0 = 0). Then the nth order Taylor series approximation is 1 p n (n) f (x0 + p) ≈ f (x0 ) + pf (x0 ) + p 2 f (x0 ) + · · · + f (x0 ). 2 n! Here f (n) (x0 ) is the nth derivative of f at the point x0 , and n! = n(n − 1)(n − 2) · · · 3 · 2 · 1. Notice that 12 p 2 f (x0 ) = (p2 /2!)f (2) (x0 ). In this formula, p is a variable; we will decide later what values p will take. The approximation will normally only be accurate for small values of p. √ Example 2.3 (Taylor Series). Let f (x) = x and let x0 = 1. Then √ √ f (x0 ) = x0 = 1 = 1 −

f (x0 ) = 12 x0

1 2 −

f (x0 ) = − 14 x0 −

f (x0 ) = 38 x0 .. .

5 2

1

= 12 1− 2 = 3 2

1 2

3

= − 14 1− 2 = − 14 5

= 38 1− 2 =

3 8

Hence, substituting into the formula for the Taylor series, 1 + p = f (x0 + p) ≈ f (x0 ) + pf (x0 ) + 12 p 2 f (x0 ) + 16 p 3 f (x0 ) = 1 + p( 12 ) + 12 p 2 (− 14 ) + 16 p 3 ( 38 ) = 1 + 12 p − 18 p 2 +

1 3 p . 16

How do we use this? Suppose we want to approximate f (1.6). Then x0 + p = 1 + p = 1.6, and so p = 0.6: √ √ 1.6 = 1 + 0.6 1 (0.6)3 ≈ 1.2685. ≈ 1 + 12 (0.6) − 18 (0.6)2 + 16 The true value is 1.264911 . . . ; the approximation is accurate to three digits. The first two terms of the Taylor series give us the formula for the tangent line for the function f at the point x0 . We commonly define the tangent line in terms of a general point x, and not in terms of p. Since x0 + p = x, we can rearrange to get p = x − x0 . Substitute this into the first two terms of the series to get the tangent line: y = f (x0 ) + (x − x0 )f (x0 ).

i

i i

i

i

i

i

64

book 2008/10/23 page 64 i

Chapter 2. Fundamentals of Optimization 2.5

2

f (x )=√x

1.5

1

quadratic approximation 1⫹(x⫺1)/2 ⫺(x⫺1)2/8 0.5

0

0

1

2

3

4

5

x

6

Figure 2.9. Taylor series approximation. For the example above we get y = 1 + (x − 1) 12

or

y = 12 (x + 1).

The first three terms of the Taylor series give a quadratic approximation to the function f at the point x0 . This is illustrated in Figure 2.9. So far we have only considered a Taylor series for a function of one variable. The Taylor series can also be derived for real-valued functions of many variables. If we use matrix and vector notation, then there is an obvious analogy between the two cases: 1-variable: f (x0 + p) = f (x0 ) + pf (x0 ) + 12 p 2 f (x0 ) + · · · n-variables: f (x0 + p) = f (x0 ) + p T∇f (x0 ) + 12 p T∇ 2 f (x0 )p + · · · . In the second line above x0 and p are both vectors. The notation ∇f (x0 ) refers to the gradient of the function f at the point x = x0 . The notation ∇ 2 f (x0 ) represents the Hessian of f at the point x = x0 . (See Appendix B.4.) The higher-order terms of the Taylor series can also be written down, but the notation is more complex and they will not be required in this book. Example 2.4 (Multidimensional Taylor Series). Consider the function f (x1 , x2 ) = x13 + 5x12 x2 + 7x1 x22 + 2x23 at the point x0 = (−2, 3)T. The gradient of this function is ∇f (x) = and the Hessian matrix is ∇ 2 f (x) =

3x12 + 10x1 x2 + 7x22

5x12 + 14x1 x2 + 6x22

6x1 + 10x2 10x1 + 14x2

10x1 + 14x2 14x1 + 12x2

.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 65 i

65

At the point x0 = (−2, 3)T these become 15 and ∇f (x0 ) = −10

∇ f (x0 ) = 2

18 22

22 8

.

If p = (p1 , p2 )T = (0.1, 0.2)T, then f (−1.9, 3.2) = f (−2 + 0.1, 3 + 0.2) = f (x0 + p) 1 ≈ f (x0 ) + p T∇f (x0 ) + p T∇ 2 f (x0 )p 2 1 15 + ( 0.1 = −20 + ( 0.1 0.2 ) −10 2 = −20 − 0.5 + 0.69 = −19.81.

0.2 )

18 22

22 8

0.1 0.2

The true value is f (−1.9, 3.2) = −19.755, so the approximation is accurate to three digits. The Taylor series for multidimensional problems can also be derived using summations rather than matrix-vector notation: n n n

∂f (x) ∂ 2 f (x) 1

pi + pi pj + ···. f (x0 + p) = f (x0 ) + ∂xi x=x0 2 i=1 j =1 ∂xi ∂xj x=x0 i=1 The formula is the same as before; only the notation has changed. There is an alternate form of the Taylor series that is often used, called the remainder form. If three terms are used it looks like 1-variable: f (x0 + p) = f (x0 ) + pf (x0 ) + 12 p 2 f (ξ ) n-variables: f (x0 + p) = f (x0 ) + p T∇f (x0 ) + 12 p T∇ 2 f (ξ )p. The point ξ is an unknown point lying between x0 and x0 + p. In this form the series is exact, but it involves an unknown point, so it cannot be evaluated. This form of the series is often used for theoretical purposes, or to derive bounds on the accuracy of the series. The accuracy of the series can be analyzed by establishing bounds on the final “remainder” term. If the remainder form of the series is used, but with only two terms, then we obtain 1-variable: f (x0 + p) = f (x0 ) + pf (ξ ) n-variables: f (x0 + p) = f (x0 ) + p T∇f (ξ ). This result is known as the mean-value theorem.

Exercises 6.1. Find the first four terms of the Taylor series for f (x) = log(1 + x)

i

i i

i

i

i

i

66

book 2008/10/23 page 66 i

Chapter 2. Fundamentals of Optimization

about the point x0 = 0. Evaluate the series for p = 0.1 and p = 0.01 and compare with the value of f (x0 +p). Derive the remainder form of the Taylor series using five terms (the four terms you already derived plus a remainder term). Derive a bound on the accuracy of the four-term series. Compare the bound you derived with the actual errors for p = 0.1 and p = 0.01. 6.2. Find the first three terms of the Taylor series for the following functions. (i) f (x) = sin x about the point x0 = π . (ii) f (x) = 2/(3x + 5) about the point x0 = −1. (iii) f (x) = ex about the point x0 = 0. 6.3. Determine the general term in the Taylor series for the function e−1/x if x > 0, f (x) = 0 if x ≤ 0, about the point x0 = 0. Compare this with the Taylor series for the function f (x) = 0 about the same point. What can you conclude about the limitations of the Taylor series as a tool for approximating functions? 6.4. Find the first three terms of the Taylor series for f (x1 , x2 ) = 3x14 − 2x13 x2 − 4x12 x22 + 5x1 x23 + 2x24 at the point x0 = (1, −1)T. Evaluate the series for p = (0.1, 0.01)T and compare with the value of f (x0 + p). 6.5. Find the first three terms of the Taylor series for f (x1 , x2 ) = x12 + x22 about the point x0 = (3, 4)T. 6.6. Prove that if pT∇f (xk ) < 0, then f (xk + p) < f (xk ) for > 0 sufficiently small. Hint: Expand f (xk + p) in a Taylor series about the point xk and look at f (xk + p) − f (xk ). 6.7. (The results of this and the next problem show that a function f is convex on a convex set S if the Hessian matrix ∇ 2 f (x) is positive semidefinite for all x ∈ S.) Let f be a real-valued function of n variables x with continuous first derivatives. Prove that f is convex on the convex set S if and only if f (y) ≥ f (x) + ∇f (x)T(y − x) for all x, y ∈ S. 6.8. Let f be a real-valued function of n variables x with continuous second derivatives. Use the result of the previous problem to prove that f is convex on the convex set S if ∇ 2 f (x) is positive semidefinite for all x ∈ S.

i

i i

i

i

i

i

2.7. Newton’s Method for Nonlinear Equations

2.7

book 2008/10/23 page 67 i

67

Newton’s Method for Nonlinear Equations

Let us now consider methods for solving f (x) = 0. We first consider the one-dimensional case where x is a scalar and f is a real-valued function. Later we will look at the n-dimensional case where x = (x1 , . . . , xn )T and f (x) = (f1 (x), . . . , fn (x))T. Note that both x and f (x) are vectors of the same length n. Throughout this section we assume that the function f has two continuous derivatives. If f (x) is a linear function, it is possible to find a solution if the system is nonsingular. The cost of finding the solution is predictable—it is the cost of applying Gaussian elimination. Except for a few isolated special cases, such as quadratic equations in one variable, in the nonlinear case it is not possible to guarantee that a solution can be found, nor is it possible to predict the cost of finding a solution. However, the situation is not totally bleak. There are effective algorithms that work much of the time, and that are efficient on a wide variety of problems. They are based on solving a sequence of linear equations. As a result, if the function f is linear, they can be as efficient as the techniques for linear systems. Also, we can apply our knowledge about linear systems in the nonlinear case. The methods that we will discuss are based on Newton’s method. Given an estimate of the solution xk , the function f is approximated by the linear function consisting of the first two terms of the Taylor series for the function f at the point xk . The resulting linear system is then solved to obtain a new estimate of the solution xk+1 . To derive the formulas for Newton’s method, we first write out the Taylor series for the function f at the point xk : f (xk + p) ≈ f (xk ) + pf (xk ). If f (xk ) = 0, then we can solve the equation f (x∗ ) ≈ f (xk ) + pf (xk ) = 0 for p to obtain

p = −f (xk )/f (xk ).

The new estimate of the solution is then xk+1 = xk + p or xk+1 = xk − f (xk )/f (xk ). This is the formula for Newton’s method. Example 2.5 (Newton’s Method). As an example, consider the one-dimensional problem f (x) = 7x 4 + 3x 3 + 2x 2 + 9x + 4 = 0. Then

f (x) = 28x 3 + 9x 2 + 4x + 9

and the formula for Newton’s method is xk+1 = xk −

7xk4 + 3xk3 + 2xk2 + 9xk + 4 . 28xk3 + 9xk2 + 4xk + 9

i

i i

i

i

i

i

68

book 2008/10/23 page 68 i

Chapter 2. Fundamentals of Optimization

Table 2.1. Newton’s method for a one-dimensional problem. xk

f (xk )

|xk − x∗ |

0 −0.4444444444444444 −0.5063255748934088 −0.5110092428604380 −0.5110417864454134 −0.5110417880368663

4 × 10 4 × 10−1 3 × 10−2 2 × 10−4 9 × 10−9 0

5 × 10−1 7 × 10−2 5 × 10−3 3 × 10−5 2 × 10−9 0

k 0 1 2 3 4 5

0

If we start with the initial guess x0 = 0, then x1 = x0 −

7x04 + 3x03 + 2x02 + 9x0 + 4 28x03 + 9x02 + 4x0 + 9

7 × 0 4 + 3 × 03 + 2 × 02 + 9 × 0 + 4 28 × 03 + 9 × 02 + 4 × 0 + 9 4 = 0 − = −4/9 = −0.4444 . . . . 9 =0−

At the next iteration we substitute x1 = −4/9 into the formula for Newton’s method and obtain x2 ≈ −0.5063. The complete iteration is given in Table 2.1. Newton’s method corresponds to approximating the function f by its tangent line at the point xk . The point where the tangent line crosses the x-axis (i.e., a zero of the tangent line) is taken as the new estimate of the solution. This geometric interpretation is illustrated in Figure 2.10. The performance of Newton’s method in Example 2.5 is considered to be typical for this method. It converges rapidly and, once xk is close to the solution x∗ , the error is approximately squared at every iteration. It has a quadratic rate of convergence as we now show. It is not difficult to analyze the convergence of Newton’s method using the Taylor series. Define the error in xk by ek = xk − x∗ . Using the remainder form of the Taylor series: 0 = f (x∗ ) = f (xk − ek ) = f (xk ) − ek f (xk ) + 12 ek2 f (ξ ). Dividing by f (xk ) and rearranging gives ek −

f (xk ) 1 2 f (ξ ) = e . f (xk ) 2 k f (xk )

Since ek = xk − x∗ we obtain xk −

f (ξ ) f (xk ) 1 − x∗ = (xk − x∗ )2 , f (xk ) 2 f (xk )

i

i i

i

i

i

i

2.7. Newton’s Method for Nonlinear Equations

book 2008/10/23 page 69 i

69

f ( x) = 0 x2

x1

x0

Figure 2.10. Newton’s method—geometric interpretation. which is the same as xk+1 − x∗ =

f (ξ ) 1 (xk − x∗ )2 . 2 f (xk )

If the sequence { xk } converges, then ξ → x∗ , and hence when xk is sufficiently close to x∗ , 1 f (x∗ ) xk+1 − x∗ ≈ (xk − x∗ )2 2 f (x∗ ) indicating that the error in xk is approximately squared at every iteration, assuming that the rate constant 12 f (x∗ )/f (x∗ ) is not ridiculously large or small. These results are summarized in the following theorem. Theorem 2.6 (Convergence of Newton’s Method). Assume that the function f (x) has two continuous derivatives. Let x∗ be a zero of f with f (x∗ ) = 0. If |x0 − x∗ | is sufficiently small, then the sequence defined by xk+1 = xk − f (xk )/f (xk ) converges quadratically to x∗ with rate constant C = |f (x∗ )/2f (x∗ )|. Proof. See the Exercises. Example 2.5 also shows that the function values f (xk ) converge quadratically to zero. This also follows from the Taylor series: 0 = f (x∗ ) = f (xk + ek ) = f (xk ) + ek f (ξ ).

i

i i

i

i

i

i

70

book 2008/10/23 page 70 i

Chapter 2. Fundamentals of Optimization

This can be rearranged to obtain f (xk ) = −ek f (ξ ) = −f (ξ )(x∗ − xk ) rate if f (x∗ ) = 0. so that f (xk ) is proportional to (x∗ −xk ). Hence they converge

at the same In the argument above we have assumed that f (xk ) and f (x∗ ) are all nonzero. If f (xk ) = 0 for some k, then Newton’s method fails (there is a division by zero in the formula). Geometrically this means that the tangent line is horizontal, parallel to the x-axis, and so it does not have a zero. If on the other hand f (xk ) = 0 for all k, f (x∗ ) = 0, but f (x∗ ) = 0, then the coefficient in the convergence analysis f (ξ ) 2f (xk ) tends to infinity, and the algorithm does not have a quadratic rate of convergence. If f is a polynomial, this corresponds to f having a multiple zero at the point x∗ ; this case is illustrated in Example 2.7. Example 2.7 (Newton’s Method; f (x∗ ) = 0). We now apply Newton’s method to the example f (x) = x 4 − 7x 3 + 17x 2 − 17x + 6 = (x − 1)2 (x − 2)(x − 3) = 0. This function has a multiple zero at x∗ = 1 and at this point f (x∗ ) = f (x∗ ) = 0. The derivative of f is f (x) = 4x 3 − 21x 2 + 34x − 17 and the formula for Newton’s method is xk+1 = xk −

x 4 − 7x 3 + 17x 2 − 17x + 6 . 4x 3 − 21x 2 + 34x − 17

If we start with the initial guess x0 = 1.1, then the method converges to x∗ = 1 at a linear rate, whereas if we start with x0 = 2.1, then the method converges to x∗ = 2 at a quadratic rate. The results for these iterations are given in Tables 2.2 and 2.3. (In the final lines of both tables the function value f (xk ) is zero; this is the value calculated by the computer and is a side effect of using finite-precision arithmetic.) In Example 2.7 the slow convergence only occurs when the method converges to a solution where f (x∗ ) = 0. Quadratic convergence is obtained at the other roots, where f (x∗ ) = 0. It should also be noticed that the accuracy of the solution was worse at a multiple root. This too can be explained by the Taylor series, although this time we expand about the point x∗ : f (xk ) = f (x∗ + ek ) = f (x∗ ) + ek f (x∗ ) + 12 ek2 f (ξ ). At the solution, f (x∗ ) = 0, and since this is assumed to be a multiple zero, f (x∗ ) = 0 as well. Hence f (xk ) = 12 ek2 f (ξ ) = ( 12 f (ξ ))(xk − x∗ )2 .

i

i i

i

i

i

i

2.7. Newton’s Method for Nonlinear Equations

book 2008/10/23 page 71 i

71

Table 2.2. Newton’s method: f (x∗ ) = 0 (x0 = 1.1). k

xk

f (xk )

0 1 2 3 4 5 6 7 8 9 10

1.100000000000000 1.045541401273894 1.021932395992710 1.010779316995807 1.005345328998912 1.002661858321646 1.001328260855184 1.000663467429195 1.000331568468827 1.000165742989413 1.000082861192927 .. .

24 25

1.000000075780004 1.000000040618541

−2

|xk − x∗ |

2 × 10 4 × 10−3 9 × 10−4 2 × 10−4 6 × 10−5 1 × 10−5 4 × 10−6 9 × 10−7 2 × 10−7 6 × 10−8 1 × 10−8

1 × 10−1 5 × 10−2 2 × 10−2 1 × 10−2 5 × 10−3 3 × 10−3 1 × 10−3 7 × 10−4 3 × 10−4 2 × 10−4 8 × 10−5

1 × 10−14 0

8 × 10−8 4 × 10−8

Table 2.3. Newton’s method: f (x∗ ) = 0 (x0 = 2.1). k 0 1 2 3 4

xk

|xk − x∗ |

f (xk )

2.100000000000000 2.006603773584894 2.000042472785593 2.000000001803635 2.000000000000001

−1

−1 × 10 −7 × 10−3 −4 × 10−5 −2 × 10−9 0

1 × 10−1 7 × 10−3 4 × 10−5 2 × 10−9 9 × 10−16

The function value f (xk ) is now proportional to the square of the error (xk − x∗ ). So, for example, if f (xk ) = 10−16 (about the level of machine precision in typical double precision arithmetic), and 12 f (ξ ) = 1, then xk − x∗ = 10−8 . In this case the point xk is only accurate to half precision. The proof of convergence for Newton’s method requires that the initial point x0 be sufficiently close to a zero. If not, the method can fail to converge, even when there is no division by zero in the formula for the method. This is illustrated in the example below. In Chapter 11 we discuss safeguards that can be added to Newton’s method that prevent this from happening. Example 2.8 (Failure of Newton’s Method). Consider the problem f (x) =

ex − e−x = 0. ex + e−x

i

i i

i

i

i

i

72

book 2008/10/23 page 72 i

Chapter 2. Fundamentals of Optimization

x3

x1

0 x0

x2

Figure 2.11. Failure of Newton’s method. If Newton’s method is used with the initial guess x0 = 1, then the sequence of approximate solutions is x0 = 1, x1 = −0.8134, x2 = 0.4094 x3 = −0.0473, x4 = 7.06 × 10−5 , x5 = −2.35 × 10−13 and at the final point f (x5 ) = −2.35 × 10−13 , so the method converges to a solution. However if x0 = 1.1, then x0 = 1.1, x3 = −1.6952,

x1 = −1.1286, x2 = 1.2341 x4 = 5.7154, x5 = −2.30 × 104

and at the next iteration an overflow results. At the final point f (x5 ) = 1, so the sequence is not converging to a solution. A graph of the function is given in Figure 2.11. This function is also called the hyperbolic tangent function, f (x) = tanh x.

2.7.1

Systems of Nonlinear Equations

Much of the discussion in the one-dimensional case can be transferred with only minor changes to the n-dimensional case. Suppose now that we are solving f (x) = 0, where this represents f1 (x1 , . . . , xn ) = 0, f2 (x1 , . . . , xn ) = 0, .. . fn (x1 , . . . , xn ) = 0.

i

i i

i

i

i

i

2.7. Newton’s Method for Nonlinear Equations

book 2008/10/23 page 73 i

73

Define the matrix ∇f (x) with columns ∇f1 (x), . . . , ∇fn (x). This is the transpose of the Jacobian of f at the point x. (The Jacobian is discussed in Appendix B.4.) As before, we write out the Taylor series approximation for the function f at the point xk : f (xk + p) ≈ f (xk ) + ∇f (xk )Tp, where p is now a vector. Now we solve the equation f (x∗ ) ≈ f (xk ) + ∇f (xk )Tp = 0 for p to obtain

p = −∇f (xk )−T f (xk ).

The new estimate of the solution is then xk+1 = xk + p = xk − ∇f (xk )−T f (xk ). This is the formula for Newton’s method in the n-dimensional case. Example 2.9 (Newton’s Method in n Dimensions). As an example, consider the twodimensional problem f1 (x1 , x2 ) = 3x1 x2 + 7x1 + 2x2 − 3 = 0, f2 (x1 , x2 ) = 5x1 x2 − 9x1 − 4x2 + 6 = 0.

Then ∇f (x1 , x2 ) =

3x2 + 7 3x1 + 2

5x2 − 9 5x1 − 4

,

and the formula for Newton’s method is −T 3x1 x2 + 7x1 + 2x2 − 3 3x2 + 7 5x2 − 9 . xk+1 = xk − 3x1 + 2 5x1 − 4 5x1 x2 − 9x1 − 4x2 + 6 If we start with the initial guess x0 = (1, 2)T, then −T 3x1 x2 + 7x1 + 2x2 − 3 3x2 + 7 5x2 − 9 x1 = x0 − 3x1 + 2 5x1 − 4 5x1 x2 − 9x1 − 4x2 + 6 −T 13 1 1 14 − = 5 1 2 −1 −1.375 2.375 1 . = − = 5.375 −3.375 2

The complete iteration is given in Table 2.4. In the n-dimensional case, Newton’s method corresponds to approximating the function f by a linear function at the point xk . The zero of this linear approximation is the new estimate xk+1 . As in the one-dimensional case, the method typically converges with a

i

i i

i

i

i

i

74

book 2008/10/23 page 74 i

Chapter 2. Fundamentals of Optimization

Table 2.4. Newton’s method for an n-dimensional problem. k

x1k

x2k

f 2

x − x∗ 2

0 1 2 3 4 5 6 7 8

1.0000000 × 100 −1.3749996 × 100 −5.4903371 × 10−1 −1.6824928 × 10−1 −2.7482068 × 10−2 −1.0090199 × 10−3 −1.4637396 × 10−6 −3.0852447 × 10−12 −2.0216738 × 10−18

2.0000000 5.3749991 3.0472771 1.9741571 1.5774495 1.5028436 1.5000041 1.5000000 1.5000000

1 × 101 5 × 101 1 × 101 2 × 100 3 × 10−1 1 × 10−2 2 × 10−5 4 × 10−11 0

1 × 100 4 × 100 2 × 100 5 × 10−1 8 × 10−2 3 × 10−3 4 × 10−6 9 × 10−12 2 × 10−18

quadratic rate of convergence, as the theorem below indicates. A proof of quadratic convergence for Newton’s method can be found in the book by Ortega and Rheinboldt (1970, reprinted 2000). Theorem 2.10 (Convergence of Newton’s Method in n Dimensions). Assume that the function f (x) has two continuous derivatives. Assume that x∗ satisfies f (x∗ ) = 0 with ∇f (x∗ ) nonsingular. If x0 − x∗ is sufficiently small, then the sequence defined by xk+1 = xk − (∇f (xk ))−1 f (xk ) converges quadratically to x∗ . Our discussion has implicitly assumed that every Jacobian matrix ∇f (xk )T is nonsingular, that is, the system of linear equations that defines the new point xk+1 has a unique solution. If this assumption is not satisfied, Newton’s method fails. If the Jacobian matrix at the solution ∇f (x∗ )T is singular, then there is no guarantee of quadratic convergence. The proof of convergence assumes that xk is “sufficiently close” to x∗ , as in the one-dimensional case. If it is not, the method can diverge.

Exercises 7.1. Apply Newton’s method to find all three solutions of f (x) = x 3 − 5x 2 − 12x + 19 = 0. You will have to use several different initial guesses. 7.2. Let a be some positive constant. It is possible to use Newton’s method to calculate 1/a to any desired accuracy without doing division. Determine a function f such that f (1/a) = 0, and for which the formula for Newton’s method only uses the

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 75 i

75

arithmetic operations of addition, subtraction, and multiplication. For what initial values does Newton’s method converge for this function? 7.3. Apply Newton’s method to f (x) = (x − 2)4 + (x − 2)5 with initial guess x0 = 3. You should observe that the sequence converges linearly with rate constant 34 . Now apply the iterative method xk+1 = xk − 4f (xk )/f (xk ). This method should converge more rapidly for this problem. Prove that the new method converges quadratically, and determine the rate constant. 7.4. A function f has a root of multiplicity m > 1 at the point x∗ if f (x∗ ) = f (x∗ ) = · · · = f (m−1) (x∗ ) = 0. Assume that Newton’s method with initial guess x0 converges to such a root. Prove that Newton’s method converges linearly but not quadratically. Assume that the iteration xk+1 = xk − mf (xk )/f (xk ) 7.5.

7.6. 7.7.

7.8.

7.9. 7.10.

converges to x∗ . If f (m) (x∗ ) = 0, prove that this sequence converges quadratically. Apply Newton’s method to solve f (x) = x 2 − a = 0, where a > 0. This is a good √ way to compute ± a. How does the iteration behave if a ≤ 0? What happens if you choose x0 as a complex number? Prove that your iteration from the previous √ problem converges to a root if√x0 = 0. When does the iteration converge to + a and when does it converge to − a? For the iteration in the previous problem, can you efficiently determine a good initial guess x0 using the value of a and the elementary operations of addition, subtraction, multiplication, and division? Can you determine an upper bound on how many elementary operations are required to determine a root to within a specified accuracy? Newton’s method was derived by approximating the general function f by the first two terms of its Taylor series at the current point xk . Derive another method for finding zeroes by approximating f with the first three terms of its Taylor series at the current point, and finding a zero of this approximation. Determine the rate of convergence for this new method (you may assume that the method converges). Apply the method to the functions in Examples 2.5 and 2.7. Prove Theorem 2.10. Apply Newton’s method to the system of nonlinear equations f1 (x1 , x2 ) = x12 + x22 − 1 = 0 f2 (x1 , x2 ) = 5x12 − x2 − 2 = 0. There are four solutions to this system of equations. Can you find all four of them by using different initial guesses?

i

i i

i

i

i

i

76

book 2008/10/23 page 76 i

Chapter 2. Fundamentals of Optimization

7.11. (Extended Project) Apply Newton’s method to solve f (x) = x n − a = 0 for various values of n. Experiment with the method in an attempt to understand its properties. Under what circumstances will the method converge to a root? Can you, by using complex-valued initial guesses, determine all n roots of this equation? What can you prove about the convergence of the iteration? What happens if n is not an integer? 7.12. Suppose that Newton’s method is applied to a system of nonlinear equations, where some of the equations are linear. Prove that the linear equations are satisfied at every iteration, except possibly at the initial point.

2.8

Notes

Global Optimization—Techniques for global optimization are discussed in the books by Hansen (1992) and Floudas and Pardalos (1992, reprinted 2007); Hansen and Walster (2003); Horst et al. (2000); and Liberti and Maculan (2006). A survey of results can be found in article by Rinnooy Kan and Timmer (1989). Newton’s Method—If a function is known to have a multiple root, and if the multiplicity of the root is known (e.g., if it is known to be a double root), then it is possible to adjust the formula for Newton’s method to restore the quadratic rate of convergence. (See the Exercises above.) However, on a general problem it is unlikely that this information will be available, so this is not normally a practical alternative.

i

i i

i

i

i

i

book 2008/10/23 page 77 i

Chapter 3

Representation of Linear Constraints

3.1

Basic Concepts

In this chapter we examine ways of representing linear constraints. The goal is to write the constraints in a form that makes it easy to move from one feasible point to another. The constraints specify interrelationships among the variables so that, for example, if we increase the first variable, retaining feasibility might require making a complicated sequence of changes to all the other variables. It is much easier if we express the constraints using a coordinate system that is “natural” for the constraints. Then the interrelationships among the variables are taken care of by the coordinate system, and moves between feasible points are almost as simple as for a problem without constraints. In the general case these constraints may be either equalities or inequalities. Since any inequality of the “less than or equal” type may be transformed to an equivalent constraint of the “greater or equal” type, any problem with linear constraints may be written as follows: minimize subject to

f (x) aiTx = bi , i ∈ E aiTx ≥ bi , i ∈ I.

Each ai here is a vector of length n and each bi is a scalar. E is an index set for the equality constraints and I is an index set for the inequality constraints. We denote by A the matrix whose rows are the vectors aiT and denote by b the vector of right-hand side coefficients bi . Let S be the set of feasible points. A set of this form, defined by a finite number of linear constraints, is sometimes called a polyhedron or a polyhedral set. In this chapter we are not concerned with the properties of the objective function f . Example 3.1 (Problem with Linear Constraints). Consider the problem minimize subject to

f (x) = x12 + x23 x34 x1 + 2x2 + 3x3 = 6 x1 , x2 , x3 ≥ 0. 77

i

i i

i

i

i

i

78

book 2008/10/23 page 78 i

Chapter 3. Representation of Linear Constraints

Figure 3.1. Feasible directions. For this example E = { 1 } and I = { 2, 3, 4 }. The vectors { ai } that determine the constraints are a1 = ( 1 2 3 )T , a2 = ( 1 0 0 )T a3 = ( 0 1 0 )T , a4 = ( 0 0 1 )T and the right-hand sides are b1 = 6,

b2 = 0,

b3 = 0,

and

b4 = 0.

We start by taking a closer look at the relation between a feasible point and its neighboring feasible points. We shall be interested in determining how the function value changes as we move from a feasible point x¯ to nearby feasible points. First let us look at the direction of movement. We define p to be a feasible direction at the point x¯ if a small step taken along p leads to a feasible point in the set. Mathematically, p is a feasible direction if there exists some > 0 such that x¯ + αp ∈ S for all 0 ≤ α ≤ . Thus, a small movement from x¯ along a feasible direction maintains feasibility. In addition, since the feasible set is convex, any feasible point in the set can be reached from x¯ by moving along some feasible direction. Examples of feasible directions are shown in Figure 3.1. In many applications, it is useful to maintain feasibility at every iteration. For example, the objective function may only be defined at feasible points. Or if the algorithm is terminated before an optimal solution has been found, only a feasible point may have practical value. These considerations motivate a class of methods called feasible-point methods. These methods have the following form. Algorithm 3.1. Feasible-Point Method 1. Specify some initial feasible guess of the solution x0 . 2. For k = 0, 1, . . . (i) Determine a feasible direction of descent pk at the point xk . If none exists, stop. (ii) Determine a new feasible estimate of the solution: xk+1 = xk + αk pk , where f (xk+1 ) < f (xk ).

i

i i

i

i

i

i

3.1. Basic Concepts

book 2008/10/23 page 79 i

79

In this chapter we are mainly concerned with representing feasible directions with respect to S in terms of the constraint vectors ai . We begin by characterizing feasible directions with respect to a single constraint. Specifically, we determine conditions that ensure that small movements away from a feasible point x¯ will keep the constraint satisfied. Consider first an equality constraint aiTx = bi . Let us examine the effect of taking a small positive step α in the direction p. Since aiTx¯ = bi , then aiT(x¯ + αp) = bi will hold if and only if aiTp = 0. Example 3.2 (An Equality Constraint). Suppose that we wished to solve minimize subject to

f (x1 , x2 ) x1 + x2 = 1.

For this constraint a1 = (1, 1)T and b1 = 1. Let x¯ = (0, 1)T so that x¯ satisfies the constraint. Then x¯ + αp will satisfy the constraint if and only if a1Tp = 0, that is, p1 + p2 = 0. For this example a1T(x¯ + αp) = (x¯1 + x¯2 ) + α(p1 + p2 ) = (1) + α(0) = 1, as expected. The original problem is equivalent to minimize f (x¯ + αp), α

where x¯ = (0, 1)T, as before, and where p = (1, −1)T is a vector satisfying a1Tp = 0. Expressing feasible points in the form x¯ + αp will be a way for us to transform constrained problems to equivalent problems without constraints. Continuing to inequality constraints, consider first some constraint aiTx ≥ bi which is inactive at x. ¯ Since aiTx¯ > bi , then aiT(x¯ + αp) > bi for all α sufficiently small. Thus, we can move a small distance in any direction p without violating the constraint. If the inequality constraint is active at x, ¯ we have aiTx¯ = bi . Then to guarantee that aiT(x¯ + αp) ≥ bi for small positive step lengths α, the direction p must satisfy aiTp ≥ 0. Example 3.3 (An Inequality Constraint). Suppose that we wished to solve minimize subject to

f (x1 , x2 ) x1 + x2 ≥ 1.

For this constraint a1 = (1, 1)T and b1 = 1. If x¯ = (0, 2)T, then the constraint is inactive and any nearby point is feasible. If x¯ = (0, 1)T, then the constraint is active and nearby points can be expressed in the form x¯ + αp with a1Tp ≥ 0. For this example this corresponds to the condition p1 + p2 ≥ 0, or p1 ≥ −p2 .

i

i i

i

i

i

i

80

book 2008/10/23 page 80 i

Chapter 3. Representation of Linear Constraints

In summary, we conclude that the feasible directions at a point x¯ are determined by the equality constraints and the active inequalities at that point. Let Iˆ denote the set of active inequality constraints at x. ¯ Then p is a feasible direction with respect to the feasible set at x¯ if and only if aiTp = 0,

i ∈ E,

aiTp ≥ 0,

ˆ i ∈ I.

In the following, it will be convenient to consider separately problems that have only equality constraints, or only inequality constraints. The general form of the equality-constrained problem is minimize subject to

f (x) Ax = b.

It is evident from our discussion above that a vector p is a feasible direction for the linear equality constraints if and only if Ap = 0. We call the set of all vectors p such that Ap = 0 the null space of A. A direction p is a feasible direction for the linear equality constraints if and only if it lies in the null space of A. The general form of the inequality-constrained problem is minimize subject to

f (x) Ax ≥ b.

Let x¯ be a feasible point for this problem. We have observed already that the inactive constraints at x¯ do not influence the feasible directions at this point. Let Aˆ be the submatrix of A corresponding to the rows of the active constraints at x. ¯ Then a direction p is a feasible direction for S at x¯ if and only if ˆ ≥ 0. Ap Since the inactive constraints at a point have no impact on its feasible directions, such constraints can be ignored when testing whether the point is locally optimal. In particular, if we had prior knowledge of which constraints are active at the optimum, we could cast aside the inactive constraints and treat the active constraints as equalities. A solution of the inequality-constrained problem is a solution of the equality-constrained problem defined by the active constraints. The theory for inequality-constrained problems draws on the theory for equality-constrained problems. For this reason, it is important to study problems with only equality constraints. In particular, it will be useful to study ways to represent all the vectors in the null space of a matrix. This is the topic of Sections 3.2 and 3.3. Once a feasible direction p is determined, the new estimate of the solution is of the form x¯ + αp where α ≥ 0. Since the new point must be feasible, in general there is an upper limit on how large α can be. For an equality constraint we have aiTp = 0, and so aiT(x¯ + αp) = aiTx¯ = bi

i

i i

i

i

i

i

3.1. Basic Concepts

book 2008/10/23 page 81 i

81

p

a Tx > b

1

a T p1 < 0

a T p2 = 0

aT p > 0 3

p2

p3

Figure 3.2. Movement to and away from the boundary. for all values of α. For an active inequality constraint we have aiTp ≥ 0, and so aiT(x¯ + αp) ≥ aiTx¯ ≥ bi for all values of α ≥ 0. Thus only the inactive constraints are relevant when determining an upper bound on α. Because x¯ is feasible, aiTx¯ > bi for all inactive constraints. Thus, if aiTp ≥ 0, the constraint remains satisfied for all α ≥ 0. As α increases, the movement is away from the boundary of the constraint. On the other hand, if aiTp < 0, the inequality will remain valid only if α ≤ (aiTx¯ −bi )/(−aiTp). A positive step along p is a move towards the boundary, and any step larger than this bound will violate the constraint. (See Figure 3.2.) The maximum step length α¯ that maintains feasibility is obtained from a ratio test:

α¯ = min (aiTx¯ − bi )/(−aiTp) : aiTp < 0 , where the minimum is taken over all inactive constraints. If aiTp ≥ 0 for all inactive constraints, then an arbitrarily large step can be taken without violating feasibility. Example 3.4 (Ratio Test). Let x¯ = (1, 1)T and p = (4, −2)T. Suppose that there are three inactive constraints with a1T = ( 1

4)

and

b1 = 3

= (0

3)

and

b2 = 2

= (5

1)

and

b3 = 4.

a2T a3T Then a1Tp = −4 < 0,

a2Tp = −6 < 0,

and

a3Tp = 18 > 0,

so only the first two constraints are used in the ratio test:

α¯ = min (aiTx¯ − bi )/(−aiTp) : aiTp < 0 = min { (5 − 3)/4, (3 − 2)/6 } = 1/6. Notice that the point x¯ + αp ¯ = ( 53 , 23 )T is on the boundary of the second constraint.

i

i i

i

i

i

i

82

book 2008/10/23 page 82 i

Chapter 3. Representation of Linear Constraints

Exercises 1.1. Find the sets of all feasible directions at points xa = (0, 0, 2)T, xb = (3, 0, 1)T, and xc = (1, 1, 1)T for Example 3.1. 1.2. Consider the set defined by the constraints x1 + x2 = 1, x1 ≥ 0, and x2 ≥ 0. At each of the following points determine the set of feasible directions: (a) (0, 1)T; (b) (1, 0)T; (c) (0.5, 0.5)T. 1.3. Consider the system of inequality constraints Ax ≥ b with A=

9 6 1

4 −7 6

1 9 8 −4 3 −7

−7 −6 6

and

b=

−15 −30 −20

.

For the given values of x and p, perform a ratio test to determine the maximum step length α¯ such that x + αp ¯ remains feasible. (i) (ii) (iii) (iv)

x x x x

= (8, 4, −3, 4, 1)T and p = (1, 1, 1, 1, 1)T, = (7, −4, −3, −3, 3)T and p = (3, 2, 0, 1, −2)T, = (5, 0, −6, −8, −3)T and p = (5, 0, 5, 1, 3)T, = (9, 1, −1, 6, 3)T and p = (−4, −2, 4, −2, 2)T.

1.4. What are the potential consequences of miscalculating α¯ in the ratio test? 1.5. Let S = { x : Ax ≤ b }. Derive the conditions that must be satisfied by a feasible direction at a point x¯ ∈ S. 1.6. On a computer, there is a danger that an overflow can occur during the ratio test if, in a particular ratio, the numerator is large and the denominator is small. How can the ratio test be implemented so that this danger is removed?

3.2

Null and Range Spaces

Let A be an m × n matrix with m ≤ n. We denote the null space of A by

N (A) = p ∈ n : Ap = 0 . The null space of a matrix is the set of vectors orthogonal to the rows of the matrix. Recall that the null space represents the set of feasible directions for the constraints Ax = b. It is easy to see that any linear combination of two vectors in N (A) is also in N (A), and thus the null space is a subspace of n . It can be shown that the dimension of this subspace is n − rank(A). When A has full row rank (i.e., its rows are linearly independent), this is just n − m. Another term that will be important to our discussions is the range space of a matrix. This is the set of vectors spanned by the columns of the matrix (that is, the set of all linear combinations of these columns). In particular, we are interested in the range space of AT, defined by

R(AT) = q ∈ n : q = ATλ for some λ ∈ m .

i

i i

i

i

i

i

3.2. Null and Range Spaces

book 2008/10/23 page 83 i

83

R (A T ) N(A ):a T p = 0 a p x

q

Figure 3.3. Null space and range space of A = (a T). Throughout this text, if we mention a range space without specifying a matrix, it refers to the range space of AT. The dimension of the range space is the same as the rank of AT, or equivalently the rank of A. There is an important relationship between N (A) and R(AT): they are orthogonal subspaces. This means that any vector in one subspace is orthogonal to any vector in the other. To verify this statement, we note that any vector q ∈ R(AT) can be expressed as q = ATλ for some λ ∈ m , and therefore, for any vector p ∈ N (A) we have q Tp = λTAp = 0. There is more. Because the null and range spaces are orthogonal subspaces whose dimensions sum to n, any n-dimensional vector x can be written uniquely as the sum of a null-space and a range-space component: x = p + q, where p ∈ N (A) and q ∈ R(AT). Figure 3.3 illustrates the null and range spaces for A = (a T), where a is a two-dimensional nonzero vector. Notice that the vector a is orthogonal to the null space and that any range-space vector is a scalar multiple of a. The decomposition of a vector x into null-space and range-space components is also shown in Figure 3.3. How can we represent vectors in the null space of A? For this purpose, we define a matrix Z to be a null-space matrix for A if any vector in N (A) can be expressed as a linear combination of the columns of Z. The representation of a null-space matrix is not unique. If A has full row rank m, any matrix Z of dimension n × r and rank n − m that satisfies AZ = 0 is a null-space matrix. The column dimension r must be at least (n − m). In the special case where r is equal to n − m, the columns of Z are linearly independent, and Z is then called a basis matrix for the null space of A. If Z is an n × r null-space matrix, the null space can be represented as N (A) = { p : p = Zv

for some v ∈ r } ,

thus N (A) = R(Z). This representation of the null space gives us a practical way to generate feasible points. If x¯ is any point satisfying Ax = b, then all other feasible points

i

i i

i

i

i

i

84

book 2008/10/23 page 84 i

Chapter 3. Representation of Linear Constraints

can be written as x = x¯ + Zv for some vector v. As an example consider the rank-two matrix 1 −1 0 0 . A= 0 0 1 1 The null space of A is the set of all vectors p such that ⎛ ⎞ p1 p1 − p2 0 1 −1 0 0 ⎜ p2 ⎟ ; = = Ap = ⎝ ⎠ 0 0 0 1 1 p3 p3 + p 4 p4 that is, the vector must satisfy p1 = p2 and p3 = −p4 . Thus any null-space vector must have the form ⎛ ⎞ v1 ⎜ v ⎟ p =⎝ 1⎠ v2 −v2 for some scalars v1 and v2 . A possible basis matrix for the null space of A is ⎛ ⎞ 1 0 0⎟ ⎜1 Z=⎝ ⎠ 0 1 0 −1 and the null space can be expressed as N (A) = p : p = Zv The matrix

for some v ∈ 2 .

⎛

⎞ 1 0 2 0 2⎟ ⎜1 Z¯ = ⎝ ⎠ 0 1 −1 0 −1 1 is also a null-space matrix for A, but it is not a basis matrix since its third column is a linear combination of the first two columns. The null space of A can be expressed in terms of Z¯ as

N (A) = p : p = Z¯ v¯ for some v¯ ∈ 3 .

Exercises 2.1. In each of the following cases, compute a basis matrix for the null space of the matrix A and express the points xi as xi = pi + qi where pi is in the null space of A and qi is in the range space of AT.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 85 i

85

(i)

A=

1 1 0

1 −1 1

1 −1 0

1 1 1

(ii)

,

⎛ ⎞ 1 ⎜3⎟ x1 = ⎝ ⎠ , 1 2 ⎛

A = (1

1

1

1),

⎞ −2 ⎜ 4⎟ x1 = ⎝ ⎠, 5 −2

(iii) A= (iv)

A=

1 1

1 −1

1 1 −1 1

1 1 2 0 1 −1

1 1 0 2 −1 1

⎛

⎞ 0 ⎜ −2 ⎟ x2 = ⎝ ⎠. −3 4 ⎛

⎞ 7 ⎜ 5⎟ x2 = ⎝ ⎠. −13 1

,

⎛ ⎞ 4 ⎜3⎟ x1 = ⎝ ⎠ , 4 0

⎞ −1 ⎜ 1⎟ x2 = ⎝ ⎠. 5 −5 ⎛

,

⎛ ⎞ 3 ⎜1⎟ x1 = ⎝ ⎠ , 1 2

⎛

⎞ 8 ⎜ 9⎟ x2 = ⎝ ⎠. −2 −4

2.2. Let Z be an n × r null-space matrix for the matrix A. If Y is any invertible r × r matrix, prove that Zˆ = ZY is also a null-space matrix for A. 2.3. Let A be a given m × n matrix and let Z be a null-space matrix for A. Let X be an invertible m × m matrix and let Y be an invertible n × n matrix. If a change of ˆ variable is made to transform A into Aˆ = XAY , how can Z be transformed into Z, ˆ a null-space matrix for A? 2.4. Let A be a full-rank m × n matrix and let Z be a basis matrix for the null space of A. Use the results of the previous problem to prove that, for appropriate choices of X and Y , Aˆ and Zˆ have the form In−m , Aˆ = ( 0 Im ) and Zˆ = 0 where Im and In−m are identity matrices of the appropriate size. What is the corresponding result in the case where A is not of full rank and Z is any null-space matrix? 2.5. Let A be an m × n matrix with m < n. Prove that any n-dimensional vector x can be written uniquely as the sum of a null-space and a range-space component: x = p + q, where p ∈ N (A) and q ∈ R(AT). 2.6. Suppose that you are given a matrix A and a vector p and are told that p is in the null space of A. On a computer, you cannot expect that Ap will be exactly equal to zero because of rounding errors. How large would the computed value of Ap have to be before you could conclude that p was not in the null space of A? (Your

i

i i

i

i

i

i

86

book 2008/10/23 page 86 i

Chapter 3. Representation of Linear Constraints answers should incorporate the values of the machine precision and the components of A and p.) If the computed value of Ap is zero, can you conclude that p is in the null space of A?

3.3

Generating Null-Space Matrices

We present here four commonly used methods for deriving a null-space matrix for A. The discussion assumes that A is an m×n matrix of full row rank (and hence m ≤ n). Two of the approaches, the variable reduction method and the QR factorization, yield an n × (n − m) basis matrix for N (A). The other two methods yield an n × n null-space matrix.

3.3.1 Variable Reduction Method This method is the approach used by the simplex algorithm for linear programming. It is also used in nonlinear optimization (see Section 15.6). We start with an example. Consider the linear system of equations: p1 + p2 − p3 = 0 − 2p2 + p3 = 0. This system has the form Ap = 0. We wish to generate all solutions to this system. We can solve for any two variables whose associated columns in A are linearly independent in terms of the third variable. For example, we can solve for p1 and p3 in terms of p2 as follows: p1 = p2 p3 = 2p2 . The set of all solutions to the system can be written as 1 p = 1 p2 , 2 where p2 is chosen arbitrarily. Thus Z = (1, 1, 2)T is a basis for the null space of A. Since the values of p1 and p3 depend on p2 , they are called dependent variables. They are also sometimes called basic variables. The variable p2 which can take on any value is called an independent variable, or a nonbasic variable. To generalize this, consider the m × n system Ap = 0. Select any set of m variables whose corresponding columns are linearly independent—these will be the basic variables. Denote by B the m × m matrix defined by these columns. The remaining variables will be the nonbasic variables; we denote the m × (n − m) matrix of their respective columns by N . The general solution to the system Ap = 0 is obtained by expressing the basic variables in terms of the nonbasic variables, where the nonbasic variables can take on any arbitrary value.

i

i i

i

i

i

i

3.3. Generating Null-Space Matrices

Thus

book 2008/10/23 page 87 i

87

For ease of notation we assume here that the first m variables are the basic variables. pB Ap = ( B N ) = BpB + NpN = 0. pN

Premultiplying the last equation by B −1 we get pB = −B −1 NpN . Thus the set of solutions to the system Ap = 0 is pB −B −1 N p= pN , = I pN and the n × (n − m) matrix

Z=

−B −1 N I

is a basis for the null space of A. Consider now the system Ax = b. One feasible solution is −1 B b x¯ = . 0 If x is any point that satisfies Ax = b, then x can be written in the form −1 −B −1 N B b x = x¯ + p = x¯ + ZpN = + pN . 0 I If the basis matrix B is chosen differently, then the representation of the feasible points changes, but the set of feasible points does not. In this derivation we assumed that the first m variables were the basic variables. If this is not true, the rows in Z must be reordered to correspond to the ordering of the basic and nonbasic variables. This technique is illustrated in the following example. Example 3.5 (Variable Reduction). Consider the system of constraints Ax = b with 5 1 −2 1 3 . and b = A= 6 0 1 1 4 Let B consist of the first two columns of A, and let N consist of the last two columns: 1 3 1 −2 . and N = B= 1 4 0 1 Then x¯ =

B

−1

0

b

⎛

⎞ 17 ⎜ 6⎟ =⎝ ⎠ 0 0

i

i i

i

i

i

i

88

book 2008/10/23 page 88 i

Chapter 3. Representation of Linear Constraints

and

⎛

⎞ −3 −11 −B N ⎜ −1 −4 ⎟ =⎝ Z= ⎠. I 1 0 0 1 It is easy to verify that Ax¯ = b and AZ = 0. Every point satisfying Ap = 0 is of the form ⎞ ⎛ ⎛ ⎞ −3p3 − 11p4 −3 −11 ⎜ −p3 − 4p4 ⎟ ⎜ −1 −4 ⎟ p3 =⎝ ZpN = ⎝ ⎠. ⎠ p4 p3 1 0 p4 0 1

−1

If instead B is chosen as columns 4 and 3 of A (in that order), and N as columns 2 and 1, then −2 1 3 1 . and N = B= 1 0 4 1 Care must be taken in defining x¯ and Z to ensure that their components are positioned correctly. In this case ⎛ ⎞ 0 1 ⎜0⎟ −1 B b= and x¯ = ⎝ ⎠ . 2 2 1 −1 ¯ corresponding to the Notice that the components of B b are at positions 4 and 3 in x, columns of A that were used to define B. Similarly ⎛ ⎞ 0 1 −3 1 0⎟ ⎜ 1 and Z = ⎝ −B −1 N = ⎠. 11 −4 11 −4 −3 1 The rows of −B −1 N are placed in rows 4 and 3 of Z, and the rows of I are placed in rows 2 and 1. As before, Ax¯ = b and AZ = 0. Every point satisfying Ap = 0 is of the form ⎛ ⎞ ⎛ ⎞ p1 0 1 p2 0 ⎟ p2 ⎜ ⎟ ⎜ 1 =⎝ ZpN = ⎝ ⎠. ⎠ p1 11p2 − 4p1 11 −4 −3p2 + p1 −3 1 In practice the matrix Z itself is rarely formed explicitly, since the inverse of B should not be computed. This is not a limitation; Z is only needed to provide matrix-vector products of the form p = Zv, or the form Z Tg. These computations do not require Z explicitly. For example, the vector p = Zv may be computed as follows. First we compute t = N v. Next we compute u = −B −1 t, by solving the system Bu = −t. (This should be done via a numerically stable method such as the LU factorization.) The vector p = Zv is now given by p = (uT, v T)T. The variable reduction approach for representing the null space is the method used in the simplex algorithm for linear programming. This approach has been enhanced so that ever larger problems can be solved. These enhancements exploit the sparsity that is often present in large problems, in order to reduce computational effort and increase accuracy. A more detailed exposition of these techniques is given in Chapter 7.

i

i i

i

i

i

i

3.3. Generating Null-Space Matrices

book 2008/10/23 page 89 i

89

x

Px

Ap =0

Figure 3.4. Orthogonal projection.

3.3.2

Orthogonal Projection Matrix

Let x be an n-dimensional vector, and let A be an m × n matrix of full row rank. Then x can be expressed as a sum of two components, one in N (A) and the other in R(AT): x = p + q, where Ap = 0, and q = ATλ for some m-dimensional vector λ. Multiplying this equation on the left by A gives Ax = AATλ, from which we obtain λ = (AAT)−1 Ax. Substituting for q gives the null-space component of x: p = x − AT(AAT)−1 Ax = (I − AT(AAT)−1 A)x. The n × n matrix

P = I − AT(AAT)−1 A

is called an orthogonal projection matrix into N (A). The null-space component of the vector x can be found by premultiplying x by P ; the resulting vector P x is also termed the orthogonal projection of x onto N (A) (see Figure 3.4). The orthogonal projection matrix is the unique matrix with the following properties: • It is a null-space matrix for A; • P 2 = P , which means repeated application of the orthogonal projection has no further effect; • P T = P (P is symmetric). The name “orthogonal projection” may be misleading—unless P is the identity matrix it is not orthogonal. There are a number of ways to compute the projection matrix. Selection of the method depends in general on the application, the size of m and n, as well as the sparsity of A. We

i

i i

i

i

i

i

90

book 2008/10/23 page 90 i

Chapter 3. Representation of Linear Constraints

point out that by “computing the matrix” we mean representing the matrix so that a matrixvector product of the form P x can be formed for any vector x. The projection matrix itself is rarely formed explicitly. To demonstrate this point, suppose that A consists of a single row: A = a T, where a is an n-vector. Then 1 P = I − T aa T. a a Forming P explicitly would require approximately n2 /2 multiplications and n2 /2 storage locations. Forming the product P x for some vector x would require n2 additional multiplications. These costs can be reduced dramatically if only the vector a and the scalar a Ta are stored. “Forming” P this way only requires n multiplications in the calculation of (a Ta). The matrix-vector product is computed as P x = x − a(a Tx)/(a Ta). This requires only 2n multiplications. In the example above the matrix AAT is the scalar a Ta, which is easy to invert. In the more general case where A has several rows, the task of “inverting” AAT becomes expensive, and care must be taken to perform this in a numerically stable manner. Often, this is done by the Cholesky factorization. However, if A is dense it is not advisable to form the matrix AAT explicitly, since it can be shown that its condition number is the square of that of A. A more stable approach is to use a QR factorization of AT (see Appendix A.7.3 and Section 3.3.4 below). For the case when A is large and sparse, the QR factorization may be too expensive, since it tends to produce dense factors. Special techniques that attempt to exploit the sparsity structure of A have been developed for this situation.

3.3.3

Other Projections

As before, let A be an m × n matrix of full row rank. Let D be a positive-definite n × n matrix, and consider the n × n matrix PD = I − DAT(ADAT)−1 A. It is easy to show that PD is a null-space matrix for A. Also, PD PD = PD . An n × n matrix with these two properties is called a projection matrix. An orthogonal projection is therefore a symmetric projection matrix. Many of the new interior point algorithms for optimization use projections of this form. In the case of linear programming, the matrix D is generally a diagonal matrix with positive diagonal terms. This matrix D changes from iteration to iteration, while A remains unchanged. Special techniques for computing and updating these projections have been developed.

3.3.4 The QR Factorization Again let A be an m × n matrix with full row rank. We perform an orthogonal factorization of AT : AT = QR.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 91 i

91

Let Q = (Q1 , Q2 ), where Q1 consists of the first m columns of Q, and Q2 consists of the last n − m columns. Also denote the top m × m triangular submatrix of R by R1 . The rest of R is an (n − m) × m zero matrix. Since Q is an orthogonal matrix, it follows that AQ = R T, or AQ1 = R1T and AQ2 = 0. Thus Z = Q2 is a basis for the null space of A. This basis is also known as an orthogonal basis, since Z TZ = I . Example 3.6 (Generating a Basis Matrix Using the QR Factorization). Consider the matrix 1 −1 0 0 . A= 0 0 1 1 An orthogonal factorization of AT yields ⎛ √ ⎞ 0 −1/2 −1/2 −√2/2 2/2 −1/2 −1/2 ⎟ ⎜ √0 Q=⎝ ⎠, 0 −√2/2 1/2 −1/2 0 − 2/2 −1/2 1/2 hence

⎛

−1/2 ⎜ −1/2 Z=⎝ 1/2 −1/2

⎛ √ − 2 ⎜ 0 R=⎝ 0 0

0 √

⎞

− 2⎟ ⎠, 0 0

⎞ −1/2 −1/2 ⎟ ⎠ −1/2 1/2

is a basis for the null space of A. The QR factorization method has the important advantage that the basis Z can be formed in a numerically stable manner. Moreover, computations performed with respect to the resulting basis Z are numerically stable. (For further information, see the references cited in the Notes.) However, this numerical stability comes at a price, since computing the QR factorization is relatively expensive. If m is small relative to n, some savings may be gained by not forming Q explicitly. An additional drawback of the QR method is that the basis Z can be dense even when A is sparse. As a result it may be unsuitable for large sparse problems.

Exercises 3.1. For each of the following matrices, compute a basis for the null space using variable reduction (with A written in the form (B, N )). (i)

A=

1 1 0

1 −1 1

1 1 −1 1 0 1

.

i

i i

i

i

i

i

92

book 2008/10/23 page 92 i

Chapter 3. Representation of Linear Constraints (ii) A = (1 (iii)

A=

(iv)

1 1

1

1

1 −1

1).

1 −1

1 1

.

1 1 1 1 A= 2 0 0 2 . 1 −1 −1 1 3.2. Compute the orthogonal projection matrix for each of the matrices in the previous problem. 3.3. Consider the system Ap = 0, where 1 2 0 2 . A= 2 1 2 4 Compute a basis for the null-space matrix of A using p2 and p3 as the basic variables. Use this to write a general expression for all solutions to this system. Could you do the same if p1 and p4 were the basic variables? 3.4. Let A be an m × n matrix of full row rank. Prove that the matrix AAT is positive definite, and hence its inverse exists. 3.5. Let A be an m × n matrix of full column rank. Prove that the matrix ATA is positive definite, and hence its inverse exists. 3.6. Let A be an m × n full row rank matrix and let Z be a basis for its null space. Prove that I − AT(AAT)−1 A = Z(Z TZ)−1 Z T. 3.7. Let P be the orthogonal projection matrix associated with an m × n full row rank matrix A. Prove that P has n − m linearly independent eigenvectors associated with the eigenvalue 1, and m linearly independent eigenvectors associated with the eigenvalue 0. 3.8. Prove that if P is the orthogonal projection matrix associated with N (A), then I − P is the orthogonal projection matrix associated with R(AT). 3.9. Let A = (1, 3, 2, −1)T and let x = (6, 8, −2, 1)T. Compute the orthogonal projection of x into the null space of A without explicitly forming the projection matrix. 3.10. Prove that an orthogonal projection matrix is positive semidefinite. 3.11. Let A be an m × n matrix of full row rank, and let P be the orthogonal projection matrix corresponding to A. Let a be an n-dimensional vector and suppose that a is not a linear combination of the rows of A. (i) Prove that a TP a = 0. (ii) Let Aˆ =

A aT

,

ˆ Prove that and let Pˆ be the orthogonal projection matrix corresponding to A. Pˆ = P − P a(a TP a)−1 a TP .

i

i i

i

i

i

i

3.4. Notes

book 2008/10/23 page 93 i

93

3.12. Let A be an m × n full row rank matrix and D an n × n positive-definite matrix. (i) Prove that the matrix ADAT is positive definite, and hence its inverse exists. (ii) Let PD = I −DAT(ADAT)−1 A. Prove that PD x = 0 if and only if x = DATη for some m-dimensional vector η. (iii) Prove that the matrix PD D is positive semidefinite, and x TPD Dx = 0, if and only if x = ATη for some vector η. 3.13. Compute an orthogonal basis matrix for the matrices in Exercise 3.1. 3.14. Consider the QR factorization of a full row rank matrix A. Prove that Q1 Q1 T + Q2 QT2 = I . 3.15. Consider the problem of forming the orthogonal projection matrix associated with a matrix A. One approach to avoid the potential ill-conditioning of the matrix AAT is to use the QR factorization for the matrix AAT. Assume that A has full row rank. (i) Prove that the AAT = R1TR1 and hence R1T is the lower triangular matrix of the Cholesky factorization for AAT. (ii) Prove that the resulting orthogonal projection is P = Q2 Q2 T . (iii) Prove that AT(AAT)−1 = Q1 R1−T . 3.16. Let A be a matrix with full row rank, and let Z be an orthogonal basis matrix for A. Prove that the orthogonal projection matrix associated with A satisfies P = ZZ T. 3.17. Let P be an orthogonal projection. Prove that P is unique. Hint: Let P = ZZ T where Z is an orthogonal basis matrix for the null space. Suppose that P1 is another orthogonal projection. Then P1 = ZV T for some full-rank matrix V . Now prove that V = Z. 3.18. Compute an orthogonal projection matrix for 1 1 1 1 . A= 2 2 2 2

3.4

Notes

Further information on these topics can be found in the books by Gill, Murray, and Wright (1991); Golub and Van Loan (1996); and Trefethen and Bau (1997).

i

i i

i

i

book 2008/10/23 page 94 i

i

i

i

i

i

i

i

i

i

book 2008/10/23 page 95 i

Part II

Linear Programming

i

i i

i

i

book 2008/10/23 page 96 i

i

i

i

i

i

i

i

i

i

book 2008/10/23 page 97 i

Chapter 4

Geometry of Linear Programming

4.1

Introduction

Linear programs can be studied both algebraically and geometrically. The two approaches are equivalent, but one or the other may be more convenient for answering a particular question about a linear program. The algebraic point of view is based on writing the linear program in a particular way, called standard form. Then the coefficient matrix of the constraints of the linear program can be analyzed using the tools of linear algebra. For example, we might ask about the rank of the matrix, or for a representation of its null space. It is this algebraic approach that is used in the simplex method, the topic of the next chapter. The geometric point of view is based on the geometry of the feasible region and uses ideas such as convexity to analyze the linear program. It is less dependent on the particular way in which the constraints are written. Using geometry (particularly in two-dimensional problems where the feasible region can be graphed) makes many of the concepts in linear programming easy to understand, because they can be described in terms of intuitive notions such as moving along an edge of the feasible region. There is a direct correspondence between these two points of view. This chapter will explore several aspects of this correspondence. Before giving an outline of the chapter, we show how a two-dimensional linear program can be solved graphically. Consider the problem minimize subject to

z = −x1 − 2x2 −2x1 + x2 ≤ 2 −x1 + x2 ≤ 3 x1 ≤ 3 x1 , x2 ≥ 0.

The feasible region is graphed in Figure 4.1. The figure also includes lines corresponding to various values of the objective function. For example, the line z = −2 = −x1 − 2x2 passes through the points (2, 0)T and (0, 1)T, 97

i

i i

i

i

i

i

98

book 2008/10/23 page 98 i

Chapter 4. Geometry of Linear Programming

z = -15 - 2 x1 + x2 = 2

6

5

4

3

x1 =3 2

z = -4 - x1 + x2 = 3

1

z = -2 -3

-2

0

-1 -1

1

2

3

4

z =0

Figure 4.1. Graphical solution of a linear program. and the parallel line z = 0 passes through the origin. The goal of the linear program is to minimize the value of z. As the figure illustrates, z decreases as these lines move upward and to the right. The objective z cannot be decreased indefinitely, however. Eventually the z line ceases to intersect the feasible region, indicating that there are no longer any feasible points corresponding to that particular value of z. The minimum occurs when z = −15 at the point (3, 6)T, that is, at the last point where an objective line intersects the feasible region. This is a corner of the feasible region. It is no coincidence that the solution occurred at a corner or extreme point. Proving this result will be the major goal of this chapter. To achieve this goal, we will first describe standard form, a particular way of writing a system of linear constraints. Standard form will be used to define a basic feasible solution. We will then show that the algebraic notion of a basic feasible solution is equivalent to the geometric notion of an extreme point. This equivalence is of value because, in higher dimensions, basic feasible solutions are easier to generate and identify than extreme points. It will then be shown how to represent any feasible point in terms of extreme points and directions of unboundedness (directions used in the description of unbounded feasible regions). Finally, this representation of feasible points will be used to prove that any linear program with a finite optimal solution has an optimal extreme point. This last result will, in turn, motivate our discussion of the simplex method, a method that solves linear programs by examining a sequence of basic feasible solutions, that is, extreme points.

Exercises 1.1. Solve the following linear programs graphically.

i

i i

i

i

i

i

Exercises

book 2008/10/23 page 99 i

99

(i) minimize subject to

z = 3x1 + x2 x1 − x2 ≤ 1 3x1 + 2x2 ≤ 12 2x1 + 3x2 ≤ 3 −2x1 + 3x2 ≥ 9 x1 , x2 ≥ 0.

maximize subject to

z = x1 + 2x2 2x1 + x2 ≥ 12 x1 + x2 ≥ 5 −x1 + 3x2 ≤ 3 6x1 − x2 ≥ 12 x1 , x2 ≥ 0.

(ii)

(iii) minimize subject to

z = x1 − 2x2 x1 − 2x2 ≥ 4 x1 + x2 ≤ 8 x1 , x2 ≥ 0.

minimize subject to

z = −x1 − x2 x1 − x2 ≥ 1 x1 − 2x2 ≥ 2 x1 , x2 ≥ 0.

minimize subject to

z = x 1 − x2 x1 − x2 ≥ 2 2x1 + x2 ≥ 1 x1 , x2 ≥ 0.

(iv)

(v)

(vi) minimize subject to

z = 4x1 − x2 x1 + x2 ≤ 6 x1 − x2 ≥ 3 −x1 + 2x2 ≥ 2 x1 , x2 ≥ 0.

maximize subject to

z = 6x1 − 3x2 2x1 + 5x2 ≥ 10 3x1 + 2x2 ≤ 40 x1 , x2 ≤ 15.

minimize subject to

z = x1 + 9x2 2x1 + x2 ≤ 100 x1 + x2 ≤ 80 x1 ≤ 40 x1 , x2 ≥ 0.

(vii)

(viii)

i

i i

i

i

i

i

100

book 2008/10/23 page 100 i

Chapter 4. Geometry of Linear Programming (ix) minimize subject to

(x) minimize subject to

z = 2x1 + 13x2 x1 + x2 ≤ 5 x1 + 2x2 ≤ 6 x1 , x2 ≥ 0. z = −5x1 − 7x2 −3x1 + 2x2 ≤ 30 −2x1 + x2 ≤ 12 x1 , x2 ≥ 0.

1.2. Find graphically all the values of the parameter a such that (−3, 4)T is the optimal solution of the following problem: maximize subject to

z = ax1 + (2 − a)x2 4x1 + 3x2 ≤ 0 2x1 + 3x2 ≤ 7 x1 + x2 ≤ 1.

1.3. Find graphically all the values of the parameter a such that the following systems define nonempty feasible sets. (i)

5x1 + x2 + x3 + 3x4 = a 8x1 + 3x2 + x3 + 2x4 = 2 − a x1 , x2 , x3 , x4 ≥ 0.

(ii)

ax1 + x2 + 3x3 − x4 = 2 x1 − x2 − x3 − 2x4 = 2 x1 , x2 , x3 , x4 ≥ 0.

1.4. Suppose that the linear program minimize subject to

z = cTx Ax = b x≥0

has an optimal objective z∗ . Discuss how the optimal objective would change if (a) a constraint is added to the problem; and (b) a constraint is deleted from the problem.

4.2

Standard Form

There are many different ways to represent a linear program. It is sometimes more convenient to use one instead of another, at times to make a property of the linear program more apparent, at other times to simplify the description of an algorithm. One such representation, called standard form, will be used to describe the simplex method.

i

i i

i

i

i

i

4.2. Standard Form

book 2008/10/23 page 101 i

101

In matrix-vector notation, a linear program in standard form will be written as minimize

z = cTx

subject to

Ax = b x≥0

with b ≥ 0. Here x and c are vectors of length n, b is a vector of length m, and A is an m × n matrix called the constraint matrix. The important things to notice are (i) it is a minimization problem, (ii) all the variables are constrained to be nonnegative, (iii) all the other constraints are represented as equations, and (iv) the components of the right-hand side vector b are all nonnegative. This will be the form of a linear program used within the simplex method. In other settings, other forms of a linear program may be more convenient. Example 4.1 (Standard Form). The linear program minimize

z = 4x1 − 5x2 + 3x3

subject to

3x1 − 2x2 + 7x3 = 7 8x1 + 6x2 + 6x3 = 5 x1 , x2 , x3 ≥ 0

is in standard form. In terms of the matrix-vector notation, x1 4 3 −2 7 , x = x2 , c = −5 , A = 8 6 6 3 x3

7 . b= 5

There are n = 3 variables and m = 2 constraints. All linear programs can be converted to standard form. The rules for doing this are simple and can be performed automatically by software. Most linear programming software packages allow the user to represent a linear program in any convenient way and then the software performs the conversion internally. We illustrate these techniques via examples. Justification for these rules is left to the Exercises. If the original problem is a maximization problem: maximize z = 4x1 − 3x2 + 6x3 = cTx, then the objective can be multiplied by −1 to obtain minimize zˆ = −4x1 + 3x2 − 6x3 = −cTx. After the problem has been solved, the optimal objective value must be multiplied by −1, so that z∗ = −ˆz∗ . The optimal values of the variables are the same for both objective functions. If any of the components of b are negative, then those constraints should be multiplied by −1. This will cause a constraint of the “≤” form to be converted to a “≥” constraint and vice versa. If a variable has a lower bound other than zero, say x1 ≥ 5,

i

i i

i

i

i

i

102

book 2008/10/23 page 102 i

Chapter 4. Geometry of Linear Programming

then the variable can be replaced in the problem by x1 = x1 − 5. The constraint x1 ≥ 5 is equivalent to x1 ≥ 0. An upper bound on a variab