100% found this document useful (1 vote)
548 views623 pages

Understanding Language Structure and Design

This document discusses the nature of language and describes how programming languages work. It covers topics like the structure of natural languages versus programming languages, how languages represent and abstract concepts, the elements that make up languages like nouns and verbs, how the syntax and semantics of languages are formally defined, and how basic concepts like variables, types, and objects are modeled in languages. The document provides an overview of the fundamental concepts underlying all languages.

Uploaded by

dr_m_azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
548 views623 pages

Understanding Language Structure and Design

This document discusses the nature of language and describes how programming languages work. It covers topics like the structure of natural languages versus programming languages, how languages represent and abstract concepts, the elements that make up languages like nouns and verbs, how the syntax and semantics of languages are formally defined, and how basic concepts like variables, types, and objects are modeled in languages. The document provides an overview of the fundamental concepts underlying all languages.

Uploaded by

dr_m_azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Contents

I About Language
Nature of Language Communication . . . . . . . . . . . . . . . . . . . Syntax and Semantics . . . . . . . . . . . . . . . Natural Languages and Programming Languages 1.3.1 Structure . . . . . . . . . . . . . . . . . . 1.3.2 Redundancy . . . . . . . . . . . . . . . . . 1.3.3 Using Partial Information: Ambiguity and 1.3.4 Implicit Communication . . . . . . . . . . 1.3.5 Flexibility and Nuance . . . . . . . . . . . 1.3.6 Ability to Change and Evolve . . . . . . . The Standardization Process . . . . . . . . . . . 1.4.1 Language Growth and Divergence . . . . Nonstandard Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
3 4 5 6 6 7 8 9 10 10 11 12 12 17 17 20 21 22 22 25 25 27 30 41 41 49

1 The 1.1 1.2 1.3

1.4 1.5

2 Representation and Abstraction 2.1 What Is a Program? . . . . . . . . . . . . . . . . 2.2 Representation . . . . . . . . . . . . . . . . . . . 2.2.1 Semantic Intent . . . . . . . . . . . . . . . 2.2.2 Explicit versus Implicit Representation . . 2.2.3 Coherent versus Diuse Representation . 2.3 Language Design . . . . . . . . . . . . . . . . . . 2.3.1 Competing Design Goals . . . . . . . . . . 2.3.2 The Power of Restrictions . . . . . . . . . 2.3.3 Principles for Evaluating a Design . . . . 2.4 Classifying Languages . . . . . . . . . . . . . . . 2.4.1 Language Families . . . . . . . . . . . . . 2.4.2 Languages Are More Alike than Dierent

iii

iv 3 Elements of Language 3.1 The Parts of Speech . . . . . . . . . . . . . . . . . . 3.1.1 Nouns . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Pronouns: Pointers . . . . . . . . . . . . . . . 3.1.3 Adjectives: Data Types . . . . . . . . . . . . 3.1.4 Verbs . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Prepositions and Conjunctions . . . . . . . . 3.2 The Metalanguage . . . . . . . . . . . . . . . . . . . 3.2.1 Words: Lexical Tokens . . . . . . . . . . . . . 3.2.2 Sentences: Statements . . . . . . . . . . . . . 3.2.3 Larger Program Units: Scope . . . . . . . . 3.2.4 Comments . . . . . . . . . . . . . . . . . . . . 3.2.5 Naming Parts of a Program . . . . . . . . . 3.2.6 Metawords That Let the Programmer Extend 4 Formal Description of Language 4.1 Foundations of Programming Languages . . . . . . . 4.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Extended BNF . . . . . . . . . . . . . . . . . 4.2.2 Syntax Diagrams . . . . . . . . . . . . . . . . 4.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Meaning of a Program . . . . . . . . . . 4.3.2 Denition of Language Semantics . . . . . . . 4.3.3 The Abstract Machine . . . . . . . . . . . . . 4.3.4 Lambda Calculus: A Minimal Semantic Basis 4.4 Extending the Semantics of a Language . . . . . . . 4.4.1 Semantic Extension in FORTH . . . . . . .

CONTENTS 51 51 51 52 53 55 58 59 59 62 64 67 70 70 77 78 78 82 87 90 90 90 92 96 107 110

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

II

Describing Computation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

115
117 . 118 . 118 . 118 . 120 . 126 . 126 . 127 . 130 . 133

5 Primitive Types 5.1 Primitive Hardware Types . . . . . . . . . . . . . 5.1.1 Bytes, Words, and Long Words . . . . . . 5.1.2 Character Codes . . . . . . . . . . . . . . 5.1.3 Numbers . . . . . . . . . . . . . . . . . . 5.2 Types in Programming Languages . . . . . . . . 5.2.1 Type Is an Abstraction . . . . . . . . . . 5.2.2 A Type Provides a Physical Description . 5.2.3 What Primitive Types Should a Language 5.2.4 Emulation . . . . . . . . . . . . . . . . . .

CONTENTS 5.3

A Brief History of Type Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.1 Origins of Type Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.2 Type Becomes a Denable Abstraction . . . . . . . . . . . . . . . . . . . . . . 137 . . . . . . . . . . . . Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 144 146 146 148 152 154 156 158 158 169 175 175 176 178 180 185 186 190 191 191 193 193 195 198 200 205 211 212 213 214 215 216 216 217

6 Modeling Objects 6.1 Kinds of Objects . . . . . . . . . . . . . . . . . . . . . . 6.2 Placing a Value in a Storage Object . . . . . . . . . . . 6.2.1 Static Initialization . . . . . . . . . . . . . . . . . 6.2.2 Dynamically Changing the Contents of a Storage 6.2.3 Dereferencing . . . . . . . . . . . . . . . . . . . . 6.2.4 Pointer Assignment . . . . . . . . . . . . . . . . 6.2.5 The Semantics of Pointer Assignment . . . . . . 6.3 The Storage Model: Managing Storage Objects . . . . . 6.3.1 The Birth and Death of Storage Objects . . . . . 6.3.2 Dangling References . . . . . . . . . . . . . . . .

7 Names and Binding 7.1 The Problem with Names . . . . . . . . . . . . . . . . . . . . . 7.1.1 The Role of Names . . . . . . . . . . . . . . . . . . . . . 7.1.2 Denition Mechanisms: Declarations and Defaults . . . 7.1.3 Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Names and Objects: Not a One-to-One Correspondence 7.2 Binding a Name to a Constant . . . . . . . . . . . . . . . . . . 7.2.1 Implementations of Constants . . . . . . . . . . . . . . . 7.2.2 How Constant Is a Constant? . . . . . . . . . . . . . . . 7.3 Survey of Allocation and Binding . . . . . . . . . . . . . . . . . 7.4 The Scope of a Name . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Naming Conicts . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Block Structure . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Recursive Bindings . . . . . . . . . . . . . . . . . . . . . 7.4.4 Visibility versus Lifetime. . . . . . . . . . . . . . . . . . 7.5 Implications for the Compiler / Interpreter . . . . . . . . . . . 8 Expressions and Evaluation 8.1 The Programming Environment . . . . 8.2 Sequence Control and Communication 8.2.1 Nesting . . . . . . . . . . . . . 8.2.2 Sequences of Statements . . . . 8.2.3 Interprocess Sequence Control . 8.3 Expression Syntax . . . . . . . . . . . 8.3.1 Functional Expression Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi 8.3.2 Operator Expressions . . . . . . . 8.3.3 Combinations of Parsing Rules . . Function Evaluation . . . . . . . . . . . . 8.4.1 Order of Evaluation . . . . . . . . 8.4.2 Lazy or Strict Evaluation . . . . . 8.4.3 Order of Evaluation of Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 222 224 224 227 230

8.4

9 Functions and Parameters 9.1 Function Syntax . . . . . . . . . . . . . . . . . . . 9.1.1 Fixed versus Variable Argument Functions 9.1.2 Parameter Correspondence . . . . . . . . . 9.1.3 Indenite-Length Parameter Lists . . . . . 9.2 What Does an Argument Mean? . . . . . . . . . . 9.2.1 Call-by-Value . . . . . . . . . . . . . . . . . 9.2.2 Call-by-Name . . . . . . . . . . . . . . . . . 9.2.3 Call-by-Reference . . . . . . . . . . . . . . . 9.2.4 Call-by-Return . . . . . . . . . . . . . . . . 9.2.5 Call-by-Value-and-Return . . . . . . . . . . 9.2.6 Call-by-Pointer . . . . . . . . . . . . . . . . 9.3 Higher-Order Functions . . . . . . . . . . . . . . . 9.3.1 Functional Arguments . . . . . . . . . . . . 9.3.2 Currying . . . . . . . . . . . . . . . . . . . 9.3.3 Returning Functions from Functions . . . .

233 . 234 . 234 . 235 . 237 . 239 . 240 . 242 . 244 . 247 . 248 . 249 . 254 . 255 . 257 . 259 267 . 268 . 269 . 270 . 270 . 272 . 273 . 274 . 275 . 275 . 277 . 278 . 284 . 289 . 290 . 290 . 292

10 Control Structures 10.1 Basic Control Structures . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Normal Instruction Sequencing . . . . . . . . . . . . . . 10.1.2 Assemblers . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.3 Sequence, Subroutine Call, IF, and WHILE Suce . . . 10.1.4 Subroutine Call . . . . . . . . . . . . . . . . . . . . . . . 10.1.5 Jump and Conditional Jump . . . . . . . . . . . . . . . 10.1.6 Control Diagrams . . . . . . . . . . . . . . . . . . . . . 10.2 Conditional Control Structures . . . . . . . . . . . . . . . . . . 10.2.1 Conditional Expressions versus Conditional Statements 10.2.2 Conditional Branches: Simple Spaghetti . . . . . . . . . 10.2.3 Structured Conditionals . . . . . . . . . . . . . . . . . . 10.2.4 The Case Statement . . . . . . . . . . . . . . . . . . . . 10.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 The Innite Loop . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Conditional Loops . . . . . . . . . . . . . . . . . . . . . 10.3.3 The General Loop . . . . . . . . . . . . . . . . . . . . .

CONTENTS 10.3.4 Counted Loops . . . . . . . . . 10.3.5 The Iteration Element . . . . . 10.4 Implicit Iteration . . . . . . . . . . . . 10.4.1 Iteration on Coherent Objects . 10.4.2 Backtracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii 293 298 301 301 303 309 310 310 313 315 317 320 321 327 327 328 331

11 Global Control 11.1 The GOTO Problem . . . . . . . . . . . 11.1.1 Faults Inherent in GOTO . . . . . 11.1.2 To GOTO or Not to GOTO . . . 11.1.3 Statement Labels . . . . . . . . . 11.2 Breaking Out . . . . . . . . . . . . . . . 11.2.1 Generalizing the BREAK . . . . . 11.3 Continuations . . . . . . . . . . . . . . . 11.4 Exception Processing . . . . . . . . . . . 11.4.1 What Is an Exception? . . . . . 11.4.2 The Steps in Exception Handling 11.4.3 Exception Handling in Ada . . .

III

Application Modeling
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

335
337 338 339 341 342 344 347 350 351 351 352 355 358

12 Functional Languages 12.1 Denotation versus Computation . . . 12.1.1 Denotation . . . . . . . . . . 12.2 The Functional Approach . . . . . . 12.2.1 Eliminating Assignment . . . 12.2.2 Recursion Can Replace WHILE 12.2.3 Sequences . . . . . . . . . . . 12.3 Miranda: A Functional Language . . 12.3.1 Data Structures . . . . . . . . 12.3.2 Operations and Expressions . 12.3.3 Function Denitions . . . . . 12.3.4 List Comprehensions . . . . . 12.3.5 Innite Lists . . . . . . . . . 13 Logic Programming 13.1 Predicate Calculus 13.1.1 Formulas . 13.2 Proof Systems . . . 13.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

361 . 362 . 362 . 365 . 367

viii 13.4 Automatic Theorem Proving . . . . . 13.4.1 Resolution Theorem Provers . 13.5 Prolog . . . . . . . . . . . . . . . . . . 13.5.1 The Prolog Environment . . . . 13.5.2 Data Objects and Terms . . . . 13.5.3 Horn Clauses in Prolog . . . . . 13.5.4 The Prolog Deduction Process . 13.5.5 Functions and Computation . . 13.5.6 Cuts and the not Predicate . 13.5.7 Evaluation of Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 370 375 375 375 376 379 380 385 389 393 394 394 396 397 398 400 400 406 408 412 419 422 422 423 430 435 436 436 437 441 444 444 445 446 448 448 449 450

14 The Representation of Types 14.1 Programmer-Dened Types . . . . . . . . . . . . . . . 14.1.1 Representing Types within a Translator . . . . 14.1.2 Finite Types . . . . . . . . . . . . . . . . . . . 14.1.3 Constrained Types . . . . . . . . . . . . . . . . 14.1.4 Pointer Types . . . . . . . . . . . . . . . . . . . 14.2 Compound Types . . . . . . . . . . . . . . . . . . . . . 14.2.1 Arrays . . . . . . . . . . . . . . . . . . . . . . . 14.2.2 Strings . . . . . . . . . . . . . . . . . . . . . . . 14.2.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . 14.2.4 Records . . . . . . . . . . . . . . . . . . . . . . 14.2.5 Union Types . . . . . . . . . . . . . . . . . . . 14.3 Operations on Compound Objects . . . . . . . . . . . 14.3.1 Creating Program Objects: Value Constructors 14.3.2 The Interaction of Dereferencing, Constructors, 14.4 Operations on Types . . . . . . . . . . . . . . . . . . . 15 The Semantics of Types 15.1 Semantic Description . . . . . . . . . . . . . . . . . . 15.1.1 Domains in Early Languages . . . . . . . . . 15.1.2 Domains in Typeless Languages . . . . . . 15.1.3 Domains in the 1970s . . . . . . . . . . . . . 15.1.4 Domains in the 1980s . . . . . . . . . . . . . 15.2 Type Checking . . . . . . . . . . . . . . . . . . . . . 15.2.1 Strong Typing . . . . . . . . . . . . . . . . . 15.2.2 Strong Typing and Data Abstraction . . . . . 15.3 Domain Identity: Dierent Domain/ Same Domain? 15.3.1 Internal and External Domains . . . . . . . . 15.3.2 Internally Merged Domains . . . . . . . . . . 15.3.3 Domain Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Selectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS 15.4 Programmer-Dened Domains . . . . . . . . . 15.4.1 Type Description versus Type Name . 15.4.2 Type Constructors . . . . . . . . . . . 15.4.3 Types Dened by Mapping . . . . . . 15.5 Type Casts, Conversions, and Coercions . . . 15.5.1 Type Casts. . . . . . . . . . . . . . . . 15.5.2 Type Conversions . . . . . . . . . . . 15.5.3 Type Coercion . . . . . . . . . . . . . 15.6 Conversions and Casts in Common Languages 15.6.1 COBOL . . . . . . . . . . . . . . . . . 15.6.2 FORTRAN . . . . . . . . . . . . . . . . 15.6.3 Pascal . . . . . . . . . . . . . . . . . . 15.6.4 PL/1 . . . . . . . . . . . . . . . . . . . 15.6.5 C . . . . . . . . . . . . . . . . . . . . . 15.6.6 Ada Types and Treatment of Coercion 15.7 Evading the Type Matching Rules . . . . . . 16 Modules and Object Classes 16.1 The Purpose of Modules . . . . . . . . 16.2 Modularity Through Files and Linking 16.3 Packages in Ada . . . . . . . . . . . . . 16.4 Object Classes. . . . . . . . . . . . . . 16.4.1 Classes in C++ . . . . . . . . . 16.4.2 Represented Domains . . . . . 16.4.3 Friends of Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix 452 452 453 454 459 460 465 466 470 470 470 471 472 472 475 479 489 490 492 497 500 501 505 506 511 512 512 513 515 516 521 521 521 523 524 525 527 530 531

17 Generics 17.1 Generics . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1.1 What Is a Generic? . . . . . . . . . . . . . . . . 17.1.2 Implementations of Generics . . . . . . . . . . 17.1.3 Generics, Virtual Functions, and ADTs . . . . 17.1.4 Generic Functions . . . . . . . . . . . . . . . . 17.2 Limited Generic Behavior . . . . . . . . . . . . . . . . 17.2.1 Union Data Types . . . . . . . . . . . . . . . . 17.2.2 Overloaded Names . . . . . . . . . . . . . . . . 17.2.3 Fixed Set of Generic Denitions, with Coercion 17.2.4 Extending Predened Operators . . . . . . . . 17.2.5 Flexible Arrays . . . . . . . . . . . . . . . . . . 17.3 Parameterized Generic Domains . . . . . . . . . . . . 17.3.1 Domains with Type Parameters . . . . . . . . . 17.3.2 Preprocessor Generics in C . . . . . . . . . . .

x 18 Dispatching with Inheritance 18.1 Representing Domain Relationships . . . . . . . . . . . . 18.1.1 The Mode Graph and the Dispatcher . . . . . . . 18.2 Subdomains and Class Hierarchies. . . . . . . . . . . . . 18.2.1 Subrange Types . . . . . . . . . . . . . . . . . . 18.2.2 Class Hierarchies . . . . . . . . . . . . . . . . . . 18.2.3 Virtual Functions in C++. . . . . . . . . . . . . 18.2.4 Function Inheritance . . . . . . . . . . . . . . . . 18.2.5 Programmer-Dened Conversions in C++ . . . . 18.3 Polymorphic Domains and Functions . . . . . . . . . . . 18.3.1 Polymorphic Functions . . . . . . . . . . . . . . . 18.3.2 Manual Domain Representation and Dispatching 18.3.3 Automating Ad Hoc Polymorphism . . . . . . . 18.3.4 Parameterized Domains . . . . . . . . . . . . . . 18.4 Can We Do More with Generics? . . . . . . . . . . . . . 18.4.1 Dispatching Using the Mode Graph . . . . . . . 18.4.2 Generics Create Some Hard Problems . . . . . .

CONTENTS 541 . 542 . 542 . 549 . 549 . 550 . 554 . 556 . 558 . 561 . 561 . 562 . 563 . 567 . 568 . 571 . 575 585 . 585 . 585 . 586 . 587 . 588 . 590 . 591 . 591 . 592 . 593 . 596 . 597 . 597 . 598 . 598 . 599 . 600 . 600 . 600 . 601 . 601 . 602

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Exhibits Listed by Topic A.1 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Ada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 APL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.4 C and ANSI C . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.5 FORTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.6 FORTRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.7 LISP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.8 Miranda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.9 Pascal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.10 Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.11 Scheme and T . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.12 Other Languages . . . . . . . . . . . . . . . . . . . . . . . . A.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Application Modeling, Generics, and Polymorphic Domains A.2.2 Control Structures . . . . . . . . . . . . . . . . . . . . . . . A.2.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . A.2.4 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.5 Lambda Calculus . . . . . . . . . . . . . . . . . . . . . . . . A.2.6 Language Design and Specication . . . . . . . . . . . . . . A.2.7 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.8 Translation, Interpretation, and Function Calls . . . . . . .

CONTENTS

xi

A.2.9 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602

xii

CONTENTS

Preface
This text is intended for a course in advanced programming languages or the structure of programming language and should be appropriate for students at the junior, senior, or masters level. It should help the student understand the principles that underlie all languages and all language implementations. This is a comprehensive text which attempts to dissect language and explain how a language is really built. The rst eleven chapters cover the core material: language specication, objects, expressions, control, and types. The more concrete aspects of each topic are presented rst, followed by a discussion of implementation strategies and the related semantic issues. Later chapters cover current topics, including modules, object-oriented programming, functional languages, and concurrency constructs. The emphasis throughout the text is on semantics and abstraction; the syntax and historical development of languages are discussed in light of the underlying semantical concepts. Fundamental principles of computation, communication, and good design are stated and are used to evaluate various language constructs and to demonstrate that language designs are improving as these principles become widely understood. Examples are cited from many languages including Pascal, C, C++, FORTH, BASIC, LISP, FORTRAN, Ada, COBOL, APL, Prolog, Turing, Miranda, and Haskell. All examples are annotated so that a student who is unfamiliar with the language used can understand the meaning of the code and see how it illustrates the principle. It is the belief of the authors that the student who has a good grasp of the structure of computer languages will have the tools to master new languages easily. The specic goals of this book are to help students learn: To reason clearly about programming languages. To develop principles of communication so that we can evaluate the wisdom and utility of the decisions made in the process of language design. To break down language into its major components, and each component into small pieces so that we can focus on competing alternatives. To dene a consistent and general set of terms for the components out of which programming languages are built, and the concepts on which they are based. xiii

xiv

CONTENTS To use these terms to describe existing languages, and in so doing clarify the conicting terminology used by the language designers, and untangle the complexities inherent in so many languages. To see below the surface appearance of a language to its actual structure and descriptive power. To understand that many language features that commonly occur together are, in fact, independent and separable. To appreciate the advantages and disadvantages of each feature. To suggest ways in which these basic building blocks can be recombined in new languages with more desirable properties and fewer faults. To see the similarities and dierences that exist among languages students already know, and to learn new ones. To use the understanding so gained to suggest future trends in language design.

Acknowledgement The authors are indebted to several people for their help and support during the years we have worked on this project. First, we wish to thank our families for their uncomplaining patience and understanding. We thank Michael J. Fischer for his help in developing the sections on lambda calculus, functional languages and logic. and for working out several sophisticated code examples. In addition, his assistance as software and hardware systems expert and TeX guru made this work possible. Several reviewers read this work in detail and oered invaluable suggestions and corrections. We thank these people for their help. Special thanks go to Robert Fischer and Roland Lieger for reading beyond the call of duty and to Gary Walters for his advice and for the material he has contributed. Finally, we thank our students at the University of New Haven and at Sacred Heart University for their feedback on the many versions of this book. Parts of this manuscript were developed under a grant from Sacred Heart University.

Part I

About Language

Chapter 1

The Nature of Language

Overview
This chapter introduces the concept of the nature of language. The purpose of language is communication. A set of symbols, understood by both sender and receiver, is combined according to a set of rules, its grammar or syntax. The semantics of the language denes how each grammatically correct sentence is to be interpreted. Using English as a model, language structures are studied and compared. The issue of standardization of programming languages is examined. Nonstandard compilers are examples of the use of deviations from an accepted standard.

This is a book about the structure of programming languages. (For simplicity, we shall use the term language to mean programming language.) We will try to look beneath the individual quirks of familiar languages and examine the essential properties of language itself. Several aspects of language will be considered, including vocabulary, syntax rules, meaning (semantics), implementation problems, and extensibility. We will consider several programming languages, examining the choices made by language designers that resulted in the strengths, weaknesses, and particular character of each language. When possible, we will draw parallels between programming languages and natural languages. Dierent languages are like tools in a toolbox: although each language is capable of expressing most algorithms, some are obviously more appropriate for certain applications than others. (You can use a chisel to turn a screw, but it is not a good idea.) For example, it is commonly understood that COBOL is good for business applications. This is true because COBOL provides a large variety of symbols for controlling input and output formats, so that business reports may easily be 3

CHAPTER 1. THE NATURE OF LANGUAGE

made to t printed forms. LISP is good for articial intelligence applications because it supports dynamically growing and shrinking data. We will consider how well each language models the objects, actions, and relationships inherent in various classes of applications. Rather than accept languages as whole packages, we will be asking: What design decisions make each language dierent from the others? Are the dierences a result of minor syntactic rules, or is there an important underlying semantic issue? Is a controversial design decision necessary to make the language appropriate for its intended use, or was the decision an accident of history? Could dierent design decisions result in a language with more strengths and fewer weaknesses? Are the good parts of dierent languages mutually exclusive, or could they be eectively combined? Can a language be extended to compensate for its weaknesses?

1.1

Communication

A natural language is a symbolic communication system that is commonly understood among a group of people. Each language has a set of symbols that stand for objects, properties, actions, abstractions, relations, and the like. A language must also have rules for combining these symbols. A speaker can communicate an idea to a listener if and only if they have a common understanding of enough symbols and rules. Communication is impaired when speaker and listener interpret a symbol dierently. In this case, either speaker and/or listener must use feedback to modify his or her understanding of the symbols until commonality is actually achieved. This happens when we learn a new word or a new meaning for an old word, or correct an error in our idea of the meaning of a word. English is for communication among people. Programs are written for both computers and people to understand. Using a programming language requires a mutual understanding between a person and a machine. This can be more dicult to achieve than understanding between people because machines are so much more literal than human beings. The meaning of symbols in natural language is usually dened by custom and learned by experience and feedback. In contrast, programming languages are generally dened by an authority, either an individual language designer or a committee. For a computer to understand a human language, we must devise a method for translating both the syntax and semantics of the language into machine code. Language designers build languages that they know how to translate, or that they believe they can gure out how to translate.

1.2. SYNTAX AND SEMANTICS

On the other hand, if computers were the only audience for our programs we might be writing code in a language that was trivially easy to transform into machine code. But a programmer must be able to understand what he or she is writing, and a human cannot easily work at the level of detail that machine language represents. So we use computer languages that are a compromise between the needs of the speaker (programmer) and listener (computer). Declarations, types, symbolic names, and the like are all concessions to a humans need to understand what someone has written. The concession we make for computers is that we write programs in languages that can be translated with relative ease into machine language. These languages have limited vocabulary and limited syntax. Most belong to a class called context-free languages, which can be parsed easily using a stack. Happily, as our skill at translation has increased, the variety and power of symbols in our programming languages have also increased. The language designer must dene sets of rules and symbols that will be commonly understood among both human and electronic users of the language. The meaning of these symbols is generally conveyed to people by the combination of a formal semantic description, analogy with other languages, and examples. The meaning of symbols is conveyed to a computer by writing small modules of machine code that dene the action to be taken for each symbol. The rules of syntax are conveyed to a computer by writing a compiler or interpreter. To learn to use a new computer language eectively, a user must learn exactly what combinations of symbols will be accepted by a compiler and what actions will be invoked for each symbol in the language. This knowledge is the required common understanding. When the human communicates with a machine, he must modify his own understanding until it matches the understanding of the machine, which is embodied in the language translator. Occasionally the translator fails to understand a phrase correctly, as specied by the ocial language denition. This happens when there is an error in the translator. In this case the understanding of the translator must be corrected by the language implementor.

1.2

Syntax and Semantics

The syntax of a language is a set of rules stating how language elements may be grammatically combined. Syntax species how individual words may be written and the order in which words may be placed within a sentence. The semantics of a language dene how each grammatically correct sentence is to be interpreted. In a given language, the meaning of a sentence in a compiled language is the object code compiled for that sentence. In an interpreted language, it is the internal representation of the program, which is then evaluated. Semantic rules specify the meaning attached to each placement of a word in a sentence, the meaning of omitting a sentence element, and the meaning of each individual word. A speaker (or programmer) has an idea that he or she wishes to communicate. This idea is the speakers semantic intent. The programmer must choose words that have the correct semantics so that the listener (computer) can correctly interpret the speakers semantic intent. All languages have syntax and semantics. Chapter 4 discusses formal mechanisms for expressing

CHAPTER 1. THE NATURE OF LANGUAGE

the syntax of a language. The rest of this book is primarily concerned with semantics, the semantics of particular languages, and the semantic issues involved in programming.

1.3

Natural Languages and Programming Languages

We will often use comparisons with English to encourage you to examine language structures intuitively, without preconceived ideas about what programming languages can or cannot do. The objects and functions of a program correspond to the nouns and verbs of natural language. (We will use the word functions to apply to functions, procedures, operators, and some commands. Objects include variables, constants, records, and so on.) There are a number of language traits that determine the character of a language. In this section we compare the ways in which these traits are embodied in a natural language (English) and in various programming languages. The dierences between English and programming languages are real, but not as great as they might at rst seem. The dierences are less extreme now than they were ten years ago and will decrease as programming languages continue to evolve. Current programming language research is directed toward: Easing the constraints on the order in which statements must be given. Increasing the uses of symbols with multiple denitions. Permitting the programmer to talk about and use an object without knowing details of its representation. Facilitating the construction of libraries, thus increasing the number of words that can be understood implicitly. Increasing the ability of the language to express varied properties of the problem situation, especially relationships among classes of objects.

1.3.1

Structure

Programs must conform to very strict structural rules. These govern the order of statements and sections of code, and particular ways to begin, punctuate, and end every program. No deviation from these rules is permitted by the language denition, and this is enforced by a compiler. The structure of English is more exible and more varied, but rules about the structure of sentences and of larger units do exist. The overall structure of a textbook or a novel is tightly controlled. Indeed, each kind of written material has some structure it must follow. In any situation where the order of events is crucial, such as in a recipe, English sentences must be placed in the correct sequence, just like the lines in a program. Deviation from the rules of structure is permitted in informal speech, and understanding can usually still be achieved. A human listener usually attempts to correct a speakers obvious errors.

1.3. NATURAL LANGUAGES AND PROGRAMMING LANGUAGES

For example, scrambled words can often be put in the right order. We can correct and understand the sentence: I yesterday nished the assignment. Spoonerisms (exchanging the rst letters of nearby words, often humorously) can usually be understood. For example, I kee my sids was obviously intended to mean I see my kids. A human uses common sense, context, and poorly dened heuristics to identify and correct such errors. Most programming language translators are notable for their intolerance of a programmers omissions and errors. A compiler will identify an error when the input text fails to correspond to the syntactic rules of the language (a syntax error) or when an object is used in the wrong context (a type error). Most translators make some guesses about what the programmer really meant, and try to continue with the translation, so that the programmer gets maximum feedback from each attempt to compile the program. However, compilers can rarely correct anything more than a trivial punctuation error. They commonly make faulty guesses which cause the generation of heaps of irrelevant and confusing error comments. Some compilers actually do attempt to correct the programmers errors by adding, changing, respelling, or ignoring symbols so that the erroneous statement is made syntactically legal. If the attempted correction causes trouble later, the compiler may return to the line with the error and try a dierent correction. This eort has had some success. Errors such as misspellings and errors close to the end of the code can often be corrected and enable a successful translation. Techniques have been developed since the mid-1970s and are still being improved. Such error-correcting compilers are uncommon because of the relatively great cost for added time and extra memory needed. Some people feel that the added costs exceed the added utility.

1.3.2

Redundancy

The syntactic structure of English is highly redundant. The same information is often conveyed by several words or word endings in a sentence. If required redundancy is absent, as in the sentence I nishes the assignment tomorrow, we can identify that errors have occurred. The lack of agreement between I and nishes is a syntactic error, and the disagreement of the verb tense (present) with the meaning of tomorrow is a semantic error. [Exhibit 1.1] A human uses the redundancy in the larger context to correct errors. For example, most people would be able to understand that a single letter was omitted in the sentence The color of my coat is back. Similarly, if a listener fails to comprehend a single word, she or he can usually use the redundancy in the surrounding sentences to understand the message. If a speaker omits a word, the listener can often supply it by using context. Programming languages are also partly redundant, and the required redundancy serves as a way to identify errors. For example, the rst C declaration in Exhibit 1.2 contains two indications of the intended data type of the variable named price: the type name, int, and the actual type, float, of the initial value. These two indicators conict, and a compiler can identify this as an error. The second line contains an initializer whose length is longer than the declared size of the array named table. This lack of agreement in number is an identiable error.

8 Exhibit 1.1. Redundancy in English.

CHAPTER 1. THE NATURE OF LANGUAGE

The subject and verb of a sentence must agree in number. Either both must be singular or both plural: Correct: Mark likes the cake. Singular subject, singular verb. Wrong: Mark like the cake. Singular subject, plural verb. The verb tense must agree with any time words in the sentence: Correct: I nished the work yesterday. Past tense, past time. Wrong: I nish the work yesterday. Present tense, past time. Where categories are mentioned, words belonging to the correct categories must be used. Correct: The color of my coat is black. Black is a color. Wrong: The color of my coat is back. Back is not a color. Sentences must supply consistent information throughout a paragraph. Pronouns refer to the preceding noun. A pronoun must not suddenly be used to refer to a dierent noun. Correct: The goalie is my son. He is the best. His name is Al. Wrong: The goalie is my son. He is the best. He is my father. These errors in English have analogs in programming languages. The rst error above is analogous to using a nonarray variable with a subscript. The second and third errors are similar to type errors in programming languages. The last error is analogous to faulty use of a pointer.

1.3.3

Using Partial Information: Ambiguity and Abstraction

English permits ambiguity, that is, words and phrases that have dual meanings. The listener must disambiguate the sentence, using context, and determine the actual meaning (or meanings) of the speaker.1 To a very limited extent, programming languages also permit ambiguity. Operators such as + have two denitions in many languages, integer +integer and real +real. Object-oriented languages permit programmer-dened procedure names with more than one meaning. Many languages are block-structured. They permit the user to dene contexts of limited scope, called blocks. The same symbol can be given dierent meanings in dierent blocks. Context is used, as it is in English, to disambiguate the meaning of the name.
1

A pun is a statement with two meanings, both intended by the speaker, where one meaning is usually funny.

Exhibit 1.2. Violations of redundancy rules in ANSI C. int price = 20.98; /* Declare and initialize variable. int table[3] = {11, 12, 13, 14}; /* Declare and initialize an array. */ */

1.3. NATURAL LANGUAGES AND PROGRAMMING LANGUAGES

The primary dierences here are that context is dened very exactly in each programming language and quite loosely in English, and that most programming languages permit only limited ambiguity. English supports abstraction, that is, the description of a quality apart from an instance. For example, the word chair can be dened as a piece of furniture consisting of a seat, legs, and back, and often arms, designed to accommodate one person. 2 This denition applies to many kinds of chairs and conveys some but not all of a particular chairs properties. Older programming languages do not support this kind of abstraction. They require that all an objects properties be specied when the name for that object is dened. Some current languages support very limited forms of abstraction. For example, Ada permits names to be dened for generic objects, some of whose properties are left temporarily undened. Later, the generic denition must be instantiated by supplying actual denitions for those properties. The instantiation process produces fully specied code with no remaining abstractions which can then be compiled in the normal way. Smalltalk and C++ are current languages whose primary design goal was support for abstraction. A Smalltalk declaration for a class chair would be parallel to the English denition. Languages of the future will have more extensive ability to dene and use partially specied objects.

1.3.4

Implicit Communication

English permits some things to be understood even if they are left unsaid. When we read between the lines in an English paragraph, we are interpreting both explicit and implicit messages. Understanding of the explicit message is derived from the words of the sentence. The implicit message is understood from the common experience of speaker and listener. People from dierent cultures have trouble with implicit communication because they have inadequate common understanding. Some things may be left implicit in programming languages also. Variable types in FORTRAN and the type of the result of a function in the original Kernighan and Ritchie C may or may not be dened explicitly. In these cases, as in English, the full meaning of such constructs is dened by having a mutual understanding, between speaker and listener, about the meaning of things left unspecied. A programmer learning a new language must learn its implicit assumptions, more commonly called defaults. Unfortunately, when a programmer relies on defaults to convey meaning, the compiler cannot tell the dierence between the purposeful use of a default and an accidental omission of an important declaration. Many experienced programmers use explicit declarations rather than rely on defaults. Stating information explicitly is less error prone and enables a compiler to give more helpful error comments.
2

Cf. Morris [1969].

10

CHAPTER 1. THE NATURE OF LANGUAGE

1.3.5

Flexibility and Nuance

English is very exible : there are often many ways to say something. Programming languages have this same exibility, as is demonstrated by the tremendous variety in the solutions handed in for one student programming problem. As another example, APL provides at least three ways to express the same simple conditional branch. Alternate ways of saying something in English usually have slightly dierent meanings, and subtlety and nuance are important. When dierent statement sequences in a programming language express the same algorithm, we can say that they have the same meaning. However, they might still dier in subtle ways, such as in the time and amount of memory required to execute the algorithm. We can call such dierences nuances. The nuances of meaning in a program are of both theoretical and practical importance. We are content when the work of a beginning programmer has the correct result (a way of measuring its meaning). As programmers become more experienced, however, they become aware of the subtle implications of alternative ways of saying the same thing. They will be able to produce a program with the same meaning as the beginners program, but with superior clarity, eciency, and compactness.

1.3.6

Ability to Change and Evolve

Expressing an idea in any language, natural or articial, can sometimes be dicult and awkward. A person can become speechless when speaking English. Words can fail to express the strength or complexity of the speakers feelings. Sometimes a large number of English words are required to explain a new concept. Later, when the concept becomes well understood, a word or a few words suce. English is constantly evolving. Old words become obsolete and new words and phrases are added. Programming languages, happily, also evolve. Consider FORTRAN for example. The original FORTRAN was a very limited language. For example, it did not support parameters and did not have an IF...THEN...ELSE statement. Programmers who needed these things surely found themselves speechless, and they had to express their logic in a wordy and awkward fashion. Useful constructs were added to FORTRAN because of popular demand. As this happened, some of the old FORTRAN words and methods became obsolete. While they have not been dropped from the language yet, that may happen someday. As applications of computers change, languages are extended to include words and concepts appropriate for the new applications. An example is the introduction of words for sound generation and graphics into Commodore BASIC when the Commodore-64 was introduced with sound and graphics hardware. One of the languages that evolves easily and constantly is FORTH. There are several public domain implementations, or dialects, used by many people and often modied to t a users hardware and application area. The modied dialect is then passed on to others. This process works like the process for adding new meanings to English. New words are introduced and become common

1.4. THE STANDARDIZATION PROCESS

11

knowledge gradually as an increasing number of people learn and use them. Translators for many dialects of BASIC, LISP, and FORTH are in common use. These languages are not fully standardized. Many dialects of the original language emerge because implementors are inspired to add or redesign language features. Programs written in one dialect must be modied to be used by people whose computer understands a dierent dialect. When this happens we say that a program is nonportable. The cost of rewriting programs makes nonstandardized programming languages unattractive to commercial users of computers. Lack of standardization can also cause severe diculties for programmers and publishers: the language specications and reference material must be relearned and rewritten for each new dialect.

1.4

The Standardization Process

Once a language is in widespread use, it becomes very important to have a complete and precise denition of the language so that compatible implementations may be produced for a variety of hardware and system environments. The standardization process was developed in response to this need. A language standard is a formal denition of the syntax and semantics of a language. It must be a complete, unambiguous statement of both. Language aspects that are dened must be dened clearly, while aspects that go beyond the limits of the standard must be designated clearly as undened. A language translator that implements the standard must produce code that conforms to all dened aspects of the standard, but for an undened aspect, it is permitted to produce any convenient translation. The authority to dene an unstandardized language or to change a language denition may belong to the individual language designer, to the agency that sponsored the language design, or to a committee of the American National Standards Institute (ANSI) or the International Standards Organization (ISO). The FORTRAN standard was originated by ANSI, the Pascal standard by ISO. The denition of Ada is controlled by the U.S. Department of Defense, which paid for the design of Ada. New or experimental languages are usually controlled by their designers. When a standards organization decides to sponsor a new standard for a language, it convenes a committee of people from industry and academia who have a strong interest in and extensive experience with that language. The standardization process is not easy or smooth. The committee must decide which dialect, or combination of ideas from dierent dialects, will become the standard. Committee members come to this task with dierent notions of what is good or bad and dierent priorities. Agreement at the outset is rare. The process may drag on for years as one or two committee members ght for their pet features. This happened with the original ISO Pascal standard, the ANSI C standard, and the new FORTRAN-90 standard. After a standard is adopted by one standards organization (ISO or ANSI), the denition is considered by the other. In the best of all worlds,the new standard would be accepted by the second organization. For example, ANSI adopted the ISO standard for Pascal nearly unchanged. However, smooth sailing is not always the rule. The new ANSI C standard is not acceptable to some ISO committee members, and when ISO decides on a C standard,it may be substantially

12

CHAPTER 1. THE NATURE OF LANGUAGE

dierent from ANSI C. The rst standard for a language often clears up ambiguities, xes some obvious defects, and denes a better and more portable language. The ANSI C and ANSI LISP standards do all of these things. Programmers writing new translators for this language must then conform to the common standard, as far as it goes. Implementations may also include words and structures, called extensions, that go beyond anything specied in the standard.

1.4.1

Language Growth and Divergence

After a number of years, language extensions accumulate and actual implementations diverge so much that programs again become nonportable. This has happened now with Pascal. The standard language is only minimally adequate for modern applications. For instance, it contains no support for string processing or graphics. Further, it has design faults, such as an inadequate case statement, and design shortcomings, such as a lack of static variables, initialized variables, and support for modular compilation. Virtually all implementations of Pascal for personal computers extend the language. These extensions are similar in intent and function but dier in detail. A program that uses the extensions is nonportable. One that doesnt use extensions is severely limited. We all need a new Pascal standard. When a standardized language has several divergent extensions in common use, the sponsoring standards agency may convene a new committee to reexamine and restandardize the language. The committee will consider the collection of extensions from various implementations and decide upon a new standard, which usually includes all of the old standard as a subset. Thus there is a constant tension between standardization and diversication. As our range of applications and our knowledge of language and translation techniques increase, there is pressure to extend our languages. Then the dialects in common use become diversied. When the diversity becomes too costly, the language will be restandardized.

1.5

Nonstandard Compilers

It is common for compilers to deviate from the language standard. There are three major kinds of deviations: extensions, intentional changes, and compiler bugs. The list of dierences in Exhibit 1.3 was taken from the Introduction to the Turbo Pascal Reference Manual, Version 2.0. With each new version of Turbo, this list has grown in size and complexity. Turbo Pascal version 5 is a very dierent and much more extensive language than Standard Pascal. An extension is a feature added to the standard, as string operations and graphics primitives are often added to Pascal. Items marked with a + in Exhibit 1.3 are true extensions: they provide processing capabilities for things that are not covered by the standard but do not change the basic nature of the language. Sometimes compiler writers believe that a language, as it is ocially dened, is defective; that is, some part of the design is too restrictive or too clumsy to use in a practical application environment. In these cases the implementor often redenes the language, making it nonstandard

1.5. NONSTANDARD COMPILERS

13

Exhibit 1.3. Summary of Turbo Pascal deviations from the standard. syntactic extensions semantic extensions semantic changes ! ! !

+ * + * * ! * + + +

Absolute address variables Bit/byte manipulation Direct access to CPU memory and data ports Dynamic strings Free ordering of sections within declaration part Full support of operating system facilities In-line machine code generation Include les Logical operations on integers Program chaining with common variables Random access data les Structured constants Type conversion functions (to be used explicitly)

and incompatible with other translators. This is an intentional change. Items marked with a ! in Exhibit 1.3 change the semantics of the language by circumventing semantic protection mechanisms that are part of the standard. Items marked by a * are extensions and changes to the syntax of the language that do not change the semantics but, if used, do make Turbo programs incompatible with the standard. A compiler bug occurs where, unknown to the compiler writer, the compiler implements dierent semantics than those prescribed by the language standard. Examples of compiler bugs abound. One Pascal compiler for the Commodore 64 required a semicolon after every statement. In contrast, the Pascal standard requires semicolons only as separators between statements and forbids a semicolon before an ELSE. A program written for this nonstandard compiler cannot be compiled by a standard compiler and vice versa. An example of a common bug is implementation of the mod operator. The easy way to compute i mod j is to take the remainder after using integer division to calculate i/j. According to the Pascal standard, quoted in Exhibit 1.4, 3 this computation method is correct if both i and j are positive integers. If i is negative, though, the result must be adjusted by adding in the modulus, j. The standard considers the operation to be an error if j is negative. Note that mod is only the same as the mathematical remainder function if i >= 0 and j > 0. Many compilers ignore this complexity, as shown in Exhibits 1.5 and 1.6. They simply perform an integer division operation and return the result, regardless of the signs of i and j. For example, in OSS Pascal for the Atari ST, the mod operator is dened in the usual nonstandard way. The OSS
3

Cooper [1983], page 3-1.

14

CHAPTER 1. THE NATURE OF LANGUAGE

Exhibit 1.4. The denition of mod in Standard Pascal. The value of i mod j is the value of i(k*j) for an integer value k, such that 0 <= (i mod j) < j. (That is, the value is always between 0 and j.) The expression i mod j is an error if j is zero or negative.

Pascal reference manual (page 6-26) describes mod as follows: The modulus is the remainder left over after integer division. Compiling and testing a few simple expressions [Exhibit 1.5] substantiates this and shows how OSS Pascal diers from the standard. Expression 2 gives a nonstandard answer. Expressions (3) through (6) compile and run, but shouldnt. They are designated as errors in the standard, which requires the modulus to be greater than 0. These errors are not detected by the OSS Pascal compiler or run-time system, nor does the OSS Pascal reference manual state that they will not be detected, as required by the standard. In defense of this nonstandard implementation, one must note that this particular deviation is common and the function it computes is probably more useful than the standard denition for mod. The implementation of mod in Turbo Pascal is dierent, but also nonstandard, and may have been an unintentional deviation. It was not included on the list of nonstandard language features. [Exhibit 1.3] The author of this manual seems to have been unaware of this nonstandard nature of mod and did not even describe it adequately. The partial information given in the Turbo reference manual (pages 5152) is as follows: mod is only dened for integers its result is an integer 12 mod 5 = 2 Exhibit 1.5. The denition of mod in OSS Pascal for the Atari ST. 1. 2. 3. 4. 5. 6. Expression 5 mod 2 -5 mod 2 5 mod -2 -5 mod -2 5 mod 0 -5 mod 0 OSS result 1 -1 1 -1 0 -1 Answer according to Pascal Standard Correct. Should be 1 (between 0 and the modulus-1). Should be detected as an error. Should be detected as an error. Should be detected as an error. Should be detected as an error.

1.5. NONSTANDARD COMPILERS

15

Exhibit 1.6. The denition of mod in Turbo Pascal for the IBM PC. 1. 2. 3. 4. 5. 6. Expression 5 mod 2 -5 mod 2 5 mod -2 -5 mod -2 5 mod 0 -5 mod 0 Turbo result 1 -1 1 -1 Run-time error Run-time error Answer according to Pascal Standard Correct. Should be -1 + 2 = 1. Should be an error. Should be an error. Correct. Correct.

The reference manual for Turbo Pascal version 4.0 still does not include mod on the list of nonstandard features. However, it does give an adequate denition (p. 240) of the function it actually computes for mod: the mod operator returns the remainder from dividing the operands: i mod j = i (i/j ) j . The sign of the result is the sign of i. An error occurs if j = 0. Compiling and testing a few simple expressions [Exhibit 1.6] substantiates this denition. Expression 2 gives a nonstandard answer. Expressions (3) and (4) are designated as errors in the standard, which requires the modulus to be greater than 0. These errors are not detected by the Turbo compiler. Furthermore, its reference manual does not state that they will not be detected, as required by the standard. While Turbo Pascal will not compile a div or mod operation with 0 as a constant divisor, the result of i mod 0 can be tested by setting a variable, j , to zero, then printing the results of i mod j . This gives the results on lines (5) and (6). Occasionally, deviations from the standard occur because an implementor believes that the standard, although unambiguous, dened an item wrong; that is, some other denition would have been more ecient or more useful. The version incorporated into the compiler is intended as an improvement over the standard. Again, the implementation of mod provides an example here. In many cases, the programmer who uses mod really wants the arithmetic remainder, and it seems foolish for the compiler to insert extra lines of code in order to compute the unwanted standard Pascal function. At least one Pascal compiler (for the Apollo workstation) provides a switch that can be set either to compile the standard meaning of mod or to compile the easy and ecient meaning. The person who wrote this compiler clearly believed that the standard was wrong to include the version it did rather than the integer remainder function. The implementation of input and output operations in Turbo Pascal version 2.0 provides another example of a compiler writer who declined to implement the standard language because he believed his own version was clearly superior. He explains this decision as follows: 4
4

Borland [1984], Appendix F.

16

CHAPTER 1. THE NATURE OF LANGUAGE The standard procedures GET and PUT are not implemented. Instead, the READ and WRITE procedures have been extended to handle all I/O needs. The reason for this is threefold: Firstly, READ and WRITE gives much faster I/O, secondly variable space overhead is reduced, as le buer variables are not required, and thirdly the READ and WRITE procedures are far more versatile and easier to understand than GET and PUT.

The actual Turbo implementation of READ did not even measure up to the standard in a minimal way, as it did not permit the programmer to read a line of input from the keyboard one character at a time. (It is surely inecient to do so but essential in some applications.) Someone who did not know that this deviation was made on purpose would think that it was simply a compiler bug. This situation provides an excellent example of the dangers of taking the law into your own hands. Whether or not we agree with the requirements of a language standard, we must think carefully before using nonstandard features. Every time we use a nonstandard feature or one that depends on the particular bit-level implementation of the language, it makes a program harder to port from one system to another and decreases its potential usefulness and potential lifetime. Programmers who use nonstandard features in their code should segregate the nonstandard segments and thoroughly document them.

Exercises
1. Dene natural language. Dene programming language. How are they dierent? 2. How are languages used to establish communication? 3. What is the syntax of a language? What are the semantics? 4. What are the traits that determine the character of a language? 5. How do these traits appear in programming languages? 6. What need led to standardization? 7. What is a standard for a language? 8. What does it mean when a language standard denes something to be undened? 9. How does standardization lead to portability? 10. What three kinds of deviations are common in nonstandard compilers? 11. What are the advantages and disadvantages of using nonstandard language features?

Chapter 2

Representation and Abstraction

Overview
This chapter presents the concept of how real-world objects, actions, and changes in the state of a process are represented through a programming language on a computer. Programs can be viewed as either a set of instructions for the computer to execute or as a model of some real-world process. Languages designed to support these views will exhibit dierent properties. The language designer must establish a set of goals for the language and then examine them for consistency, importance, and restrictions. Principles for evaluating language design are presented. Classication of languages into groups is by no means an easy task. Categories for classifying languages are discussed.

Representation may be explicit or implicit, coherent or diused.

2.1

What Is a Program?

We can view a program two ways. 1. A program is a description of a set of actions that we want a computer to carry out. The actions are the primitive operations of some real or abstract machine, and they are performed using the primitive parts of a machine. Primitive actions include such things as copying data from one machine register or a memory location to another, applying an operation to a register, or activating an input or output device.

17

18

CHAPTER 2. REPRESENTATION AND ABSTRACTION 2. A program is a model of some process in the real or mathematical world. The programmer must set up a correspondence between symbols in the program and real-world objects, and between program functions and real-world processes. Executing a function represents a change in the state of the world or nding a solution to a set of specications about elements of that world.

These two world-views are analogous to the way a builder and an architect view a house. The builder is concerned with the method for achieving a nished house. It should be built eciently and the result should be structurally sound. The architect is concerned with the overall function and form of the house. It should carry out the architects concepts and meet the clients needs. The two world-views lead to very dierent conclusions about the properties that a programming language should have. A language supporting world-view (1) provides ready access to every part of the computer so that the programmer can prescribe in detail how the computer should go about solving a given problem. The language of a builder contains words for each material and construction method used. Similarly, a program construction language allows one to talk directly about hardware registers, memory, data movement, I/O devices, and so forth. The distinction isnt simply whether the language is low-level or high-level, for assembly language and C are both designed with the builder in mind. Assembly language is, by denition, low-level, and C is not, since it includes control structures, type denitions, support for modules, and the like. However, C permits (and forces) a programmer to work with and be aware of the raw elements of the host computer. A language supporting world-view (2) must be able to deal with abstractions and provide a means for expressing a model of the real-world objects and processes. An architect deals with abstract concepts such as space, form, light, and functionality, and with more concrete units such as walls and windows. Blueprints, drawn using a formal symbolic language, are used to represent and communicate the plan. The builder understands the language of blueprints and chooses appropriate methods to implement them. The languages Smalltalk and Prolog were designed to permit the programmer to represent and communicate a world-model easily. They free the programmer of concerns about the machine and let him or her deal instead with abstract concepts. In Smalltalk the programmer denes classes of objects and the processes relevant to these classes. If an abstract process is relevant to several classes, the programmer can dene how it is to be accomplished for each. In Prolog the programmer represents the world using formulas of mathematical logic. In other languages, the programmer may use procedures, type declarations, structured loops, and block structure. to represent and describe the application. Writing a program becomes a process of representing objects, actions, and changes in the state of the process being modeled [Exhibit 2.1]. The advantage of a builders language is that it permits the construction of ecient software that makes eective use of the computer on which it runs. A disadvantage is that programs tailored to a particular machine cannot be expected to be well suited to another machine and hence they are not particularly portable.

2.1. WHAT IS A PROGRAM?

19

Exhibit 2.1. Modeling a charge account and relevant processes. Objects: A program that does the accounting for a companys charge accounts must contain representations for several kinds of real-world objects: accounts, payments, the current balance, items charged, items returned, interest. Actions: Each action to be represented involves objects from a specied class or classes. The actions to be represented here include the following: Credit a payment to an account. Send a bill to the account owner. Debit a purchase to an account. Credit a return to an account. Compute and debit the monthly interest due. Changes of state: The current balance of an account, todays date, and the monthly payment date for that account encode the state of the account. The balance may be positive, negative, or zero, and a positive balance may be either ok or overdue. Purchases, returns, payments, and monthly due dates and interest dates all cause a change in the state of the account.

Moreover, a programmer using such a language is forced to organize ideas at a burdensome level of detail. Just as a builder must be concerned with numerous details such as building codes, lumber dimensions, proper nailing patterns, and so forth, the program builder likewise deals with storage allocation, byte alignment, calling sequences, word sizes, and other details which, while important to the nished product, are largely unrelated to its form and function. By way of contrast, an architects language frees one from concern about the underlying machine and allows one to describe a process at a greater level of abstraction, omitting the minute details. A great deal of discretion is left to the compiler designer in choosing methods to carry out the specied actions. Two compilers for the same architects language often produce compiled code of widely diering eciency and storage requirements. In fact, there is no necessary reason why there must be a compiler at all. One could use the architects language to specify the form and function of the nished program and then turn the job over to a program builder. However, the computer can do a fairly good job of automatically producing a program for such languages, and the ability to have it do so gives the program architect a powerful tool not available to the construction architectthe ability to rapidly prototype designs. This is the power of the computer, and one of the aspects that makes the study of programming language so fascinating!

20

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.2. Representations of a date in English. Dates are abstract real-world objects. We represent them in English by specifying an era, year, month, and day of the month. The era is usually omitted and the year is often omitted in representations of dates, because they can be deduced from context. The month is often encoded as an integer between 1 and 12. Full, explicit representation: January 2, 1986 AD Common representations: January 2, 1986 Jan. 2, 86 Jan. 2 1-2-86 2 Jan 86 86-1-2

2.2

Representation

A representation of an object is a list of the relevant facts about that object in some language [Exhibit 2.2]. A computer representation of an object is a mapping of the relevant facts about that object, through a computer language, onto the parts of the machine. Some languages support high-level or abstract representations, which specify the functional properties of an object or the symbolic names and data types of the elds of the representation [Exhibit 2.3]. A high-level representation will be mapped onto computer memory by a translator. The actual number and order of bytes of storage that will be used to represent the object may vary from translator to translator. In contrast, a computer representation is low level if it describes a particular implementation of the object, such as the amount of storage that will be used, and the position of each eld in that storage area [Exhibit 2.4]. A computer representation of a process is a sequence of program denitions, specications, or

Exhibit 2.3. High-level computer representations of a date. An encoding of the last representation in Exhibit 2.2 is often used in programs. In a high-level language the programmer might specify that a date will be represented by three integers, as in this Pascal example: TYPE date = RECORD year, month, day: integer END; VAR BirthDate: date; The programmer may now refer to this object and its components as: BirthDate or [Link] or [Link] or [Link]

2.2. REPRESENTATION

21

Exhibit 2.4. A low-level computer representation of a date. In a low-level language such as assembler or FORTH, the programmer species the exact number of bytes of storage that must be allocated (or ALLOTted) to represent the date. In the FORTH declaration below, the keyword VARIABLE causes 2 bytes to be allocated, and 4 more are explicitly allocated using ALLOT. Then the programmer must manually dene selection functions that access the elds of the object by adding an appropriate oset to the base address. VARIABLE birth_date 4 ALLOT : year 0 + ; ( Year is first -- offset is zero bytes. ) : month 2 + ; ( Month starts two bytes from the beginning. ) : day 4 + ; ( Day is the fifth and sixth bytes. ) The variable named birth_date and its component elds can now be accessed by writing: birth_date or birth_date year or birth_date month or birth_date day

statements that can be performed on representations of objects from specied sets. We say that the representation of a process is valid, or correct, if the transformed object representation still corresponds to the transformed object in the real world. We will consider three aspects of the quality of a representation: semantic intent, explicitness, and coherence. Abstract representations have these qualities to a high degree; low-level representations often lack them.

2.2.1

Semantic Intent

A data object (variable, record, array, etc.) in a program has some intended meaning that is known to the programmer but cannot be deduced with certainty from the data representation itself. This intended meaning is the programmers semantic intent. For example, three 2-digit integers can represent a womans measurements in inches or a date. We can only know the intended meaning of a set of data if the programmer communicates, or declares, the context in which it should be interpreted. A program has semantic validity if it faithfully carries out the programmers explicitly declared semantic intent. We will be examining mechanisms in various languages for expressing semantic intent and ensuring that it is carried out. Most programming languages use a data type to encode part of the semantic intent of an object. Before applying a function to a data object, the language translator tests whether the function is dened for that object and, therefore, is meaningful in its context. An attempt to apply a function to a data object of the wrong type is identied as a semantic error. A type checking mechanism can thus help a programmer write semantically valid (meaningful) programs.

22

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.5. The structure of a table expressed implicitly. Pascal permits the construction and use of sorted tables, but the fact that the table is sorted cannot be explicitly declared. We can deduce that the table is sorted by noting that a sort algorithm is invoked, that a binary search algorithm is used, or that a sequential search algorithm is used that can terminate a search unsuccessfully before reaching the end of the table. The order of entries (whether ascending or descending) can be deduced by careful analysis of three things: The comparison operator used in a search (< or >) The order of operands in relation to this operator The result (true or false) which causes the search to terminate. Deductions of this sort are beyond the realistic present and future abilities of language translators.

2.2.2

Explicit versus Implicit Representation

The structure of a data object can be reected implicitly in a program, by the way the statements are arranged [Exhibit 2.5], or it can be declared explicitly [Exhibit 2.6]. A language that can declare more kinds of things explicitly is more expressive. Information expressed explicitly in a program may be used by the language translator. For example, if the COBOL programmer supplies a KEY clause, the processor will permit the programmer to use the ecient built-in binary search command, because the KEY clause species that the le is sorted in order by that eld. The less-ecient sequential search command must be used to search any table that does not have a KEY clause. A language that permits explicit communication of information must have a translator that can identify, store, organize, and utilize that information. For example, if a language permits programmers to dene their own types, the translator needs to implement type tables (where type descriptions are stored), new allocation methods that use these programmer-dened descriptions, and more elaborate rules for type checking and type errors. These translator mechanisms to identify, store, and interpret the programmers declarations form the semantic basis of a language. Other mechanisms that are part of the semantic basis are those which implement binding (Chapters 6 and 9), type checking and automatic type conversion (Chapter 15), and module protection (Chapter 16).

2.2.3

Coherent versus Diuse Representation

A representation is coherent if an external entity (object, idea, or process) is represented by a single symbol in the program (a name or a pointer) so that it may be referenced and manipulated as a unit [Exhibit 2.7]. A representation is diuse if various parts of the representation are known by

2.2. REPRESENTATION

23

Exhibit 2.6. COBOL: The structure of a table expressed explicitly. COBOL allows explicit declaration of sorted tables. The key eld(s) and the order of entries may be declared as in the following example. This table is intended to store the names of the fty states of the United States and their two-letter abbreviations. It is to be stored so that the abbreviations are in alphabetical order. 01 state-table. 02 state-entry OCCURS 50 TIMES ASCENDING KEY state-abbrev INDEXED BY state-index. 03 state-abbrev PICTURE XX. 03 state-name PICTURE X(20). This table can be searched for a state abbreviation using the binary-search utility. A possible call is: SEARCH ALL state-entry AT END PERFORM failed-search-process WHEN st-abbrev (state-index) = search-key PERFORM found-process. COBOL also permits the programmer to declare Pascal-like tables for which the sorting order and key eld are not explicitly declared. The SEARCH ALL command cannot be used to search such a table; the programmer can only use the less ecient sequential search command.

dierent names, and no one name or symbol applies to the whole [Exhibits 2.8 and 2.9]. A representation is coherent if all the parts of the represented object can be named by one symbol. This certainly does not imply that all the parts must be stored in consecutive (or contiguous) memory locations. Thus an object whose parts are connected by links or pointers can still be coherent [Exhibit 2.10]. The older languages (FORTRAN, APL) support coherent representation of complex data objects

Exhibit 2.7. A stack represented coherently in Pascal. A stack can be represented by an array and an integer index for the number of items currently in the array. We can represent a stack coherently by grouping the two parts together into a record. One parameter then suces to pass this stack to a function. TYPE stack = RECORD store: ARRAY [1..max_stack] of stack_type; top: 0..max_stack END;

24

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.8. A stack represented diusely in FORTRAN and Pascal. A stack can be represented diusely as an array of items and a separate index (an integer in the range 0 to the size of the array). FORTRAN: A diuse representation is the only representation of a stack that is possible in FORTRAN because the language does not support heterogeneous records. These declarations create two objects which, taken together, comprise a stack of 100 real numbers: REAL STSTORE( 100 ) INTEGER STTOP Pascal: A stack can be represented diusely in Pascal. This code allocates the same amount of storage as the coherent version in Exhibit 2.7, but two parameters are required to pass this stack to a procedure. TYPE stack_store = ARRAY [1..max_stack] of stack_type; stack_top = 0..max_stack;

Exhibit 2.9. Addition represented diusely or coherently. FORTH: The operation of addition is represented diusely because addition of single-length integers, double-length integers, and mixed-length integers are named by three dierent symbols, (+, D+, and M+) and no way is provided to refer to the general operation of integer +. C: The operation of addition is represented coherently. Addition of single-length integers, double-length integers, and mixed-length integers are all named +. The programmer may refer to + for addition, without concern for the length of the integer operands.

Exhibit 2.10. A coherently but not contiguously represented object. LISP: A linked tree structure or a LISP list is a coherent object because a single pointer to the head of the list allows the entire list to be manipulated. A tree or a list is generally implemented by using pointers to link storage cells at many memory locations. C: A sentence may be represented as an array of words. One common representation is a contiguous array of pointers, each of which points to a variable-length string of characters. These strings are normally allocated separately and are not contiguous.

2.3. LANGUAGE DESIGN

25

only if the object can be represented by a homogeneous array of items of the same data type. 1 Where an object has components represented by dierent types, separate variable names must be used. COBOL and all the newer languages support coherent heterogeneous groupings of data. These are called records in COBOL and Pascal, and structures in C. The FORTRAN programmer can use a method called parallel arrays to model an array of heterogeneous records. The programmer declares one array for each eld of the record, then uses a single index variable to refer to corresponding elements of the set of arrays. This diuse representation accomplishes the same goal as a Pascal array of records. However, an array of records represents the problem more clearly and explicitly and is easier to use. For example, Pascal permits an array of records to be passed as a single parameter to a function, whereas a set of parallel arrays in FORTRAN would have to be passed as several parameters. Some of the newest languages support coherence further by permitting a set of data representations to be grouped together with the functions that operate on them. Such a coherent grouping is called a module in Modula-2, a cluster in CLU, a class in Smalltalk, and a package in Ada.

2.3

Language Design

In this section we consider reasons why a language designer might choose to create an architects language with a high degree of support for abstraction, or a builders language with extensive control over low-level aspects of representation.

2.3.1

Competing Design Goals

Programming languages have evolved greatly since the late 1950s when the rst high-level languages, FORTRAN and COBOL, were implemented. Much of this evolution has been made possible by the improvements in computer hardware: todays machines are inconceivably cheap, fast, and large (in memory capacity) compared to the machines available in 1960. Although those old machines were physically bulky and tremendously expensive, they were hardly more powerful than machines that today are considered to be toys. Along with changes in hardware technology came improvements in language translation techniques. Both syntax and semantics of the early languages were ad hoc and clumsy to translate. Formal language theory and formal semantics aected language design in revolutionary ways and have resulted in better languages with cleaner semantics and a more easily translatable syntax. There are many aspects of a language that the user cannot modify or extend, such as the data structuring facilities and the control structures. Unless a language system supports a preprocessor, the language syntax, also, is xed. If control structures and data denition facilities are not built
The EQUIVALENCE statement can be used to circumvent this weakness by dening the name of the coherent object as an overlay on the storage occupied by the parts. This does not constitute adequate support for compound heterogeneous objects.
1

26

CHAPTER 2. REPRESENTATION AND ABSTRACTION

in, they are not available. Decisions to include or exclude such features must, therefore, be made carefully. A language designer must consider several aspects of a potential feature to decide whether it supports or conicts with the design goals. During these thirty years of language development, a consensus has emerged about the importance of some language features, for example, type checking and structured conditionals. Most new languages include these. On other issues, there has been and remains fundamental disagreement, for instance, over the question of whether procedural or functional languages are better. No single set of value judgments has yet emerged, because dierent languages have dierent goals and dierent intended uses. The following are some potential language design goals: Utility. Is a feature often useful? Can it do important things that cannot be done using other features of the language? Convenience. Does this feature help avoid excessive writing? Does this feature add or eliminate clutter in the code? Eciency. Is it easy or dicult to translate this feature? Is it possible to translate this feature into ecient code? Will its use improve or degrade the performance of programs? Portability. Will this feature be implementable on any machine? Readability. Does this form of this feature make a program more readable? Will a programmer other than the designer understand the intent easily? Or is it cryptic? Modeling ability. Does this feature help make the meaning of a program clear? Will this feature help the programmer model a problem more fully, more precisely, or more easily? Simplicity. Is the language design as a whole simple, unied, and general, or is it full of dozens of special-purpose features? Semantic clarity. Does every legal program and expression have one dened, unambiguous, meaning? Is the meaning constant during the course of program execution? These goals are all obviously desirable, but they conict with each other. For example, a simple language cannot possibly include all useful features, and the more features included, the more complicated the language is to learn, use, and implement. Ada illustrates this conict. Ada was designed for the Department of Defense as a language for embedded systems, to be used in all systems development projects, on diverse kinds of hardware. Thus it necessarily reects a high value placed on items at the beginning and middle of the preceding list of design goals. The result is a very large language with a long list of useful special features. Some language researchers have taken as goals the fundamental properties of language shown at the end of the list of design goals. Outstanding examples include Smalltalk, a superior language for modeling objects and processes, and Miranda, which is a list-oriented functional language that achieves both great simplicity and semantic clarity.

2.3. LANGUAGE DESIGN

27

Exhibit 2.11. A basic type not supported by Pascal. Basic type implemented by most hardware: bit strings Common lengths: 8, 16, and 32 bits (1, 2, and 4 bytes) Operations built into most hardware a right shift n places a left shift n places a and b a or b a exclusive or b complement a Symbol in C a >> n a << n a & b a | b a ^ b ~ a

2.3.2

The Power of Restrictions

Every language imposes restrictions on the user, both by what it explicitly prohibits and by what it simply doesnt provide. Whenever the underlying machine provides instructions or capabilities that cannot be used in a user program, the programming language is imposing a restriction on the user. For example, Pascal does not support the type bit string and does not have bit string operators [Exhibit 2.11]. Thus Pascal restricts access to the bit-level implementations of objects. The reader must not confuse logical operators with bitwise operators. Pascal supports the logical (Boolean) data type and logical operators and, or, and not. Note that there is a dierence between these and the bitwise operators [Exhibit 2.12]. Bitwise operators apply the operation between every corresponding pair of bits in the operands. Logical operators apply the operation to the operands as a whole, with 00000000 normally being interpreted as False and anything else as True. In general, restrictions might prevent writing the following two sorts of sentences: 1. Useless or meaningless sentences such as 3 := 72.9 + a . 2. Sentences useful for modeling some problem, that could be written eciently in assembly

Exhibit 2.12. Bitwise and logical operations. The dierence between bitwise and logical operations can output from these operations in C: Operation Operands as bit strings Result bitwise and 10111101 & 01000010 00000000 logical and 10111101 && 01000010 00000001 complement logical not 01000010 ! 01000010 10111101 00000000 be seen by comparing the input and Explanation no bit pairs match both operands represent True each bit is ipped operand is True

28

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.13. Useful pointer operations supported in C. In the C expressions below, p is a pointer used as an index for the array named ages. Initially, p will store the machine address of the beginning of ages. To make p index the next element, p will be incremented by the number of bytes in one element of ages. The array element is accessed by dereferencing the pointer. This code contains an error in logic, which is pointed out in the comments. It demonstrates one semantic problem that Pascals restrictions were designed to prevent. The loop will be executed too many times, and this run-time error will not be detected. Compare this to the similar Pascal loop in Exhibit 2.14. int ages[10]; ... p = ages; end = &ages[10]; while (p <= end){ printf("%d \n", *p); p++; } /* Array "ages" has subscripts 0..9. /* /* /* /* /* */

Make p point at ages[0]. */ Compute address of eleventh array element. */ Loop through eleventh array address. */ Print array element in decimal integer format. */ Increment p to point at next element of ages. */

code but are prohibited. A good example of a useful facility that some languages prohibit is explicit address manipulation. This is supported in C [Exhibit 2.13]. The notation for pointer manipulation is convenient and is generally used in preference to subscripting when the programmer wishes to process an array sequentially. In contrast, manipulation of addresses is restricted in Pascal to prevent the occurrence of meaningless and potentially dangerous dangling pointers (see Chapter 6). Address manipulation is prohibited and address arithmetic is undened in Pascal. Nothing comparable to the C code in Exhibit 2.13 can be written in Pascal. A pointer cant be set to point at an array element, and it cannot be incremented to index through an array. This is a signicant loss in Pascal because using a subscript involves a lot of computation: the subscript must be checked against the minimum and maximum legal values, multiplied by the size of an array element, and added to the base address of the array. Checking whether a pointer has crossed an array boundary and using it to access an element could be done signicantly faster. Let us dene exibility to mean the absence of a restriction, and call a restriction good if it prevents the writing of nonsense, and bad if it prevents writing useful things. Some restrictions might have both good and bad aspects. A powerful language must have the exibility to express a wide variety of actionspreferably a variety that approaches the power of the underlying machine. But power is not a synonym for exibility. The most exible of all languages is assembly language, but assemblers lack the power to express a problem solution succinctly and clearly. A

2.3. LANGUAGE DESIGN

29

Exhibit 2.14. A meaningless operation prohibited in Pascal but not in C. Subscripts are checked at run time in Pascal. Every subscript that is used must be within the declared bounds. VAR ages: array[0..9] of integer; p: integer; ... p := 0; while p <= 10 do begin /* Loop through last array subscript. */ writeln( ages[p] ); /* Print the array element. */ p:=p+1 /* Make p point at next element of array. end;

*/

The last time around this loop the subscript, p, has a value that is out of range. This will be detected, and a run-time error comment will be generated. The analogous C code in Exhibit 2.13 will run and print garbage on the last iteration. The logical error will not be detected, and no error comment will be produced.

second kind of power is provided by sophisticated mechanisms in the semantic basis of a language that let the programmer express a lot by saying a little. The type denition and type checking facility in any modern language is a good example of a powerful mechanism. A third kind of power can come from good restrictions that narrow the variety of things that can be written. If a restriction can eliminate troublesome or meaningless sentences automatically, then programmers will not have to check, explicitly, whether such meaningless sections occur in their programs. Pascal programs rarely run wild and destroy memory. But C and FORTH programs, with unrestricted pointers and no subscript bounds checking, often do so. A language should have enough good restrictions so that the programmer and translator can easily distinguish between a meaningful statement and nonsense. For example, an attempt to access an element of an array with a subscript greater than the largest array subscript is obviously meaningless in any language. The underlying machine hardware permits one to FETCH and STORE information beyond the end of an array, but this can have no possible useful meaning and is likely to foul up the further operation of the program. The semantics of standard Pascal prescribe that the actual value of each subscript expression should be checked at run time. An error comment is generated if the value is not within the declared array bounds. Thus, all subscripting in Pascal is safe and cannot lead to destruction of other information [Exhibit 2.14]. No such array bounds check is done in C. Compare Exhibits 2.13 and 2.14. These two code fragments do analogous things, but the logical error inherent in both will be trapped by Pascal and ignored by C. In C, a FETCH operation with too large a subscript can supply nonsensical information, and a STORE can destroy vital, unrelated information belonging to variables allocated before or after the array. This situation was exploited to create the computer network worm that

30

CHAPTER 2. REPRESENTATION AND ABSTRACTION

invaded hundreds of computer systems in November 1988. It disabled these systems by ooding their processing queues with duplicates of itself, preventing the processing of normal programs. This escapade resulted in the arrest and conviction of the programmer. Often, as seen in Exhibit 2.13, a single feature is both useful and dangerous. In that case, a language designer has to make a value judgement about the relative importance of the feature and the danger in that feature. If the designer considers the danger to outweigh the importance, the restriction will be included, as Wirth included the pointer restrictions in Pascal. If the need outweighs the danger, the restriction will not be included. In designing C, Kernighan and Ritchie clearly felt that address manipulation was vital, and decided that the dangers of dangling pointers would have to be avoided by careful programming, not by imposing general restrictions on pointers.

2.3.3

Principles for Evaluating a Design

In the remaining chapters of this book we will sometimes make value judgments about the particular features that a language includes or excludes. These judgments will be based on a small set of principles. Principle of Frequency The more frequently a language feature will be used, the more convenient its use should be, and the more lucid its syntax should be. An infrequently used feature can be omitted from the core of the language and/or be given a long name and less convenient syntax. C provides us with examples of good and poor application of this principle. The core of the C language does not include a lot of features that are found in the cores of many other languages. For example, input/output routines and mathematical functions for scientic computation are not part of the standard language. These are relegated to libraries, which can be searched if these features are needed. There are two C libraries which are now well standardized, the math library and the C library (which includes the I/O functions). The omission of mathematical functions from C makes good sense because the intended use of C was for systems programming, not scientic computation. Putting these functions in the math library makes them available but less convenient. To use the math library, the loader must have the library on its search path and the user must include a header le in the program which contains type declarations for the math functions. On the other hand, most application programs use the input-output functions, so they should be maximally convenient. In C they arent; in order to use them a programmer must include the appropriate header le containing I/O function and macro declarations, and other essential things. Thus nearly every C application program starts with the instruction #include stdio.h . This could be considered to be a poor design element, as it would cost relatively little to build these denitions into the translator.

2.3. LANGUAGE DESIGN Principle of Locality

31

A good language design enables and encourages, perhaps even enforces, locality of eects. The further the eects of an action reach in time (elapsed during execution) or in space (measured in pages of code), the more complex and harder it is to debug a program. The further an action has inuence, the harder it is to remember relevant details, and the more subtle errors seem to creep into the code. To achieve locality, the use of global variables should be minimized or eliminated and all transfers of control should be short-range. A concise restatement of this principle, in practical terms is: Keep the eects of everything conned to as local an area of the code as possible. Here are some corollaries of the general principle, applied to lexical organization of a program that will be debugged on-line, using an ordinary nonhierarchical text editor: A control structure that wont t on one screen is too long; shorten it by dening one or more scopes as subroutines. All variables should be dened within one screen of their use. This applies whether the users screen is large or smallthe important thing is to be able to see an entire unit at one time. If your subroutine wont t on two screens, it is too long. Break it up. Global Variables. Global variables provide a far more important example of the cost of nonlocality. A global variable can be changed or read anywhere within a program. Specically, it can be changed accidentally (because of a typographical error or a programmers absentmindedness) in a part of the program that is far removed from the section in which it is (purposely) used. This kind of error is hard to nd. The apparent fault is in the section that is supposed to use the variable, but if that section is examined in isolation, it will work properly. To nd the cause of the error, a programmer must trace the operation of the entire program. This is a tedious job. The use of unnecessary global variables is, therefore, dangerous. If the program were rewritten to declare this variable locally within the scope in which it is used, the distant reference would promptly be identied as an error or as a reference to a semantically distinct variable that happens to have the same name. Among existing languages are those that provide only global variables, provide globals but encourage use of locals and parameters, and provide only parameters. Unrestricted use of global variables. A BASIC programmer cannot restrict a variable to a local scope. This is part of the reason that BASIC is not used for large systems programs.

32

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Use of global variables permitted but use of locals encouraged. Pascal and C are block structured languages that make it easy to declare variables in the procedure in which they are used.2 Their default method of parameter passing is call-by-value. Changing a local variable or value parameter has only local eects. Programmers are encouraged to use local declarations, but they can use global variables in place of both local variables and parameters. Use of global variables prohibited. In the modern functional languages there are no global variables. Actually, there are no variables at all, and parameter binding takes the place of assignment to variables. Assignment was excluded from this class of languages because it can have nonlocal eects. The result is languages with elegant, clean semantics. Principle of Lexical Coherence Sections of code that logically belong together should be physically adjacent in the program. Sections of code that are not related should not be interleaved. It should be easy to tell where one logical part of the program ends and another starts. A language design is good to the extent that it permits, requires, or encourages lexical coherence. This principle concerns only the surface syntax of the language and is, therefore, not as important as the other principles, which concern semantic power. Nonetheless, good human engineering is important in a language, and lexical coherence is important to make a language usable and readable. Poor lexical coherence can be seen in many languages. In Pascal the declarations of local variables for the main program must be near the top of the program module, and the code for main must be at the bottom [Exhibit 2.15]. All the function and procedure denitions intervene. In a program of ordinary size, several pages of code come between the use of a variable in main and its denition. Recently, hierarchical editors have been developed for Pascal. They allow the programmer to hide a function denition under the function header. A program is thus divided into levels, with the main program at the top level and its subroutines one level lower. If the subroutines have subroutines, they are at level three, and so on. When the main program is on the screen, only the top level code appears, and each function denition is replaced by a simple function header. This brings the main programs body back into the lexical vicinity of its declarations. When the programmer wishes to look at the function denition, simple editor commands will allow him to descend to that level and return. A similar lack of coherence can be seen in early versions of LISP. 3 LISP permits a programmer to write a function call as a literal function, called a lambda expression, followed by its actual arguments, as shown at the top of Exhibit 2.16. The dummy parameter names are separated from the matching parameter values by an arbitrarily long function body.
2 3

Local declarations are explained fully in Chapter 6; parameters are discussed in Chapter 9, Section 9.2. McCarthy et al. [1962].

2.3. LANGUAGE DESIGN

33

Exhibit 2.15. Poor lexical coherence for declarations and code in Pascal. The parts of a Pascal program are arranged in the order required to permit one-pass compilation: Constant declarations. Type declarations. Variable declarations. Procedure and Function declarations. Code. Good programming style demands that most of the work of the program be done in subroutines, and the part of the program devoted to subroutine denitions is often many pages long. The variable declarations and code for the main program are, therefore, widely separated, producing poor lexical coherence.

Exhibit 2.16. Syntax for lambda expressions in LISP. The order of elements in the primitive syntax is: ((lambda ( list of dummy parameter names ) ( body of the function )) list of actual parameter values ) The order of elements in the extended syntax is: (let ( list of dummy name - actual value pairs ) ( body of the function ))

34

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.17. A LISP function call with poor coherence. The following literal function is written in the primitive LISP syntax. It takes two parameters, x, and y. It returns their product plus their dierence. It is being called with the arguments 3.5 and a + 2. Note that the parameter declarations and matching arguments are widely separated. ((lambda ( x (+ (* x (- x 3.5 (+ a 2) y ) y) y) )) )

This lack of lexical coherence makes it awkward and error prone for a human to match up the names with the values, as shown in Exhibit 2.17. The eye swims when interpreting this function call, even though it is simple and the code section is short. Newer versions of LISP, for example Common LISP, 4 oer an improved syntax with the same semantics but better lexical coherence. Using the let syntax, dummy parameter names and actual values are written in pairs at the top, followed by the code. This syntax is shown at the bottom of Exhibit 2.16, and an example of its use is shown in Exhibit 2.18. A third, and extreme, example of poor lexical coherence is provided by the syntax for function denitions in SNOBOL. A SNOBOL IV function is dened by a function header of the following form: ( name ( parameter list ) local variable name list , entry label ) The code that denes the action of the subroutine can be anywhere within the program module, and it starts at the line labeled entry label . It does not even need to be all in the same place,
4

Kessler[1988], p. 59.

Exhibit 2.18. A LISP function call with good coherence. The following function call is written in LISP using the extended let syntax. It is semantically equivalent to the call in Exhibit 2.17. (let ((x 3.5) (y (+ a 2) )) ((+ (* x y) (- x y)) )) Compare the ease of matching up parameter names and corresponding arguments here, with the diculty in Exhibit 2.17. The lexically coherent syntax is clearly better.

2.3. LANGUAGE DESIGN

35

Exhibit 2.19. Poor lexical coherence in SNOBOL. SNOBOL has such poor lexical coherence that semantically unrelated lines can be interleaved, and no clear indication exists of the beginning or end of any program segment. This program converts English to Pig Latin. It is annotated below. 1. 2. 3. 4. 5. 6. 7. 8. 9. PROC MAIN PIG1 LOOP DEFINE( PIG(X) Y, Z , PIG1 ) OUTPUT = PIG(IN) IN = INPUT PIG = NULL X SPAN( ) = X BREAK( ) . Y SPAN( ) = Y LEN(1) . Z = PIG = PIG Y Z AY OUTPUT = . :(MAIN) :F(END) S(PROC) :F(RETURN) :F(RETURN) :(LOOP)

END

Program Notes. The main program begins on line 1, with the declaration of a header for a subroutine named PIG. Line 1 directs that execution is to continue on the line named MAIN. The subroutine declaration says that the subroutine PIG has one parameter, X, and two local variables, Y and Z. The subroutine code starts on the line with the label PIG1. Lines 2, 3, and 9 belong to the main program. They read a series of messages, translate each to Pig Latin, write them out, and quit when a zero-length string is entered. Lines 4 through 8 belong to the subroutine PIG. Line 4 initializes the answer to the null string. Line 5 strips leading blanks o the parameter, X. Line 6 isolates the next word in X (if any), and line 7 isolates its rst letter. Finally, line 8 glues this word onto the output string with its letters in a dierent order and loops back to line 6.

since each of its lines may be attached to the next by a GOTO. Thus a main program and several subroutines could be interleaved. (We do admit that a sane programmer would never do such a thing.) Exhibit 2.19 shows a SNOBOL program, with subroutine, that translates an English sentence into Pig Latin. The line numbers are not part of the program but are used to key it to the program notes that follow. Principle of Distinct Representation Each separate semantic object should be represented by a separate syntactic item. Where a single syntactic item in the program is used for multiple semantic purposes, conicts are bound to occur, and one or both sets of semantics will be compromised. The line numbers in a BASIC program provide a good example. BASIC was the very rst interactive programming language. It combined an on-line editor, a le system, and an interpreter to make a language in which simple problems could be programmed

36

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.20. BASIC: GOTOs and statement ordering both use line numbers. Line numbers in BASIC are used as targets of GOTO and also to dene the proper sequence of the statements; the editor accepts lines in any order and arranges them by line number. Thus the user could type the following lines in any order and they would appear as follows: 2 4 6 8 SUM = SUM + A PRINT SUM IF A < 10 GO TO 2 STOP

Noticing that some statements have been left out, the programmer sees that three new lines must be inserted. The shortsighted programmer has only left room to insert one line between each pair, which is inadequate here, so he or she renumbers the old line 2 as 3 to make space for the insertion. The result is: 1 2 3 4 5 6 8 LET SUM = LET A = 1 SUM = SUM PRINT SUM LET A = A IF A < 10 STOP 0 + A + 1 GO TO 2

Notice that the loop formed by line 6 now returns to the wrong line, making an innite loop. Languages with separate line numbers and statement labels do not have this problem.

quickly. The inclusion of an editor posed a new problem: how could the programmer modify the program and insert and delete lines? The answer chosen was to have the programmer number every line, and have the editor arrange the lines in order by increasing line number. BASIC was developed in the context of FORTRAN, which uses numeric line numbers as statement labels. It was, therefore, natural for BASIC to merge the two ideas and use one mechanism, the monotonically increasing line number, to serve purposes (1) and (2) below. When the language was extended to include subroutines, symbolic names for them were not dened either. Rather, the same line numbers were given a third use. Line numbers in BASIC are, therefore, multipurpose: 1. They dene the correct order of lines in a program. 2. They are the targets of GOTOs and IFs. 3. They dene the entry points of subroutines (the targets of GOSUB). A conict happens because inserting code into the program requires that line numbers change, and GOTO requires that they stay constant. Because of this, adding lines to a program can be a

2.3. LANGUAGE DESIGN

37

complicated process. Normally, BASIC programmers leave regular gaps in the line numbers to allow for inserting a few lines. However, if the gap in numbering between two successive lines is smaller than the number of lines to be inserted, something will have to be renumbered. But since the targets of GOTOs are not marked in any special way, renumbering implies searching the entire program for GOTOs and GOSUBs that refer to any of the lines whose numbers have been changed. When found, these numbers must be updated [Exhibit 2.20]. Some BASIC systems provide a renumbering utility, others dont. In contrast, lines can be added almost anywhere in a C program with minimal local adjustments. Principle of Too Much Flexibility A language feature is bad to the extent that it provides exibility that is not useful to the programmer, but that is likely to cause syntactic or semantic errors. For example, any line in a BASIC program can be the target of a GOTO or a GOSUB statement. An explicit label declaration is not neededthe programmer simply refers to the line numbers used to enter and edit the program. A careless or typographical error in a GOTO line number will not be identied as a syntactic error. Every programmer knows which lines are supposed to be the targets of GOTOs, and she or he could easily identify or label them. But BASIC supplies no way to restrict GOTOs to the lines that the programmer knows should be their targets. Thus the translator cannot help the programmer ensure valid use of labels. We would say that the ability to GOTO or GOSUB to any line in the program without writing an explicit label declaration is excessively exible: it saves the programmer the minor trouble of declaring labels, but it leads to errors. If there were some way to restrict the set of target lines, BASIC would be a better and more powerful language. Power comes from a translators ability to identify and eliminate meaningless commands, as well as from a languages ability to express aspects of a model. Another example of useless exibility can be seen in the way APL handles GOTO and statement labels. APL provides only three control structures: the function call, sequential execution, and a GOTO statement. A GOTO can only transfer control locally, within the current function denition. All other control structures, including ordinary conditionals and loops, must be dened in terms of the conditional GOTO. As in BASIC, numeric line numbers are used both to determine the order of lines in a program and as targets of the GOTO. But the problems in BASIC with insertions and renumbering are avoided because, unlike BASIC, symbolic labels are supported. A programmer may write a symbolic label on a line and refer to it in a GOTO, and this will have the correct semantics even if lines are inserted and renumbering happens. During compilation of a function denition (the process that happens when you leave the editor), the lines are renumbered. Each label is bound to a constant integer value: the number of the line on which it is dened. References to the label in the code are replaced by that constant, which from then on has exactly the same semantics as an integer. (Curiously, constants are not otherwise supported by the language.)

38

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.21. Strange but legal GOTOs in APL. The GOTO is written with a right-pointing arrow, and its target may be any expression. The statements below are all legal in APL. (x + 2) 6 Legal so long as the result is an integer. 347 An array of line numbers is given; control will be transferred to the rst. 0 N returns a vector of N numbers, ranging from 1 to N . Thus, 0 returns a vector of length 0, which is the null object. A branch to the null object is equivalent to a no-op.

Semantic problems arise because the labels are translated into integer constants and may be operated on using integer operations such as multiplication and division! Further, the APL GOTO is completely unrestricted; it can name either a symbolic label or an integer line number, whether or not that line number is dened in that subroutine. Use of an undened line number is equivalent to a function return. These semantics have been dened so that some interpretation is given no matter what the result of the expression is [Exhibit 2.21]. Because the target of a GOTO may be computed and may depend on variables, any line of the subroutine might potentially be its target. It is impossible at compile time to eliminate any line from the list of potential targets. Thus, at compile time, the behavior of a piece of code may be totally unpredictable. APL acionados love the exibility of this GOTO. All sorts of permutations and selection may be done on an array of labels to implement every conceivable variety of conditional branch. Dozens of useful idioms, or phrases, such as the one in Exhibit 2.22, have been developed using this GOTO and published for other APL programmers to use. It is actually fun to work on and develop a new control structure idiom. Many language designers, though, question the utility and wisdom of permitting and relying on such idiomatic control structures. They must be deciphered to be understood, and the result of a mistake in denition or use is a totally wrong and unpredictable branch. Even a simple conditional branch

Exhibit 2.22. A computed GOTO idiom in APL. (NEG, EQ, POS) [ 2 + N ] This is a three-way branch very similar to the previous example and analogous to the FORTRAN arithmetic IF. The signum function, , returns -1 if N is negative, +1 if N is positive, and 0 otherwise. Two is added to the result of signum, and the answer is used to subscript a vector of labels. One of the three branches is always taken.

2.3. LANGUAGE DESIGN

39

to the top of a loop can be written with four dierent idioms, all in common use. This makes it dicult to learn to read someone elses code. Proofs of correctness are practically impossible. We have shown that APLs totally unrestricted GOTO has the meaningless and useless exibility to branch to any line of the program, and that the lack of any other control structure necessitates the use of cryptic idioms and produces programs with unpredictable behavior. These are severe semantic defects! By the principle of Too Much Flexibility, this unrestricted GOTO is bad, and APL would be a more powerful language with some form of restriction on the GOTO. The Principle of Semantic Power A programming language is powerful (for some application area) to the extent that it permits the programmer to write a program easily that expresses the model, the whole model, and nothing but the model. Thus a powerful language must support explicit communication of the model, possibly by dening a general object and then specifying restrictions on it. A restriction imposed by the language can support power at the price of exibility that might be necessary for some applications. On the other hand, a restriction imposed by the user expresses only the semantics that the user wants to achieve and does not limit him or her in ways that obstruct programming. The programmer should be able to specify a program that computes the correct results and then be able to verify that it does so. All programs should terminate properly, not crash. Faulty results from correct data should be provably impossible. Part of a model is a description of the data that is expected. A powerful language should let the programmer write data specications in enough detail so that garbage in is detected and does not cause garbage out. The Principle of Portability A portable program is one that can be compiled by many dierent compilers and run on dierent hardware, and that will work correctly on all of them. If a program is portable, it will be more useful to more people for more years. We live in times of constant change: we cannot expect to have the same hardware or operating system available in dierent places or dierent years. But portability limits exibility. A portable program, by denition, cannot exploit the special features of some hardware. It cannot rely on any particular bit-level representation of any object or function; therefore, it cannot manipulate such things. One might want to do so to achieve eciency or to write low-level system programs. Languages such as Standard Pascal that restrict access to pointers and to the bit-representations of objects force the programmer to write portable code but may prohibit him or her from writing ecient code for some applications. Sometimes features are included in a language for historical reasons, even though the language supports a dierent and better way to write the same thing. As languages develop, new features are added that improve on old features. However, the old ones are seldom eliminated because upward compatibility is important. We want to be able to recompile old programs on new versions of the

40

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.23. An archaic language feature in FORTRAN. The arithmetic IF statement was the only conditional statement in the original version of FORTRAN. It is a three-way conditional GOTO based directly on the conditional jump instruction of the IBM 704. An example of this statement is: IF (J-1) 21, 76, 76 The expression in parentheses is evaluated rst. If the result is negative, control goes to the rst label on the list (21). For a zero value, control goes to the second label (76), and for a positive value, to the third (76). More often than not, in practical use, two of the labels are the same. The arithmetic IF has been superseded by the modern block IF statement. Assume that block 1 contains the statements that followed label 21 above, and block2 contains the statements following statement 76. Then the following block IF statement is equivalent to the arithmetic IF above: IF J-1 .LT. 0 THEN block 1 ELSE block 2 ENDIF

translator. Old languages such as COBOL and FORTRAN have been through the process of change and restandardization several times. Some features in these languages are completely archaic, and programmers should be taught not to use them [Exhibit 2.23]. Many of these features have elements that are inherently error prone, such as reliance on GOTOs. Moreover, they will eventually be dropped from the language standard. At that point, any programs that use the archaic features will require extensive modernization before they can be modied in any way. Our answer to redundant and archaic language features is simple: dont use them. Find out what constitutes modern style in a language and use it consistently. Clean programming habits and consistent programming style produce error-free programs faster. Another kind of redundancy is seen in Pascal, which provides two ways to delimit comments: (* This is a comment. *) and { This is a comment. } The former way was provided, as part of the standard, for systems that did not support the full ASCII character set. It will work in all Pascal implementations and is thus more portable. The latter way, however, is considered more modern and preferred by many authors. Some programmers use both: one form for comments, the other to comment out blocks of code. The language allows both kinds of comment delimiters to be used in a program. However, mixing the delimiters is a likely source of errors because they are not interchangeable. A comment must begin and end with the same kind of delimiter. Thus whatever conventions a programmer chooses should be used consistently. The programmer must choose either the more portable way or the more modern way, a true dilemma.

2.4. CLASSIFYING LANGUAGES

41

2.4

Classifying Languages

It is tempting to classify languages according to the most prominent feature of the language and to believe that these features make each language group fundamentally dierent from other groups. Such categorizations are always misleading because: Languages in dierent categories are fundamentally more alike than they are dierent. Believing that surface dierences are important gets in the way of communication among groups of language users. We tend to associate things that occur together in some early example of a language category. We tend to believe that these things must always come together. This impedes progress in language design. Category names are used loosely. Nobody is completely sure what these names mean, and which languages are or are not in any category. Languages frequently belong to more than one category. Sorting them into disjoint classes disguises real similarities among languages with dierent surface syntax.

2.4.1

Language Families

Students do need to understand commonly used terminology, and it is sometimes useful to discuss a group of languages having some common property. With this in mind, let us look at some of the language families that people talk about and try to give brief descriptions of the properties that characterize each family. As you read this section, remember that these are not absolute, mutually exclusive categories: categorizations are approximate and families overlap heavily. Examples are listed for each group, and some languages are named several times. Interactive Languages. An interactive language is enmeshed in a system that permits easy alternation between program entry, translation, and execution of code. We say that it operates using a REW cycle: the system Reads an expression, Evaluates it, and Writes the result on the terminal, then waits for another input. Programs in interactive languages are generally structured as a series of fairly short function and object denitions. Translation happens when the end of a denition is typed in. Programs are usually translated into some intermediate form, not into native machine code. This intermediate form is then interpreted. Many interactive languages, such as FORTH and Miranda, use the term compile to denote the translation of source code into the internal form. Examples: APL, BASIC, FORTH, LISP, T, Scheme, dBASE, Miranda. Structured Languages. Control structures are provided that allow one to write programs without using GOTO. Procedures with call-by-value parameters 5 are supported. Note that we call Pascal
5

See Chapter 9.

42

CHAPTER 2. REPRESENTATION AND ABSTRACTION

a structured language even though it contains a GOTO, because it is not necessary to use that GOTO to write programs. Examples: Pascal, C, FORTH, LISP, T, Scheme. Strongly Typed Languages. Objects are named, and each name has a type. Every object belongs to exactly one type (types are disjoint). The types of actual function arguments are compared to the declared types of the dummy parameters during compilation. A mismatch in types or in number of parameters will produce an error comment. Many strongly typed languages, including Pascal, Ada, and ANSI C, include an escape hatchthat is, some mechanism by which the normal type-checking process can be evaded. Examples: FORTRAN 77, Pascal, Ada, ANSI C (but not the original C), ML, Miranda. Object-oriented Languages. These are extensions or generalizations of the typed languages. Objects are typed and carry their type identity with them at all times. Any given function may have several denitions, which we will call methods. 6 Each method operates on a dierent type of parameter and is associated with the type of its rst parameter. The translator must dispatch each function call by deciding which dening method to invoke for it. The method associated with the type of the rst parameter will be used, if it exists. Object-oriented languages have nondisjoint types and function inheritance. The concept of function inheritance was introduced by Simula and popularized by Smalltalk, the rst language to be called object-oriented. A type may be a subset of another type. The function dispatcher will use this subset relationship in the dispatching process. It will select a function belonging to the supertype when none is dened for the subtype. Actually, many of these characteristics also apply to APL, an old language. It has objects that carry type tags and functions with multiple denitions and automatic dispatching. It is not a full object-oriented language because it lacks denable class hierarchies. Examples: Simula, Smalltalk, T, C++. APL is object-oriented in a restricted sense. Procedural Languages. A program is an ordered sequence of statements and procedure calls that will be evaluated sequentially. Statements interact and communicate with each other through variables. Storing a value in a variable destroys the value that was previously stored there. (This is called destructive assignment.) Exhibit 2.24 is a diagram of the history of this language family. Modern procedural languages also contain extensive functional elements. 7 Examples: Pascal, C, Ada, FORTRAN, BASIC, COBOL. Functional Languages, Old Style. A program is a nested set of expressions and function calls. Call-by-value parameter binding, not assignment, is the primary mechanism used to give names to variables. Functions interact and communicate with each other through the parameter stack.
6 7

This is the term used in Smalltalk. See Chapter 8.

2.4. CLASSIFYING LANGUAGES

43

Exhibit 2.24. The development of procedural languages. Concepts and areas of concern are listed on the left. Single arrows show how these inuenced language design and how some languages inuenced others. Dotted double arrows indicate that a designer was strongly inuenced by the bad features of an earlier language.
1950 Mathematical Notation Unit Record Equipment Symbolic Names Data Specification Structured Control 1960 APL (1962) Nonalgorithmic Specification Interactive Use Structured Data Object-Oriented Programming 1970 RPG (1964) Symbolic Assemblers (mid-1950s) COBOL (1958) FORTRAN (1956) ALGOL-58 MAD (1959) ALGOL-60 1960 1950

BASIC (1964)

CPL (1963)

Simula (1967) PL/1 (1966)

BCPL (1967) ALGOL-68 B (1970) C (1972) 1970

Concurrency Data Abstraction 1980 Smalltalk

Pascal (1973) Concurrent Pascal (1975) Modula CLU (1977) Ada (1982) True BASIC (1980s)

1980

1985 C++

1985

44

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Exhibit 2.25. The development of functional languages.

LISP, McCarthy (1959) 1960


INTERLISP (1974) BB&N, Xerox MAC LISP, MIT (1968)

1960 Pure Functional Languages ISWIM: Landin (1966)


LOGO -- recursion and graphics for children (early 1970s)

1970 Scheme, MIT (1975)


Lexical Scoping

1970

Franz LISP (1979) Under Berleley UNIX

1980

T, Yale (1982)
Object Oriented Common LISP (1985) Lexical Scoping

ML: Milner (1978) FP: Backus (1978) Miranda (1986) Haskell (1988)

1980

1990

1990

Certain characteristics are commonly associated with functional languages. Most are interactive and oriented toward the list data structure. Functions are objects that can be freely manipulated, passed as parameters, composed, and so on. Permitting functions to be manipulated like other objects gives a language tremendous power. Exhibit 2.25 is a diagram of the development of this language family. LISP and its modern lexically scoped descendants support destructive assignment and sequences of expressions, which are evaluated in order. When these features are used, these languages become procedural, like Pascal. These languages are, therefore, functional in the same sense that Pascal is structured. It is never necessary to use the semantically messy GOTO in Pascal. Any semantics that can be expressed with it can be expressed without it. Similarly, it is not necessary to use the semantically messy destructive assignment in LISP, but it is used occasionally, to achieve eciency, when changing one part of a large data structure. Examples: LISP, Common LISP, T, Scheme.

2.4. CLASSIFYING LANGUAGES

45

Functional Languages, New Style. Considerable work is now being done on developing functional languages in which sequences of statements, variables, and destructive assignment do not exist at all. Values are passed from one part of a program to another by function calls and parameter binding. There is one fundamental dierence between the old and new style functional languages. The LISP-like languages use call-by-value parameters, and these new languages use call-by-need (lazy evaluation).8 A parameter is not evaluated until it is needed, and its value is then kept for future use. Call-by-need is an important semantic development, permitting the use of innite lists, which are objects that are part data and part program, where the program part is evaluated, as needed, to produce the next item on the list. The terminology used to talk about these new functional programming languages is sometimes dierent from traditional programming terminology. A program is an unordered series of static denitions of objects, types, and functions. In Miranda it isnt even called a program, it is called a script. Executing a program is replaced by evaluating an expression or reducing an expression to its normal form. In either case, though, computation happens. Since pure functional programming is somewhat new, it has not reached its full development yet. For example, ecient array handling has yet to be included. As the eld progresses, we should nd languages that are less oriented to list processing and more appropriate for modeling nonlist applications. Examples: ML, Miranda, Haskell. Parallel Languages. These contain multitasking primitives that permit a program to fork into two or more asynchronous, communicating tasks that execute some series of computations in parallel. This class of languages is becoming increasingly important as highly parallel hardware develops. Parallel languages are being developed as extensions of other kinds of languages. One of the intended uses for them is to program highly parallel machines such as the HyperCube. There is a great deal of interest in using such machines for massive numeric applications like weather prediction and image processing. It is not surprising, therefore, that the language developed for the HyperCube resembled a merger of the established number-oriented languages, FORTRAN and APL. There is also strong interest in parallel languages in the articial intelligence community, where many researchers are working on neural networks. Using parallelism is natural in such disciplines. In many situations, a programmer wishes to evaluate several possible courses of action and choose the rst one to reach a goal. Some of the computations may be very long and others short, and one cant predict which are which. One cannot, therefore, specify an optimal order in which to evaluate the possibilities. The best way to express this is as a parallel computation: Evaluate all these computations in parallel, and report to me when the rst one terminates. List-oriented parallel languages will surely develop for these applications.
8

Parameter passing is explained fully in Chapter 9.

46

CHAPTER 2. REPRESENTATION AND ABSTRACTION

Finally, the clean semantics of the assignment-free functional languages are signicantly easier to generalize to parallel execution, and new parallel languages will certainly be developed as extensions of functional languages. Examples: Co-Pascal, in a restricted sense. LINDA, OCCAM, FORTRAN-90. Languages Specialized for Some Application. These languages all contain a complete generalpurpose programming language as their basis and, in addition, contain a set of specialized primitives designed to make it convenient to process some particular data structure or problem area. Most contain some sophisticated and powerful higher-level commands that would require great skill and long labor to program in an unspecialized language like Pascal. An example is dBASE III which contains a full programming language similar to BASIC and, in addition, powerful screen handling and le management routines. The former expedites entry and display of information, the latter supports a complex indexed le structure in which key elds can be used to relate records in dierent les. Systems programming languages must contain primitives that let the programmer manipulate the bits and bytes of the underlying machine and should be heavily standardized and widely available so that systems, once implemented, can be easily ported to other machines. Examples: C, FORTH. Business data processing languages must contain primitives that give ne and easy control over details of input, output, le handling, and precision of numbers. The standard oatingpoint representations are not adequate to provide this control, and some form of xed-point numeric representation must be provided. The kind of printer or screen output formatting provided in FORTRAN, C, and Pascal is too clumsy and does not provide enough exibility. A better syntax and more options must be provided. Similarly, a modern language for business data processing must have a good facility for dening screens for interactive input. A major proportion of these languages is devoted to I/O. Higher-level commands should be included for common tasks such as table handling and sorting. Finally, the language should provide good support for le handling, including primitives for handling sequential, indexed, and random access les. Examples: RPG (limited to sequential les), COBOL, Ada. Data base languages contain extensive subsystems for handling internal les, and relationships among les. Note that this is quite independent of a good subsystem for screen and printer I/O. Examples: dBASE, Framework, Structured Query Language (SQL). List processing languages contain primitive denitions for a linked list data type and the important basic operations on lists. This structure has proven to be useful for articial intelligence

2.4. CLASSIFYING LANGUAGES

47

programming. Implementations must contain powerful operations for direct input and output of lists, routines for allocation of dynamic heap storage, and a garbage collection routine for recovery of dynamically allocated storage that is no longer accessible. Examples: LISP, T, Scheme, Miranda. Logic languages are interactive languages that use symbolic logic and set theory to model computation. Prolog was the rst logic language and is still the best known. Its dominant characteristics dene the language class. A Prolog program is a series of statements about logical relations that are used to establish a data base, interspersed with statements that query this data base. To evaluate a query, Prolog searches that data base for any entries that satisfy all the constraints in the query. To do this, the translator invokes an elaborate expression evaluator which performs an exhaustive search of the data base, with backtracking. Rules of logical deduction are built into the evaluator. Thus we can classify a logic language as an interactive data base language where both operations and the data base itself are highly specialized for dealing with the language of symbolic logic and set theory. Prolog is of particular interest in the articial intelligence community, where deductive reasoning on the basis of a set of known facts is basic to many undertakings. Examples: HASL, FUNLOG, Templog (for temporal logic), Uniform (unies LISP and Prolog), Fresh (combines the functional language approach with logic programming), etc. Array processing languages contain primitives for constructing and manipulating arrays and matrices. Sophisticated control structures are built in for mapping simple operations onto arrays, for composing and decomposing arrays, and for operating on whole arrays. Examples: APL, APL-2, VisiCalc, and Lotus. String processing languages contain primitives for input, output, and processing of character strings. Operations include searching for and extracting substrings specied by complex patterns involving string functions. Pattern matching is a powerful higher-level operation that may involve exhaustive search by backtracking. The well-known string processing languages are SNOBOL and its modern descendant, ICON. Typesetting languages were developed because computer typesetting is becoming an economically important task. Technical papers, books, and drawings are, increasingly, prepared for print using a computer language. A document prepared in such a language is an unreadable mixture of commands and ordinary text. The commands handle les, set type fonts, position material, and control indexing, footnotes, and glossaries. Drawings are specied in a language of their own, then integrated with text. The entire nished product is output in a language that a laser printer can handle. This book was prepared using the languages mentioned below, and a drafting package named Easydraw whose output was converted to Postscript.

48

CHAPTER 2. REPRESENTATION AND ABSTRACTION Examples: Postscript, TeX, LaTeX.

Command languages are little languages frequently created by extending a systems user interface. First simple commands are provided; these are extended by permitting arguments and variations. More useful commands are added. In many cases these command interfaces develop their own syntax (usually ad hoc and fairly primitive) and truly extensive capabilities. For example, entire books have been written about UNIX shell programming. Every UNIX system includes one or several shells which accept, parse, and interpret commands. From these shells, the user may call system utilities and other small systems such as grep, make, and ex. Each one has its own syntax, switches, semantics, and defaults. Command languages tend to be arcane. In many cases, little design eort goes into them because their creators view them as simple interfaces, not as languages. Fourth-generation Languages . This curious name was applied to diverse systems developed in the mid-1980s. Their common property was that they all contained some powerful new control structures, statements, or functions by which you could invoke in a few words some useful action that would take many lines to program in a language like Pascal. These languages were considered, therefore, to be especially easy to learn and user friendly, and the natural accompaniments to fourth-generation hardware, or personal computers. Lotus 1-2-3 and SuperCalc are good examples of fourth-generation languages. They contain a long list of commands that are very useful for creating, editing, printing, and extracting information from a two-dimensional data base called a spreadsheet, and subsystems for creating several kinds of graphs from that data. HyperCard is a data base system in which it is said that you can write complex applications without writing a line of code. You construct the application with the mouse, not with the keyboard. The designers of many fourth-generation languages viewed them as replacements for programming languages, not as new programming languages. The result is that their designs did not really prot as much as they could have from thirty years of experience in language design. Like COBOL and FORTRAN, these languages are ad hoc collections of useful operations. The data base languages such as dBASE are also called fourth-generation languages, and again their designers thought of them as replacements for computer languages. Unfortunately, these languages do not eliminate the need for programming. Even with lots of special report-generating features built in, users often want something a little dierent from the features provided. This implies a need for a general-purpose language within the fourth-generation system in which users can dene their own routines. The general-purpose language included in dBASE is primitive and lacks important control structures. Until the newest version, dBASE4, procedures did not even have parameters, and when they were nally added, the implementation was unusual and clumsy. The moral is that there is no free lunch. An adaptable system must contain a general-purpose language to cover applications not supported by predened features. The whole system will be better if this general-purpose language is carefully designed.

2.4. CLASSIFYING LANGUAGES

49

2.4.2

Languages Are More Alike than Dierent

Viewing languages as belonging to language families tends to make us forget how similar all languages are. This basic similarity happens because the purpose of all languages is to communicate models from human to machine. All languages are inuenced by the innate abilities and weaknesses of human beings, and are constrained by the computers inability to handle irreducible ambiguity. Most of the dierences among languages arise from the specialized nature of the objects and tasks to be communicated using a given language. This book is not about any particular family of languages. It is primarily about the concepts and mechanisms that underlie the design and implementation of all languages, and only secondarily about the features that distinguish one family from another. Most of all, it is not about the myriad variations in syntax used to represent the same semantics in dierent languages. The reader is asked to try to forget syntax and focus on the underlying elements.

Exercises
1. What are the two ways to view a program? 2. How will languages supporting these views dier? 3. What is a computer representation of an object? A process? 4. Dene semantic intent. Dene semantic validity. What is their importance? 5. What is the dierence between explicit and implicit representation? What are the implications of each? 6. What is the dierence between coherent and diuse representation? 7. What are the advantages of coherent representation? 8. How can language design goals conict? How can the designer resolve this problem? 9. How can restrictions imposed by the language designer both aid and hamper the programmer? 10. Why is the concept of locality of eect so important in programming language design? 11. What are the dangers involved when using global variables? 12. What is lexical coherence? Give an example of poor lexical coherence. 13. What is portability? Why does it limit exibility? 14. Why is it dicult to classify languages according to their most salient characteristics?

50

CHAPTER 2. REPRESENTATION AND ABSTRACTION

15. What is a structured language? Strongly typed language? Object-oriented language? Parallel language? Fourth-generation language? 16. Why are most languages more similar than they are dierent? From what causes do language dierences arise? 17. Discuss two aspects of a language design that make it hard to read, write, or use. Give an example of each, drawn from a language with which you are familiar. 18. Choose three languages from the following list: Smalltalk, BASIC, APL, LISP, C, Pascal, Ada. Describe one feature of each that causes some people to defend it as the best language for some application. Choose features that are unusual and do not occur in many languages.

Chapter 3

Elements of Language

Overview
This chapter presents elements of language, drawing correlations between English parts of speech and words in programming languages. Metalanguages allow languages to describe themselves. Basic structural units, words, sentences, paragraphs, and references, are analogous to the lexical tokens, statements, scope, and comments of programming languages.

Languages are made of words with their denitions, rules for combining the words into meaningful larger units, and metawords (words for referring to parts of the language). In this section we examine how this is true both of English and of a variety of programming languages.

3.1
3.1.1

The Parts of Speech


Nouns

In natural languages nouns give us the ability to refer to objects. People invent names for objects so that they may catalog them and communicate information about them. Likewise, names are used for these purposes in programming languages, where they are given to program objects (functions, memory locations, etc.). A variable declaration is a directive to a translator to set aside storage to represent some real-world object, then give a name to that storage so that it may be accessed. Names can also be given to constants, functions, and types in most languages.

51

52 First-Class Objects

CHAPTER 3. ELEMENTS OF LANGUAGE

One of the major trends throughout the thirty-ve years of language design has been to strengthen and broaden the concept of object. In the beginning, programmers dealt directly with machine locations. Symbolic assemblers introduced the idea that these locations represented real-world data, and could be named. Originally, each object had a name and corresponded to one storage location. When arrays were introduced in FORTRAN and records in COBOL, these aggregates were viewed as collections of objects, not as objects themselves. Several years and several languages later, arrays and records began to achieve the status of rstclass objects that could be manipulated and processed as whole units. Languages from the early seventies, such as Pascal and C, waed on this point, permitting some whole-object operations on aggregate objects but prohibiting others. Modern languages support aggregate-objects and permit them to be constructed, initialized, assigned to, compared, passed as arguments, and returned as results with the same ease as simple objects. More recently, the functional object, that is, an executable piece of code, has begun to achieve rst-class status in some languages, which are known as functional languages. The type object has been the last kind of object to achieve rst-class status. A type-object describes the type of other objects and is essential in a language that supports generic code. Naming Objects One of the complex aspects of programming languages that we will study in Chapter 6 involves the correspondence of names to objects. There is considerable variation among languages in the ways that names are used. In various languages a name can: Exist without being attached, or bound, to an object (LISP). Be bound simultaneously to dierent objects in dierent scopes (ALGOL, Pascal). Be bound to dierent types of objects at dierent times (APL, LISP). Be bound, through a pointer, to an object that no longer exists (C). Conversely, in most languages, a single object can be bound to more than one name at a time, producing an alias. This occurs when a formal parameter name is bound to an actual parameter during a function call. Finally, in many languages, the storage allocated for dierent objects and bound to dierent names can overlap. Two dierent list heads may share the same tail section [Exhibit 3.1].

3.1.2

Pronouns: Pointers

Pronouns in natural languages correspond roughly to pointers in programming languages. Both are used to refer to dierent objects (nouns) at dierent times, and both must be bound to (dened to refer to) some object before becoming meaningful. The most important use of pointers in

3.1. THE PARTS OF SPEECH

53

Exhibit 3.1. Two overlapping objects (linked lists).

List1: List2:

The Your

Only Time

Direction

From

Here

Is

Up.

programming languages is to label objects that are dynamically created. Because the number of these objects is not known to the programmer before execution time, he cannot provide names for them all, and pointers become the only way to reference them. When a pointer is bound to an object, the address of that object is stored in space allocated for the pointer, and the pointer refers indirectly to that object. This leads to the possibility that the pointer might refer to an object that has died, or ceased to exist. Such a pointer is called a dangling reference. Using a dangling reference is a programming error and must be guarded against in some languages (e.g., C). In other languages (e.g., Pascal) this problem is minimized by imposing severe restrictions on the use of pointers. (Dangling references are covered in Section 6.3.2.)

3.1.3

Adjectives: Data Types

In English, adjectives describe the size, shape, and general character of objects. They correspond, in a programming language, to the many data type attributes that can be associated with an object by a declaration or by a default. In some languages, a single attribute is declared that embodies a set of properties including specications for size, structure, and encoding [Exhibit 3.2]. In other languages, these properties are independent and are listed separately, either in variable declarations (as in COBOL) or in type declarations, as in Ada [Exhibit 3.3]. Some of the newer languages permit the programmer to dene types that are related hierarchically in a tree structure. Each class of objects in the tree has well-dened properties. Each subclass has properties of its own and also inherits all the properties of the classes above it in the hierarchy. Exhibit 3.4 gives an example of such a type hierarchy in English. The root of this hierarchy is the class vertebrate, which is characterized by having a backbone. All subclasses inherit this

Exhibit 3.2. Size and encoding bundled in C. The line below declares a number that will be represented in the computer using oating-point encoding. The actual number of bytes allocated is usually four, and the precision is approximately seven digits. This declaration is the closest parallel in C to the Ada declaration in Exhibit 3.3. float price;

54

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.3. Size and encoding specied separately in Ada. The Ada declarations below create a new type named REAL and a REAL object, price. The use of the keyword digits indicates that this type is to be derived from some predened type with oating-point encoding. The number seven indicates that the resulting type must have at least seven decimal digits of precision. type REAL is digits 7; price: REAL;

property. At the next level are birds, which have feathers, and mammals, which have hair. We can, therefore, conclude that robins and chickens are feathered creatures, and that human beings are hairy. Going down the tree, we see that roosters and hens inherit all properties of chickens, including being good to eat. According to the tree, adults and children are both human (although members of each subclass sometimes dispute this). Finally, at the leaf level, both male and female subclasses exist, which inherit the properties of either adults or children. Inheritance means that any function dened for a superclass also applies to all subclasses. Thus if we know that constitutional rights are guaranteed for human beings, we can conclude that girls have these rights. Using an object-oriented language such as Smalltalk a programmer can implement types (classes) with exactly this kind of hierarchical inheritance of type properties. (Chapter 17 deals with this topic more fully.)

Exhibit 3.4. A type hierarchy with inheritance in English.

Vertebrate Bird Robin Rooster Mammal Ewe Man Human Being Adult Child Woman Boy Girl

Chicken Hen

Sheep Ram

3.1. THE PARTS OF SPEECH

55

3.1.4

Verbs

In English, verbs are words for actions or states of being. Similarly, in programming languages, we see action words such as RETURN, BREAK, STOP, GOTO, and :=. Procedure calls, function calls, and arithmetic operators all direct that some action should happen, and are like action verbs. Relational operators (=, >, etc.) denote states of beinghey ask questions about the state of some program object or objects. In semistandard terminology, a function is a program object that receives information through a list of arguments, performs a prescribed computation on that information, calculates some answer, and returns that value to the calling program. In most languages function calls can be embedded within the argument lists of other function calls, and within arithmetic expressions. Function calls are usually denoted by writing the function name followed by an appropriate series of arguments enclosed in parentheses. Expressions often contain more than one function call. In this case each language denes (or explicitly leaves undened) the order in which the calls will be executed.1 A procedure is just like a function except that it does not return a value. Because no value results from executing the procedure, the procedure call constitutes an entire program statement and cannot be embedded in an expression or in the argument list of another call. An operator is a predened function whose name is often a special symbol such as +. Most operators require either one or two arguments, which are called operands. Many languages support inx notation for operators, in which the operator symbol is written between its two operands or before or after its single operand. Rules of precedence and associativity [Chapter 8, Section 8.3.2] govern the way that inx expressions are parsed, and parentheses are used, when necessary, to modify the action of these rules. We will use the word function as a generic word to refer to functions, operators, and procedures when the distinctions among them are not important. Some languages (e.g., FORTRAN, Pascal, and Ada) provide three dierent syntactic forms for operators, functions, and procedures [Exhibit 3.5]. Other languages (e.g., LISP and APL) provide only one [Exhibits 3.6 and 3.7]. To a great extent, this makes languages appear to be more dierent in structure than they are. The rst impression of a programmer upon seeing his or her rst LISP program is that LISP is full of parentheses, is cryptic, and has little in common with other languages. Actually, various front ends, or preprocessors, have been written for LISP that permit the programmer to write using a syntax that resembles ALGOL or Pascal. This kind of preprocessor changes only cosmetic aspects of the language syntax. It does not add power or supply kinds of statements that do not already exist. The LISP preprocessors do demonstrate that LISP and ALGOL have very similar semantic capabilities.
1

This issue is discussed in Chapter 8.

56

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.5. Syntax for verbs in Pascal and Ada. These languages, like most ALGOL-like languages, have three kinds of verbs, with distinct ways of invoking each. Functions: The function name is written followed by parentheses enclosing the list of arguments. Arguments may themselves be function calls. The call must be embedded in some larger statement, such as an assignment statement or procedure call. This is a call to a function named push with an embedded call on the sin function. Success := push(rs, sin(x)); Procedures: A procedure call constitutes an entire program statement. The procedure name is written followed by parentheses enclosing the list of arguments, which may be function calls. This is a call on Pascals output procedure, with an embedded function call: Writeln (Success, sin(x)); Operators: An operator is written between its operands, and several operators may be combined to form an expression. Operator-expressions are legal in any context in which function calls are legal. Success := push(rs, (x+y)/(x-y));

Exhibit 3.6. Syntax for verbs in LISP. LISP has only one class of verb: functions. There are no procedures in LISP, as all functions return a value. In a function call, the function name and the arguments are enclosed in parentheses (rst line below). Arithmetic operators are also written as function calls (second line). (myfun arg1 arg2 arg3) (+ B 1)

3.1. THE PARTS OF SPEECH

57

Exhibit 3.7. Syntax for verbs in APL. APL provides a syntax for applying operators but not for function calls or procedure calls. Operators come in three varieties: dyadic (having two arguments), monadic (having one argument), and niladic (having no arguments). Dyadic operators are written between their operands. Line [1] below shows + being used to add the vector (5 3) to B and add that result to A. (APL expressions are evaluated rightto-left.) Variables A and B might be scalars or length two vectors. The result is a length two vector. Monadic operators are written to the left of their operands. Line [2] shows the monadic operator |, or absolute value. Line [3] shows a call on a niladic operator, the read-input operator, 2. The value read is stored in A. [1] A + 5 3 + B [2] | A [3] A 2 The programmer may dene new functions but may not use more than two arguments for those functions. Function calls are written using the syntax for operators. Thus a dyadic programmerdened function named FUN would be called by writing: A FUN B When a function requires more than two arguments, they must be packed or encoded into two bunches, sent to the function, then unpacked or decoded within the function. This is awkward and not very elegant.

The Domain of a Verb The denition of a verb in English always includes an indication of the domain of the verb, that is, the nouns with which that verb can meaningfully be used. A dictionary provides this information, either implicitly or explicitly, as part of the denition of each verb [Exhibit 3.8]. Similarly, the domain of a programming language verb is normally specied when it is dened. This specication is part of the program in some languages, part of the documentation in others. The domain of a function is dened in most languages by a function header, which is part of the function denition. A header species the number of the objects required for the function to operate and the formal names by which those parameters will be known. Languages that implement strong typing also require the types of the parameters to be specied in the header. This information is used to ensure that the function is applied meaningfully, to objects of the correct types [Exhibit 3.9].

58

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.8. The domain of a verb in English. Verb: Cry Denition for the verb cry, 2. To weep, or shed tears. 3. To shout, or shout out. 4. To utter a characteristic sound or call (used of an animal). The domain is dened in denitions (1) through (3) by stating that the object/creature that cries must be able to sob, express feelings, weep, or shout. Denition (4) explicitly states that the domain is an animal. Thus all of the following things can cry: human beings (by denitions 1, 2, 3), geese (4), and baby dolls (2).
a

paraphrased from the dictionary. a

1. To make inarticulate sobbing sounds expressing grief, sorrow, or pain.

Morris [1969], p. 319.

The range of a function is the set of objects that may be the result of that function. This must also be specied in the function header (as in Pascal) or by default (as in C) in languages that implement type checking.

3.1.5

Prepositions and Conjunctions

In English we distinguish among the parts of speech used to denote time, position, conditionals, and the relationship of phrases in a sentence. Each programming language contains a small number of such words, used analogously to delimit phrases and denote choices and repetition (WHILE, ELSE, BY, CASE, etc.). The exact words dier from language to language. Grammatical rules state how these words may be combined with phrases and statements to form meaningful units.

Exhibit 3.9. The domains of some Pascal functions. Predened functions chr ord trunc A user-dened function header Domains An integer between 0 and 127. Any object of an enumerated type. A real number. list;

: FUNCTION search (N:name; L:list ):

The domain of search is pairs of objects, one of type name, the other of type list. The result of search is a list; its range is, therefore, the type list.

3.2. THE METALANGUAGE

59

By themselves these words have little meaning, and we will deal with them in Chapter 10, where we examine control structures.

3.2

The Metalanguage

A language needs ways to denote its structural units and to refer to its own parts. English has sentences, paragraphs, essays, and the like, each with lexical conventions that identify the unit and mark its beginning and end. Natural languages are also able to refer to these units and to the words that comprise the language, as in phrases such as the paragraph below, and USA is an abbreviation for the United States of America. These parts of a language that permit it to talk about itself are called a metalanguage. The metalanguage that accompanies most programming languages consists of an assortment of syntactic delimiters, metawords, and ways to refer to structural units. We consider denitions of the basic structural units to be part of the metalanguage also.

3.2.1

Words: Lexical Tokens

The smallest unit of any written language is the lexical tokenthe mark or series of marks that denote one symbol or word in the language. To understand a communication, rst the tokens must be identied, then each one and their overall arrangement must be interpreted to arrive at the meaning of the communication. Analogously, one must separate the sounds of a spoken sentence into tokens before it can be comprehended. Sometimes it is a nontrivial task to separate the string of written marks or spoken sounds into tokens, as anyone knows who has spent a day in a foreign country. This same process must be applied to computer programs. A human reader or a compiler must rst perform a lexical analysis of the code before beginning to understand the meaning. The portion of the compiler that does this task is called the lexer. In some languages lexical analysis is trivially simple. This is true in FORTH, which requires every lexical token to be delimited (separated from every other token) by one or more spaces. Assembly languages frequently dene xed columns for operation codes and require operands to be separated by commas. Operating system command shells usually call for the use of spaces and a half dozen punctuation marks which are tokens themselves and also delimit other tokens. Such simple languages are easy to lexically analyze, or lex. Not all programming languages are so simple, though, and we will examine the common lexical conventions and their eects on language. The lexical rules of most languages dene the lexical forms for a variety of token types: Names (predened and user-dened) Special symbols Numeric literals Single-character literals

60 Multiple-character string literals

CHAPTER 3. ELEMENTS OF LANGUAGE

These rules are stated as part of the formal denition of every programming language. A lexer for a language is commonly produced by feeding these rules to a program called a lexer generator, whose output is a program (the lexer) that can perform lexical analysis on a source text string according to the given rules. The lexer is the rst phase of a compiler. Its role in the compiling process is illustrated in Exhibit 4.3. Much of the feeling and appearance of a language is a side eect of the rules for forming tokens. The most common rules for delimiting tokens are stated below. They reect the rules of Pascal, C, and Ada. Special symbols are characters or character strings that are nonalphabetic and nonnumeric. Examples are ;, +, and :=. They are all predened by the language syntax. No new special symbols may be dened by the programmer. Names must start with an alphabetic character and must not contain anything except letters, digits, and (sometimes) the _ symbol. Everything that starts with a letter is a name. Names end with a space or a special symbol. Special symbols generally alternate with names and literals. Where two special symbols or two names are adjacent, they must be separated by a space. Numeric literals start with a digit, a +, or a . They may contain digits, ., and E (for exponent). Any other character ends the literal. Single-character literals and multiple-character strings are enclosed in matching single or double quotes. If, as in C, a single character has dierent semantics from a string of length 1, then single quotes may be used to delimit one and double quotes used for the other. Note that spaces are used to delimit some but not all tokens. This permits the programmer to write arithmetic expressions such as a*(b+c)/d the way a mathematician would write them. If we insisted on a delimiter (such as a space) after every token, the expression would have to be written a * ( b + c ) / d , which most programmers would consider to be onerous and unnatural. Spaces are required to delimit arithmetic operators in COBOL. The above expression in COBOL would be written a * (b + c) / d. This awkwardness is one of the reasons that programmers are uncomfortable using COBOL for numeric applications. The reason for this requirement is that the - character is ambiguous: COBOLs lexical rules permit - to be used as a hyphen in variable names, for example, hourly-rate-in. Long, descriptive variable names greatly enhance the readability of programs.

3.2. THE METALANGUAGE

61

Hyphenated variable names have existed in COBOL from the beginning. When COBOL was extended at a later date to permit the use of arithmetic expressions an ambiguity arose: the hyphen character and the subtraction operator were the same character. One way to avoid this problem is to use dierent characters for the two purposes. Modern languages use the - for subtraction and the underbar, _, which has no other function in the language, to achieve readability. As you can see, the rules for delimiting tokens can be complex, and they do have varied repercussions. The three important issues here are: Code should be readable. The language must be translatable and, preferably, easy to lex. It is preferable to use the same conventions as are used in English and/or mathematical notation. The examples given show that a familiar, readable language may contain an ambiguous use of symbols. A few language designers have chosen to sacrice familiarity and readability altogether in order to achieve lexical simplicity. LISP, APL, and FORTH all have simpler lexical and syntactic rules, and all are considered unreadable by some programmers because of the conict between their prior experience and the lexical and syntactic forms of the language. Let us examine the simple lexical rule in FORTH and its eects. In other languages the decision was made to permit arithmetic expressions to be written without delimiters between the variable names and the operators. A direct consequence is that special symbols (nonalphabetic, nonnumeric, and nonunderbar) must be prohibited in variable names. It may seem natural to prohibit the use of characters like + and ( in a name, but it is not at all necessary. FORTH requires one or more space characters or carriage returns between every pair of tokens, and because of this rule, it can permit special characters to be used in identiers. It makes no distinction between user-dened names and predened tokens: either may contain any character that can be typed and displayed. The string #$% could be used as a variable or function name if the programmer so desired. The token ab* could never be confused with an arithmetic problem because the corresponding arithmetic problem, a b * , contains three tokens separated by spaces. Thus the programmer, having a much larger alphabet to use, is far freer to invent brief, meaningful names. For example, one might use a+ to name a function that increments its argument (a variable) by the value of a. Lexical analysis is trivially easy in FORTH. Since its lexical rules treat all printing characters the same way and do not distinguish between alphabetic characters and punctuation marks, FORTH needs only three classes of lexical tokens: Names (predened or user-dened). Numeric literals. String literals. These can appear only after the string output command, which is ." (pronounced dot-quote). A string literal is terminated by the next " (pronounced quote).

62

CHAPTER 3. ELEMENTS OF LANGUAGE

These three token types correspond to semantically distinct classes of objects that the interpreter handles in distinct ways. Names are to be looked up in the dictionary and executed. Numeric literals are to be converted to binary and put on the stack. String literals are to be copied to the output stream. The lexical rules of the language thus correspond directly to its semantics, and the interpreter is very short and simple. The eect of these lexical rules on people should also be noted. Although the rules are simple and easy to learn, a programmer accustomed to the conventions in other languages has a hard time learning to treat the space character as important.

3.2.2

Sentences: Statements

The earliest high-level languages reected the linguistic idea of sentences: a FORTRAN or COBOL program is a series of sentencelike statements. 2 COBOL statements even end in periods. Most statements, like sentences, specify an action to perform and some object or objects on which to perform the action. A language is called procedural, if a program is a sequence statements, grouped into procedures, to be carried out using the objects specied. In the late 1950s when FORTRAN and COBOL were developed, the punched card was the dominant medium for communication from human to computer. Programs, commands to the operating system, and data were all punched on cards. To compile and (one hoped) run a program, the programmer constructed a deck usually consisting of: An ID control card, specifying time limits for the compilation and run. 3 A control card requesting compilation, an object program listing, an error listing, and a memory map.4 A series of cards containing the program. A control card requesting loading and linking of the object program. A control card requesting a run and a core dump 5 of the executable program.
Caution: In a discussion of formal grammars and parsing, the term sentence is often used to mean the entire program, not just one statement. 3 Historical note: Some of the items on the control cards are hard to understand in todays environment. Limiting the time that a job would be allowed to run (using a job time limit) was important then because computer time was very costly. In 1962, time on the IBM 704 (a machine comparable in power to a Commodore 64) cost $600 per hour at the University of Michigan. For comparison, Porterhouse steak cost about $1 per pound. Short time limits were specied so that innite loops would be terminated by the system as soon as possible. 4 The memory map listed all variable names and their memory addresses. The map, object listing, and core (memory) dump together were indispensable aides to debugging. They permitted the programmer to reconstruct the execution of the program manually. 5 Most debugging was done in those days by carefully analyzing the contents of a core (memory) dump. The kind of trial and error debugging that we use today was impractical because turnaround time for a trial run was rarely less than a few hours and sometimes was measured in days. In order to glean as much information as possible from each run, the programmer would analyze the core dump using the memory maps produced by the compiler and linker.
2

3.2. THE METALANGUAGE

63

Exhibit 3.10. Field denitions in FORTRAN. Columns 1 15 6 772 7380 Use in a FORTRAN program A C or a * here indicates a comment line. Statement labels Statement continuation mark The statement itself Programmer ID and sequence numbers.

End of statement. At end of line, unless column 6 on the next card is punched to indicate a continuation of the statement. Indenting convention. Start every statement in column 7. (Indenting is not generally used.)

A control card marking the beginning of the data. A series of cards containing the data. A JOB END control card marking the end of the deck. Control cards had a special character in column 1 by which they could be recognized. Because a deck of cards could easily become disordered by being dropped, columns 73 through 80 were conventionally reserved for identication and sequence numbers. The JOB END card had a dierent special mark. This made it easy for the operating system to abort remaining segments of a job after a fatal error was discovered during compilation or linking. Because punched cards were used for programs as well as data, the physical characteristics of the card strongly inuenced certain aspects of the early languages. The program statement, which was the natural program unit, became tightly associated with the 80-column card, which was the natural media unit. Many programmers wrote their code on printed coding forms, which looked like graph paper with darker lines marking the elds. This helped keypunch operators type things in the correct columns. The designers of the FORTRAN language felt that most FORTRAN statements would t on one line and so chose to require that each statement be on a separate card. The occasional statement that was too long to t could be continued on another card by placing a character in column six of the second card. Exhibit 3.10 lists the elds that were dened for a FORTRAN card. COBOL was also an early xed-format language, with similar but dierent xed elds. Due to the much longer variable names permitted in COBOL, and the wordier and more complex syntax,
The programmer would trace execution of the program step by step and compare the actual contents of each memory location to what was supposed to be there. Needless to say, this was slow, dicult, and beyond the capabilities of many people. Modern advances have made computing much more accessible.

64

CHAPTER 3. ELEMENTS OF LANGUAGE

many statements would not t on one line. A convention that imitated English was introduced: the end of each statement was marked by a period. A group of statements that would be executed sequentially was called a paragraph, and each paragraph was given an alphanumeric label. Within columns 1372, indenting was commonly used to clarify the meaning of the statements. Two inventions in the late 1960s combined to make the use of punched cards for programs obsolete. The remote-access terminal and the on-line, disk-based le system made it both unnecessary and impractical to use punched cards. Languages that were designed after this I/O revolution reect the changes in the equipment used. Fixed elds disappeared, the use of indentation to clarify program structure became universal, and a character such as ; was used to separate statements or terminate each statement.6

3.2.3

Larger Program Units: Scope

English prepositions and conjunctions commonly control a single phrase or clause. When a larger scope of inuence is needed in English, we indicate that the word pertains to a paragraph. In programming languages, units that correspond to such paragraphs are called scopes and are commonly marked by a pair of matched opening and closing marks. Exhibits 3.11, 3.13, and 3.15 show the tremendous variety of indicators used to mark the beginning and end of a scope. In FORTRAN the concept of scope was not well abstracted, and scopes were indicated in a variety of ways, depending on the context. As new statement types were added to the language over the years, new ways were introduced to indicate their scopes. FORTRAN uses ve distinct ways to delimit the scopes of the DATA statement, DO loop, implied DO loop, logical IF (true action only), and block IF (true and false actions) [Exhibit 3.11]. This nonuniformity of syntax does not occur in the newer languages. Two dierent kinds of ways to end scopes are shown in Exhibit 3.11. The labeled statement at the end of a DO scope ends a specic DO. Each DO statement species the statement label of the line which terminates its scope. (Two DOs are allowed to name the same label, but that is not relevant here.) We say that DO has a labeled scope. In contrast, all block IF statements are ended by identical ENDIF lines. Thus an ENDIF could end any block IF statement. We say that block IF statements have unlabeled scopes. The rules of FORTRAN do not permit either DO scopes or block IF scopes to overlap partially. That is, if the beginning of one of these scopes, say B, comes between the beginning and end of another scope, say A, then the end of scope B must come before the end of scope A. Legal and illegal nestings of labeled scopes are shown in Exhibit 3.12. All languages designed since 1965 embody the abstraction scope. That is, the language supplies a single way to delimit a paragraph, and that way is used uniformly wherever a scope is needed in the syntax, for example, with THEN, ELSE, WHILE, DO, and so on. For many languages, this is accomplished by having a single pair of symbols for begin-scope and end-scope, which are
Most languages did not use the . as a statement-mark because periods are used for several other purposes (decimal points and record part selection), and any syntax becomes hard to translate when symbols become heavily ambiguous.
6

3.2. THE METALANGUAGE

65

Exhibit 3.11. Scope delimiters in FORTRAN. The following program contains an example of each linguistic unit that has an associated scope. The line numbers at the left key the statements to the descriptions, in the table that follows, of the scope indicators used. 1 INTEGER A, B, C(20), I 2 DATA A, B /31, 42/ 3 READ* A, B, ( C(I), I=1,10) 4 DO 80 I= 1, 10 5 IF (C(I) .LT. 0) C(I+10)=0 6 IF (C(I) .LT. 100) THEN 7 C(I+10) = 2 * C(I) 8 ELSE 9 C(I+10) = C(I)/2 10 ENDIF 11 80 CONTINUE 12 END Scope of Dimension list DATA values Implied DO Subscript list DO loop Logical IF Block IF (true) Block IF (false) Begins at ( after array name First / ( in I/O list ( after array name Line following DO After ( condition ) After THEN After ELSEIF or ELSE Ends at The next ) Second / I/O loop control Matching ) Statement with DO label End of line ELSE, ELSEIF, or ENDIF ELSE, ELSEIF, or ENDIF Line #s 1 2 3 3 5 - 11 5 7 9

Exhibit 3.12. Labeled scopes. Correct Nesting: Faulty Nesting:

Begin Scope A Begin Scope B End Scope B End Scope A

Begin Scope A Begin Scope B End Scope A End Scope B

66

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.13. Tokens used to delimit program scopes. Language C LISP Pascal Beginning of Scope { ( BEGIN RECORD CASE DO; DO loop control ; End of Scope } ) END END END END; END;

PL/1

used to delimit any kind of scope [Exhibit 3.13]. In these languages it is not possible to nest scopes improperly because the compiler will always interpret the nesting in the legal way. A compiler will match each end-scope to the nearest unmatched begin-scope. This design is attractive because it produces a language that is simpler to learn and simpler to translate. If an end-scope is omitted, the next one will be used to terminate the open scope regardless of the programmers intent [Exhibit 3.14]. Thus an end-scope that was intended to terminate an IF may instead be used to terminate a loop or a subprogram. A compiler error comment may appear on the next line because the program element written there is in an illegal context, or error comments may not appear until the translator reaches the end of the program and nds that the wrong number of end-scopes was included. If an extra end-scope appears somewhere else, improper nesting might not be detected at all. Using one uniform end-scope indicator has the severe disadvantage that a nesting error may not be identied as a syntactic error, but become a logical error which is harder to identify and correct. The programmer has one fewer tool for communicating semantics to the compiler, and the compiler has one fewer way to help the programmer achieve semantic validity. Many experienced programmers use comments to indicate which end-scope belongs to each begin-scope. This practice makes programs more readable and therefore easier to debug, but of course does not help the compiler. A third, intermediate way to handle scope delimiters occurs in Ada. Unlike Pascal, each kind of scope has a distinct end-scope marker. Procedures and blocks and labeled loops have fully labeled end-scopes. Unlike FORTRAN, a uniform syntax was introduced for delimiting and labeling scopes. An end-scope marker is the word end followed by the word and label, if any, associated with the beginning of the scope [Exhibit 3.15]. It is possible, in Ada, for the compiler to detect many (but not all) improperly nested scopes and often to correctly deduce where an end-scope has been omitted. This is important, since a misplaced or forgotten end-scope is one of the most common kinds of compile-time errors. A good technique for avoiding errors with paired delimiters is to type the END marker when the BEGIN is typed, and position the cursor between them. This is the idea behind the structured

3.2. THE METALANGUAGE

67

Exhibit 3.14. Nested unlabeled scopes in Pascal.

Begin Scope Begin Scope End Scope End Scope

i := 0; WHILE a <= 100 DO BEGIN IF a mod 7 = 0 THEN BEGIN i := i + 1; writeln (i, a) END END

editors. When the programmer types the beginning of a multipart control unit, the editor inserts all the keywords and scope markers necessary to complete that unit meaningfully. This prevents beginners and forgetful experts from creating malformed scopes.

3.2.4

Comments

Footnotes and bibliographic citations in English permit us to convey general information about the text. Analogously, comments, interspersed with program words, let us provide information about a program that is not part of the program. With comments, as with statements, we have the problem of identifying both the beginning and end of the unit. Older languages (COBOL, FORTRAN) generally restrict comments to separate lines, begun by a specic comment mark in a xed position on the line [Exhibit 3.16]. This convention was natural when programs were typed on punch cards. At the same time it is a severe restriction because it prohibits the use of brief comments placed out of the way visually. It therefore limits the usefulness of comments to explain obscure items that are embedded in the code. The newer languages permit comments and code to be interspersed more freely. In these

Exhibit 3.15. Lexical scope delimiters in Ada. Begin-scope markers block_name : declarations BEGIN PROCEDURE proc_name LOOP label :LOOP CASE IF condition THEN or ELSIF condition THEN ELSE End-scope markers END block_name END proc_name ; END LOOP; END LOOP label ; END CASE ELSIF, ELSE, or END IF END IF

68

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.16. Comment lines in older languages. In these languages comments must be placed on a separate line, below or above the code to which they apply. Language Comment line is marked by FORTRAN A C in column 1 COBOL A * in column 7 at the beginning of a line original APL The lamp symbol: BASIC REM at the beginning of a line

languages, statements can be broken onto multiple lines and combined freely with short comments in order to do a superior job of clarifying the intent of the programmer. Both the beginning and end of a comment are marked [Exhibit 3.17]. Comments are permitted to appear anywhere within a program, even in the middle of a statement. A nearly universal convention is to place the code on the left part of the page and comments on the right. Comments are used to document the semantic intent of variables, parameters, and unusual program actions, and to clarify which end-scope marker is supposed to match each beginscope marker. Whole-line comments are used to mark and document the beginning of each program module, greatly assisting the programmers eye in nding his or her way through the pages of code. Some comments span several lines, in which case only the beginning of the rst line and end of the last line need begin- and end-comment marks. In spite of this, many programmers mark the beginning and end of every line because it is aesthetically nicer and sets the comment apart from code. With all the advantages of these partial-line comments, one real disadvantage was introduced by permitting begin-comment and end-comment marks to appear anywhere within the code. It is not unusual for an end-comment mark to be omitted or typed incorrectly [Exhibit 3.18]. In this case all the program statements up to the end of the next comment are taken to be part of the nonterminated comment and are simply swallowed up by the comment.

Exhibit 3.17. Comment beginning and end delimiters. These languages permit a comment and program code to be placed on the same line. Both the beginning and end of the comment is marked. Language C PL/1 Pascal FORTH Comments are delimited by /* . . . */ /* . . . */ (* . . . *) or { . . . } ( ...)

3.2. THE METALANGUAGE

69

Exhibit 3.18. An incomplete comment swallowing an instruction. The following Pascal code appears to be ok at rst glance, but because of the mistyped end-comment mark, the computation for tot_age will be omitted. The result will be a list of family members with the wrong average age! A person_list is an array of person cells, each containing a name and an age. PROCEDURE average_age(p: person_list); VAR famsize, tot_age, k:integer; BEGIN readln(famsize); (* Get the number of family members to process.*) tot_age := 0; FOR k := 1 TO famsize DO BEGIN writeln( p[k].name ); (* Start with oldest family member. * ) tot_age := tot_age + p[k].age; (* Sum ages for average. *) END; writeln( Average age of family = , tot_age/famsize) END;

The translator may not ever detect this violation of the programmers intent. If the next comment is relatively near, and no end-scope markers are swallowed up by the comment, the program may compile with no errors but run strangely. This can be a very dicult error to debug, since the program looks correct but its behavior is inconsistent with its appearance! Eventually the programmer will decide that he or she has clearly written a correct instruction that the compiler seems to have ignored. Since compilers do not just ignore code, this does not make sense. Finally the programmer notices that the end-comment mark that should be at the end of some prior line is missing. This problem is an example of the cost of over-generality. Limiting comments to separate lines was too restrictive, that is, not general enough. Permitting them to begin and end anywhere on a line, though, is more general than is needed or desired. Even in languages that permit this, comments usually occupy either a full line or the right end of a line. A more desirable implementation of comments would match the comment-scope and comment placement rules with the actual conventions that most programmers use, which are: whole-line comments partial-line comments placed on the right side of the page multiple-line comments Thus comments should be permitted to occur on the right end of any line, but they might as well be terminated by the end of the line. Permitting multiple-line comments to be written is important,

70

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.19. Comments terminating at end of line. These languages permit comments to occupy entire lines or the right end of any line. Comments start with the comment-begin mark listed and extend to the carriage return on the right end of the line. (TeX is a text processing language for typesetting mathematical text and formulas. It was used to produce this book.) Language Ada LISP TeX UNIX command shell C++ Comment-begin mark -; (This varies among implementations.) % # //

but it is not a big burden to mark the beginning of every comment line, as many programmers do anyway to improve the appearance of their programs. The payo for accepting this small restriction is that the end-of-line mark can be used as a comment-end mark. Since programmers do not forget to put carriage returns in their programs, comments can no longer swallow up entire chunks of code. Some languages that have adopted this convention are listed in Exhibit 3.19. Some languages support two kinds of comment delimiters. This permits the programmer to use the partial-line variety to delimit explanatory comments. The second kind of delimiter (with matched begin-comment and end-comment symbols) is reserved for use during debugging, when the programmer often wants to comment out, temporarily, large sections of code.

3.2.5

Naming Parts of a Program

In order to refer to the parts of a program, we need meta-words for those parts and for whatever actions are permitted. For example, C permits parts of a program to be stored in separate les and brought into the compiler together by using #include le_name . The le name is a metaword denoting a section of the program, and #include is a metaword for the action of combining it with another section. Most procedural languages provide a GOTO instruction which transfers control to a specic labeled statement somewhere in the program. The statement label, whether symbolic or numeric, is thus a metaword that refers to a part of the program. Since the role of statement labels cannot be fully understood apart from the control structures that use them, labels are discussed with the GOTO command in Section 11.1.

3.2.6

Metawords That Let the Programmer Extend the Language

There are several levels on which a language may be extended. One might extend: The list of dened words (nouns, verbs, adjectives).

3.2. THE METALANGUAGE

71

The syntax but not the semantics, thus providing alternative ways of writing the same meanings one could write without the extension. The actual semantics of the language, with a corresponding extension either of the syntax or of the list of dened words recognized by the compiler. Languages that permit the third kind of extension are rare because extending the semantics requires changing the translator to handle a new category of objects. Semantic extension is discussed in the next chapter. Extending the Vocabulary Every declaration extends the language in the sense that it permits a compiler to understand new words. Normally we are only permitted to declare a few kinds of things: nouns (variables, constants, le names), verbs (functions and procedures), and sometimes adjectives (type names) and metawords (labels). We cannot normally declare new syntactic words or new words such as array. The compiler maintains one combined list or several separate lists of these denitions. This list is usually called the symbol table, but it is actually called the dictionary in FORTH. New symbols added to this list always belong to some previously dened syntactic category with semantics dened by the compiler. Each category of symbol that can be declared must have its own keyword or syntactic marker by which the compiler can recognize that a denition of a new symbol follows. Words such as TYPE, CONST, and PROCEDURE in Pascal and INTEGER and FUNCTION in FORTRAN are metawords that mean, in part, extend the language by putting the symbols that follow into the symbol table. As compiler technology has developed and languages have become bigger and more sophisticated, more kinds of declarable symbols have been added to languages. The original BASIC permitted no declarations: all two-letter variable names could be used without declaration, and no other symbols, even subroutine names, could be dened. The newest versions of BASIC permit use of longer variable names, names for subroutines, and symbolic labels. FORTRAN, developed in 19541958, permitted declaration of names for variables and functions. FORTRAN 77 also permits declaration of names for constants and COMMON blocks. ALGOL-68 supported type declarations as a separate abstraction, not as part of some data object. Pascal, published in 1971, brought type declarations into widespread use. Modula, a newer language devised by the author of Pascal, permits declaration and naming of semantically separate modules. Ada, one of the newest languages in commercial use, permits declaration of several things missing in Pascal, including the range and precision of real variables, support for concurrent tasks, and program modules called generic packages which contain data and function declarations with type parameters.

72

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.20. Denition of a simple macro in C. In C, macro denitions start with the word #dene, followed by the macro name. The string to the right of the macro name denes the meaning of the name. The #dene statements below make the apparent syntax of C more like Pascal. They permit the faithful Pascal programmer to use the familiar scoping words BEGIN and END in a C program. (These words are not normally part of the C language.) During preprocessing, BEGIN will be replaced by { and END will be replaced by }. #define BEGIN { #define END }

Syntactic Extension without Semantic Extension Some languages contain a macro facility (in C, it is part of the preprocessor). 7 This permits the programmer to dene short names for frequently used expressions. A macro denition consists of a name and a string of characters that becomes the meaning of the name [Exhibit 3.20]. To use a macro, the programmer writes its name, like a shorthand notation, in the program wherever that string of characters is to be inserted [Exhibit 3.21]. A preprocessor scans the source program, searching for macro names, before the program is
7

The C preprocessor supports various compiler directives as well as a general macro facility.

Exhibit 3.21. Use of a simple macro in C. Macro Calls. The simple macros dened in Exhibit 3.20 are called in the following code fragment. Unfortunately, the new scope symbols, BEGIN and END, and the old ones, { and }, are now interchangeable. Our programmer can write the following code, dening two well-nested scopes. It would work, but it isnt pretty or clear. BEGIN x = y+2; if (x < 100) { x += k; y = 0; END else x = 0; }

Macro Expansion. During macro expansion the macro call is replaced by the dening string. The C translator never sees the word BEGIN. { x = y+2; if (x < 100) { x += k; y = 0; } else x = 0; }

3.2. THE METALANGUAGE

73

parsed. These macro names are replaced by the dening strings. The expanded program is then parsed and compiled. Thus the preprocessor commands and macro calls form a separate, primitive, language. They are identied, expanded, and eliminated before the parser for the main language even begins its work. The syntax for a macro language, even one with macro parameters, is always simple. However, piggy-backing a macro language on top of a general programming language causes some complications. The source code will be processed by two translators, and their relationship must be made clear. Issues such as the relationship of macro calls to comments or quoted strings must be settled. In C, preprocessor commands and macro denitions start with a # in column 1. 8 This distinguishes them from source code intended for the compiler. Custom (but not compiler rules) dictates that macro names be typed in uppercase characters and program identiers in lowercase. Case does not matter to the translator, but this custom helps the programmer read the code. Macro calls are harder to identify than macro denitions, since they may be embedded anywhere in the code, including within a macro denition. Macro names, like program identiers, are variablelength strings that need to be identied and separated from other symbols. Lexical analysis must, therefore, be done before macro expansion. Since the result of expansion is a source string, lexical analysis must be done again after expansion. Since macro denitions may contain macro calls, the result of macro expansion must be rescanned for more macro calls. Control must thus pass back and forth between the lexer and the macro facility. The lexical rules for the preprocessor language are necessarily the same as the rules for the main language. In the original denition of C, the relationship among the lexer, preprocessor, and parser was not completely dened. Existing C translators thus do dierent things with macros, and all are correct by the language denition. Some C translators simply insert the expanded macro text back into the source text without inserting any blanks or delimiters. The eect is that characters outside a macro can become adjacent to characters produced by the macro expansion. The program line containing the expanded macro is then sent back to the lexer. When the lexer processes this, it forms a single symbol from the two character strings. This gluing action can produce strange and unexpected results. The ANSI standard for C has claried this situation. It states that no symbol can bridge a macro boundary. Lexical analysis on the original source string is done, and symbols are identied, before macro expansion. The source string that denes the macro can also be lexed before expansion, since characters in it can never be joined with characters outside it. These rules clean up a messy situation. The result of expanding a macro still must be rescanned for more macro calls, but it does not need to be re-lexed. The denition and call of a macro within a macro are illustrated in Exhibits 3.22 and 3.23. A general macro facility also permits the use of parameters in macro denitions [Exhibit 3.24]. In a call, macro arguments are easily parsed, since they are enclosed in parentheses and follow the macro name [Exhibit 3.25]. To expand a macro, formal parameter names must be identied in the denition of the macro. To do this, the tokens in the macro denition must rst be identied. Any
8

Newer C translators permit the # to be anywhere on the line as long as it is the rst nonblank character.

74

CHAPTER 3. ELEMENTS OF LANGUAGE

Exhibit 3.22. A nest of macros in C. The macros dened here are named PI and PRINTX. PRINTX expands into a call on the library function that does formatted output, printf. The rst parameter for printf must be a format string, the other parameters are expressions denoting items to be printed. Within the format, a % eld denes the type and eld width for each item on the I/O list. The \t prints out a tab character. #define PI #define PRINTX 3.1415927 printf("Pi times x = %8.5f\t", PI * x)

Exhibit 3.23. Use of the simple PRINTX macro. The macro named PRINTX is used below in a for loop. for (x=1; x<=3; x++) PRINTX; Before compilation begins, the macro name is replaced by the string printf("Pi times x = %8.5f\t", PI * x), giving a string that still contains a macro call: for (x=1; x<=3; x++) printf("Pi times x = %8.5f\t", PI * x); This string is re-scanned, and the call on the macro PI is expanded, producing macro-free source code. The compiler then compiles the statement: for (x=1; x<=3; x++) printf("Pi times x = %8.5f\t", 3.1415927 * x); At run time, this code causes x to be initialized to 1 before the loop is executed. On each iteration of the loop, the value of x is compared to 3. If x does not exceed 3, the words Pi times x = are printed, followed by the value of 3.1415927 * x as a oating-point number with ve decimal places (%8.5f), followed by a tab character (\t). The counter x is then incremented. The loop is terminated when x exceeds 3. Thus a line with ve elds is printed, as follows: Pi times x = 3.14159 Pi times x = 6.28319 Pi times x = 9.42477

3.2. THE METALANGUAGE

75

Exhibit 3.24. A macro with parameters in C. The macro dened here is named PRINT. It is similar to the PRINTX macro in Exhibit 3.22, but it has a parameter. #define PRINT(yy) printf(#yy " = %d\t", yy) The denition for PRINT is written in ANSI C. References to macro parameters that occur within quoted strings are not recognized by the preprocessor. However, the # symbol in a macro denition causes the parameter following it to be converted to a quoted string. Adjacent strings are concatenated by the translator. Using both these facts, we are able to insert a parameter value into a quoted format string.

token that matches a parameter name is replaced by the corresponding argument string. Finally, the entire string of characters, with parameter substitutions, replaces the macro call. The original denition of C did not clearly dene whether tokens were identied before or after macro parameters were processed. This is important because a comment or a quoted string looks like many words but forms a single program token. If a preprocessor searches for parameter names before identifying tokens, quoted strings will be searched and parameter substitution will happen within them. Many C translators work this way; others identify tokens rst. The ANSI C standard claries this situation. It decrees that tokenization will be done uniformly before parameter substitution. Macro names are syntactic extensions. They are words that may be written in the program and will be recognized by the compiler. Unlike variable declarations they may stand for arbitrarily complex items, and they may expand into strings that are not even syntactically legal units when used alone. Macros can be used to shorten code with repetitive elements, to redene the compiler words such as BEGIN, or to give symbolic names to constants. What they do not do is extend the

Exhibit 3.25. Use of the print macro with parameters. The macro named PRINT is used here, with dierent variables supplied as parameters each time. PRINT(x); PRINT(y); PRINT(z); These macro calls will be expanded and produce the following compilable code: printf("x = %d\t", x); printf("y = %d\t", y); printf("z = %d\t", z); Assume that at run time the variables x, y, and z contain the values 1, 3, and 10, respectively. Then executing this code will cause one line to be printed, as follows: x = 1 y = 3 z = 10

76

CHAPTER 3. ELEMENTS OF LANGUAGE

semantics of the language. Since all macro calls must be expanded into compilable code, anything written with a macro call could also be written without it. No power is added to the language by a macro facility.

Exercises
1. Why are function calls considered verbs? 2. What is the domain of a verb? Dene the domain and range of a function. 3. What is a data type? Inheritance? 4. What is a metalanguage? 5. What is a lexical token? How are lexical tokens formed? Use a language with which you are familiar as an example. What are delimiters? 6. How are programming language statements analogous to sentences? 7. What is the scope of a programming language unit? How is it usually denoted? 8. How is it possible to improperly nest scopes? How can this be avoided by designers of programming languages? 9. What is the purpose of a comment? How are comments traditionally handled within programs? What is the advantage of using a carriage return as a comment delimiter? 10. The language C++ is an extension of C which supports generic functions and type checking. For the most part, C++ is C with additions to implement things that the C++ designers believed are important and missing from C. One of the additions is a second way to denote a comment. In C, a comment can be placed almost anywhere in the code and is delimited at both ends. In this program fragment two comments and an assignment statement are intermingled: x=y*z /* Add the product of y and z */+x; /* to x. */

C++ supports this form but also a new form which must be placed on the right end of the line and is only delimited at the beginning by //: x=y*z + x // Add the product of y and z to x. Briey explain why the original comment syntax was so inadequate that a new form was needed. 11. How can we extend a language through its vocabulary? Its syntax? 12. What is a macro? How is it used within a program?

Chapter 4

Formal Description of Language

Overview
The syntax of a language is its grammatical rules. These are usually dened through EBNF (Extended Backus-Naur Form) and/or syntax diagrams, both discussed in this chapter. The meaning of a program is represented by p-code (portable code) or by a computation tree. The language syntax denes the computation tree that corresponds to each legal source program. Semantics are the rules for interpreting the meaning of programming language statements. The semantic specication of a language denes how each computation tree is to be implemented on a machine so that it retains its meaning. Being always concerned with the portability of code, we dene the semantics of a language in terms of an implementation-independent model. One such model, the abstract machine, is composed of a program environment, shared environment, stack, and streams. The semantic basis of a language means the specic version of the machine that denes the language, together with the internal data structures and interpretation procedures that implement the abstract semantics. Lambda calculus is an example of a minimal semantic basis. A language may be extended primarily through its vocabulary and occasionally through its syntax, as in EL/1, or through its semantics, as in FORTH.

77

78

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

4.1

Foundations of Programming Languages

Formal methods have played a critical role in the development of modern programming languages. Formal methods were not available in the mid-1950s when the rst higher-level programming languages were being created. The most notable of these eorts was FORTRAN, which survives (in greatly expanded form) to this day. Even though the syntax and semantics of the early FORTRAN were primitive by todays standards, the complexity of the language was at the limit of what could be handled by the methods then available. It was quickly realized that ad hoc methods are severely limited in what they can achieve, and a more systematic approach would be needed to handle languages of greater expressive power and correspondingly greater complexity. Contemporaneously with the implementation of the FORTRAN language and compiler, a new language, ALGOL, was being dened using a new formal approach for the specication of syntax and semantics. Even though it required several more years of research before people learned how to compile ALGOL eciently, the language itself had tremendous inuence on the design of subsequent programming languages. Concepts such as block structure (cf. Chapter 7) and delayed evaluation of function parameters (cf. Chapter 8), introduced in ALGOL, have reappeared in many subsequent modern programming languages. ALGOL was the rst programming language whose syntax was formally described. A notation called BNF, for Backus-Naur Form, was invented for the purpose. BNF turned out to be equivalent in expressive power to context-free grammars, developed by the linguist Noam Chomsky for describing natural language, but the BNF notation turned out to be easier for people, so variations on it are still used in describing most programming languages. An attempt was made to give a rigorous English-language specication of the semantics of ALGOL. Nevertheless, the underlying model was not well understood at the time, and ALGOL appeared at rst to be dicult or impossible to implement eciently. Syntax and semantic interpretations were specied informally for early languages. Then, motivated by the new need to describe programming languages, formal language theory ourished. Some of the major developments in the foundations of computer science are shown in Exhibit 4.1. Formal syntax and parsing methods grew from work on automata theory and linguistics [Exhibit 4.1]. Formal methods of semantic specication [Exhibit 4.2] grew from early work on logic and computability and were especially inuenced by Churchs work on the lambda calculus. In this chapter, we give a brief introduction to some of the formal tools that have been important to the development of modern-day programming languages.

4.2

Syntax

The rules for constructing a well-formed sentence (statement) out of words, a paragraph (module) out of sentences, and an essay (program) out of paragraphs are the syntax of the language. The syntax denitions for most programming languages take several pages of text. A few are very short, a few very long. There is at least one language (ALGOL-68) in which the syntax rules that

4.2. SYNTAX

79

Exhibit 4.1. Foundations of computer science.

Giuseppe Peano Set theory (1895) 1900 1900

1910

Alfred North Whitehead Bertrand Russel Symbolic logic (1910) Automated mathematics Post Incompleteness theorem, Goedel (1931) Post systems Recursive function theory Church, Rosser (1930s) Computability theory Turing (1936)

1910

1920

1920

1930

1930

1940
Formal language theory Chomsky Information theory Shannon Electronics

1940

1950

1950
Switching theory

1960

Formal syntactic definition Backus and Naur (1960) Knuth: Parsing methods, compiler theory Compiler compilers EL/1: Extensible syntax

Automata theory Rabin, Scott Complexity theory Hartmanis, Blum Computational cryptography (1976) Diffie, Hellman Public key system Rivest, Shamir, Adelman (1978)

1960

1970

1970

1980

Randomized algorithms

1980

1990

1990

80

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.2. Formal semantic specication.

1930 Post systems

Recursive function theory Church, Rosser (1930s)

Computability theory Turing (1936)

1930

1940

Lambda calculus Church (1941)

1940

1950
Program correctness and verification (1960s) Referential transparency, Strachey Formal semantic definition SECD machine, Landin (1964) Vienna definition of PL/1 (1967) Denotational semantics (1971) Scott, Strachey Milner: Type theory (1978)

1950

1960

1960

1970

1970
Concurrency, Dijkstra (1968) Hoare: CSP (1978) Distributed computing (1978) Lamport Collaborative computing 1988

1980
Functional languages: ML Miranda Haskell

1980

1990

1990

determine whether or not a statement should compile are so complicated that only an expert can understand them. It is usual to dene the syntax of a programming language in a formal language . A variety of formalisms have been introduced over the years for this purpose. We present two of the most common here: Extended Backus-Naur Form (EBNF) and syntax diagrams. An EBNF language denition can be translated by a program called a parser generator 1 into a program called a parser [Exhibit 4.3]. 2 A parser reads the users source code programs and determines the syntactic category (part of speech) of every source symbol and combination of
The old term was compiler compiler. This led to the name of the UNIX parser generator, yacc, which stands for yet another compiler compiler. 2 A parser generator can only handle grammars for context-free languages. Dening this language class is beyond the scope of this book. Note, though, that the grammars published for most programming languages are context free.
1

4.2. SYNTAX

81

Exhibit 4.3. The compiler is produced from the language denition. In the following diagram, programs are represented by rectangles and data by circles. The lexer and parser can be automatically generated from the lexical specications and syntax of a contextfree language by a parser generator and its companion lexer generator. This is represented by the vertical arrows in the diagram. The lexer and parser are the output data of these generation steps. A code generator requires more hand work: the compiler writer must construct an assembly code translation, for every syntax rule in the grammar, which encodes the semantics of that rule in the target machine language. The lexer, parser, and code generator are programs that together comprise the compiler. The compilation process is represented by the horizontal chain in the diagram.
Lexical rules for Pascal EBNF rules for Pascal syntax Lexer for Pascal Parser Generator also known as Compiler Compiler Hand-coded semantic interpretation for each rule

Lexer Generator

Pascal Compiler Source code for a Pascal program Lexer for Pascal Parser for Pascal Parsed, intermediate form of program Pascal Code Generator Object code for program

symbols. Its output is the list of the symbols dened in the program and a parse tree, which species the role that each source symbol is serving, much like a sentence diagram of an English sentence. The parser forms the heart of any compiler or interpreter for the language. The study of formal language theory and parsing has strongly aected language design. Older languages were not devised with modern parsing methods in mind. Their syntax was usually developed ad hoc. Consequently, a syntax denition for such a language, for example FORTRAN, is lengthy and full of special cases. By todays standards these languages are also relatively slow and dicult to parse. Newer languages are designed to be parsed easily by ecient algorithms. The syntax for Pascal is brief and elegant. Pascal compilers are small, as compilers go, and can be implemented on personal computers. The standard LISP translator 3 is only fteen pages long!
3

Griss and Hearn [1981].

82

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

4.2.1

Extended BNF

Backus-Naur Form, or BNF, is a formal language developed by Backus and Naur for describing programming language syntax. It gained widespread inuence when it was used to dene ALGOL in the early 1960s. The original BNF formalism has since been extended and streamlined; a generally accepted version, named Extended BNF, is presented here. An EBNF grammar consists of: A starting symbol. A set of terminal symbols, which are the keywords and syntactic markers of the language being dened. A set of nonterminal symbols, which correspond to the syntactic categories and kinds of statements of the language. A series of rules, called productions, that specify how each nonterminal symbol may be expanded into a phrase containing terminals and nonterminals. Every nonterminal has one production rule, which may contain alternatives. The Syntax of EBNF The syntax for EBNF itself is not altogether standardized; several minor variations exist. We dene a commonly used version here. The starting symbol must be dened. One nonterminal is designated as the starting symbol. Terminal symbols will be written in boldface and enclosed in single quotes. Nonterminal symbols will be written in regular type and enclosed in angle brackets . Production rules. The nonterminal being dened is written at the left, followed by a ::= sign (which we will pronounce as goes to). After this is the string, with options, which denes the nonterminal. The denition extends up to but does not include the . that marks the end of the production. When a nonterminal is expanded it is replaced by this dening phrase. Blank spaces between the ::= and the . are ignored. Alternatives are separated by vertical bars. Parentheses may be used to indicate grouping. For example, the rule s ::= ( a | bc ) d . indicates that an s may be replaced by an ad or a bcd. An optional syntactic element is a something-or-nothing alternativeit may be included or not included as needs demand. This is indicated by enclosing the optional element in square brackets, as follows:

4.2. SYNTAX s ::= [a] d . This formula indicates that an s may be replaced by an ad or simply by a d.

83

An unspecied number of repetitions (zero or more) of a syntactic unit is indicated by enclosing the unit in curly brackets. For example, the rule s ::= {a}d . indicates that an s may be replaced by a d, an ad, an aad, or a string of any number of as followed by a single d. A frequently occurring pattern is the following: s ::= t{t} This means that s may be replaced by one or more copies of t. Recursive rules. Recursive production rules are permitted. For example, this rule is directly recursive because its right side contains a reference to itself: s ::= asz | w . This expands into a single w, surrounded on the left and right by any number of matched pairs of a and z: awz, aawzz, aaawzzz, etc. Tail recursion is a special kind of recursion in which the recursive reference is the last symbol in the string. Tail recursion has the same eect as a loop. This production is tail recursive: s ::= as | b . This expands into a string of any number of as followed by a b. Mutually recursive rules are also permitted. For example, this pair of rules is mutually recursive because each rule refers to the other: s ::= at | b . t ::= bs | a . A single s could expand into any of the following: b, aa, abb, abaa, ababb, ababaa, etc. Combinations of alternatives, optional elements, recursions, and repetitions often occur in a production, as follows: s ::= {a | b} [c] d . This rule indicates that an s may be replaced by any of the following: d, ad, bd, cd, acd, bcd, aad, abd, aacd, abcd, bd, bad, bbd, bcd, bacd, bbcd, and many more. Using EBNF To illustrate the EBNF rules, we give part of the syntax for Pascal, taken from the ISO standard [Exhibit 4.4]. The rst few rules of the grammar are given, followed by several rules from the middle of the grammar which dene what a statement is. The complete set of EBNF grammar rules cannot be given here because it is too long. 4 Following are brief explanations of the meaning
4

It occupies nearly six pages in Cooper [1983].

84

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.4. EBNF production rules for parts of Pascal. program ::= program-heading ; program-block . program-parameters ::= identifier-list . identifier-list ::= identifier { , identifier } . program-block ::= block . block ::= label-declaration-part constant-declaration-part type-declaration-part variable-declaration-part procedure-and-function-declaration-part statement-part . variable-declaration-part ::= [ var { identifier-list : statement-part ::= compound statement . compound-statement ::= begin statement-sequence end . statement-sequence ::= statement { ; statement } . statement ::= [ label : ] ( simple-statement | structured-statement ). simple-statement ::= empty-statement | assignment-statement | procedure-statement | goto-statement . structured-statement ::= compound-statement | conditional-statement | repetitive-statement | with-statement . typename ; } ] . .

program-heading ::= program identifier [ ( program-parameters ) ].

of these rules. The production for the starting symbol states that a program consists of a heading, a semicolon, a block and a period. The semicolon and period are terminal symbols and will form part of the nished program. The symbols program-heading and program-block are nonterminals and need further expansion. The program-heading starts with the terminal symbol program, which is followed by the name of the program and an optional, parenthesized list of parameters, used for le names. The program parameters, if they are used, are just a list of identiers, that is, a series of one or more identiers separated by commas. The program block consists of a series of declarations followed by a single compound statement. The production for compound statement forms an indirectly recursive cycle with the rules for statement sequence, and statement. That is, a statement can be a structured statement,

4.2. SYNTAX

85

which can be a compound statement, which contains a statement-sequence, which contains a statement, completing the cycle. The rule for statement contains an optional label eld and the choice between simplestatement and structured-statement. The rules for simple-statement and structured-statement dene all of Pascals control structures. Generating a Program. To generate a program (or part of a program) using a grammar, one starts with the specied starting symbol and expands it according to its production rule. The starting symbol is replaced by the string of symbols from the right side of its production rule. If the rule contains alternatives, one may use whichever option seems appropriate. The resulting expansion will contain other nonterminal symbols which then must be expanded also. When all the nonterminals have been expanded, the result is a grammatically correct program. We illustrate this derivation process by using the EBNF grammar for ISO Standard Pascal to generate a ridiculously simple program named little. Parts, but not all, of this grammar are given in Exhibit 4.4.5 The starting symbol is program . Wherever possible, more than one nonterminal symbol is reduced on each line, in order to shorten the derivation. program program-heading ; program-block . program identifier ; block . program little ; label-declaration-part constant-declaration-part variable-declaration-part procedure-and-function-declaration-part statement-part . program little ; var variable-declaration ; compound-statement . program little ; var identifier-list : begin statement-sequence end . type-denoter ;

program little ; var identifier : type-denoter ; begin statement ; statement end . program little ; var x : integer ; begin simple-statement ; simple-statement end . program little ; var x : integer ; begin assignment-statement ; procedure-statement end . program little ; var x : integer ; begin variable-access := expression ; procedure-identifier ( writeln-parameter-list ) end .
5

The complete grammar can be found in Cooper [1983], pp 15358.

86

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE program little ; var x : integer ; begin entire-variable := simple-expression ; writeln ( write-parameter ) end . program little ; var x : integer ; begin variable-identifier := term ; writeln ( expression ) end . program little ; var x : integer ; begin identifier := factor ; writeln ( simple-expression ) end. program little ; var x : integer ; begin x := unsigned-constant ; writeln ( term ) end . program little ; var x : integer ; begin x := unsigned-number ; writeln ( factor ) end . program little ; var x : integer ; begin x := unsigned-integer ; writeln ( variable-access ) end . program little ; var x : integer ; begin x := 17 ; writeln ( entire-variable ) end . program little ; var x : integer ; begin x := 17 ; writeln ( variable-identifier ) end . program little ; var x : integer ; begin x := 17 ; writeln ( identifier ) end . program little ; var x : integer ; begin x := 17 ; writeln ( x ) end .

Parsing a Program. The process of syntactic analysis is the inverse of this generation process. Syntactic analysis starts with source code. The parsing routines of a compiler determine how the source code corresponds to the grammar. The output from the parse is a tree-representation of the grammatical structure of the code called a parse tree. There are several methods of syntactic analysis, which are usually studied in a compiler course and are beyond the scope of this book. The two broad categories of parsing algorithms are called bottom-up and top-down. In top-down parsing, the parser starts with the grammars starting symbol and tries, at each step, to generate the next part of the source code string. A brief description of a bottom-up method should serve to illustrate the parsing process. In a bottom-up parse, the parser searches the source code for a string which occurs as one alternative on the right side of some production rule. Ambiguity is resolved by looking ahead k input symbols. The matching string is replaced by the nonterminal on the left of that rule. By repeating this process, the program is eventually reduced, phrase by phrase, back to the starting symbol. Exhibit 4.5 illustrates the steps in forming a parse tree for the body of the program named little. All syntactically correct programs can be reduced in this manner. If a compiler cannot do the reduction successfully, there is some error in the source code and the compiler produces an error

4.2. SYNTAX

87

comment containing some guess about what kind of syntactic error was made. These guesses are usually close to being correct when the error is discovered near where it was made. Their usefulness decreases rapidly as the compiler works on and on through the source code without discovering the error, as often happens.

4.2.2

Syntax Diagrams

Syntax diagrams were developed by Niklaus Wirth to dene the syntax of Pascal. They are also called railroad diagrams, because of their curving, branching shapes. This is the form in which Pascal syntax is usually presented in textbooks. Syntax diagrams and EBNF can express exactly the same class of languages, but they are used for dierent purposes. Syntax diagrams provide a graphic, two-dimensional way to communicate a grammar, so they are used to make grammatical relationships easier for human beings to grasp. EBNF is used to write a grammar that will be the input to a parser generator. Corresponding to each production is code for the semantic action that the compiler should take when that production is parsed. The rules of an EBNF syntax are often more broken up than seems necessary, in order to provide hooks for all the semantic actions that a compiler must perform. When a grammar for the same language is presented as syntax diagrams, several EBNF productions are often condensed into one diagram, making the entire grammar shorter, less roundabout, and easier to comprehend. A Wirth syntax diagram denition has the same elements as an EBNF grammar, as follows: A starting symbol. Terminal symbols, written in boldface but without quotes, sometimes also enclosed in round or oval boxes. Nonterminal symbols, written in regular type. Production rules are written using arrows (as in a ow chart) to indicate alternatives, options, and indenite repetition. Each rule starts with a nonterminal symbol written at the left and ends where the arrow ends on the right. Nonterminal symbols are like subroutine calls. To expand one, you go to the correct diagram, follow the arrows through the diagram until it ends, and return to the calling point to nish the calling production. Branch points correspond to alternatives and indicate that any appropriate choice can be made. Repetition is encoded by backward-pointing arrows which form explicit loops. Direct and indirect recursion are both allowed. Syntax diagrams are given in Exhibits 4.6 and 4.7, which correspond exactly to the EBNF grammar fragments in Exhibit 4.4. In spite of the simplicity and visual appeal of syntax diagrams, though, the ocial denition of Pascal grammar is written in EBNF, not syntax diagrams. EBNF is a better input language for a parser generator and provides a clearer basis for a formal denition of the semantics of the language. Revision 1.8 1992/06/09 [Link] scher

88

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.5. Parsing a simple Pascal program. We perform a bottom-up parse of part of the program named little, using standard Pascal syntax, part of which is shown in Exhibit 4.4. Starting with the expression at the top, we identify a single token or a consecutive series of tokens that correspond to the right side of a syntactic rule. This series is then reduced, or replaced by the left side of that rule. The nal reduction is shown at the bottom of the diagram.

begin

x
identifier variableidentifier entirevariable variableaccess

:=

17
unsignedinteger unsignednumber unsignedconstant factor term simpleexpression expression

writeln
procedureidentifier

x
identifier variableidentifier entirevariable variableaccess factor term simpleexpression expression writeparameter writelnparameterlist

end

assignmentstatement simplestatement statement

procedurestatement simplestatement statement statementsequence compoundstatement

4.2. SYNTAX

89

Exhibit 4.6. Syntax diagram for program. This diagram corresponds to the EBNF productions for program, program-heading, programparameters, and identier list. The starting symbol is program .

, program program identifier ( identifier ) ; .

program-block

Exhibit 4.7. Syntax diagrams for statement. These diagrams correspond to the EBNF productions for statement, simple-statement, structuredstatement, compound-statement, and statement-sequence.

statement

label

: assignment-statement procedure-call-statement goto-statement compound-statement if-statement case-statement with-statement while-statement repeat-statement for-statement ;

compound-statement

begin

statement

e n d

90

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.8. Translation: from source to machine code. The object code was generated by OSS Pascal for the Atari ST. Source code: P-code tree: begin ; x := 17; y:= x+1 := := end Object code: moveq #17,d0 move d0,x addq #1,d0 move d0,y

17

y x

+ 1

4.3
4.3.1

Semantics
The Meaning of a Program

A modern language translator converts a program from its source form into a tree representation. This tree representation is sometimes called p-code, a shortening of portable code, because it is completely independent of hardware. This tree represents the structure of the program. The formal syntax of the language denes the kinds of nodes in the tree and how they may be combined. In this tree, the nodes represent objects and computations, and the structure of the tree represents the (partial) order in which the computations must be done. If any part of this tree is undened or missing, the tree may have no meaning. The formal semantics denes the meaning of this tree and, therefore, the meaning of the program. A language implementor must determine how to convert this tree to machine code for a specic machine so that his or her translation will have the same meaning as that dened by the formal semantics. This two-step approach is used because the conversion from source text to tree form can be the same for all implementations of a language. Only the second step, code generation, is hardware-dependent [Exhibit 4.8].

4.3.2

Denition of Language Semantics

The rules for interpreting the meaning of statements in a language are the semantics of the language. In order for a language to be meaningful and useful, the language designers, compiler writers, and programmers must share a common understanding of those semantics. If no single semantic standard exists, or no common understanding of the standard exists, various compiler writers will implement the language dierently, and a programmers knowledge of the language will not be transferable from one implementation to another. This is indeed the situation with both BASIC

4.3. SEMANTICS

91

and LISP; many incompatible versions exist. Knowing the full syntax of a programming language is enough to permit an experienced person to make a guess about the semantics, but such a guess is at best rough, and it is likely to be wrong in many details and in some major ways. This is because highly similar syntactic forms in similar languages often have dierent semantics. The syntax of a programming language needs only to describe all strings of symbols that comprise legal programs. To dene the semantics, one must either dene the results of some real or abstract computer executing the program, or write a complete set of mathematical formulas that axiomatize the operation of the program and the expected results. Either way, the denition must be complete, precise, correct, and nonambiguous. Neither kind of denition is easy to make. The semantics of a language must thus dene a highly varied set of things, including but not limited to: What is the correct interpretation of every statement type? What do you mean when you write a name? What happens during a function call? In what order are computations done? Are there syntactically legal expressions that are not meaningful? In what ways does a compiler writer have freedom? To what extent must all compilers produce code that computes the same answers? In general, answering such questions takes many more pages than dening the syntax of a language. For example, syntax diagrams for Pascal can be printed in eight pages, three of which also contain extensive semantic information. 6 In contrast, a complete semantic description of Pascal, at a level that can be understood by a well-educated person, takes 142 pages. 7 Part of the reason for this dierence is the dissimilarity between the meta-languages in which syntax and semantics are dened. The semantics of natural languages are communicated to learners by a combination of examples and attempts to describe the meaning. The examples are required because an English description of semantics will lack precision and be as ambiguous as English. Similarly, English alone is not adequate to dene the semantics of a programming language because it is too vague and too ambiguous to dene highly complex things in such a way that no doubt remains about their meaning. Just as it is possible to create a formal system such as EBNF to dene language syntax, it is possible to create a formal system to dene programming language semantics. 8 There is a major
Dale and Lilly [1985], pages A1A8. Cooper [1983]. 8 Historical note: The Vienna Denition of PL/1 dened a new language for expressing semantics and dened the semantics of PL/1 in it. ALGOL-68 also had its own, impenetrable, formal language that tried to eliminate most of the need for a semantic denition by including semantics in the syntax. The result was a book-length syntax.
7 6

92

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

dierence, though. The languages used to express syntax are relatively easy to learn and can be mastered by any student with a little eort. The languages used to express semantics are very dicult to read and extremely dicult to write. The primary use for a formal semantic denition is to establish a single, unambiguous standard for the semantics of the language, to which all other semantic descriptions must conform. It denes all details of the meaning of the language being described and provides a precise answer to any question about details of the language, even details that were never considered by the language designer or semantics writer. Precision and completeness are more important for this purpose than readability, and formal semantic denitions are not easy to read. A denition which only experts can read can serve as a standard to determine whether a compiler implements the standard language, but it is not really adequate for general use. Someone must study the denition and provide additional explanatory material so that educated nonexperts can understand it. Following is a quote from Coopers Preface 9 which colorfully expresses the role of his book in providing a usable denition of Pascal semantics: The purpose of this manual is to provide a correct, comprehensive, and comprehensible reference for Pascal. Although the ocial Standard promulgated by the International Standards Organization (ISO) is correct by denition, the precision and terseness required by a formal standard makes it quite dicult to understand. This book is aimed at students and implementors with merely human powers of understanding, and only a modest capacity for fasting and prayer in the search for the syntax or semantics of a domain-type or variant selector. Coopers book includes the denitions from the ISO standard and provides added explanatory material and examples. Compiler writers and textbook authors, in turn, can (but too many do not) use books such as Standard Pascal to ensure that their translations, explanations, and examples are correct.

4.3.3

The Abstract Machine

In order to make language denitions portable and not dependent on the properties of any particular hardware, the semantics of a computation tree must be dened in terms of an abstract model of a computer, rather than some specic hardware. Such a model has elements that represent the computer hardware, plus a facility for dening and using symbols. It forms a bridge between the needs of the human and computer. On one hand, it can represent symbolic computation, and on the other hand, the elements of the model are chosen so that they can be easily implemented on real hardware. We describe an abstract machine here which we will use to discuss the semantics of many languages. It has ve elements: the program environment, the stack, streams, the shared environment, and the control.
9

Cooper [1983], p. ix.

4.3. SEMANTICS

93

This abstract machine resembles both the abstract machine underlying FORTH 10 and the SECD machine that Landin used to formalize the semantics of LISP. 11 Landins SECD machine also has a stack and a control. Its environment component is our program environment, and our streams replace Landins dump. The FORTH model contains a dictionary which implements our program environment. FORTH has two stacks (for parameters and return addresses) which together implement our stack, except that no facility is provided for parameter names or local names. 12 The FORTH system denes input and output from les (our streams) and how a stream may be attached to a program. Finally, FORTH has an interpreter and a compiler which together dene our control element. Our abstract machine has one element, the shared environment, not present in either the FORTH model or the SECD machine, as those models did not directly support multitasking. Program Environment. This environment is the context internal to the program. It includes global denitions and dynamically allocated storage that can be reached through global objects. It is the part of the abstract machine that supports communication between any nonhierarchically nested modules in a single program. Each function, F, exists in some symbolic context. Names are dened outside of F for objects and other functions. If these names are in Fs program environment, they are known to F and permit F to refer to those objects and call those functions. The program environment is implemented by a symbol table (oblist in LISP, dictionary in FORTH). When a symbol is dened, its name is placed in the symbol table, which connects each name to its meaning. Predened symbols are also part of the environment. The meaning of a name is stored in some memory location, either when the name is dened or later. Either this space itself (as in FORTH) or a pointer to it (as in LISP) is kept adjacent to the name in the symbol table. Depending on the language, the meaning may be stored into the space by binding and initialization and/or it may be changed by assignment. Shared Environment. This is the context provided by the operating system or program development shell. It is the part of the abstract machine that supports communication between a program and the outside world. A model for a language that supports multitasking must include this element to enable communication between tasks. Shared objects are in the environment of two or more tasks but do not belong to any of them. Objects that can be directly accessed by the separate, asynchronous tasks that form a job are part of the shared environment. Intertask messages are examples. The Stack. The stack is the part of the computation model that supports communication between the enclosing and enclosed function calls that form an expression. It is a segmented structure of
Brodie [1987], Chapter 9. Landin [1964]. 12 The dictionary in FORTH 83 is structured as a list of independent vocabularies, giving some support for local names.
11 10

94

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

theoretically unlimited size. The top stack segment, or frame, provides a local environment and temporary objects for the currently active function. This local environment consists of local names for objects outside the function (parameters) and for objects inside the function (local variables). Local environments for several functions can exist simultaneously and will not interfere with each other. Suspension of one function in order to execute another is possible, with later reactivation of the rst in the same state as when it was suspended. The stack is implemented by a stack. A stack pointer is used to point at the stack frame (local environment) for the current function, which points back to a prior frame. A frame for a function F is created above the prior frame upon entry to F, and is destroyed when F exits. Storage for function parameters and a function return address are allocated in this frame and initialized (and possibly later removed) by the calling program. Upon entry to F, the names of its parameters are added to the local environment by binding them to the stack locations that were set up by the calling program. The local symbols dened in F are also added to the environment and bound to additional locations allocated in Fs stack frame. The symbol table is managed in such a way as to permit these names to be removed from the environment upon function exit. Streams. Streams are one medium of communication between dierent tasks that are parts of a job. A program exists in the larger context of a computer system and its les. The abstract machine, therefore, must reect mass storage and ways of achieving data input and output. A stream is a model of a sequential le, as seen by a program. It is a sequence, in time, of data objects, which can be either read or written. Symbolic names for streams and for the les to which they are bound must be part of the program environment. The concept of a stream is actually more general than the concept of a sequential le. Suppose two tasks are running concurrently on a computer system, and the output stream of one becomes the input stream of the other. A small buer to hold the output until it is reprocessed can be enough to implement both streams. Control. The control section of the abstract model implements the semantic rules of the language that dene the order in which the pieces of the abstract computation tree will be evaluated. It denes how execution of statements and functions is to begin, proceed, and end, including the details of sequencing, conditional execution, repetition, and function evaluation. (Chapter 8 deals with expressions and function evaluation, and Chapter 10 deals with control statements.) Three kinds of control patterns exist: functional, sequential, and asynchronous. 13 These patterns are supported in various combinations in dierent languages. Each kind of control pattern is associated with its own form of communication, as diagrammed in Exhibit 4.9. Functional control elements communicate with each other by putting parameters on the stack and leaving results in a return register. In the diagram, functions F1, F2, and F3 are all part of Process_1 and have associated stack frames on the stack for Process_1. When F3 is entered, its
13

Developed fully in Chapter 8.

4.3. SEMANTICS

95

Exhibit 4.9. Communication between modules.

Program Stack F3 g: ? F2 t: 10 F1 s: 30 return: 6 Hierarchical Execution F1 Parameter s; return(s/5); F2 Parameter t; f:=F1(t*3);

Program Environment
Global variables Dynamic storage

Shared Environment
O.S. storage: messages, etc.

Streams
Out

In

f: ?

P2: ON

Pipe

Sequential Execution

Concurrent Execution

Sequential Execution

Local g; Call F2(10); F3 g := 2*f+1; Signal( P2 );

Wait_for (P2). Pipe output to Process 3.

Pipe input from Process 2.

Process 2

Process 3

Process 1

stack frame is created. Then when F3 calls F2 and F2 calls F1, frames for F2 and F1 are created on the stack. The frame for F1, indicated by a <, is the current frame. Parameters are initialized during the function-calling process. When F1 returns it will return a 6 to F2. Functions within the same process share access to global variables in the program environment for that process. Sequential constructs in these functions communicate by assigning values to these variables. Function F2 communicates with F3, and sequential statements in F3 communicate with each other through the global variable named f in the program environment. F1 will return the value 6 to F2, which will assign it to a global variable, f. This variable is accessible to F3, which will use its value to compute g. Concurrent tasks communicate through the shared environment. Process_1 and Process_2 share asynchronous, concurrent execution and synchronize their operations through signals left in the shared environment. Sequential tasks communicate through streams. The output from Process_2 becomes the input for Process_3. To implement this, the operating system has connected their output and input

96

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

streams through an operating system pipe. This pipe could be implemented either by conveying the data values to Process_3 as soon as they are produced by Process_2 or by storing the output in a buer or a le, then reading it back when the stream is closed. A Semantic Basis. The formal semantic denition of a language must include specic denitions of the details of the abstract machine that implements its semantics. Dierent language models include and exclude dierent elements of our abstract machine. Many languages do not support a shared environment. The new functional languages do not support a program environment, except for predened symbols. The control elements, in particular, dier greatly from one language to the next. We dene the term semantic basis of a language to mean the specic version of the abstract machine that denes the language, together with the internal data structures and interpretation procedures that implement the abstract semantics. Layered on top of the semantic basis is the syntax of the language, which species the particular keywords, symbols, and order of elements to be used to denote each semantic unit it supports. The semantic basis of a language must dene the kinds of objects that are supported, the primitive actions, and the control structures by which the objects and actions are linked together, and the ways that the language may be extended by new denitions. The features included in a semantic basis completely determine the power of a language; items left out cannot be dened by the programmer or added by using macros. Where two dierent semantic units provide roughly the same power, the choice of which to include determines the character of the language and the style of programs that will be written in it. Thus a wise language designer gives careful thought to the semantic basis before beginning to dene syntax.

4.3.4

Lambda Calculus: A Minimal Semantic Basis

It is perhaps surprising that a very small set of semantic primitives, excluding goto and assignment, can form an adequate semantic basis for a language. This was proven theoretically by Churchs work on lambda calculus.14 Lambda calculus is not a programming language and is not directly concerned with computers. It has no programs or objects or execution as we understand them. It is a symbolic, logical system in which formulas are written as strings of symbols and manipulated according to logical rules. We need to be knowledgeable about lambda calculus for three reasons. First, it is a complete system: Church has shown that it is capable of representing any computable function. Thus any language that can implement or emulate lambda calculus is also complete. Second, lambda calculus gives us a starting point by dening a minimal semantic basis for computation that is mathematically clean. As we examine real computer languages we want to distinguish between necessary features, nice features (extras), nonfeatures (things that the language
14

Church [1941].

4.3. SEMANTICS

97

Exhibit 4.10. Lambda calculus formulas. Formulas x (x.((yy )x)) (z.(y (z.z ))) ((z.(zy ))x) Comments Any variable is a formula. Lambda expressions are formulas. The body of this lambda expression is an application. Why is this formula an application?

would be better o without), and missing features which limit the power of the language. The lambda calculus gives us a starting point for deciding which features are necessary or missing. Finally, an extended version of lambda calculus forms the semantic basis for the modern functional languages. The Miranda compiler translates Miranda code into tree structures which can then be interpreted by an augmented lambda calculus interpreter. Lambda calculus has taken on new importance because of the recent research on functional languages. These languages come exceedingly close to capturing the essence of lambda calculus in a real, translatable, executable computer language. Understanding the original formal system gives us some grasp of how these languages dier from C, Pascal, and LISP, and supplies some reason for the aspects of functional languages that seem strange at rst. Symbols, Functions, and Formulas There are two kinds of symbols in lambda calculus: A single-character symbol, such as y , used to name a parameter and called a variable. Punctuation symbols (, ), ., and . These symbols can be combined into strings to form formulas according to three simple rules: 1. A variable is a formula. 2. If y is a variable and F is a formula, then (y.F ) is a formula, which is called a lambda expression; y is said to be the parameter of the lambda expression, and F is its body. 3. If F and G are formulas, then (F G) is a formula, which is called an application. Thus every lambda calculus formula is of one of three types: a variable, a lambda expression, or an application. Examples of formulas are given in Exhibit 4.10. Lambda calculus diers from programming languages in that its programs and its semantic domain are the same. Formulas can be thought of as programs or as the data upon which programs operate. A lambda expression is like a function: it species a parameter name and has a body that usually refers to that parameter.15 An application whose rst formula is a lambda expression is like
The syntax dened here supports only one-argument functions. There is a common variant which permits multiargument functions. This form can be mechanically converted to the single-argument syntax.
15

98

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.11. Lambda calculus names and symbols. Formulas x, y, z , etc. G = (x.(y (yx))) H = (GG) Comments Single lowercase letters are variables. A symbolic name may be dened to stand for a formula. Previously dened names may be used in describing formulas.

a function callthe function represented by the lambda expression is called with the second formula as an argument. Thus ((x.F )G) intuitively means to call the function (x.F ) with argument G. However, not all formulas can be interpreted as programs. Formulas such as (xx) or (y (x.z )) do not specify a computation; they can be thought of as data. In order to talk about lambda formulas, we will often give them symbolic names. To avoid confusing our names, which we use to talk about formulas, with variables, which are formulas, we use uppercase letters when naming formulas. As a shorthand for the statement, let F be the formula (x.(yx)), we will write simply F = (x.(yx)). If we then write a phrase like, the formula (F z ) is an application, the formula we are talking about is ((x.(yx))z ). In general, wherever F appears, it should be replaced by its denition. Since names are just a shorthand for formulas, a circular denition such as F = (x.(yF )) is meaningless. Examples of symbols and denitions are shown in Exhibit 4.11. As another shorthand, when talking about formulas, we may omit unnecessary parentheses. Thus we may write x.y instead of (x.y ). In general, there may be more than one way to insert parentheses to make a meaningful formula. For example, [Link] might mean either (x.(yx)) or ((x.y )x). We use the rules that the body of a lambda expression extends as far to the right as possible, and sequences associate to the left. Thus, in the above example, the body of the lambda expression is yx, so the fully parenthesized form is (x.(yx)). Examples of these rules are given in Exhibit 4.12. Free and Bound Variables. A parameter name is a purely local name. It binds all occurrences of that name on the right side of the lambda expression. A symbol on the right side of a lambda

Exhibit 4.12. Omitting parentheses when writing lambda calculus formulas. Shorthand f xy x.y.x [Link].y (x.(xx))(zw) [Link] Meaning ((f x)y ) (x.(y.x)) (x.(x(y.y ))) ((x.(xx))(zw)) (x.(y.((yz )w)))

4.3. SEMANTICS

99

Exhibit 4.13. Lambda expressions for TRUE and FALSE. Expressions T = x.y.x Comments The symbol T represents the logical value TRUE. You should read the denition of T as follows: T is a function of parameters x and y . Its body ignores y and returns x. (We say the argument y is dropped.) F names the lambda expression which represents FALSE.

F = x.y.y

expression is bound if it occurs as a parameter, immediately following the symbol , on the left side of the same expression or of an enclosing expression. The scope of a binding is the entire right side of the expression. In Exhibit 4.14, the x denes a local name and binds all occurrences of x in the expression. We say that each bound occurrence of x refers to the particular x that binds it. An occurrence of a variable x in F is free if x is not bound. Thus the occurrence of p in ([Link] ) is free, but the occurrence of y in that same formula is bound (to y ). In the formula (x(x.((x.x)x))), the variable x occurs ve times. The second and third occurrences are bindings; the other three occurrences are uses. The rst occurrence is free, since it does not lie within the scope of any x-expression. The fourth occurrence is bound to the third occurrence, and the fth occurrence is bound to the second occurrence. These binding rules are the familiar scoping rules of block-structured programming languages such as Pascal. The operator x declares a new instance of x. All occurrences of x within its scope refer to that instance, unless x is redeclared by a nested x. In other words, an occurrence of a variable is always bound to the innermost enclosing block in which x is declared. Representing Computation Church invented a way to use lambda formulas to represent computation. He assigned interpretations to certain formulas, making them represent the basic elements of computation. (Some, but not all, lambda expressions have useful interpretations.) The formulas shown in this chapter are some of the most basic in Churchs system, including formulas that represent truth values [Exhibit 4.13], the integers [Exhibit 4.15], and simple computations on them [Exhibit 4.16]. More advanced formulas are able to represent recursion. As you work through these examples the purpose and mechanics of these basic denitions should become clearer. Now that we know what lambda calculus formulas are, we need to talk about what they do. Evaluation rules allow one formula to be transformed to another. A formula which cannot be transformed further is said to be in normal form. The meaning of a formula is its normal form, if it has one; otherwise, the formula is undened. An undened formula corresponds to a nonterminating computation. Exhibit 4.14 dissects an expression and looks at its parts.

100

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.14. Dissection of a lambda expression. A lambda expression, with name: Useful interpretation: 2 = x.y.x(xy )

the number two

Breakdown of elements 2= Declares the symbol 2 to be a name for the following expression. x. The function header names the parameter, x. Everything that follows this . is the expression body. y.x(xy ) The body of the original expression is another expression with a parameter named y . Parameter names are purely arbitrary; this expression would still have the same meaning if it were rewritten with a dierent parameter name, as in: q.x(xq ) x(xy ) This is the body of the inner expression. It contains a reference to the parameter y and also references to the parameter x from the enclosing expression.

Reduction. Consider a lambda expression which represents a function. At the abstract level, the meaning, or semantics, of the expression is the mathematical function that it computes when applied to an argument. Intuitively, we want to be able to freely replace an expression by a simpler expression that has the same meaning. The rules for beta and eta reduction permit us to do so. The main evaluation rule for lambda calculus is called beta reduction and it corresponds to the action of calling a function on its argument. A beta reducible expression is an application whose left part is a lambda expression. We also use the term beta redex as a shortening of reducible expression. When a lambda expression is applied to an argument, the argument formula is substituted for the bound variable in the body of the expression. The result is a new formula. A second reduction rule is called eta reduction. Eta reduction lets us eliminate one level of binding in an expression of the form x.f (x). In words, this is a special case in which the lambda argument is used only once, at the end of the body of the expression, and the rest of the body is a lambda expression applied to this parameter. If we apply such an expression to an argument, one beta reduction step will result in the simpler form f (x). Eta reduction lets us make this transformation without supplying an argument. Specically, eta reduction permits us to replace any expression of the form x.f (x), where f represents a function, by the single symbol f . After a reduction step, the new formula may still contain a redex. In that case, a second reduction step may be done. When the result does not contain a beta-redex or eta-redex, the reduction process is complete. We say such a formula is in normal form. Many lambda expressions contain nested expressions. When such an expression is fully parenthesized it is clear which arguments belong to which function. When parentheses are omitted, remember that function application associates to the left; that is, the leftmost argument is substituted rst for the parameter in the outermost expression. We now describe in more detail how reduction works. When we reduce a formula (or subformula) of the form H = ((x.F )G), we replace H by the formula F , where F is obtained from F by

4.3. SEMANTICS

101

Exhibit 4.15. Lambda calculus formulas that represent numbers. 0= 1= 2= x.y.y [Link] x.y.x(xy )

The formula for zero has no occurrences of its rst parameter in its body. Note that it is the same as the formula for F . Zero and False are also represented identically in many programming languages. The formula for the integer one has a single x in its body, followed by a y . The formula for two has two xs. The number n will be represented by a formula in which the rst parameter occurs n times in succession.

substituting G for each reference to x in F . Note that if F contains another binding x, the references to that binding are not replaced. For example, (([Link] )(zw)) reduces to ((zw)y ) and ((x.x(x.(xy )))(zz )) reduces to (zz )(x.(xy )). When an expression containing an unbound symbol is used as an argument to another lambda expression, special care must be taken. Any occurrence of a variable in the argument that was free before the substitution must remain free after the substitution. It is not permitted for a variable to be captured by an unrelated during substitution. For example, it is not permitted to apply the reduction rule to the formula ((x.(y.x))(zy )), since y is free in (zy ), but after substitution, that occurrence of y would not be free in (y.(zy )). To avoid this problem, the parameter must be renamed, and all of its bound occurrences must be changed to the new name. Thus ((x.(y.x))(zy )) could be rewritten as ((x.(w.x))(zy )), after which the reduction step would be legal. Examples of Formulas and Their Reductions The formulas T and F in Exhibit 4.13 accomplish the equivalent of branching by manipulating their parameters. They take the place of the conditional statement in a programming language. T (true) returns its rst argument and discards the second. Thus it corresponds to the IF..THEN statement which evaluates the THEN clause when the condition is true. Similarly, the formula F (false) corresponds to the IF..ELSE clause. It returns its second parameter just as an IF statement evaluates the second, or ELSE clause, when the condition is false. The successor function, S , applied to any integer, gives us the next integer. Exhibit 4.16 shows the lambda formula that computes this function. Given any formula for a number n, it returns the formula for n + 1. The function ZeroP (zero predicate) tests whether its argument is equal to the formula for zero. If so, the result is T , if not, F . Exhibit 4.17 shows how we would call S and ZeroP . The process of carrying out these computations will be explained later. Church was able to show that lambda calculus can represent all computation, by representing

102

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.16. Basic arithmetic functions. The successor function for integers. Given the formula for any integer, n, this formula adds one x and returns the formula for the next larger integer. S = n.([Link](xy )) Zero predicate. This function returns T if the argument = 0 and F otherwise. ZeroP = n.n(x.F )T

numbers, conditional evaluation, and recursion. Crucial to the power of his system is that there is no distinction between objects and functions. In fact, objects, in the sense of data objects, were not dened at all. Expressions called normal forms take their place as concrete things that exist and can be tested for identity. A formula is in normal form if it contains no redexes. Not all formulas have a normal form; some may be reduced innitely many times. These formulas, therefore, do not represent objects. They are the analog of innite recursions in computer languages. For example, let us dene the symbol twin to be a lambda expression that duplicates its parameter: twin = [Link] The function twin can be applied to itself as an argument. The application looks like this: (twin twin ) The preceding line shows this application symbolically. Now we rewrite this formula with the

Exhibit 4.17. Some lambda applications. An application consists of a function followed by an argument. The rst three applications listed here use the number symbols dened in Exhibit 4.15 and the function symbols dened in Exhibit 4.16. These three applications are evaluated step-by-step in Exhibits 4.18, 4.19, and 4.20. (S 1) Apply the successor function to the function 1. (ZeroP 0) Apply ZeroP to 0 (Does 0 = zero?) (ZeroP 1) Does 1 = zero? ((GH ) x) Apply formula G to formula H , and apply the result to x. The last application has the same meaning when written without the parentheses: GHx.

4.3. SEMANTICS

103

name of the function replaced by its denition. Parentheses are used, for clarity, to separate expressions: (([Link])(twin )) This formula contains a redex and so it is not in normal form. When we apply the reduction rule, the function, [Link], makes two copies of its parameter, giving: (twin twin ) Thus the result of reduction is the same as the formula we started with! Clearly, a normal form can never be reached. Higher-Order Functions If lambda calculus were a programming language, we would say that it treats functions as rstclass objects and supports higher-order functions. This means that functions may take functions as parameters and return functions as results. With this potential we can do some highly powerful things. We can dene a lambda expression, F , to be the composition of two other expressions, say G and H . (This means that F is the expression produced by applying G to the result of H .) This cannot be done in most programming languages. C, for example, permits you to execute a function G on the result of executing H . But C does not let you write a function that takes two functional parameters, G and H , and returns a function, F , that will later accept some argument and apply rst H to it and then apply G to the result. A formula that implements recursion can be dened as the composition of two higher-order functions. Thus lambda calculus does not need to have recursion built in; it can be dened within the system. In contrast, recursion is, and must be, built into C and Pascal. A language with higher-order functions also permits one to curry a function. G is a currying of F if G has one fewer parameter than F and computes its result by calling F with a constant in place of the omitted parameter. Currying, combined with generic dispatching, 16 is one way to implement functions with optional arguments. Evaluation / Reduction Any model of computation must represent action as well as objects. Actions are represented in the lambda calculus by applying the reduction rule, which requires applying the renaming and substitution rules. To reduce a formula, F , one nds a subformula, S , anywhere within F , that is reducible. To be reducible, S must consist of a lambda expression, L, followed by an argument, A. The reduction process then consists of two steps: renaming and substitution.
16

See Chapter 18.

104 Exhibit 4.18. Reducing (S 1).

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Compute the Successor of 1. The answer should be 2. For clarity, the formula for one has been written using p and q instead of x and y . (This is, of course, permitted. The symbols that are used for bound variables may be renamed any time.) Write out S. (n.([Link](xy ))1) Substitute 1 for n, reduce. (x.y.1x(xy )) Write out the denition of 1. (x.y.([Link] )x(xy )) Substitute x for p, and reduce. (x.y.([Link] )(xy )) (x.y.x(xy )) Substitute (xy) for q, reduce. The answer is the formula for 2, which is, indeed, the successor of 1.

Renaming. Renaming is required only if unbound symbols occur in A. They must not have the same name as Ls parameter. If such a name conict occurs, the parameter in L must be renamed so that the unbound symbol will not be captured by Ls parameter. The new name may be any symbol whatsoever. The formula for L is simply rewritten with the new symbol in place of the old one. Substitution. After renaming, each parameter reference on the right side of L is replaced by a copy of the entire argument-expression, and the resulting string replaces the subexpression S . The , the dummy parameter, and the . are dropped. Exhibits 4.18, 4.19, and 4.20 illustrate the reduction process. Three simple formulas are given and reduced until they are in normal form. The comments on the left in these exhibits document each choice of redex and the corresponding substitution process. The following explanations are given so that you may develop some intuition about how these functions work. Successor. Intuitively, the successor function must take a numeric argument (a nest of two lambda expressions) and insert an additional copy of the outermost parameter into the middle of the formula. This is accomplished as follows: On the rst reduction step, the formula for S embeds its argument, n, in the middle of a nested lambda expression. The symbols x and y in the formula for S are bound by the lambdas at the left. We rename the bound variables in the formula for n to avoid confusion; during the reduction process, this p and q will be eliminated. The formula for n now forms a redex with the x in the tail end of the formula for S . Reducing this puts as many copies of x into the result as there were copies of p in n. Remember, we want to end up with exactly one additional copy of x. This added x comes from the (xy ) at the right of the formula for S . The result of the preceding

4.3. SEMANTICS

105

Exhibit 4.19. Reducing (ZeroP 0). Apply ZeroP to 0, that is, determine whether 0 equals zero. The answer should be T . Write out ZeroP followed by 0. ((n.n(x.F )T )0) Substitute 0 for n in the body of ZeroP and reduce. (0(x.F )T ) Write out the formula for zero. ((x.y.y )(x.F )T ) Substitute (x.F ) for x, and reduce. ((y.y )T ) Substitute T for y , reduce. T So 0 does equal 0. Note that the argument, (x.F ), was dropped in the fourth step because the parameter, x, was not referenced in the body of the function.

reduction forms a redex with this (xy ). When we reduce, this nal x is sandwiched between the other xs and the y , as desired. Essentially, the y in a number is a growth bud that permits any number of xs to be appended to the string. It would be easy, now, to write a denition for the function plus2. Zero predicate. Remember, 0 and F are represented by the same formula. Thus the zero predicate must turn F into T and any other numeric formula into F . (The behavior of ZeroP on nonnumeric arguments is undened. Applying ZeroP to a nonnumber is like a type error.) Briey, the mechanics of this computation work as follows: An integer is represented by a formula that is a nest of two lambda expressions. ZeroP takes its argument, n, and appends two expressions, x.F and T , to n. These two

Exhibit 4.20. Reducing (ZeroP 1). Write out ZeroP followed by 1. Substitute 1 for n, reduce. Write out the formula for 1. Substitute (x.F ) for x, reduce. Substitute T for y , reduce. Substitute T for x and reduce. ((n.n(x.F )T )1) (1(x.F )T ) (([Link] )(x.F )T ) ((y.(x.F )y )T ) ((x.F )T ) F

On the last line, the parameter x does not appear in the body of the function, so the argument, T , is simply dropped. So 1 does not equal 0. Applying ZeroP to any nonzero number would give the same result, but involve one more reduction step for each x in the formula.

106

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Exhibit 4.21. A formula with three redexes. Assume that P 3 (which adds 3 to its argument) and (which computes the product of two arguments) have already been dened. (They can be built up out of the successor function.) Then the formula ( (P 3 4) (P 3 9)) has three reducible expressions: (P 3 4), (P 3 9), and ( (P 3 4) (P 3 9)).

expressions form arguments for the two lambda expressions in n. The entire unit forms two nested applications. We reduce the outermost lambda expression rst, using the argument x.F . If n is 0, this argument is discarded because the formula for zero does not contain a reference to its parameter. For nonzero arguments, this expression is kept. The inner expression (from the original argument, n) forms an application with the argument T . If n was zero, this reduces immediately to T . If n was nonzero, there is one more reduction step and the result is F . The Order of Reductions Not every expression has a normal form; some can be reduced forever. But if a normal form exists it can always be reached by some chain of reductions. When each lambda expression in a formula is nested fully within another, only one order of reduction is possiblefrom the outside in. But it is possible to have a formula with two reducible lambda expressions at the same level, side by side [Exhibit 4.21]. Further, whatever redex you select next, the normal form can still be reached. Put informally, you cannot back yourself into a corner from which you cannot escape. This important result is named the Church-Rosser Theorem after the logicians who formally proved it. Some expressions that do have normal forms contain subexpressions that cannot be reduced to normal form. This seems like a contradiction until you realize that, in the process of evaluation, whole sections of a formula may be discarded. For example, in a conditional structure, either the then part or the else part will be skipped. The computation enclosing the conditional can still terminate successfully, even if the part that is skipped contains an innite computation. By the Church-Rosser theorem, a normal form, if it exists, can be reached by reducing subformulas in any order until there are no reducible subformulas left. However, although you cannot get blocked in reducing such an expression, you can waste an innite amount of eort if you persist in reducing a nonterminating part of the formula. Since any subformula may be discarded by a conditional, and never need to be evaluated, it is wiser to postpone evaluating a sub-expression until it is needed. If, eventually, a non-terminating sub-formula must be evaluated, then the formula has no normal form. If, on the other hand, it is discarded, the formula in which this innite

4.4. EXTENDING THE SEMANTICS OF A LANGUAGE

107

computation was embedded can still be computed (reduced to normal form). A further theorem proves that if a normal form can be reached, then it can be reached using the outside-in order of evaluation. That is, at each step the outermost possible redex is chosen. (The formulas in Exhibits 4.20, 4.19, and 4.18 were all reduced in outside-in order.) This order is called the normal order of evaluation in lambda calculus and corresponds to call-by-name reduction order in a programming language.17 It may not be a unique order, since sometimes the outermost formula is not reducible, but may contain more than one redex side-by-side. In that case, either may be reduced rst. The Relevancy of Lambda Calculus Lambda calculus has been proven to be a fully general way to symbolize any computable formula. Its semantic basis contains representations of objects (normal forms) and functions ( expressions). Because functions are objects, and higher-order functions can be constructed, the system is able to represent conditional branching, function composition, and recursion. Computation is represented by the process of reduction, which is dened by the rules for renaming, parameter substitution, and formula rewriting. Although lambda calculus is a formal logical system for manipulating formulas and symbols, it provides a model of computation that can be and has been used as a starting point for dening programming languages. LISP was originally designed to be an implementation of lambda calculus, but it did not capture the outside-in evaluation semantics.

4.4

Extending the Semantics of a Language

Let us dene an extension to be a set of denitions which augment a language with an entirely new facility that can be used in the same way that preexisting facilities are used. Some of the earliest languages were not very extensible at all. The original FORTRAN allowed variables to be dened but not types or functions (in a general sense). Function denitions were limited to one line. All modern languages are extensible in many ways. Any time we dene a new object, a new function, or a new data type, we are extending the language. Each such denition extends the list of words that are meaningful and adds new expressive power. Pascal, LISP, and the like. are extensible in this sense: by building up a vocabulary of dened functions and/or procedures, we ultimately write programs in a language that is much more extensive and powerful than the bare language provided by the compiler. Historically, we have seen that extensibility depends on uniform, general treatment of a language feature. Any time a translator is designed to recognize a specic, xed set of keywords or dened symbols, that portion of the language is not extensible. The earliest BASIC was not extensible at all; even variable names were all predened (only two-letter names were permitted). FORTRAN, one of the earliest computer languages, can help us see how the design of a language and a translator
17

See chapter 9, Section 9.2.

108

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

can create barriers to extensibility. We will look at types and functions in early FORTRAN and contrast them to the extension facilities in more modern languages. Early FORTRAN supported a list of predened mathematical functions. The translator recognized calls on those predened functions, but users could not dene their own. This probably happened because the designers/implementors of FORTRAN provided a static, closed list of function names instead of simply permitting a list that could grow. The mechanics of translating a function call are also simpler if only one- and two-argument functions have to be supported, rather than argument lists of unlimited size. In contrast, consider early LISP. Functions were considered basic (as lambda expressions are basic in lambda calculus), and the user was expected to dene many of them. The language as a whole was designed to accept and translate a series of denitions and enter each into an extensible table of dened functions. The syntax for function calls was completely simple and modeled after lambda calculus, which was known to be completely general. LISP was actually easier to translate than FORTRAN. Consider type extensions. In FORTRAN, there were two recognized data types, real and integer. These were hard wired into the language: variables whose names started with letters I through N were integers, all other variables were real. On the implementation level, FORTRAN parsers were written to look at each variable name and deduce the type from it. This was certainly a convenient system, since it made declarations unnecessary, but it was not extensible. The system fell apart when FORTRAN was extended to support alphabetic data and double-precision arithmetic. In contrast, look at Pascal. Pascal has four primitive data types and several ways to build new simple and aggregate types out of the primitive types. The language has a clear notion of what a type is, and when a new type is or is not constructed. Each time the programmer uses a type constructor, a new type is added to the list of dened types. Thereafter, the programmer may use the new type name in exactly the same ways that primitive type names may be used. Although Pascal types are extensible, there are predened, nonextensible relationships among the predened types, just as there are in FORTRAN. Integers may be converted to reals, and vice versa, under specic, predened circumstances. These conversion relationships are nonextensible; the triggering circumstances cannot be modied, and similar conversion relationships for other types cannot be dened. Object-oriented languages carry type-extensibility one step farther, permitting the programmer to dene relationships between types and extend the set of situations in which a conversion will take place. This is accomplished, in C++ for example, by introducing the notion of a constructor function, which builds a value of the target type out of components of the original type. The programmer may dene her or his own constructors. The translator will use those constructors to avoid a type error under specied circumstances, by converting an argument of the original type to one of the target type. In all the cases described here, extension is accomplished by allowing the programmer to dene new examples of a semantic category that already exists in the translator. To enable extension, a new syntax is provided for dening new instances of existing categories. However, the programmer writes the same syntax for using an extension as for using a predened facility. Old categories are extended; entirely new things are not added. Some languages, those with macro facilities, allow

4.4. EXTENDING THE SEMANTICS OF A LANGUAGE

109

the programmer to extend the language by supplying new notation for existing facilities. However, very few languages support additions or changes to the basic syntactic structure or the semantic basis of the language. Changing the syntactic structure would involve changing the parser, which is normally xed. Changing the semantic basis would involve adding new kinds of tables or procedures to the translator to implement the new semantics. What would it mean to extend the syntactic structure of a language? Consider the break instruction in C and the EXIT in Ada. These highly useful statements enable controlled exits from the middle of loops. Pascal does not have a similar statement, and an exit from the middle of a loop can be done only with a GOTO. But the GOTO lacks the safely controlled semantics of break and EXIT. Because it is so useful, EXIT is sometimes added to Pascal as a nonstandard extension. Doing this involves extending the parsing phase of the compiler to recognize a new keyword and modifying the code generation phase to generate a branch from the middle of a loop to the rst statement after the loop. Of course, a programmer cannot extend a Pascal compiler like this. It can only be done when the compiler is being written. The ANSI C dialect and the language C++ are both semantic extensions of C. ANSI C extended the original language by adding type checking for function calls and some coherent operations on structured data. C++ adds, in addition, semantically protected modules (classes), virtual functions, and polymorphic domains. This kind of semantic extension is implemented by changing the compiler and having it do work of a dierent nature than is done by an old C compiler. These extensions mentioned required modifying the process of translating a function call, adding new information to the symbol table, implementing new restrictions on visibility, and adding type checking and type conversion algorithms. The code and tables of a compiler are normally o-limits to the ordinary language user. In most languages, a programmer cannot access or change the compilers tables. The languages EL/1, FORTH, and T break this rule; EL/118 permitted additions to the compilers syntactic tables, with accompanying semantic extensions, and FORTH permits access to the entire compiler, including the symbol table and the semantic interpretation mechanisms. EL/1 (Extensible Language 1) actually permitted the programmer to supply new EBNF syntax rules and their associated interpretations. The translator included a preprocessor and a compiler generator which combined the user-supplied syntax rules with the built-in ones and produced a compiler for the extended language. The semantic interpretations for the new syntactic rules, supplied by the user, were then used in the code generation phase. A very similar thing can be done in T. T is a semantic extension of Scheme which includes data structuring primitives, object classes, and a macro preprocessor which can be used to extend the syntax of the language. Each preprocessor symbol is dened by a well-formed T expression. With these tools, extensions can be constructed that are not possible in C, Pascal, or Scheme. We could, for example, use the macro facility to dene the syntax for a for loop expression and dene the semantics to be a complex combination of initializations, statement executions, increments, and result-value construction.
18

Wegbreit [1970].

110

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

4.4.1

Semantic Extension in FORTH

We use FORTH to demonstrate the kind of extension that can be implemented by changing the parser and semantic interpretation mechanisms of a translator. Two kinds of limited semantic extension are possible in FORTH: We may add new kinds of information to the symbol table, with accompanying extensions to the interpreter. We may modify the parser to translate new control structures. We shall give an example of each kind of extension below. In both cases, the extension is accomplished by using knowledge of the actual implementation of the compiler and accessing tables that would (in most compilers) be protected from user tampering. FORTH has several unusual features that make it possible to do this kind of extension. First, like LISP, FORTH is a small, simple language with a totally simple structure. FORTH books explain the internal structure of the language and details of the operation of the compiler and interpreter. Second, the designers of FORTH anticipated the desire to extend the rather rudimentary language and included extension primitives, the words CREATE and DOES>, that denote a compiler extension, and the internal data structures to implement them. Finally, FORTH is an interpretive language. The compiler produces an ecient intermediate representation of the code, not native machine code. Control changes from the interpreter to the compiler when the interpreter reaches the : at the beginning of a denition, and switches back to the interpreter when the compiler reaches the ; at the end of the denition. Words are also included that permit one to suspend a compilation in the middle, interpret some code, and return to the compilation. Thus variable declarations, ordinary function denitions, segments of code to be interpreted, and extensions to the compiler can be freely intermixed. The only requirement is that everything be dened before it is used. New Types. Unextended, FORTH has three semantic categories, or data types, for items in the dictionary (symbol table): constant, variable, and function. By using the words CREATE and DOES> inside what otherwise looks like a normal function denition, more types can be added. CREATE enters the name of the new type category into the dictionary. Following it must be FORTH code for any compile-time actions that must be taken to allocate and/or initialize the storage for this new type. This compile-time section is terminated by the DOES>, which marks this partial entry as a new semantic category. Finally, the denition includes FORTH code for the semantic routine that should be executed at run time when items in this category are referenced [Exhibit 4.22]. Having added a type, the FORTH interpreter can be extended to check the type of a function parameter and dispatch (or execute) one of several function methods, depending on the type. New data types are additional examples of a category that was built into the language. However, type checking was not built into FORTH in any way. When we implement type checking, we add a semantic mechanism to the language that did not previously exist. This is true semantic extension.

4.4. EXTENDING THE SEMANTICS OF A LANGUAGE

111

Exhibit 4.22. Denition in FORTH of the semantics for arrays. 0 1 2 3 4 5 6 7 8 9 10 11 12 : 2by3array create 2 , 3 , 12 allot does> rangecheck linearsub ; 2by3array box 10 1 2 box ! ( ( ( ( ( ( ( ( ( ( The ":" marks the beginning of a definition. ) Compile time actions for type declarator 2by3array. ) Store dimensions in the dictionary with the object. ) Allocate 12 bytes for 6 short integers. ) Run time actions to do a subscripted fetch. ) Function call to check that both subscripts are ) within the legal range. ) Function call to compute the effective memory ) address, given base address of array and subscripts.) End of data type definition. ) )

( Declare and allocate an array variable named box. ( Store the number 10 in box[1,2]. )

Program Notes Comments are enclosed in parentheses. The denition of the new type declarator goes from line 0 to line 9. , stores the prior number in the dictionary. Lines 5 and 7 are calls on the functions rangecheck and linearsub, which the programmer must dene and compile before this can be compiled. Linearsub must leave its result, the desired memory address, on the stack. Line 11 declares a 2by3array variable named box. When this line is compiled, the code on lines 2 and 3 is run to allocate and initialize storage for the new array variable. Line 12 puts the value 10 on the stack, then the subscripts 1 and 2. When the interpreter processes the reference to box, the semantic routine for 2by3array (lines 58) is executed. This checks that the subscripts are within range, then computes a memory address and leaves it on the stack. Finally, that address is used by ! to store the 10 that was put on the stack earlier. ! is the assignment operation. It expects a value and an address to be on the stack and stores the value in that address.

112

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

Adding a new control structure. CREATE and DOES> provide semantic extension without corresponding syntactic extension. They permit us to extend the data structuring capabilities of the language but not to add things like new loops that would require modifying the syntax. To the extent that the FORTH compilers code is open and documented, though, the clever programmer can even extend the syntax in a limited way. We have code that adds a BREAK instruction to exit from the middle of a FIG FORTH loop. This code uses a compiler variable that contains the address of the end of the loop during the process of compiling the loop. The code for BREAK cannot be added to FORTH 83. Many compiler variables that were documented in FIG FORTH are kept secret in the newer FORTH 83. These machine- and implementation-dependent things were taken out of the language documentation in order to increase the portability of programs written in FORTH, and the portability of the FORTH translator itself. Providing no documentation about the internal operation of the compiler prevents the syntax from being extended.

Exercises
1. Briey dene EBNF and syntax diagrams. How are they used, and why are they necessary? 2. Describe the compilation process from source code to object code. 3. Consider the following EBNF syntax. Rewrite this grammar as syntax diagrams. sneech ::= * | ( ( sneech ) ) | [ bander ] * sneech { +$+ | # } | ( % bander )

bander ::=

4. Which of the following sentences are not legal according to the syntax for sneeches, given in question 3? Why? a. (*) f. #####** b. (+$+*) g. (+$+# c. * h. +$+#* d. ***** i. *+$+# e. %%%** j. %#*+$+** 5. Rewrite the following syntax diagrams as an EBNF grammar.

4.4. EXTENDING THE SEMANTICS OF A LANGUAGE

113

# blit [ slit grit Y B slit grit slit N O grit % ] blit

6. What is the dierence between a terminal and nonterminal symbol in EBNF? 7. What is a production? How are alternatives denoted in EBNF? Repetitions? 8. Using the production rules in Exhibit 4.4, generate the program called easy which has two variables: a, an integer, and b, a real. The program initializes a to 5. Then b gets the result of multiplying a by 2. Finally, the value of b is written to the screen followed by a new line. 9. What are the EBNF productions for the conditional statement in Pascal? Show the corresponding syntax diagrams from a standard Pascal reference. 10. Show the syntax diagram for the For statement in Pascal. List several details of the meaning of the For that are not dened by the syntax diagram. 11. What are semantics? 12. What is the dierence between a program environment and a shared environment? 13. What is a stream? 14. Why is lambda calculus relevant in a study of programming language? 15. Show the result of substituting u for x in the following applications. Rename bound variables where necessary. a. ((x.y.x)u) b. ((x.y.z )u) c. (([Link])u) d. (([Link])u)

114

CHAPTER 4. FORMAL DESCRIPTION OF LANGUAGE

16. Each item below is a lambda application. We have used a lot of parentheses to help you parse the expressions. Reduce each formula, until no redex remains. One of the items requires renaming of a bound variable. a. ((x.y.x(xy ))(pq )q ) b. ((x.y.y )(pq )q ) c. ((z.([Link] ))([Link] )) d. ((x.y.y (xy ))([Link])q ) 17. Verify the following equality. Start with the left-hand side and substitute the formula for twice. Then reduce the formula until it is in normal form. This may look like a circular reduction, but the formula reaches normal form after eight reduction steps. Let twice = f.x.f (f x). Show that twice twice gz = g (g (g (gz ))). Hints: Write out the formula for twice only when you are using it as a function; keep arguments in symbolic form. Each time you write out twice, use new names for the bound variables. Be careful of the parentheses. Remember that function application associates to the left. 18. Show that 3 is the successor of 2, using the lambda calculus representations dened for integers. 19. Dene the function plus2 using a lambda formula. Demonstrate that your formula works by applying it to the formula for 1. 20. Construct a lambda formula to express the following conditional expression. (Assume that x is a Boolean value, T or F .) Verify the correctness of your expression by applying it to T and F and reducing to get 0 or 2. If x is true then return 0 else return 2. 21. How do EL/1 and FORTH allow the semantics of the languages to be extended?

Part II

Describing Computation

115

Chapter 5

Primitive Types

Overview
This chapter explains the concept of types within programming languages and the hardware that supports these types. Computer memory is an array of bits usually grouped into addressable 8-bit segments called bytes. Words are groups of bytes, usually 2, 4, and sometimes 8 bytes long. All data types in programming languages must be mapped onto the bytes and words of the machine. Logical computer instructions operate on bytes and words, but other instructions operate on objects that are represented by codes which are superimposed on bit strings. Common codes include ASCII, EBCDIC, binary integer, packed decimal and oating point. A data type is an abstraction: a description of a set of properties independent of any specic object that has those properties. A previously dened type is referred to by a type name. A type description identies the parts of a nonprimitive type. A specic type is a homogeneous set of objects, while a generic type is a set that includes objects of more than one specic type. Each type is a set of objects with an associated set of functions. A type denes the representation for program objects. Several attributes are dened by the type of an object, including encoding, size, and structure. Every language supports a set of primitive data types. Usually these include integer, real, Boolean, and character or string. A language standard determines the minimum set of primitive types that the language compiler must implement. Hardware characteristics inuence which types a language designer chooses to make primitive. If the hardware does not support a required type, that type may have to be emulated, that is, implemented in the software. Type declarations have a long history, going back to the earliest languages which sup117

118

CHAPTER 5. PRIMITIVE TYPES ported primitive types built into the instruction set of the host machine. By the late 1960s, types were recognized as abstractions. Type declarations dened value constructors, selectors, and implicit type predicates. In the 1970s, with the emergence of Ada, types were treated as objects in a limited way. There were cleaner type compatibility rules, support for type portability, and explicit constraint on the values of a type. Recent research includes the issues of type hierarchies with inherited properties and the implementation of nonhomogeneous types in a semantically sound way.

5.1

Primitive Hardware Types

A translator must map a programmers objects and operations onto the storage and instructions provided by the computer hardware. To understand the primitive types supported by languages, one should also understand the hardware behind those types.

5.1.1

Bytes, Words, and Long Words

Computer memory is a very long array of bits, normally organized into groups. 1 Each group has an address, used to store and fetch data from it. Modern machines usually have 8-bit bytes and are byte addressable. Bytes are grouped into longer 2- and 4-byte units called words and long words. Some machines have a few hardware instructions that support double word, or 8-byte, operations. Bytes and words form the basis for all representation and all computation. They are the primitive data type onto which all other data types must be mapped. Computer instruction sets include some instructions that operate on raw, uninterpreted bytes or words. These are called logical instructions. They include right and left shifts, and bitwise complement, and, or, and exor operations. Most computer instructions, though, are intended to operate on objects other than bit strings, such as numbers or characters. All objects must be represented by bit strings, but they have semantics above and beyond the bits that represent them. These objects are represented by codes that are superimposed on the bit strings. Common encodings include ASCII, EBCDIC, binary integer, packed decimal, and oating point.

5.1.2

Character Codes

Long before IBM built computers, it built unit record equipment, which processed data recorded on punched cards. Keypunches were used to produce this data, and line printers could copy cards to fanfold paper. Tabulating machines were used to process the data. These had plug boards,
The Burroughs memory of the B1700/B1800 series of computers was an undivided string of bits that was actually bit-addressable.
1

5.1. PRIMITIVE HARDWARE TYPES

119

on which a skilled person could build programs by using plug-in cables to connect holes that represented card columns to holes that represented functions such as + and . 2 Punched cards were in common use before computers were invented and quite naturally became the common input medium for computers. The Hollerith character code, used for punched cards, was adapted for use in computers, and called Binary Coded Decimal, or BCD. Hollerith code was a decimal code. It used one column with twelve punch positions to represent each digit. (These positions were interpreted as +, , 0...9.) Alphabetic letters were represented as pairs of punches, one in the zone area (+, , 0), and one in the digit area (1..9). This gives 27 combinations, which is one too many for our alphabet, and the 0:1 punch combination was not used for any letter. The tradition was that this combination was omitted from the alphabet because the two closely spaced punches made it physically weak. However, this combination was used to represent /. Thus the alphabet had a nonalpha character in its middle. The entire Hollerith character set had no more than 64 codes. Letters and digits accounted for 36 of these; the rest were other punctuation and control characters and were represented by double or triple punches. The BCD code used sixty four 6-bit codes to represent this character set. The BCD character code was reected in various ways in the computer hardware of the 1950s and early 1960s. It has always been practical to make word size a multiple of the character code size. Hardware was built with word lengths of 24, 36, 48, and 60 bits (making 4, 6, 8, and 10 characters per word). Floating-point encoding was invented for the IBM 704; its 36-bit words were long enough to provide adequate range and precision. Software, also, showed the eects of this character code. FORTRAN was designed around this severely limited character set (uppercase only, very few available punctuation symbols). FORTRAN identiers were limited to six characters because that is what would t into one machine word on the IBM 704 machine. COBOL implemented numeric data input formats that were exactly like Hollerith code. If you wanted to input the number 371, you punched only three columns, and put the sign over the rightmost, giving the number code 37J. The number +372 was encoded as 37B. This wholly archaic code is still in use in COBOL today and is increasingly dicult to explain and justify to students. Dissatisfaction with 6-bit character codes was rampant; sixty four characters are just not enough. People, reasonably, wanted to use both upper- and lower-case letters, and language designers felt unreasonably restricted by the small set of punctuation and mathematical symbols that BCD provided. Two separate eorts in the early 1960s produced two new codes, EBCDIC (Extended BCD Interchange Code) and ASCII (American Standard Code for Information Interchange). EBCDIC was produced and championed by IBM. It was an 8-bit code, but many of the 256 possible bit combinations were not assigned any interpretation. Upper- and lowercase characters were included, with ample punctuation and control characters. This code was an extension of BCD; the old BCD characters were mapped into EBCDIC in a systematic way. Certainly, that made compatibility with old equipment less of a problem.
These were archaic in the early 1960s, but a few early computer science students had the privilege of learning to use them.
2

120

CHAPTER 5. PRIMITIVE TYPES

Unfortunately, the EBCDIC code was not a sensible code because the collating sequence was not normal alphabetical order. 3 Numbers were greater than letters, and like BCD, alphabetic characters were intermingled with nonalphabetic characters. ASCII code grew out of the old teletype code. It uses seven bits, allowing 128 characters. Upperand lowercase letters, numerals, many mathematical symbols, a variety of useful control characters, and an escape are supported. The escape could be used to form 2-character codes for added items.4 ASCII is a sensible code; it follows the well-established English rules for alphabetization. It has now virtually replaced EBCDIC, even on IBM equipment. An extended 8-bit version of ASCII code is now becoming common. It uses the additional 128 characters for the accented and umlauted European characters, some graphic characters, and several Greek letters and symbols used in mathematics. Hardware intended for the international market supports extended ASCII.

5.1.3

Numbers

We take integers and oating-point numbers for granted, but they are not the only ways, or even the only common and useful ways, to represent data. In the late 1950s and early 1960s, machines were designed to be either scientic computers or business computers . The memory of a scientic computer was structured as a sequence of words (commonly 36 bits per word) and its instruction set performed binary arithmetic. Instructions were xed length and occupied one word of memory. Packed Decimal The memory of a business computer was a series of BCD bytes with an extra bit used to mark the beginning of each variable-length word. Objects and instructions were variable length. Numbers were represented as a series of decimal (BCD) digits. Arithmetic was done in base ten, not base two. The distinction between scientic and business computers profoundly aected the design of programming languages. COBOL, a business language, was oriented toward variable-length objects and supported base ten arithmetic. In contrast, FORTRAN was a scientic language. Its data objects were one word long, or arrays of one-word objects, and computation was done either in binary or in oating point. Characters were not even a supported data type. In 1964, IBM introduced a family of computers with innovative architecture, intended to serve both the business and scientic communities. 5 The memory of the IBM 360 was byte-addressable. The hardware had general-purpose registers to manipulate byte, half-word (2-byte), and word (4-byte) sized objects, plus four 8-byte registers for oating-point computation. The instruction
3 The collating sequence of a code is the order determined by the < relationship. To print out a character code in collation order, start with the code 00000000, print it as a character, then add 1 and repeat, until you reach the largest code in the character set. 4 It is often used for fancy I/O device control codes, such as reverse video on. 5 Gorsline [1986], p. 317.

5.1. PRIMITIVE HARDWARE TYPES

121

Exhibit 5.1. Packed-decimal encoding. A number is a string of digits. The sign may be omitted if the number is positive, or represented as the rst or last eld of the string. Each decimal digit is represented by 4 bits, and pairs of digits are packed into each byte. The string may be padded on the left to make the length even. The code 0000 represents the digit 0, and 1001 represents 9. The six remaining possible bit patterns, 1010 . . . 1111, do not represent legal digits.

set supported computation on binary integers, oating point, and integers represented in packed decimal with a trailing sign [Exhibit 5.1]. Many contemporary machines support packed decimal computation. Although the details of packed-decimal representations vary somewhat from machine to machine, the necessary few instructions are included in the Intel chips (IBM PC), the Motorola chips (Apollo workstations, Macintosh, Atari ST), and the Data General MV machines. Packed-decimal encoding is usually used to implement decimal xed-point arithmetic. A decimal xed-point number has two integer elds, one representing the magnitude, the other the scale (the position of the decimal point). The scale factors must be taken into account for every arithmetic operation. For instance, numbers must be adjusted to have the same scale factors before they can be added or subtracted. Languages such as Ada and COBOL, which support xed-point arithmetic, do this adjustment for the programmer. 6 Base two arithmetic is convenient and fast for computers, but it cannot represent most base ten fractions exactly. Furthermore, almost all input and output is done using base ten character strings. These strings must be converted to/from binary during input/output. The ASCII to oating point conversion routines are complex and slow. Arithmetic is slower with packed decimal than with binary integers because packed-decimal arithmetic is inherently more complex. Input and output conversions are much faster; a packeddecimal number consists of the last 4 bits of each ASCII or EBCDIC digit, packed two digits per byte. Arithmetic is done in xed point; a specied number of digits of precision is maintained, and numbers are rounded or truncated after every computation step to the required precision. Control over rounding is easy, and no accuracy is lost in changing the base of fractions. In a data processing environment, packed decimal is often an ideal representation for numbers. Most business applications do more input and output than computation. Some, such as banking
Unfortunately, the Ada standard does not require that xed-point declarations be implemented by decimal xedpoint arithmetic! It is permissible in Ada to approximate decimal xed-point computation using numbers represented in binary, not base ten, encoding!
6

122

CHAPTER 5. PRIMITIVE TYPES

Exhibit 5.2. Representable values for signed and unsigned integers. These are the smallest and largest integer values representable on a twos complement machine. Type Signed Length 4 bytes 2 bytes 1 byte 4 bytes 2 bytes 1 byte Minimum -2,147,483,648 -32,768 -128 0 0 0 Maximum 2,147,483,647 32,767 127 4,294,967,295 65,535 255

Unsigned

and insurance computations, require total control of precision and rounding during computation in order to meet legal standards. For these applications, binary encoding for integers and oatingpoint encoding for reals is simply not appropriate. Binary Integers Binary numbers are built into modern computers. However, there are several ways that binary numbers can be represented. They can be dierent lengths (2- and 4-byte lengths being the most common), and be signed or unsigned. If the numbers are signed, the negative values might be represented in several ways. Unsigned integers are more appropriate than signed numbers for an application that simply does not deal with negative numbers, for example, a variable representing a machine address or a population count. Signed and unsigned numbers of the same length can represent exactly the same number of integers; only the range of representable numbers is dierent [Exhibit 5.2]. On a modern twos complement machine, unsigned arithmetic is implemented by the same machine instructions as signed arithmetic. Some languages, for example C, support both signed and unsigned integers as primitive types. Others, for example Pascal and LISP, support only signed integers. Having unsigned integer as a primitive type is not usually necessary. Any integer that can be represented as an unsigned can also be represented as a signed number that is one bit longer. There are only a few situations in which this single bit makes a dierence: An application where main storage must be conserved and a 1-byte or 2-byte integer could be used, but only if no bit is wasted on the sign. An application where very large machine addresses or very large numbers must be represented as integers, and every bit of a long integer is necessary to represent the full range of possible values.

5.1. PRIMITIVE HARDWARE TYPES

123

The intended application area for the language involves extensive use of the natural numbers (as opposed to the integers). By using type unsigned we can constrain a value to be nonnegative, thereby increasing the explicitness of the representation and the robustness of the program. Unsigned will probably be included as a primitive type in any language whose intended applications t one of these descriptions. C was intended for systems programming, in which access to all of a machines capabilities is important, and so supports unsigned as a primitive type. 7 Signed Binary Integers The arithmetic instructions of a computer dene the encoding used for numbers. The ADD 1 instruction determines the order of the bit patterns that represent the integers. Most computers count in binary, and thus support binary integer encoding. Most compilers use this encoding to represent integers. Although this is not the only way to represent the integers, binary is a straightforward representation that is easy for humans to learn and understand, and it is reasonably cheap and fast to implement in hardware. 8 Large machines support both word and long word integers; very small ones may only support byte or word sized integers. On such machines, a compiler writer must use the short word instructions to emulate arithmetic on longer numbers that are required by the language standard. For example, the instruction set on the Commodore 64 supported only byte arithmetic, but Pascal translators for the Commodore implemented 2-byte integers. Adding a pair of 2-byte integers required several instructions; each half was added separately and then the carry was propagated. Negative Numbers One early binary integer representation was sign and magnitude. The leftmost bit was interpreted as the sign, and the rest of the bits as the magnitude of the number. The representations for +5 and 5 diered only in one bit. This representation is simple and appealing to humans, but not terric for a computer. An implementation of arithmetic on sign-and-magnitude numbers required a complex circuit to propagate carries during addition, and another one to do borrowing during subtraction. CPU circuitry has always been costly, and eventually designers realized that it could be made less complex and cheaper by using complement notation for negative numbers. Instead of implementing + and , a complement machine could use + and negate. Subtraction is equivalent to negation followed by addition. Negation is trivially easy in ones complement representationjust ip the bits. Thus 00000001 represented the integer 1 and 11111110 represented the integer negative one. A carry o the left end of the word was added back in on the right. The biggest drawback of this system is that zero has two representations, 0000000 (or +0) and 11111111 (or 0). A further insight occurred in the early 1960s: complement arithmetic could be further simplied by using twos complement instead of ones complement. To nd the twos complement of a
We must also note that the primitive type byte or bitstring is lacking in C, and unsigned is used instead. While this is semantically unattractive, it works. 8 Other kinds of codes have better error correction properties or make carrying easier.
7

124 Exhibit 5.3. IEEE oating-point formats. Format Name Short real Long real Temporary real Length 4 bytes 8 bytes 10 bytes Sign 31 63 79

CHAPTER 5. PRIMITIVE TYPES

Bit elds in representation Exponent Mantissa 3023 220, with implicit leading 1 6252 510, with implicit leading 1 7864 630, explicit leading 1

number, complement the bits and add one. The twos complement of 00000001 (representing 1) is 11111111 (representing 1). Twos complement representation has two good properties that are missing in ones complement: there is a unique representation for zero, 00000000, and carries o the left end of a sum can simply be ignored. Twos complement encoding for integers has now become almost universal. Floating Point Many hardware representations of oating-point numbers have been used in computers. Before the advent of ASCII code, when characters were 6 bits long, machine words were often 36 or 48 bits long. (It has always been convenient to design a machines word length to be a multiple of the byte length.) Thirty-six bits is enough to store a oating-point number with a good range of exponents and about eight decimal digits of precision. Forty-eight or more bits allows excellent precision. However, word size now is almost always 32 bits, which is a little too small. In order to gain the maximum accuracy and reasonable uniformity among machines, the IEEE has developed a standard for oating-point representation and computation. In this discussion, we focus primarily on this standard. The IEEE standard covers all aspects of oating pointthe use of bits, error control, and processor register requirements. It sets a high standard for quality. Several modern chips, including the IBM 8087 coprocessor, have been modeled after it. To understand oats, you need to know both the format and the semantics of the representation. A oating-point number, N , has two parts, an exponent, e, and a mantissa, m. Both parts are signed numbers, in some base. If the base of the exponent is b, then N = m b e . The IEEE standard supports oats of three lengths: 4, 8, and 10 bytes. Let us number the bits of a oat starting with bit 0 on the right end. The standard prescribes the oat formats shown in Exhibit 5.3. The third format denes the form of the CPU register to be used during computation. Exhibit 5.4 shows how a few numbers are represented according to this standard. The sign bit, always at the left end, is the sign of the entire number. A 1 is always used for negative, 0 for a positive number. The exponent is a signed number, often represented in bias notation. A constant, called the bias, is added to the actual exponent so that all exponent values are represented by unsigned positive numbers. In the case of bias 128, this is like twos complement with the sign bit reversed. The advantage of a bias representation is that, if an ordinary logical comparison is made,

5.1. PRIMITIVE HARDWARE TYPES

125

Exhibit 5.4. Floating point on the SparcStation. The SparcStation is a new RISC workstation built by Sun Microsystems. It has a oating-point coprocessor modeled after the IEEE standard. Using the Sun C compiler to explore its oatingpoint representation, we nd that the C oat type is implemented by a eld with the IEEE 4-byte encoding. A few numbers are shown here with their representations printed in hex notation and in binary with the implied 1-bit shown to the left of the mantissa. Decimal 0.00 0.25 0.50 1.00 -1.00 10.00 5.00 2.50 1.25 Hex 00000000 3E800000 3F000000 3F800000 BF800000 41200000 40A00000 40200000 3FA00000 Sign 0 0 0 0 1 0 0 0 0 Binary Representation Exponent Mantissa 00000000 0.0000000 00000000 00000000 01111101 01111110 01111111 01111111 10000010 10000001 10000000 01111111 1.0000000 1.0000000 1.0000000 1.0000000 1.0100000 1.0100000 1.0100000 1.0100000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

positive numbers are greater than negative numbers. Absolutely no special provision needs to be made for the sign of the number. With 8 bits in the exponent, 00000000 represents the smallest possible negative exponent, and 11111111 is the largest positive exponent. 10000000 generally represents an exponent of either zero or one. When interpreted as a binary integer, 10000000 is 128. If this represents an exponent of zero, we say that the notation is bias 128, because 128 0 = 128. When 10000000 represents an exponent of one, we say the notation is bias 127, because 128 1 = 127. In the IEEE standard, the exponent is represented in bias 127 notation, and the exponent 10000000 represents +1. This can be seen easily in Exhibit 5.4. The representation for 2.50 has an exponent of 10000000. The binary point in 1.010000 must be moved one place to the right to arrive at 10.1, the binary representation of 2.50. Thus 10000000 represents +1. Floating-point hardware performs oat operations in a very long register, much longer than the 24 bits that can be stored in a oat. To maintain as many bits of precision as possible, the mantissa is normalized after every operation. This means that the leading 0 bits are shifted to the left until the leftmost bit is a 1. Then when you store the number, all bits after the twenty-fourth are truncated (discarded). A normalized mantissa always starts with a 1 bit, therefore this bit has no information value and can be regenerated by the hardware when needed. So only bits 224 of the mantissa are stored, in bits 220 of the oat number. The mantissa is a binary fraction with an implied binary point. In the IEEE standard, the point is between the implied 1 bit and the rest of the mantissa. Some representations place the

126 Exhibit 5.5. A hierarchy of abstractions.


Electronic Device Computer Microcomputer Apple IIE TRS-80

CHAPTER 5. PRIMITIVE TYPES

Microphone Minicomputer Amana

Microwave Oven Litton

IBM PC

DG MV8000

binary point to the left of the implied 1 bit. These interpretations give the same precision but dierent ranges of representable numbers.

5.2
5.2.1

Types in Programming Languages


Type Is an Abstraction

An abstraction is the description of a property independent from any particular object which has that property. Natural languages contain words that form hierarchies of increasing degrees of abstraction, such as TRS-80, microcomputer, computer, and electronic device [Exhibit 5.5]. TRS-80 is itself an abstraction, like a type, describing a set of real objects, all alike. Most programming language development since the early 1980s has been aimed at increasing the ability to express and use abstractions within a program. This work has included the development of abstract data types, generic functions, and object-oriented programming. We consider these topics briey here and more extensively later. A data type is an abstraction: it is the common property of a set of similar data objects. This property is used to dene a representation for these objects in a program. Objects are said to have or to be of that type to which they belong. Types can be primitive, dened by the system implementor, or they can be programmer dened. We refer to a previously dened type by using a type name. A type declaration denes a type name and associates a type description with it, which identies the parts of a nonprimitive type [Exhibit 5.6]. The terms type and data type are often used loosely; they can refer to the type name, the type description, or the set of objects belonging to the type. If all objects in a type have the same size, structure, and semantic intent, we call the type concrete or specic. A specic type is a homogeneous set of objects. All the primitive types in Pascal are specic types, as are Pascal arrays, sets, and ordinary records made out of these basic types. A variant record in Pascal is not a specic type, since it contains elements with dierent structures and meanings.

5.2. TYPES IN PROGRAMMING LANGUAGES

127

Exhibit 5.6. Type, type name, and type description. Real-world objects: a set of rectangular boxes. Type: We will represent a box by three real numbers: its length, width, and depth. Type name declared in Pascal: TYPE box_type = Possible type descriptions in Pascal: ARRAY [1..3] OF real RECORD length, width, depth:

real END

A generic domain is a set that includes objects of more than one concrete type [Exhibit 5.7]. A specic type that is included in a generic domain is called an instance or species of the generic domain, as diagrammed in Exhibit 5.8. Chapters 15 and 17 explore the subjects of type abstraction and generic domains.

5.2.2

A Type Provides a Physical Description

The properties of a type are used to map its elements onto the computers memory. Let us focus on the dierent attributes that are part of the type of an object. These include encoding, size, and structure. Exhibit 5.7. Specic types and generic domains. Specic types: Integer arrays of length 5 Character arrays of length 10 Real numbers Integer numbers

Generic domains: Intarray: The set of integer arrays, of all lengths. Number: All representations on which you can do arithmetic, including oating point, integer, packed decimal, etc.

128

CHAPTER 5. PRIMITIVE TYPES

Exhibit 5.8. Specic types are instances of generic domains. The generic domain Number has several specic subtypes, including Real, Integer, and Complex. Objects (variables) have been declared that belong to these types. Objects named V, W, and X belong to type Real; objects J and K belong to type Integer, and C belongs to type Complex. All six objects also belong to the generic domain Number.
Number Real V W X Complex C Integer J K

Encoding. The instruction set of each machine includes instructions that do useful things on certain encodings (bit-level formats) of data. For example, the Data General MV8000 has instructions that perform addition if applied to numbers encoded with 4-bits per decimal digit. Because of this built-in encoding, numbers can be conveniently represented in packed-decimal encoding in Data General COBOL. Where an encoding must be implemented that is not directly supported by the hardware, the implementation tends to be inecient. Size. The size of an object can be described in terms of hardware quantities such as words or bytes, or in terms of something meaningful to the programmer, such as the range of values or the number of signicant digits an object may take on. Structure. An object is either simple or it is compound. A simple object has one part with no subparts. No operators exist within a language that permit the programmer to decompose simple objects. In a language that has integer as a simple type, integer is generally undecomposable. In standard Pascal, integers are simple objects, as are reals, Booleans, and characters. In various Pascal extensions, though, an integer can be decomposed into a series of bytes. In these dialects integer is not a simple type. Primitive types may or may not be simple. In both cases, integer is a primitive type; that is, it is a predened part of the language. A compound object is constructed of an ordered series of elds of specic types. A list of these elds describes the structure of the object. If the elds of the compound object all have the same type, it is a homogeneous compound. These are commonly called array, vector, matrix,

5.2. TYPES IN PROGRAMMING LANGUAGES

129

or string. The dimensions of an array and its base type (the type of its elements) dene its structure. If the elds of a compound object have dierent types, it is a heterogeneous compound. These are commonly called records or structures. An ordered list of the types of each eld of a record denes its structure. The distinctions among structure, encoding, and size are seen most clearly in COBOL, where these three properties are specied separately by the programmer. Structure in COBOL. The internal structure of each data object is dened by listing its elds and subelds, in order. The subelds of a eld are listed immediately following the eld and given higher level numbers to indicate that they are subordinate to it. Encoding in COBOL. Character data has only one encoding: the character code built into the machine hardware. Depending on the compiler, several encodings may be provided for numbers, with DISPLAY being the default. The COBOL programmer may specify the encoding in a USAGE clause. In Data General COBOL, the programmer can choose from the following set: DISPLAY COMPUTATIONAL COMP-2 COMP-3 ASCII or EBCDIC characters binary xed point packed binary-coded-decimal xed point oating point

Double-precision encoding is also provided in some COBOL implementations. Each encoding has inherent advantages, which must be understood by the programmer. Input and output require operands of DISPLAY usage. Arithmetic can be done on all usages except DISPLAY. The most ecient numeric I/O conversion is between DISPLAY and COMP-2. The most ecient arithmetic is done in COMPUTATIONAL. Conversion from one encoding to another is performed automatically when required in COBOL. If a numeric variable does not have the default usage, DISPLAY, conversion is performed during the input and output processes, as in most languages. If a numeric variable represented in DISPLAY usage is used in an arithmetic statement, it will be converted to packed decimal. (This conversion is fast and ecient). The arithmetic will be done in packed-decimal encoding, and the result will be converted back to display usage if it is stored in a DISPLAY variable. Size in COBOL. Size is dened by supplying a PICTURE clause for every eld that has no subelds [Exhibit 5.9]. The PICTURE illustrates the largest number of signicant characters or decimal digits that will ever be needed to represent the eld. Note that the programmer describes the size of the object being represented, not the size, in bytes, of the representation. Dierent amounts of storage could be allocated for equal size specications with dierent encoding specications. At the other extreme from COBOL, the language BASIC permits the programmer to specify only whether the object will encode numeric or alphanumeric objects, and to declare the structure of arrays (number of dimensions and size of each). The encoding is chosen by the translator and hidden from the user. Thus BASIC is simpler to use. It frees the programmer from concern about the appropriateness of the encoding. At the same time, it provides no easy or ecient control over

130

CHAPTER 5. PRIMITIVE TYPES

Exhibit 5.9. Size and encoding specications in COBOL. Three simple variables are dened, named PRICE, DISCOUNT, and ITEM. 01 PRICE PICTURE 999V99. 01 DISCOUNT PICTURE V999 USAGE COMPUTATIONAL. 01 ITEM PICTURE XXXX. PRICE has a numeric-character encoding, indicated by the 9s in the PICTURE clause and the absence of a USAGE clause. The size of this variable is dened by the number of 9s given, and decimal position is marked by the V. In this case, the number has two decimal places and is less than or equal to 999.99. DISCOUNT has binary xed-point encoding (because of the USAGE clause). Its size is three decimal digits, with a leading decimal point. ITEM has alphanumeric encoding, indicated by the Xs in its PICTURE. Its size is four characters. Any alphanumeric value of four or fewer characters can be stored in this variable.

precision and rounding. BASIC is thus a better tool for the beginner, but a clumsy tool for the professional.

5.2.3

What Primitive Types Should a Language Support?

The usual set of primitive data types in a language includes integer, real, Boolean, and character or string. However, Ada has many more and BASIC has fewer. A language standard determines the minimum set of primitive types that must be implemented by a compiler. Choosing this set is the job of the language designer. A language implementor may choose to support additional types, however. For example, Turbo Pascal supports a type string that is not required by the standard. The string type is a language extension. The decision to make a type primitive in a computer language is motivated by hardware characteristics and the intended uses of the language. Compromises must often be made. A language designer must decide to include or exclude a type from the primitive category by considering the cost of implementing and using it [Exhibit 5.10] as opposed to the cost of not implementing it [Exhibit 5.11]. Types that are not primitive sometimes cannot be implemented eciently, or even implemented at all, by the user. For example, the ANSI C standard does not support packed-decimal numbers. A user could write his or her own packed decimal routines in C. To achieve adequate precision the user would probably map them onto integers, not oats. Masking, base 10 addition and multiplication, carrying, and the like could be implemented. However, the lack of eciency in the nished product would be distressing, especially when you consider that many machines provide ecient hardware instructions to do this operation. If users are expected to need a certain type frequently, the language is improved by making that type primitive. Packed decimal is not a primitive type in C because the intended usage of C

5.2. TYPES IN PROGRAMMING LANGUAGES

131

Exhibit 5.10. Costs of implementing a primitive type. Every added feature complicates both the language syntax and semantics. Both require added documentation. If every useful feature were supported, the language would become immense and unwieldy. Standardization could become more dicult, as there is one more item about which committee members could disagree. This could be an especially severe problem if a type is complex, its primitive operations are extensive, or it is unclear what the ideal representation should be. The compiler and/or library and/or run-time system become more complex, harder to debug, and consume more memory. Literals, input and output routines, and basic functions must be dened for every new primitive type. If typical hardware does not provide instructions to handle the type, it may be costly and inecient to implement it. Perhaps programmers should not be encouraged to use inecient types.

Exhibit 5.11. Costs of omitting a primitive type. Ineciency: failing to include an operation that is supported by the hardware leads to a huge increase in execution time. Language structure may be inadequate to support the type as a user extension, as Pascal cannot support variable-length strings or bit elds with bitwise operators. Some built-in functions such as READ, WRITE, assignment, and comparison are generic in nature. They work on all primitive types but not necessarily on all user-dened types. If these functions cannot be extended, a user type can never be as convenient or easy to use as a primitive type. Primitive types have primitive syntax for writing literals. Literal syntax is often not extensible to user-dened types.

132

CHAPTER 5. PRIMITIVE TYPES

was for systems programming, not business applications. In this case, the cost of not implementing the type is low, and the cost of implementing it is increased clutter in the language. As another example, consider the string type in Pascal. It was almost certainly a mistake to omit a string manipulation package from the standard language. Alphabetic data is very common, and many programs use string data. The Pascal standard recognizes that strings exist but does not provide a reasonable set of string manipulation primitives. The standard denes a string to be any object that is declared as a packed array[1..n] of char, where n is an integer > 1. String output is provided by Write and Writeln. String comparison and assignment are supported, but only for strings of equal length. Length adjustment, concatenation, and substrings are not supported, and Read cannot handle strings at all. A programmer using Standard Pascal must read alphabetic elds one character at a time and store each character into a character array. Virtually all implementations of Pascal extend the language to include a full string type with reasonable operations. Unfortunately, these extensions have minor dierences and are incompatible with each other. Thus there are two kinds of costs associated with omitting strings from standard Pascal: 1. User implementations of string functions are required. These execute less eciently than system implementations could. 2. Because programmers use strings all the time, many compilers are extended to support a string type and some string functions. Using these extensions makes programs less portable because the details of the extensions vary from compiler to compiler. Including strings in the language makes a language more complex. Both the syntax and semantic denitions become longer and require more extensive documentation. The minimal compiler implementation is bigger. In the case of Pascal and strings, none of these reasons justify the omission. When language designers do decide to include a primitive type, they must extend the language syntax for declarations, but they have some choices about how to include the operations on that type. The meanings of operators such as < are usually extended to operate on elements of the new type. New operators may also be added. Any specic function for the new type may be omitted, added to the language core, or included in a library. The latter approach becomes more and more attractive as the number of dierent primitive types and functions increases. A modular design makes the language core simpler and smaller, and the library features do not add complexity or consume space unless they are needed. For example, exponentiation is a primitive operation that is important for much scientic computation. Pascal, C, and FORTRAN all support oating-point encoding but have very unequal support for exponentiation. In Pascal, exponentiation in base 10 is not supported by the standard at all; it must be programmed using the natural logarithm and exponentiation functions (ln and exp). In C, an exponentiation function, pow, is included in the mathematics library along with the trigonometric functions. In contrast, FORTRANs intended application was scientic com-

5.3. A BRIEF HISTORY OF TYPE DECLARATIONS

133

putation, and the FORTRAN language includes an exponentiation operator, **, as part of the language core.

5.2.4

Emulation

The types required by a language denition may or may not be supported by the hardware of machines for which that language is implemented. For example, Pascal requires the type real, but oating-point hardware is not included on many personal computers. In such situations, data structures and operations for that type must be implemented in software. Another example: xedpoint arithmetic is part of Ada. This is no problem on hardware that supports packed-decimal encoding, but on a strictly binary machine, an Ada translator must use a software emulation or approximation of xed-point arithmetic. The representation for an emulated primitive type is a compromise. On the one hand, it should be as ecient as possible for the architecture of the machine. On the other hand, it should conform as closely as possible to the typical hardware implementation so that programs are portable. The hardware version and the emulation should give the same answers! When oating point is emulated, the exponent is sometimes represented as a 1-byte integer, and the mantissa is represented by 4 or more bytes with an implied binary point at the left end. This produces an easily manipulated object with good precision. A minimum of shifting and masking is needed when this representation is used. However, it sometimes does not produce the same answers as a 4-byte hardware implementation. Other software emulations try to conform more closely to the hardware. Accurate emulation of oating-point hardware is more dicult and slower, but has the advantage that a program will give the same answers with or without a coprocessor. A good software emulation should try to imitate the IEEE hardware standard as closely as possible without sacricing acceptable eciency [Exhibit 5.12].

5.3

A Brief History of Type Declarations

The ways for combining individual data items into structured aggregates form an important part of the semantic basis of any language.

5.3.1

Origins of Type Ideas

Types Were Based on the Hardware. The primitive types supported by the earliest languages were the ones built into the instruction set of the host machine. Some aggregates of these types were also supported; the kinds of aggregates diered from language to language, depending on both the underlying hardware and the intended application area. In these old languages, there was an intimate connection between the hardware and the language. For example, FORTRAN, designed for numeric computation, was rst implemented on the IBM 704. This was the rst machine to support oating-point arithmetic. So FORTRAN supported one-

134 Exhibit 5.12. An emulation of oating point.

CHAPTER 5. PRIMITIVE TYPES

This is a brief description of the software emulation of oating point used by the Mark Williams C compiler for the Atari ST (Motorola 68000 chip). Note that it is very similar to, but not quite like, the IEEE standard shown in Exhibits 5.3 and 5.4. Bit Bits Bits 31: 3023: 220: Sign Characteristic, base 2, bias 128 Normalized base 2 mantissa, implied high-order 1, binary point immediately to the left of the implied 1. Hex 00000000 3F800000 40000000 40800000 C0800000 42200000 41A00000 41200000 40A00000 Sign 0 0 0 0 1 0 0 0 0 Binary Representation Exponent Mantissa 00000000 .00000000 00000000 00000000 01111111 10000000 10000001 10000001 10000100 10000011 10000010 10000001 .10000000 .10000000 .10000000 .10000000 .10100000 .10100000 .10100000 .10100000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Decimal 0.00 0.25 0.50 1.00 -1.00 10.00 5.00 2.50 1.25

word representations of integers and oating-point numbers. The 704 hardware had index registers that were used for accessing elements of an arrayso FORTRAN supported arrays. COBOL was used to process business transactions and was implemented on byte-oriented business machines. It supported aggregate variables in the form of records and tables, represented as variable-length strings of characters. One could read or write entire COBOL records. This corresponded directly to the hardware operation of reading or writing one tape record. One could extract a eld of a record. This corresponded to a hardware-level load register from memory instruction. The capabilities of the language were the capabilities of the underlying hardware. Type was not a separate idea in COBOL. A structured variable was not an example of a structured typeit was an independent object, not related to other objects. The structured variable as a whole was named, as were all of its elds, subelds, and sub-subelds. To refer to a subeld, the programmer did not need to start with the name of the whole object and give the complete pathname to that subeld; it could be referred to directly if its name was unambiguous. FORTRAN supported arrays, and COBOL supported both arrays (called tables) and records. It would be wrong, though, to say that they supported array or record types, because the structure of these aggregates was not abstracted from the individual examples of that structure. One could use a record in COBOL, and even pass it to a subroutine, but one could not talk about the type of

5.3. A BRIEF HISTORY OF TYPE DECLARATIONS

135

Exhibit 5.13. Declaration and use of a record in COBOL. We declare a three-level record to store data about a father. If other variables were needed to store information about other family members, lines two through six would have to be repeated. COBOL provides no way to create a set of uniformly structured variables. Field names could be the same or dierent for a second family member. The prevailing style is to make them dierent by using a prex, as in F-FIRST below. 1 FATHER. 2 NAME. 3 LAST 3 F-FIRST 3 F-MID-INIT 2 F-AGE

PIC PIC PIC PIC

X(20). X(20). X. 99.

Assume that FATHER is the only variable with a eld named F-FIRST, and that MOTHER also has a eld named LAST. Then we could store information in FATHER thus: MOVE Charles TO F-FIRST. MOVE Brown TO LAST IN FATHER. Note that the second line gives just enough information to unambiguously identify the eld desired; it does not specify a full pathname.

that record. Each record object had a structure, but that structure had no name and no existence apart from the object [Exhibit 5.13]. LISP Introduced Type Predicates. LISP was the earliest high-level language to support dynamic storage allocation, and it pioneered garbage collection as a storage management technique. In the original implementation of LISP, its primitive types, atom and list, were drawn directly from the machine hardware of the IBM 704. An atom was a number or an identier. A list was a pointer to either an atom or a cell. A cell was a pair of lists, implemented by a single machine word. The 36-bit machine instruction word had four elds: operation code, address, index, and decrement. The address and decrement elds could both contain a machine address, and the hardware instruction set included instructions to fetch and store these elds. Here again we see a close relationship between the language and the underlying hardware. This two-address machine word was used to build the two-pointer LISP cell. The three fundamental LISP functions, CAR, CDR, and CONS, were based directly on the hardware structure. CAR extracted the address eld of the cell, and CDR extracted the decrement eld. (Note that the A in CAR and the D in CDR came from address and decrement.) CONS constructed a cell dynamically and returned a pointer to it. This cell was initialized to point at the two arguments of CONS. Note that all LISP allocations were a xed sizeone word. Only one word was ever allocated at a time. However, the two pointers in a cell could be used to link cells together into tree structures

136

CHAPTER 5. PRIMITIVE TYPES

of indenite size and shape. The concept of type was more fully developed in LISP than in FORTRAN or COBOL. Types were recognized as qualities that could exist separately from objects, and LISP supported type predicates, functions that could test the type of an argument at run time. Predicates were provided for the types atom and list. These were essential for processing tree structures whose size and shape could vary dynamically. SNOBOL: Denable Patterns. SNOBOL was another language of the early 1960s. It was designed for text processing and was the rst high-level language that had dynamically allocated strings as a primitive type. This was an important step forward, since strings (unlike arrays, records, and list cells) are inherently variable-sized objects. New storage management techniques had to be developed to handle variable-sized objects. Variables were implemented as pointers to numbers or strings, which were stored in dynamically allocated space. Dynamic binding, not assignment, was used to associate a value with a variable. Storage objects were created to hold the results of computations and bound to an identier. They died when that identier was reused for the result of another computation. A storage compaction technique was needed to reclaim dead storage objects periodically. The simplest such technique is called storage compaction. It involves identifying all live storage objects and moving them to one end of memory. The rest of the memory then becomes available for reuse. A second new data type was introduced by SNOBOL: the pattern. Patterns were the rst primitive data type that did not correspond at all to the computer hardware. A pattern is a string of characters interspersed with wild cards and function calls. The language included a highly powerful pattern matching operation that would compare a string to a pattern and identify a substring that matched the pattern. During the matching process, the wild cards would be matched against rst one substring and then another, until the entire pattern matched or the string was exhausted.9 COBOL permitted the programmer to dene objects with complex structured data types but not to refer to those types. LISP provided type predicates but restricted the user to a few primitive types. PL/1 went further than either: its LIKE attribute permitted the programmer to refer to a complex user-dened type. PL/1 was developed in the mid-1960s for the IBM 360 series of machines. It was intended to be the universal language that would satisfy the needs of both business and scientic communities. For this reason, features of other popular languages were merged into one large, conglomerate design. Arithmetic expressions resembled FORTRAN, and pointers permitted the programmer to construct dynamically changing tree structures. Declarations for records and arrays were very much like COBOL declarations. Types could not be declared separately but were created as a side eect of declaring a structured variable. Once a type was created, though, more objects of the same type could be declared by saying they were LIKE the rst object [Exhibit 5.14].
9

Compare this to the pattern matching built into Prolog, Chapter 10, Section 10.4.

5.3. A BRIEF HISTORY OF TYPE DECLARATIONS

137

Exhibit 5.14. Using the LIKE attribute in PL/1. The design of PL/1 was strongly inuenced by COBOL. This inuence is most obvious in the declaration and handling of structures. Here we declare a record like the one in Exhibit 5.13. We go beyond the capabilities of COBOL, though, by declaring a second variable, MOTHER, of the same structured type. DCL 1 FATHER, 2 NAME, 3 LAST CHAR (20), 3 FIRST CHAR (20), 3 MID-INIT CHAR (1), 2 F-AGE PIC 99; MOTHER LIKE FATHER;

DCL 1

To create unambiguous references, eld names of both MOTHER and FATHER must be qualied by using the variable name: [Link] = [Link];

5.3.2

Type Becomes a Denable Abstraction

By the late 1960s, types were recognized as abstractionsthings that could exist apart from any instances or objects. The fundamental idea, developed by C. Strachey and T. Standish, is that a type is a set of constructors (to create instances), selectors (to extract parts of a structured type), and a predicate (to test type identity). Languages began to provide ways to dene, name, and use types to create homogeneous sets of objects. ALGOL-68 and Simula were developed during these years. Simula pioneered the idea that a type denition could be grouped together with the functions that operate on that type, and objects belonging to the type, to form a class. Thus Simula was the rst language to support type modules and was a forerunner of the modern object-oriented languages. 10 ALGOL-68 contained type declarations and very carefully designed type compatibility rules. The type declarations dened constructors (specications by which structured variables could be allocated), selectors (subscripts for arrays and part names for records), and implicit type predicates. Type identity was the basis for extensive and carefully designed type checking and compatibility rules. Some kinds of type conversions were recognized to be (usually) semantically valid, and so were supported. Other type relationships were seen as invalid. The denition of the language was immensely complex, partly because of the type extension and compatibility rules, and partly because the design goal was super-generality and power.
10

See Chapter 17 for a discussion of object-oriented languages.

138 Reactions to Complexity: C and Pascal

CHAPTER 5. PRIMITIVE TYPES

Two languages, C and Pascal, were developed at this time as reactions against the overwhelming size and complexity of PL/1 and ALGOL-68. These were designed to achieve the maximum amount of power with the minimum amount of complexity. C moved backwards with respect to type abstractions. The designers valued simplicity and exibility of the language more than its ability to support semantic validity. They adopted type declarations as a way to dene classes of structured objects but omitted almost all use of types to control semantics. C supported record types (structs and unions) with part selectors and arrays with subscripting. Record types were full abstractions; they could be named, and the names used to create instances, declare parameters, and select subelds. Arrays, however, were not fully abstracted, independent types; array types could not be named and did not have an identity that was distinct from the type of the array elements. The purpose of type declarations in C was to dene the constructors and selectors for a new type. The declaration supplied information to the compiler that enabled it to allocate and access compound objects eciently. The eld names in a record were translated into osets from the beginning of the object. The size of the base type of an array became a multiplier, to be applied to subscripts, producing an oset. At run time, when the program selected a eld of the compound, the oset was added to the address of the beginning of the compound, giving an eective address. At this period of history, types were not generally used as vehicles for expressing semantic intent. Except for computing address osets, there were very few contexts in early C in which the type of an object made a dierence in the code the translator generated. 11 Type checking was minimal or nonexistent. Thus C type declarations did not dene type predicates. Type identity was not, in general, important. The programmer could not test it directly, as was possible in LISP, nor was it checked by the compiler before performing function calls, as it is in Pascal. 12 Niklaus Wirth, who participated in the ALGOL-68 committee for some time, designed Pascal to prove that a language could be simple, powerful, and semantically sound at the same time. 13 Pascal retained both the type declarations and type checking rules of ALGOL-68 and achieved simplicity by omitting ALGOL-68s extensive type conversion rules. The resulting language is more restrictive than C, but it is far easier to understand and less error prone. Ada: The Last ALGOL-Like Language? In the late 1960s, the U.S. Department of Defense (DoD) realized that the lack of a common computer language among its installations was becoming a major problem. By 1968, small-scale research eorts were being funded to develop a core language that could be extended in various directions to meet the needs of dierent DoD groups. Design goals included generality of the core language, extensibility, and reasonable eciency.
11

The exception was automatic conversions between numeric types in mixed expressions. Type checking and the semantic uses of types are discussed at length in Chapter 15. 13 It is said that he never dreamed that Pascal would achieve such widespread use as a teaching language.
12

5.3. A BRIEF HISTORY OF TYPE DECLARATIONS

139

In the early 1970s, DoD decided to strictly limit the number of languages in use and to begin design of one common language. A set of requirements for this new language were developed by analyzing the needs of various DoD groups using computers. Finalized in 1976, these requirements specied that the new language must support modern software engineering methods, provide superior error checking, and support real-time applications. After careful consideration, it was decided that no existing language met these criteria. Proposals were sought in 1977 for an ALGOL-like language design that would support reliable, maintainable, and ecient programs. Four proposals were selected, from seventeen submitted, for further development. One of these prototype languages was selected in 1979 and named Ada. Major changes were made, and a proposed language standard was published in 1980. Ada took several major steps forward in the area of data types. These included Cleaner type compatibility rules. Explicit constraints on the values of a type. Support for type portability. Types treated as objects, in a limited way. Ada was based on Pascal and has similar type compatibility rules. These rules are an important aid to achieving reliable programs. However, Pascal is an old language, and its compatibility rules have holes; some things are compatible that, intuitively, should not be. Ada partially rectied these problems. The idea of explicit constraints on the values belonging to a type was present in Pascal in the subrange types. In Ada, this idea is generalized; more kinds of constraints may be explicitly stated. These constraints are automatically checked at run time when a value is stored in a constrained variable.14 In the older languages, the range of values belonging to a type often depended on the hardware on which a program ran. A program, debugged on one computer, often ran incorrectly on another. By providing a means to specify constraints, Ada lets the programmer explicitly state data characteristics so that appropriate-sized storage objects may be created regardless of the default data type sizes on any given machine. A programmer can increase the portability of code substantially by using constrained types. A Pascal type can be used to declare a parameter, but it cannot be a parameter. Ada carries the abstraction of types one step further. Ada supports modules called generic packages. These are collections of declarations for types, data, and functions which depend on type parameters and/or integer parameters. Each type declaration in a generic package denes a generic type, or a family of types, and must be instantiated, or expanded with specic parameters, to produce a specic type declaration during the rst phase of compilation. 15 Thus although the use of type as parameters is restricted to precompile time, Ada types are objects in a restricted sense.
14 15

To achieve eciency, Ada permits this checking to be turned o after a program is considered fully debugged. See Chapter 17.

140 Recent Developments

CHAPTER 5. PRIMITIVE TYPES

Since the early 1980s, data type research has been directed toward implementing abstract data types, type hierarchies with inherited properties, and implementing nonhomogeneous types in a semantically sound way. These issues are covered in Chapter 17.

Exercises
1. Dene: bit, byte, word, long word, double word. 2. What is unique about logical instructions? 3. How are objects represented in the computer? Explain. 4. What is the purpose of computer codes? Explain. 5. What were the dissatisfactions with early computer codes? 6. What was the dierence between the memory of a business and a scientic computer? 7. What is packed decimal? How is it used? 8. What is an unsigned number? How is it represented in memory? 9. What is sign and magnitude representation? Ones complement? Twos complement? 10. How are negative numbers represented in modern computers? 11. What is a oating-point number? What are the problems associated with the representation of oating-point numbers? 12. Name the computer that you use. For each number below, give the representation (in binary) that is used on your computer. a. b. c. d. e. The largest 32-bit integer The largest positive oating-point number Negative one-half, in oating-point The smallest positive oat (closest to zero) The negative oating-point number with the greatest magnitude

13. Even though FORTH does not contain semantic mechanisms that implement subscripting, array bounds, or variables with multiple slots, it can be said that FORTH has arrays. Explain. 14. What is a data type? Type declaration? Type description?

5.3. A BRIEF HISTORY OF TYPE DECLARATIONS

141

15. Compare the way that the type of an object is represented in Pascal and in APL. Point out similarities and dierences. 16. What is a specic data type? Generic data type? 17. Explain the three attributes of a data type: encoding, size, and structure. 18. What determines the set of primitive data types associated with a language? 19. What is the usual set of primitive types associated with a language? 20. What is type emulation? Why is it needed? 21. How were types supported by the earliest languages? Give a specic example. 22. How is type represented in COBOL? LISP? SNOBOL? 23. Compare the pattern matching in SNOBOL to the database search in Prolog. 24. What is type checking? Type compatibility? 25. What are value constructors? Selectors? 26. What new ideas did Simula pioneer? 27. Why was Ada developed? 28. What major steps in the area of data typing were used in Ada?

142

CHAPTER 5. PRIMITIVE TYPES

Chapter 6

Modeling Objects

Overview
This chapter creates a framework for describing the semantics and implementation of objects so that the semantics actually used in any language can be understood and the advantages and drawbacks of the various implementations can be evaluated. We assume the reader is familiar with the use of objects such as variables, constants, pointers, strings, arrays, and records. When we survey the popular programming languages, we see a great deal of commonality in the semantics of these things in all languages. There are also important dierences, sometimes subtle, that cause languages to feel dierent, or require utterly dierent strategies for use. A program object embodies a real-world object within a program. The program object is stored in a storage object, a collection of contiguous memory cells. Variables are storage objects that store pure values; pointer variables store references. Initialization and assignment are two processes that place a value in a storage object. Initialization stores a program object in the storage object when the storage object is created. Assignment may be destructive or coherent. Extracting the contents from a storage object is known as dereferencing. Assignment and dereferencing of pointer variables usually yield references to ordinary variables rather than pure values. Managing computer memory involves creating, destroying, and keeping storage objects available. Three strategies are static storage, stack storage, and heap storage.

143

144 Exhibit 6.1. Representing objects.

CHAPTER 6. MODELING OBJECTS

External object: a length of 2" by 4" lumber. Program object: a 32-bit oating-point value. Storage object: a memory location with four consecutive bytes reserved for this number. External object: a charge account. Program object: a collection of values representing a customers name, address, account number, billing date, and current balance. Storage object: a series of consecutive memory locations totaling 100 bytes.

6.1

Kinds of Objects

A program is a means of modeling processes and objects that are external to the computer. External objects might be numbers, insurance policies, alien invaders for a video game, or industrial robots. Each one may be modeled in diverse ways. We set up the model through declarations, allocation commands, the use of names, and the manipulation of pointers. Through these, we create objects in our programs, give them form, and describe their intended meaning. These objects are then manipulated by the functions and operators of a language. We start by making a distinction between the memory location in which data is stored and the data itself. The ways of getting data into and out of locations are explored. A program object is the embodiment of an object in the program. It may represent an external object, such as a number or a record, in which case it is called a pure value. It may also represent part of the computer system itself, such as a memory location, a le, or a printer. During execution, the program manipulates its program objects as a means of simulating meaningful processes on the external objects or controlling its own internal operations. It produces usable information from observed and derived facts about the program objects. A program commonly deals with many external objects, each being represented by a pure value program object [Exhibit 6.1]. While all the external objects exist at once, their representing program objects can be passed through the computer sequentially and so do not have to be simultaneously present. For example, an accounting program deals with many accounts. Representations of these accounts are put in some sequence on an input medium and become program objects one at a time. In order to manipulate program objects, the program must generally store all or part of a program object in memory. It uses a storage object for this purpose. A storage object is a collection of contiguous memory cells (bits, bytes, etc.) in which a program object, called its value or contents , can be stored.1 A reference is the memory address of a storage object and is the handle by which the object
A storage object sometimes encompasses more cells than are needed to store the value. These cells, commonly added to achieve word alignment, are called padding.
1

6.1. KINDS OF OBJECTS

145

Exhibit 6.2. Values, variables, and pointers. The relationship between storage objects and program objects is illustrated. Boxes represent storage objects, letters represent pure values of type character, small circles (o) from which arrows emerge represent references, and dotted lines represent dereferencing.
102 102

A reference to a pointer variable A reference to a variable C A T

A pointer variable: 136


136

contains

136

A variable array 3 of char :

C A T

contains

A pure value array 3 of char

is accessed. In older terminology, a pure value is called an r -value or right-hand-value, because it can occur to the right of an assignment operator. A reference is called an l-value or left-hand-value. A reference is created when a storage object is allocated. This reference is itself a program object and may be stored in another storage object for later use [Exhibit 6.2]. A program must possess a reference to a program object in order to use that object. The allocation process sets aside an area of unused computer memory to make a new storage object. The process is essentially the same whether it is being carried out by an interpreter, which does the allocation at run time when the command is interpreted, or by a compiler, which deals with addresses of the storage objects that will be allocated at some future time when the program is executed. Allocation procedures are usually part of the implementation of a language, not part of the language denition, so the actual allocation process often diers from one translator to the next, as well as from one language to the next. Typically, though, the allocation process will include the following actions: 1. The translator must determine N , the number of bytes of memory that are needed. In some languages the programmer communicates this information by specifying the data type of the new object. Size, in bytes, is calculated by the translator and stored as part of the denition of a type. In lower-level languages the programmer species the allocation size explicitly. 2. A segment of free storage is located with length L N . A reference to the rst location in this segment is saved. 3. The address of the beginning of the free storage area is incremented by N , thus removing N

146 bytes from free storage.

CHAPTER 6. MODELING OBJECTS

4. If an initial value was dened, it is stored in the new storage object. 5. The address, or reference, saved in step 2 is returned as the result of the allocation process. It is the means by which the program is able to nd the new storage object. A variable is a storage object in which a pure value may be stored. Pure values and the variables in which they are stored have the same size and structure and are considered to be the same data type in many languages. We distinguish between them here because they have very dierent semantics. Operations you can perform with variables are to allocate and deallocate them and to fetch values from and store values into them. In contrast, pure values can be combined and manipulated with operators and functions, but not allocated and deallocated. A pointer variable is a storage object, or part of a storage object, in which a reference may be stored. (Often this term will be shortened to pointer.) Pointers are used to create storage structures such as game trees and linked lists and are an important means of modeling external objects.

6.2
6.2.1

Placing a Value in a Storage Object


Static Initialization

A storage object receives a value by one of two processes: initialization or assignment. Until a value is stored in a storage object, it is said to contain garbage, or to have an undened value. (When we wish to indicate an undened value we will write ?.) Using an undened value is a commonly made semantic error which generally cannot be detected by a language translator. For this reason some translators initialize all variables to zero, which is the most commonly useful initial value, or to some distinctive bit pattern, so that the semantic error can be more easily detected. It is poor programming practice to depend on such automatic initialization, however. Dierent translators for the same language may implement dierent initialization policies, and the program that depends on a particular policy is not portable. Initialization stores a program object in the storage object when the storage object is created. Many languages permit the programmer to include an initializing clause in an object declaration. Typical declaration forms are shown in Exhibits 6.3 and 6.4. In each exhibit, declarations are given for an integer variable, a character string, and an array of real numbers, and initial values are declared for each. Initializing compound objects, such as arrays and records, is restricted or not allowed in some languages. Two problems are involved here: how to denote a structured value, and how to implement initialization of dynamically allocated structured objects. The FORTRAN and C examples [Exhibits 6.3 and 6.4] illustrate two approaches to dening the structure of the initializer. In FORTRAN, the programmer writes an explicit loop or nest of loops which specify the order in which the elds of an array will be initialized and then provides a series of constants that will

6.2. PLACING A VALUE IN A STORAGE OBJECT

147

Exhibit 6.3. Initial value declarations in FORTRAN. CHARACTER*3 EOFLAG DIMENSION A (8) DATA EOFLAG, ISUM / NO , 0/, ( A(I), I=1,8) / 8.2, 2.6, 3.1, 17.0, 4 * 0.0 / Notes: In FORTRAN, simple integers and reals may be declared implicitly. Explicit declarations must be given for arrays and strings. Initial values are given in separate DATA declarations which must follow the statements that declare the storage objects. A single DATA statement can initialize a list of objects. It must contain exactly as many initial values as elds to be initialized. Initial values may be repeated by using a repeat count with a *. An array may be initialized by giving a loop-controlling expression.

Exhibit 6.4. Initial value declarations in C. static static Notes: In C an initial value may be given as part of a variable declaration. Static arrays can be initialized by listing the correct number of values for the array enclosed in brackets. (The property static is explained in Section 6.3.) The programmer may omit the array length specier from the declaration, as in the top line, and the length of the storage object will be deduced from the length of the initial value list. If too few initializers are given to ll an array, remaining elements are initialized to zero. char end_of_file_flag [ ] = "no "; int isum = 0; float a[8] = {8.2, 2.6, 3.1, 17.0};

148

CHAPTER 6. MODELING OBJECTS

evaluate to the desired initial values. A repetition count can be specied when several elds are to be initialized to the same value. Part or all of an array may be initialized this way. This is a powerful and exible method, but it does complicate the syntax and semantics of the language. Contrast this to a C initializer. Its structure is denoted very simply by enclosing the initial values in brackets, which can be nested to denote a type whose elds are themselves structured types. The same simple syntax serves to initialize both records and arrays. Initializers can be constants or constant expressions; that is, expressions that can be evaluated at compile time. In some ways, this is not as exible a syntax as FORTRAN provides. If any eld of a C object is initialized, then all elds will be initialized. If the same nonzero value is to be placed in several elds, it must be written several times. The one shortcut available is that, if the initializer has too few elds, the remaining elds will default to an initial value of zero. It is likely that the designers of C felt that FORTRAN initializers are too exiblethat they provide unnecessary exibility, at the cost of unnecessary complication. Applying something akin to the principle of Too Much Flexibility, they chose to include the simpler, but still very useful, form in C. All data storage in FORTRAN is created and initialized at load time. A translator can evaluate the constant expressions in an initializer and generate store instructions to place the resulting values into storage when the program code is loaded. Modern languages, though, support dynamic allocation of local variables in stack frames. (These are called automatic variables in C.) The initialization process for automatic variables is more complex than for static variables. Suppose a function F contains a declaration and initializations for a local array, V . This array cannot be initialized at load time because it does not yet exist. The translator must evaluate the initializing expressions, store the values somewhere, and generate a series of store instructions to be executed every time F is called. These copy precomputed initial values into the newly allocated area. This process was considered complex enough that the original denition of C simply did not permit initialization of automatic arrays. ANSI C, however, supports this useful facility.

6.2.2

Dynamically Changing the Contents of a Storage Object

Destructive Assignment. In many languages, one storage object can be used to store dierent program objects at dierent times. Assignment is an operation that stores a program object into an existing storage object and thus permits the programmer to change the value of a storage object dynamically. This operation is sometimes called destructive assignment because the previous contents of the storage object are lost. The storage object now represents a dierent external object, and we say that its meaning has changed. Functional languages are an important current research topic. The goal of this research is to build a language with a clean, simple semantic model. Destructive assignment is a problem because it causes a change in the meaning of the symbol that names the storage object. It complicates a formal semantic model considerably to have to deal with symbols that mean dierent things at

6.2. PLACING A VALUE IN A STORAGE OBJECT

149

Exhibit 6.5. Initializing and copying a compound object in Pascal. Pascal declarations are given below for a record type named person and for two personvariables, a and b. In Pascal, compound objects cannot be initialized coherently, so three assignments are used to store a record-value into b. On the other hand, records can be assigned coherently, as shown in the last line, which copies the information from b to a. TYPE person = RECORD age, weight: VAR a, b : person; BEGIN [Link] := 10; [Link] := 70; [Link] := M; a := b; ... END; integer; sex: char END;

dierent times. In a functional language, parameter binding is used in place of destructive assignment to associate names with objects. At the point that a Pascal programmer would store a computed value in a variable, the functional programmer passes that value as an argument to a function. The actions following the assignment in the Pascal program, and depending on it, would form the body of the function. A series of Pascal statements with assignment gets turned outside in and becomes a nest of function calls with parameter bindings. 2 This approach produces an attractive, semantically clean language because the parameter name has the same meaning from procedure entry to procedure exit. Coherent Assignment. An array or a record is a compound object: a whole made up of parts which are objects themselves. Some but not all programming languages permit coherent assignment of compound objects. In such languages an entire compound variable is considered to be a single storage object, and the programmer can refer to the compound object as a whole and assign compound values to it [Exhibits 6.5 and 6.7]. In COBOL any kind of object could be copied coherently. It is even possible to use one coherent READ statement to load an entire data table from a le into memory. In most older languages, though, assignment can only be performed on simple (single-word) objects. An array or a record is considered to be a collection of simple objects, not a coherent large object. The abstract process of placing a compound program object into its proper storage object must be accomplished by a series of assignment commands that store its individual simple components.
A deeply nested expression can look like a rats nest of parentheses; deep nesting is avoided by making many short function denitions.
2

150

CHAPTER 6. MODELING OBJECTS

Exhibit 6.6. Initializing and copying a compound object in K&R C. A record type named person is dened, and two person-variables, a and b, are declared. The variable b is initialized by the declaration and copied into a by the assignment statements. The property static causes the variable to be allocated in the program environment rather than on the stack, so that it can be initialized at load time. K&R C did not support initialization of dynamically allocated structured objects. typedef struct {int age, weight; char sex;} person; static person a, b = {10, 70, M}; { [Link] = [Link]; [Link] = [Link]; [Link] = [Link]; ...}

An example of the lack of coherent assignment can be seen in the original Kernighan and Ritchie denition of C. Coherent assignment was not supported; to copy a record required one assignment statement for each eld in the record. Thus three assignments would be required to copy the information from b to a in Exhibit 6.6. However, coherent initialization of record variables was supported, and b could be initialized coherently. Even in languages that support coherent compound assignment, the programmer is generally permitted to assign a value to one part of the compound without changing the others. In such situations, care must always be taken to ensure that a compound storage object is not left containing parts of two dierent program objects!

Exhibit 6.7. Initializing and copying a compound object in ANSI C. This example is written in ANSI C, which is newer than both K&R C and Pascal. The dierence between this and the clumsier versions in Exhibits 6.5 and 6.6 reects the growing understanding that coherent representations and operations are important. The type and object declarations are the same in both versions of C, as are initializations. But compound objects can be assigned coherently in ANSI C, so only one assignment is required to copy the information from b to a. Further, dynamically allocated (automatic) structs may be initialized in ANSI C. typedef struct {int age, weight; char sex;} person; person a, b = {10, 70, M}; { a = b; ...}

6.2. PLACING A VALUE IN A STORAGE OBJECT

151

Exhibit 6.8. Languages where assignment is a statement. A yes in the third column indicates that compound objects (such as arrays and records) may be assigned coherently, as a single action. A yes in the fourth column indicates that one ASSIGN statement may be used to store a value in several storage objects. Language COBOL Assignment Symbol MOVE = (in a COMPUTE statement) ADD, SUBTRACT, MULTIPLY, DIVIDE = := = ! := := Compound Assignment? yes no no no no yes no yes yes Multiple Assignment? yes yes yes no no yes no no no

FORTRAN ALGOL PL/1 FORTH Pascal Ada

Assignment Statements versus Assignment as a Function. Assignment is invoked either by writing an explicit ASSIGN operator or by calling a READ routine. In either case, two objects are involved, a reference and a value. The reference is usually written on the left of the ASSIGN operator or as the parameter to a READ routine, and the value is written on the right of the ASSIGN or is supplied from an input medium. Assignment is one of a very small number of operations that require a reference as an argument. (Others are binding, dereference, subscript, and selection of a eld of a record.) The purpose of an assignment is to modify the information in the computers memory, not to compute a new value. It is the only operation that modies the value of existing storage objects. For this reason, ASSIGN and READ occur in many languages as statement types or procedures rather than as functions. Exhibit 6.8 lists the symbols and semantics for the ASSIGN statements in several common programming languages. In other languages, ASSIGN is a function that returns a result and may, therefore, be included in the middle of an expression. Exhibit 6.9 shows ASSIGN functions in common programming languages. LISP returns the reference as the result of an assignment. C returns the value, so that it may be assigned to another storage object in the same expression or may be used further in computing the value of an enclosing expression. Exhibit 6.10 demonstrates how one assignment can be nested within another. When ASSIGN returns a value, as in C, a single expression may be written which assigns that value to several storage objects. We call this multiple assignment. While this facility is not essential, it is often useful, especially when several variables need to be zeroed out at once. The same end is achieved in other languages, such as COBOL, by introducing an additional syntactic rule to allow

152

CHAPTER 6. MODELING OBJECTS

Exhibit 6.9. Languages where assignment is a function. A yes in the third column indicates that compound objects (such as arrays and records) may be assigned coherently, as a single action. Language LISP APL C (1973) C (ANSI) Assignment Symbol replaca, replacd (also used for binding) = = Compound Assignment? some versions yes no yes Result Returned reference value value value

an ASSIGN statement to list references to several storage objects, all of which will receive the single value provided.

6.2.3

Dereferencing

Dereferencing is the act of extracting the contents from a storage object. It is performed by the FETCH operation, which takes a reference to a storage object and returns its value. When a pointer variable is dereferenced, the result is another reference. This could be a reference to a variable, which itself could be dereferenced to get a pure value, or it could be a reference to another pointer, and so forth. Whereas ASSIGN is always written explicitly in a language, its inverse, FETCH, is often invoked implicitly, simply by using the name of a storage object. Many languages (e.g., FORTRAN, Pascal, C, COBOL, BASIC, LISP) automatically dereference a storage object in any context where a program

Exhibit 6.10. Assignment as a function in C. An array length is dened as a constant at the top of the program to facilitate modications. Then the array ar is declared to have 100 elements, with subscripts from 0 to 99. Two integers are declared and set to useful numbers: num_elements holds the number of elements in the array, and high_sub holds the subscript of the last element. #define MAXLENGTH 100 float ar[ MAXLENGTH ]; int high_sub, num_elements; high_sub = (num_elements = MAXLENGTH) - 1; The last line contains two assignments. The constant MAXLENGTH is stored into the variable num_elements, and it is also returned as the result of the assignment function. This value is then decremented by one, and the result is stored in high_sub.

6.2. PLACING A VALUE IN A STORAGE OBJECT

153

Exhibit 6.11. Dereferencing by context in Pascal. We analyze the dereferences triggered by evaluating this expression: xarray[ point_1.number ] := eval_function( point_2 ) ; Assume the objects referenced have the following types: xarray: point_1, point_2: eval_function: An array of unspecied type. Pointer to a record with a eld called number. A function taking one pointer parameter and returning something of the correct type to be stored in xarray.

A variety of dereference contexts occur. Contexts (1), (3), and (4) occur together on the left, as do contexts (2) and (5) on the right. Reference xarray point_1 Is it dereferenced here? No, it is on the left of a := operator. Yes, explicitly, by the operator. Although this is part of a subscript expression, explicit dereference must be used because pointer variable names are not dereferenced in a pointer expression. You cannot tell from this amount of context. It will not be dereferenced if the function denition species that it is a VAR parameter. If VAR is not specied, it will be automatically dereferenced.

point_2

object is required. Thus a variable name written in a program sometimes means a reference and sometimes a pure value, depending on context. This introduces complexity into a language. You cannot just see a symbol, as in lambda calculus, and know what it means. You must rst examine where it is in the program and how it is used. To dene the dereferencing rules of a language, contexts must be enumerated and described. The commonly important contexts are: 1. The left-hand side of an assignment operator. 2. The right-hand side of an assignment operator. 3. Part of a subscript expression. 4. A pointer expression. 5. A parameter in a function or procedure call. Note that these contexts are not mutually exclusive but can occur in a confusing variety of combinations, as shown in Exhibit 6.11. Many other combinations of dereferencing contexts are, of course, possible. Whether or not a reference is dereferenced in each context varies among languages. In context (1) dereferencing is never done, as a reference is required for an ASSIGN operation. But when a subscript expression (3) occurs in context (1), dereferencing will happen within the subscript part

154 Exhibit 6.12. Explicit dereferencing in FIG FORTH.

CHAPTER 6. MODELING OBJECTS

All FORTH expressions are written in postx form, so you should read and interpret the operators from left to right. The FETCH operator is @. It is written following a reference and extracts the contents from the corresponding storage object. On lines 1 and 2, variables named XX and Y are declared and initialized to 13 and 0, respectively. Line 3 dereferences the variable XX and multiplies its value by 2. The result is stored in Y, which is not dereferenced because a reference is needed for assignment. 1 13 VARIABLE XX 2 0 VARIABLE Y 3 XX @ 2 * Y ! ( Same as Y = XX * 2 in FORTRAN. ) The expression XX 2 * would multiply the address, rather than the contents, of the storage object named XX by 2.

of the expression (the subscripted variable itself will not be dereferenced). In contexts (2) and (3) most languages will automatically dereference, as long as the situation does not also involve context (4). In context (4) languages generally do not dereference automatically. They either provide an explicit FETCH operator or combine dereferencing with other functions. Examples of FETCH operators are the Pascal and C *. Examples of combined operators are -> in C, which dereferences a pointer and then returns a reference to a selected part of the resulting record, and car and cdr in LISP, which select a part of a record and then dereference it. In context (5), there is no uniformity at all among languages. The particular choices and mechanisms used in various languages are discussed fully in Chapter 8, Section 8.4, and Chapter 9, Section 9.2. There are also languages in which storage objects are never automatically dereferenced, the most common being FORTH. In such languages the dereference command must be written explicitly using a dereference operator (@ in FORTH) [Exhibit 6.12]. The great benet of requiring explicit dereference is simplicity. A variable name always means the same thing: a reference. Considering the kind of complexity (demonstrated above) that is inherent in deriving the meaning of a reference from context, it is easy to understand the appeal of FORTHs simple method. The drawback of requiring explicit dereference is that an additional symbol must be written before most uses of a variable name, adding visual clutter to the program and becoming another likely source of error because dereference symbols are easily forgotten.

6.2.4

Pointer Assignment

Pointer assignment is ordinary assignment where the required reference is a reference to a pointer variable and the value is itself a reference, usually to an ordinary variable. Languages that support pointer variables also provide a run-time allocation function that returns a reference to the newly

6.2. PLACING A VALUE IN A STORAGE OBJECT

155

Exhibit 6.13. Pointer assignments in Pascal. We assume the initial state of storage shown in Exhibit 6.14. TYPE list = cell; cell = RECORD value:char; link:list END; VAR P1, P2, P3: list; Code P2 := P1; P3 := P1 .link; Comments Dereference P1 and store its value in P2. Dereference P1, select its link eld, which is a pointer variable, and dereference it. Store the resulting reference in P3.

P1, P2, and P3 all share storage now. We can refer to the eld containing the % as [Link] or as [Link]. Note that a pointer must be explicitly dereferenced, using , before accessing a eld of the object to which it points.

allocated storage. This reference is then assigned to a pointer variable, which is often part of a compound storage object. Pointer assignment allows a programmer to create and link together simple storage objects into complex, dynamically changing structures of unlimited size. Multiple pointers may be attached to an object by pointer assignment. The program object of a pointer is a reference to another storage object. When the pointer assignment P2 := P1 is executed, the program object P1, which is a reference to some object, Cell1, is copied into the storage object of P2, thus creating an additional pointer to Cell1 and enabling P2 as well as P1 to refer to Cell1. Thus two objects now store references to one storage object, and we say they share storage dynamically. This is illustrated in Exhibits 6.13 and 6.14. While such sharing is obviously useful, it creates a complex situation in which the contents of the storage structure attached to a name may change without executing an assignment to that name. This makes pointer programs hard to debug and makes mathematical proofs of correctness very hard to construct. Many programmers nd it impossible to construct correct pointer programs

Exhibit 6.14. Pointer structures sharing storage. Storage, diagrammed before and after the pointer assignments in Exhibit 6.13.

Before:
P1

After:
P1

P2 ? $

P3 ? % &

P2

P3

&

156

CHAPTER 6. MODELING OBJECTS

Exhibit 6.15. Pointer assignments with dereference in C. The right side of the assignment is dereferenced if it evaluates to a structure or a simple object. typedef struct { int age; float weight; } body; body s; /* A variable of type body. */ body *ps, *qs; /* Two pointers to bodies. */ int k; /* An integer variable. */ int *p, *q; /* Two pointers to integers. */ p = &k; ps = &s; q = p; qs = ps; /* Store the address of k in p, that is, make p point at k. Note: the & operator prevents automatic dereferencing. */ /* Make ps point at s. */ */ /* Dereference p to get the address stored in p, and store that address in q, making q point at the same thing as p. /* Make qs point at the same thing as ps. */

k p q 17 p s q s

s 37 105.2

without making diagrams of their storage objects and pointer variables.

6.2.5

The Semantics of Pointer Assignment

There are two likely ways in which a pointer assignment could be interpreted: with and without automatic dereferencing of the right-hand side. Pascal does dereference, as is shown in Exhibit 6.13. In such a language the statement Q := P is legal if P and Q are both pointers. This makes Q point at whatever P is pointing at. The assignment P := K is illegal if P is a pointer and K is an integer. Exhibit 6.15 shows several pointer assignments in C where the right side is dereferenced. In a hypothetical language, := could be dened such that the assignment p := k would be legal and would make p point at k. In this case, pointer assignment is interpreted without dereferencing the right side. In such a language we could create a chain of pointers as follows: k := 5.4; -- k is type float. p := k; -- p must be type pointer to float. q := p; -- q must be type pointer to pointer to float. These assignments, taken together, would construct a pointer structure like this:

6.2. PLACING A VALUE IN A STORAGE OBJECT

157

Exhibit 6.16. Pointer assignment without dereference in C. We declare an integer variable, k; integer pointers, p1 and p2; an array of ve integers, a; a function that returns an integer, f; and a pointer to a function that returns an integer, p3. The right side of a C assignment is not dereferenced if it refers to an array or a function. int k, *p1, *p2, a[5], f(), *p3(); p1 = &k; /* Make p1 point at k. */ p2 = a; /* Make p2 point at the array. Note absence of "&".*/ p2 = &a[0]; /* Make p2 point at the address of the zeroth element of the array. This has the same effect as the line above. */ p3 = f; /* Store a reference to the function f in pointer p3. Note that f is not dereferenced.*/

k p1 p3

f Machine code

a p2 a[0] a[1] a[2] a[3] a[4]


Note that p2 = &a; is syntactically incorrect because the name of an array means the address of its zeroth element. One must either omit the & or supply a subscript.

K 5.4

Exhibit 6.16 shows pointer assignments in C which set pointers to an array and a function. In these contexts, in C, the right side will not be dereferenced. While either interpretation of pointer assignment could make sense, we would expect to see either one or the other used consistently in a language. One of the unusual and confusing facets of C is that the semantics of pointer assignment depends on the type of the expression on the right. If it denotes a simple object (such as an integer or a pointer) or an object dened as a struct, automatic dereferencing is used [Exhibit 6.15]. If the right-hand object is an array or a function, the second meaning, without dereferencing, is implemented [Exhibit 6.16].

158

CHAPTER 6. MODELING OBJECTS

6.3

The Storage Model: Managing Storage Objects

The dierences among languages are easier to understand when the underlying mechanisms are known. A key part of any translator is managing the computer memory; storage objects must be created, kept available, and destroyed when appropriate. Three storage management strategies are in common use with all three present in some translators, but only one in others. These are static storage and two kinds of dynamic storage: stack storage and heap storage.

6.3.1

The Birth and Death of Storage Objects

A storage object is born when it is allocated, and it dies when it is no longer available for use by the program. The lifetime , or extent, of a storage object is the span of time from its birth to its death. An object that lives until the program is terminated is immortal. Most objects, however, die during program execution. It is a semantic error to attempt to reference a storage object after it has died. The run-time system will typically reuse the formerly occupied storage for other purposes, so references to a dead object will yield unpredictable results. Deallocation is the recycling process by which dead storage objects are destroyed, and the storage locations they occupied are made available for reuse by the allocation process. Deallocation happens sometime, often not immediately, after death. All live objects must be simultaneously in the computers virtual memory. Real computers have limited memory, so it is important that the lifetimes of objects correspond to the period of time during which they are actually needed by the program. By having an object die when it is no longer useful, we can recycle the storage it formerly occupied. This enables a program to use a larger number of storage objects than would otherwise t into memory. Static Storage Objects A compiler plans what storage objects will be allocated to a program at load time, and when the object code will be copied into computer memory, linked, and made ready to run. Such objects are allocated before execution begins and are immortal. These are called static storage objects because they stay there, unmoved, throughout execution. Static allocation is often accompanied by initialization. The compiler chooses run-time locations for the static objects and can easily put initial values for these locations into the object code. The number of static storage objects in a program is xed throughout execution and is equal to the number of static names the programmer has used. Global variables are static in any language. Some languages (for example, COBOL) have only static objects, while others (for example, Pascal) have no static storage except for globals. Still others (ALGOL, C) permit the programmer to declare that a nonglobal object is to be static. In ALGOL, this is done by specifying the attribute OWN as part of a variable declaration. In C, the keyword static is used for this attribute. A language with only static storage is limiting. It cannot support recursion, because storage must be allocated and exist simultaneously for the parameters of a dynamically variable number of calls on any recursive function.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

159

A language that limits static storage to global variables is also limiting. Many complex applications can be best modeled by a set of semi-independent functions. Each one of these performs some simple well-dened task such as lling a buer with data or printing out data eight columns per line. Each routine needs to maintain its own data structures and buer pointers. Ideally, these are private structures, protected from all other routines. These pointers cannot be ordinary local variables, since the current position on the line must be remembered from one call to the next, and dynamically allocated variables are deallocated between calls. On the other hand, these pointers should not be global, because global storage is subject to accidental tampering by unrelated routines. The best solution is to declare these as static local storage, which simultaneously provides both continuity and protection. Finally, the unnecessary use of static objects, either global or local, is unwise because they are immortal. Using them limits the amount of storage that can be recycled, thereby increasing the overall storage requirements of a program. Dynamic Storage Objects Storage objects that are born during execution are called dynamic. The number of dynamic storage objects often depends on the input data, so the storage for them cannot be planned by the compiler in advance but must be allocated at run time. The process of choosing where in memory to allocate storage objects is called memory management. A memory manager must be sure that two storage objects that are alive at the same time never occupy the same place in memory. It should also try to use memory eciently so that the program will run with as small an amount of physical memory as possible. Memory management is a very dicult task to do well, and no single scheme is best in all circumstances. The job is considerably simplied if the memory manager knows something in advance about the lifetimes of its storage objects. For this reason, languages typically provide several dierent kinds of dynamic storage objects which have dierent lifetime patterns. The simplest pattern is a totally unrestricted lifetime. Such an object can be born and die at any time under explicit control of the programmer. Nothing can be predicted about the lifetimes of these objects, which are generally stored in an area of memory called the heap. Whenever a new one is born, the storage manager tries to nd a suciently large unused area of heap memory to contain it. Whenever the storage manager learns of the death of a heap object, it takes note of the fact that the memory is no longer in use. There are many problems in recycling memory. First of all, the blocks in use may be scattered about the heap, leaving many small unused holes instead of one large area. If no hole is large enough for a new storage object, then the new object cannot be created, even though the total size of all of the holes is more than adequate. This situation is called memory fragmentation. Second, a memory manager must keep track of the holes so that they can be located when needed. A third problem is that two or more adjacent small holes should be combined into one larger one. Dierent heap memory managers solve some or all of these problems in dierent ways. We will talk about some of them later in this chapter.

160

CHAPTER 6. MODELING OBJECTS

Because of the diculty in managing a heap, it is desirable to use simpler, more ecient but restricted memory managers whenever possible. One particularly common pattern of lifetimes is called nested lifetimes. In this pattern, any two objects with dierent lifetimes that exist at the same time have well-nested lifetimes; that is, the lifetime of one is completely contained within the lifetime of the other. This pattern arises from block structure and procedure calls. Storage for local block variables and procedure parameters only needs to exist while that block or procedure is active. We say that a block is active when control resides within it or within some procedure called from it. A storage object belonging to a block can be born when the block begins and die when the block ends, so its lifetime coincides with the time that the block is active. Blocks can be nested, meaning that a block B that starts within a block A nishes before A does. It follows that the lifetimes of any storage objects created by B are contained within the lifetimes of objects created by A. Dynamic Stack Storage Storage for objects with nested lifetimes can be managed very simply using a stack, frequently called the run-time stack. This is an area of memory, like the heap, on which storage objects are allocated and deallocated. Since, in the world of nested lifetime objects, younger objects always die before older ones, objects can always be allocated and deallocated from the top of the stack. For such objects, allocation and deallocation are very simple processes. The storage manager maintains a stack allocation pointer which indicates the rst unused location on the stack. When a program block is entered, this pointer is incremented by the number of bytes required for the new storage object(s) during the allocation process. Deallocation is accomplished at block exit time by simply decrementing the stack allocation pointer by the same number of bytes. This returns the newly freed storage to the storage pool, where it will be reused. In languages that support both heap and stack storage objects, the stack objects should be used wherever possible because their lifetime is tied to the code that uses them, and the birth and death processes are very ecient and automatic. (This is the reason that stack-allocated objects are called auto in C.) Storage managers typically use stack storage for a variety of purposes. When control enters a new program block, a structure called a stack frame, or activation record, is created on the top of the stack. The area past the end of the current stack frame is used for temporary buers and for storing intermediate results while calculating long arithmetic expressions [Exhibit 6.17, right side]. A stack frame3 includes several items: parameters, local variables, the return address, and the return value (if the block is a function body). It also contains two pointers, called the static link and dynamic link [Exhibit 6.17, left side]. Let us dene the lexical parent of a block to be that block which encloses it on the program listing. The lexical parent of the outermost block or blocks is the system. A lexical ancestor is a parent or the parent of a parent, and so on. The static link points to the stack frame of the current
3

The rest of this section explains the structure of the stack for a lexically scoped, block-structured language.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

161

Exhibit 6.17. The structure of the run-time stack. The order of the parts within a stack frame is arbitrary, as is the relationship of global storage and program code to the stack. The diagrams indicate a functional arrangement of the necessary kinds of information. A single stack frame: The program and stack at run time:

Parameters Return Address Dynamic Link Static Link Return Value Local Variables
Top of stack

Program Code Global and static storage Stack frame for oldest block Stack frames for other blocks Stack frame for newest block Temporary locations
Top of stack

blocks lexical parent. At run time, these links form a chain that leads back through the stack frames for all the blocks that lexically enclose the current block. Since the location of a lexical ancestors frame is not predictable at compile time, the chain of static links must be followed to locate a storage object that was allocated by an ancestor. This is, of course, not as ecient as nding a local object, and it is one good reason to use parameters or local variables wherever possible. The dynamic parent of a block is the block which called it during the course of execution and to which it must return at block exit time. The dynamic link points to the stack frame of the current blocks dynamic parent. This link is used to pop the stack at block exit time. The static and dynamic links are created when the stack frame is allocated at run time. During this process, several things are entered into the locations just past the end of the current frame. This process uses (and increments) the local-allocation pointer which points to the rst free location on the stack. Before beginning the call process, this pointer is saved. The saved value will be used later to pop the stack. The sequence of events is as follows:

162

CHAPTER 6. MODELING OBJECTS

1. The calling program puts the argument values on the stack using the local-allocation pointer. Typically, the last argument in the function call is loaded on the stack rst, followed by the second-last, and so on. The rst argument ends up at the top of the stack. 2. The return address is written at the top of the stack, above the rst argument. 3. The current top-of-stack pointer is copied to the top of the stack. This will become the new dynamic link eld. The address of this location is stored into the top-of-stack pointer. 4. The static link for the new frame is written on the stack. This is the same as either the static link or the dynamic link of the calling block. Code is generated at compile time to copy the appropriate link. 5. The local allocation pointer is incremented by enough locations to store the return value and the local variables. If the locals have initializers, those values are also copied. 6. Control is transferred to the subroutine. At block exit time, the stack frame must be deallocated. In our model, the return value is in the frame (rather than in a register), so the frame must be deallocated by the calling program. To do this, the value in the dynamic link eld of the subroutines frame is copied back into the top-of-stack pointer, and the local-allocation pointer is restored to its value prior to loading the arguments onto the stack. Stack storage enables the implementation of recursive functions by permitting new storage objects to be allocated for parameters and local variables each time the function is invoked. An unlimited number of storage objects which correspond to each parameter or local name in the recursive function can exist at the same time: one set for every time a recursive block has been entered but not exited [Exhibits 6.18 and 6.19]. Each time a recursive procedure exits, the corresponding stack frame is deallocated, and when the original recursive call returns to the calling program, the last of these frames dies. The number of storage objects simultaneously in existence for a recursive program is limited only by the program logic and the amount of storage available for stack allocation, not by the number of declared identiers in the program. Dynamic Heap Storage There are situations in which heap storage must be used because the birth or death patterns associated with stack storage are too restrictive. These include cases in which the size or number of storage objects needed is not known at block entry time and situations in which an object must outlive the block in which it was created. Heap Allocation. Heap allocation is invoked by an explicit allocation command, which we will call ALLOC. Such commands can occur anywhere in a program, unlike local variable declarations which are restricted to the beginning of blocks. Thus a heap-allocated object can be born at any

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

163

Exhibit 6.18. A recursive function in Pascal. The following (foolish) recursive function multiplies jj inputs together. Exhibit 6.19 traces an execution of this function. FUNCTION product (jj: integer):integer; VAR kk: integer; BEGIN IF jj <= 0 THEN product := 1 ELSE BEGIN readln(kk); product := kk * product(jj-1); END END;

Exhibit 6.19. Stack frames for recursive calls. If the function product in Exhibit 6.18, were called with the parameter 2, two recursions would happen. Assume the inputs 25 and 7 were supplied. Just before returning from the second recursion the stack would contain three stack frames as diagrammed. The ? in a stack location indicates an undened value.
Points at stack frame from lexical parent DL Stack frame from original call DL Stack frame from first recursive call DL Stack frame from second recursive call Top of stack

Key: SL is a Static Link DL is a Dynamic Link

SL

jj : 2 return : ? kk : 25 jj : 1 return : ? kk : 7 jj : 0 return : 1 kk : ?

Temporary locations

164 Exhibit 6.20. Dynamic allocation in FORTH. Allocation: HERE expression ALLOT

CHAPTER 6. MODELING OBJECTS

Storage can be allocated dynamically in the dictionary, which stores the symbol table and all global objects. The programmer is given access to the top-of-dictionary pointer through the system variable named HERE. The code above puts the current value of HERE on the stack. Then it evaluates the expression, which must produce an integer, N. Finally, ALLOT adds N bytes to the dictionary pointer. The address of the newly allocated area is left on the stack; the user must store it in a pointer variable. Deallocation: Users must write their own storage management routines if they wish to free and reuse dynamic storage.

time. ALLOC reserves storage in the heap and returns a reference to the new storage object. The allocation process for heap storage is somewhat more complicated than that for stack storage, since there may be two places to look in the heap for available storage. Initially there is only a large, empty area with an associated allocation pointer which is incremented (like the stack pointer) when storage is allocated from that area. After some objects have died, there may also be a freelist which contains references to these formerly used locations. Clearly, items on the freelist might be scattered all over the heap and be of quite varied sizes. The memory manager must contain algorithms to keep track of the sizes and merge adjacent free areas, and these algorithms must be fast to avoid degrading the performance of the system. An ALLOC command takes some indication of the required size of the new object and nds and reserves that much memory. Either it returns a reference to the memory location, or it stores that reference in a pointer variable which thereafter gives access to the new object. The ways these actions are incorporated into current languages are truly varied [Exhibits 6.20 through 6.25]. The new storage object is used, later, by dereferencing this pointer, and it remains alive as long as the pointer or some copy of the pointer points at it.

Exhibit 6.21. Dynamic allocation in LISP. Allocation: (cons expr1 expr2 )

This allocates a new list cell and returns a reference to it. The left eld of the cell is initialized to the result of evaluating expr1 and its right eld to expr2 . Deallocation: Most LISP systems rely on garbage collection.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

165

Dead Heap Objects. We call a dead heap object garbage. Management of dead heap objects is very dierent from stack management. A heap object dies either when the last reference to the object is destroyed (let us call this a natural death) or when it is explicitly killed using a KILL command. The run-time system of a language translator must manage heap storage allocation just as it manages stack frame allocation. However, when an object dies a natural death, both the programmer and the run-time system may be unaware of that death. A hole containing garbage is left in the heap at an unknown location. A KILL command takes a reference to a storage object, kills the storage object, and puts the reference onto the freelist, where it can be recycled. In languages that implement KILL, programmers who use extensive amounts of dynamic storage are strongly urged to keep track of their objects and KILL them when they are no longer useful. (In general, this will be well before the object dies a natural death.) It is only through an explicit KILL command that the system can reclaim the storage. Recycling a dead heap cell is more complex than recycling stack cells. The system cannot simply decrement the heap allocation pointer because, in general, dead objects are in the middle of the heap, not at the end. A data structure called a freelist is generally used to link together the recycled cells and provide a pool of cells available for future reuse. Conceptually, a freelist is just a list of reclaimed and reusable storage objects. However, that is not a simple thing to implement eciently. The objects might all be interchangeable, or they might be of diering sizes. In any case, they are probably scattered all over heap memory. The language designer or system implementor must decide how to organize the freelist to maximize its benets and minimize bookkeeping. Ignore dead cells. The easiest implementation of KILL is to ignore it! Although this seems to be a misimplementation of a language, it has been done. The Pascal reference manual for the Data General MV8000 explicitly states that the Dispose command is implemented as a no-op. This compiler runs under the AOS-VS operating system, which is a time-shared, paged, virtual memory system. The philosophy of the compiler writer was that most programs dont gobble up huge amounts of storage, and those that do can be paged. Old, dead storage objects will eventually be paged out. If all objects on the page have died, that page will never again be brought into memory. Thus the compiler depends on the storage management routines of the operating system to deep-six the garbage. This can work very well if objects with similar birth times have similar useful lifetimes. If not, each of many pages might end up holding a few scattered objects, vastly increasing the memory requirements of the process and degrading the performance of the entire system. Keep one freelist. One possibility is to maintain a single list which links together all free areas. To do this, each area on the list must have at least enough bytes to store the size of the area and a link. (On most hardware that means 8 bytes.) Areas smaller than this are not reclaimable. Many or most C and Pascal compilers work this way.

166 Exhibit 6.22. Dynamic allocation in C.

CHAPTER 6. MODELING OBJECTS

Allocation: In the commands that follow, T and basetype are types, N is an integer. The malloc function allocates one object of type T or size N bytes. calloc allocates an array of N objects of type basetype and initializes the entire array to zero. Both malloc and calloc return a reference to the new storage object. The programmer must cast that reference to the desired pointer type and assign it to some pointer variable. malloc(sizeof(T)) malloc(N) calloc(N, sizeof(basetype)) Deallocation: free(ptr); ptr must be a pointer to a heap object that was previously allocated using malloc or calloc. That object is linked onto a freelist and becomes available for reuse.

A compiler could treat this 8-byte minimum object size in three ways. It could refuse to allocate anything smaller than 8 bytes; a request for a smaller area would be increased to this minimum. This is not as wasteful as it might seem. Those extra bytes often have to be allocated anyway because many machines require every object to start on a word or long-word boundary (a byte address that is divisible by 2 or 4). Alternately, the compiler could refuse to reclaim anything smaller than the minimum. If a tiny object were freed, its bytes would simply be left as a hole in the heap. The philosophy here is that tiny objects are probably not worth bothering about. It takes a very large number of dead tiny objects to ll up a modern memory. A fragmentation problem can occur with these methods for handling variable-sized dead objects.

Exhibit 6.23. Dynamic allocation in Pascal. Allocation: New( PtrName ); The PtrName must be declared as a pointer to some type, say T. A new cell is allocated of type T, and the resulting reference is stored in the pointer variable. Deallocation: Dispose( PtrName ); The object pointed at by PtrName is put onto the freelist.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

167

With many rounds of allocation and deallocation, the average size of the objects can decrease, and the freelist may end up containing a huge number of tiny, worthless areas. If adjacent areas are not glued together, one can end up with most of the memory free but no single area large enough to allocate a large object. Joining adjacent areas is quick and easy, but one must rst identify them. Ordinarily this would require keeping the freelist sorted in order of address and searching it each time an object is freed. This is certainly time-consuming, and the system designer must decide whether the time or the space is more valuable. One nal implementation of variable-sized deallocation addresses this problem. In this version, each allocation request results in an 8-byte header plus the number of bytes requested, rounded up to the nearest word boundary. At rst this seems very wasteful, but using the extra space permits a more satisfactory implementation of the deallocation process. The 8-byte header contains two pointers that are used to create a doubly linked circular list of dynamically allocated areas. One bit somewhere in the header is set to indicate whether the area is currently in use or free. The areas are arranged on this list in order of memory address. Areas that are adjacent in memory are adjacent in the list. Disposing of a dead object is very ecient with this implementation: one only needs to set this bit to indicate free. Then if either of the neighboring areas is also free, the two can be combined into one larger area. When a request is made for more storage, the list can be scanned for a free cell that is large enough to satisfy the new request. Scanning the list from the beginning every time would be very slow, since many areas that are in use would have to be bypassed before nding the rst free area. But a scanning pointer can be kept pointing just past the most recently allocated block, and the search for a free area can thus start at the end of the in-use area. By the time the scanner comes back around to the beginning of the list, many of the old cells will have been freed. Thus we have a typical time/space trade-o. By allocating extra space we can reduce memory-management time. Keep several freelists. A nal strategy for managing free storage is to maintain one freelist for every size or type of storage object that can be freed. Thus all cells on the list are interchangeable, and their order doesnt matter. This simplies reallocation, avoids the need for identifying adjacent areas, and, in general, is simpler and easier to implement. This reallocation strategy is used by Ada and Turing [Exhibits 6.24 and 6.25]. One of the problems with heap-allocated objects is in knowing when to kill them. It is all too easy to forget to kill an object at the end of its useful lifetime or to accidentally kill it too soon. This situation is complicated by the way in which pointer structures may share storage. A storage object could be shared by two data structures, one of which is no longer useful and apparently should be killed, while the other is still in use and must not be killed. If we KILL this structure we create a dangling pointer which will eventually cause trouble. Identifying such situations is dicult and error prone, but omitting KILL instructions can increase a programs storage requirements beyond what is readily available. For this reason, some languages, such as LISP, automate the process of recycling dead heap

168

CHAPTER 6. MODELING OBJECTS

Exhibit 6.24. Dynamic allocation in Ada. Allocation: NEW type ( expression )

This allocates an object of the type requested. If the optional expression is supplied, it is evaluated and the result is used to initialize the new storage object. NEW is a function that returns a reference to the new object. The programmer must assign this reference to a variable of an ACCESS type (a pointer). Deallocation: Explicit deallocation is not generally used in Ada. In most Ada implementations, the dynamically allocated cells in a linked structure are automatically deallocated when the stack frame containing the pointer to the beginning of the structure is deallocated. Some Ada implementations contain full garbage collectors, like LISP. When it is necessary to recycle cells explicitly, a programmer may use a generic package named Unchecked_Deallocation.4 This package must be instantiated (expanded, like a macro, at compile time) for each type of cell that is to be deallocated. Each instantiation produces a procedure, for which the programmer supplies a name, that puts that kind of cell on a freelist. (Dierent cell types go on dierent freelists.) Use of this facility is discouraged because it may lead to dangling pointers.

Exhibit 6.25. Dynamic allocation in Turing. Allocation: new collection , ptr To dynamically allocate cells of a particular type, the programmer must explicitly declare that the type forms a collection. The new command allocates one cell from the desired collection and stores the reference in ptr . Deallocation: free collection , ptr The object pointed at by ptr is returned to the collection it came from, where it will be available for reuse. The pointer object ptr is set to nil.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

169

objects. In cases where a KILL command does not exist or is not used, heap objects still die, but the memory manager is not aware of the deaths when they happen. To actually recycle these dead heap cells requires a nontrivial mechanism called a garbage collector, which is invoked to recycle the dead storage objects when the heap becomes full or nearly full. A garbage collector looks through storage, locating and marking all the live objects. It can tell that an object is alive if it is static or stack-allocated or if there is a pointer to it from some other live object anywhere in storage. The garbage collector then puts references to all of the unmarked, dead areas on the freelist, where the allocator will look for reusable cells. While this scheme oers a lot of advantages, it is still incumbent on the programmer to destroy references to objects that are no longer needed. Furthermore, garbage collection is slow and costly. On the positive side, the garbage collector needs to be run only when the supply of free storage is low, which is an infrequent problem with large, modern memories. Thus garbage collection has become a practical solution to the storage management problem.

6.3.2

Dangling References

The case of a name or a pointer that refers to a dead object is problematical. This can happen with heap storage where the programming language provides an explicit KILL command. The programmer could allocate a heap-object, copy the resulting reference several times, then KILL the object and one of its references. The other references will still exist and point at garbage. These pointers are called dangling references or dangling pointers. This situation could also arise if the program is able to store references to stack-allocated objects. Assume that a reference to a stack-allocated variable declared in an inner block could be stored in a pointer from an outer block. During the lifetime of the inner block, this can make good sense. When storage for the inner block is deallocated, though, the reference stored in the outer block becomes garbage. If it were then used, it would be an undened reference. Initially, a dangling reference points at the value of the deallocated variable. Later, when the storage is reused for another block, the address will contain useful information that is not relevant to the pointer. Thus the pointer provides a way of accessing and modifying some random piece of storage. Serious errors can be caused by the accidental use of a dangling reference. Because the storage belonging to any inner block might be aected, the symptoms of this kind of error are varied and confusing. The apparent error happens at a point in the program that is distant from the block containing the dangling reference. If the inner blocks are modied, the symptoms may change; the part that was malfunctioning may start to work, and some other part may suddenly malfunction. This kind of error is extremely dicult to trace to its cause and debug. Because of the potential severe problems involved, pointers into the stack are completely prohibited in Pascal. Pascal was designed to be simple and as foolproof as possible. The designers opinion is that all programmers are occasionally fools, and the language should provide as much protection as possible without prohibiting useful things. Pascal completely prevents dangling pointers that point into the stack by prohibiting all pointers

170

CHAPTER 6. MODELING OBJECTS

to stack-allocated objects. The use of Pascal pointers is thus restricted to heap storage. Linked lists and trees, which require the use of pointers, are allocated in the heap. Simple variables and arrays can be allocated on the stack. Address arithmetic is not dened. Although this seems like a severe restriction, its primary bad eect is that subscripts must be used to process arrays, rather than the more ecient indexing methods which use pointers and address arithmetic. In contrast, the use of pointers is not at all restricted in C. The & operator can be used freely and lets the programmer point at any object, including stack objects that have been deallocated. When control leaves an inner block, and its stack frame is deallocated, any pointer that points into that block will contain garbage. (An example of such code and corresponding diagrams are given in Exhibits 9.25 and 9.26.) A language, such as C, which permits unrestricted use of addresses must either forgo the use of an execution stack or cope with the problem of dangling references. Allocation of parameters and local variables on the execution stack is a simple and ecient method of providing dynamically expandable storage, which is necessary to support recursion. Alternatives to using a stack exist but have high run-time overhead. The other possibility is to permit the programmer to create dangling references and make it the programmers responsibility to avoid using them meaninglessly. A higher level of programming skill is then required because misuse of pointers is always possible. A high premium is placed on developing clean, structured methods for handling pointers. One design principle behind the original C was that a systems programmer does not need a foolproof language but does need free access to all the objects in his or her programs. Permitting free use of pointers was also important in the original C because it lacked other important features. Since structured objects could not be passed coherently to and from subroutines, any subroutine that worked on a structure had to communicate with its calling program using a pointer to the structure. In the new ANSI C, this weakness is changed but not eliminated. It permits coherent assignment of structures, but not arrays. Similarly, structures may be passed to and from functions without using pointers, but an array parameter is always passed by pointer. Thus if C had the same restriction on pointers that Pascal has, the language would be much less powerful, perhaps not even usable. How, then, can Pascal avoid the need to have pointers to stack objects? It has two facilities that are missing in C: Compound stack-allocated objects are coherent. They can be operated on, assigned, compared, and passed as parameters coherently. References to objects can be passed as parameters by using the VAR parameter declarator. Unfortunately, returning a compound value from a function is not permitted in the standard language and must be accomplished by storing the answer in a VAR parameter. Most standard algorithms and data structures can be coded easily within these restrictions, using the fact that compound objects are coherent. Some experts assert that Pascal is a better language

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

171

for these applications because the programmer does not need to exercise as much care. On the other hand, there are, occasionally, situations in which the pointer restrictions in Pascal prevent the programmer from coding an algorithm at all, and others in which the code would have been much more ecient if the programmer had used pointers to stack objects. One might say that C is a better language for these applications.

Exercises
1. Dene and explain the relationship among: external object, program object, storage object, and pointer object. 2. What is the dierence between a variable and a pointer variable? 3. By what two means can a value be placed in a storage object? What is the dierence between the two processes? 4. What is the dierence between destructive assignment and coherent assignment? 5. What is multiple assignment? How is it used? 6. Why do languages that implement assignment as a function allow the programmer more exibility than those that implement assignment as a statement? 7. Dene dereferencing. Which abstract function implements it? 8. Choose a programming language with which you are familiar. Give a sample of code in which dereferencing is explicit. Give another in which it is implicit. 9. What language contexts must be enumerated in order to dene the implicit dereferencing rules in a language? 10. Give some examples of FETCH operators which are used to dereference pointer expressions. 11. How does a pointer assignment dier from an ordinary assignment? 12. Given p1 pointing at this list of integers, write legal pointer assignments for p2 and p3.

p1

p2

p3

50

25

62

13. In the C language, what are the meanings of & and * in pointer notation?

172

CHAPTER 6. MODELING OBJECTS

14. In C, what is the meaning of the name of an array? Do you need an & when assigning a pointer to an array? Why or why not? 15. What is the lifetime of a storage object? 16. What is the dierence between a static storage object and a dynamic one? 17. Why is a language that only supports static storage items limiting? 18. What is memory management? Why is it important? 19. What is a run-time stack? A stack-frame? A heap? 20. What are the purposes of the static and dynamic links in a stack frame? 21. Name two languages in which local variables can be declared and are allocated each time a subroutine is entered. Give examples of local variable declarations. 22. Name a language in which local variables cannot be dened. 23. What is static local storage? In what ways is it better than global storage and ordinary local storage? Give an example, in some language, of a declaration that creates static local storage. 24. Suppose a language allows initial values to be specied for local variables, for example, the following declarations which dene X and initialize it to 50: Language Ada C Declaration X: integer := 50; int X=50;

When and how often does initialization happen in the following two cases? a. X is an ordinary local variable. b. X is a static local variable. 25. Explain the dierences in lifetime, accessibility, and creation time between: a. An ordinary local variable. b. A static local variable. c. A global variable. See page ??. 26. Name a language that does not support dynamic storage at all. (All storage is static.) Explain two ways in which this limits the power of the language.

6.3. THE STORAGE MODEL: MANAGING STORAGE OBJECTS

173

27. What is the purpose of an ALLOC command? What is garbage? A freelist? What is the function of a KILL command? 28. Give examples in LISP and C of expressions that allocate nonstack (heap) storage dynamically. 29. Explain the reallocation strategy used by Turing. What are its advantages? 30. What is a dangling reference, and what problems can be caused by it? 31. Name a language in which pointers exist but can only point at dynamically allocated heap objects, not at objects allocated in the stack. 32. Name a language in which a pointer can point at any variable and pointer arithmetic is possible. Give an example of code. 33. Write a paragraph discussing the following questions: In what sense does FORTH or assembly language have pointers? What can be done with them? Are there any restrictions? Name two common errors that can occur with this kind of pointers. 34. Choose three languages from the list: APL, Ada, Pascal, C, FORTH, and assembler. What restrictions are there on the use of pointers in each? What eect do these restrictions have on exibility and ease of use of the language? What eect do they have on the safety of code?

174

CHAPTER 6. MODELING OBJECTS

Chapter 7

Names and Binding

Overview
This chapter discusses the denition, implementation, and semantics of names. The meaning of a symbol or name in a programming language is its binding. In most languages, all sorts of entities can have names. Binding creates an association between a name and its storage object. Binding can be static; the name is bound to the object when the object is allocated and remains bound throughout the program. Or, binding can be dynamic. In this case, a name can be bound to one object, unbound, and rebound to another within the run of a program. Constant declarations bind a symbolic name to a literal value. The scope of a name is that part of the program where a name is known by the translator. Naming conicts occur when some name is accidentally used more than once within a linear program. Modularity and block structure allow the programmer to limit the scope of a name to a block and all its nested blocks. It is the job of the interpreter or compiler to determine the proper meaning of ambiguous names according to the semantics of the language.

7.1

The Problem with Names

This section concerns the ways that we dene symbols, or names, in a programming language, give those names meaning (or meanings), and interpret references to names. In lambda calculus this issue is very simple; every name acquires a unique meaning in one of two ways: 175

176

CHAPTER 7. NAMES AND BINDING

1. Some names are dened, by declaration, to stand for formulas. 2. Parameter names acquire meaning during a reduction step. When a lambda expression is applied to an argument, that argument becomes the meaning of the parameter name. Lambda calculus is referentially transparent: wherever a name appears in an expression, the dening formula can be substituted without changing the meaning of the expression. The reverse is also true; the meaning of an expression does not change if a dened name is substituted for a subexpression which matches its dening formula. Thus lambda calculus makes a simple one-to-one correspondence between names and meanings. Most programming languages, however, are not so simple. This section tries to explain and straighten out all the myriad ways in which real languages complicate the naming problem.

7.1.1

The Role of Names

We use names to talk about objects in a computer. In simplest terms, a name is a string of characters that a programmer can write in a program. Dierent languages have dierent rules for the construction of names, but intuitively, a name is just like a word in Englisha string of letters. A name must be given a meaning before it can be used. The meaning of a name is its binding, and we say the name is bound to that meaning. While objects can be created dynamically in most languages, names cannot. Names are written in the program, and the text of the program does not change when the program is executed. Bindings, though, change when objects are created or destroyed. They attach the changing collection of objects to the xed collection of names. Naming would need little explanation if languages followed a one object-one name rule. However, the situation is not so simple. Languages permit a bewildering mismatch between the number of objects that exist and the number of names in the program. On the one hand, an object can have no name, one name, or multiple names bound to it. On the other hand, a name can be bound to no object (a dangling pointer), one object (the usual case), or several objects (a parameter name in a recursive function). This complexity comes about because of block structure, parameters, recursion, pointers, alias commands, KILL commands, and explicit binding commands. In this section, we explore the way names are used in writing programs and the binding mechanisms provided by various languages. Symbolic names are not necessary for a computer to execute a program: compilers commonly remove names altogether and replace them by references. Nor are names necessary for a person to write a program: the earliest method for programming computers, writing absolute machine code, did not use names. Nonsymbolic programming requires considerable skill and extraordinary attention to detail, and symbolic assemblers, which permit the programmer to dene names for locations, were a great leap forward because names help the programmer write correct code. It is much easier for a human to remember a hundred names than a hundred machine addresses. In addition, names have an important semantic aspect that is appreciated by experienced programmers. A program that uses well-chosen names that are related to the meaning of the objects

7.1. THE PROBLEM WITH NAMES Exhibit 7.1. Bad names.

177

You may enjoy the challenge of guring out what this function does without using diagrams. At least one highly experienced Pascal programmer failed on his rst try. TYPE my_type = x_type; x_type = RECORD next: char; prior: my_type END; FUNCTION store_it (temp: my_type, jxq: VAR jqx, first: my_type; BEGIN first := temp; jqx := [Link]; WHILE (jqx <> NIL) DO BEGIN IF [Link] = jxq THEN jqx := NIL ELSE BEGIN char): my_type;

first := jqx; jqx := [Link] END END; store_it := first; END;

being named is much easier to debug than a program with randomly chosen, excessively general, or overused names like J and temp. A program that uses names inappropriately can be terrible to debug, since the human working on the program can be misled by a name and fail to connect the name with an observed error. It is even harder for another programmer, unfamiliar with the program, to maintain that code. The function denition in Exhibit 7.1 was written with names purposely chosen to disguise and confuse its purpose. The English semantics of every name used are wrong for the usage of the corresponding object. A compiler would have no trouble making sense of this clear, concise code, but a human being will be hindered by ideas of what names are supposed to mean and will have trouble understanding the code. You may enjoy trying to decode it before reading further. Several naming sins occur in Exhibit 7.1: Two names were used that have subtle dierences: jxq, jqx. Nonsuggestive names were used: temp, my_type, x_type, jxq. Suggestive names were inappropriately used: store_it names a search routine that does no

178 Exhibit 7.2. Good names. TYPE list_type = cell_type; cell_type = RECORD value:

CHAPTER 7. NAMES AND BINDING

char; next:

list_type END; list_type;

FUNCTION search (letter_list: list_type, search_key: char): VAR scanner, follower: list_type; BEGIN follower := letter_list; scanner := letter_list.next; WHILE (scanner <> NIL) DO BEGIN IF [Link] = search_key THEN scanner := NIL ELSE BEGIN follower := scanner; scanner := [Link] END END; search := follower; END;

storing. Next names a value eld, rather than the traditional pointer. Prior names a pointer eld pointing at the next item in the list. A name was used that seemed appropriate on rst use but did not reect the actual usage of the object: rst started as a pointer to the rst thing in the list, but it is actually a scanning pointer. A list of good name substitutions for the program in Exhibit 7.1 is: my_type = list_type prior = next jxq = search_key x-type = cell_type next = value jqx = scanner store_it = search temp = letter_list rst = follower Rewritten with the names changed, the purpose of this code should be immediately apparent. Try reading the code in Exhibit 7.2. Anyone familiar with Pascal and with list processing should understand this code readily.

7.1.2

Denition Mechanisms: Declarations and Defaults

All sorts of entities can have names in most languages: objects and les (nouns), functions and procedures (verbs), types (adjectives), and more. Depending on the rules of the language, the

7.1. THE PROBLEM WITH NAMES Exhibit 7.3. Predened names in Pascal. This is a list of all the names that are predened in UCSD Pascal. types constants les functions procedures integer, real, Boolean, char, text NIL, TRUE, FALSE, MAXINT input, output odd, eof, eoln, abs, sqr, sqrt, sin, cos, arctan, ln, exp, trunc, round, ord, chr, succ, pred read, readln, write, writeln, get, put, rewrite, reset, page, new, dispose, pack, unpack

179

programmer might or might not be permitted to use the same name for entities in dierent classes. As in English, there must be some way to give meaning to a name and some way to nd the meaning of a name when it is used. Declarations, defaults, and the language denition itself are the means used in programming languages to give meaning to names. The symbol table is a data structure maintained by every translator that is analogous to a dictionary. It stores names and their denitions during translation. 1 A name must be dened before it can be used. In some languages this happens the rst time it is used; in others all names must be explicitly declared. A declaration is a statement that causes the translator to add a new name to its list of dened names in the symbol table. Many functions, types, and constants are named by the language designer, and their denitions are built into all implementations of the language. We call these primitive symbols [Exhibits 7.3 and 7.4]. These names are not like other reserved words. They do not occur in the syntax that denes the language, and the programmer may dene more names in the same categories. They are a necessary part of a language denition because they provide a basic catalog of symbols in terms of which all other symbols must be dened.
1

This structure has also been called the environment or dictionary .

Exhibit 7.4. Predened names in C. These are the names that are predened in C: types int, long, short, unsigned, oat, double, char constants functions NULL (TRUE, FALSE, and EOF are also dened in many versions of stdio.h , the header le for the standard I/O package.) Every C implementation has a library, which contains I/O functions, numeric functions, and the like. The libraries are fairly well standardized from one implementation to another and are far too extensive to list here.

180

CHAPTER 7. NAMES AND BINDING

In many interpreted languages the programmer is not required to declare types, because allocation decisions do not have to be made in advance of execution, and at execution time, the type of a datum can often be determined by examining the datum itself. Names are generally added to the symbol table the rst time they are mentioned. Thus names are typeless. These languages are sometimes called typeless because types are not declared and not stored in the symbol table with the names. Objects, on the other hand, are never typeless. Every storage object has a xed size, and size is one aspect of type. Every program object has a dened encoding, another aspect of type. In a typeless language, the type of an object must still be recorded. Since the type is not stored with the name, it must be encoded somehow as part of the object itself. In a compiled language, the type of each name must be supplied either by a declaration or by a default so that the compiler can know how many bytes to allocate for the associated storage object. The type is stored in the symbol table with the name, and remains unchanged throughout the rest of translation. Pascal requires that a type be declared for each name. FORTRAN permits the programmer to write explicit declarations, but if an identier does not appear in a declaration, a default type will be used which depends on the rst letter of the name. The original C permitted function return types and parameter types, but not variable types, to be declared as integer by default.

7.1.3

Binding

In compiled languages, a name exists only in the symbol table at translation time and objects exist only at run time. Names are gone before objects are created; they are not part of objects. In interpreted languages, names and objects coexist. In both cases, a name acquires one or more meanings during the course of translation, by a process called binding. Binding creates an association between a name (in the symbol table) and a storage object (an area of memory). We can picture a binding as a pointer from the name to the storage object. A binding diers from an ordinary pointer, though, because it reaches from the systems storage area into the programmers area. Moreover, in compiled languages, the binding spans time as well as space. At compile time it holds the location where an object will someday be allocated. Finally, bindings are unlike pointers because the translator automatically dereferences bindings but does not dereference pointers. We represent bindings in our diagrams as arrows (like pointers) but drawn in boldface, because they are not ordinary pointers. Binding is invoked by the translator whenever a declaration is processed but can also be invoked explicitly by the programmer in many languages. A binding is static if it never changes during the lifetime of the program. Otherwise it is said to be dynamic. At any time a name might be bound to a particular object to which it refers, or it might be unbound, in which case it refers to nothing and is said to be undened, or it might be multiply bound to dierent objects in dierent scopes. Names of variables, types, pure values, and functions are identied and recorded in the symbol table during translation of a program. Another column of the symbol table records the bindings. Like allocation, binding can be static, block structured, or dynamic.

7.1. THE PROBLEM WITH NAMES Exhibit 7.5. A symbol table with static bindings.

181

Symbol Table
type real

name binding length

Run-Time Memory
storage object

array[1..4] of integer ages

Typed Languages / Static Binding Most of the familiar languages (COBOL, FORTRAN, ALGOL, C, Pascal, Ada) belong to a class called typed languages. In these languages each name dened in a program unit has a xed data type associated with it, and often declared with it. In the oldest and simplest of these languages, such as symbolic assemblers and COBOL, name binding is static. A name is bound to an object when the object is allocated and remains bound to the same storage object until the end of the program. In such languages, when a meaning is given to a name, that name retains the meaning throughout the program. Static binding occurs in typed languages that are non-block structured. In a static language, there is no concept of a program block enclosed within another program block, producing a local program scope in which a name could be redened. 2 A static binding associates a name with a storage object of xed type and size at a xed memory address. Static binding can be implemented simply by using three columns in the symbol table to store the name, type, and binding [Exhibit 7.5]. We can describe this kind of symbol table as atit has the form of a simple one-dimensional list of entries, where each entry has three elds. Each declaration (explicit or default) species a name and a type. It causes the compiler to select and set aside an area of storage appropriate for an object of that type. Although this storage will not exist until run time, its address can be computed at compile time and stored in the symbol table as the binding for the name. Note, in Exhibit 7.5, that the run-time memory contains only the storage object; the symbol table no longer needs to be present. It was used to generate machine code and discarded at the end of translation. 3 A Typed Language with Dynamic Binding FORTH is an interactive, interpretive language embedded in a program development system. A complete system contains an editor, an assembler, an interpreter, and a compiler. This compiler does not generate machine code, rather, it lexes and parses function denitions and produces an
However, additional names can be bound to a COBOL object, by using REDEFINES, as explained in Section 7.1.4. Some translators are embedded in systems that provide a symbolic debugger. These systems must keep the symbol table and load it along with the object code for the program.
3 2

182

CHAPTER 7. NAMES AND BINDING

Exhibit 7.6. Names, types, and bindings in FIG FORTH. This dictionary segment contains two words, an integer and an array of four integers. The righthand column has 4 bytes of memory per line in the diagram.
name : link : type : body : name : link : type : body : 6len gth ? int (Length of name followed by name.) (Pointer to previous word in dictionary.) (Pointer to run-time code for integer variables.) (4 bytes, properly called the "parameter field".)

4age s length int (Pointer to run-time code for integer variables.) (16 bytes of storage, enough for four variables.)

intermediate program form that can be interpreted eciently. FORTH is a typed language. Its symbol table, called the dictionary, is only a little more complex than the simple, at symbol table used for a static language. The dictionary is organized into several vocabularies, each containing words for a dierent subsystem. Each vocabulary is implemented by a simple, at symbol table. Unlike COBOL and assembler, though, FORTH is an interactive language system. A user wishing to create a new application subsystem is permitted to create a new vocabulary or to add to an existing vocabulary. The user may alternate between dening objects and functions, and executing those functions. The dictionary may thus grow throughout a session. The dictionary contains an entry for each dened item [Exhibit 7.6]. Function names, variable names, and constant names are all called words. Entries for all the primitive words are loaded into the dictionary when you enter the FORTH system. A dictionary entry is created for a userdened word when a declaration is processed, and it will remain in the dictionary until the user gives the command to FORGET the symbol. The FORTH dictionary is stack-structured; new items are added at the top of the stack and can be dened in terms of anything below them on the stack. Each entry has four elds: The name eld holds the name of the word, stored as a string whose rst byte contains the length of the name. The link eld is used to organize the dictionary into a data structure that can be searched eciently. Searching must be done during translation, when the denition of a function refers to a symbol, or at run time when the interpreter evaluates a symbolic expression interactively. The implementation of the link eld and its position relative to the name eld varies among

7.1. THE PROBLEM WITH NAMES

183

dierent versions and implementations of FORTH. Exhibit 7.6 shows the relationships dened for FIG FORTH. The code eld is the functional equivalent of a type eld. It identies, uniquely, the kind of object this word represents (function, variable, constant, or programmer-dened type). The parameter eld, or body, contains the specic meaning of the word. For constants, it is a pure value. For variables it is a storage object. For functions, it contains code that can be interpreted. FORTH maintains a rudimentary sort of type information in the code eld of each dictionary entry. This eld is a pointer to a run-time routine that determines the semantics of the name. It is actually a pointer to some code which will be run whenever this word is used at run time. This code denes the interpretation method for objects of this type. Thus constants, variables, and user-dened types can be interpreted dierently. Initially only the types function, variable, and constant are built in, but others can be added. When a new type declarator is dened, two pieces of code are given: one to allocate and initialize enough storage for an object of the new type, and a second to interpret run-time references to the name of an object of this type. A pointer to this second piece of code becomes the unique identier for the new type, and also becomes the contents of the code eld for all objects declared with the new type. FORTH diers from the simple static languages in one important way: it permits the user to redene a word that is already in the dictionary. The translator will provide a warning message, but accept the redenition. Henceforth the new denition will be used to compile any new functions, but the old one will be used to interpret any previously compiled functions. This opens up the possibility of redening primitive symbols. The new denition can call the original denition and, in addition, do more elaborate processing. The simple relationship between a name and its meaning no longer holds at all. The FORGET command is an unusual feature that has no counterpart in most language translators. It does not just remove one item from the dictionary, it pops the entire dictionary stack back to the entry before its argument, forgetting everything that has been dened since! This is a rudimentary form of symbol table management which does not have either the same purpose or the same power as the stack-structured symbol tables used to implement block structure. A FORTH programmer alternates between compiling parts of his or her code and testing them. FORGET lets the programmer erase the results of part of a compilation, correct an error in that part, and recompile just one part. Thus FORGET is an important program-development tool. Typed Languages / Block Structured Binding. The connection between a name and its meaning is further complicated by block structure. FORTH permits a new denition of a name to be given, and it will permanently replace the old version (unless it is explicitly forgotten). A block structured language permits this same kind of redenition, but such a language will restore the original denition after exit from the block containing the redenition. Block structure and the semantic mechanisms that implement it are taken up in Section 7.4.

184

CHAPTER 7. NAMES AND BINDING

Exhibit 7.7. A symbol table with dynamic binding.


Symbol Table name binding length Run-Time Memory storage object : type tag 17.3 : real

Explicit Dynamic Binding. Fully dynamic binding is available only in interpreted languages or ones such as LISP with simple, uniform type structures. In such a language, types can be associated with objects, not names, and are stored with the object in memory, rather than in the symbol table. The symbol table has only two columns, the name and its current binding. The type must be stored with or encoded into the object in memory, or discarded altogether as in assembly language. This is illustrated in Exhibit 7.7. With fully dynamic binding, a name can be unbound from one object and rebound to any other at any time, even in the middle of a block, by explicit programmer command. In such a language, the type of the object bound to a name may change dramatically, and these languages are sometimes called typeless because no denite type is associated permanently with a name. SNOBOL and APL are examples of this language class. These typeless languages nevertheless commonly do implement objects of dierent types. For example, in APL there are two basic types, number and character. These types are implemented by attaching a type tag to the object itself, rather than to the name in the symbol table [Exhibit 7.8]. The symbol table contains only the name and the binding, and the programmer is permitted to bind a name to any object. Thus at dierent times a name may be bound to storage areas of dierent sizes, each with an associated type tag. In such languages, binding often serves the same purpose as does assignment in Pascal and is often mistaken for assignment. The essential dierence is that assignment does not change the storage object to which a name is bound, but changes the program object which is the contents of

Exhibit 7.8. Names, types, and binding in APL. APL is a typeless language and so has no permanent association of types with names. Rather, a type tag is associated with each storage object, and the combination may be bound to any name.
Symbol Table name binding length ages Run-Time Memory storage object : type tag : scalar number : array of 4 numbers

7.1. THE PROBLEM WITH NAMES Exhibit 7.9. Dynamic binding in APL. On two successive executions of the input statement Q2

185

one could legally supply the following inputs: aeiouy (which is a character array) and 37.1 (which is a number). Thus we would get rst the following binding:
Symbol Table name binding Q Run-Time Memory storage o