Evaluating and Programming the
29K RISC Family
Third Edition – DRAFT
A D V A N C E D M I C R O D E V I C E S
IF YOU HAVE QUESTIONS, WE’RE HERE TO HELP YOU.
Customer Service
AMD’s customer service network includes U.S. offices, international offices, and
a customer training center. Expert technical assistance is available to answer 29K
Family hardware and software development questions from AMD’s worldwide
staff of field application engineers and factory support staff.
Hotline, Bulletin Board, and eMail Support
For answers to technical questions, AMD provides a toll-free number for direct
access to our engineering support staff. For overseas customers, the easiest way to
reach the engineering support staff with your questions is via fax with a short
description of your question. Also available is the AMD bulletin board service,
which provides the latest 29K product information, including technical informa-
tion and data on upcoming product releases. AMD 29K Family customers also
receive technical support through electronic mail. This worldwide service is avail-
able to 29K product users via the International UNIX eMail service. To access the
service, use the AMD eMail address: “
[email protected].”
Engineering Support Staff:
(800) 292-9263 ext. 2 toll free for U.S.
(512) 602-4118 local for U.S.
0800-89-1455 toll free for UK
0031-11-1163 toll free for Japan
(512) 602-5031 FAX for overseas
Bulletin Board:
(800) 292-9263 ext. 1 toll free for U.S.
(512) 602-4898 worldwide and local for U.S.
Documentation and Literature
The 29K Family Customer Support Group responds quickly to information and
literature requests. A simple phone call will get you free 29K Family information
such as data books, user’s manuals, data sheets, application notes, the Fusion29K
Partner Solutions Catalog and Newsletter, and other literature. Internationally,
contact your local AMD sales office for complete 29K Family literature.
Customer Support Group:
(800) 292-9263 ext. 3 toll free for U.S.
(512) 602-5651 local for U.S.
(512) 602-5051 FAX for U.S.
Evaluating and Programming the
29K RISC Family
Third Edition – DRAFT
Daniel Mann
Advanced Micro Devices
1995 Daniel Mann
Advanced Micro Devices reserves the right to make changes in its products
without notice in order to improve design or performance characteristics.
This publication neither states nor implies any warranty of any kind, including but
not limited to implied warrants of merchantability of fitness for a particular applica-
tion. AMD assumes no responsibility for the use of any circuit other than the circuit
in an AMD product.
The author and publisher of this book have used their best efforts in preparing this
book. Although the information presented has been carefully reviewed and is be-
lieved to be reliable, the author and publisher make no warranty of any kind, ex-
pressed or implied, with regard to example programs or documentation contained in
this book. The author and publisher shall not be liable in any event for accidental or
consequential damages in connection with, or arising out of, the furnishing, perfor-
mance, or use of these programs.
Trademarks
29K, Am29005, Am29027, Am29050, Am29030, Am29035, Am29040, Am29200,
Am29205, Am29240, Am29243, Am29245, EZ030, SA29200, SA29240,
SA29040, MiniMON29K, XRAY29K, ASM29K, ISS, SIM29, Scalable Clocking,
Traceable Cache and UDI are a trademark of Advanced Micro Devices, Inc.
Fusion29K is a registered service trademark of Advanced Micro Devices, Inc.
AMD and Am29000 are registered trademarks of Advanced Micro Devices, Inc.
PowerPC is a trademark of International Buisness Machines Corp.
MRI and XRAY are trademarks of Microtec Reasearch Inc.
High C is a registered trade mark of MetaWare Inc.
i960 is a trademarks of Intel, Inc.
MC68020 is a trademark of Motorola Inc.
UNIX is a trademark of AT&T.
NetROM is a trademark of XLNT Designs, Inc.
UDB and UMON are trademarks of CaseTools Inc.
Windows is a trademarks of Microsoft Corp.
Product names used in this publication are for identification purposes only and may
be trademarks of their respective companies.
To my wife
Audrey
and my son
Geoffrey
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Chapter 1
Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 A RISC DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 FAMILY MEMBER FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 THE Am29000 3–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 The Am29005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 THE Am29050 3–BUS FLOATING–POINT MICROPROCESSOR . . . . . . 11
1.5 THE Am29030 2–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Am29030 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.2 The Am29035 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 THE Am29040 2–BUS MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Am29040 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 A SUPERSCALAR 29K PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7.1 Instruction Issue and Data Dependency . . . . . . . . . . . . . . . . . . . . . 21
1.7.2 Reservation Stations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7.3 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.7.4 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.8 THE Am29200 MICROCONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.8.1 ROM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.8.2 DRAM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.8.3 Virtual DRAM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.8.4 PIA Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.8.5 DMA Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.8.6 16–bit I/O Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.8.7 Parallel Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.8.8 Serial Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
1.8.9 I/O Video Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.8.10 The SA29200 Evaluation Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.8.11 The Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.8.12 Am29200 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.8.13 The Am29205 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.9 THE Am29240 MICROCONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.9.1 The Am29243 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.9.2 The Am29245 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.9.3 The Am2924x Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.10 REGISTER AND MEMORY SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.10.1 General Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.10.2 Special Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.10.3 Translation Look–Aside Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.10.4 External Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.11 INSTRUCTION FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.12 KEEPING THE RISC PIPELINE BUSY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.13 PIPELINE DEPENDENCIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.14 ARCHITECTURAL SIMULATION, sim29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.14.1 The Simulation Event File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.14.2 Analyzing the Simulation Log File . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 2
Applications Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.1 C LANGUAGE PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.1.1 Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.1.2 Activation Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.1.3 Spilling And Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.1.4 Global Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.1.5 Memory Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.2 RUN–TIME HIF ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.2.1 OS Preparations before Calling start In crt0 . . . . . . . . . . . . . . . . . . 97
2.2.2 crt0 Preparations before Calling main() . . . . . . . . . . . . . . . . . . . . . . 100
2.2.3 Run–Time HIF Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.2.4 Switching to Supervisor Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.3 C LANGUAGE COMPILER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3.1 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3.2 Metaware High C 29K Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.3.3 Free Software Foundation, GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.3.4 C++ Compiler Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.3.5 Executable Code and Source Correspondence . . . . . . . . . . . . . . . 113
2.3.6 Linking Compiled Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.4 LIBRARY SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
2.4.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii Contents
2.4.2 Setjmp and Longjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4.3 Support Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.5 C LANGUAGE INTERRUPT HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.5.1 An Interrupt Context Cache with High C 29K . . . . . . . . . . . . . . . . . 127
2.5.2 An Interrupt Context Cache with GNU . . . . . . . . . . . . . . . . . . . . . . . 128
2.5.3 Using Signals to Deal with Interrupts . . . . . . . . . . . . . . . . . . . . . . . . 131
2.5.4 Interrupt Tag Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
2.5.5 Overloaded INTR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
2.5.6 A Signal Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.5.7 Minimizing Interrupt Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.5.8 Signal Processing Without a HIF Operating System . . . . . . . . . . . 153
2.5.9 An Example Am29200 Interrupt Handler . . . . . . . . . . . . . . . . . . . . . 153
2.6 SUPPORT UTILITY PROGRAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.6.1 Examining Object Files (Type .o And a.Out) . . . . . . . . . . . . . . . . . . 156
2.6.2 Modifying Object Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.6.3 Getting a Program into ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Chapter 3
Assembly Language Programming . . . . . . . . . . . . . . . . . . . . . . . . 161
3.1 INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.1.1 Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.1.2 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.1.3 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.1.4 Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.1.5 Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.1.6 Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.1.7 Floating–point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.1.8 Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.1.9 Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
3.1.10 Reserved Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.2 CODE OPTIMIZATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
3.3 AVAILABLE REGISTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3.1 Useful Macro–Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.3.2 Using Indirect Pointers and gr0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.3.3 Using gr1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.3.4 Accessing Special Register Space . . . . . . . . . . . . . . . . . . . . . . . . . . 183
3.3.5 Floating–point Accumulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
3.4 DELAYED EFFECTS OF INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.5 TRACE–BACK TAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.6 INTERRUPT TAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.7 TRANSPARENT ROUTINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.8 INITIALIZING THE PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.9 ASSEMBLER SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Contents ix
3.9.1 The AMD Assembler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
3.9.2 Free Software Foundation (GNU), Assembler . . . . . . . . . . . . . . . . 192
Chapter 4
Interrupts and Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.1 29K PROCESSOR FAMILY INTERRUPT SEQUENCE . . . . . . . . . . . . . . . . 196
4.2 29K PROCESSOR FAMILY INTERRUPT RETURN . . . . . . . . . . . . . . . . . . 197
4.3 SUPERVISOR MODE HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.3.1 The Interrupt Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.3.2 Interrupt Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.3.3 Simple Freeze-mode Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
4.3.4 Operating in Freeze mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.3.5 Monitor mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.3.6 Freeze-mode Clock Interrupt Handler . . . . . . . . . . . . . . . . . . . . . . . 204
4.3.7 Removing Freeze mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.3.8 Handling Nested Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.3.9 Saving Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.3.10 Enabling Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.3.11 Restoring Saved Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.3.12 An Interrupt Queuing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.3.13 Making Timer Interrupts Synchronous . . . . . . . . . . . . . . . . . . . . . . . 221
4.4 USER-MODE INTERRUPT HANDLERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.4.1 Supervisor mode Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.4.2 Register Stack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4.3 SPILL and FILL Trampoline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.4.4 SPILL Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
4.4.5 FILL Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.4.6 Register File Inconsistencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.4.7 Preparing the C Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.4.8 Handling Setjmp and Longjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Chapter 5
Operating System Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.1 REGISTER CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.2 SYNCHRONOUS CONTEXT SWITCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.2.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
5.3 ASYNCHRONOUS CONTEXT SWITCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.4 INTERRUPTING USER MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
5.4.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.5 PROCESSING SIGNALS IN USER MODE . . . . . . . . . . . . . . . . . . . . . . . . . . 254
5.6 INTERRUPTING SUPERVISOR MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
5.6.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.7 USER SYSTEM CALLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
x Contents
5.8 FLOATING–POINT ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.9 DEBUGGER ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
5.10 RESTORING CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
5.11 INTERRUPT LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
5.12 ON–CHIP CACHE SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5.13 INSTRUCTION CACHE MAINTENANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5.13.1 Cache Locking and Invalidating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.13.2 Instruction Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.13.3 Branch Target Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.13.4 Am29030 2–bus Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.13.5 Am29240 and Am29040 Processors . . . . . . . . . . . . . . . . . . . . . . . . 277
5.14 DATA CACHE MAINTENANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
5.14.1 Am29240 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
5.14.2 Am29040 2–bus Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
5.14.3 Cache Locking and Invalidating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.14.4 Cache Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.15 SELECTING AN OPERATING SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.16 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Chapter 6
Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
6.1 SRAM VERSUS DRAM PERFORMANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.2 TRANSLATION LOOK–ASIDE BUFFER (TLB) OPERATION . . . . . . . . . . 300
6.2.1 Dual TLB Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.2.2 Taking a TLB Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
6.3 PERFORMANCE EQUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.4 SOFTWARE CONTROLLED CACHE MEMORY ARCHITECTURE . . . . . 310
6.4.1 Cache Page Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.4.2 Data Access TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.4.3 Instruction Access TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
6.4.4 Data Write TLB Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
6.4.5 Supervisor TLB Signal Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
6.4.6 Copying a Page into the Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
6.4.7 Copying a Page Out of the Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 323
6.4.8 Cache Set Locked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
6.4.9 Returning from Signal Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
6.4.10 Support Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
6.4.11 Performance Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Chapter 7
Software Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
7.1 REGISTER ASSIGNMENT CONVENTION . . . . . . . . . . . . . . . . . . . . . . . . . . 331
7.2 PROCESSOR DEBUG SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Contents xi
7.2.1 Execution Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
7.2.2 Memory Access Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.2.3 Trace Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.2.4 Program Counter register PC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.2.5 Monitor Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.2.6 Instruction Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.2.7 Data Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.3 THE MiniMON29K DEBUGGER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.3.1 The Target MiniMON29K Component . . . . . . . . . . . . . . . . . . . . . . . 339
7.3.2 Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.3.3 The DebugCore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
7.3.4 DebugCore installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
7.3.5 Advanced DBG and CFG Module Features . . . . . . . . . . . . . . . . . . 347
7.3.6 The Message System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
7.3.7 MSG Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
7.3.8 MSG Virtual Interrupt Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
7.4 THE OS–BOOT OPERATING SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
7.4.1 Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
7.4.2 OS–boot Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
7.4.3 HIF Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
7.4.4 Adding New Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
7.4.5 Memory Access Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
7.4.6 Down Loading a New OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.5 UNIVERSAL DEBUG INTERFACE (UDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
7.5.1 Debug Tool Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.5.2 UDI Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
7.5.3 P–trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
7.5.4 The GDB–UDI Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
7.5.5 The UDI–MiniMON29K Monitor Connection, MonTIP . . . . . . . . . . 365
7.5.6 The MiniMON29K User–Interface, MonDFE . . . . . . . . . . . . . . . . . 366
7.5.7 The UDI – Instruction Set Simulator Connection, ISSTIP . . . . . . 368
7.5.8 UDI Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
7.5.9 Getting Started with GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.5.10 GDB and MiniMON29K Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
7.6 SIMPLIFYING ASSEMBLY CODE DEBUG . . . . . . . . . . . . . . . . . . . . . . . . . . 374
7.7 SOURCE LEVEL DEBUGGING USING A WINDOW INTERFACE . . . . . . 377
7.8 TRACING PROGRAM EXECUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
7.9 Fusion3D TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
7.9.1 NetROM ROM Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
7.9.2 HP16500B Logic Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
7.9.3 Selecting Trace Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
7.9.4 Corelis PI – Am29040 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . 406
7.9.5 Corelis PI – Am29460 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . 408
xii Contents
Chapter 8
Selecting a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
8.1 THE 29K FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
8.1.1 Selecting a Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
8.1.2 Moving up to an Am2920x Microcontroller . . . . . . . . . . . . . . . . . . . 431
8.1.3 Selecting a Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
8.1.4 Reducing the Register Window Size . . . . . . . . . . . . . . . . . . . . . . . . 443
Appendix A
HIF Service Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
A.1 Service Call Numbers And Parameters . . . . . . . . . . . . . . . . . . . . . . 450
A.2 Error Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Appendix B
HIF Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.1 User Trampoline Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.2 Library Glue Routines to HIF Signal Services . . . . . . . . . . . . . . . . . 518
B.3 The Library signal() Routine for Registering a Handler . . . . . . . . . 519
Appendix C
Software Assigned Trap Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 522
Appendix D
DebugCore 2.0 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
D.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
D.2 REGISTER USAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
D.3 DEBUGCORE 1.0 ENHANCEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
D.3.1 Executing OS Service Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
D.3.2 Per–Process Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
D.3.3 Current PID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
D.3.4 Virtual or Physical Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
D.3.5 Breakpoint Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
D.4 MODULE INTERCONNECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
D.4.1 The DebugCore 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
D.4.2 The Message System 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
D.4.3 The DebugCore 2.0 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 539
References and Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
Contents xiii
xiv Contents
Figures
Figure 1-1. RISC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Figure 1-2. CISC Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Figure 1-3. Processor Price–Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 1-4. Am29000 Processor 3–bus Harvard Memory System . . . . . . . . . . . . . . . . . . . . 9
Figure 1-5. The Instruction Window for Out–of–Order Instruction Issue . . . . . . . . . . . . . . 24
Figure 1-6. A Function Unit with Reservation Stations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 1-7. Register Dependency Resolved by Register Renaming . . . . . . . . . . . . . . . . . . . 28
Figure 1-8. Circular Reorder Buffer Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 1-9. Multiple Function Units with a Reorder Buffer . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 1-10. Instruction Decode with No Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 1-11. Four–Instruction Decoder with Branch Prediction . . . . . . . . . . . . . . . . . . . . . 32
Figure 1-12. Am29200 Microcontroller Address Space Regions . . . . . . . . . . . . . . . . . . . . . 35
Figure 1-13. Am29200 Microcontroller Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 1-14. General Purpose Register Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 1-15. Special Purpose Register Space for the Am29000 Microprocessor . . . . . . . . 51
Figure 1-16. Am29000 Processor Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 1-17. Additional Special Purpose Registers for the Monitor Mode Support . . . . . . 56
Figure 1-18. Additional Special Purpose Registers for the Am29050 Microprocessor . . . 57
Figure 1-19. Additional Special Purpose Registers for Breakpoint Control . . . . . . . . . . . . 57
Figure 1-20. Additional Special Purpose Registers for On–Chip Cache Control . . . . . . . . 58
Figure 1-21. Additional Special Purpose Register for the Am29050 Microprocessor . . . . . 61
Figure 1-22. Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 1-23. Frequently Occurring Instruction–Field Uses . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 1-24. Pipeline Stages for BTC Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
xv
Figure 1-25. Pipeline Stages for a BTC Hit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 1-26. Data Forwarding and Bad–Load Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 1-27. Register Initialization Performed by sim29 . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 2-1. Cache Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Figure 2-2. Overlapping Activation Record Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 2-3. 29K Microcontroller Interrupt Control Register . . . . . . . . . . . . . . . . . . . . . . . . 141
Figure 2-4. Processing Interrupts with a Signal Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . 146
Figure 3-1. The EXTRACT Instruction uses the Funnel Shifter . . . . . . . . . . . . . . . . . . . . . 166
Figure 3-2. LOAD and STORE Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 3-3. General Purpose Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Figure 3-4. Global Register gr1 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Figure 3-5. Trace–Back Tag Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 3-6. Walking Back Through Activation Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Figure 3-7. Interrupt Procedure Tag Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Figure 4-1. Interrupt Handler Execution Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Figure 4-2. The Format of Special Registers CPS and OPS . . . . . . . . . . . . . . . . . . . . . . . . 197
Figure 4-3. Interrupted Load Multiple Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Figure 4-4. Am29000 Processor Interrupt Enable Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Figure 4-5. Interrupt Queue Entry Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Figure 4-6. An Interrupt Queuing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Figure 4-7. Queued Interrupt Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Figure 4-8. Saved Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Figure 4-9. Register and Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Figure 4-10. Stack Upon Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 4-11. Stack After Fix–up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Figure 4-12. Long–Jump to Setjmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Figure 5-1. A Consistent Register Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Figure 5-2. Current Procedures Activation Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Figure 5-3. Overlapping Activation Records Eventual Spill Out of the
Register Stack Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Figure 5-4. Context Save PCB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Figure 5-5. Register Stack Cut–Across . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Figure 5-6. Instruction Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Figure 5-7. Am29240 Microcontroller Cache Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Figure 5-8. Am29240 Data Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Figure 5-9. Am29040 2–bus Microprocessor Cache Data Flow . . . . . . . . . . . . . . . . . . . . . 284
xvi Figures
Figure 5-10. Am29040 Data Cache Tag and Status bits . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Figure 6-1. Average Cycles per Instruction Using DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Figure 6-2. Average Cycles per Instruction Using SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Figure 6-3. Block Diagram of Example Joint I/D System . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Figure 6-4. Average Cycles per Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Figure 6-5. Probability of a TLB Access per Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Figure 6-6. TLB Field Composition for 4K Byte Page Size . . . . . . . . . . . . . . . . . . . . . . . . . 302
Figure 6-7. Block Diagram of Am29000 processor TLB Layout . . . . . . . . . . . . . . . . . . . . . 303
Figure 6-8. Am29000 Processor TLB Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Figure 6-9. TLB Register Format for Processor with Two TLBs . . . . . . . . . . . . . . . . . . . . . 306
Figure 6-10. TLB Miss Ratio for Joint I/D 2–1 SRAM System . . . . . . . . . . . . . . . . . . . . . . . 309
Figure 6-11. Average Cycles Required per TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Figure 6-12. PTE Mapping to Cache Real Page Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Figure 6-13. Software Controlled Cache, K bytes paged–in . . . . . . . . . . . . . . . . . . . . . . . . 314
Figure 6-14. Probability of a Page–in Given a TLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Figure 6-15. TLB Signal Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Figure 6-16. Cache Performance Gains with the Assembly Utility . . . . . . . . . . . . . . . . . . . 328
Figure 6-17. Cache Performance Gains with NROFF Utility . . . . . . . . . . . . . . . . . . . . . . . 329
Figure 6-18. Comparing Cache Based Systems with DRAM Only Systems . . . . . . . . . . . . . 329
Figure 7-1. 29K Development and Debug Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Figure 7-2. MinMON29k Debugger Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Figure 7-3. 29K Target Software Module Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Figure 7-4. Vector Table Assignment for DebugCore 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Figure 7-5. Processor Initialization Code Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Figure 7-6. Operating System Information Passed to dbg_control() . . . . . . . . . . . . . . . . . . 345
Figure 7-7. Return Structure from dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Figure 7-8. Typical OS–boot Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Figure 7-9. Currently Available Debugging Tools that Conform to UDI Specification . . . . 362
Figure 7-10. The UDB to 29K Connection via the GIO Process . . . . . . . . . . . . . . . . . . . . . 378
Figure 7-11. UDB Main Window Showing Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Figure 7-12. UDB Window Showing the Assembly Code Associated with the
Previous Source Code Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Figure 7-13. UDB Window Showing Global Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Figure 7-14. HP16500B Logic Analyzer Window Showing State Listing . . . . . . . . . . . . . . . . 386
Figure 7-15. Path Taken By Am29040 Recursive Trace Processing Algorithm . . . . . . . . . . . 390
Figure 7-16. UDB Console Window Showing Processed Trace Information . . . . . . . . . . . . . 395
Figures xvii
Figure 7-17. UDB Trace Window Showing Processed Trace Information . . . . . . . . . . . . . . 396
Figure 7-18. PI–Am29460 Preprocessor Trace Capture Scheme . . . . . . . . . . . . . . . . . . . . . 409
Figure 7-19. PI–Am29460 Preprocessor Trace Capture Timing . . . . . . . . . . . . . . . . . . . . . 410
Figure 7-20. Slave Data Supporting Am29460 Traceable Cache . . . . . . . . . . . . . . . . . . . . 412
Figure 7-21. RLE Output Queue From Reorder Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Figure 8-1. 29K Microcontrollers Running the LAPD Benchmark
With 16 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Figure 8-2. 29K Microcontrollers Running the LAPD Benchmark
With 20 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Figure 8-3. 29K Microcontrollers Running the LAPD Benchmark
With 25 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Figure 8-4. 29K Microcontrollers Running the LAPD Benchmark
With 33 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Figure 8-5. Am2920x Microcontrollers Running the LAPD Benchmark with
8–bit and 16–bit Memory Systems Operating at 12 and 16 MHz . . . . . . . . . . . 432
Figure 8-6. 29K Microprocessors Running the LAPD Benchmark
with 16 MHz Memory systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Figure 8-7. 29K Microprocessors Running the LAPD Benchmark
with 20 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Figure 8-8. 29K Microprocessors Running the LAPD Benchmark
with 25 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
Figure 8-9. 29K Microprocessors Running the LAPD Benchmark
with 33 MHz Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Figure 8-10. Am29040 Microprocessors Running the LAPD Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Figure 8-11. Am29200 Microcontroller Running the LAPD Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
Figure 8-12. Am29040 Microprocessors Running the Stanford Benchmark
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Figure 8-13. Reduction In Worst–Case Asynchronous Task Context Switch Times
with Various Register Stack Window Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Figure A-1. HIF Register Preservation for Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Figure D-1. 29K Target Software Module configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Figure D-2. Data Structure Shared by Operating System and DebugCore 2.0 . . . . . . . . . . 529
Figure D-3. DebugCore 2.0 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Figure D-4. OS Information Passed to dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Figure D-5. Return Structure from dbg_control() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Figure D-6. DebugCore 2.0 Receive Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Figure D-7. Message System 1.0 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Figure D-8. Configuration Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
xviii Figures
Tables
Table 1-1. Pin Compatible 3–bus 29K Family Processors . . . . . . . . . . . . . . . . . . . . . . . . . 8
Table 1-2. Pin Compatible 2–bus 29K Family Processors . . . . . . . . . . . . . . . . . . . . . . . . . 14
Table 1-3. Am2920x Microcontroller Members of 29K Processor Family . . . . . . . . . . . . . . 34
Table 1-4. Am2924x Microcontroller Members of 29K Processor Family . . . . . . . . . . . . . . 42
Table 1-5. 3–bus Processor Memory Modeling Parameters for sim29 . . . . . . . . . . . . . . . . 76
Table 1-6. 3–bus Processor DRAM Modeling Parameters for sim29 (continued) . . . . . . . . 77
Table 1-7. 3–bus Processor Static Column Modeling Parameters for sim29 (continued) . . 77
Table 1-8. 3–bus Processor Memory Modeling Parameters for sim29 (continued) . . . . . . . 78
Table 1-9. 2–bus Processor Memory Modeling Parameters for older sim29 . . . . . . . . . . . . 78
Table 1-10. 2–bus Processor Memory Modeling Parameters for newer sim29 . . . . . . . . . . 79
Table 1-11. Microcontroller Memory Modeling Parameters for sim29 . . . . . . . . . . . . . . . . 81
Table 1-12. Microcontroller Processor Memory Modeling Parameters for newer sim29 . . 83
Table 2-1. Trap Handler Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 2-2. HIF Service Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Table 2-3. HIF Service Call Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Table 2-4. HIF Service Call Parameters (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Table 3-1. Integer Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Table 3-2. Integer Arithmetic Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Table 3-3. Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table 3-4. Compare Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 3-5. Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table 3-6. Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table 3-7. Data Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Table 3-8. Data Move Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xix
Table 3-9. Constant Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Table 3-10. Floating–Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Table 3-11. Floating–Point Instructions (Concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 3-12. Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Table 3-13. Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Table 4-1. Global Register Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Table 4-2. Expanded Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Table 5-1. 29K Family Instruction and Date Cache Support . . . . . . . . . . . . . . . . . . . . . . . . 271
Table 5-2. Instruction Cache Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Table 5-3. Data Cache Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Table 5-4. PGM Field of the Am29040 Microprocessor TLB . . . . . . . . . . . . . . . . . . . . . . . . 286
Table 6-1. PGM Field of the Am29040 Microprocessor TLB . . . . . . . . . . . . . . . . . . . . . . . . 307
Table 7-1. 29K Family On-chip Debug Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Table 7-2. UDI–p Procedures (Version 1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Table 7-3. ptrace() Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Table 7-4. GDB Remote–Target Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Table 7-5. PI–Am29040 Logic Analyzer Pod Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 407
Table 7-6. PI–Am29460 Logic Analyzer Pod Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 415
Table 8-1. Memory Access Times for Am2920x Microcontroller ROM Space . . . . . . . . . . . 426
Table 8-2. ROM and FLASH Memory Device Access Times . . . . . . . . . . . . . . . . . . . . . . . . 427
Table 8-3. Memory Access Times for Am2924x Microcontroller ROM Space . . . . . . . . . . . 427
Table 8-4. Cache Block Reload Times for Various Memory Types . . . . . . . . . . . . . . . . . . . . 436
Table A-1. HIF Open Service Mode Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Table A-2. Default Signals Handled by HIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
Table A-3. HIF Signal Return Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Table A-4. HIF Error Numbers Assigned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Table A-5. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Table A-6. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Table A-7. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Table A-8. HIF Error Numbers Assigned (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Table A-9. HIF Error Numbers Assigned (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Table C-1. Software Assigned Trap Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Table C-2. Software Assigned Trap Numbers (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Table C-3. Software Assigned Trap Numbers (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . 524
xx Tables
Preface
The first edition of this book brought together, for the first time, a
comprehensive collection of information required by the person developing software
for the Advanced Micro Devices 29K family of RISC microprocessors and
microcontrollers. This second edition contains all the material from the first. In
addition it adds many new topics such as performance evaluation and on–chip cache
operation. Topics such as interrupt processing and software debugging are extended
with the addition of new techniques. The book is useful to the computer professional
and student interested in the 29K family RISC implementation. It does not assume
that the reader is familiar with RISC techniques.
Although certain members of the 29K family are equally suited to the
construction of a workstation or an embedded application, the material is mainly
applicable for embedded application development. This slant shall be appreciated by
most readers; since early in the 29K’s introduction AMD has promoted the family as
a collection of processors spanning a wide range of embedded performance.
Additionally, in recent years, AMD started a range of microcontrollers, initially with
the Am29200. The inclusion of onchip peripherals in the microcontroller
implementations resulted in this particular extension to the family being well
received by the embedded processor community.
The success of the 29K family, and of RISC technology in general, has created
considerable interest within the microprocessor industry. A growing number of
engineers are evaluating RISC, and an increasing number are selecting RISC rather
than CISC designs for new products. Higher processor performance is the main
reason cited for adopting new RISC designs. This book describes the methods used
by the 29K family –– many of which are characteristic of the RISC–approach –– to
obtain a performance gain vis–a–vis CISC processors. Many of the processor and
software features described will be compared with an equivalent CISC method; this
shall assist the engineer making the CISC to RISC transition.
xxi
Because the 29K family architecture reveals the processor’s internal pipeline
operation much more than a CISC architecture, a better understanding of how the
software can control the hardware and avoid resource conflicts is required to obtain
the best performance. Up to this point, software engineers have had to glean
information about programming the 29K family from scattered application notes,
conference proceedings and other publications. In addition much of the necessary
information has never been documented. This has lead to a number of difficulties,
particularly where the most efficient use of the RISC design features is sought.
The material presented is practical rather than theoretical. Each chapter is in a
somewhat standalone form, reducing the need to read earlier chapters before later
chapters are studied. Many of the code examples are directly usable in real embedded
systems rather than as student exercises. Engineers planning on using the 29K
family will be able to extract useful code sequences from the book for integration into
their own designs. Much of the material presented has been used by AMD, and other
independent companies, in building training classes for computer professionals
wishing to quickly gain an understanding of the 29K family.
This book is organized as follows:
Chapter 1 describes the architectural characteristics of the 29K RISC
microprocessor and microcontroller family. The original family member, the
Am29000 processor, is described first. Then the family tree evolution is dealt with in
terms of each member’s particular features. Although all 29K processors are
application code compatible they are not all pin compatible. The ability of the 29K
family to be flexible in its memory requirements is presented. In addition, the chapter
shows the importance of keeping the RISC pipeline busy if high performance is to be
achieved.
Chapter 2 deals with application programming. It covers the main topics
required by a software developer to produce code for execution on a 29K.
Application coding is done in a high level language and the chapter assumes the C
language is most widely used. The dual register and memory stack technique used by
the 29K procedure calling–convention is described in detail, along with the process
of maintaining the processor’s local register file as a cache for the top of the register
stack. Application programs require runtime support. The library services typically
used by developers make demands upon such operating system services. The Host
Interface (HIF) specifies a set operating system services. The HIF services are
described and their relevance put in context.
Chapter 3 explains how to program a 29K at assembly level. Methods of
partioning and accessing a processor’s register space are described. This includes the
special register space which can only be reached by assembly level instructions. The
reader is shown how to deal with such topics as branch delay slots and memory access
latency. It is not expected that application programs will be developed in assembly
xxii Preface
language, rather, that assembly language coding skills are required by the operating
system developer. Some developers may only be required to utilize assembly coding
to implement, say, a small interrupt handler routine.
Chapter 4 deals with the complex subject of 29K interrupts. Because 29K
processors make no use of microcode, the range of interrupt handler options is
extended over the typical CISC type processor. Techniques new to the reader familiar
with CISC, such as lightweight interrupts and interrupt context caching, are
presented. Most application developers are moving toward writing interrupt
handlers in a high level language, such as C. This chapter describes the process of
preparing the 29K to handle a C level signal handler after taking an interrupt or trap.
Chapter 5 deals with operating system issues. It describes, in detail, the process
of performing an application task context switch. This is one of the major services
performed by an operating system. A detailed knowledge of the utilized
procedural–linkage mechanism and 29K architectural features is required to
implement a high performance context switch. Also dealt with are issues concerning
the operation and maintenance of on–chip instruction and data memory cache.
Chapter 6 describes the Translation Look–Aside Buffer (TLB) which is
incorporated into many of the 29K family members. Its use as a basic building block
for a Memory Management Unit (MMU) is described. This chapter also
demonstrates the use of the TLB to implement a software–controlled cache which
improves overall system performance.
Chapter 7 explains the operation of popular software debugging tools such as
MiniMON29K and GDB. The process of building a debug environment for an
embedded application is described. Also dealt with is the Universal Debug Interface
(UDI) which is used to connect the user–interface process with the process
controlling the target hardware. The use of UDI introduces new freedom in tool
choice to the embedded product developer.
Chapter 8 helps with the sometimes difficult task of processor selection.
Performance benchmarks are presented for all the current 29K family members. The
effect on–chip cache and memory system performance have on system performance
is quantified. Systems are considered in terms of their performance and software
programming requirements.
Although I am the sole author of this book, I would like to thank my colleagues
at Advanced Micro Devices for their help with reviewing early manuscripts. I am
also grateful for their thoughtful suggestions, many of which were offered during the
porting of 4.3bsd UNIX to the Am29000 processor. I would also like to thank Grant
Maxwell for his helpful comments and in particular his review of chapters 1, 5 and 8.
Bob Brians also extensively reviewed the first edition and suggested a number of
improvements; he also made many helpful comments when he reviewed the
Preface xxiii
manuscript for this second edition. Mike Johnson and Steve Guccione reviewed the
section introducing superscalar processors. Chip Freitag reviewed chapter 8 and
helped me improve its quality. Discussions with Leo Lozano helped resolve many of
the issues concerning cache operation dealt with in chapter 5. Thanks also to
Embedded Systems Programming for allowing the use of material describing the
GDB debugger which first appeared in their volume 5 number 12 issue. Embedded
System Engineering is also thanked for allowing the reuse of material describing the
Am29040 processor and Architectural Simulator. Finally, I would like to thank the
Product Marketing Department of AMD’s Embedded Processor Division, for their
encouragement to complete this second edition.
xxiv Preface
Chapter 1
Architectural Overview
This Chapter deals with a number of topics relevant to the selection of a 29K
family member. General RISC architecture characteristics are discussed before each
family member is described in more detail. A RISC microprocessor can achieve high
performance only if its pipeline is kept effectively busy — this is explained. Finally,
the architectural simulator is described; it is an important tool in evaluating a proces-
sors performance.
The instruction set of the 29K family was designed to closely match the internal
representation of operations generated by optimizing compilers. Instruction execu-
tion times are not burdened by redundant instruction formats and options. CISC mi-
croprocessors trap computational sequences in microcode. Microcode is a set of se-
quences of internal processor operations combined to perform a machine instruction.
A CISC microprocessor contains an on–chip microprogram memory to hold the mi-
crocode required to support the complex instructions. It is difficult for a compiler to
select CISC instruction sequences which result in the microcode being efficiently
applied to the overall computational task. The myopic microcode results in processor
operational overhead. The compiler for a CISC can not remove the overhead, it can
only reduce it by making the best selection from the array of instruction options and
formats — such as addressing modes. The compiler for a 29K RISC can exploit lean
instructions whose operation is free of microcode and always visible to the compiler
code–generator.
Each 29K processor has a 4–stage RISC pipeline: consisting of first, a fetch
stage, followed by decode, execute and write–back stages. Instructions, with few ex-
ceptions, execute in a single–cycle. Although instructions are streamlined, they still
support operations on two source operands, placing the result in a third operand. Reg-
isters are used to supply operands for most instructions, and the processor contains a
1
large number of registers to reduce the need to fetch data from off–chip memory.
When external memory is accessed it is via explicit load and store operations, and
never via extended instruction addressing modes. The large number of registers,
within the processor’s register file, act effectively as a cache for program data. How-
ever, the implementation of a multiport register file is superior to a conventional data
cache as it enables simultaneous access to multiple operands.
Parameter passing between procedure calls is supported by dynamically sized
register windows. Each procedure’s register window is allocated from a stack of 128
32–bit registers. This results in a very efficient procedure call mechanism, and is re-
sponsible for considerable operational benefits compared to the typical CISC meth-
od of pushing and popping procedure parameters from a memory stack.
Processors in the 29K family also make use of other techniques usually
associated with RISC, such as delayed branching, to keep the instruction hungry
RISC fed and prevent pipeline stalling.
The freedom from microcode not only benefits the effectiveness of the instruc-
tion processing stream, but also benefits the interrupt and trap mechanism required to
support such events as external hardware interrupts. The preparations performed by
29K hardware for interrupt processing are very brief, and this lightweight approach
enables programmers to define their own interrupt architecture; enabling optimiza-
tions to be selected which are best for, say, interrupt through put, or short latency in
commencing handler processing.
The 29K family includes 3–bus Harvard memory architecture processors,
2–bus processors which have simplified and flexible memory system interfaces, and
microcontrollers with considerable on–chip system support. The range is extensive,
yet User mode instruction compatibility is achieved across the entire family [AMD
1993a]. Within each family–grouping, there is also pin compatibility. The family
supports the construction of a scalable product range with regard to performance and
system cost. For example, all of the performance of the top–end processor configura-
tions may not be required, or be appropriate, in a product today but it may be neces-
sary in the future. Because of the range and scalability of the family, making a com-
mitment to 29K processor technology is an investment supported by the ability to
scale–down or scale–up a design in the future. Much of the family’s advantages are
attained by the flexibility in memory architecture choice. This is significant because
of the important impact a memory system can have on performance, overall cost, and
design and test time [Olson 1988][Olson 1989].
The microcontroller family members contain all the necessary RAM and ROM
interface glue–logic on–chip, permitting memory devices to be directly connected to
the processor. Given that memory systems need only be 8–bit or 16–bit wide, the
introduction of these devices should hasten the selection of embedded RISC in future
product designs. The use of RISC need not be considered an expensive option in
terms of system cost or hardware and software design times. Selecting RISC is not
2 Evaluating and Programming the 29K RISC Family
only the correct decision for expensive workstation designs, but increasingly for a
wide range of performance and price sensitive embedded products.
1.1 A RISC DEFINITION
The process of dealing with an instruction can be broken down into stages (see
Figure 1-1). An instruction must then flow through the pipeline of stages before its
processing is complete. Independent hardware is used at each pipeline stage. In-
formation is passed to subsequent pipeline stages at the completion of each processor
cycle. At any instant, the pipeline stages are processing several instructions which are
each at a different stage of completion. Pipelining increases the utilization of the pro-
cessor hardware, and effectively reduces the number of processor cycles required to
process an instruction.
Instruction #1 fetch decode execute write–back
Instruction #2 fetch decode execute write–back
Instruction #3 fetch decode execute
cycle t t+1 t+2
1–cycle
Figure 1-1. RISC Pipeline
With a 4–stage pipeline an instruction takes four cycles to complete, assuming
the pipeline stages are clocked at each processor cycle. However, the processor is
able to start a new instruction at each new processor cycle, and the average proces-
sing time for an instruction is reduced to 1–cycle. Instructions which execute in
1–cycle have only 1–cycle latency as their results are available to the next instruc-
tion.
The 4–stage pipeline of the 29K processor family supports a simplified execute
stage. This is made possible by simplifying instruction formats, limiting instruction
complexity and operating on data help in registers. The simplified execute stage
means that only a single processor cycle is required to complete execute–stage pro-
cessing and the cycle time is also minimized.
CISC processors support a complex execution–stage which require several pro-
cessor cycles to complete. When an instruction is ready for execution it is broken
down into a sequence of microinstructions (see Figure 1-2). These simplified
instructions are supplied by the on–chip microprogram memory. Each microinstruc-
tion must be decoded and executed separately before the instruction execution–stage
Chapter 1 Architectural Overview 3
is complete. Depending on the amount of microcode needed to implement a CISC
instruction, the number of cycles required to complete instruction processing varies
from instruction to instruction.
microcode program
Instruction #1 fetch dec dec dec dec
exe exe exe exe
Instruction #2 fetch dec dec dec dec
exe exe exe exe
t t+1 t+2 t+3
1–cycle
Figure 1-2. CISC Pipeline
Because the hardware used by the execute–stage of a CISC processor is utilized
for a number of processor cycles, the other stages of the pipeline have available addi-
tional cycles for their own operation. For example, if an execute–stage requires four
processors cycles, the overlapping fetch–stage of the next instruction has four cycles
to complete. If the fetch–stage takes four or less cycles, then no stalling of the pipe-
line due to execute–stage starvation shall occur. Starvation or pipeline stalling occurs
when a previous stage has not completed its processing and can not pass its results to
the input of the next pipeline stage.
During the evolution of microprocessors, earlier designs operated with slower
memories than are available today. Both processor and memory speeds have seen
great improvements in recent years. However, the low cost of high performance
memory devices now readily available has shifted microprocessor design. When
memory was slow it made sense overlapping multicycle instruction fetch stages with
multicycle execute stages. Once an instruction had been fetched it was worthwhile
getting as much execute–value as possible since the cost of fetching the instruction
was high. This approach drove processor development and led to the name Complex
Instruction Set Computer.
Faster memory means that instruction processing times are no longer fetch–
stage dominated. With a reduction in the number of cycles required by the fetch–
stage, the execute–stage becomes the dominant factor in determining processor per-
formance. Consequently attention turned to the effectiveness of the microcode se-
quences used to perform CISC instruction execution. Careful analysis of CISC
instruction usage revealed that the simpler instructions were much more frequently
used than the complex ones which required long microcode sequences. The conclu-
4 Evaluating and Programming the 29K RISC Family
sion drawn was that microcode rarely provides the exact sequence of operations re-
quired to support a high level language instruction.
The variable instruction execution times of CISC instructions results in com-
plex pipeline management. It is also more difficult for a compiler to work out the
execution times for different combinations of CISC instructions. For that matter it is
harder for the assembly level programmer to estimate the execution times of, say, an
interrupt handler code sequence compared to the equivalent RISC code sequence.
More importantly, streamlining pipeline operations enables reduced processor cycle
times and greater control by a compiler of the processor’s operation. Given that the
execute–stage dominates performance, the RISC approach is to fetch more instruc-
tions which can be simply executed. Although a RISC program may contain 20%
more instructions than a program for a CISC, the total number of cycles required to
perform a task is reduced.
A number of processor characteristics have been proposed in the press as indica-
tive of RISC or CISC. Many of these proposals are made by marketing departments
which wish to control markets by using RISC and CISC labels as marketing rather
than engineering expressions. I consider a processor to be RISC if it is microcode free
and has a simple instruction execute–stage which can complete in a single cycle.
1.2 FAMILY MEMBER FEATURES
Although this book is about Programming the 29K RISC Family, the following
sections are not restricted to only describing features which can be utilized by soft-
ware. They also briefly describe key hardware features which affect a processor’s
performance and hence its selection.
All members of the family have User mode binary code compatibility. This
greatly simplifies the task of porting application code from one processor to another.
Some system–mode code may need to be changed due to differences in such things as
field assignments of registers in special register space.
Given the variation between family members such as the 3–bus Am29050 float-
ing–point processor and the Am29205 microcontroller, it is remarkable that there is
so much software compatibility. The number of family members is expected to con-
tinue to grow; but already there is a wide selection enabling systems of ranging per-
formance and cost to be constructed (see Figure 1-3). If AMD continues to grow the
family at “both ends of the performance spectrum”, we might expect to see new mi-
crocontroller family members as well as superscalar microprocessors [Johnson
1991]. AMD has stated that future microprocessors will be pin compatible with the
current 2–bus family members.
I think one of the key features of 29K family members is their ability to operate
with varying memory system configurations. It is possible to build very high perfor-
mance Harvard type architectures, or low cost –– high access latency –– DRAM
based systems. Two types of instruction memory caching are supported. Branch Tar-
Chapter 1 Architectural Overview 5
40
2–bus processors
30
29K
processor
MIPS
20
10 3–bus processors
microcontrollers
cost
Figure 1-3. Processor Price–Performance Summary
get Cache (BTC) memory is used in 3–bus family members to hide memory access
latencies. The 2–bus family members make use of more conventional bandwidth im-
proving instruction cache memory.
The higher performance 2–bus processors and microcontrollers have on–chip
data cache. When cache hit ratios are high, processing speeds can be decoupled from
memory system speeds; especially when the processor is clocked at a higher speed
than the off–chip memory system.
A second key feature of processors in the 29K family is that the programmer
must supply the interrupt handler save and restore mechanism. Typically a CISC type
processor will save the processor context, when an exception occurs, in accordance
with the on–chip microcode. The 29K family is free of microcode, making the user
free to tailor the interrupt and exception processing mechanism to suit the system.
This often leads to new and more efficient interrupt handling techniques. The fast in-
terrupt response time, and large interrupt handling capacity made possible by the
flexible architecture, has been sited as one of the key reasons for selecting a 29K pro-
cessor design.
All members of the 29K family make some use of burst–mode memory inter-
faces. Burst–mode memory accesses provide a simplified transfer mechanism for
high bandwidth memory systems. Burst–mode addressing only applies to consecu-
tive access sequences, it is used for all instruction fetches and for load–multiple and
store–multiple data accesses.
6 Evaluating and Programming the 29K RISC Family
The 3–bus microprocessors are dependent on burst–mode addressing to free–up
the address bus after a new instruction fetch sequence has been established. The
memory system is required to supply instructions at sequential addresses without the
processor supplying any further address information; at least until a jump or call type
instruction is executed. This makes the address bus free for use in data memory ac-
cess.
The non 3–bus processors can not simultaneously support instruction fetching
and data access from external memory. Consequently the address bus continually
supplies address information for the instruction or data access currently being sup-
ported by the external memory. However, burst–mode access signals are still sup-
plied by the processor. Indicating that the processor will require another access at the
next sequential address, after the current access is complete, is an aid in achieving
maximum memory access bandwidth. There are also a number of memory devices
available which are internally organized to give highest performance when accessed
in burst–mode.
1.3 THE Am29000 3–BUS MICROPROCESSOR
The Am29000 processor is pin compatible with other 3–bus members of the
family (see Table 1-1) [AMD 1989][Johnson 1987]. It was the first member of the
family, introduced in 1987. It is the core processor for many later designs, such as the
current 2–bus processor product line. Much of this book describes the operation of
the Am29000 processor as the framework for understanding the rest of the family.
The processor can be connected to separate Instruction and data memory sys-
tems, thus exploiting the Harvard architectural advantages (See Figure 1-4). Alter-
natively, a simplified 2–bus system can be constructed by connecting the data and
address busses together; this enables a single memory system to be constructed.
When the full potential of the 3–bus architecture is utilized, it is usually necessary to
include in the memory system a bridge to enable instruction memory to be accessed.
The processor does not support any on–chip means to transfer information on the
instruction bus to the data bus.
The load and store instructions, used for all external memory access, have an
option field (OPT2–0) which is presented to device pins during the data transfer op-
eration. Option field value OPT=4 is defined to indicate the bridge should permit
ROM space to be read as if it were data. Instructions can be located in two separate
spaces: Instruction space and ROM space. Often these spaces become the same, as
the IREQT pin (instruction request type) is not decoded so as to enable distinction
between the two spaces. When ROM and Instruction spaces are not common, a range
of data memory space can be set aside for accessing Instruction space via the bridge.
It is best to avoid overlapping external address spaces if high level code is to access
any memory located in the overlapping regions (see section 1.10.4).
Chapter 1 Architectural Overview 7
Table 1-1. Pin Compatible 3–bus 29K Family Processors
Processor Am29000 Am29050 Am29005
Instruction Cache BTC BTC 64x4 or
No
32x4 words 128x2 words
I–Cache Associativity 2 Way 2 Way N/A
Date Cache – – –
D–Cache Associativity – – –
On–Chip Floating–Point No Yes No
On–Chip MMU Yes Yes No
Integer Multiply in h/w No Yes No
Programmable Bus Sizing No No No
On–Chip Interrupt Yes Yes Yes
Controller Inputs 6 6 6
Scalable Bus Clocking No No No
Burst–mode Addressing Yes, up to 1K bytes Yes, up to 1K bytes Yes, up to 1K bytes
Freeze Mode Processing Yes Yes Yes
Delayed Branching Yes Yes Yes
On–Chip Timer Yes Yes Yes
On–Chip Memory Controler No No No
DMA Channels – – –
Byte Endian Big/Little Big/Little Big/Little
JTAG Debugging No No No
Clock Speeds (MHz) 16,20,25,33 20,25,33,40 16
8 Evaluating and Programming the 29K RISC Family
Coprocessor
Am29000
ADDRESS DATA
RISC
INSTRUCTION
32 32
Instruction Bridge
ROM
32
Instruction
Memory
Data
Memory
Input/Output
Figure 1-4. Am29000 Processor 3–bus Harvard Memory System
Chapter 1 Architectural Overview 9
All processors in the 29K family support byte and half–word size read and write
access to data memory. The original Am29000 (pre rev–D, 1990) only supported
word sized data access. This resulted in read–modify–write cycles to modify sub–
word sized objects. The processor supports insert– and extract–byte and half–word
instructions to assist with sub–word operations. These instructions are little used
today.
The processor has a Branch Target Cache (BTC) memory which is used to sup-
ply the first four instructions of previously taken branches. Successful branches are
20% of a typical instruction mix. Using burst–mode and interleaf techniques,
memory systems can sustain the high bandwidths required to keep the instruction
hungry RISC fed. However, when a branch occurs, memory systems can present con-
siderable latency before supplying the first instruction of the branch target. For ex-
ample, consider an instruction memory system which has a 3–cycle first access laten-
cy but can sustain 1–cycle access in burst–mode. Typically every 5th instruction is a
branch and for the example the branch instruction would take effectively 5–cycles to
complete its execution (the pipeline would be stalled for 4–cycles (see section 1.13)).
If all other instructions were executed in a single–cycle the average cycle time per
instruction would be 1.8 (i.e. 9/5); not the desired sustained single–cycle operation.
The BTC can hide all 3–cycles of memory access latency, and enable the branch
instruction to execute in a single–cycle.
The programmer has little control over BTC operation; it is maintained internal-
ly by processor hardware. There are 32 cache entries (known as cache blocks) of four
instructions each. They are configured in a 2–way set associative arrangement. En-
tries are tagged to distinguish between accesses made in User mode and Supervisor
mode; they are also tagged to differentiate between virtual addresses and physical
addresses. Because the address in the program counter is presented to the BTC at the
same time it is presented to the MMU, the BTC does not operate with physical ad-
dresses. Entries are not tagged with per–process identifiers; consequently the BTC
can not distinguish between identical virtual addresses belonging to different pro-
cesses operating with virtual addressing. Systems which operate with multiple tasks
using virtual addressing must invalidate the cache when a user–task context switch
occurs. Using the IRETINV (interrupt return and invalidate) instruction is one con-
venient way of doing this.
The BTC is able to hold the instructions of frequently taken trap handler rou-
tines, but there is no means to lock code sequences into the cache. Entries are replaced
in the cache on a random basis, the most recently occurring branches replacing the
current entries when necessary.
The 3–bus members of the the 29K family can operate the shared address bus in
a pipeline mode. If a memory system is able to latch an address before an instruction
or data transfer is complete, the address bus can be freed to start a subsequent access.
10 Evaluating and Programming the 29K RISC Family
Allowing two accesses to be in progress simultaneously can be effectively used by
the separate instruction and data memory systems of a Harvard architecture.
1.3.1 The Am29005
The Am29005 is pin compatible with other 3–bus members of the family (see
Table 1-1). It is an inexpensive version of the Am29000 processor. The Translation
Look–Aside Buffer (TLB) and the Branch Target Cache (BTC) have been omitted. It
is available at a lower clock speed, and only in the less expensive plastic packaging. It
is a good choice for systems which are price sensitive and do not require Memory
Management Unit support or the performance advantages of the BTC. An Am29005
design can always be easily upgraded with an Am29000 replacement later. In fact the
superior debugging environment offered by the Am29000 or the Am29050 may
make the use of one of these processor a good choice during software debugging. The
faster processor can always be replaced by an Am29005 when production com-
mences.
1.4 THE Am29050 3–BUS FLOATING–POINT MICROPROCESSOR
The Am29050 processor is pin compatible with other 3–bus members of the
family (see Table 1-1) [AMD 1991a]. Many of the features of the Am29050 were al-
ready described in the section describing its closely related relative, the Am29000.
The Am29050 processor offers a number of additional performance and system sup-
port features when compared with the Am29000. The most notable is the direct
execution of double–precision (64–bit) and single–precision (32–bit) floating–point
arithmetic on–chip. The Am29000 has to rely on software emulation or the
Am29027 floating–point coprocessor to perform floating–point operations. The
introduction of the Am29050 eliminated the need to design the Am29027 coproces-
sor into floating–point intensive systems.
The processor contains a Branch Target Cache (BTC) memory system like the
Am29000; but this time it is twice as big, with 32 entries in each of the two sets rather
than the Am29000’s 16 entries per set. BTC entries are not restricted to four instruc-
tions per entry; there is an option (bit CO in the CFG register) to arrange the BTC as
64 entries per set, with each entry containing two instructions rather than four. The
smaller entry size is more useful with lower latency memory systems. For example, if
a memory system has a 2–cycle first–access start–up latency it is more efficient to
have a larger number of 2–instruction entries. After all, for this example system, the
third and fourth instructions in a four per entry arrangement could just as efficiently
be fetched from the external memory.
The Am29050 also incorporates an Instruction Forwarding path which addi-
tionally helps to reduce the effects of instruction memory access latency. When a new
instruction fetch sequence commences, and the target of the sequence is not found in
Chapter 1 Architectural Overview 11
the BTC, an external memory access is performed to start filling the Instruction Pre-
fetch Buffer (IPB). With the Am29000 processor the fetch stage of the processor
pipeline is fed from the IPB, but the Am29050 can by–pass the fetch stage and feed
the first instruction directly into the decode pipeline stage using the instruction for-
warding technique. By–passing also enables up to four cycles of external memory
latency to be hidden when a BTC hit occurs (see section 1.10).
The Am29050 incorporates a Translation Look–Aside Buffer (TLB) for
Memory Management Unit support, just like the Am29000 processor. However it
also has two region mapping registers. These permit large areas of memory to be
mapped without using up the smaller TLB entries. They are very useful for mapping
large data memory regions, and their use reduces the TLB software management
overhead.
The processor can also speed up data memory accesses by making the access
address available a cycle earlier than the Am29000. The method is used to reduce
memory load operations which have a greater influence on pipeline stalling than
store operations. Normally the address of a load appears on the address bus at the start
of the cycle following the execution of the load instruction. If virtual addressing is in
use, then the TLB registers are used to perform address translation during the second
half of the load execute–cycle. To save a cycle, the Am29050 must make the physical
address of the load available at the start of the load instruction execution. It has two
ways of doing this.
The access address of a load instruction is specified by the RB field of the
instruction (see Figure 1–13). A 4–entry Physical Address Cache (PAC) memory is
used to store most recent load addresses. The cache entries are tagged with RB field
register numbers. When a load instruction enters the decode stage of the pipeline, the
RB field is compared with one of the PAC entries, using a direct mapping technique,
with the lower 2–bits of the register number being used to select the PAC entry. When
a match occurs the PAC supplies the address of the load, thus avoiding the delay of
reading the register file to obtain the address from the register selected by the RB field
of the LOAD instruction. If a PAC miss occurs, the new physical address is written to
the appropriate PAC entry. The user has no means of controlling the PAC; its opera-
tion is completely determined by the processor hardware.
The second method used by the Am29050 processor to reduce the effect of pipe-
line stalling occurring as a result of memory load latency is the Early Address Gener-
ator (EAG). Load addresses are frequently formed by preceding the load with
CONST, CONSTH and ADD type instructions. These instructions prepare a general
purpose register with the address about to be used during the load. The EAG circuitry
continually generates addresses formed by the use of the above instructions in the
hope that a load instruction will immediately follow and use the address newly
formed by the preceding instructions. The EAG must make use of the TLB address
translation hardware in order to make the physical address available at the start of the
12 Evaluating and Programming the 29K RISC Family
load instruction. This happens when, fortunately, the RB field of the load instruction
matches with the destination register of the previous address computation instruc-
tions.
Software debugging is better supported on the Am29050 processor than on any
other current 29K family member. All 29K processors have a trace facility which en-
ables single stepping of processor instructions. However, prior to the Am29050 pro-
cessor, tracing did not apply to the processor operation while the DA bit (disable all
traps and interrupts) was set in the current processor status (CPS) register. The DA bit
is typically set while the processor is operating in Freeze mode (FZ bit set in the CPS
register). Freeze mode code is used during the entry and exit of interrupt and trap
handlers, as well as other critical system support code. The introduction of Monitor
mode operation with the Am29050 enables tracing to be extended to Freeze mode
code debugging. The processor enters Monitor mode when a synchronous trap oc-
curs while the DA bit is set. The processor is equipped with a second set of PC buffer
registers, known as the shadow PC registers, which record the PC–bus activity while
the processor is operating in Monitor mode. The first set of PC buffer registers have
their values frozen when Freeze mode is entered.
The addition of two hardware breakpoint registers aids the Am29050 debug
support. As instructions move into the execute stage of the processor pipeline, the
instruction address is compared with the break address values. The processor takes a
trap when a match occurs. Software debug tools, such as monitors like Mini-
MON29K, used with other 29K family members, typically use illegal instructions to
implement breakpoints. The use of breakpoint registers has a number of advantages
over this technique. Breakpoints can be placed in read–only memories, and break ad-
dresses need not be physical but virtual, tagged with the per–process identifier.
1.5 THE Am29030 2–BUS MICROPROCESSOR
The Am29030 processor is pin compatible with other 2–bus members of the
family (see Table 1-2) [AMD 1991b]. It was the first member of the 2–bus family
introduced in 1991. Higher device construction densities enable it to offer high per-
formance with a simplified system interface design. From a software point of view
the main differences between it and the Am29000 processor occur as a result of re-
placing the Branch Target Cache (BTC) memory with 8k bytes of instruction cache,
and connecting the instruction and data busses together on–chip. However, the sys-
tem interface busses have gained a number of important new capabilities.
The inclusion of an instruction cache memory reduces off–chip instruction
memory access bandwidth requirements. This enables instructions to be fetched via
the same device pins used by the data bus. Only when instructions can not be supplied
by the cache is there contention for access to external memory. Research [Hill 1987]
has shown that with cache sizes above 4k bytes, a conventional instruction cache is
Chapter 1 Architectural Overview 13
Table 1-2. Pin Compatible 2–bus 29K Family Processors
Processor Am29030 Am29035 Am29040
Instruction Cache 8K bytes 4K bytes 8K bytes
I–Cache Associativity 2–Way Direct–Mapped 2–Way
Date Cache (Physical) – – 4K bytes
D–Cache Associativity – – 2–Way
On–Chip Floating–Point No No No
On–Chip MMU Yes Yes Yes
Integer Multiply in h/w No No Yes, 2–cycles
Narrow Memory Reads Yes, 8/16 bit Yes, 8/16 bit Yes, 8/16 bit
Programmable Bus Sizing No Yes, 16/32 bit Yes, 16/32 bit
On–Chip Interrupt Yes Yes Yes
Controller Input’s 6 6 6
Scalable Clocking 1x,2x 1x,2x 1x,2x
Burst–mode Addressing Yes, up to 1K bytes Yes, up to 1K bytes Yes, up to 1K bytes
Freeze Mode Processing Yes Yes Yes
Delayed Branching Yes Yes Yes
On–Chip Timer Yes Yes Yes
On–Chip Memory Controler No No No
DMA Channels – – –
Byte Endian Big/Little Big/Little Big/Little
JTAG Debugging Yes Yes Yes
Clock Speeds (MHz) 20,25,33 16 0–33,40,50
14 Evaluating and Programming the 29K RISC Family
more effective than a BTC. At these cache sizes the bandwith requirements are suffi-
ciently reduced as to make a shared instruction/data bus practicable.
Each cache entry (known as a block) contains four consecutive instructions.
They are tagged in a similar manner to the BTC mechanism of the Am29000 proces-
sor. This allows cache entries to be used for both User mode and Supervisor mode
code at the same time, and entries to remain valid during application system calls and
system interrupt handlers. However, since entries are not tagged with per–process
identifiers, the cache entries must be invalidated when a task context switch occurs.
The cache is 2–way set associative. The 4k bytes of instruction cache provided by
each set results in 256 entries per set (each entry being four instructions, i.e. 16 bytes).
When a branch instruction is executed and the block containing the target
instruction sequence is not found in the cache, the processor fetches the missing
block and marks it valid. Complete blocks are always fetched, even if the target
instruction lies at the end of the block. However, the cache forwards instructions to
the decoder without waiting for the block to be reloaded. If the cache is not disabled
and the block to be replaced in the cache is not valid–and–locked, then the fetched
block is placed in the cache. The 2–way cache associativity provides two possible
cache blocks for storing any selected memory block. When a cache miss occurs, and
both associated blocks are valid but not locked, a block is chosen at random for re-
placement.
Locking valid blocks into the cache is not provided for on a per–block basis but
in terms of the complete cache or one set of the two sets. When a set is locked, valid
blocks are not replaced; invalid blocks will be replaced and marked valid and locked.
Cache locking can be used to preload the cache with instruction sequences critical to
performance. However, it is often difficult to use cache locking in a way that can out–
perform the supported random replacement algorithm.
The processor supports Scalable Clocking which enables the processor to op-
erate at the same or twice the speed of the off–chip memory system. A 33 MHz pro-
cessor could be built around a 20 MHz memory system, and depending on cache uti-
lization there may be little drop–off in performance compared to having constructed
a 33 MHz memory system. This provides for higher system performance without in-
creasing memory system costs or design complexity. Additionally, a performance
upgrade path is provided for systems which were originally built to operate at lower
speeds. The processor need merely be replaced by a pin–compatible higher frequen-
cy device (at higher cost) to realize improved system performance.
Memory system design is further simplified by enforcing a 2–cycle minimum
access time for data and instruction accesses. Even if 1–cycle burst–mode is sup-
ported by a memory system, the first access in the burst is hardwired by the processor
to take 2–cycles. This is effective in relaxing memory system timing constraints and
generally appreciated by memory system designers. The high frequency operation of
the Am29030 processor can easily result in electrical noise [AMD1992c]. Enforcing
Chapter 1 Architectural Overview 15
2–cycle minimum access times ensures that the address bus has more time to settle
before the data bus is driven. This reduces system noise compared with the data bus
changing state during the same cycle as the address bus.
At high processor clock rates, it is likely that an interleafed memory system will
be required to obtain bandwidths able to sustain 1–cycle burst mode access. Inter-
leafing requires the construction of two, four or more memory systems (known as
banks), which are used in sequence. When accessed in burst–mode, each bank is giv-
en more time to provide access to its next storage location. The processor provides an
input pin, EARLYA (early address), by which a memory system can request early ad-
dress generation by the processor. This can be used to simplify the implementation of
interleaved memory systems. When requested, the processor provides early the ad-
dress of even–addressed banks, allowing the memory system to begin early accesses
to both even– and odd–addressed banks.
The processor can operate with memory devices which are not the full 32–bit
width of the data bus. This is achieved using the Narrow Read capability. Memory
systems which are only 8–bit or 16–bit wide are connected to the upper bits of the
data/instruction bus. They assert the RDN (read narrow) input pin along with the
RDY (ready) pin when responding to access requests. When this occurs the processor
will automatically perform the necessary sequences of accesses to assemble instruc-
tions or data which are bigger than the memory system width.
The Narrow Read ability can not be used for data writing. However, it is very
useful for interfacing to ROM which contains system boot–up code. Only a single
8–bit ROM may be required to contain all the necessary system initialization code.
This can greatly simplify system design, board space, and cost. The ROM can be used
to initialize system RAM memory which, due to its 32–bit width, will permit faster
execution.
1.5.1 Am29030 Evaluation.
AMD provides a low cost evaluation board for the Am29030 at 16 MHz, known
as the EZ030 (pronounced easy–030). Like the microcontroller evaluation board, it is
a standalone, requiring an external 5v power supply and connection to a remote com-
puter via an RS–232 connection. The board is very small, measuring about 4 inches
by 4 inches (10x10 cm). The memory system is restricted to 16 MHz operation but
with scalable clocking the processor can run at 16 MHz or 33 MHz.
It contains 128k bytes of EPROM, which is accessed via 8–bit narrow bus proto-
col. There is also 1M byte of DRAM arranged as 256kx32 bits. The DRAM is ex-
pandable to 4M bytes. The EPROM is preprogrammed with the MiniMON29K de-
bug monitor and the OS–boot operating system described in Chapter 7.
1.5.2 The Am29035
The Am29035 processor is pin compatible with other 2–bus members of the
family (see Table 1-2). As would be expected, given the AMD product number, its
16 Evaluating and Programming the 29K RISC Family
operation is very similar to the Am29030 processor. It is only available at lower clock
frequencies, compared with its close relative. And with half the amount of instruction
cache memory, it contains one set of the two sets provided by the Am29030. That is, it
has 4k bytes of instruction memory cache which is directly mapped. Consequently it
can be expected to operate with reduced overall performance.
In all other aspects it is the same as the Am29030 processor, except it has Pro-
grammable Bus Sizing which the Am29030 processor does not. Programmable Bus
Sizing provides for lower cost system designs. The processor can be dynamically
programmed (via the configuration register) to operate with a 16–bit instruction/data
bus, performing both read and write operations. When the option is selected, 32–bit
data is accessed by the processor hardware automatically performing two consecu-
tive accesses. The ability to operate with 16–bit and 32–bit memory systems makes
the 2–bus 29K family members well suited to scalable system designs, in terms of
cost and performance.
1.6 THE Am29040 2–BUS MICROPROCESSOR
The Am29040 processor is pin compatible with other 2–bus members of the
family (see Table 1-2). The processor was introduced in 1994 and offers higher per-
formance than the 2–bus Am29030; it also has a number of additional system support
facilities.
There is an enhanced instruction cache, now 8k bytes; which is tagged in much
the same way as the Am29030’s instruction cache, except there are four valid bits per
cache block (compared to the Am29030’s one bit per block). Partially filled blocks
are supported, and block reload begins with the first required instruction (target of a
branch) rather than the first instruction in the block. An additional benefit of having a
valid bit per–instruction rather than per–block is that load or store instructions can
interrupt cache reload. With the Am29030 processor, once cache reload had started,
it could not be postponed or interrupted by a higher priority LOAD instruction.
The Am29040 was the first 29K microprocessor to have a data cache. The 4k
byte data cache is physically addressed and supports both “copy–back” and “write–
through” policies. Like other 29K Family members, the data cache always operates
with physical addresses and cache blocks are only allocated on LOAD instructions
which miss (a “read–allocate” or “load–allocate” policy). The block size is 16 bytes
and there is one valid bit per block. This means that complete data blocks must be
fetched when data cache reload occurs. Burst mode addressing is used to reload a
block, starting with the first word in the block. The addition of a data cache makes the
Am29040 particularly well–suited to high–performance data handling applications.
The default data cache policy is “copy–back”. A four word copy–back buffer is
used to improve the performance of the copy–back operation. Additionally, cache
blocks have an M bit–field, which becomes set when data in the block is modified. If
Chapter 1 Architectural Overview 17
the M bit is not set when a cache block is reallocated, the out–going block is not co-
pied back.
When data cache is added to a processor, there can be difficulties dealing with
data consistency. Problems arise when there is more than one processor or data con-
troller (such as a DMA controller) accessing the same memory region. The Am29040
processor uses bus snooping to solve this problem. The method relies on the proces-
sor monitoring all accesses performed on the memory system. The processor inter-
venes or updates its cache when an access is attempted on a currently cached data
value. Cache consistency is dealt with in detail in section 5.14.4.
Via the MMU, each memory page can be separately marked as “non cached”,
“copy–back”, or “write–through”. A two word write–through buffer is used to assist
with writes to memory. It enables multiple store instructions to be in–execution with-
out the processor pipeline stalling. Data accesses which hit in the cache require
2–cycle access times. Two cycles, rather than one, are required due to the potentially
high internal clock speed. The data cache operation is explained in detail in section
5.14.2. However, load instructions do not cause pipeline stalling if the instruction im-
mediately following the load does not require the data being accessed.
Scalable bus clocking is supported; enabling the processor to run at twice the
speed of the off–chip memory system. Scalable Clocking was first introduced with
the Am29030 processors, and is described in the previous section describing the
Am29030. If cache hit rates are sufficiently high, Scalable Clocking enables high
performance systems to be built around relatively slow memory systems. It also of-
fers an excellent upgrade path when additional performance is required in the future.
The maximum on–chip clock speed is 50 MHz.
The Am29040 processor supports integer multiply directly. A latency of two
cycles applies to integer multiply instructions (most 29K instructions require only
one cycle). Again, this is a result of the potentially high internal clocking speeds of
the processor. Most 29K processors take a trap when an integer multiply is attempted.
It is left to trapware to emulate the missing instruction. The ability to perform high
speed multiply makes the processor a better choice for calculation intensive applica-
tions such as digital signal processing. Note, floating–point performance should also
improve with the Am29040 as floating–point emulation routines can make use of the
integer multiply instruction.
The Am29040 has two Translation Look–Aside Buffers (TLBs). Having two
TLBs enables a larger number of virtual to physical address translations to be cached
(held in a TLB register) at any time. This reduces the TLB reload overhead. The TLB
format is similar to the arrangement used with the Am29243 microcontroller. Each
TLB has 16 entries (8 sets, two entries per set). The page size used by each TLB can
be the same or different. If the TLB page sizes are the same, a four–way set associa-
tive MMU can be constructed with supporting software. Alternatively one TLB can
be used for code and the second, with a larger page size, for data buffers or shared
18 Evaluating and Programming the 29K RISC Family
libraries. The TLB entries have a Global Page (GLB) bit; when set the mapped page
can be accessed by any processes regardless of its process identifier (PID). The TLB
also enables parity checking to be enabled on a per page basis; and pages can be allo-
cated from 16–bit or 32–bit wide memory regions.
On–chip debug support is extended with the inclusion of two Instruction Break-
point Controllers and one Data Breakpoint Controller. This enables inexpensive de-
bug monitors such as the DebugCore incorporated within MiniMON29K to be used
when developing software. Breakpoints are supported when physical or virtual ad-
dressing is in use. The JTAG test interface has also been extended over other 29K
family members to include several new JTAG–processed instructions. The effective-
ness of the JTAG interface for hardware and software debugging is improved.
The Am29040 family grouping is implemented with a silicon process which en-
ables processors to operate at 3.3–volts. However, the device is tolerant of 5–volt in-
put/output signal levels. The lower power consumption achievable at 3.3–volts
makes the Am29040 suitable for hand–held type applications. Note, the device oper-
ates at a maximum clock frequency of 50 MHz.
A 29K processor enters Wait Mode when the Wait Mode bit is set in the Current
Processor Status (CPS) register. Wait Mode is extended to include a Snooze Mode
which is entered from Wait Mode while the interrupt and trap input lines are held in-
active. An interrupt is normally used to depart Wait or Snooze Mode. While in
Snooze mode, Am29040 power consumption is reduced. Returning from Snooze
mode to an interrupt processing state requires approximately 256 cycles. The proces-
sor can be prevented from entering Snooze Mode while in Wait Mode by holding, for
example, the INTR3 input pin active and setting the interrupt mask such as to disable
the INTR3 interrupt.
If the input clock is held high or low while the processor is in Snooze mode,
Sleep Mode is entered. Minimum power consumption occurs in this mode. The pro-
cessor returns to Snooze Mode when the input clock is restarted. Using Snooze and
Sleep modes enables the Am29040 processor to be used in applications which are
very power sensitive.
1.6.1 Am29040 Evaluation.
Like any 29K processor, the Am29040 can be evaluated using the Architectural
Simulator. But for those who wish for real hardware, AMD manufactures a number
of evaluation boards. The most popular being the SE29040 evaluation board. The
board, originally constructed in rev–A form, supports 4M bytes of DRAM (expand-
able to 64M bytes); DRAM timing is 3/1, i.e. 3–cycle first access then 1–cycle burst.
There is also 1M byte of 32–bit wide ROM and space for 1M byte of 2/1 SRAM.
Boards are typically populated with only 128K of SRAM. The memory system clock
speed is 25 MHZ and the maximum processor speed of 50 MHz is supported.
Chapter 1 Architectural Overview 19
There are connections for JTAG and a logic analyzer as well as two UARTs via
an 85C30 serial communications controller. The board requires a 5–volt power sup-
ply and there is a small wire–warp area for placement of additional system compo-
nents.
The later rev–B boards have an additional parallel port and Ethernet connection
(10–base–T). An AMD HiLANCE is used for Ethernet communication. The rev–B
board can also support memory system speeds up to 33 MHz.
1.7 A SUPERSCALAR 29K PROCESSOR
AMD representatives have talked at conferences and to the engineering press
about a superscalar 29K processor. No announcements have yet been made about
when such a processor will be available, but it is generally expected to be in the near
future. At the 1994 Microprocessor Forum, AMD presented a product overview, but
much of the specific details about the processor architecture were not announced.
However, piecing together available information, it is possible to form ideas about
what a superscalar 29K would look like.
This section does not describe a specific processor, but presents the superscalar
techniques which are likely to be utilized. A lead architect of the 29K family, Mike
Johnson, has a text book dealing with “Superscalar Microprocessor Design” ([John-
son 1991]) which covers the technology in depth. It might be expected that many of
the conclusions drawn in Johnson’s book will appear in silicon in a future 29K pro-
cessor.
AMD has stated that future microprocessors will be pin compatible with the cur-
rent 2–bus family members. This indicates that a superscalar 29K will be pin compat-
ible with the Am29030 and Am29040 processors. It is much more likely that the pro-
cessor will take 2–bus form rather than a microcontroller. User mode instruction
compatibility can also be expected. Given the usual performance increments that ac-
company a new processors introduction, it will likely sustain two–times the perfor-
mance of an Am29040 processor. This may be an underestimate, as higher clock rates
or increased use of Scalable Clocking may allow for even higher performance. The
processor is certain to have considerable on–chip instruction and data cache. AMD’s
product overview indicates that 2x, 3x and 4x Scalable Clocking will be supported
and there will be an 8K byte instruction cache and an 8K byte data cache. Also re-
ported was an internal clock speed up to 100 MHz at 3.3–volts.
A superscalar processor achieves higher performance than a conventional sca-
lar processor by executing more than one instruction per cycle. To achieve this it must
have multiple function units which can operate in parallel. AMD has indicated that
the initial superscalar 29K processor will have six function units. And since about
50% of instructions perform integer operations, there will be two integer operation
units, one integer multiplier and one funnel shifter. If a future the processor supports
floating–point operations directly, we can expect to see a floating–point execution
20 Evaluating and Programming the 29K RISC Family
unit added. Other execution units are included to deal with off–chip access via load
and store instructions; and to deal with branch instruction execution. All six function
units, except the integer multiplier, produce their results in a single–cycle.
High speed operation can only be obtained if as many as possible of the function
units can be kept productively busy during the same processor cycles. This will place
a heavy demand on instruction decoding and operand forwarding. Several instruc-
tions will have to be decoded in the same cycle and forwarded to the appropriate
execution unit. The demand for operands for these instructions shall be considerably
higher than must be dealt with by a scalar processor. The following sections describe
some of the difficulties encountered when attempting to execute more than one
instruction per cycle. Architectural techniques which overcome the inherent difficul-
ties are presented.
1.7.1 Instruction Issue and Data Dependency
The term instructions issue refers to the passing of an instruction from the pro-
cessor decode stage to an execution unit. With a scalar processor, instructions are is-
sued in–order. By that, I mean, in the order the decoder received the instructions from
cache or off–chip memory. Instructions naturally complete in–order. However with a
RISC processor out–of–order completion is not unusual for certain instructions. Typ-
ically load and store instructions are allowed to execute in parallel with other instruc-
tions. These instructions are issued in–order; they don’t complete immediately but
some time (a few cycles) later. The instructions following loads or stores are issued
and execute in parallel unless there is any data dependencies. Dependencies arise
when, for example, a load instructions is followed by an operation on the loaded data.
A superscalar processor can reduce total execution time for a code sequence if it
allows all instruction types to complete out–of–order. Instruction issue need not stop
after an instruction is issued to a function unit which takes multiple cycles to com-
plete. Consequently, function units with long latency may complete their operation
after a subsequent instruction issued to a low latency function unit. The Am29050
processor allows long latency floating–point operations to execute in parallel with
other integer operations. The processor has an additional port on it’s register file for
writing–back the results of floating–point operations. An additional port is required
to avoid the contention which would arise with an integer operation writing back its
result at the same time. Most instructions are issued to an integer unit which, with a
RISC processor, has only one cycle latency. However, there is very likely to be more
than one integer unit, each operating in parallel.
Chapter 1 Architectural Overview 21
Write–Read Dependency
Even if a processor is able to support out–of–order instruction completion, it
still must deal with the data dependencies that flow through a program’s execution.
These flow dependencies (often known as true dependencies) represent the move-
ment of operands between instructions in a program.Examine the code below:
mul gr96,lr2,lr5 ;write gr96, gr96 = lr2 * lr5
add gr97,gr96,1 ;read gr96, write–read dependency
The first instruction would be issued to the integer lr2 lr5
multiply unit; this will have (according to AMD’s
product overview) two cycles of latency. The result is mul
written to register gr96. The second instruction would 1
be issued to a different integer handling unit. However, gr96
it has a source operand supplied in gr96. If the second
instruction had no data dependencies on the first, it add
would be easy to issue the instruction while the first was
still in execute. However, execution of the first
instruction must complete before the second instruction gr97
can start execution. Steps must be taken to deal with the
data dependency. This kind of dependency is also know
as write–read dependency, because gr96 must be
written by an earlier instruction before a later one can
read the result.
Some superscalar processors, such as the Intel i960 CA, use a
reduced–scoreboarding mechanism to resolve data dependances [Thorton 1970].
When a register is required for a result, a one–bit flag is set to indicate the register is in
use. Currently in–execute instructions set the scoreboard bit for their result registers.
Before an instruction is issued the scoreboard bit is examined. Further instructions
are not issued if the scoreboard indicates that an in–execute instruction intends to
write a register which supplies a source operand for the instruction waiting for issue.
When an instruction completes, the relevant scoreboard bit is cleared. This may
result in a currently stalled instruction being issued.
It is unlikely a 29K processor will use scoreboarding; and even less likely it will
use a reduced–scoreboarding mechanism, such as the i960 CA, which only detects
data dependency for out–of–order instruction completion. A superscalar 29K
processor will support out–of–order instruction issue, which is described shortly.
Scoreboarding can resolve the resulting data dependencies. However, other
techniques, such as register renaming, enable instructions to be decoded and issued
further ahead than is possible with scoreboarding. This will be described in more
detail as we proceed.
22 Evaluating and Programming the 29K RISC Family
Write–Write Dependency
A second type of data dependency can complicate out–of–order instruction
completion. Examine the code sequence shown below:
mul gr96,lr2,lr5 ;write gr96, gr96 = lr2 * lr5
add gr97,gr96,1
add gr96,lr5,1 ;write gr96, write–write dependency
The result of the third instruction has an output dependency on the first
instruction. The third instruction can not complete before the first. Both instructions
write their results to register gr96, and completing the first instruction last would
result in an out–of–date value being held in gr96. Steps must be taken to deal with the
data dependency. Because the completion of multiple instructions is dependent on
writing gr96 with the correct value, this kind of dependence is also known as a
write–write dependance.
Scoreboarding or reduced–scoreboarding can also resolve write–write
dependences. Before an instruction is issued, the scoreboard bit for the result register
is tested. If there is a currently in–execute instruction planning on writing to the same
result register, the scoreboard bit will be set. This information can be used to stall
issuing until the result register is available.
The parallel execution possible with out–of–order completion, enables higher
performance than in–order completion, but extra logic is required to deal with data
dependency checking. With in–order instruction issue, instructions can no longer be
issued when a dependency is detected. If instruction issue is to continue when data
dependencies are present, the processor architecture becomes yet more complicated;
but the performance reward is extended beyond that of out–of–order completion
with in–order issue.
Read–Write Dependency
Instruction issuing can continue even when the write–read and write–write
dependencies described above are present. The preceding discussion on data
dependency was restricted to in–order instruction issue. Certainly, when a data
dependency is detected, the unfortunate instruction can not be issued; but this need
not mean that future instructions can not be issued. Of course the future instruction
must be free of any dependencies. With out–of–order instruction issue, instructions
are decoded and placed in an instruction window. Instructions can be issued from the
window when they are free of dependencies and there is an available function unit.
The processes of decoding and executing an instruction is separated by the
instruction window, see Figure 1-5. This does not add an additional pipeline stage to
the superscalar processor. The decoder places instructions into the window. When an
instruction is free of dependencies it can be issued from the window to a function unit
for execution. The register window could be implemented as a large buffer within the
instruction decode unit, but this leads to a complex architecture. When an instruction
Chapter 1 Architectural Overview 23
is issued, the op–code and operands must be communicated to the function unit.
When multiple instructions are issued in a single cycle, a heavy demand is placed on
system busses and register file access ports. An alternative window implementation
is to hold instructions at the function units in reservation stations. This way
instructions are sent during decode to the appropriate function unit along with any
available operands. They are issued from the reservation station (really the window)
when any remaining dependencies are resolved and the function unit is available for
execution. The operation of reservation stations is described in more detail in section
1.7.2.
mul gr96,lr2,lr5
add gr97,gr96,1
Instruction Instruction
add gr96,lr5,1
Decode Execute
Instruction Window
Figure 1-5. The Instruction Window for Out–of–Order Instruction Issue
An instruction is issued from the window when its operands are available for
execution. Future instructions may be issued ahead of earlier instructions which
become blocked due to data dependencies. Executing instructions out–of–order
introduces a new form of data dependency not encountered with in–order instruction
issue. Examine the code sequence below:
mul gr96,lr2,lr5 ;gr96 = lr2 * lr5
add gr97,gr96,1 ;read gr96
add gr96,lr5,1 ;write gr96, read–write dependency
The third instruction in the sequence uses gr96 for its result. The second
instruction receives an operand in the same gr96 register. The third instruction can
not complete and write its result until the second instruction begins execution;
otherwise the second will receive the wrong operand. The result of the third
instruction has an antidependency on the operand to the second instruction. The
dependency is very much like an in–order issue dependency but reversed. This kind
of dependency is also know as read–write dependance, because gr96 must be read by
the second instruction before the third can write its result to gr96.
Registers are used to hold data values. The flow of data through a program is
represented by the registers accessed by instructions. When instructions execute
out–of–order; the flow of data between instructions is restricted by the reuse of
registers to hold different data values. In the above example we want to issue the third
instruction but its use of gr96 creates a problem. The second instruction is receiving,
24 Evaluating and Programming the 29K RISC Family
via gr96, a data value produced by the first instruction. The register label gr96 is
merely used as an identifier for the data flow. What is intended is that data be passed
from the first instruction to the second. If our intentions could be communicated
without restricting data passing to gr96, then the third instruction could be executed
before the second.
The problem can be overcome by using register renaming, see section 1.7.3.
Briefly, when the first instruction in the above example is issued, it writes its result to
a temporary register identified by the name gr96. The second instruction receives its
operand from the same temporary register used by the first instruction. Execution of
the third instruction need not be stalled if it writes its result to a different copy of
register gr96. So now there are multiple copies of gr96. What really happens is
temporary registers are renamed to be gr96 for the duration of the data flow. These
temporary registers play the role of registers indicated by the instruction sequence.
They are tagged to indicate the register they are duplicating.
1.7.2 Reservation Stations
Each function unit has a number of reservation stations which hold instructions
and operands waiting for execution, see Figure 1-6. All the reservation stations for
each function unit combined represent the instruction window from which
instructions are issued. The decoder places instructions into reservation stations
[Tomasulo 1967] with copies of operands, when available. Otherwise operand values
are replaced with tags indicating the register supplying the missing data. Placing a
copy of a source operand into the reservation station when an instruction is decoded,
prevents the operand being updated by a future instruction; and hence eliminates
anidependency conflicts. A function unit issues instructions to its execute stage when
it is not busy and a reservation station has an instruction ready for execution. Once an
instruction is placed in a reservation station, its issue occurs regardless of any
instruction issue occurring in another function unit. There can be any number of
reservation stations attached to a function unit. The greater the number, the larger the
instruction window; and the further ahead the processor can decode and issue
instructions. Additionally, a greater number of reservation stations prevents short
term demands on a function unit resulting in decoder–stalling.
An instruction may be stalled in a reservation station when a data dependency
causes a tag, rather than data, to be placed in the operand field. The necessary data
will become available when some other instruction completes and the result made
available. The instruction producing the required data value may be in a reservation
station or in execution in the same function unit, or in another function unit. Result
values are tagged indicating the register they should be placed in. With a scalar
processor, the result is always written to the instruction’s destination register. But
when register renaming is used by a superscalar processor, results are written to a
register which is temporarily playing the role of the destination register. These
Chapter 1 Architectural Overview 25
destination tag or source tag or source Reservation
OP–code Station
operand C operand A operand B
reorder–buffer tag or source tag or source Reservation
OP–code Station
register operand A operand B
Execution Unit
result from other
function unit
result tag information
Figure 1-6. A Function Unit with Reservation Stations
temporary registers, known as copy or duplicate registers, are tagged to indicate the
real register they are duplicating.
When a function unit completes an instruction, it places the result along with the
tag information identifying the result register on a result bus. If several function units
complete in the same cycle, there can be competition for the limited number of result
busses. Other function units monitor the result bus (or busses). Their intention is to
obtain the missing operands for instructions held in reservation stations. When they
observe a data valued tagged with a register value matching a missing operand they
copy the data into the reservation station’s operand field. This may enable the
instruction to be issued.
Once an instruction is placed into a reservation station it will execute in
sequence with other instructions held in other reservation stations within the same
function unit. Of course exceptional events, or the placing of instructions into the
instruction window which represent over speculation, can divert the planned
execution. The instruction window supports speculative instruction decoding. It is
possible that a branch instruction can result in unsuccessful speculation; and the
window must be refilled with instructions fetched from a new instruction sequence.
If a superscalar processor’s performance is to be kept high, it is important that
26 Evaluating and Programming the 29K RISC Family
speculation be successful. For this to be accomplished, branch prediction techniques
must be employed; more on this is in section 1.7.4.
1.7.3 Register Renaming
It was briefly described in the previous section dealing with read–write
dependency (antidependency), that register renaming can help deal with the conflicts
which arise from the reuse of the same register to hold data values. Of course these
dependencies only arise from the out–of–order instruction issue which occurs with a
superscalar processor. Also described were write–write (output) dependencies,
which occur with even in–order instruction issue when more than one instruction
wishes to write the same result register. Both these types of dependency can be
grouped under the heading storage conflicts. Their interference with concurrent
instruction execution is only temporary. Duplication of the result register for the
duration of the conflict can resolve the dependency and enable superscalar
instruction execution to continue.
The temporary result registers are allocated from a reorder buffer which
consists of 10 registers and supporting tag information. Every new result value is
allocated a new copy of the original assignment register. Copies are tagged to enable
them to be used as source operands in future instructions. Register renaming is shown
for the example code sequence below.
;original code ;code after register renaming
mul gr96,lr2,lr5 mul RR1,lr2,lr5 ;gr96 = lr2 * lr5
add gr97,gr96,1 add RR2,RR1,1
add gr96,lr5,1 add RR3,lr5,1
The write–write dependency between the first and third instruction is resolved
by renaming register gr96 to be register RR3 in the third instruction. The renaming
gr96 to be RR3 in the third instruction also resolves the read–write dependency
between the second and third instruction Using register renaming, execution of the
third instruction need not be stalled due to storage (register) dependency. Figure 1-7
shows the dependencies before and after register renaming.
Let’s look in more detail at the operation of the reorder buffer. When an instruc-
tion is decoded and placed in the instruction window (in practice, a reservation sta-
tion), a register in the reorder buffer is assigned to hold the instruction result.
Figure 1-8 shows the format of information held in the reorder buffer. When the
instruction is issued from the reservation station and, at a later time, execution com-
pleted, the result is written to the assigned reorder buffer entry.
If a future instruction refers to the result of a previous instruction, the reorder
buffer is accessed to obtain the necessary value. The reorder buffer is accessed via the
contents of the destination–tag field. This is known as a content–addressable
memory access. A parallel search of the reorder buffer is performed. All memory
locations are simultaneously examined to determine if they have the requested data.
Chapter 1 Architectural Overview 27
lr2 lr5 lr2 lr5
lr5 1 lr5 1
mul mul
1 1
add add
gr96 RR1
gr96 add RR3 add
“gr96”
Before Renaming gr97 After Renaming RR2
“gr97”
Figure 1-7. Register Dependency Resolved by Register Renaming
If the instruction producing the result has not yet completed execution, then the
dispatched instruction is provided with a reorder–buffer–tag for the pending data.
For example, the second instruction in the above code sequence would receive
register–buffer–tag RR1.
It is likely that the reorder buffer contains entries which are destined (tagged) for
the same result register. When the reorder buffer is accessed with a destination–tag
which has multiple entries, the reorder buffer provides the most recent entry. This
ensures the most recently assigned (according to instruction decode) value is used. In
such case, the older entry could be discarded; but it is kept in case of an exceptional
event, such as an interrupt or trap, occurring.
When an instruction completes, the reorder buffer entry is updated with the re-
sult value. A number of result busses are used to forward result values, and their
RR0 destination tag value status older
entry
RR1 gr96 – in use
RR2 gr97 – in use
entry
RR3 gr96 – in use selected
newer
RRn free entry
instruction source operand address = gr96
Reorder buffer tag
(entry name)
Figure 1-8. Circular Reorder Buffer Format
28 Evaluating and Programming the 29K RISC Family
associated tag information, to the reorder buffer. Function units monitor the flow of
data along these buses in the hope of acquiring data values required by their reserva-
tion stations. In this way, instructions are supplied the operands which where missing
when the instruction was decoded. When a reorder buffer has been updated with a
result, the entry is ready for retiring. This is the term given to writing the result value
into the real register in the register file. There is a bus for this task which connects
read ports on the reorder buffer to write ports on the register file. The number of ports
assigned to this task (2) limits the number of instructions which can be retired in any
one processor cycle. A register file with two write ports supports a maximum of four
instructions being retired during the same cycle; two instructions which modify re-
sult registers, one store instruction, and one branch instruction (these last two instruc-
tion types do not write to result registers). Figure 1-9 outlines the system layout.
When the reorder buffer becomes full, no further instruction decoding can occur
until entries are made available via instruction retiring. Instructions are retired in the
order they are placed in the reorder buffer. This ensures in–order retiring of
instructions. Should an exceptional event occur during program execution, the state
of instruction retirement specifies the precise position which execution has reached
within the program. Only completed instructions, without exceptions, are retired.
Instruction
Memory
Retirement Bus
Instruction 2 word
Cache
4 instructions
Register
Instruction File Reorder
Decode Buffer
4 word 4 word
Instruction Bus
10 word
Operand Buses
Function Function Function
reservation Unit Unit Unit
stations (2)
Result and Tag Buses 3 word
Figure 1-9. Multiple Function Units with a Reorder Buffer
Chapter 1 Architectural Overview 29
Figure 1-9 shows the operand busses supplying source operands from the
reorder buffer to the reservation stations. However, in some cases, when an
instruction is decoded and the operand register’s number presented to the reorder
buffer, no entry is found. This indicates there is currently no copy of the required
register. Consequently, the real register in the register file must be accessed to obtain
the data. For this reason the register file is provided with read ports (4) which supply
data to the operand bus.
1.7.4 Branch Prediction
Out–of–order instruction issue places a heavy demand on instruction decoding.
If reservation stations are to be kept filled, instruction decode must proceed at a rate
equal to, or greater than, instruction execution. Otherwise, performance will be
limited by the ability to decode instructions. The major obstacle in the way of
achieving efficient decoder operation is branching instructions. Unfortunately,
instruction sequences typically contain only about five or six instructions before a
further branch–type instruction is encountered. Compilers directed to producing
code specifically for superscalar processor execution try to increase this critical
parameter. Additionally, the fact that a target of a branch instruction need not be
aligned on a cache block boundary, can further reduce the efficiency of the decoding
processes.
The decoder fetches instructions and places them into the instruction window
for issue by a function unit. If an average decode rate of more than two instructions
per cycle is to be achieved, it is likely that a four–instruction decoder (or better) will
be required. In fact, AMD’s product overview indicates a four–instruction decoder is
used. To study this further, first examine the code below. The first target sequence
begins at address label L13. The linker need not align the L13 label at a cache block
boundary –– a cache block size of four instructions will be assumed. The same
alignment issue occurs with the second target sequence beginning at label L14. The
decoder is presented with a complete cache block rather than sequential instructions
from within the block. This requires a 128–bit bus between the instruction cache and
the decode unit. However, this is essential if instructions are to be decoded in parallel.
Figure 1-10 shows a possible cache block assignment, assuming the target of the first
instruction sequence begins in the second entry of the cache block. The target of the
second sequence begins in the third instruction of the block.
L13: ;target of a branch
add gr98,gr98,10 ;gr98 = gr98 + 10
sll gr99,gr99,2
cpgt gr97,gr97,gr98
jmpt gr97,L14 ;conditional branch to L14
add lr4,lr4,gr99 ;branch delay slot, see section 1.13
L15:
load 0,0,gr97,lr4
store 0,0,gr97,gr96
30 Evaluating and Programming the 29K RISC Family
. . .
L14: ;target of branch
jmp L16 ;unconditional branch to L16
const lr10,0 ;branch delay slot, always executed
. . .
The branch instruction from the first code sequence to label L14 is located in the
second instruction of the block. Assuming two cycles are required to fetch the target
block, the decoder is left with nothing to decode for several cycles. Additionally,
branch alignment has resulted in there being less than four instructions available for
decode during any cycle. The resulting decode rate is 1 instruction per cycle. This
would result in little better than scalar processor performance –– much less that the
desired 2 or more instructions per cycle.
Cache block being decoded
add gr98,gr98,10 sll gr99,gr99,2
cpgt gr97,gr97,gr98 jmpt gr97,L14 add lr4,lr4,gr99
jmp L16 const lr10,0
L16: . . .
Average Decode = 7/7
time rate = 1 instructions/cycle
two–cycle
in cycles delay
Figure 1-10. Instruction Decode with No Branch Prediction
In Figure 1-10 the target sequence is found in the cache. Of course the cost of the
branch would be much higher if the target instructions had to be fetched from
off–chip memory. Additionally, a two–cycle branch delay is shown. This is typically
defined as the time from decoding the branch instruction till decoding the target
instruction. The actual delay encountered is difficult to estimate, as the target address
is not known until the jump instruction is executed. Figure 1-10 shows the cycle
when the jump instruction is placed in the instruction window. When it will be issued
depends on a number of factors such as register dependency and reservation station
activity. Additionally the result of the jump must be forwarded to the decode unit
before further instruction decode can proceed. In practice, several cycles could
elapse before the decoder obtains the address of the cache block containing the target
instruction.
It is clear from the above discussion that a superscalar processor must take steps
to achieve a higher instruction decode rate. This is likely to involve some form of
Chapter 1 Architectural Overview 31
branch prediction. The decoder can not wait for the outcome of the branch instruction
to be know before it starts fetching the new instruction stream. It must examine the
instruction currently being decoded, and determine if a branch is present. When a
branching instruction is found, the decoder must predict both if the branch will be
taken and the target of the branch. This enables instructions to be fetched and
decoded along the predicted path. Of course, unconditional branches also benefit
from early fetching of their target instruction sequence; and they do not require
branch prediction support.
The instruction decode sequence for the previous code example is shown in
Figure 1-11 using branch prediction. Without waiting for the conditional–jump
instruction in the second entry of the cache block to execute, the decoder predicts the
branch will be taken and in the next cycle starts decoding the block containing the
target instruction. This results in a decode rate of 2.33 instructions per cycle. If the
prediction is correct, the decoder should be able to sustain a decode rate which
prevents starving the function units of instructions.
Cache block being decoded
add gr98,gr98,10 sll gr99,gr99,2
cpgt gr97,gr97,gr98 jmpt gr97,L14 add lr4,lr4,gr99
jmp L16 const lr10,0
Average Decode = 7/3
time rate = 2.33 instructions/cycle
in cycles
Figure 1-11. Four–Instruction Decoder with Branch Prediction
Branch prediction supports speculative instruction fetching. It results in
instructions being placed in the instruction window which may be speculatively
dispatched and executed. If the branch is wrongly predicted, instructions still waiting
in reservation stations must be cancelled. Any wrongly predicated instructions which
reach execution must not be retired. This requires considerable support circuitry. For
this reason scoreboarding is used by some processors to support speculative
instruction fetching. With scoreboarding the decoder sets a scoreboard bit for each
instruction’s destination register. Since there is only one bit indicating there is a
pending update, there can be only one such update per register. Consequently, the
decoder stalls when encountering an instruction required to update a register which
already has a pending update. The scoreboarding mechanism is simpler to implement
than register renaming using a reorder buffer. However, its restrictions limit the
decoder’s ability to speculatively fetch instruction further ahead of actual execution.
This has been shown to result in about 21% poorer performance when a
four–instruction decoder is used [Johnson 1991].
32 Evaluating and Programming the 29K RISC Family
It is certain that a superscalar 29K processor will incorporate a branch
prediction technique. Given that instruction compatibility is to be maintained, it is
likely that a hardware prediction rather than a software prediction method will be
employed. This will require the processor to keep track of previous branch activity.
An algorithm will likely help with selecting the most frequent branch paths; such as
branches to lower addresses are more often taken then not –– jump at bottom of loop.
1.8 THE Am29200 MICROCONTROLLER
The Am29200 was the first of the 29K family microcontrollers (see
Table 1-3) [AMD 1992b]. To date the Am29205 is the only other microcontroller
added to the family. Being microcontrollers, many of the device pins are assigned I/O
and other dedicated support tasks which reduce system glue logic requirements. For
this reason none of the devices are pin compatible. The system support facilities, in-
cluded within the Am29200 package, make it ideal for many highly integrated and
low cost systems.
The processor supports a 32–bit address space which is divided into a number of
dedicated regions (see Figure 1-12). This means that ROM, for example, can only be
located in the region preallocated for ROM access. When an address value is gener-
ated, the associated control–logic for the region is activated and used to control data
or instruction access for the region.
There is a 32–bit data bus and a separate 24–bit address bus. The rest of the 104
pins used by the device are mainly for I/O and external peripheral control tasks
associated with each of the separate address regions.
By incorporating memory interface logic within the chip, the processor enables
lower system costs and simplified designs. In fact, DRAM devices can be wired di-
rectly to the microcontroller without the need for any additional circuitry.
At the core of the microcontroller is an Am29000 processor. The additional I/O
devices and region control mechanisms supported by the chip are operated by pro-
grammable registers located in the control register region of memory space. These
control registers are accessible from alternate address locations –– for historical rea-
sons. It is best, and essential if C code is used, to access these registers from the op-
tional word–aligned addresses.
Accessing memory or peripherals located in each address region is achieved
with a dedicated region controller. While initializing the control registers for each
region it is possible to specify the access times and, say, the DRAM refresh require-
ments for memory devices located in the associated region.
Other peripheral devices incorporated in the microcontroller, such as the UART,
are accessed by specific control registers. The inclusion of popular peripheral de-
vices and the associated glue logic for peripheral and memory interfaces within a
single RISC chip, enables higher performance at lower costs than existing systems
Chapter 1 Architectural Overview 33
Table 1-3. Am2920x Microcontroller Members of 29K Processor Family
Processor Am29200 Am29205
Instruction Cache – –
I–Cache Associativity – –
Date Cache – –
D–Cache Associativity – –
On–Chip Floating–Point No No
On–Chip MMU No –
Integer Multiply in h/w No No
Programmable I/O 16 pins 8 pins
ROM width 8/16/32 bit 16 bit
DRAM width 16/32 bit 16 bit
On–Chip Interrupt Yes Yes
Controller Inputs 14 10
Scalable Clocking No No
Burst–mode Addressing Yes, up to 1K bytes Yes, up to 1K bytes
Freeze Mode Processing Yes Yes
Delayed Branching Yes Yes
On–Chip Timer Yes Yes
On–Chip Memory Controler Yes Yes
DMA Channels 2 1
Byte Endian Big Big
Serial Ports 1 1
JTAG Debugging Yes No
Clock Speeds (MHz) 16.7, 20 12.5,16.7
34 Evaluating and Programming the 29K RISC Family
Region Allocation Address Range
0xffff,ffff
reserved
0x9600,0000
PIA space
0x9000,0000
control regs.
0x8000,0000
video–DRAM
0x6000,0000
virtual–DRAM
0x5000,0000
DRAM
0x4000,0000
ROM
0x0
Figure 1-12. Am29200 Microcontroller Address Space
Regions
(see Figure 1-13). Let’s take a quick look at each of the region controllers and spe-
cialized on–chip peripherals in turn.
1.8.1 ROM Region
First thing to realize is that ROM space is really intended for all types of
nonmultiplexed–address devices, such as ROM and SRAM. Controlling access to
these types of memories is very similar. The region is divided into four banks. Each
bank is individually configurable in width and timing characteristics. A bank can be
associated with 8–bit, 16–bit or 32–bit memory and can contain as much as 16M
bytes of memory (enabling a 64M bytes ROM region).
Bank 0, the first bank, is normally attached to ROM memory as code execution
after processor reset starts at address 0. During reset the BOOTW (boot ROM width)
input pin is tested to determine the width of Bank 0 memory. Initially the memory is
assumed to have 4–cycle access times (three wait states) and no burst–mode. The
SA29200 evaluation board contains an 8–bit EPROM at bank 0 (SA stands for stand–
alone). Other banks may contain, say, 32–bit SRAM with different wait state require-
ments. It is possible to arrange banks to form a contiguous address range.
Whenever memory in the ROM address range is accessed, the controller for the
region is activated and the required memory chip control signals such as CE (chip
enable), R/W, OE (output enable) and others are generated by the microcontroller.
Thus SRAM and EPROM devices are wired directly to pins on the microcontrol chip.
Chapter 1 Architectural Overview 35
Am29000
A I D
ROM or
parallel ROM SRAM
port controller
Memory
serial
port PIA
video DMA
interface controller
I/O DRAM DRAM
port controller Memory
interrupt
controller
Figure 1-13. Am29200 Microcontroller Block Diagram
1.8.2 DRAM Region
In a way similar to the ROM region, there is a dedicated controller for DRAM
devices which are restricted to being located in the DRAM address region. Once
again the region is divided into four banks which may each contain as much as 16M
bytes of off–chip memory. The DRAM region controller supports 16–bit or 32–bit
wide memory banks which may be arranged to appear as contiguous in address
range.
DRAM, unlike ROM, is always assumed to have 4–cycle access times. Howev-
er, if page–mode DRAM is used it is possible to achieve 2–cycle rather than 4–cycle
burst–mode accesses. Burst–mode is used when consecutive memory addresses are
being accessed, such as during instruction fetching between program branches. The
DRAM memory is often referred to as 3/2 rather than 4/2. The four cycles consist of
1-cycle precharge and 3–cycles latency, under certain circumstances the 1–cycle of
precharge can be hidden. This is explained in section 1.14.1 under the Am29200 and
Am29205 subheading.
The control register associated with each DRAM bank, maintains a field for
DRAM refresh support. This field indicates the number of processor cycles between
DRAM refresh. If refresh is not disabled, “CAS before RAS” cycles are performed
36 Evaluating and Programming the 29K RISC Family
when required. Refresh is overlapped in the background with non–DRAM access
when possible.
If a DRAM bank contains video–DRAM rather than conventional DRAM, then
it is possible to perform data transfer to the VDRAM shift register via accesses in the
VDRAM address range. The VDRAM is aliased over the DRAM region. Accessing
the memory as VDRAM only changes the timing of memory control signals such as
to indicate a video shift register transfer is to take place rather than a CPU memory
access.
1.8.3 Virtual DRAM Region
A 16–Mbyte (24 address bit) virtual address space is supported via four map-
ping registers. The virtually addressed memory is divided into 64K byte (16 address
bits) memory pages which are mapped into physical DRAM. Each mapping register
has two 8–bit fields specifying the upper address bits of the mapped memory pages.
When memory is accessed in the virtual address space range, and one of the four
mapping registers contains a match for the virtually addressed page being accessed,
then the access is redirected to the physical DRAM page indicated by the mapping
register.
When no mapping register contains a currently valid address translation for the
required virtual address, a processor trap occurs. In this case memory management
support software normally updates one of the mapping registers with a valid mapping
and normal program execution is restarted.
Only DRAM can be mapped into the virtual address space. The address region
supports functions such as image compression and decompression that yield lower
overall memory requirements and, thus, lower system costs. Images can be stored in
virtually addressed space in a compressed form, and only uncompressed into physi-
cally accessed memory when required for image manipulation or output video imag-
ing.
1.8.4 PIA Region
The Peripheral Interface Adapter (PIA) region is divided into six banks, each of
24–bit address space. Each bank can be directly attached to a peripheral device. The
control registers associated with the region give extra flexibility in specifying the
timing for signal pins connecting the microcontroller and PIA peripherals. The PIA
device–enable and control signals are again provided on–chip rather than in external
support circuitry.
When external DMA is utilized, transfer of data is always between DRAM or
ROM space and PIA space. More on DMA follows.
1.8.5 DMA Controller
When an off–chip device wishes to gain access to the microcontroller DRAM, it
makes use of the Direct Memory Access (DMA) Controller. On–chip peripherals can
Chapter 1 Architectural Overview 37
also perform DMA transfers; this is referred to as internal DMA. DMA is initiated by
an external or internally generated peripheral DMA request.
The only internal peripherals which can generate DMA requests are the parallel
port, the serial port and the video interface. These three devices are described shortly.
There are two external DMA request pins, one for each of the two on–chip DMA con-
trol units. Internal peripherals have a control register field which specifies which
DMA controller their DMA request relates to.
The DMA controllers must be initialized by software before data transfer from,
or to, DRAM takes place. The associated control registers specify the DRAM start
address and the number of transfers to take place. Once the DMA control registers
have been prepared, a DMA transfer will commence immediately upon request with
out any further CPU intervention. Once the DMA transfer is complete the DMA con-
troller may generate an interrupt. The processor may then refresh the DMA control
unit parameters for the next expected DMA transfer.
One of the DMA control units has the special feature of having a duplicate set of
DMA parameter registers. At the end of a DMA transfer, when the primary set of
DMA parameter registers have been exhausted, the duplicate set is immediately co-
pied into the primary set. This means the DMA unit is instantly refreshed and pre-
pared for a further DMA request. Ordinarily the DMA unit is not ready for further use
until the support software has executed, usually via an end of DMA interrupt request.
Just such an interrupt may be generated but it will now be concerned with preparing
parameters for the duplicate control registers for the one–after–next DMA request.
This DMA queue technique is very useful when DMA transfers are occurring to the
video controller. In such case DMA can not be postponed as video imaging require-
ments mean data must be available if image distortion is to be avoided.
External DMA can only occur between DRAM or ROM space and two of the six
PIA address space banks. DMA only supports an 8–bit address field within a PIA ad-
dress bank.
One further note on DMA, the microcontroller does support an external DMA
controller; enabling random access by the external DMA device to DRAM and
ROM. The external DMA unit must activate the associated control pins and place the
address on the microcontroller address bus. In conjunction with the microcontroller,
the external DMA unit must complete the single 32–bit data access.
1.8.6 16–bit I/O Port
The I/O port supports bit programmable access to 16 input or output pins. These
pins can also be used to generate level–sensitive or edge–sensitive interrupts. When
used as outputs, they can be actively driven or used in open collector mode.
1.8.7 Parallel Port
The parallel port is intended for connecting the microcontroller chip to a host
processor, where the controller acts as an intelligent high performance control unit.
38 Evaluating and Programming the 29K RISC Family
Data can be transferred in both directions, either via software controlled 8–bit or
32–bit data words, or via DMA unit control. Once again the associated control regis-
ters give the programmer flexibility in specifying the timing requirements for con-
necting the parallel port directly to the host processor.
1.8.8 Serial Port
The on–chip serial port supports high speed full duplex, bi–directional data
transfer using the RS–232 protocol. The serial port can be used in an polled or inter-
rupted driven mode. Alternatively, it may request DMA access. The lightweight in-
terrupt structure of the Am29000 processor core, coupled with the smart on–chip pe-
ripherals, presents the software engineer with a wide range of options for controlling
the serial port.
1.8.9 I/O Video Interface
The video interface provides direct connection to a number of laser–beam
marking engines. It may also be used to receive data from a raster input device such as
a scanner or to serialize/deserialize a data stream. It is possible with external circuitry
support that a noninterleaved composite TV video signal could be generated.
The video shift register clock must be supplied on an asynchronous input pin,
which may be tied to the processor clock. (Note, a video image is built by serially
clocking the data in the shift register out to the imaging hardware. When the shift reg-
ister is empty it must be quickly refilled before the next shift clock occurs.) The
imaged page may be synchronized to an external page–sync signal. Horizontal and
vertical image margins as well as image scan rates are all programmable via the now
familiar on–chip control register method.
The video shift registers are duplicated, much like some of the DMA control
registers. This reduces the need for rapid software response to maintain video shift
register update. When building an image, the shift register is updated from the dupli-
cate support register. Software, possibly activated via a video–register–empty inter-
rupt, must fill the duplicate shift register before it becomes used–up. Alternatively,
the video data register can be maintained by the DMA controller without the need for
frequent CPU intervention.
1.8.10 The SA29200 Evaluation Board
The SA29200 is an inexpensive software development board utilizing the
Am29200 microcontroller. Only a 5v supply and a serial cable connection to a host
computer are required to enable board operation. Included on the board is an 8–bit
wide EPROM (128Kx8) which contains the MiniMON29K debug monitor and the
OS–boot operating system. There is also 1M byte of 32–bit DRAM (256Kx32) into
which programs can be loaded via the on–chip UART. The processor clock rate is 16
Chapter 1 Architectural Overview 39
MHz and the DRAM operates with 4–cycle initial access and 2–cycle subsequent
burst accesses. So, although the performance is good, it is not as high as other mem-
bers of the 29K family.
The SA29200 board measures about 3 by 3.5 inches (9x10 cm) and has connec-
tions along both sides which enable attachment to an optional hardware prototyping
board (see following section). This extension board has additional I/O interface de-
vices and a small wire–wrap area for inclusion of application specific hardware.
1.8.11 The Prototype Board
The prototying board is inexpensive because it contains mainly sockets, which
can support additional memory devices, and a predrilled wire–wrap area. The RISC
microcontroller signals are made available on the prototyping board pins. Some of
these signals are routed to the empty memory sockets so as to enable simple memory
expansions for 8–bit, 16–bit or 32–bit EPROM or SRAM. There is also space for up
to 16M bytes of 32–bit DRAM.
Using the wire–wrap area the microcontroller I/O signals can be connected to
devices supporting specific application tasks, such as A/D conversion or peripheral
control. This makes the board ideal for a student project. Additionally, the access
times for memory devices are programmable, thus enabling the effects of memory
performance on overall system operation to be evaluated.
1.8.12 Am29200 Evaluation
The Combination of the GNU tool chain and the low cost SA29200 evaluation
board and associated prototping board, makes available an evaluation environment
for the industry’s leading embedded RISC. The cost of getting started with embedded
RISC is very low and additional high performance products can be selectively pur-
chased from specialized tool builders. The evaluation package should be of particu-
lar interest to university undergraduate and post–graduate courses studying RISC.
1.8.13 The Am29205 Microcontroller
The Am29205 is a microcontroller member of the 29K family (see Table 1-3). It
is functionally very similar to the Am29200 microcontroller. It differs as a result of
reduced system interface specifications. This reduction enables a lower device pin–
count and packaging cost. The Am29205 is available in a 100–lead Plastic Quad Flat
Pack (PQFP) package. It is suitable for use in price sensitive systems which can oper-
ate with the somewhat reduced on–chip support circuitry.
The reduction in pin count results in a 16–bit data/instruction bus. The processor
generates two consecutive memory requests to access instructions and data larger
than 16–bits. The memory system interface has also been simplified in other ways.
Only 16–bit transfers to memory are provided for; no 8–bit ROM banks are sup-
40 Evaluating and Programming the 29K RISC Family
ported. The parallel port, DMA controller, and PIA, also now support transfers lim-
ited to the 16–bit data width.
Generally the number of service support pins such as: programmable Input/Out-
put pins (now 8, 16 for the Am29200 processor); serial communication handshake
signals DTR, DSR; DMA request signals; interrupt request pins; and number of de-
coded PIA and memory banks, have all been reduced. The signal pins supporting vid-
eo–DRAM and burst–mode ROM access have also been omitted. These omissions
do not greatly restrict the suitability of the Am29205 microcontroller for many proj-
ects. The need to make two memory accesses to fetch instructions, which are not sup-
ported by an on–chip cache memory, will result in reduced performance. However,
many embedded systems do not require the full speed performance of a 32–bit RISC
processor.
AMD provides a low cost evaluation board known as the SA29205. The board is
standalone and very like the SA29200 evaluation board; in fact, it will fit with the
same prototype expansion board used by the SA29200. It is provided with a 256k
byte EPROM, organized as 128kx16 bits. The EPROM memory is socket upgradable
to 1M byte. There is 512K byte of 16–bit wide DRAM. For debugging purposes, it
can use the MiniMON29K debug monitor utilizing the on–chip serial port.
1.9 THE Am29240 MICROCONTROLLER
The Am29240 is a follow–on to the Am29200 microcontroller (see Table 1-4).
It was first introduced in 1993. The Am29240 is a member of the Am2924x family
grouping which offers increased on–chip support and greater processing power. In
terms of peripherals the Am29240 has two serial ports in stead of the Am29200’s one.
It also has 4 DMA controllers in stead of two.
Unlike the Am29200, all of the Am29240 DMA channels support queued data
transfer. Additionally, fly–by DMA transfers are optionally supported. Normal
DMA transfers require a read stage followed by a write stage. The data being trans-
ferred is temporarily held in an on–chip buffer after being read. With fly–by DMA the
read and write stages occur at the same time. This results in a faster DMA transfer.
However, the device being accessed must be able to transfer data at the maximum
DRAM access rate.
The Am2924x family grouping, unlike the Am2920x grouping, support virtual
memory addressing. The Translation Look–Aside Buffer (TLB) used to construct an
MMU scheme supports larger page sizes than the Am29000 processor. The page size
can be up to 16M bytes. The large page size enables extensive memory regions to be
mapped with only a few TLB mapping entries. For this reason only 16 TLB entries
are provided (8 sets, two entries per set). A consequence of the relatively large page
size is pages can not be individually protected against Supervisor mode reads and
execution –– this is possible with the smaller pages used by the Am29000 processor
(see section 6.2.1). This loss is outweighed by the benefits of the larger page size
Chapter 1 Architectural Overview 41
Table 1-4. Am2924x Microcontroller Members of 29K Processor Family
Processor Am29240 Am29243 Am29245
Instruction Cache 4K bytes 4K bytes 4K bytes
I–Cache Associativity 2–Way 2–Way 2–Way
Data Cache (Physical) 2K bytes 2K bytes –
D–Cache Associativity 2–Way 2–Way –
On–Chip Floating–Point No No No
On–Chip MMU Yes Yes Yes
Integer Multiply in h/w Yes, 1–cycle Yes, 1–cycle No
Programmable I/O 16 pins 8 pins 8 pins
ROM width 8/16/32 bit 16/32 bit 8/16/32 bit
DRAM width 16/32 bit 8/16/32 bit (parity) 16/32 bit
On–Chip Interrupt Yes Yes Yes
Controller Input’s 14 14 14
Scalable Clocking 1x,2x 1x,2x No
Burst–mode Addressing Yes, up to 1K bytes Yes, up to 1K bytes Yes, up to 1K bytes
Freeze Mode Processing Yes Yes Yes
Delayed Branching Yes Yes Yes
On–Chip Timer Yes Yes Yes
On–Chip Memory Controller Yes Yes Yes
DMA Channels 4 4 2
Byte Endian Big Big Big
Serial Ports 2 2 1
JTAG Debugging Yes Yes Yes
Clock Speeds (MHz) 0–20,25,33 0–20,25,33 0–16
42 Evaluating and Programming the 29K RISC Family
which achieves virtual memory addressing with little TLB reload activity and with
only a small amount of chip area being required.
Increased performance is achieved by the inclusion of separate 4k byte instruc-
tion and 2k byte data caches. As with all 29K instruction caches, address tags are
based on virtual addresses when address translation is turned on. The first processor
in the 29K Family to have a conventional instruction cache was the Am29030. The
Am29240 cache is similar in operation to the Am29030’s cache. However, the
Am29240 processor has four valid bits per cache entry (four instructions) in place of
the previous one bit. This offers a performance advantage as cache blocks need only
be partially filled and need not be fetched according to block boundaries (more on
this in section 5.13.5).
The data cache always operates with physical addresses. The block size is 16
bytes and there is one valid bit per block. This means that compete data blocks must
be fetched when data cache reload occurs. A “write–through” policy is supported by
the cache which ensures that external memory is always consistent with cache con-
tents. Cache blocks are only allocated for data loaded from DRAM or ROM address
regions. Access to other address regions is not cached. A two word write–through
buffer is used to assist with writes to memory. It enables multiple store instructions to
be in–execution without the processor pipeline stalling. Data accesses which hit in
the cache require 1–cycle access times. The data cache operation is explained in de-
tail in section 5.14.
Scalable bus clocking is supported; enabling the processor to run at twice the
speed of the off–chip memory system. Scalable Clocking was first introduced with
the Am29030 processors, and is described in the previous section describing the
Am29030. If cache hit rates are sufficiently high, Scalable Clocking enables high
performance systems to be built around relatively slow memory systems. It also of-
fers an excellent upgrade path when addition performance is required in the future.
Initially the ROM memory region is assumed to have four cycle access times
(three wait states) and no burst–mode –– same as Am29200. The four banks within
the region can be programmed for zero wait–state read and one wait–state write, or
another combination suitable for slower memory devices.
DRAM, unlike ROM, is always assumed to have 3–cycle access times. Howev-
er, if page–mode DRAM is used it is possible to achieve 1–cycle burst–mode ac-
cesses. Burst–mode is used when consecutive memory addresses are being accessed,
such as during instruction fetching. The Am29200 microcontroller supports 4–cycle
DRAM access with 2–cycle burst. The faster DRAM interface of the Am29240
should result in a substantial performance gain. Additionally, the 3–cycle initial
DRAM access can be reduced to 2–cycle if the required 1–cycle precharge can be
hidden. This is explained in section 1.14.1 under the Am29200 and Am29205 sub-
heading. Consequently the Am29240 DRAM is often referred to as 2/1 rather than
3/1.
Chapter 1 Architectural Overview 43
The Am29240 processor supports integer multiply directly in a single cycle.
Most 29K processors take a trap when an integer multiply is attempted. It is left to
trapware to emulate the missing instruction. The ability to perform high speed multi-
ply makes the processor a better choice for calculation intensive applications such as
digital signal processing. Note, floating–point performance should also improve
with the Am29240 as floating–point emulation routines can make use of the integer
multiply instruction.
The Am2924x family grouping is implemented with a silicon process which en-
ables processors to operate at 3.3–volts or 5–volts. The lower power consumption
achievable at 3.3–volts makes the Am29240 suitable for hand–held type applica-
tions.
1.9.1 The Am29243 Microcontroller
The Am29243 is an Am29240 microcontroller enhanced to deal with commu-
nication applications (see Table 1-4). For this reason the video interface is omitted.
The pins used have not been reassigned, and there is a possibility they will be allo-
cated in a future microcontroller for an additional communications support function.
Communication applications frequently require large amounts of DRAM, and it
is often critical that no corruption of the data occur. Parity error checking is often per-
formed by memory systems with the objective of detecting data corruption. It can be
difficult to built the necessary circuitry at high memory system speeds. The
Am29243 microcontroller has built–in parity generation and checking for all DRAM
accesses. When enabled by the DRAM controller, the processor will take trap num-
ber 4 when a parity error is detected. Having parity handling built–in enables single–
cycle DRAM accesses to be performed without any external circuitry required.
Because of the larger amounts of memory typically used in communication ap-
plications, the Am29243 has a second Translation Look–Aside Buffer (TLB). Hav-
ing two TLBs enables a larger number of virtual to physical address translations to be
cached (held in a TLB register) at any time. This reduces the TLB reload overhead.
The second TLB also has 16 entries (8 sets, two entries per set), and the page size can
be the same or different. If the TLB page sizes are the same, a four–way set associa-
tive MMU can be constructed with supporting software. Alternatively one TLB can
be used for code and the second, with a larger page size, for data buffers or shared
libraries. The TLB entries have a Global Page (GLB) bit; when set the mapped page
can be accessed by any processes regardless of its process identifier (PID).
1.9.2 The Am29245 Microcontroller
The Am29245 is a low–cost version of the Am29240 microcontroller (see
Table 1-4). To enable the lower cost, the data cache and the integer multiply unit have
been omitted. Further, there are only two DMA channels in place of the Am29240’s
four. To further reduce cost, one of the two serial ports has also been omitted.
44 Evaluating and Programming the 29K RISC Family
The Am29245 is intended for use in systems which do not need the maximum
performance of the Am29240 or all of its peripherals; and can benefit from a reduced
processor cost. The Am29245 does not support Scalable Clocking and is only avail-
able at relatively lower clock speeds.
1.9.3 The Am2924x Evaluation
AMD has a number of boards available for Am2924x evaluation. Microcontrol-
lers in this family grouping all have the same pin configuration. This enables the
boards to operate with any of the Am2924x processors. The least expensive board is
the SD29240 it is a very small board, similar in form to the SA29200 board; it does
not have the expansion connector available with the SA29200. It is normally sup-
plied with an Am29240 or Am29245 installed. There is 1M byte of 32–bit wide
DRAM which operates at 16 MHz. When an Am29240 is used, Scalable Clocking
can enable the processor to operate at 32 MHz. The board also has a JTAG and
RS–232 connector. The 1M byte of 32–bit wide EPROM supplied with the board is
preprogrammed for MiniMON29K operation.
Those with more money to spend, or requiring a more flexible evaluation board,
can use the SE29240 board. It contains an Am29243 processor but can be used to
evaluate an Am29240 or Am29245. Initially the board contains 1M byte of 36–bit
wide DRAM. However, this can be expanded considerably. The DRAM is 36–bits
wide due to the additional 4–bits required for parity checking. The maximum
memory speed is 25 MHz. Scalable Clocking can be used with a 32 MHz processor
when the memory system is configured for 16 MHz operation.
The SE29240 board has greater I/O capability than the SD29240 board. There
are connectors for two RS–232 ports and a parallel port. Debugging can be achieved
via a serial or parallel port link to the MiniMON29K DebugCore located in EPROM.
Debugging is also supported via the JTAG or Logic Analyzer connections. There is a
small wire–wrap area for additional circuitry, and extra boards can be connected via
an expansion connector.
AMD also has an evaluation board intended for certain communication applica-
tions. The NET29K board has a triple processor pad–site. The board can operate with
either an Am29205, Am29200 or Am2924x (probably an Am29243) processor. The
processor pad site is concentric, the larger processor being at the outer position. The
similarity in the memory region controllers enables the construction of this unusual
board.
The memory system consists of 4M bytes of 36–bit wide DRAM, which is ex-
pandable. There is also 2M bytes of 32–bit EPROM. The EPROM can be replaced
with 1M byte of Flash programmable memory. For communications there is an AMD
MACE chip which provides an Ethernet capability via an 10–Base–T connector. Two
of the processors DMA channels are wired for MACE access. Once channel of an
85C30 UART is connected to an RS–449 connector which supports RS–422 signal
Chapter 1 Architectural Overview 45
level communication. This enables very fast UART communication. The Mini-
MON29K DebugCore and OS–boot operating system are initially installed in
EPROM (or Flash); and the DebugCore communicates via an on–chip UART con-
nected to an RS–232 (9–way) connector.
When the NET29K board is used with an Am29205 processor, the 16–bit pro-
cessor bus enables only half of the memory system to be accessed. The board is
physically small, measuring about 5 1/2 x 5 1/2 inches (14cm x 14cm). Debugging is
further supported by JTAG and Logic Analyzer connections. An inexpensive 9–volt
power supply is required.
1.10 REGISTER AND MEMORY SPACE
Most of the 29K instructions operate on information held in various processor
registers. Load and store type instructions are available for moving data between ex-
ternal memory and processor registers. Members of the 29K family generally sup-
port registers in three independent register regions which make up the 29K register
space. These regions are the General Purpose registers, Translation Look–Aside
(TLB) registers, and Special Purpose registers. Members of the 29K family which do
not support Memory Management Unit operation, do not have TLB registers imple-
mented.
There are currently two core processors within the 29K family, the Am29000
and the Am29050. Other processors are generally derived from one of these core pro-
cessors. For example, the Am29030 has an Am29000 at its core, with additional sili-
con area being used to implement instruction cache memory and a 2–bus processor
interface. The differences between the core processors and their derivatives is re-
flected in expansions to the special register space.
However, the special register space does appear uniform through out the 29K
family. Generally only those concerned with generating operating system support
code are concerned with the details of the special register space. AMD has specified a
subset of special registers which are supported on all 29K family processors. This
aids in the development and porting of Supervisor mode code.
The core processors support a 3–bus Harvard Architecture, with instructions
and data being held in separate external memory systems. There is one 32–bit bus
each for the two memory systems and a shared 32–bit address bus. Some RISC chips
have a 4–bus system, where there is an address bus for each of the two memory sys-
tems. This avoids the contention for use of a shared address bus. Unfortunately, it also
results in increased pin–count and, consequently, processor cost. The 29K 3–bus pro-
cessors avoid conflicts for the address bus by supporting burst mode addressing and a
large number of on–chip registers. It has been estimated that the Am29000 processor
losses only 5% performance as a result of the shared address bus.
All instruction fetches are directed to instruction memory; data accesses are di-
rected to data memory or I/O space. These two externally accessible spaces consti-
46 Evaluating and Programming the 29K RISC Family
tute two of the four external access spaces. The other two are the ROM space and the
coprocessor space. The ROM space is accessed via the instruction bus. Like the
instruction space it covers a 232 range.
1.10.1 General Purpose Registers
All members of the family have general purpose registers which are made up
from 128 local registers and more than 64 global registers (see Figure 1-14). These
registers are the primary source and destination for most 29K instructions. Instruc-
tions have three 8–bit operand fields which are used to supply the addresses of gener-
al registers. All User mode executable instructions and code produced by high level
language compilers, are restricted to only directly assessing general purpose regis-
ters. The fact that these registers are all 32–bit and that there is a large number of
them, vis–a–vis CISC, reduces the need to access data held in external memory.
General purpose registers are implemented by a multiport register file. This file
has a minimum of three access ports, the Am29050 processor has an additional port
for writing–back floating–point results. Two of the three ports provide simultaneous
read access to the register file; the third port is for updating a register value. Instruc-
tions generally specify two general purpose register operands which are to be oper-
ated on. After these operands have been presented to the execution unit, the result of
the operation is made available in the following cycle. This allows the result of an
integer operation to be written back to the selected general purpose register in the
cycle following its execution. At any instant, the current cycle is used to write–back
the result of the previous computation.
The Am29050 can execute floating–point operations in parallel with integer
operations. The latency of floating–point instructions can be more than the 1–cycle
achieved by the integer operation unit. Floating–point results are written back, when
the operation is complete, via their own write–back port, without disrupting the inte-
ger units ability to write results into the general purpose register file.
Global Registers
The 8–bit operand addressing fields enable only the lower 128 of the possible
256 address values to be used for direct general purpose register addressing. This is
because the most significant address bit is used to select a register base–plus–offset
addressing mode. When the most significant bit is zero, the accessed registers are
known as Global Registers. Only the upper 64 of the global registers are implement-
ed in the register file. These registers are known as gr64–gr127. Some of the lower
address–value global registers are assigned special support tasks and are not really
general purpose registers.
The Am29050 processor supports a condition code accumulator with global
registers gr2 and gr3. The accumulator can be used to concatenate the result of sever-
al Boolean comparison operations into a single condition code. Later the accumu-
Chapter 1 Architectural Overview 47
Absolute GENERAL–PURPOSE
REG# REGISTER
0 Indirect Pointer Access
1 Stack Pointer
2 THRU 63 not implemented
64 GLOBAL REGISTER 64
65 GLOBAL REGISTER 65
66 GLOBAL REGISTER 66
GLOBAL
REGISTERS
126 GLOBAL REGISTER 126
127 GLOBAL REGISTER 127
128 LOCAL REGISTER 125
129 LOCAL REGISTER 126
130 LOCAL REGISTER 127
131 LOCAL REGISTER 0
LOCAL 132 LOCAL REGISTER 1
STACK
REGISTERS POINTER
=131
(example)
254 LOCAL REGISTER 123
255 LOCAL REGISTER 124
Figure 1-14. General Purpose Register Space
48 Evaluating and Programming the 29K RISC Family
lated condition can be quickly tested. These registers are little used and on the whole
other, more efficient, techniques can be found in preference to their use.
Local Registers
When the most significant address bit is set, the upper 128 registers in the gener-
al purpose register file are accessed. The lower 7–bits of the address are used as an
offset to a base register which points into the 128 registers. These general purpose
registers are known as the Local Registers. The base register is located at the global
register address gr1. If the addition of the 7–bit operand address value and the register
base value produces a results too big to be contained in the 7–bit local register address
space, the result is rounded modulo–128. When producing a general purpose register
address from a local register address, the most significant bit of the general purpose
register address value is always set.
The local register base address can be read by accessing global register gr1.
However, the base register is actually a register which shadows global register gr1.
The shadow support circuitry requires that the base be written via an ALU operation
producing a result destined for gr1. This also requires that a one cycle delay follow
the setting of the base register and any reference to local registers.
Global register address gr0 also has a special meaning. Each of the three oper-
and fields has an indirect pointer register located in the special register space. When
address gr0 is used in an operand field, the indirect pointer is used to access a general
purpose register for the associated operand. Each of the three indirect pointers has an
8–bit field and can point anywhere in the general purpose register space. When indi-
rect pointers are used, there is no distinction made between global and local registers.
All of the general purpose registers are accessible to the processor while execut-
ing in User mode unless register bank protection is applied. General purpose registers
starting with gr64 are divided into groups of 16 registers. Each group can have access
restricted to the processor operating in Supervisor mode only. The AMD high level
language calling convention specifies that global registers gr64–gr95 be reserved for
operating system support tasks. For this reason it is normal to see the special register
used to support register banking set to disable User mode access to global registers
gr64–gr95.
1.10.2 Special Purpose Registers
Special purpose register space is used to contain registers which are not ac-
cessed directly by high level languages. Registers such as the program counter and
the interrupt vector table base pointer are located in special register space. Normally
these registers are accessed by operating system code or assembly language helper
routines. Special registers can only be accessed by move–to and move–from type
instructions; except for the move–to–immediate case. Move–to and move–from
instructions require the use of a general purpose register. It is worth noting that
Chapter 1 Architectural Overview 49
move–to special register instructions are among a small group of instructions which
cause processor serialization. That is, all outstanding operations, such as overlapping
load or store instructions, are completed before the serializing instruction com-
mences.
Special register space is divided into two regions (see Figure 1-15). Those reg-
isters whose address is below sr128 can only be accessed by the processor operating
in Supervisor mode. Different members of the 29K family have extensions to the
global registers shown in Figure 1-15. However, special registers sr0–sr14 are a sub-
set which appear in all family members. Certain, generally lower cost, family mem-
bers such as the Am29005 processor, which have no memory management unit, do
not have the relevant MMU support registers (sr13 and sr14). I shall first describe the
restricted access, or protected, special registers. I shall not go into the exact bit–field
operations in detail, for an expansion of field meanings see later chapters or the rele-
vant processor User’s Manual. The objective here is to provide a framework for bet-
ter understanding the special register space.
Special registers are not generally known by their special register number. For
example, the program counter buffer register PC1 is known as PC1 by assembly lan-
guage programming tools rather than sr11.
Vector Area Base
Special register sr0, better known as VAB, is a pointer to the base of a table of
256 address values. Each interrupt or trap is assigned a unique vector number. Vector
numbers 0–63 are assigned to specific processor support tasks. When an interrupt or
trap exception is taken, the vector number is used to index the table of address values.
The identified address value is read and used as the start address of the exception
handling routine. Alternatively with 3–bus members of the 29K family, the vector
table can contain 256 blocks of instructions. The VF bit (vector fetch) in the proces-
sor Configuration register (CFG) is used to select the vector table configuration.
Each block is limited to 64 instructions, but via this method the interrupt handler can
be reached faster as the start of, say, an interrupt handler need not be preceded by a
fetch of the address of the handler. In practice the table of vectors to handlers, rather
than handlers themselves, is predominantly used due to the more efficient use of
memory. For this reason the two later 2–bus members of the 29K family only support
the table of vectors method; and the VF bit in the CFG register is reserved and effec-
tively set.
The first 29K processor, the Am29000, has a VAB register which requires the
base of the vector table to be aligned to a 64k byte address boundary. This can be in-
convenient and lead to memory wastage. More recent family members provide for a
1k byte boundary. Because the 3–bus family members support instructions being lo-
cated in Instruction space and ROM space (memory space is described in section
1.10.4), it is possible with these processors to specify that handler routines are in
ROM space by setting the RV bit (ROM vector area) in the CFG register when the VF
50 Evaluating and Programming the 29K RISC Family
Special Purpose Mnemon-
Reg. No. Protected Registers ic
0 Vector Area Base Address VAB
1 Old Processor Status OPS
2 Current Processor Status CPS
3 Configuration CFG
4 Channel Address CHA
5 Channel Data CHD
6 Channel Control CHC
7 Register Bank Protect RBP
8 Timer Counter TMC
9 Timer Reload TMR
10 Program Counter 0 PC0
11 Program Counter 1 PC1
12 Program Counter 2 PC2
13 MMU Configuration MMU
14 LRU Recommendation LRU
Unprotected Registers
128 Indirect Pointer C IPC
129 Indirect Pointer A IPA
130 Indirect Pointer B IPB
131 Q Q
132 ALU Status ALU
133 Byte pointer BP
134 Funnel Shift Count FC
135 Load/Store Count Remaining CR
160 Floating–Point Environment FPE
161 Integer Environment INTE
162 Floating–Point Status FPS
Figure 1-15. Special Purpose Register Space for the Am29000 Microprocessor
Chapter 1 Architectural Overview 51
bit is zero. Or, when the more typical table of vectors method is being used by, setting
bit–1 of the handler address. Since handler routines all start on 4–byte instruction
boundaries, bits 0 and 1 of the vector address are not required to hold address in-
formation. The 2–bus and microcontroller members of the 29K family do not support
ROM space and RV bit in the CFG registers is reserved.
Processor Status
Two special registers, sr1 and sr2, are provided for processor status reporting
and control. The two registers OPS (old processor status) and CPS (current processor
status) have the same bit–field format. Each bit position has been assigned a unique
task. Some bit positions are not effective with particular family members. For exam-
ple, the Am29030 processor does not use bit position 15 (CA). This bit is used to indi-
cate coprocessor activity. Only the 3–bus family members support coprocessor op-
eration in this way.
The CPS register reports and controls current processor operation. Supervisor
mode code is often involved with manipulating this register as it controls the enabling
and disabling of interrupts and address translation. When a program execution ex-
ception is taken, or an external event such as an interrupt occurs, the CPS register
value is copied to the OPS register and the processor modifies the CPS register to
enter Supervisor mode before execution continues in the selected exception handling
routine. When returning from the handler routine, the interrupted program is re-
started with an IRET type instruction. Execution of an IRET instruction causes the
OPS register to be copied back to the CPS register, helping to restore the interrupted
program context. Supervisor mode code often prepares OPS register contents before
executing an IRET and starting User mode code execution.
Configuration
Special register sr3, known as the configuration control register (CFG), esta-
blishes the selected processor operation. Such options as big or little endian byte or-
der, cache enabling, coprocessor enabling, and more are selected by the CFG setting.
Normally this register value is established at processor boot–up time and is infre-
quently modified.
The original Am29000 (rev C and later) only used the first six bits of the CFG
register for processor configuration. Later members of the family offer the selection
of additional processor options, such as instruction memory cache and early address
generation. Additional options are supported by extensions to the CFG bit–field as-
signment. Because there is no overlap with CFG bit–field assignment across the 29K
family, and family members offer a matrix of functionality, there are often reserved
bit–fields in the CFG register for any particular 29K processor. The function pro-
vided at each bit position is unique and if the function is not provided for by a proces-
sor, the bit position is reserved.
52 Evaluating and Programming the 29K RISC Family
The upper 8–bits of the CFG register are used for processor version and revision
identification. The upper 3–bits of this field, known as the PRL (processor revision
level) identify the processor. The Am29000 processor is identified by processor
number 0, the Am29050 is processor number 1, and so on. The lower 5–bits of the
PRL give the the revision level; a value of 3 indicates revision ‘D’. The PRL field is
read–only.
Data Access Channel
Three special registers, sr4–sr6, known as CHA (channel address), CHD (chan-
nel data) and CHC (channel control), are used to control and record all access to ex-
ternal data memory. Processors in the 29K family can perform data memory access in
parallel with instruction execution. This offers a considerable performance boost,
particularly where there is high data memory access latency. Parallel operation can
only occur if the instruction pipeline can be kept fed from the instruction prefetch
buffer (IPB), instruction memory cache, or via separate paths to data and instruction
memory (Harvard style 3–bus processors). It is an important task of a high level lan-
guage compiler to schedule load and store instructions such that they can be success-
fully overlapped with other nondependent instructions (see section 1.13).
When data memory access runs in parallel, its completion will occur some time
after the instruction originally making the data access. In fact it could be several
cycles after the original request, and it may not be possible to determine the original
instruction. On many processors, keeping track of the original instruction is required
in case the load or store operation does not complete for some reason. The original
instruction is restarted after the interrupting complication has been dealt with. How-
ever, with the 29K family the original instruction is not restarted. All access to exter-
nal memory is via the processor Data Channel. The three channel support registers
are used to restart any interrupted load or store operation. Should an exception occur
during data memory access, such as an address translation fault, memory access
violation, or external interrupt, the channel registers are updated by the processor re-
porting the state of the in–progress memory access.
The channel control register (CHC) contains a number of bit–fields. The con-
tents–valid bit (CV) indicates that the channel support registers currently describe a
valid data access. The CV bit is normally seen set when a channel operation is inter-
rupted. The ML bit indicates a load– or store–multiple operation is in progress.
LOADM and STOREM instructions set this bit when commencing and clear it when
complete. It is important to note that non–multiple LOAD and STORE instructions
do not set or clear the ML bit. When a load– or store–multiple operation is interrupted
and nested interrupt processing is supported, it is not sufficient to just clear the CV bit
to temporary cancel the channel operation. If the ML bit was left set, a subsequent
load or store operation would become confused with a multiple type operation. The
ML bit should be cleared along with the CV bit; this is best done by writing zero into
the CHC register. (See section 4.3.8 for more information about clearing CHC.)
Chapter 1 Architectural Overview 53
Integer operations complete in a single cycle, enabling the result of the previous
integer operation to be written back to the general purpose register file in the current
cycle. Because external memory reads are likely to take several cycles to complete,
and pipeline stalling is to be avoided, the accessed data value is not written back to the
global register file during the following instruction (the write–back cycle). This re-
sults in the load data being held by the processor until access to the write–back port is
available. This is certain to occur during the execution of any future load or store
instruction which itself can not make use of its own write–back cycle. The processor
makes available via load forwarding circuitry the load data which awaits write–back
to the register file.
Register Access Protection
Special register sr7, known as RBP (register bank protect), provides a means to
restrict the access of general purpose registers by programs executing in User mode.
General purpose registers starting with gr64 are divided into groups of 16 registers.
When the corresponding bit in the RBP register is set, the associated bank of 16 regis-
ters is protected from User mode access. The RBP register is typically used to prevent
User mode programs from accessing Supervisor–maintained information held in
global registers gr64–gr95. These registers are reserved by the AMD high level lan-
guage calling convention for system level information.
On–Chip Timer Control
Special registers sr8 and sr9, known as TMC (timer counter) and TMR (timer
reload value), support a 24–bit real–time clock. The TMC register decrements at the
rate of the processor clock. When it reaches zero it will generate an interrupt if en-
abled. In conjunction with support software these two registers can be used to imple-
ment many of the functions often supported by off–chip timer circuitry.
Program Counter
A 29K processor contains a Master and Slave PC (program counter) address
register. The Master PC register contains the address of the instruction currently be-
ing fetched. The Slave contains the next sequentional instruction. Once an instruc-
tion flows into the execution unit, unless interrupted, the following instruction, cur-
rently in decode, will always flow into the execution unit. This is true for all instruc-
tions except for instructions such as IRET. Even if the instruction in execute is a
jump–type, the following instruction known as the delay–slot instruction is executed
before the jump is taken. This is known as delayed branching and can be very useful
in hiding memory access latencies, as the processor pipeline can be kept busy execut-
ing the delay–slot instruction while the new instruction sequence is fetched. It is an
important activity of high level language compilers to find useful instructions to
place in delay–slot locations.
The Master PC value flows along the PC–bus and the bus activity is recorded by
the PC buffer registers, see Figure 1-16. There are three buffer registers arranged in
54 Evaluating and Programming the 29K RISC Family
sequence. These buffer registers are accessible within special registers’ space as
sr10–sr12, better known as PC0, PC1 and PC2. The PC0 register contains the address
of the instruction currently in decode; register PC1 contains the address of the
instruction currently in execute; and PC2 the instruction now in write–back.
R–BUS
Instruction
Fetch Fetch address PC–BUS
Address
or Generation
30
Cache Unit
30–bit PC 0
Incrementer Decode-
address
Master PC
PC 1
Slave PC Execute
address
PC MUX Return
Branch PC 2
Address PC–Buffer
Write–Back
address
B–BUS, supplies branch addresses
Figure 1-16. Am29000 Processor Program Counter
When a program exception occurs the PC–buffer registers become frozen. This
is signified by the FZ bit in the current processor status register being set. When fro-
zen, the PC–buffer registers accumulate no new PC–bus information. The frozen PC
information can be used later to restart program execution. An IRET instruction
causes the PC1 and PC0 register information to be copied to the Master and Slave PC
registers and instruction fetching to commence. For this reason it is important to
maintain both PC1 and PC0 values when dealing with such system level activities as
nested interrupt servicing. Since the PC2 register records the address of a now
executed instruction, maintenance of its value is less important; but it can play an im-
portant role in debugging
When a CALL instruction is executed, the B–bus supplies the Master PC with
the address of the new instruction stream. Earlier, when the CALL instruction en-
tered the decode stage, the PC–bus was used to fetch the delay–slot instruction; and
the address of the instruction following the delay–slot (the return address) was pre-
pared for entry into the Slave PC. On the following cycle, the CALL instruction en-
Chapter 1 Architectural Overview 55
ters the execute stage and the return address enters the Return Address register. Dur-
ing CALL execution, the return address is transferred to the register file via the R–
BUS.
MMU control
The last of the generally available special registers are concerned with memory
management unit (MMU) operation. Processors which have the Translation Look–
Aside Buffer (TLB) registers omitted will not have these two special registers. The
operation of the MMU is quite complex, and Chapter 6 is fully dedicated to the de-
scription of its operation. Many computer professionals working in real–time proj-
ects may be unfamiliar with MMU operation. The MMU enables virtual addresses
generated by the processor to be translated into physical memory addresses. Addi-
tionally, memory is divided into page sized quantities which can be individually pro-
tected against User mode or Supervisor mode read and write access.
Special register sr13, known as MMU, is used to select the page size; a mini-
mum of 1k bytes, and a maximum of 8k bytes. Also specified is the current User
mode process identifier. Each User mode process is given a unique identifier and Su-
pervisor mode processes are assumed to have identifier 0.
Certain newer 29K processors support two TLB systems on–chip. Each TLB
has a independently programmable page size. These processors, and their close rela-
tives can be programmed for a maximum page size of 16M bytes.
Additional Protected Special Registers
Monitor Mode
Some newer members of the 29K family have additional Supervisor only acces-
sible special registers which are addressed above sr14. Figure 1-17 shows the addi-
tional special registers for processors which support Monitor mode. Special register
sr15, known as RSN (reason vector), records the trap number causing Monitor mode
Special Purpose Mnemonic
Reg. No. Protected Registers
15 Reason Vector RSN
20 Shadow Program Counter 0 SPC0
21 Shadow Program Counter 1 SPC1
22 Shadow Program Counter 2 SPC2
Figure 1-17. Additional Special Purpose Registers for the Monitor Mode Support
56 Evaluating and Programming the 29K RISC Family
to be entered. Monitor mode extends the software debugging capability of the pro-
cessor; it was briefly described in the previous section describing the processor fea-
tures, and is dealt with in detail in later chapters. The shadow Program Counter regis-
ters constituted a second set of PC–buffer registers. They record the PC–bus activity
and are used to support Monitor mode debugging.
Am29050
Figure 1-18 shows the additional special registers used by the Am29050 pro-
cessor for region mapping. In the Am29050 case, the additional special registers sup-
port two functions: debugging and region mapping. Four special registers in the
range sr16–sr19 extend the virtual address mapping capabilities of the TLB registers.
They support the mapping of two regions which are of programmable size. Their use
reduces the demand placed on TLB registers to supply all of a systems address map-
ping and memory access protection requirements.
Special Purpose Mnemonic
Reg. No. Protected Registers
16 Region Mapping Address 0 RMA0
17 Region Mapping Control 0 RMC0
18 Region Mapping Address 1 RMA1
19 Region Mapping Control 1 RMC1
Figure 1-18. Additional Special Purpose Registers for the Am29050 Microprocessor
Instruction and Data Breakpoints
Figure 1-19 shows the additional special registers for processors which support
breakpoint debugging. They facilitate the control of separate instruction access
Special Purpose Mnemonic
Reg. No. Protected Registers
23 Instruction Breakpoint Address 0 IBA0
24 Instruction Breakpoint Control 0 IBC0
25 Instruction Breakpoint Address 1 IPA1
26 Instruction Breakpoint Control 1 IBC1
27 Data Breakpoint Address 0 DBA0
28 Date Breakpoint Control 0 DBC0
Figure 1-19. Additional Special Purpose Registers for Breakpoint Control
Chapter 1 Architectural Overview 57
breakpoints and data access breakpoints. Some 29K processors have instruction
breakpoints only; others support both types of breakpoint.
On–Chip Cache Control
Figure 1-20 shows the additional special registers required to access on–chip
cache. There are only two additional registers, sr29 and sr30, required. Both registers
are used for communicating with the instruction memory cache supported by many
29K processors. If a processor also contains data cache, the memory can similarly be
accessed via the same cache interface registers. Supervisor mode support code con-
trols cache operation via the processor configuration register (CFG), and is not likely
to make use of the cache interface registers. These registers may be used by debug-
gers and monitors to preload and examine cache memory contents.
Special Purpose Mnemonic
Reg. No. Protected Registers
29 Cache Interface Register CIR
30 Cache Data Register CDR
Figure 1-20. Additional Special Purpose Registers for On–Chip Cache Control
User Mode Accessible Special Registers
Figure 1-15 showed the special register space with its two regions. The region
addressed above sr128 is always accessible; and below sr128, registers are only ac-
cessible to the processor when operating in Supervisor mode.
The original Am29000 processor defined a subset of User mode accessible reg-
isters, in fact those shown in Figure 1-15. Every 29K processor supports the use of
these special registers, but, only the Am29050 has the full complement implemented.
Registers in the range sr128–sr135 are always present. However, the three reg-
isters sr160–sr162 are used to support floating–point and integer operations. Only
certain members of the 29K family directly support these operations in processor
hardware. Other 29K family members virtualize these three registers. When not
available, an attempt to access them causes a protection violation trap. The trap han-
dler identifies the attempted operation and redirects the access to shadow copies of
the missing registers. The accessor is unaware that the virtualization has occurred,
accept for the delay in completing the requested operation. In practice, floating–point
supporting special registers are not frequently accessed; except for the case of float-
ing–point intensive systems which tend to be constructed around an Am29050 pro-
cessor.
58 Evaluating and Programming the 29K RISC Family
Indirect Pointers
Special registers sr128–sr130, better known as IPA, IPB and IPC, are the indi-
rect pointers used to access the general purpose register file. For instructions which
make use of the three operand fields, RA, RB and RC, to address general purpose
registers, the indirect pointer can be used as an alternative operand address source.
For example, the RA operand field supplies the register number for the source oper-
and–A; if global register address gr0 is used in the RA instruction field, then the oper-
and register number is provided by the IPA register.
The IPA, IPB and IPC registers are pointers into the global register file. They are
generally used to point to parameters passed to User mode helper routines. They are
also used to support instruction emulation, where trap handler routines perform in
software the missing instruction. The operands for the emulated instruction are
passed to the trap handler via the indirect pointers.
ALU Support
Special registers sr131–sr134 support arithmetic unit operation. Register
sr131, better known as Q, is used during floating–point and integer multiply and di-
vide steps. Only the Am29050 processor can perform floating–point operations di-
rectly, that is, without coprocessor or software emulation help. It is also the only pro-
cessor which directly supports integer multiply. All other current members of the
29K family perform these operations in a sequence of steps which make use of the Q
register.
The result of a comparison instruction is placed in a general purpose register, as
well as in the condition field of the ALU status register (special register sr132). How-
ever, the ALU status register is not conveniently tested by such instructions as condi-
tional branch. Branch decisions are made on the basis of True or False values held in
general purpose registers. This makes a lot of sense, as contention for use of a single
resource such as the ALU status register would lead to a resource conflict which
would likely result in unwanted pipeline stalling.
The ALU status register controls and reports the operation of the processor inte-
ger operation unit. It is divided into a number of specialized fields which, in some
cases, can be more conveniently accessed via special registers sr134 and sr135. The
short hand access provided by these additional registers avoids the read, shift and
mask operations normally required before writing to bit–fields in the ALU register.
Data Access Channel
The three channel control registers, CHA, CHD and CHC, were previously de-
scribed in the protected special registers section. However, User mode programs
have a need to establish load– and store–multiple operations which are controlled by
the channel support registers. Special register sr135, known as CR, provides a means
for a User mode program to set the Count Remaining field of the protected CHC reg-
ister. This field specifies the number of consecutive words transferred by the multiple
Chapter 1 Architectural Overview 59
data move operation. Should the operation be interrupted for any reason, the CR field
reports the number of transfers yet to be completed. Channel operation is typically
restarted (if enabled) when an IRET type instruction is issued.
Instruction Environment Registers
Special registers sr160 and sr162, known as FPE and FPS, are the floating–
point environment and status registers. The environment register is used by User
mode programs to establish the required floating–point operations, such as double–
or single–precision, IEEE specification conformance, and exception trap enabling.
The status register reports the outcome of floating–point operations. It is typically
examined as a result of a floating–point operation exception occurring. Only proces-
sors (Am29050) which support floating–point operations directly (free of trapware)
have real sr161 and sr162 registers. All other processors appear to have these regis-
ters via trapware support which creates virtual registers.
The integer environment is established by setting special register sr161, known
as INTE. There are two control bits which separately enable integer and multiplica-
tion overflow exceptions. If exception detection is enabled, the processor will take
an Out–of–Range trap when an overflow occurs. Only processors (Am29040,
Am29240 and Am29243) which support integer multiply directly (free of trapware)
have a real sr161 register. All other processors appear to have an sr161 register via
trapware support.
Additional User Mode Special Registers
Am29050
The Am29050 has an additional special register, shown in Figure 1-21. Register
sr164, known as EXOP, reports the instruction operation code causing a trap. It is
used by floating–point instruction exceptions. Unlike other 29K processors the
Am29050 directly executes all floating–point instructions. Exception traps can oc-
cur during these operations. When instruction emulation techniques are being used, it
is an easy matter to determine the instruction being emulated at the time of the trap.
However, with direct execution things are not as simple. The processor could ex-
amine the memory at the address indicated by the PC–buffer registers to determine
the relevant instruction opcode. But the Am29050 supports a Harvard memory archi-
tecture and there is no path within the processor to access the instruction memory as if
it were data. The EXOP register solves this problem. Whenever an exception trap is
taken, the EXOP register reports the opcode of the instruction causing the exception.
Users of other 3–bus Harvard type processors such as the Am29000 and
Am29005 should take note; virtualizing the unprotected special registers sr160–162
requires that the instruction space be readable by the processor (virtualizing, in this
case, means making registers sr160–162 appear to be accessible even when they are
not physically present). This can only be achieved by connecting the instruction and
60 Evaluating and Programming the 29K RISC Family
Special Purpose Mnemonic
Reg. No. Unprotected Registers
164 Exception Opcode EXOP
Figure 1-21. Additional Special Purpose Register for the Am29050 Microprocessor
data busses together (disabling the Harvard architecture advantages by creating a
2–bus system) or providing an off–chip bridge. This bridge must enable the address
space to be reached from within some range of data memory space, at least for word–
size read accesses, and, all be it, with additional access time penalties.
The Am29050 processor has an additional group of registers known as the float-
ing–point accumulators. There are four 64–bit accumulators ACC3–0 which can be
used with certain floating–point operations. They can hold double– or single–preci-
sion numbers. They are not special registers in the sense they lie in special register
space. They are located in their own register space, giving the Am29050 one more
register space than the normal three register spaces of the other 29K family members.
However, like special registers, they can only be accessed by move–to and move–
from accumulator type instructions.
Double–precision numbers (64–bit) can be moved between accumulators and
general registers in a single cycle. Global registers are used in pairs for this operation.
This is possible because the Am29050 processor is equipped with an additional
64–bit write–back port for floating point data, and the register file is implemented
with a width of 64–bits.
1.10.3 Translation Look–Aside Registers
Although some 29K family members are equipped with region mapping regis-
ters, a Translation Look–Aside Buffer (TLB) technique is generally used to provide
virtual to physical address translation. The TLB is two–way set associative and up to
64 translations are cached in the TLB support registers.
The TLB registers form the basis for implementing a Memory Management
Unit. The scheme for reloading TLB registers is not dictated by processor micorcode,
but left to the programmer to organize. This enables a number of performance boost-
ing schemes to be implemented with low overhead costs. However, it does place the
burden of creating a TLB maintenance scheme on the user. Those used to having to
work around a processor’s microcode imposed scheme will appreciate the freedom.
TLB registers can only be accessed by move–to TLB and move–from TLB
instructions executed by the processor operating in Supervisor mode. Each of the
possible 64 translation entries (less than 64 with some 29K family members) requires
Chapter 1 Architectural Overview 61
a pair of TLB registers to fully describe the address translation and access permis-
sions for the mapped page. Pages are programmable in size from 1k bytes to 8k bytes
(to 16M byte with newer 29K processors), and separate read, write and execute per-
missions can be enabled for User mode and Supervisor mode access to the mapped
page.
There is only a single 32–bit virtual address space supported. This space is
mapped to real instruction, data or I/O memory. Address translation is performed in a
single cycle which is overlapped with other processor operations. This results in the
use of an MMU not imposing any run–time performance penalties, except where
TLB misses occur and the TLB cache has to be refilled. Each TLB entry is tagged
with a per–process identifier, avoiding the need to flush TLB contents when a user–
task context switch occurs. Chapter 6 fully describes the operation of the TLB.
1.10.4 External Address Space
The 3–bus members of the 29K family support five external 32–bit address
spaces. They are:
Data Memory — accessed via the data bus.
Input/Output — also accessed via the data bus.
Instruction — accessed via the instruction bus, normally read–only.
ROM — also accessed via the instruction bus, normally read–only.
Coprocessor — accessed via both data and address busses. Note, the address
bus is only used for stores to coprocessor space. This enables 64–bit transfers
during stores and 32–bit during loads.
The address bus is used for address information when accessing all address
spaces except the coprocessor space. During load and store operations to coprocessor
space, address information can be supplied in a limited way by the OPT2–0 field of
the load and store instructions. Of course, with off–chip address decoding support,
access to coprocessor space could always be made available via a region of I/O or
data space. Coprocessors support off–chip extensions to a processor’s execution
unit(s). AMD supplied a coprocessor in the past, which was for floating–point sup-
port, the Am29027. It is possible that users could construct their own coprocessor for
some specialized support task.
Earlier sections discussed the read–only nature of the instruction bus of 3–bus
processors. Instructions are fetched along the instruction bus from either the ROM
space or the Instruction space. Access to the two 32–bit spaces is distinguished by the
IREQT processor pin. The state of this pin is determined by the RE (ROM enable) bit
of the current processor status register (CPS). This bit can be set by software or via
programmed event actions, such as trap processing. ROM space is intended for sys-
tem level support code. Typically systems do not decode this pin and the two spaces
are combined into one.
62 Evaluating and Programming the 29K RISC Family
The Input/Output (I/O) space can be reached by setting the AS (address space)
bit in load and store instructions. Transfers to I/O space, like coprocessor space and
data space transfers, are indicated by the appropriate value appearing on the
DREQT1–0 (data request type) processor pins. I/O space access is only convenient
for assembly level routines. There is typically no convenient way for a high level lan-
guage to indicate an access is to be performed to I/O space rather than data space. For
this reason use of I/O space is often best avoided, unless it is restricted to accessing
some Supervisor maintained peripheral which is best handled via assembly language
code.
The 2–bus 29K family processors support a reduced number of off–chip address
spaces, in fact, only two: Input/Output space, and a combined Instruction/Data
memory space. Accessing both instructions and data via a shared instruction/data bus
simplifies the memory system design. It can also simplify the software; for example,
instruction space and data space can no longer overlap. Consider a 3–bus system
which has physical memory located at address 0x10000 in instruction space and also
different memory located at address 0x10000 in data space. Software errors can oc-
cur regarding accessing the correct memory for address 0x10000. It can also compli-
cate system tasks such as virtual memory management, where separate free–page
lists would have to be kept for the different types of memory.
The Translation Look–Aside buffer (TLB), used to support virtual memory ad-
dressing, supports separate enabling of data and instruction access via the R/W/X
(read/write/execute) enable bits. However, permission checking is only performed
after address translation is performed. It is not possible to have two valid virtual–to–
physical address translations present in the TLB at the same time for the same virtual
address, even if one physical address is for data space and the other instruction space.
This complicates accessing overlapping address spaces via a single 32–bit virtual
space.
Accessing virtual memory has similar characteristics to accessing memory via a
high level language. For example, C normally supports a single address space. It is
difficult and nonportable to have C code which can reach different address spaces.
Except for instruction fetching, all off–chip memory accesses are via load and store
type instructions. The OPT2–0 field for these instructions specifies the size of the
data being transferred: byte, half–word or 32–bit. The compiler assigns OPT field
values for all load and store instructions it generates. Unless via C language exten-
sions or assembly code post–processing, there is no way to set the load and store
instruction address–space–selecting options. Software is simplified by locating all
external peripherals and memory in a single address space; or when a Harvard archi-
tecture is used, by not overlapping the regions of data and instruction memory spaces
used.
Chapter 1 Architectural Overview 63
1.11 INSTRUCTION FORMAT
All instructions for the Am29000 processor are 32 bits in length, and are divided
into four fields, as shown in Figure 1-22. These fields have several alternative defini-
tions, as discussed below. In certain instructions, one or more fields are not used, and
are reserved for future use. Even though they have no effect on processor operation,
bits in reserved fields should be 0 to insure compatibility with future processor ver-
sions.
31 23 15 7 0
Op
A//M RC RA RB
I17..I10 SA RB or I
I15..I8 I9..I2
VN I7..I0
CE//CNTL UI//RND//FD//FS
Figure 1-22. Instruction Format
The instruction fields are defined as follows:
BITS 31–24
Op This field contains an operation code, defining the operation to be
performed. In some instructions, the least–significant bit of the op-
eration code selects between two possible operands. For this reason,
the least–significant bit is sometimes labeled “A” or “M”, with the
following interpretations:
A (Absolute): The A bit is used to differentiate between Program–
Counter relative (A = 0) and absolute (A = 1) instruction addresses,
when these addresses appear within instructions.
M (IMmediate): The M bit selects between a register operand (M = 0)
and an immediate operand (M =1), when the alternative is allowed by
an instruction.
BITS 23–16
RC The RC field contains a global or local register–number, which is the
destination operand for many instructions.
I17..I10 This field contains the most–significant 8 bits of a 16–bit instruction
address. This is a word address, and may be Program–Counter rela-
tive or absolute, depending on the A bit of the operation code.
64 Evaluating and Programming the 29K RISC Family
I15..I8 This field contains the most–significant 8 bits of a 16–bit instruction
constant.
VN This field contains an 8–bit trap vector number.
CE//CNTL This field controls a load or store access.
BITS 15–8
RA The RA field contains a global or local register–number, which is a
source operand for many instructions.
SA The SA field contains a special–purpose register–number.
BITS 7–0
RB The RB field contains a global or local register–number, which is a
source operand for many instructions.
RB or I This field contains either a global or local register–number, or an
8–bit instruction constant, depending on the value of the M bit of the
operation code.
I9..I2 This field contains the least–significant 8 bits of a 16–bit instruction
address. This is a word address, and may be Program–Counter rela-
tive, or absolute, depending on the A bit of the operation code.
I7..I0 This field contains the least–significant 8 bits of a 16–bit instruction
constant.
UI//RND//FD//FS
This field controls the operation of the CONVERT instruction.
The fields described above may appear in many combinations. However, cer-
tain combinations which appear frequently are shown in Figure 1-23.
1.12 KEEPING THE RISC PIPELINE BUSY
If the external interface of a microprocessor can not support an instructon fetch
rate of one instruction per cycle, execution rates of one per cycle can not be sustained.
As described in detail in Chapter 6, a 4–1 DRAM (4–cycle first access, 1–cycle sub-
sequent burst–mode access) memory system used with a 3–bus Am29000 processor,
can sustain an average processing time per instruction of typically two cycles, not the
desired 1–cycle per instruction. However, a 2–1 SRAM based system comes very
close to this target. From these example systems it can be seen that even if a memory
system can support 1–cycle burst–mode access, there are other factors which prevent
the processor from sustaining single–cycle execution rates.
It is important to keep the processor pipeline busy doing useful work. Pipeline
stalling is a major source of lost processor performance. Stalling occurs as a result of:
Chapter 1 Architectural Overview 65
Three operands, with possible 8–bit constant:
31 23 15 7 0
X X X X X X X M RC RA RB or I
Three operands, without constant::
31 23 15 7 0
X X X X X X X 0 RC RA RB
One register operand, with 16–bit constant:
31 23 15 7 0
X X X X X X X 1 I15..I8 RA I7..I0
Jumps and calls with 16–bit instruction address:
31 23 15 7 0
X X X X X X X A I17..I10 RA I9..I2
Two operands with trap vector number:
31 23 15 7 0
X X X X X X X M VN RA RB or I
Loads and stores:
31 23 15 7 0
X X X X X X X M CNTL RA RB or I
CE
Figure 1-23. Frequently Occurring Instruction–Field Uses
66 Evaluating and Programming the 29K RISC Family
inadaquate memory bandwidth, high memory access latency, bus access contention,
excesive program branching, and instruction dependancies. To get the best from a
processor an understanding of instruction stream dependancies is required. Proces-
sors in the 29K familiy all have pipeline interlocks supported by processor hardware.
The programmer does not have to ensure correct pipeline operation, as the processor
will take care of any dependancies. However, it is best that the programmer arranges
code execution to smooth the pipeline operation.
1.13 PIPELINE DEPENDENCIES
Modification of some registers has a delayed effect on processor behavior.
When developing assembly code, care must be taken to prevent unexpected behav-
ior. The easiest of the delayed effects to remember is the one cycle that must follow
the use of an indirect pointer after having set it. This occurs most often with the regis-
ter stack pointer. It cannot be used to access a local register in the instruction that fol-
lows the instruction that writes to gr1. An instruction that does not require gr1 (and
that means all local registers referenced via gr1) can be placed immediately after the
instruction that updates gr1.
Direct modification of the Current Processor Status (CPS) register must also be
done carefully. Particularly where the Freeze (FZ) bit is reset. When the processor is
frozen, the special-purpose registers are not updated during instruction execution.
This means that the PC1 register does not reflect the actual program counter value at
the current execution address, but rather at the point where freeze mode was entered.
When the processor is unfrozen, either by an interrupt return or direct modification of
the CPS, two cycles are required before the PC1 register reflects the new execution
address. Unless the CPS register is being modified directly, this creates no problem.
Consider the following examples. If the FZ bit is reset and trace enable (TE) is
set at the same time, the next instruction should cause a trace trap, but the PC–buffer
registers frozen by the trap will not have had time to catch up with the current execu-
tion address. Within the trap code the processor will have appeared to have stopped at
some random address, held in PC1. If interrupts and traps are enabled at the same
time as the FZ bit is cleared, then the next instruction may suffer an external interrupt
or an illegal instruction trap. Once again, the PC–buffer register will not reflect the
true execution address. An interrupt return would cause execution to commence at a
random address. The above problems can be avoided by resetting FZ two cycles be-
fore enabling the processor to once again enter freeze mode.
Instruction Memory Latency
The Branch Target Cache (BTC), or the Instruction Memory Cache, can be used
to remove the pipeline stalling that normally occurs when the processor executes a
branch instruction. For the purpose of illustrating memory access latency, the effects
of the BTC shall be illustrated. The address of a branch target appears on the address
Chapter 1 Architectural Overview 67
pins at the start of the write-back stage. Figure 1-24 shows the instruction flow
through the pipeline stages, assuming the external instruction memory returns the
target of a jump during the same cycle in which it was requested. This makes the Tar-
get instruction available at the fetch stage while the Delay instruction has to be stalled
before it can enter the execute stage. In this case, execution is stalled for two cycles
when the BTC is not used to supply the target instruction.
Instruction Delay Target Target+1 Target+2
Fetch
Instruction Jump Delay Target Target+1
Am29000 Decode
Pipeline
Stages Instruction Current Jump Delay Target
Execution
Result Jump Delay
Write-Back
Current Processor 1–cycle fetch future
cycle cycles
Legend: Delay = Delay instruction Target = Target of jump instruction
Jump = Jump instruction Target + 1 = 1st instruction after target
Current = Current instruction Target + 2 = 2nd instruction after target
= Pipeline stall
Figure 1-24. Pipeline Stages for BTC Miss
The address of the fetch is presented to the BTC hardware during the execute
stage of the jump instruction, the same time the address is presented to the memory
management unit. When a hit occurs, the target instruction is presented to the decode
stage at the next cycle. This means no pipeline stalling occurs. The external instruc-
tion memory has up to three cycles to return the instruction four words past the target
address. That is, if single-cycle burst–mode can be established in three cycles (four
cycles for the Am29050 processor) or less, then continuous execution can be
achieved. The BTC supplies the target instructions and the following three instruc-
tions, assuming another jump is not taken. Figure 1-25 shows the flow of instruc-
tions through the pipeline stages.
Data Dependencies
Instructions that require the result of a load should not be placed immediately
after the load instruction. The Am29000 processor can overlap load instructions with
other instructions that do not depend on the result of the load. If 4-cycle data memory
is in use, then external data loads should (if possible) have four instructions
(4-cycles) between the load instructions and the first use of the data. Instructions that
68 Evaluating and Programming the 29K RISC Family
Instruction Delay Delay+1 Target+4
Fetch
Instruction Jump Delay Target Target+1 Target+2 Target+3
Am29000 Decode
Pipeline
Stages Instruction Current Jump Delay Target Target+1 Target+2
Execution
Result Jump Delay Target Target+1
Write-Back
Current Processor 3–cycle fetch future
cycle cycles
Legend: Delay = Delay instruction Target = Target of jump instruction
Jump = Jump instruction Target + 1 = 1st instruction after target
Current = Current instruction Target + 2 = 2nd instruction after target
Figure 1-25. Pipeline Stages for a BTC Hit
depend on data whose loads have not yet completed, cause a pipeline stall. The stall is
minimized by forwarding the data to the execution unit as soon as it is available.
Consider the example of an instruction sequence shown in Figure 1-26. The
instruction at Load+1 is dependent on the data loaded at Load. The address of load
data appears on the address pins at the start of the write-back stage. At this point,
instruction Load+1 has reached the execution stage and is stalled until the data is for-
warded at the start of the next cycle, assuming the external data memory can return
data within one cycle.
Instruction Load+1 Load+2
Fetch
Instruction Load Load+1 Load+2 Load+2
Am29000 Decode
Pipeline
Stages Instruction Current Load Load+1 Load+1 Load+2
Execution
Result Write- Load+1
Current Load
Back
1–cycle stall future
Legend: Load = Load instruction cycles
Current = Current instruction
Figure 1-26. Data Forwarding and Bad–Load Scheduling
Chapter 1 Architectural Overview 69
If the instruction were not dependent on the result of the load, it would have
executed without delay. Because of data forwarding and a 1-cycle data memory, the
load data would be available for instruction Load+2 without causing a pipeline stall.
1.14 ARCHITECTURAL SIMULATION, sim29
AMD has for a long time made available a 29K simulator which accurately
models the processor operation. This simulator, known as the Architectural Simula-
tor, can be configured to incorporate memory system characteristics. Since memory
system performance can greatly influence overall system performance, the use of the
simulator before making design decisions is highly recommended.
Simulation of all the 29K family members is supported, making the simulator
useful in determining processor choice [AMD 1991c][AMD 1993c]. For example,
does a floating–point intensive application require an Am29050 or will an Am29000
suffice? Alternatively, the performance penalties of connecting the data and instruc-
tion busses together on a 3–bus Harvard Architecture processor can be determined.
Because the simulator models detailed processor operation, such as pipeline
stages, cache memory, instruction prefetch, channel operation and much more, the
simulation run–times are longer than if the Instruction Set Simulator (ISS) were used.
Consequently, the Architectural Simulator is seldom used for program debugging.
The ISS simulator is described in Chapter 7 (Software Debugging). This is one of the
reasons that the Architectural simulator does not utilize the Universal Debugger In-
terface (see section 7.5). Without a UDI interface, the simulator can not support inter-
active debugging. Simulation results are directed to a log file. Interpretating their
meaning and dealing with log file format takes a little practice; more on this later.
When used with a HIF conforming operating system, the standard input and out-
put for the simulated program use the standard input and output for the executable
simulator. Additionally, the 29K program standard output is also written to the simu-
lation log file. AMD does not supply the simulator in source form; it is available in
binary for UNIX type hosts and 386 based PCs. The simulator driver, sim29, sup-
ports several command line options, as shown below. AMD updated the simulator
after version 1.1–8; the new version is compatible with the old and simulates at more
than four times the speed. The old simulator is still used with the Am29000 and
Am29050 processors. Only the new simulator models the Am2924x microcontrol-
lers and newer 2–bus processors. The following description of command line options
covers both simulator versions.
sim29 [–29000 | –29005 | –29030 | –29035 | –29050 ... –29240]
[–cfg=xx] [–d] [–e eventfile] [–f freq] [–h heapsize] [–L] [–n]
[–o outputfile] [–p from–to] [–r osboot] [–t max_sys_calls]
[–u] [–v] [–x[codes]] [–dcacheoff] [–icacheoff] [–dynmem <val>]
execfile [... optional args for executable]
70 Evaluating and Programming the 29K RISC Family
OPTIONS
–29000|29005|29030|29035|29040|29050|29200|29205|29240|...
Select 29K processor, default is Am29000. Depending on the proces-
sor selected, the old or new simulator is selected.
–cfg=xx Normally the simulator starts execution at address 0, with the proces-
sor Configuration Register (CFG) set to the hardware default value.
Its the application code or the osboot code responsibility to modify
the CFG registers as necessary. Alternatively, the CFG register can be
initialized from the command line. The –cfg option specifies the set-
ting for CFG, where xx is a 1 to 5 digit HEX number. If the –cfg option
is used, no run–time change to CFG will take effect, unless an
Am292xx processor is in use. The –cfg option is seldom used; it
should be used where an osboot file is not supplied with the –r option.
Alternatively it can be used to override the cache enable/disable op-
eration of osboot code. This can enable the effects of cache to be de-
termined without the need to built a new osboot file. The –cfg option
is not supported by the newer simulator. Caches can be disabled using
the new –icacheoff and –dcacheoff options.
–d This option instructs the simulator to report the contents of processor
registers in the logfile at end of simulation.
–dcacheoff This option is only available with the newer simulator. When used it
causes the Configuration Register (CFG) to be set for data cache dis-
able.
–dynmem <val>
During execution a program may access a memory region out with
any loaded memory segment or heap and stack region. The simulator
can be instructed to automatically allocate (val=1) memory for the ac-
cessed region. Alternatively (default, val=0) an access violation is re-
ported.
–e eventfile An event file is almost always used. It enables memory system char-
acteristics to be defined and the simulation to be controlled (see sec-
tion 1.14.1).
–f frequency Specify CPU frequency in MHz; the default for the Am292xx and
Am29035 is 16 MHz; the Am2900x default is 25 MHz; and the de-
fault frequency for the Am29030 and Am29050 is 40 MHz.
–h heapsize This option specifies the amount of resource memory available to the
simulated 29K system. This memory is used for the register stack and
memory stack support as well as the run–time heap. The default size
is 32 K bytes; a heapsize value of 32.
Chapter 1 Architectural Overview 71
–icacheoff This option is only available with the newer simulator. When used it
causes the Configuration Register (CFG) to be set for instruction
cache disable.
–L This option is similar in nature to the –cfg option. It can be used to se-
lect the large memory model for the Am292xx memory banks. Nor-
mally this selection is performed in the osboot file. However, the –L
option can be used to override the osboot settings, without having to
build a new osboot file. This option is currently not supported in the
newer simulator.
–n Normally the simulator will allow access to the two words following
the end of a data section, without generating an access violation.
Some of the support library routines, such as strcpy(), used by 29K
application code, use a read–ahead technique to improve perfor-
mance. If the read–ahead option is not supported, then the –n option
should be used. Only the older simulator supports this option. The
newer simulator always allows access to the words just past the end of
the data section.
–o outputfile The
simulator normally presents simulation results in file sim.out.
However an alternative result file can be selected with this option.
–p from–to The simulator normally produces results of a general nature, such as
average number of instructions per second. It is possible, using this
option to examine the operation of specific code sequences within ad-
dress range from to to.
–r osboot The simulator can load two 29K executable programs via command–
line direction: osboot and program. It is normal to load an operating
system to deal with application support services; this is accomplished
with osboot. It is sometimes referred to as the romfile, because when
used with 29K family members which support separate ROM and
Instruction spaces, osboot is loaded into ROM space. AMD supplies
a HIF conforming operating system called OS–boot which is general-
ly used with the –r option. Your simulation tool installation should
have a 29K executable file called osboot, romboot or even pumaboot
which contains the OS–boot code. Care should be taken to identify
and use the correct file. The newer simulator will automatically select
a default osboot file from the library directory if the –r option is not
used.
–t max_sys_calls
Specify maximum number of system call types that will be used dur-
ing simulation This switch controls the internal management of the
72 Evaluating and Programming the 29K RISC Family
simulator; it is seldom used and has a default value of 256. This option
is not supported by the newer simulator.
–u The Am292xx microcontroller family members have built–in ROM
and DRAM memory controllers. Programmable registers are used to
configure the ROM and DRAM region controllers. If the –u option is
used, application code in file program can modify the controller set-
tings, otherwise only code in osboot can effect changes. This protects
application code from accidentally changing the memory region con-
figuration.
–v The OS–boot operating system, normally used to implement the os-
boot file, can modify its warm–start operation depending on the value
in register gr104 (see section 7.4). The –v switch causes gr104 to be
initialized to 0. When OS–boot is configured to operate with or with-
out MMU support, a run–time gr104 value of 0 will turn off MMU
use.
–x[code] If a 29K error condition occurs during simulation, execution is not
stopped. The –x option can be used to cause execution to stop under a
selected range of error conditions. Note, the option is not supported
by the newer simulator. Each error condition is given a code letter. If –
x is used with no selected codes, then all the available codes are as-
sumed active. Supported code are:
A Address error; data or instruction address out of bounds.
K Kernel error; illegal operation in Supervisor mode.
O Illegal opcode encountered.
F Floating–point exception occurred; such as divide by zero.
P A protection violation occurred in User mode
S An event file error detected.
execfile Name of the executable program to be loaded into memory; followed
by any command–line arguments for the 29K executable. It is impor-
tant that the program be correctly linked for the intended memory sys-
tem. This is particularly true for systems based on Am292xx proces-
sors. They have ROM and DRAM regions which can have very dif-
ferent memory access performance. If SRAM devices are to be used
in the ROM region, it is important that the application be linked for
the ROM region use rather than the DRAM.
It is best to run sim29 with the –r osboot option (this is the default operation with
the newer simulator). This is sometimes called cold–start operation. The osboot pro-
gram must perform processor initialization, bringing the processor into what is
known as the warm–start condition. At this point, execution of the loaded program
commences. It is possible to run the older simulator without the use of an osboot file;
Chapter 1 Architectural Overview 73
this is known as warm–start simulation. When this is done the simulator initializes
the processor special registers CFG and CPS to a predefined warm–start condition.
AMD documentation explains the chosen settings; they are different for each proces-
sor. Basically, the processor is prepared to run in User mode with traps and interrupts
enabled and cache in use.
To support osboot operation, the simulator prepares processor registers before
osboot operation starts (see Figure 1-27).
gr105 address of end of physical memory
gr104 Operating system control info.
gr103 start of command line args (argv)
gr102 register stack size
g101 memory stack size
gr100 first instruction of User loaded code
gr99 end address of program data
gr98 start address of program data
gr97 end address of program text
gr96 start address of program text
lr3 argument pointer, argv
lr2 argument count, argc
Figure 1-27. Register Initialization Performed by sim29
The initial register information is extracted from the program file. Via the regis-
ter data, the osboot code obtains the start address of the program code. If osboot code
is not used (no –r command–line switch when using the older simulator), the 29K
Program Counter is initialized to the start address of program code, rather than ad-
dress 0. To support direct entry into warm–start code, the program argument in-
formation is duplicated in lr2 and lr3. Normally this information is obtained by os-
boot using the data structure pointed to by gr103.
The simulator intercepts a number of HIF service calls (see section 2.2). These
services mainly relate to operating system functions which are not simulated, but
dealt with directly by the simulator. All HIF services with identification numbers be-
low 256 are intercepted. Additionally service 305, for querying the CPU frequency,
is intercepted. Operating services which are not intercepted, must be dealt with by the
osboot code. The simulator will intercept a number of traps if the –x[codes] com-
mand line option is used; otherwise all traps are directed to osboot support code, or
any other trapware installed during 29K run–time.
74 Evaluating and Programming the 29K RISC Family
1.14.1 The Simulation Event File
Simulation is driven by modeling the 29K processor pipeline operation. Instruc-
tions are fetched from memory, and make their way through the decode, execute and
write–back stages of the four–stage pipeline. Accurate modeling of processor inter-
nals enables the simulator to faithfully represent the operation of real hardware.
The simulator can also be driven from an event file. This file contains com-
mands which are to be performed at specified time values. All times are given in pro-
cessor cycles, with simulation starting at cycle 0. The simulator examines the event
file and performs the requested command at the indicated cycle time.
The syntax of the command file is very simple; each command is entered on a
single line preceded with a integer cycle–time value. There are about 15 to 20 differ-
ent commands; most of them enable extra information to be placed in the simulation
results file. Information such as recording register value changes, displaying cache
memory contents, monitoring floating–point unit operation, and much more. A se-
cond group of commands are mainly used with microcontroller 29K family mem-
bers. They enable the on–chip peripheral devices to be incorporated in the simula-
tion. For example, the Am29200 parallel port can receive and transmit data from files
representing off–chip hardware.
In practice, most of these commands are little used; with one exception, the SET
command (see note below). Most users of sim29 simply wish to determine how a
code sequence, representative of their application code, will perform on different
29K family members with varying memory system configurations. The SET com-
mand is used to configure simulation parameters and define the characteristics of
system memory and buss arrangements. I will only describe the parameters used with
the MEM option to the SET command.The cycle–time value used with the com-
mands of interest is zero, as the memory system characteristics are established before
simulation commences. One other option to the SET command of interest is
SHARED_ID_BUS; when used, it indicates the Instruction and Data buses are con-
nected together. This option only makes sense with 3–bus members of the 29K fami-
ly. All the 2–bus members already share a single bus for data and instructions, the
second bus being used for address values. The syntax for the commands of interest is
show below:
0 SET_SHARED_ID_BUS
0 SET MEM access TO value
Note, the SET command is accepted by both the older and newer versions of the
simulator. However, the newer version has an abbreviation to the SET command
shown below; the “SET MEM” syntax is replaced by a direct command and there is
no need for the “TO”.
Chapter 1 Architectural Overview 75
0 SET MEM IWIDTH TO 32 older syntax
0 ROMWIDTH 32 newer syntax
romwidth 32 newer syntax
Am29000 and Am29050
Note, when the Instruction bus and Data busses are tied together with 3–bus pro-
cessors, the ROM space is still decoded separately from the Instruction space. Tying
the busses together will reduce system performance, because instructions can no
longer be fetched from Instruction space, or ROM space, while the Data bus is being
used.
Considering only the most popular event file commands simplifies the presenta-
tion of sim29 operation; and encourages its use. Those wishing to know more about
event file command options should contact AMD. They readily distribute the sim29
executable software for popular platforms and with relevant documentation.
Table 1-5 shows the allowed access and value parameters for 3–bus members of
the 29K family, that is, the Am29000 and Am29050 processors. Off–chip memory
can exist in three separately addressed spaces: Instruction, ROM , and Data. Memory
address–decode and access times (in cycles) must be entered for each address space
which will be accessed by the processor; default values are provided.
Table 1-5. 3–bus Processor Memory Modeling Parameters for sim29
Instruction ROM Data Value Default Operation
IDECODE RDECODE DDECODE 0–n 0 Decode address
IACCESS RACCESS DRCCESS 1–n 1 First read
DWACCESS 1–n 1 First write
IBURST RBURST DBURST T|F false Burst–mode supported
IBACCESS RBACCESS DBRACCESS 1–n 1 Burst read
DBWACCESS 1–n 1 Burst write
If a memory system supports burst mode, the appropriate *BURST access pa-
rameter must be set to value TRUE. The example below sets Instruction memory ac-
cesses to two cycles; subsequent burst mode accesses are single–cycle. The example
commands only affect Instruction memory; additional commands are required to es-
tablish Data memory access characteristics. Many users of the simulator only require
memory modeling parameters from Table 1-5, even if DRAM is in use.
0 SET MEM IACCESS TO 2
0 SET MEM IBURST TO true
0 SET MEM IBACCESS TO 1
If DRAM memory devices are used, there are several additional access parame-
ters which can be used to support memory system modeling (see Table 1-6). DRAM
76 Evaluating and Programming the 29K RISC Family
devices are indicated by the *PAGEMODE parameter being set. The 29K family in-
ternally operates with a page size of 256 words, external DRAM memory always op-
erates with integer multiples of this value. For this reason, there is never any need to
change the *PGSIZE parameter setting from its default value. The first read access to
DRAM memory takes *PFACCESS cycles; second and subsequent read accesses
take *PSACCESS cycles. However, if the memory system supports burst mode, sub-
sequent read accesses take *PBACCESS cycles rather than *PSACCESS.
If static column DRAM memories are used, then memory devices do not require
CAS signals between same–page accesses. Static column memory use is indicated by
the *STATCOL parameter. Initial page accesses suffer the additional *PRECHAR-
GE access penalties, subsequent accesses all have same access latencies. Note, burst
mode access can also apply to static column DRAM memory. Table 1-7 shows
memory modeling parameters for static column memories.
Table 1-6. 3–bus Processor DRAM Modeling Parameters for sim29 (continued)
Instruction ROM Data Value Default Operation
IPAGEMODE PAGEMODE DPAGEMODE T|F false Memory is paged
IPGSIZE RPGSIZE DPGSIZE 1–n 256 Page size in words
IPFACCESS RPFACCESS DPFRACCESS 1–n 1 First read in page mode
DPFWACCESS 1–n 1 First write in page mode
IPSACCESS RPSACCESS DPSRACCESS 1–n 1 Secondary read within page
DPSWACCESS 1–n 1 Secondary write within page
IPBACCESS RPBACCESS DPBRACCESS 1–n 1 Burst read within page
DPBWACCESS 1–n 1 Burst write within page
Table 1-7. 3–bus Processor Static Column Modeling Parameters for sim29 (continued)
Instruction ROM Data Value Default Operation
ISTATCOL RSTATCOL DSTATCOL T|F false Static column memory used
ISMASK RSMASK DSMASK 0xffffff0 0 Column address mask, 64–words
IPRECHARGE RPRECH DPRECHARGE 0–n 0 Precharge on page crossing
ISACCESS RSACCESS DSRACCESS 1–n 1 Read access within static column
DSWACCESS 1–n 1 Write access within static column
Separate regions of an address space may contain more than one type of
memory device and control mechanism. To support this, memory banking is pro-
vided for in the simulator (see Table 1-8). The [I|R|D]BANKSTART parameter is
used to specify the start address of a memory bank; a bank is a contiguous region of
memory of selectable size, within an indicated address space. Once the *BANK-
Chapter 1 Architectural Overview 77
START command has been used, all following commands relate to the current bank,
until a new bank is selected. This type of command is more frequently used with mi-
crocontroller members of the 29K family.
Table 1-8. 3–bus Processor Memory Modeling Parameters for sim29 (continued)
Instruction ROM Data Value Default Operation
IBANKSTART RBANK DBANKSTART 0–n – Start address of memory region
IBANKSIZE BBAKSIZE DBANKSIZE 1–n 1 Size in bytes of memory region
Am29030 and Am29035
The parameters used with the SET command, when simulating 2–bus 29K fami-
ly members are a little different from 3–bus parameters (see Table 1-9). The parame-
ters shown are for the older simulator, but they are accepted by the new simulator. For
a list of alternative parameters, which are only accepted by the newer simulator, see
the following Am29040 section. There is no longer a ROM space, and although
instructions and data can be mixed in the same memory devices, separate modeling
parameters are provided for instruction and data accesses.
Table 1-9. 2–bus Processor Memory Modeling Parameters for older sim29
Instruction Data Value Default Operation
IACCESS DRACCESS 2–n 2 First read from SRAM
DWACCESS 2–n 2 First write from SRAM
IBURST DBURST T|F true Burst–mode supported
IBACCESS DBRACCESS 1–n 1 Burst read within page
DBWACCESS 1–n 1 Burst write within page
IWIDTH DWIDTH 8,16,32 32 Memory width
IPRECHARGE DPRECHARGE 0–n 0 Precharge on page crossing
IPACCESS DPRACCESS 2–n 2 First access in page mode
DPWACCESS 2–n 2 First write in page mode
IBANKSTART DBANKSTART 0–n – Start address of memory region
IBANKSIZE DBANKSIZE 1–n 1 Size in bytes of memory region
HALFSPEED HALFSPEED T|F false Memory system is 1/2 CPU speed
Consider accessing memory for instructions; IACCESS gives the access time,
unless DRAM is used, in such case, access time is given by IPACCESS. The use of
DRAM is indicated by the *PRECHARGE parameter value being non zero. First ac-
cesses to DRAM pages suffer an addition access delay of *PRECHARGE. If burst
mode is supported, with all memory device types, the access times for instruction
memory, other than the first access, is given by IBACCESS.
78 Evaluating and Programming the 29K RISC Family
Both the current 2–bus 29K family members support Scalable Clocking, enab-
ling a half speed external memory system. They also support narrow, 8–bit or 16–bit,
memory reads. The Am29035 processor also supports dynamic bus sizing. All exter-
nal memory accesses can be 16–bit or 32–bit; processor hardware takes care of multi-
ple memory accesses when operating on 32–bit data. As with the 3–bus 29K family
members, the simulator provides for memory banking. This enables different
memory devices to be modeled within specified address ranges.
Alternative Am29030, Am29035 and Am29040
As stated in the previous section, the newer sim29 can accept the memory mod-
eling parameters used by the older sim29. However, the newer simulator can operate
with alternative modelling commands; these are shown on Table 1-10. Commands
can be in upper or lower case, but they are shown here in lower case. A list of avail-
able simulator commands can be had by issuing the command “sim29 –29040
–help”. An example of Am29040 processor simulation can be found in section 8.1.3
Table 1-10. 2–bus Processor Memory Modeling Parameters for newer sim29
Command value Operation
rombank <adds> <size> Size and address of ROM/SRAM
rambank <adds> <size> Size and address of DRAM
halfspeedbus true|false Scalable Clocking (default=false)
logging true|false Loging to file sip.log (default=false)
ROM/SRAM DRAM Value Default Operation
romread ramread 2–n 2 First read
romwrite ramwrite 2–n 2 First write
romburst ramburst T|F true Enable burst mode addressing
rombread rambread 1–n 1 Burst read within page
rombwrite rambwrite 1–n 1 Burst write within page
rompage rampage T|F true Enable page mode
rompread rampread 2–n 2 Single read within page
rompwrite rampwrite 2–n 2 Single write within page
rompwidth ramwidth 16,32 32 Bit width of memory
ramprecharge 0–n 0 DRAM precharge time
rampprecharge 0–n 0 Page mode DRAM prechage
ramrefrate 0–n 0 DRAM refresh rate (0=off)
ROM and SRAM memory types are modeled with the same set of commands.
The simulator allocates a default ROM/SRAM memory bank starting at address 0x0.
Unless a RAMBANK command is used to allocate a DRAM memory section at a low
memory address, all code and data linked for low memory addresses will be allocated
to the default ROM/SRAM memory bank.
Chapter 1 Architectural Overview 79
DRAM memory is modelled with the RAM* modelling commands. A default
DRAM memory section is established at address 0x4000,0000. Unless a
ROMBANK command is used to allocate a ROM/SRAM memory bank at this
address range, all accesses to high memory will be satisfied by the default DRAM
memory.
The default linker command files used with the High C 29K tool chain, typically
links programs for execution according the the above default memory regions. How-
ever, older release of the compiler tool chain (or other tool chains) may link for differ-
ent memory models. This would require the use of RAMBANK–type commands to
establish the correct memory model. Alternatively, a compiler command file could
be used to ensure a program is linked for the default simulator memory mode (see
section 2.3.6).
Am29200 and Am29205
The simulator does not maintain different memory access parameters for
instruction and data access when modeling microcontroller members of the 29K
family. However, it does support separate memory modeling parameters for DRAM
and ROM address regions (see Table 1-11). Each of these two memory regions has its
own memory controller supporting up to four banks. A bank is a contiguous range of
memory within the address range accessed via the region controller. The DRAM re-
gion controller is a little more complicated than the ROM region controller. The pa-
rameters shown in Table 1-11 are for the older simulator, but they are accepted by the
new simulator. For a list of alternative parameters, which are only accepted by the
newer simulator, see the following Am29240 section.
The DRAM access is fixed at four cycles (1 for precharge + 3 for latency), it can
not be programmed. Subsequent accesses to the same page take four cycles unless
pagemode memories are supported. Note the first access is only three cycles rather
than four, as the RAS will already have met the precharge time. Basically, to prechar-
ge the RAS bit lines, all RAS lines need to be taken high in between each change of the
row addresses. A separate cycle is needed for precharge when back–to–back DRAM
accesses occurs. Use of pagemode memories is indicated by the PAGEMODE pa-
rameter being set; when used, the processor need not supply RAS memory strobe sig-
nals before page CAS strobes for same page accesses. This reduces subsequent page
access latency to three cycles. Additionally, when pagemode is used and a data burst
is attempted within a page, access latency is two cycles. The DRAM memory width
can be set to 16 or 32–bits. Of course when an Am29205 is used, all data memory
accesses are restricted by the 16–bit width of the processor data bus.
To explain further, access times to DRAM for none pagemode memories follow
the sequence:
X,3,4,4,4,X,3,4,4,4,X,X,3,X,3,...
80 Evaluating and Programming the 29K RISC Family
Where X is a non–DRAM access, say to ROM or PIA space. For DRAM sys-
tems supporting pagemode the sequence would be:
X,3,2,2,2,<boundary crossing>,4,2,2,<boundary crossing>,X,3,2,2,2
Memory devices located in ROM space can be modeled with a wider range of
parameter values. Both SRAM and ROM devices can be modeled in ROM space. Us-
ing the RBANKNUM parameter, the characteristics of each bank can be selectively
described. Burst–mode addressing is only supported for instruction or data reading.
When the burst option is used (RBURST set to TRUE), read accesses, other than the
first for a new burst, take RBACCESS cycles rather than the standard RRACCESS
cycles. Memory device widths can be 8, 16 or 32–bits. If an Am29205 microcontrol-
ler is being modeled, memory accesses wider than the 16–bit bus width always re-
quire the processor to perform multiple memory transfers to access the required
memory location.
Table 1-11. Microcontroller Memory Modeling Parameters for sim29
ROM/SRAM value default DRAM Value Default (Am29200) Operation
1Precharge on page crossing
RRACCESS 1–n 1 3First read
RWACCESS 2–n 2 3First write
RBURST T|F false Burst address in ROM region
RBACCESS 1–n 1 2 Burst read within page
2 Burst write within page
ROMWIDTH 8,16,32 32 DRAMWIDTH 16,32 32 Width of memory
PAGEMODE T|F false Page mode supported
RBANKNUM 0–3 – DBANKNUM 0–3 – Select which memory bank
Preparing sim29 for modeling an Am29200 system is not difficult. The follow-
ing commands configure the first two ROM banks to access non–burst–mode memo-
ries which are 32–bits wide, and have a 1–cycle read access, and a 2–cycle write ac-
cess.
Chapter 1 Architectural Overview 81
0 COM ROM bank 0 parameters
0 SET MEM rbanknum to 0
0 SET MEM rraccess to 1
0 SET MEM rwaccess to 2
0 COM ROM bank 1 parameters
0 SET MEM rbanknum to 1
0 SET MEM rraccess to 1
0 SET MEM rwaccess to 2
The following DRAM parameters, like the ROM parameters above, are correct
for modeling an SA29200 evaluation board. The first DRAM bank is configured to
support pagemode DRAM access, giving access latencies of 4:3:2 (4 for first, 3 for
same–page subsequent, unless they are bursts which suffer only 2–cycle latency).
0 COM DRAM bank 0 parameters
0 SET MEM dbanknum to 0
0 SET MEM dpagemode to true
Alternative Am2920x and Am2924x
As stated in the previous section, the newer sim29 can accept the memory mod-
eling parameters used by the older sim29. However, the newer simulator can operate
with alternative modelling commands; these are shown on Table 1-12. Commands
can be in upper or lower case, but they are shown here in lower case. A list of avail-
able simulator commands can be had by executing the command “sim29 –29240
–help”. An example of Am29200 microcontroller simulation can be found in section
8.1.
82 Evaluating and Programming the 29K RISC Family
Table 1-12. Microcontroller Processor Memory Modeling Parameters for newer sim29
Command value default Operation
rombank <adds> <size> Size and address of ROM/SRAM
rambank <adds> <size> Size and address of DRAM
halfspeedbus true|false Scalable Clocking (default=false)
logging true|false Loging to file sip.log (default=false)
parallelin <file> [<speed>] Parallel port input file
parallelout <file> [<speed>] Parallel port output file
serialin a|b <file> [ [<baud>] Serial port, a or b, input file
serialout a|b <file> [<baud>] Serial port, a or b, output file
ROM/SRAM DRAM Value Default (Am29240) Operation
romread 1–n 1 First read
romwrite 2–n 2 First write
romburst T|F false Enable burst mode addressing
rombread 1–n 1 Burst read within page
rampage T|F true Enable page mode
rompwidth ramwidth 8,16,32 32 Bit width of memory
ramrefrate 0–n 255 DRAM refresh rate (0=off)
ROM and SRAM memory types are modeled with the same set of commands.
The simulator automaticlay allocates ROM/SRAM memory bank 0. Using the
ROMBANK parameter, the characteristics of each bank can be selectively de-
scribed. The default parameters are typically for a relatively fast memory system
The DRAM memory access times are fixed by the processor specification.
However, there are some DRAM modelling commands enabling selection of
memory system with and pagemode devices. The simulator automatically allocates
DRAM memory bank 0 at address 0x4000,0000. All accesses to memory above this
address will be satisfied by the DRAM memory bank.
It is usually less of a problem linking programs for execution on a 29K micro-
controller; as the processor hardware dictates, to some extend, the allowed memory
regions. The default linker command files used with the High C 29K tool chain, typi-
cally link programs for execution according the the processor specificity memory re-
gions. Compiler command files are described in section 2.3.6.
1.14.2 Analyzing the Simulation Log File
Running the architectural simulator is simple but rather slow. The inclusion of
detail about the processor pipeline results in slow simulation speeds. For this reason,
users typically select a portion of their application code for simulation. This portion
is either representative of the overall code or subsections whose operation is critical
to overall system performance.
Chapter 1 Architectural Overview 83
Older sim29 Log File Format
For demonstration purposes I have merely simulated the “hello world” program
running on an Am29000 processor. The C source file was compiled with the High C
29K compiler using the default compiler options; object file hello was produced by
the compile/link process. The memory model was the simulator default, single–cycle
operation. Given the selection of default memory parameter, there is no need for an
eventfile establishing memory parameters. However, I did use an eventfile with the
following contents:
0 log on channel
This option has not previously been described; it enables the simulator to pro-
duce an additional log file of channel activity. This can occasionally be useful when
studying memory system operation in detail. The simulator was started with the com-
mand:
sim29 –29000 –r /gnu/29k/src/osboot/sim/osboot –e eventfile hello
Two simulation result files were produced; the most important of which, the de-
fault simulation output file, sim.out, we shall briefly examine. The channel.out file
reports all instruction and data memory access activity. The contents of the sim.out
file are shown below exactly as produced by the simulator:
AMD ARCHITECTURAL SIMULATOR, V# 1.0–17PC
### T=3267 Am29000 Simulation of ”hello” complete –– successful
–––––––––––––––––––––––––––––––––––––––––––––––––––––
<<<<< S U M M A R Y S T A T I S T I C S >>>>>
CPU Frequency = 25.00MHz
Nops:50
total instructions = 2992
User Mode: 291 cycles (0.00001164 seconds)
Supervisor Mode: 2977 cycles (0.00011908 seconds)
Total: 3268 cycles (0.00013072 seconds)
Simulation speed: 22.89 MIPS (1.09 cycles per instruction)
–––––––––– Pipeline ––––––––––
8.45% idle pipeline:
6.46% Instruction Fetch Wait
0.46% Data Transaction Wait
0.18% Page Boundary Crossing Fetch Wait
0.00% Unfilled BTCache Fetch Wait
0.49% Load/Store Multiple Executing
0.03% Load/Load Transaction Wait
84 Evaluating and Programming the 29K RISC Family
0.83% Pipeline Latency
Total Wait: 276 cycles (0.00001104 seconds)
–––––––––– Branch Target Cache ––––––––––
Partial hits: 0
Branch btcache access: 2418
Branch btcache hits:2143
Branch btcache hit ratio: 88.63%
–––––––––– Translation Lookaside Buffer ––––––––––
TLB access: 0
TLB hits: 0
TLB hit ratio: 0.00%
–––––––––– Bus Utilization ––––––––––
Inst Bus Utilization: 70.01%
2288 Instruction Fetches
Data Bus Utilization: 10.86%
20 Loads
335 Stores
–––––––––– Register File Spilling/Filling ––––––––––
0 Spills, 0 Fills
Opcode Histogram
ILLEGAL: CONSTN:6 CONSTH:68 CONST:121
MTSRIM:5 CONSTHZ: LOADL: LOADL:
CLZ: CLZ: EXBYTE: EXBYTE:
. . .
System Call Count Histogram
EXIT 1:1 GETARGS 260:1 SETVEC 289:2
. . .
–––––– M E M O R Y S U M M A R Y ––––––
Memory Parameters for Non–banked Regions
I_SPEED: Idecode=0 Iaccess=1 Ibaccess=1
. . .
The simulator reports the total number of processor cycles simulated. Because
our example is brief, there are few User mode cycles. Most cycles are utilized by the
osboot operating system. The operating system runs in Supervisor mode and initial-
izes the processor to run the “hello world” program in User mode. The fast memory
system has enabled the processor pipeline to be kept busy, an 8.45% idle pipeline is
reported. A breakdown of the activities contributing to pipeline stalling is shown.
Next reported is the Branch Target Cache (BTC) activity. If a processor incorpo-
rating an Instruction Cache Memory rather than a BTC had been simulated, the corre-
sponding results would replace the BTC results shown. There were 2418 BTC ac-
cesses, of which 2143 found valid entries. This gives a hit ratio of 88.63%. Partial hits
refer to the number of BTC entries which were not fully used. This occurs when one
of the early entries in the four–entry cache block contains a jump.
Chapter 1 Architectural Overview 85
If the operating system had arranged for Translation Look–Aside Buffer (TLB)
use then the next section reports its activity. In the example, the application ran with
physical addressing which does not require TLB support. Next reported is bus activ-
ity. The large number of processor registers results in little off–chip data memory ac-
cess, and hence Data Bus utilization. The Instruction Bus is used to fill the Instruc-
tion Prefetch Buffer and BTC, and shows much higher utilization. Typically, pro-
grams are more sensitive to instruction memory performance than data memory.
The simulator then produces a histogram of instruction and system call usage.
The listing above only shows an extract of this information, as it is rather large. Ex-
amining this data can reveal useful information, such as extensive floating–point
instruction use.
Finally reported is a summary of the memory modeling parameters used during
simulation. This information should match with the default parameters or any param-
eters established by the eventfile. It is useful to have this information recorded along
with the simulation results.
Newer sim29 Log FIle Format
As with the previous demonstration, the “hello world” program is used here to
show the output format of the newer architectural simulator. The selected processor
is this time an Am29240 microcontroller. The C source file was compiled with the
High C 29K compiler using the –O4 compiler options; object file hello was produced
by the compile/link process. The memory model was the simulator default. Given the
selection of default memory parameter, there is no need for an eventfile to establish
memory parameters. The simulator was started with the command shown below.
Note, there is no need to use the –r option and specify an osboot file.
sim29 –29240 hello
The simulation result file, sim.out, was produced. The contents of the sim.out
file are shown below exactly as produced by the simulator:
Am292xx Architectural Simulator, Version# 2.4
Command line: /usr/29k/bin/sim240 –29240 hello
Boot file: /usr/29k/lib/osb24x
Text section: 00000000 – 0000001f
Text section: 00000020 – 00000333
Text section: 00000340 – 0000035f
Text section: 00000360 – 00006b6b
BSS section: 40000400 – 400007df
Application file: hello
Text section: 40010000 – 4001332b
Text section: 4001332c – 4001333b
Text section: 4001333c – 4001334b
Data section: 40014000 – 40014993
86 Evaluating and Programming the 29K RISC Family
Lit section: 40014994 – 40014c63
BSS section: 40014c64 – 40014ca3
Argv memory: 400150a0 – 4001589f
Heap memory: 40015ca0 – 40035c9f
Memory stack: 40fbf7f0 – 40fdffef
Register stack: 40fdfff0 – 410007ef
Vector Area: 40000000 – 400003ff
ROM: Address Size Rd Wr Bmd BRd Wid
0x0 * 1 1 0 1 32
RAM: Address Size Rd Wr Pmd PRd PWr Wid Ref
0x40000000 * 2 2 1 1 1 32 255
Half speed memory = 0
Starting simulation...
hello world
HIF Exit: Value = 12
Simulation summary:
Cycles: 7101
Supervisor mode = 100.0%
User mode = 0.0%
MIPS = 18.8 (25.0 Mhz * ((5342 instructions)/(7101 cycles)))
Pipeline:
Average run length= 5.9 instructions between jumps taken
Fetches not used due to jumps = 299
PipeHold: 1759 cycles = 24.8%
Fetch waits: 1520 cycles = 21.4%
Load waits: 133 cycles = 1.9%
Store waits: 79 cycles = 1.1%
Load Multiple waits: 3 cycles = 0.0%
Store Multiple waits: 24 cycles = 0.3%
Channel: Rom: accesses = 809
Rom: average cycles per access = 1.0
Ram: accesses = 1959
Ram: average cycles per access = 1.7
Ram: average cycles waiting for precharge = 0.2
Ram: average cycles waiting for refresh = 0.2
Instruction Cache Size = 4 Kbytes
Hit ratio = 66.4% (3766/5673)
Data Cache Size = 2 Kbytes
Hit ratio = 63.6% (136/214)
The format of Log File will appear familiar to those experienced with the older
architectural simulator; the total number of processor cycles simulated is reported.
There are no User mode cycles as the default osboot (osb24x) executed the hello
Chapter 1 Architectural Overview 87
program in Supervisor mode. Most cycles are utilized by the osboot operating sys-
tem. The relatively fast memory system has enabled the processor pipeline to be kept
busy, a 24.8% idle pipeline is reported. A breakdown of the activities contributing to
pipeline stalling is shown. Most pipeline stalls are due to instruction fetching; the
DRAM memory has a 2–cycle first access time, rather than the ideal 1–cycle. The
newer simulator reports the average number of instructions executed between jump
or branch instructions. The run length is shown to be 5.9 instructions, which is typical
of a 29K program.
Next reported is Channel activity. All load and store instructions make use of the
Channel. Statistics are presented separately for the ROM/SRAM and DRAM
memory systems. Typically, performance is much more sensitive to instruction
memory access rather than accesses to data. This is particularly true with the 29K
family due to its large number of on–chip registers.
Next reported is on–chip cache activity. There were 5673 accesses to the
instruction cache, of which 66.4% found valid entries. The Am29240 has the benefit
of a data cache and the results are shown. The hello program is small and only 214
data cache accesses were made, of which 63.4% hit in the cache.
Reported in the sim.out file before simulation started are the memory modeling
parameters used during simulation. This information should match with the default
parameters or any parameters established by the eventfile. It is useful to have this in-
formation recorded along with the simulation results. The values reported are shown
again below:
ROM: Address Size Rd Wr Bmd BRd Wid
0x0 * 1 2 0 1 32
RAM: Address Size Rd Wr Pmd PRd PWr Wid Ref
0x40000000 * 2 2 1 1 1 32 255
Half speed memory = 0
The ROM section refers to both ROM and SRAM memory. The tokens used are
a little cryptic. For example, “Rd” refers to memory read cycles. And “BRd” refers to
burst mode read times. The option to use Scalable Clocking was not selected; “Half
speed memory” is set to false.
88 Evaluating and Programming the 29K RISC Family
Chapter 2
Applications Programming
Application programming refers to the process of developing task specific soft-
ware. Typical 29K tasks are controlling a real–time process, processing communica-
tions data, processing real–time digital signal, and manipulating video images. There
are many more types of applications, such as word processing which the 29K is suited
for, but the 29K is better known in the embedded engineering community which typi-
cally deals with real–time processing.
This chapter deals with aspects of application programming which the Software
Engineer is required to know. Generally, computer professionals spend more time
developing application code, compared to other software development projects such
as operating systems. Additionally, applications are increasingly developed in a high
level language. Since C is the dominant language for this task, I shall present code
examples in terms of C. Assembly level programming is dealt with in a separate
chapter.
The first part of this chapter deals with the mechanism by which one C proce-
dure calls another, and how they agree to communicate data and make use of proces-
sor resources [Mann et al. 1991b]. This is termed the Calling Convention. It is pos-
sible that different tool developers could construct their own calling mechanism, but
this may lead to incompatibilities in mixing routines compiled by different vendor
tools. AMD avoided this problem by devising a calling convention which was
adopted by all tool developers. Detailed knowledge, of say, individual register sup-
port tasks for the calling convention is not presented, except for the register and
memory stacks which play an important role in the 29K calling mechanism. In prac-
tice, C language developers typically do not need to be concerned about individual
register assignments, as it is taken care of by the compiler [Mann 1991c]. Chapter 3
expands on register assignment, and it is of concern here only in terms of understand-
ing the calling convention concepts and run–time efficiencies.
89
Operating system support services (HIF services) are then dealt with. The tran-
sition from operating system to the application main() routine is described. Operat-
ing system services along with other support routines are normally accessed through
code libraries. These libraries are described for the predominant tool–chains. Using
the available libraries and HIF services, it is an easy task to arrange for interrupts to be
processed by C language handler routines; the mechanism is described. Finally, util-
ity programs for operations such as PROM preparation are listed and their capabili-
ties presented.
2.1 C LANGUAGE PROGRAMMING
Making a subroutine call on a processor with general-purpose registers is ex-
pensive in terms of time and resources. Because functions must compete for register
use, registers must be saved and restored through register-to-memory and memory-
to-register operations. For example, a C function call on the MC68000 processor
[Motorola 1985] might use the statements:
char bits8;
short bits16;
printf (”char=%c short=%d”, bits8, bits16);
After they are compiled, the above statements would generate the assembly-
level code shown below:
L15: .ascii ”char=%c short=%d\0”
MOVE.W –4[A6],D0 ;copy bits16 variable
EXT.L D0 ; to register
MOVE.L D0,–[A7] ;now push on stack
MOVE.B –1[A6],D0 ;copy bits8 variable
EXTB.L D0 ; to register
MOVE.L D0,–[A7] ;now push on stack
PEA L15 ;stack text string pointer
JSR _printf
LEA 12[A7],A7 ;repair stack pointer
The assembly listing above shows how parameters pass via the memory stack to
the function being called. The LINK instruction copies the stack pointer A7 to the
local frame pointer A6 upon entry to a routine. Within the printf() routine, the param-
eters passed and local variables in memory are referenced relative to register A6.
To reduce future access delays, the printf() routine will normally copy data to
general-purpose registers before using them. For instance, using a memory-to-
memory operation when moving data from the local frame of the function call stack
would reduce the number of instructions executed. However, these are CISC instruc-
tions that require several machine cycles before completion.
In the example, the C function call passes two variables, bits8 and bits16, to the
library function printf(). The following assembly code shows part of the printf()
function for the MC68020.
90 Evaluating and Programming the 29K RISC Family
_printf:
LINK A6,#–32 ;local variable space
LEA 8[A6],A0 ;unstack string pointer
. . .
UNLK A6
RTS ;return
Several multi–cycle instructions (like LINK and UNLK) are required to pass
the parameters and establish the function context. Unlike the variable instruction for-
mat in the MC68020, the 29K processor family has a fixed 32–bit instruction format
(see section 1.11). The same C statements compiled for the Am29000 processor gen-
erate the following assembly code for passing the parameters and establishing the
function context:
L1: .ascii “char=%c short=%d\0”
const lr2,L1
consth lr2,L1
add lr3,lr6,0 ;move bits8 and bits16
add lr4,lr8,0 ;to bottom of the
;activation record
call lr0,printf ;return address in lr0
The number of instructions required is certainly less, and they are all simple
single–cycle RISC instructions. However, to better understand just how parameters
are passed during a function call, explanation of the procedure activation records and
their use of the local register file is first required.
2.1.1 Register Stack
A register stack is assigned an area of memory used to pass parameters and allo-
cate working registers to each procedure. The register cache replaces the top of the
register stack, as shown in Figure 2-1. All 29K processors have a 128–word local
register file; these registers are used to implement the cache for the top of the register
stack. Note, if desired only a portion of the 128–word register file need be allocated to
register cache use (see section 2.3.2).
The global registers rab (gr126) and rfb (gr127) point to the top and the bottom
of the register cache. Global register rsp (also known as gr1) points to the top of the
register stack. The register cache, or stack window, moves up and down the register
stack as the stack grows and shrinks. Use of the register cache, rather than the
memory portion of the register stack, allows data to be accessed through local regis-
ters at high speed. On–chip triple–porting of the register file (two read ports and one
write port for most 29K family members), enables the register stack to perform better
than a data memory cache, which cannot support read and write operations in the
same cycle.
Chapter 2 Applications Programming 91
High address
Memory-resident
rfb points to part of the regis-
the bottom of ter stack
cache register
window Register
Stack
Cache-resident
part of the register
stack
Cache (grows down)
register
window rsp points to the
moves top of the stack
up and
down empty
empty
rab points to the
top of the cache
register window
Register
Cache
Low address
External Memory
Figure 2-1. Cache Window
2.1.2 Activation Records
A 29K processor does not apply push or pop instructions to external memory
when passing procedure parameters. Instead each function is allocated an activation
record in the register cache at compile time. Activation records hold any local vari-
ables and parameters passed to functions.
The caller stores its outgoing arguments at the bottom of the activation re-
cord.The called function establishes a new activation record below the caller’s re-
cord. The top of the new record overlaps the bottom of the old record, so that the out-
going parameters of the calling function are visible within the called functions ac-
tivation record.
92 Evaluating and Programming the 29K RISC Family
Although the activation record can be any size within the limits of the physical
cache, the compiler will not allocate more than 16 registers to the parameter-passing
part of the activation record. Functions that cannot pass all of their outgoing parame-
ters in registers must use a memory stack for additional parameters; global register
msp (gr125) points to the top of the memory stack. This happens infrequently, but is
required for parameters that have their address taken (for example in C, &variable).
Data parameters at known addresses cannot be supported in register address space
because data addresses always refer to memory, not to registers.
The following code shows part of the printf() function for the 29K family:
printf:
sub gr1,gr1,16 ;function prologue
asgeu V_SPILL,gr1,rab ;compare with top of window
add lr1,gr1,36 ;rab is gr126
. . .
jmpi lr0 ;return
asleu V_FILL,lr1,rfb ;compare with bottom
;of window gr127
The register stack pointer, rsp, points to the bottom of the current activation re-
cord. All local registers are referenced relative to rsp. Four new registers are required
to support the function call shown, so rsp is decremented 16 bytes. Register rsp per-
forms a role similar to the MC68000’s A7 and A6 registers, except that it points to data
in high-speed registers, not data in external memory.
The compiler reserves local registers lr0 and lr1 for special duties within each
activation record. The lr0 contains the execution starting address when it returns to
the caller’s activation record. The lr1 points to the top of the caller’s activation re-
cord, the new frame allocates local registers lr2 and lr3 to hold printf() function local
variables.
As Figure 2-2 shows, the positions of five registers overlap. The three printf()
parameters enter from lr2, lr3 and lr4 of the caller’s activation record and appear as
lr6, lr7 and lr8 of the printf() function activation record.
2.1.3 Spilling And Filling
If not enough registers are available in the cache when it moves down the regis-
ter stack, then a V_SPILL trap is taken, and the registers spill out of the cache into
memory. Only procedure calls that require more registers than currently are available
in the cache suffer this overhead.
Once a spill occurs, a fill (V_FILL trap) can be expected at a later time. The fill
does not happen when the function call causing the spill returns, but rather when
some earlier function that requires data held in a previous activation record (just be-
low the cache window) returns. Just before a function returns, the lr1 register, which
points to the top of the caller’s activation record, is compared with the pointer to the
Chapter 2 Applications Programming 93
higher addresses
top of activation printf() activation record is
record 9 words. Register gr1 is
lowered 4 words (16 bytes)
lr8 in–coming pram lr4 in the prologue of printf().
lr7 in–coming pram lr3
lr6 in–coming pram lr2
lr5 frame pointer lr1
lr4 return address lr0 Base of caller’s activation
lr3 local record (gr1 before printf()
is called)
lr2 local
lr1 frame pointer
gr1 (rsp) lr0 base of printf()
when printf() activation record
executes
Figure 2-2. Overlapping Activation Record Registers
bottom of the cache window(rfb). If the activation record is not stored completely in
the cache, then a fill overhead occurs.
The register stack improves the performance of call operations because most
calls and returns proceed without any memory access. The register cache contains
128 registers, so very few function calls or returns require register spilling or filling.
Because most of the data required by a function resides in local registers, there is
no need for elaborate memory addressing modes, which increase access latency. The
function-call overhead in the 29K family consists of a small number of single-cycle
instructions; the overhead in the MC68020 requires a greater number of multi-cycle
instructions.
2.1.4 Global Registers
In the discussion of activation records (section 2.1.2), it was stated that func-
tions can use activation space (local registers) to hold procedure variables. This is
true, but procedures can also use processor global registers to hold variables. Each
29K processor has a group of registers (global registers) which are located in the reg-
ister file, but are not part of the register cache. Global registers gr96–gr127 are used
by application programs. When developing software in C, there is no need to know
just how the compiler makes use of these global registers; the Assembly Level Pro-
gramming chapter, section 3.3, discusses register allocation in detail.
94 Evaluating and Programming the 29K RISC Family
Data held in global registers, unlike procedure activation records, do not survive
procedure calls. The compiler has 25 global registers available for holding temporary
variables. These registers perform a role very similar to the eight–data and eight–ad-
dress general purpose registers of the MC68020. The first 16 of the global registers,
gr96–gr111, are used for procedure return value passing. Return objects larger than
16 words must use the memory stack to return data (see section 3.3).
An extension to some C compilers has been made (High C 29K compiler for
one), enabling a calling procedure to assume that some global registers will survive a
procedure call. If the called function is defined before calls are made to it, the compil-
er can determine its register usage. This enables the global register usage of the call-
ing function to be restricted to available registers, and the calling function need only
save in local registers those global registers it knows are used by the by the callee.
2.1.5 Memory Stack
Because the register cache is limited in size, a separate memory stack is used to
hold large local variables (structs or arrays), as well as any incoming parameters be-
yond the 16th parameter. (Note, small structs can still be passed in local registers as
procedure parameters). Register msp is the memory stack pointer. (Note, having two
stacks generally requires several operating system support mechanisms not required
by a single stack CISC based system.)
2.2 RUN–TIME HIF ENVIRONMENT
Application programs need to interact with peripheral devices which support
communication and other control functions. Traditionally embedded program devel-
opers have not been well served by the tools available to tackle the related software
development. For example, performing the popular C library service printf(), using
a peripheral UART device, may involve developing the printf() library code and
then underlying operating system code which controls the communications UART.
One solution to the problem is to purchase a real–time operating system. They are
normally supplied with libraries which support printf() and other popular library ser-
vices. In addition, operating systems contain code to perform task context switching
and interrupt handling.
Typically, operating system vendors have their own operating system interface
specification. This means that library code, like printf(), which ultimately makes op-
erating system service requests, is not easily ported between different operating sys-
tems. In addition, compiler vendors which typically develop library code for the tar-
get processor for sale along with the compiler, can not be assured of a standard inter-
face to the available operating system services.
AMD wished to relieve this problem and allow library code to be used on any
target 29K platform. In addition AMD wished to ensure a number of services would
Chapter 2 Applications Programming 95
be available. These operating system services were considered necessary to enable
performance benchmarking of application code (for example the cycles service re-
turns a 56–bit elapsed processor cycle count). The result was the Host Interface spec-
ification, known as HIF. It specifies a number of operating system services which
must always be present. The list is very small, but it enables library producers to be
assured that their code will run on any 29K platform. The HIF specification states
how a system call will be made, how parameters will be passed to the operating sys-
tem, and how results will be returned. Operating system vendors need not support
HIF conforming services if they wish; they could just continue to use their own oper-
ating system interface and related library routines. But to make use of the popular
library routines from the Metaware High C 29K compiler company, the operating
system company must virtualize the HIF services on top of the underlying operating
system services.
The original specification grew into what is now known as HIF 2.0. The specifi-
cation includes services for signal handling (see following sections on C language
interrupt handlers), memory management support, run–time environment initializa-
tion and other processor configuration options. Much of this development was a re-
sult of AMD developing a small collection of routines known as OS–boot (see sec-
tion 7.4). This code can take control of the processor from RESET, prepare the run–
time environment for a HIF conforming application program, and support any HIF
request made by the application. OS–boot effectively implements a single applica-
tion–task operating system. It is adequate for many user requirements, which may be
merely to benchmark 29K applications. With small additions and changes it is ade-
quate for many embedded products. However, some of the HIF 2.0 services, re-
quested by the community who saw OS–boot as an adequate operating system, were
of such a nature that they often cannot be implemented in an operating system ven-
dor’s product. For example the settrap service enables an entry to be placed directly
into the processor’s interrupt vector table; some operating systems, for example
UNIX, will not permit this to occur as it is a security risk and, if improperly used, an
effective way to crash the system.
There are standard memory, register and other initializations that must be per-
formed by a HIF-conforming operating system before entry into a user program. In C
language programs, this is usually performed by the module crt0.s. This module re-
ceives control when an application program is invoked, and executes prior to invoca-
tion of the user’s main() function. Other high-level languages have similar modules.
The following three sections describe: what a HIF conforming operating system
must perform before code in crt0.s starts executing; what is typically achieved in
crt0.s code; and finally, what run–time services are specified in HIF 2.0.
96 Evaluating and Programming the 29K RISC Family
2.2.1 OS Preparations before Calling start In crt0
According to the HIF specification, operating system initialization procedures
must establish appropriate values for the general registers mentioned below before
execution of a user’s application code commences. Linked application code normal-
ly commences at address label start in module crt0.s. This module is automatically
linked with application code modules and libraries when the compiler is used to pro-
duce the final application executable. In addition, file descriptors for the standard in-
put and output devices must be opened, and any Am29027 floating–point coproces-
sor support as well as other trapware support must be initialized.
Register Stack Pointer (gr1)
Register gr1 points to the top of the register stack. It contains the main memory
address in which the local register lr0 will be saved, should it be spilled, and from
which it will be restored. The processor can also use the gr1 register as the base in
base–plus–offset addressing of the local register file. The content of rsp is compared
to the content of rab to determine when it is necessary to spill part of the local register
stack to memory. On startup, the values in rab, rsp, and rfb should be initialized to
prevent a spill trap from occurring on entry to the crt0 code, as shown by the follow-
ing relations:
((64*4) + rab) ≤ rsp < rfb
rfb = rab + 512
This provides the crt0 code with at least 64 registers on entry, which should be a
sufficient number to accomplish its purpose. Note, rab and rfb are normally set to be a
window distance apart, 128 words (512 bytes), but this is not the only valid settings,
see section 2.3.2 and 4.3.1.
Register Free Bound (gr127)
The register stack free–bound pointer, rfb, contains the register stack address of
the lowest-addressed word not contained within the register file. Register rfb is refer-
enced in the epilog of most user program functions to determine whether a register
fill operation is necessary to restore previously spilled registers needed by the func-
tion’s caller. The rfb register should be initialized to point to the highest address of the
memory region allocated for register stack use. It is recommended that this memory
region not be less than 6k bytes.
Register Allocate Bound (gr126)
The register stack allocate–bound pointer, rab, contains the register stack ad-
dress of the lowest-addressed word contained within the register file. Register rab is
referenced in the prolog of most user program functions to determine whether a regis-
Chapter 2 Applications Programming 97
ter spill operation is necessary to accommodate the local register requirements of the
called function. Register rab is normally initialized to be a window distance (512 by-
tes) below the rfb register value
Memory Stack Pointer (gr125)
The memory stack pointer (msp) register points to the top of the memory stack,
which is the lowest-addressed entry on the memory stack. Register msp should be
initialized to point to the highest address in the memory region allocated for memory
stack use. It is recommended that this region not be less than 2k bytes.
Am29027 Floating–Point Coprocessor Support
The Am29027 floating–point coprocessor has a mode register which has a
cumbersome access procedure. To avoid accessing the mode register a shadow copy
is kept by the operating system and accessed in preference when a mode register read
is required. The operating system shadow mode value is not accessible to User mode
code, therefore an application must maintain its own shadow mode register value.
The floating–point library code which maintains and accesses the shadow mode val-
ue, is passed the mode setting, initialized by the operating system, when crt0 code
commences. Before entering crt0, the Am29027 mode register value is copied into
global registers gr96 and gr97. Register gr96 contains the most significant half of the
mode register value, and gr97 contains the least significant half.
Open File Descriptors
File descriptor 0 (corresponding to the standard input device) must be opened
for text mode input. File descriptors 1 and 2 (corresponding to standard output and
standard error devices) must be opened for text mode output prior to entry to the
user’s program. File descriptors 0, 1, and 2 are expected to be in COOKED mode (see
Appendix A, ioctl() service), and file descriptor 0 should also select ECHO mode, so
that input from the standard input device (stdin) is echoed to the standard output de-
vice (stdout).
Software Emulation and Trapware Support
A 29K processor may take a trap in support of the procedure call prologue and
epilogue mechanism. A HIF conforming operating system supports the associated
SPILL and FILL traps by normally maintaining two global registers (in the
gr64–gr95 range) which contain the address of the users spill and fill code. Keeping
these addresses available in registers reduces the delay in reaching the typically User
mode support code. A HIF conforming operating system also installs the SPILL and
FILL trap handler code which bounces execution to the maintained handler address-
es.
98 Evaluating and Programming the 29K RISC Family
Table 2-1. Trap Handler Vectors
Trap Description
32 MULTIPLY
33 DIVIDE
34 MULTIPLU
35 DIVID
36 CONVERT
42 FEQ
43 DEQ
44 FGT
45 DGT
46 FGE
47 DGE
48 FADD
49 DADD
50 FSUB
51 DSUB
52 FMUL
53 DMUL
54 FDIV
55 DDIV
64 V_SPILL (Set up by the user’s task through a setvec call)
65 V_FILL (Set up by the user’s task through a setvec call)
69 HIF System Call
Note: The V_SPILL (64) and V_FILL (65) traps are returned to the user’s code to perform the trap
handling functions. Application code normally runs in User mode.
Additionally, the trapware code enabling HIF operating system calls is
installed. Also, all HIF conforming operating systems provide unaligned memory
access trap handlers.
A number of 29K processors do not directly support floating–point instructions
in hardware (see section 3.1.7). However the HIF environment requires that all
Am29000 User mode accessible instructions be implemented across the entire 29K
family. This means that unless an Am29050 processor is being used, trapware must
be installed to emulate in software the floating–point instructions not directly sup-
ported by the hardware. Table 2-1 lists the traps which an HIF conforming operating
system must establish support for before calling crt0 code.
When a 29K processor is supported by an Am29027 floating–point coproces-
sor, the operating system may chose to use the coprocessor to support floating–point
instruction emulation. For example, the trapware routine used for emulating the
MULTIPLY instruction is know as Emultiply; however, if the coprocessor is avail-
able the E7multiply routine is used.
Chapter 2 Applications Programming 99
2.2.2 crt0 Preparations before Calling main()
Application code normally begins execution at address start in the crt0.s mod-
ule. The previous section described the environment prepared by a HIF conforming
operating system before the code in crt0.s is executed. The crt0.s code makes final
preparations before the application main() procedure is called.
The code in crt0.s first copies the Am29027 shadow mode register value, passed
in gr96 and gr97, to memory location __29027Mode. If a system does not have an
Am29027 floating–point coprocessor then there is no useful data passed in these reg-
isters. However, application code linked with floating–point libraries which make
use of the Am29027 coprocessor, will access the shadow memory location to deter-
mine the coprocessor operating mode value.
The setvec system call is then used to supply the operating system with the ad-
dresses of the user’s SPILL and FILL handler code which is located in crt0.s. Be-
cause this code normally runs in User mode address space, and the user has the option
to tailor the operation of this code, an operating system can not know in advance
(pre–crt0.s) the required SPILL and FILL handler code operation.
When procedure main() is called, it is passed two parameters; the argc parame-
ter indicates the number of elements in argv; the second parameter, argv, is a pointer
to an array of the character strings:
main(argc, argv)
int argc;
char* argv[];
The getargs HIF service is used to get the address of the argv array. In many
real–time applications there are no parameters passed to main(). However, to support
porting of benchmark application programs, many systems arrange for main() pa-
rameters to be loaded into a user’s data space. The crt0.s code walks through the
array looking for a NULL terminating string; in so doing, it determines the argc val-
ue. The register stack pointer was lowered by the start() procedure’s prologue code
to create a procedure activation record for passing parameters to main().
To aid run–time libraries a memory variable, __LibInit, is defined in uninitial-
ized data memory space (BSS) by the library code. If any library code needs initial-
ization before use, then the __LibInit variable will be assigned to point to a library
routine which will perform all necessary initialization. This is accomplished by the
linker matching–up the BSS __LibInit variable with an initialized __LibInit variable
defined in the library code. The crt0.s code checks to see if the __LibInit variable
contains a non zero address, if so, the procedure is called.
The application main() procedure is ready to be called by start(). It is not ex-
pected that main() will return. Real–time programs typically never exit. However,
benchmark programs do, and this is accomplished by calling the HIF exit service. If a
main() routine does not explicitly call exit then it will return to start(), where exit is
called should main() return.
100 Evaluating and Programming the 29K RISC Family
2.2.3 Run–Time HIF Services
Table 2-2 lists the HIF system call services, calling parameters, and the returned
values. If a column entry is blank, it means the register is not used or is undefined.
Table 2-3 describes the parameters given in Table 2-2 . Before invoking a HIF ser-
vice, the service number and any input parameters passed to the operating system are
loaded into assigned global registers. Each HIF service is identified by its associated
service number which is placed in global register gr121. Parameters are passed, as
with procedure calls, in local registers starting with lr2. Application programs do not
need to issue ASSERT instructions directly when making service calls. They normal-
ly use a library of assembly code glue routines. The write service glue routine is
shown below:
__write: ;HIF assembly glue routine for write service
const gr121,20 ;tav is gr121
asneq 69,gr1,gr1 ;system call trap
jmpti gr121,lr0 ;return if sucessful
const gr122,_errno ;pass errror number
consth gr122,_errno
store 0,0,gr121,gr122 ;store errnor number
jmpi lr0 ;return if failure
constn gr96,–1
Application programs need merely call the _write() leaf routine to issue the ser-
vice request. The system call convention states that return values are placed in global
registers starting with gr96; this makes the transfer of return data by the assembly
glue routine very simple and efficient. If a service fails, due to, say, bad input parame-
ters, global register gr121 is returned with an error number supplied by the operating
system. If the service was successful, gr121 is set to Boolean TRUE (0x80000000).
The glue routines check the gr121 value, and if it is not TRUE, copy the value to
memory location errno. This location, unlike gr121 is directly accessible by a C lan-
guage application which requested the service.
Run–time HIF services are divided into two groups, they are separated by their
service number. Numbers 255 and less require the support of complex operating sys-
tem services such as file system management. Service numbers 256 and higher relate
to simpler service tasks. Note, AMD reserves service numbers 0–127 and 256–383
for HIF use. Users are free to extend operating system services using the unreserved
service numbers. Operating systems which implement HIF, OS–boot for example,
do not always directly support services 255 and lower. These HIF services are often
translated into native operating system calls which are virtualising HIF services. For
example, when a HIF conforming application program is running on a UNIX–based
system, the HIF services are translated into the underlying UNIX services. OS–boot
supports the more complex services by making use of the MiniMON29K message
system to communicate the service request to a debug support host processor (see
Chapter 7). For this reason, services 255 and lower are not always available. Services
Chapter 2 Applications Programming 101
Table 2-2. HIF Service Calls
Service Calling Parameters Returned Values
Title gr121 lr2 lr3 lr4 gr96 gr97 gr121
exit 1 exitcode Service does not return
open 17 pathname mode pflag fileno errcode
close 18 fileno retval errcod
read 19 fileno buffptr nbytes count errcode
write 20 fileno buffptr nbytes count errcode
lseek 21 fileno offset orig where errcode
remove 22 pathname retval errcode
rename 23 oldfile newfile retval errcode
ioctl 24 fileno mode errcode
iowait 25 fileno mode count errcode
iostat 26 fileno iostat errcode
tmpnam 33 addrptr filename errcode
time 49 secs errcode
getenv 65 name addrptr errcode
gettz 67 zonecode dstcode errcode
sysalloc 257 nbytes addrptr errcode
sysfree 258 addrptr nbytes retval errcode
getpsize 259 pagesize errcode
getargs 260 baseaddr errcode
clock 273 msecs errcode
cycles 274 LSBs cycles MSBs cycles errcode
setvec 289 trapno funaddr trapaddr errcode
settrap 290 trapno trapaddr trapaddr errcode
setim 291 mask di mask errcode
query 305 capcode hifvers errcode
capcode cpuvers errcode
capcode 027vers errcode
capcode clkfreq errcode
capcode memenv errcode
signal 321 newsig oldsig errcode
sigdfl 322 [gr125 points to HIF signal frame] Service does not return
sigret 323 [gr125 points to HIF signal frame] Service does not return
sigrep 324 [gr125 points to HIF signal frame] Service does not return
sigskp 325 [gr125 points to HIF signal frame] Service does not return
sendsig 326 sig errcode
with numbers 256 and higher do not require the support of a remote host processor.
These services are implemented directly by OS–boot. If an underlying operating sys-
tem, such as UNIX, is being used, then some of these services may not be available as
they may violate the underlying operating system’s security.
102 Evaluating and Programming the 29K RISC Family
When application benchmark programs use HIF services, care should be taken.
If a program requests a service such as time (service 49) it will suffer the delays of
communicating the service request to a remote host if the OS–boot operating system
is used. This can greatly effect the performance of a program, as execution will be
delayed until the remote host responds to the service request. It is better to use ser-
vices such as cycles (service 274) or clock (service 273) which are executed by the
29K processor and do not suffer the delays of remote host communication.
The assembly level glue routines for HIF services 255 and lower are rarely re-
quested directly by an application program. They are more frequently called upon by
library routines. For example, use of the library printf() routine is the typical way of
generating a write HIF service request. The mapping between library routines and
HIF services may not be always direct. The printf() routine, when used with a device
operating in COOKED mode, may only request write services when flushing buffers
supporting device communication. Appendix A contains a detailed description of
each HIF service in terms of input and output parameters, as well as error codes.
2.2.4 Switching to Supervisor Mode
Operating systems which conform to HIF normally run application code in User
mode. However, many real–time applications require access to resources which are
restricted to Supervisor mode. If the HIF settrap service is supported, it is easy to
install a trap handler which causes application code to commence execution in Su-
pervisor mode. The example code sequence below uses the settrap() HIF library rou-
tine to install a trap handler for trap number 70. The trap is then asserted using assem-
bly language glue routine assert_70().
extern int super_mode();/* Here in User mode */
_settrap(70,super_mode);/* install trap handler */
assert_70(); /* routine to assert trap */
. . . /* Here in Supervisor mode */
The trap handler is shown below. Its operation is very simple; it sets the Supervi-
sor mode bit in the old processors status registers (OPS) before issuing a trap return
instruction (IRET). Other application status information is not affected. For exam-
ple, if the application was running with address translation turned on, then it will con-
tinue to run with address translation on, but now in Supervisor mode.
In fact the example relies on application code running with physical addressing;
or if the Memory Management Unit is used to perform address translation, then virtu-
al addresses are mapped directly to physical addresses. This is because the Freeze
mode handler, super_mode(), runs in Supervisor mode with address translation
turned off. But the settrap system call, which installs the super_mode() handler ad-
dress, runs in User mode and thus operates with User mode address values.
Chapter 2 Applications Programming 103
.global _super_mode
_super_mode: ;gr64 is an OS temporary
mfsr gr64,ops ;read the OPS register
or gr64,gr64,0x10 ;set SM bit in OPS
mtsr ops,gr64 ;iret back to Supervisor mode
iret
The super_mode() and assert_70() routines have to be written in assembly lan-
guage. The IRET instruction in super_mode() starts execution of the JMPI instruc-
tion in the assert_70() routine shown below. The method shown of forcing a trap can
be used to test a systems interrupt and trap support software.
.global _assert_70
_assert_70: ;leaf routine
asneq 70,gr96,gr96 ;force trap 70
jmpi lr0 ;return
nop
Table 2-3. HIF Service Call Parameters
Parameter Description
027vers The version number of the installed Am29027 arithmetic accelerator chip (if any)
addrptr A pointer to an allocated memory area, a command-line-argument array, a path-
name buffer, or a NULL-terminated environment variable name string.
baseaddr The base address of the command-line-argument vector returned by the getargs
service.
buffptr A pointer to the buffer area where data is to be read from or written to during the
execution of I/O services, or the buffer area referenced by the wait service.
capcode The capabilities request code passed to the query service. Code values are: 0 (re-
quest HIF version), 1 (request CPU version), 2 (request Am29027 arithmetic accel-
erator version), 3 (request CPU clock frequency), and 4 (request memory environ-
ment).
clkfreq The CPU clock frequency (in Hertz) returned by the query service.
count The number of bytes actually read from file or written to a file.
cpuvers The CPU family and version number returned by the query service.
cycles The number of processor cycles (returned value).
di The disable interrupts parameter to the setim service.
dstcode The daylight savings time in effect flag returned by the gettz service.
errcode The error code returned by the service. These are usually the same as the codes
returned in the UNIX errno variable.
exitcode The exit code of the application program.
(continued)
104 Evaluating and Programming the 29K RISC Family
Table 2-4. HIF Service Call Parameters (Concluded)
(continued)
Parameter Description
filename A pointer to a NULL-terminated ASCII string that contains the directory path of a tem-
porary filename.
fileno The file descriptor which is a small integer number. File descriptors 0, 1, and 2 are
guaranteed to exist and correspond to open files on program entry (0 refers to the
UNIX equivalent of stdin and is opened for input; 1 refers to the UNIX stdout, and is
opened for output; 2 refers to the UNIX stderr, and is opened for output).
funaddr A pointer to the address of a spill or fill handler passed to the setvec service.
hifvers The version of the current HIF implementation returned by the query service.
iostat The input/output status returned by the iostat service.
mask The interrupt mask value passed to and returned by the setim service.
memenv The memory environment returned by the query service.
mode A series of option flags whose values represent the operation to be performed. Used
in the open, ioctl, and wait services to specify the operating mode.
msecs Milliseconds returned by the clock service.
name A pointer to a NULL-terminated ASCII string that contains an environment variable
name.
nbytes The number of data bytes requested to be read from or written to a file, or the number
of bytes to allocate or deallocate from the heap.
newfile A pointer to a NULL-terminated ASCII string that contains the directory path of a new
filename.
newsig The address of the new user signal handler passed to the signal service.
offset The number of bytes from a specified position (orig) in a file, passed to the lseek ser-
vice.
oldfile A pointer to NULL-terminated ASCII string that contains the directory path of the old
filename.
oldsig The address of the previous user signal handler returned by the signal service.
orig A value of 0, 1, or 2 that refers to the beginning, the current position, or the position of
the end of a file.
pagesize The memory page size in bytes returned by the getpsize service.
pathname A pointer to a NULL-terminated ASCII string that contains the directory path of a file-
name.
pflag The UNIX file access permission codes passed to the open service.
retval The return value that indicates success or failure.
secs The seconds count returned by the time service.
sig A signal number passed to the sendsig service.
trapaddr The trap address returned by the setvec and settrap services. A trap address
passed to and returned by the settrap service.
trapno The trap number passed to the setvec and settrap services.
where The current position in a specified file returned by the lseek service.
zonecode The time zone minutes correction value returned by the gettz service.
Chapter 2 Applications Programming 105
2.3 C LANGUAGE COMPILER
I know of six C language compilers producing code for the 29K family. The
most widely used of these are: the High C 29K compiler developed by Metaware Inc;
and GNU supported by the Free Software Foundation and Cygnus Support Inc. De-
velopers of 29K software normally operate in a cross development environment,
editing and compiling code on one machine which is intended to run on 29K target
hardware. The High C 29K compiler is sold by a number of companies, including
AMD, and packaged along with other vendor tools. High C 29K can produce code
for both big– and little–endian 29K operation. The GNU compiler, gcc, currently
(version 2.5) produces big–endian code. This does not present a problem as the 29K
is used predominantly in big–endian.
2.3.1 Compiler Optimizations
A RISC chip is very sensitive to code optimization. This is not surprising since
the RISC philosophy gives software greater access to a processor’s internals relative
to most CISC processors. Compilers make use of a number of code optimization
techniques which it is difficult for the assembly language programmer to consistently
make use of. Some of these techniques are briefly described below. For example:
Common Sub–Expression Elimination
...
c=a+b;
...
d=a+b; /* sub-expression used again */
...
The expression a+b is a common sub-expression, it does not need to be eva-
luated twice. A more efficient compiler would store the result of the first evaluation
in a local or global register and reuse the value in the second expression. Temporary
variables used during interim calculations are optimized by the compiler. These com-
piler-generated temporaries are allocated to register cache locations.
Strength Reduction
When ever possible “strength reduction” is performed. This refers to replacing
expensive instructions with less expensive ones. For example, replace multiplies by
factors of two with more efficient shift instructions.
Loop Invariant Code Motion
Sometimes a C programmer will place code in a loop which could have been
located outside of the loop. For example, variable initialization need not be repeated-
106 Evaluating and Programming the 29K RISC Family
ly executed in a loop. The loop invariant initialization would be located before the
loop code. Hence, the amount of code required to support each loop iteration is mini-
mized.
Loop Unrolling
There are a number of optimization techniques applicable to code loops. The
objective is the same, to replace the loop with a sequence of faster executing code.
This often involves unrolling the loop partially or completely. For example, the
compiler may determine a loop is traversed, say, three times. It may be more effective
to replace the loop with three in–line versions of the loop. This would eliminate the
branching required by the loop. Additionally, when a loop is unrolled there are
generally increased opportunities to apply optimizations not available to the looped
alternative. Consequently, sections on the expanded loop need not be just
duplications of a single loop iteration, but something smaller and more register
efficient.
Dead–Code Elimination
Code which can never be executed is eliminated. This saves on memory usage.
Unexecutable code can result from a branch which can never be taken. Compilers
generally issues a warning when they detect “unreachable code”. Additionally, result
values which are never used can be eliminated; this can remove unneeded store
instructions.
Improved Register Allocation
A processor’s registers are a critical resource in determining performance.
Accessing registers is very much more efficient than accessing off–chip memory.
The ability of the compiler to devise schemes to keep data within the available
registers is critical. Additionally, given that the 29K compiler determines the size of a
procedure’s register window, it is important to minimize register allocation if spilling
and filling are to be avoided.
Constant Propagation And Folding
Variables are often assigned constant values. Later, the variable is used in a cal-
culation. The 29K instruction format supports 8–bit immediate data constants. Ap-
plying constant variables as immediate data rather than holding the variable in a reg-
ister can be more efficient. Additionally, propagating an immediate value may enable
it to be combined with another immediate value at compile time. This is better than
performing a run–time calculation.
Register–to–Register Copying (Copy Propagation)
When examining compiler generated code, particularly if the target is a CISC
processor, it is not unusual to see stores of register data to memory locations. This
Chapter 2 Applications Programming 107
makes the register available for reuse. Later, the stored data is reloaded for further
processing. The better RISC compilers try to keep data in registers longer; and use
register–to–register copying rather than register–to–memory.
Memory Shadowing
The performance impact of a memory access is reduced when the access is per-
formed to a copy–back data cache. However, most processors do not have this advan-
tage available to them. The term “memory shadowing” refers to the increased use of
registers for data variable storage. Again, directing accesses to registers rather than
off–chip memory has significant performance advantages. Of course, if a variable is
defined volatile it can not be held in a register.
Memory References Are Coalesced and Aligned
Data memory can be most efficiently accessed using burst–mode addressing.
This requires the use of load– and store–multiple instructions. When a sufficiently
large data object is being moved between memory and registers, it is best to use the
burst–mode supported instructions. The compiler can also arrange for frequently ac-
cessed data to be located (coalesced) in adjacent memory locations, even if the data
variables were not consecutively defined.
There are also performance benefits to be had by aligning target instructions on
cache block boundaries. For example, a procedure can be aligned to start on a 4–word
boundary. This improves cache utilization and performance –– particularly with
caches which do not support partially filled cache blocks.
Delay Slot Filling
The compilers perform “delay slot filling” (see section 3.1.8). Delay slots occur
whenever a 29K processor experiences a disruption in consecutive instruction execu-
tion. The processor always executes the instruction in the decode pipeline stage, even
if the execute stage contains a jump instruction. Delay slot is the term given to the
instruction following the jump or conditional branch instruction. Effectively, the
branch instruction is delayed one cycle. Unlike assembly language programmers, the
compiler easily finds useful instructions to insert after branching instructions. These
instructions, which are executed regardless of the branch condition, are effectively
achieved at no cost. Typically, an instruction that is invariant to the branch outcome is
moved into the delay slot just after the branch or jump instruction.
Jump Optimizations
Because of the pipeline stalling effects of jump instruction, scheduling these
instructions can achieve significant performance improvements. The objective is to
reduce the number of taken branches. For example, code loops typically have condi-
tional tests at the top of the loop to test for loop completion. This results in branch
instructions at the top and the bottom of the loop. If the conditional branch is moved
to the bottom of the loop then the number of branches is reduced.
108 Evaluating and Programming the 29K RISC Family
Instruction Scheduling
The 29K allows load and store instructions to be overlapped with other instruc-
tions that do not depend on the load or store data. Ordinarily, a processor will load
data into a register before it makes use of it in the subsequent instruction. To enable
overlapping of the external memory access, the load instruction must be executed at
an earlier stage, before it is required. Best results are obtained if code motion tech-
niques are used to push the load instruction back by as many instructions as there are
memory access delay cycles (another name for this technique is instruction pre-
scheduling). This will prevent processor pipeline stalling caused by an operand value
not being available. Once again, code motion is best left to the compiler to worry
about.
Leaf Procedure Optimization
Leaf procedures are procedures which do not call other procedures; at least they
do not contain any C level procedure calls. However, they can contain transparent
routine calls inserted by the compiler. Because of this unique characteristic of leaf
routines, a number of optimizations can be applied. For example, simplified
procedure prologue and epilogue, alternative register usage. When a leaf is static in
scope (only known within the defining module) alternative parameter passing and
register allocation schemes can be applied.
With newer versions of the High C 29K compiler, it is possible to construct
simple procedures as transparent routines (see section 3.7). If a procedure qualifies
for a transparent–type implementation, then its parent (in the calling sequence) may
itself become a leaf procedure. This propagates the benefits obtained by leaf
procedures.
In–lining Simple Functions
The program may call a procedure but the compiler can replace the call with
equivalent in–line code. For very small procedures this can be a performance
advantage. However, as the called procedure grows in size and in–lining is frequently
applied, then code space requirements will increase. In–lining is frequently utilized
with C++ code which often has classes with small member functions. The register
requirements of a procedure can grow when it has to deal with in–line code rather
than a procedure call. This does not present much difficulty for a 29K processor as it
can individually tailor the register allocation to each procedure’s requirements with
dynamically sized register windows.
As stated above, it is possible to construct simple functions as transparent
routines (see section 3.7). This is not really in–lining, but it does further reduce the
overhead associated with even a leaf procedure. Additionally, placing code in a
transparent routine, which is shared, helps reduce the code expansion which occurs
with in–lining. For this reason using the C language key word _Transparent to define
Chapter 2 Applications Programming 109
the type of small procedures, may be a performance advantage when used with C++
object member functions.
Global Function In–lining
When code in–lining is applied, it is typically limited to functions defined and
used within an single module. More elaborate schemes enable a function to be
defined in one module and the related code to be inserted in–line even if the call to the
function appears in another file. Applying function in–lining in this global fashion
can greatly extend the benefits of in–lining.
Two–pass Code Compilation
Most compilers apply their optimization statically. That is entirely at compile
time. However, by observing the program in execution, optimizations can be further
refined. For example, branch prediction can be applied statically, but observing the
frequency of actual branching reveals the most traversed code paths. Additionally,
the data which is most frequently accessed can be determined. With this information
a second pass of the compiler can be applied and further code optimizations incorpo-
rated.
Superblock Formation
Software optimizations are normally only applied within a code block. A block
is a code sequence which is bounded by a single entry point (at the top –– a lower
address) and one or more exit points (a jump or call instruction). Instruction
scheduling and other optimizations can be better utilized if an instruction block is
large. For this reason techniques which enlarge a block’s size and create a superblock
are important
A superblock may contain a number of basic blocks, yet code optimizations can
be applied over the larger superblock code sequence. Creation of a superblock can
require duplication of code. Typically the tail of a superblock will be duplicated (tail
recursion) to eliminate side entry points to the superblock. Optimization techniques
which help superblock creation are: loop unrolling, function in–lining, jump
elimination, code duplication, code migration, and code profiling.
2.3.2 Metaware High C 29K Compiler
The Metaware Inc. compiler, invoked with the hc29 driver, has held the position
as the top performing 29K compiler for a number of years. It generally produces the
fastest code, which is of the smallest size. It is available on SUN and HP workstation
platforms as well as IBM PC–AT machines. It may be made available on other plat-
forms depending on customer demand. A number of companies resell the compiler
along with other tools, such as debuggers and emulators.
110 Evaluating and Programming the 29K RISC Family
The compiler typically allocates about 12 registers for use by each new
procedure. However, a very large procedure could be allocated up to 128 registers.
This requires the register–stack cache be assigned the maximum window size of 128
registers. The “lregs=n” compiler switch (minimum n=36) enables the maximum
number of registers allocated to a procedure to be limited to less than 128. If the
“lregs” switch is used, it is possible to operate with a reduced window size. This
would increase the frequency of stack spilling and filling (and hence reduce effective
execution speeds) but would enable a faster task context switch time (see section
8.1.4). The maximum number of local registers which would require saving or
restoring would be limited to the reduced window size (window size = rfb – rab).
A number of the example code sequences shown in this book, and provided by
AMD, are configured to operate with a fixed window size of 512 bytes; in particular,
repair_R_stack in file signal.s and signal_associate in file sig_code.s. These files
should be modified to reflect the reduced window size. Ideally a Supervisor mode
accessible memory location, say WindowSize, should be initialized by the operating
system to the chosen window size, and all subsequent code should access
WindowSize to determine the window size in use. Additionally, the spill handler
routine must be replace with the code shown below. The replacement handler
requires three additional instructions. But, unlike the more frequently used spill
handler (section 4.4.4), it is not restricted to operating with a fixed window size of
512 bytes.
spill_handler:
sub tav,rab,rsp ;calculate size of spill
srl gr96,tav,0x2
sub gr96,gr96,0x1
mtsr cr,gr96 ;number of words
sub tav,rfb,tav ;determine new rfb position
const gr96,0x200
or gr96,tav,gr96
mtsr ipa,gr96 ;point into register file
add rab,rsp,0x0 ;adjust rab position
storem 0,0x0,gr0,tav ;move data
jmpi tpc
add rfb,tav,0x0 ;adjust rfb position
The above spill handler code may fail if there is a procedure which does not use
the gr96 register. The compiler may hold a value in gr96 and expect it to survive the
function call; and the function call may result in spill handler execution. This is not
likely, but the use of gr96 above must be done with care.
A number of non–standard C features have been added to the compiler. These
features are often useful, but their use reduces the portability of code between
different C compilers. For example, the High C 29K compiler does not normally pack
data structures. The type modifier _Packed can be used to specify packing on a
per–structure bases. If structure packing is selected on the compiler command line,
Chapter 2 Applications Programming 111
unpacked structures can be selectively specified with the _Unpacked type modifier.
For example:
typedef _Packed struct packet_str /* packed structure */
{ char A;
int B;
. . .
} packet_t;
A HIF conforming operating systems provides unaligned memory access trap
handlers –– any 29K operating system may choose to do this. Hence, if an object larg-
er than a byte is accessed and the object is not aligned to an object–sized boundary,
then a trap will be taken and the trap handler will perform the required access in
stages if necessary. The trap handler will require several processor cycles to perform
its task. To the programmer, the required data is accessed as if it were aligned on the
correct address boundary. In the example above, structure member B is of size int but
is not aligned on a int–sized boundary (given object A is a char and it is aligned on a
word–sized boundary).
Of course there is a performance penalty for use of trap handlers. For this reason,
packed data structures are seldom used. However, there use does reduce data
memory requirements, and for this reason data is often sent between processors in
packed data packets. When a data packet is received, its contents can be accessed as
bytes without any data alignment difficulties. Access of data larger than bytes may
require unaligned trap handler support, and thus suffer a performance penalty.
The High C 29K compiler offers a solution to the performance problem with the
type modifiers _ASSUME_ALIGNED and _ASSUME_UNALIGNED. They enable a
pointer to a unaligned structure to be declared. For example:
receive_packet(packet_p)
_ASSUME_UNALIGNED packet_t* packet_p;
{
int data = packet_p–>B;/* unaligned access */
. . .
The receive_packet() procedure is passed a pointer to a data structure which is
known to be unaligned. Normally, when member B of the packet structure is
accessed, an unaligned trap occurs. However, informing the compiler of the
unaligned nature of the data enables the compiler to replace the normal load
instruction used to read the B data with a transparent helper routine call (see section
3.7). The transparent helper routine performs the same task as the trap handler but
with a reduced overhead.
2.3.3 Free Software Foundation, GCC
The GNU compiler, gcc, can be obtained from any existing users who are in a
position, and has the time, to duplicate their copy. Alternatively, the Free Software
112 Evaluating and Programming the 29K RISC Family
Foundation can be contacted. For a small fee, Cygnus Support Inc. will ship you a
copy along with their documentation. The GNU compiler is available in source form,
and currently runs on UNIX type host machines as well as 386 based IBM PCs and
compatibles.
Considering the Stanford University benchmark suite, the gcc compiler (ver-
sion 2.3) produces code which is on average 15–20% slower in execution compared
to hc29. The GNU compiler also used considerably more memory to contain the
compiled code. Of course your application program may experience somewhat dif-
ferent results.
2.3.4 C++ Compiler Selection
Programmers first started developing C++ code for the 29K in 1988; they used
the AT&T preprocessor, cfront, along with the High C 29K compiler. A number of
support utilities were developed at that time to enable the use of cfront: nm29,
munch29, and szal29, which gave the size and alignment of 29K data objects (re-
quired for cross development environments).
Because the GNU tool chains can support C++ code development directly with
the the GCC compiler there is little use being made of the AT&T cfront preprocessor.
Additionally, MRI and Metaware have recently announced upgrades to their prod-
ucts which now enable C++ code development. (C++ makes extensive use of dynam-
ic memory resources, see section 2.4.1.)
2.3.5 Executable Code and Source Correspondence
The typically high levels of optimization applied by a compiler producing code
for RISC execution, can make it difficult to identify the relationship between 29K
instructions and the source level code. When looking at the 29K instructions
produced by the compiler, it is not always easy to identify the assembly instructions
which correspond to each line of C code. Optimizations such as: code motion,
sub–expression elimination, loop unrolling, instruction scheduling and more, all add
to the difficulty.
Fortunately, there is usually little need to study the resulting instructions
produced after compilation. However, it can occasionally be worth studying
compiler output when trying to understand the performance of critical code
segments. It is difficult to obtain a small example of C code which demonstrates all
the potential code optimizations. The example below is interesting, but illustrates
only a few of the difficulties of relating source code to 29K instructions.
int strcmp(s1, s2) /* file strcmp.c */
char *s1,*s2;
{
Chapter 2 Applications Programming 113
int cnt=0;
for(cnt=0;;cnt++);
{ if(s1[cnt]!=s2[cnt])
return –1;
if(s1[cnt]==’\0’ || s2[cnt]==’\0’) /* line 8 */
if(s1[cnt]==’\0’ && s2[cnt]==’\0’)
return 0;
else
return –1;
}
} /* line 14 */
The procedure, strcmp(), is similar to the ANSI library routine of the same
name. It is passed the address of two strings. The strings are compared to determine if
they are the same. If they are the same, zero is returned, otherwise –1 is returned. This
is not exactly the same behavior as the ANSI routine.
The procedure is based on a for–loop statement which compares characters in
the two strings until they are found to be different or one of the strings is terminated.
The algorithm used by the C code is not optimal. But this makes the example more
interesting as it challenges the compiler to produce the minimum code sequence. The
Metaware compiler was first used to compile the code with a high level of
optimization selected (–O7). The command line use was “hc29 –S –Hanno –O7
strcmp.c”. The “–S” switch causes the compiler to stop after it has produced 29K
assembly code –– no linking with libraries is performed. The “–Hanno” switch
causes the source C code to be embedded in the output assembly code. This helps
identify the assembly code corresponding to each line of C code. The assembly code
produced is shown below. Note that some assembly level comment statements have
been added to help explain the code operation.
.text
.word 0x40000 ; Tag: argcnt=2 msize=0
.global _strcmp
_strcmp:
;4 | int cnt=0;
;5 | for(cnt=0;;cnt++)
jmp L2
const gr97,0 ;cnt=0
L3: ;top of for–loop
L2:
;6 | { if(s1[cnt]!=s2[cnt])
add gr96,lr2,gr97
load 0,1,gr99,gr96 ;load s1[cnt]
add gr96,lr3,gr97
load 0,1,gr98,gr96 ;load s2[cnt]
cpeq gr96,gr99,gr98 ;compare characters
jmpf gr96,L4 ; jump if different
cpeq gr96,gr99,0 ;test if s1[cnt] == ’\0’
;8 | if(s1[cnt]==’\0’ || s2[cnt]==’\0’)
jmpt gr96,L5 ; jump if string end
cpneq gr96,gr98,0 ;test s2[cnt]
114 Evaluating and Programming the 29K RISC Family
jmpt gr96,L3 ;for–loop if not end
add gr97,gr97,1 ;increment cnt
L5:
;9 | if(s1[cnt]==’\0’ && s2[cnt]==’\0’)
cpneq gr96,gr99,0 ;here is at end of string
jmpt gr96,L4 ;jump if s1[]!=’\0’
cpneq gr96,gr98,0
jmpt gr96,L7 ;jump if s2[]!=’\0’
constn gr96,–1
;10 | return 0;
jmpi lr0 ;strings match
const gr96,0 ;return 0
L4:
constn gr96,–1 ;no match
L7:
;12 | return –1;
jmpi lr0
nop
The body of the for–loop is contained between address labels L3 and L5. The
compiler has filled the delay slot of jump instructions with other useful instructions.
Within the for–loop, LOAD instruction are used to access the characters of each
string. Register gr97 is used to hold the loop–count value, cnt. The count value is
incremented each time round the for–loop. The value in gr97 is added to the base of
each string (lr2 and lr3) to obtain the address of each character required for compari-
son. The LOAD instructions have been scheduled to somewhat reduce conflict for
off–chip access and reduce the pipeline stalling affects of LOAD instructions.
Within the body of the loop three tests are applied: one to determine if the char-
acters at the current position in the string match; the remaining two, to determine if
the termination character has been reached for either of the strings. The assembly
code after label L5 selects the correct return value when the tested characters do not
match or string termination is reached. There is unnecessary use of jump instructions
in the code following label L5 and also in the initial code jumping to label L2. It is
somewhat fortunate that this less optimal code does not appear within the more fre-
quently executed for–loop body.
The same code was compiled with the GNU compiler using command “gcc –S
–O4 strcmp.c”. The assembly code produced is shown below; it is quite different
from the Metaware produced code.
.text
.align 4
.global _strcmp
.word 0x40000
_strcmp:
L2: ;top of for–loop
load 0,1,gr117,lr2 ;load s1[cnt]
load 0,1,gr116,lr3 ;load s2[cnt]
cpneq gr116,gr117,gr116 ;compare characters
jmpf gr116,L5 ;jump if match
Chapter 2 Applications Programming 115
cpneq gr116,gr117,0 ;test for s1[] end
jmpi lr0 ;no match
constn gr96,65535 ; return –1
L5: ;here if s1[cnt]==s2[cnt]
jmpfi gr116,lr0 ;return if at string end
const gr96,0
add lr3,lr3,1 ;next s2[] character
jmp L2 ;for–loop
add lr2,lr2,1 ;next s1[] character
All of the code is contained in the body of the for–loop. A for–loop transition
consists of 10 instructions, a decrease of one compared to the Metaware code. How-
ever, LOAD instructions are now placed back–to–back, and loaded data is used im-
mediately. Additionally, the normal path through the for–loop contains an additional
jump to label L5. This will increase the actual number of cycles required to execute a
single for–loop to more than 10 cycles. It is likely the Metaware produced code will
execute in a shorter time.
No register (previously gr97) is used to contain the cnt value. The pointers to the
passed strings, lr2 and lr3, are advanced to point to the next character within the for–
loop. Delay slot instructions are productively filled and there are no unnecessary
jump instructions.
Lines 8 through 12 of the source code are only applied if the tested characters are
found not to match. Consequently, it is redundant to test if either string has reached
the termination character –– if one has, they both have. This optimization should
have been reflected in the source code. However, the GNU compiler has identified
that it need only test string s1[] for termination. This results in the elimination of 29K
instructions relating to later C code lines. For example, there is no code relating to the
if–statement on line 9. If an attempt is made to place a breakpoint on source line 9
using the GDB source level debugger, then no breakpoint will be installed. Other de-
buggers may give a warning message or place a breakpoint at the first line before or
after the requested source line.
Programmers familiar with older generation compilers applied to CISC code
generation will notice the increased complexity in associating 29K instructions to
source C statements –– even for the simple example shown. As procedures become
larger and more complex, code association become increasingly more difficult. The
quality of 29K code produced by the better compilers available, make it very difficult
to consistently (or frequently) produce better code via hand crafting 29K instruc-
tions. Because of the difficulty of understanding the compiler generated code, it is
best to only incorporate hand–built code as separate procedures which comply with
the C language calling convention.
2.3.6 Linking Compiled Code
After application code modules have been compiled or assembled, they must be
linked together to form an executable file. There are three widely used linker tools:
116 Evaluating and Programming the 29K RISC Family
Microtec Research Inc. developed ld29; Information Processing Corp. developed
ld29i; and the GNU tool chain offers gld. Sometimes these tools are repackaged by
vendors and made available under different names. They all operate on AMD COFF
formatted files. However, they each have different command line options and link
command–file formats. A further limitation when mixing the use of these tools is that
ld29 operates with a different library format compared to the others. It uses an MRI
format which is maintained by the lib29 tool. The others use a UNIX System V for-
mat supported by the well known ar librarian tool.
It is best to drive the linker from the compiler command line, rather than invok-
ing the linker directly. The compiler driver program, gcc or hc29 for example, can
build the necessary link command file and include the necessary libraries. This is the
ideal way to link programs, even if assembly language modules are to be named on
the compiler command line. Note that the default link command files frequently use
aligns text (ALIGN .text=8192) and data sections to 8k (8192) byte boundaries. This
is because the OS–boot operating system (see Chapter 7) normally operates with ad-
dress translation turned on. The maximum (for the Am29000 processor) page size of
8k bytes is used to reduce run–time Memory Management Unit support overheads.
Different 29K evaluation boards can have different memory maps. AMD nor-
mally supplies the High C 29K linker in a configuration which produces a final code
image linked for a popular evaluation board –– many boards share the same memory
map. Additionally, AMD supplies linker command files for currently available
boards, such as the EZ030 and SA29200 boards. The linker command files are lo-
cated in the installation/lib directory; each command file ends with the file extension
.cmd. For example, the mentioned boards have command files: ez030.cmd and
sa200.cmd, respectively. The linker command files can be specified when the com-
piler is invoked. For example, the command “hc29 –o file –cmdez030.cmd file.c”
will cause the final image to be linked using the ez030.cmd command file. Using the
supplied linker command files is a convenient way to ensure a program is correctly
linked for the available memory resources.
The GNU compiler also allows options to be passed to the linker via the
“–Xlinker” flag. For example, the command line “gcc –Xlinker –c –Xlinker
ez030.cmd –o file file.c” will compile and link file.c. The linker will be passed the
option “–c ez030.cmd”. The GNU linker documentation claims the linker can
operate on MRI formatted command files. In practice, at least for the 29K, this is not
the case. The GNU linker expects MRI–MC68000 formatted command files, which
are a little different from MRI–29K formatted command files. Known differences are
the use of the “*” character rather than “#” before comments, and the key word
PUBLIC must be upper case. Those using the GNU tool chain generally prefer to use
the GNU linker command file syntax rather than attempt to use the AMD supplied
command files.
Chapter 2 Applications Programming 117
When developing software for embedded applications there is always the prob-
lem of what to do with initialized data variables. The problem arises because vari-
ables must be located in RAM, but embedded programs are typically not loaded by an
operating system which prepares the data memory locations with initialized values.
Embedded programs are stored in ROM; this means there is no problem with pro-
gram instructions unless a program wishes to modify its own code at run–time.
Embedded system support tools typically provide a means of locating initial-
ized data in ROM; and transferring the ROM contents to RAM locations before pro-
gram execution starts. The High C 29K linker, ld29, provides the INITDATA com-
mand for this purpose. Programs must be linked such that all references to writeable
data occur to RAM addresses. The INITDATA scans a list of sections and transfers
the data variables found into a new .initdat section. The list contains the names of
sections containing initialized data. The linker is then directed to locate the new .init-
data section in ROM.The start address of the new section is marked with symbol
initdat.
Developers are provided with the source to a program called initcopy() which
must be included in the application program. This program accesses the data in ROM
starting at label initdat and transfers the data to RAM locations. The format of the
data located in the .initdat section is understood by the initcopy() routine. This rou-
tine must be run before the application main() program. A user could place a call to
the initialization routine inside crt0.s.
Note, because initcopy() must be able to read the appropriate ROM devices,
these devices must be placed in an accessible address space. This is not a problem for
2–bus members of the 29K family, but 3–bus members can have a problem if the .in-
itdat section is located in a ROM device along with program code. Processors with
3–bus architectures, such as the Am29000, have separately addressed Instruction
and ROM spaces which are used for all instruction accesses. The Am29000 proces-
sor has no means of reading these two spaces to access data unless an external bridge
is provided. If program code and initialized data are located in the same ROM device,
the initcopy() program can only be used if an external bridge is provided. This bridge
connects the Am29000 processor data memory bus to the instruction memory bus. If
a 3–bus system does not have a bridge the romcoff utility can be used to initialize data
memory.
The romcoff utility can be used when the ld29 linker is not available and the
INITDATA linker command option is not provided. Besides being able to work with
3–bus architectures which have no bridge, it can be used to process program sections
other than just initialized data. Sections which ultimately must reside in RAM can be
initialized from code located in ROM.
Fully linked executables are processed by romcoff to produce a new linkable
COFF file. This new module has a section called RI_text which contains a routine
called RAMInit(). When invoked, this routine initializes the processed sections,dur-
118 Evaluating and Programming the 29K RISC Family
ing preparation of the relevant RAM regions. The new COFF file produced by romc-
off must be relinked with the originally linked modules. Additionally, a call to RA-
MInit() must be placed in crt0.s or in the processor boot–up code (cold–start code) if
the linked executable is intended to control the processor during the processor RE-
SET code sequence.
When romcoff is not used with the “–r” option, it assumes that the ROM
memory is not readable. This results in a RAMInit() function which uses CONST
and CONSTH instructions to produce the data values to be initialized in RAM. This
results in extra ROM memory requirements to contain the very much larger RAMI-
nit() routine, but ensures that 3–bus architectures which do not incorporate a bridge
can initialize their RAM memory.
2.4 LIBRARY SUPPORT
2.4.1 Memory Allocation
The HIF specification requires that conforming operating systems maintain a
memory heap. An application program can acquire memory during execution by us-
ing the malloc() library routine. This routine makes use of the underlying sysalloc
HIF service. The malloc() call is passed the number of consecutive memory bytes
required; it returns a pointer to the start of the memory allocated from the heap.
Calls to malloc() should be matched with calls to library routine free(). This
routine is passed the start address of the previously allocated memory along with the
number of bytes acquired. The free() routine is supported by the sysfree HIF service.
The HIF specification states “no dynamic memory allocation structure is implied by
this service”. This means the sysfree may do nothing; in fact, this service with OS–
boot (version 0.5) simply returns. Continually using memory without ever releasing
it and thus making it reusable, will be a serious problem for some application pro-
grams, in particular C++ which frequently constructs and destructs objects in heap
memory.
For this reason the library routines which interface to the HIF services perform
their own heap management. The first call to malloc() results in a sysalloc HIF re-
quest for 8k bytes, even in the malloc() was for only a few bytes. Further malloc()
calls do not result in a sysalloc request until the 8k byte pool is used up. Calls to free()
enable previously allocated memory to be returned to the pool maintained by the li-
brary.
The alloca() library routine provides a means of acquiring memory from the
memory stack rather than the heap. A pointer to the memory region within the calling
procedure’s memory stack frame, is returned by alloca(). The advantage of this
method is that there is no need to call a corresponding free routine. The temporary
memory space is automatically freed when the calling procedure returns. Users of the
Chapter 2 Applications Programming 119
alloca() service must be careful to remember the limited lifetime of data objects
maintained on the memory stack. After returning from the procedure calling alloca(),
all related data variables cease to exist and should not be referenced.
2.4.2 Setjmp and Longjmp
The setjmp() and longjmp() library routines provide a means to jump from the
current procedure environment to a previous procedure environment. The setjmp()
routine is used to mark the position which a longjmp() will return to. A call to
setjmp() is made by a procedure, passing it a pointer to an environment buffer, as
shown below:
int setjmp(env)
jmp_buf env;
The buffer definition is shown below. It records the value of register stack and
memory stack support registers in use at the time of the setjmp() call. The setjmp()
call returns a value zero.
typedef struct jmp_buf_str
{ int* gr1;
int* msp;
int* lr0;
int* lr1;
} *jmp_buf;
The setjmp() routine is very simple. It is listed below to assist with the under-
standing of the longjmp() routine. It is important to be aware that setjmp(),
longjmp(), SPILL and FILL handlers, along with the signal trampoline code (see
section 2.5.3) form a matched set of routines. Their operation is interdependent. Any
change to one may require changes to the others to ensure proper system operation.
_setjmp: ;lr2 points to buffer
store 0,0,gr1,lr2 ;copy gr1 to buffer
add lr2,lr2,4
store 0,0,msp,lr2 ;copy msp
add lr2,lr2,4
store 0,0,lr0,lr2 ;copy lr0
add lr2,lr2,4
store 0,0,lr1,lr2 ;copy lr1
jmpi lr0 ;return
const gr96,0
When longjmp() is called it is passed a pointer to an environment buffer which
was initialized with a previous setjmp() call. The longjmp() call does not return di-
rectly. It does return, but as the corresponding setjmp() establishing the buffer data.
120 Evaluating and Programming the 29K RISC Family
The longjmp() return–as–setjmp() can be distinguished from a setjmp() return as
itself, because the longjmp() appears as a setjmp() return with a non–zero value. In
fact the value parameter passed to longjmp() becomes the setjmp() return value. A
C language outline for the longjmp() routine is shown below:
void longjmp(env, value)
jmp_buf env;
int value)
{
gr1 = env–>gr1;
lr2addr = env–>gr1 + 8;
msp = env–>msp;
/* saved lr1 is invalid if saved lr2address > rfb */
if (lr2addr > rfb) {
/*
* None of the registers are useful.
* Set rfb to lr2address–512 & rab to rfb–512
* the FILL assert will take care of filling
*/
lr1 = env–>lr1;
rab = lr2addr – WindowSize;
rfb = lr2addr;
}
lr0 = env–>lr0;
if (rfb < lr1)
raise V_FILL;
return value;
}
The actual longjmp() routine code, shown below, is written in assembly lan-
guage. This is because the sequence of modifying the register stack support registers
is very important. An interrupt could occur during the longjmp() operation. That in-
terrupt may require a C language interrupt handler to run. The signal trampoline code
is required to understand all the possible register stack conditions, and fix–up the
stack support registers to enable further C procedure call to be made.
_longjmp:
load 0,0,tav,lr2 ;gr1 = env–>gr1
add gr97,lr2,4 ;gr97 now points to msp
cpeq gr96,lr3,0 ;test return ”value”, it must
srl gr96,gr96,31 ; be non zero
or gr96,lr3,gr96 ;gr96 has return value
add gr1,tav,0 ; gr1 = env–>gr1;
add tav,tav,8 ;lr2address =env–>gr1+8
load 0,0,msp,gr97 ;msp = env–>msp
cpleu gr99,tav,rfb ;if (lr2address > rfb)
jmpt gr99,$1
;{
add gr97,gr97,4 ;gr97 points to lr0
add gr98,gr97,4 ;gr98 points to lr1
load 0,0,lr1,gr98 ;lr1 = value from jmpbuf
Chapter 2 Applications Programming 121
sub gr99,rfb,rab ;gr99 has WindowSize
sub rab,tav,gr99 ;rab = lr2address–WindowSize
add rfb,tav,0 ;rfb = lr2address
$1: ;}
load 0,0,lr0,gr97 ;lr0 = env–>lr0
jmpi lr0 ;return
asgeu V_FILL,rfb,lr1 ;if (rfb < lr1) raise V_FILL;
; may fill from rfb to lr1
2.4.3 Support Libraries
The GNU tool chain is supported with a single library, libc.a. However the High
C 29K tool chain is supported with a range of library options. It is best to use the com-
piler driver, hc29, to select the appropriate library. This avoids having to master the
library naming rules and build linker command files.
The GNU libraries do not support word–sized–access–only memory systems.
Originally, the Am29000 processor could not support byte–sized accesses and all
memory accesses were performed on word sized objects. This required read–
modify–write access sequences to manipulate byte sized objects located in memory.
Because all current 29K processors support byte–size access directly, there is no need
to have specialized libraries for accessing bytes. However, the High C 29K tool chain
still ships the old libraries to support existing (pre–Rev D, 1990) Am29000 proces-
sors.
The hc29 driver normally links with three libraries: the ANSI standard C sup-
port library (libs*.lib), the IEEE floating–point routine library (libieee.lib), and the
HIF system call interface library (libos.lib). There are actually eight ANSI libraries.
The driver selects the appropriate library depending on the selected switches. The
reason for so many libraries is due to the support of the old word–only memory sys-
tems, the option to talk with an Am29027 coprocessor directly, and finally, the option
to select Am29050 processor optimized code.
The ANSI library includes transcendental routines (sin(), cos(), etc.) which
were developed by Kulus Inc. These routines are generally faster than the transcen-
dental routines developed by QTC Inc., which were at one time shipped with High C
29K. The QTC transcendentals are still supplied as the libq*.lib libraries, and must
now be explicitly linked. The Kulus transcendentals also have the advantage in that
they support double and single floating–point precision. The routines are named
slightly differently, and the compiler automatically selects the correct routine de-
pending on parameter type. The GNU libraries (version 2.1) include the QTC tran-
scendental routines.
Most 29K processors do not support floating–point instructions directly (see
section 3.1.7). When a non–implemented floating–point instruction is encountered,
the processor takes a trap, and operating system routines emulate the operation in
trapware code. If a system has an Am29027 floating–point coprocessor available,
122 Evaluating and Programming the 29K RISC Family
then the trapware can make use of the coprocessor to achieve faster instruction
emulation. This is generally five times faster than software based emulation. Keep-
ing the presence of the Am29027 coprocessor hidden in operating system support
trapware, enables application programs to be easily moved between systems with
and without a coprocessor.
However, an additional (about two times) speed–up can be achieved by applica-
tion programs talking to the Am29027 coprocessor directly, rather than via trapware.
When the High C 29K compiler is used with the “–29027” or “–f027” switches, inline
code is produced for floating–point operations which directly access the coprocessor.
Unfortunately the compiled code can not be run on a system which has no coproces-
sor. The ANSI standard C support libraries also support inline Am29027 coprocessor
access with the libs*7.lib library. When instructed to produce direct coprocessor ac-
cess code, the compiler also instructs the linker to use this library in place of the stan-
dard library, libs*0.lib.
The Am29050 processor supports integer multiply directly in hardware rather
than via trapware. It also supports integer divide via converting operands to floating–
point before dividing and converting back to integer. The High C 29K compiler per-
forms integer multiply and divide by using transparent helper routines (see section
3.7); this is faster than the trapware method used by the GNU compiler. When the
High C 29K compiler is used with the “–29050” switch, and the GNU compiler with
the “–m29050” switch, code optimized for the use for an Am29050 processor is
used. This code may not run on other 29K family members, as the Am29050 proces-
sor has some additional instructions (see sections 3.1.6 and 3.1.7).
2.5 C LANGUAGE INTERRUPT HANDLERS
Embedded application code developers typically have to deal with interrupts
from peripheral devices requiring attention. As with general code development there
is a desire to deal with interrupts using C language code rather than assembly lan-
guage code. Compared to CISC type processors, which generally do not have a regis-
ter stack, this is a little more difficult to achieve with the 29K family. In addition, 29K
processors do not have microcode to automatically save their interrupted context.
The interrupt architecture of a 29K processor is very flexible and is dealt with in de-
tail in Chapter 4. This section presents two useful techniques enabling C language
code to be used for interrupts supported by a HIF conforming operating system.
The characteristics of the C handler function are important in determining the
steps which must be taken before the handler can execute. It is desirable that the C
handler run in Freeze mode because this will reduce the overhead costs. These costs
are incurred because interrupts may occur at times when the processor is operating in
a condition not suitable for immediately commencing interrupt processing. Most of
these overheads are concerned with register stack support and are described in detail
in section 4.4. This section deals with establishing an interrupt handler which can run
Chapter 2 Applications Programming 123
in Freeze mode. The following section 2.5.3 deals with all other types of C language
interrupt handlers.
A C language interrupt handler qualifies for Freeze mode execution if it meets
with a number of criteria:
It is a small leaf routine which does not attempt to lower the register stack
pointer. This means that, should the interrupt have occurred during a critical
stage in register stack management, the stack need not be brought to a valid
condition.
Floating–point instructions not directly supported by the processor are not used.
Many members of the 29K family emulate floating–point instructions in
software (see Chapter 3).
Instructions which may result in a trap are not used. All interrupts and traps are
disabled while in Freeze mode. This means the Memory Management Unit
cannot be used for memory access protection and address translation.
The handlers execution is short. Because the handler is to be run in Freeze mode
its execution time will add to the system interrupt latency.
The handler does not attempt to execute LOADM and STOREM instructions
while in Freeze mode. When a performance gain can be had, the High C 29K
compiler will use these instructions to move blocks of data; this does not
typically happen with short Freeze mode interrupt handlers. However, the High
C 29K compiler supports the _LOADM_STOREM pragma which can be used
to turn off or on (default) the use of LOADM and STOREM instructions.
Transparent procedure calls are not used (see section 3.7). They typically
require the support of indirect pointer which are not temporarily saved by the
code presented in this section.
The methods shown in this and the following section rely on application code
running with physical addressing; or if the Memory Management Unit is used to per-
form address translation, then virtual addresses are mapped directly to physical ad-
dresses. This is because the macros used to install the Freeze Mode trap handlers are
used to generate code in User mode and thus operate with User mode address values.
However, Freeze mode code runs in Supervisor mode with address translation turned
off.
The Metaware High C 29K and GCC compilers prior to release 3.2 have no C
language extension to aid with interrupt handling. Release 3.2, or newer, support the
key word _Interrupt as a procedure return type. Use of this C language extension
results in additional tag data (see section 3.6) preceding the interrupt handler routine.
Without the interrupt tag data, the only way to identify if a handler routine qualifies
for the above Freeze mode handler status, is to compile it with the “–S” option and
examine the assembly language code. Alternatively, handler routines which make
124 Evaluating and Programming the 29K RISC Family
function calls can be immediately eliminated as unsuitable for operation in Freeze
mode. Examining the assembly language code would enable the nregs value used in
the following code to be determined. Small leaf routines operate with global registers
only. Starting with gr96, nregs is the number of global registers used by a C leaf han-
dler routine.
The interrupt_handler macro defined below can be used to install a C level
interrupt handler which is called upon when the appropriate trap or interrupt oc-
curs.The code is written in assembly language because it must use a carefully crafted
instruction sequence; the first part of which uses the HIF settrap service to install, in
the processor vector table, the address ($1) which will be vectored to when the inter-
rupt occurs. The necessary code is written as a macro rather than a procedure call be-
cause the second part of the macro contains the start of the actual interrupt handler
code. This code, starting at address $1, is unique to each interrupt and can not be
shared. Note, the code makes use of push and pop macro instructions to transfer data
between registers and the memory stack. These assembly macros are described in
section 3.3.1.
.reg it0,gr64;freeze mode interrupt
.reg it1,gr65;temporary registers
;install interrupt handler
.macro interrupt_handler, trap_number, C_handler, nregs
sub gr1,gr1,4*4 ;get lr0–lr3 space
asgeu V_SPILL,gr1,rab ;check for stack spill
add lr1,gr121,0 ;save gr121
add lr0,gr96,0 ;save gr96
const gr121,290 ;HIF 2.0 SETTRAP service
const lr2,trap_number ;trap number, macro parameter
const lr3,$1 ;trap handler address
consth lr3,$1
asneq 69,gr1,gr1 ;HIF service request
add gr121,lr1,0 ;restore gr121
add gr96,lr0,0 ;restore gr96
add gr1,gr1,4*4 ;restore stack
jmp $2 ;macro code finished
asleu V_FILL,lr1,rfb ;check for stack fill
$1: push msp,lr0 ;start of Interrupt handler
pushsr msp,it1,ipa ;save special reg. ipa
const it0,nregs–2 ;number or regs. to save
const it1,96<<2 ;starting with gr96
$3: mtsr ipa,it1
add it1,it1,1<<2 ;increment ipa
sub msp,msp,4 ;decrement stack pointer
jmpfdec it0,$3
store 0,0,gr0,msp ;save global registers
;
const lr0,C_handler
consth lr0,C_handler
calli lr0,lr0 ;call C level handler
Chapter 2 Applications Programming 125
nop
;
const it0,nregs–2 ;number of global registers
const it1,(96+nregs–1)<<2
$4: mtsr ipa,it1
load 0,0,gr0,msp ;restore global register
sub it1,it1,1<<2 ;decrement ipa
jmpfdec it0,$4
add msp,msp,4 ;increment stack pointer
popsr ipa,it0,msp ;restore ipa
pop lr0,msp ;restore lr0
iret
$2:
.endm
Because the C level handler is intended to run in Freeze mode, there is very little
code before the required handler, C_handler, is called. Registers lr0 and IPA are
saved on the memory stack before they are temporarily used. Then the required num-
ber of global registers (nregs) starting with gr96 are also saved on the stack. The pro-
grammer must determine the nregs value by examining the handler routine assembly
code.
The interrupt_handler macro must be used in an assembly language module.
Alternatively, a C language compiler extension can be used. The High C 29K compil-
er supports an extension which enables assembly code to be directly inserted into C
code modules. This enables a C macro to be defined which will call upon the assem-
bly language macro code. The example code below shows the C macro definition.
#define interrupt_handler(tap_number, C_handler, nregs) \
/*int trap_number; \
void (*C_handler)(); \
int nregs; */ \
_ASM(” interrupt_handler ”#trap_number”,”#C_handler”,”#nregs);
Alternatively the C macro could contain the assembly macro code directly. Us-
ing the technique shown, C modules which use the macro must be first compiled with
the “–S” option; this results in an assembly language output file. The assembly lan-
guage file (.s file) is then assembled with an include file which contains the macro
definition. Note, C modules which use the macro must use the _ASM(“assembly–
string”) C extension to include the assembly language macro file (shown below) for
its later use by the assembler. The GCC compiler supports the asm(“assembly–
string”) C extension which achieves the same result as the High C 29K _ASM(“as-
sembly–string”) extension.
_ASM(” .include \”interrupt_macros.h\””);
/* int2_handler uses 8 regs. and is called
when hardware trap number 18 occurs */
interrupt_handler(18,int2_handler,8);
126 Evaluating and Programming the 29K RISC Family
2.5.1 An Interrupt Context Cache with High C 29K
The interrupt_handler macro code, described in the previous section, pre-
pares the processor to handle a C language interrupt handler which can operate within
the processor Freeze mode restrictions. The code saves the interrupted processor
context onto the current memory stack position before calling the C handler.
The interrupt_cache macro shown below can be used in place of the previously
described macro. Its use is also restricted to preparing the processor to handle a C
level handler which meets the Freeze mode execution criteria. However, its operation
is considerably faster due to the use of an Interrupt Context Cache. Section 4.3.9 de-
scribes context caching in more detail. A cache is used here only to save sufficient
context to enable a non–interruptable C level handler to execute.
The cache is implemented using operating system registers gr64–gr80. These
global registers are considered operating system temporaries, at least gr64–gr79 are
(also known as it0–it3 and kt0–kt11). Register gr80 (known as ks0) is generally used
to hold operating system static data (see section 3.3). Processors which do not direct-
ly support floating–point operations contain instruction emulation software (trap-
ware) which normally uses registers in the gr64–gr79 range to support instruction
emulation. Given application code can perform a floating–point operation at any
time, an operating system can not consider these registers contents remain static after
application code has run. For this reason and others, floating–point trapware normal-
ly runs with interrupts turned off, it is convenient to use these registers for interrupted
context caching.
The interrupt_handler macro uses a loop to preserve the global registers used
by the Freeze mode interrupt handler. The interrupt_cache macro unrolls the loop
and uses register–to–register operations rather than register–to–memory. In place of
traversing the loop nregs times, the nregs value is used to determine the required
entry point to the unrolled code. These techniques reduce interrupt preparation times
and interrupt latency.
.macro interrupt_cache, trap_number, C_handler, nregs
sub gr1,gr1,4*4 ;get lr0–lr3 space
asgeu V_SPILL,gr1,rab ;check for stack spill
add lr1,gr121,0 ;save gr121
add lr0,gr96,0 ;save gr96
const gr121,290 ;HIF 2.0 SETTRAP service
const lr2,trap_number ;trap number, macro parameter
const lr3,$1–(nregs*4) ;trap handler address
consth lr3,$1–(nregs*4)
asneq 69,gr1,gr1 ;HIF service request
add gr121,lr1,0 ;restore gr121
add gr96,lr0,0 ;restore gr96
add gr1,gr1,4*4 ;restore stack
jmp $2 ;macro code finished
asleu V_FILL,lr1,rfb ;check for stack fill
Chapter 2 Applications Programming 127
add gr80,gr111,0 ;save gr111 to interrupt
add gr79,gr110,0 ; context cache
add gr78,gr109,0
add gr77,gr108,0 ;the interrupt handler starts
add gr76,gr107,0 ;somewhere in this code range
add gr75,gr106,0 ;depending on the register
add gr74,gr105,0 ;usage of the C level code
add gr73,gr104,0
add gr72,gr103,0
add gr71,gr102,0
add gr70,gr101,0
add gr69,gr100,0
add gr68,gr99,0
add gr67,gr98,0
add gr66,gr97,0 ;save gr97
add gr64,lr0,0 ;save lr0
$1:
const lr0,C_handler
consth lr0,C_handler
calli lr0,lr0 ;call C level handler
add gr65,gr96,0 ;save gr96
;
jmp $2–4–(nregs*4) ;determine registers used
add lr0,gr64,0 ;restore lr0
add gr111,gr80,0 ;restore gr111 from interrupt
add gr110,gr79,0 ; context cache
add gr109,gr78,0
add gr108,gr77,0
add gr107,gr76,0
add gr106,gr75,0
add gr105,gr74,0
add gr104,gr73,0
add gr103,gr72,0
add gr102,gr71,0
add gr101,gr70,0
add gr100,gr69,0
add gr99,gr68,0
add gr98,gr67,0
add gr97,gr66,0
add gr96,gr65,0 ;retsore gr96
iret
$2:
.endm
2.5.2 An Interrupt Context Cache with GNU
The previous section presented interrupt context caching when using the Meta-
ware High C 29K compiler. Global register assignment with the Free Software
Foundation compiler, GCC, is very different from High C 29K. Global registers
gr96–gr111 are little used, except for return values. GCC has very frugal global regis-
ter usage. It mainly uses global registers gr116–gr120. This effects the interrupt
128 Evaluating and Programming the 29K RISC Family
preparation code required for Freeze mode C level handlers. High C 29K uses global
registers in the gr96–gr111 range as temporaries before starting to use gr116–gr120.
The reduced use of global registers might make GCC a better choice for building
Freeze mode C–level interrupt handlers.
The assembler, as29, supplied with the GCC compiler chain does not support
macros directly. But it is possible to use the C preprocessor, CPP, to do macro instruc-
tion expansion. The interrupt_cache macro shown below demonstrates the use of
CPP with 29K assembly code. The macro is used to install a C handler for the selected
trap_number. The early part of the macro code requests the HIF settrap service be
used to insert the interrupt handler address into the processor vector table. The actual
address inserted depends on the register usage of the C handler.
The handler must be examined to determine the registers used. Parameter nregs
is used to specify the number of registers used in the gr116–gr120 range. The handler
preparation code saves the necessary global registers in an interrupt context cache
before calling the C code. Global registers gr96–gr111 are not saved in the cache, as it
is likely that they are not used by the handler –– it certainly has no return value.
The context cache is formed with global registers gr64–gr80. Registers
gr64–gr79 are used by floating–point emulation routines, and hence their contents
are available for use between floating–point trap instructions. This assumes that the
trapware runs with interrupts turned off which is normally the case. For more details
see section 2.5. Saving the registers used by the handler in this way is much faster
than pushing the registers onto an off–chip memory stack.
#define interrupt_cache(trap_number, C_handler, nregs)\
;start of interrupt_cache macro, nregs must be >=1 _CR_\
nop ;delay slot protection _CR_\
sub gr1,gr1,4*4 ;get lr0–lr3 space _CR_\
asgeu V_SPILL,gr1,rab ;check for stack spill _CR_\
add lr1,gr121,0 ;save gr121 _CR_\
add lr0,gr96,0 ;save gr96 _CR_\
const gr121,290 ;HIF 2.0 SETTRAP service _CR_\
const lr2,trap_number ;trap number, macro parameter_CR_\
const lr3,cache_##trap_number–(nregs*4) ;handler adds._CR_\
consth lr3,cache_##trap_number–(nregs*4) ; _CR_\
asneq 69,gr1,gr1 ;HIF service request _CR_\
add gr121,lr1,0 ;restore gr121 _CR_\
add gr96,lr0,0 ;restore gr96 _CR_\
add gr1,gr1,4*4 ;restore stack _CR_\
jmp cache_end_##trap_number ;install code finished _CR_
asleu V_FILL,lr1,rfb ;check for stack fill _CR_\
;START of interrupt handler code_CR_\
add gr70,gr120,0 ;save gr120 _CR_\
add gr69,gr119,0 ;save gr119 _CR_\
add gr68,gr118,0 ;save gr118 _CR_\
add gr67,gr117,0 ;save gr117 _CR_\
add gr64,lr0,0 ;save lr0 _CR_\
cache_##trap_number: ;gr96–gr111 not saved in cache _CR_\
Chapter 2 Applications Programming 129
const lr0,C_handler ;call C–level handler_CR_\
consth lr0,C_handler ; _CR_\
calli lr0,lr0 ;call C level handler _CR_\
add gr66,gr116,0 ;save gr116 _CR_\
; _CR_\
jmp L2–4–(nregs*4) ;determine registers used _CR_\
add lr0,gr64,0 ;restore lr0 _CR_\
add gr120,gr70,0 ;restore gr120 from cache _CR_\
add gr119,gr69,0 ; _CR_\
add gr118,gr68,0 ; _CR_\
add gr117,gr67,0 ; _CR_\
add gr116,gr66,0 ; _CR_\
iret ; _CR_\
cache_end_##trap_number: ;end of interrupt cache macro _CR_
The code example below shows how the macro can be invoked. The routine
install_handlers() is written in assembly code. It includes a macro for a C level inter-
rupt handler, int2_handler(), assigned to 29K interrupt INTR2. The C level handler
was examined and found to be a qualifying leaf routine using only two global regis-
ters.
.text
.extern _int2_handler
.global _install_handlers
_install_handlers:
sub gr1,gr1,2*4 ;prologue not realy needed
asgeu V_SPILL,gr1,gr126 ;lower stack pointer
interrupt_cache(18,_int2_handler,2) ;macro example
add gr1,gr1,2*4 ;raise stack pointer
constn gr96,–1 ;return TRUE value
jmpi lr0 ;return
asleu V_FILL,lr1,rfb ;procedure epilogue
The C preprocessor is invoked with the app shell script program shown below.
This is a convenient way of directing CPP to process an assembly program source
file. The use of CPP has one problem; macros are expanded into long lines. The car-
riage returns in the macro source file do not appear in the expanded code. To reinsert
the carriage returns and make the assembly code lines compatible with assembler
syntax, each assembly line in the macro is marked with the token _CR_. The UNIX
stream editor, sed, is then used to replace the _CR_ with a carriage return.
#!/bin/sh
#start of app shell script
#example, ”app file_in.s”
prams=$*
tmp=/tmp/expand.$$
cpp –P $prams > $tmp #invoke CPP
sed ’s/_CR_/\
/g’ $tmp
rm $tmp
130 Evaluating and Programming the 29K RISC Family
2.5.3 Using Signals to Deal with Interrupts
Some C language interrupt handlers will not be able to run in Freeze mode.; be-
cause (as described in section 2.5) they are unsuitable leaf routines, or are not leaf
routines and thus require use of the register stack. In this case the signal trampoline
code described in section 4.4 and Appendix B must be used. The trampoline code is
called by the Freeze mode interrupt handler after critical registers have been saved on
the memory stack. The C language handler is called by the trampoline code after the
register stack is prepared for further use. Note that interrupts can occur at times when
the register stack condition is not immediately usable by a C language handler.
The signal mechanism works by registering a signal handler function address
for use when a particular signal number occurs. This is done with the library routine
signal(). Signals are normally generated by abnormal events and the signal() routine
allows the operating system to call a user supplied routine which will be called to deal
with the event. The signal() function uses the signal HIF service to supply the address
of a library routine (sigcode) which will be called for all signals generated. (Note,
only the signal, settrap and sigret–type subset of HIF services are required.) The li-
brary routine is then responsible for calling the appropriate C handler from a table of
handlers indexed by the signal number. When signal() is used a table entry is
constructed for the indicated signal.
signal(sig_number, func)
int sig_number;
void (*func)(sig_number);
A signal can only be generated for an interrupt if the code vectored to by the in-
terrupt calls the shared library routine known as the trampoline code. It is known as
the trampoline code because signals bounce from this code to the registered signal
handler. To ensure that the trampoline code is called when an interrupt occurs, the
Freeze mode code vectored to by the interrupt must pass execution to the trampoline
code, indicating the signal which has occurred. The signal_associate macro shown
below can be used to install the Freeze Mode code and associate a signal number with
the interrupt or trap hardware number.
.reg it0,gr64;freeze mode interrupt
.reg it1,gr65;temporary registers
.macro signal_associate, trap_number, sig_number
sub gr1,gr1,4*4 ;get lr0–lr3 space
asgeu V_SPILL,gr1,rab ;check for stack spill
add lr1,gr121,0 ;save gr121
add lr0,gr96,0 ;save gr96
const gr121,290 ;HIF 2.0 SETTRAP service
const lr2,trap_number ;trap number, macro parameter
const lr3,$1 ;trap handler address
Chapter 2 Applications Programming 131
consth lr3,$1
asneq 69,gr1,gr1 ;HIF service request
add gr121,lr1,0 ;restore gr121
add gr96,lr0,0 ;restore gr96
add gr1,gr1,4*4 ;restore stack
jmp $2 ;macro code finished
asleu V_FILL,lr1,rfb ;check for stack fill
$1: const it0,sig_number ;start of Interrupt handler
push msp,it0 ;push sig_number on
push msp,gr1 ; interrupt context frame.
push msp,rab ;use push macro,
const it0,512 ; see section 3.3.1
sub rab,rfb,it0 ;set rab = rfb–WindowSize
;
pushsr msp,it0,pc0 ;push special registers
pushsr msp,it0,pc1
pushsr msp,it0,pc2
pushsr msp,it0,cha
pushsr msp,it0,chd
pushsr msp,it0,chc
pushsr msp,it0,alu
pushsr msp,it0,ops
push msp,tav ;push tav (gr121)
; set DI in CPS, but timer
mfsr it0,ops ; interrupts are still on
or it0,it0,0x2 ;this disables interrupts
mtsr ops,it0 ; in signal trampoline code
;
mtsrim chc,0 ;the trampoline code is
const it1,RegSigHand ; described in section 4.4.1
consth it1,RegSigHand ;RegSigHand is a library
load 0,0,it1,it1 ; variable
cpeq it0,it1,0 ;test for no handler
jmpt it0,SigDfl ;jmup if no handler(s)
add it0,it1,4 ;it1 has trampoline address
mtsr pc1,it1 ;IRET to signal
mtsr pc0,it0 ; trampoline code
iret
$2: ;end of macro
.endm
The above macro code does not disable the interrupt from the requesting device.
This is necessary for external interrupts; reenabling interrupts without having first
removed the current interrupt request, shall cause the interrupt to be immediately tak-
en again. The code sets the the DI–bit in the OPS special register; this means inter-
rupts will remain disabled in the trampoline code. It will be the responsibility of the C
language handler to clear the interrupt request; this may require accessing an off–
chip peripheral device. An alternative is to clear the interrupt request in the above
Freeze mode code and not set the DI–bit in the OPS. This would enable the trampo-
line and C language handler code to execute with interrupts enabled. This would lead
132 Evaluating and Programming the 29K RISC Family
to the possibility of nested signal events; however, the signal trampoline code is able
to deal with such complex events.
With the example signal_associate macro the trampoline code and the C han-
dler run in the processor mode at the time the interrupt occurred. They can be forced
to run in Supervisor mode by setting the Supervisor mode bit (SM–bit) when OR–ing
the DI–bit in the OPS register. Supervisor mode may be required to enable accessing
of the interrupting device when disabling the interrupt request. The address transla-
tion bits (PA and PD) may also be set at this time to turn off virtual addressing during
interrupt processing. To make these changes to the above example code, the value
0x72 should be OR–ed with the OPS register rather than the 0x2 value shown.
As described in section 2.5, a C language macro can be used to access the assem-
bly level macro instruction. When the High C 29K compiler is being used, the defini-
tion of the C macro is shown below. Users of the GCC compiler should replace the
_ASM() call with the equivalent asm() C language extension.
#define signal_associate(tap_number, sig_number) \
/*int trap_number; \
int sig_number; */ \
_ASM(” signal_associate ”#trap_number”,”#sig_number);
When the macro is used to associate a signal number with a processor trap num-
ber, it is also necessary to supply the address of the C language signal handler called
when the signal occurs. The following example associates trap number 18 (floating–
point exception) with signal number 8. This signal is known to UNIX and HIF users
as SIGFPE; when it occurs, the C handler sigfpe_handler is called.
_ASM(” .include \”interrupt_macros.h\””);
signal_associate(18,8); /* trap 18, F–P */
signal(8,sigfpe_handler); /* signal 8 handler */‘
C language signal handlers are free of many of the restrictions which apply to
Freeze mode interrupt handlers. However, the HIF specification still restricts their
operation to some extent. Signal handlers can only use HIF services with service
numbers greater than 256. This means that printf() cannot be used. The reason for
this is HIF services below 256 are not reentrant, and a signal may occur while just
such a HIF service request was being processed. Return from the signal handler must
be via one of the signal return services: sigdft, sigret, sigrep or sigskp. If the signal
handler simply returns, the trampoline code will issue a sigdfl service request on be-
half of the signal handler.
A single C level signal routine can be used to dispatch several C language inter-
rupt handlers. Section 4.3.12 describes an interrupt queuing method, where interrupt
handlers run in Freeze mode and build an interrupt descriptor (bead). Each descriptor
is placed in a list (string of beads) and a Dispatcher routine is used to process descrip-
tors. The signal handling method described above can be used to register a C level
Chapter 2 Applications Programming 133
Dispatcher routine. This results in C level context being prepared only once and the
Dispatcher routine calling the appropriate C handler.
2.5.4 Interrupt Tag Words
Release 3.2, or newer, of the High C 29K compiler supports routines of defined
return–type _Interrupt. The use of this C language extension causes an additional tag
word to be placed ahead of the procedure code. Section 3.6 explains the format of the
interrupt tag in detail. Note, to use the _Interrupt key word with a PC hosted compiler,
it is necessary to add the line “#define _Interrupt _CC(_INTERRUPT)” to file
29k/bin/hc29.pro. The interrupt key word in conjunction with some simple support
routines presented below make optimizing of interrupt preparation very easy. By
examining the interrupt tag word it is possible to determine if a handler routine
qualifies for Freeze mode execution or will require HIF signal processing. The
example code shown below is for a HIF conforming operating system. However, a
different operating system may choose to respond to interrupt tag information in a
somewhat different manner. Only the signal, settrap and sigret–type subset of HIF
services are required. A different operating system may have equivalent support
services.
When an interrupt occurs, it would be possible to examine the interrupt tag word
of the assigned handler. However, this would be an overhead encountered at each
interrupt and it would increase interrupt processing time. It is better to examine the
tag word at interrupt installation time and determine the necessary interrupt
preparation code. Preceding sections have described interrupt context caching and
signal processing. It would be possible to examine the tag word in more detail than
the following example code undertakes. This would produce additional intermediate
performance points in the spectrum of interrupt preparation code; context caching
being the fastest point on the spectrum and signal processing the slowest. However,
signal processing can always be used and is free of the restrictions which apply to the
use of interrupt context caching, and context caching is frequently adequate. This
renders the chosen spectrum points as most practicable.
The example below shows two C language interrupt handler routines. The first,
f_handler(), looks like it will qualify for Freeze mode execution. The key word
_Interrupt has been used during the procedure definition and this will result in a
interrupt tag word. The second function, s_handler(), is not a leaf procedure and this
fact will be reported in its interrupt tag word. Being a non leaf routine, it will be
processed as a signal handler. Such routines receive a single parameter –– the signal
number.
extern int sig_sig /* defined in library code */
extern int sig_intr0 /* signal for INTR0 */
extern char *UART_p; /* pointer to UART */
134 Evaluating and Programming the 29K RISC Family
char recv_data[50];
_Interrupt f_handler() /* Freeze mode handler */
{
static int count=0;
recv_data[count]=*uart_p; /* read from UART */
if(recv_data[count]==’\n’) /* test for end */
{ sig_sig=sig_intr0; /* signal #30 */
count=0; /* reset counter */
}
else count++;
}
_Interrupt s_handler(sig_number) /* signal handler */
int sig_number; /* for sig_intr0 */
{
printf(”in signal handler number=%d\n”, sig_number);
printf(”received string=%s\n”, recv_data);
_sigret();
}
Most programmers do not want to become concerned with the details of
interrupt preparation. They simply wish to call an operating system service routine
which will examine the interrupt tag word and select the appropriate interrupt
preparation code. The library procedure, interrupt(), shown below, is just such a
service routine. The operation of this procedure will be described a little later. The
procedure ensures that either interrupt context caching or signal processing will be
applied for the supplied handler and selected 29K trap number. The interrupt()
routine must be executed during the system initialization stage, before traps or
interrupts are expected to occur. An example initialization sequence is shown below:
int sig_intr0;
main()
{
. . .
sig_intr0=interrupt(16,s_handler); /* INTR0 */
interrupt(17,f_handler); /* INTR1 */
. . .
Interrupt tag words are dealt with at interrupt installation time, and not at
program assembly or link time. There have been discussions about adding a compiler
pragma option to High C 29K release 4.0 which, when switched on, will cause a
macro instruction to be placed in output assembly code rather than an interrupt tag
word. This requires that the relevant C code be compiled, then assembled with an
include file which defines the replacement code for the interrupt macro instruction.
This technique has some disadvantages, principally that the macro must understand
the capabilities of the operating system and how it intends dealing with interrupts. In
Chapter 2 Applications Programming 135
particular; if the interrupt should be processed in User or Supervisor mode, with
interrupts enabled or disabled; with or without address translation and so on. Use of a
macro does have the advantage that the interrupt preparation code appears in the final
linked program image. The tag word methods relies on the preparation code being
built in heap memory during interrupt installation. The preparation code is built in
consultation with the operating system and is thus more portable between different
operating systems which support somewhat different interrupt processing
environments.
Fortunately for the user, library routines are responsible for installing the
executable code into heap memory. The code itself is similar to the code of previous
sections. A portion of the code is linked into text space of the program image. At
installation time the code is copied into heap memory and further optimized. The
code sequence below is used for interrupt context caching.
.text
.align 4
.global _interrupt_cache_code
.global _interrupt_cache_end
.extern _sig_sig
_interrupt_cache_code:
add gr80,gr111,0 ;save gr111 to interrupt
add gr79,gr110,0 ; context cache
add gr78,gr109,0
add gr77,gr108,0
add gr76,gr107,0
add gr75,gr106,0
add gr74,gr105,0
add gr73,gr104,0
add gr72,gr103,0
add gr71,gr102,0
add gr70,gr101,0
add gr69,gr100,0
add gr68,gr99,0
add gr67,gr98,0
add gr66,gr97,0
add gr64,lr0,0 ;save lr0
;
const lr0,0 ;const and consth
consth lr0,0 ; need to be modified
calli lr0,lr0 ;call C handler
add gr65,gr96,0
;
add gr111,gr80,0 ;restore gr111 from
add gr110,gr79,0 ; context cache
add gr109,gr78,0
add gr108,gr77,0
add gr107,gr76,0
add gr106,gr75,0
add gr105,gr74,0
add gr104,gr73,0
add gr103,gr72,0
136 Evaluating and Programming the 29K RISC Family
add gr102,gr71,0
add gr101,gr70,0
add gr100,gr69,0
add gr99,gr68,0
add gr98,gr67,0
add gr97,gr66,0
add gr96,gr65,0
add lr0,gr64,0 ;restore lr0
const gr64,_sig_sig ;the following eight
consth gr64,_sig_sig ; instructions deal with
load 0,0,gr64,gr64 ; sig_sig testing
cpeq gr65,gr64,0 ;test for zero
const gr66,_signal_associate_code + 4 ;no relative
consth gr66,_signal_associate_code + 4 ; addressing
jmpfi gr65,gr66 ;jump if sig_sig != 0
nop
iret
_interrupt_cache_end:
The context cache code is a little different from the code shown in section 2.5.1.
Eight extra instruction have been added to support a memory variable called sig_sig.
It supports a very useful technique of two–level interrupt processing. Predominantly
a Freeze mode interrupt handler is used alone. However, when the sig_sig variable is
set to a signal number before the Freeze mode handler completes, a signal is
generated causing a second signal handler routine to execute after the Freeze mode
handler returns.
Examine the example handler routines. When interrupt INTR1 (vector 17)
occurs, the Freeze mode handler, f_handler(), normally accesses the interrupting
UART and receives a character; it then increments the count value and returns. The
processes of accessing the UART causes the interrupt request to be deasserted. This
results in a very fast interrupt handler written in C. However, when the received
character is a ‘\n’ (carriage return), sig_sig is set to the signal number allocated to the
INTR0 signal handler. This causes the s_handler() to be executed in response to the
signal. The occurrence of interrupt INTR0 (vector 16) also causes s_handler() to
execute as a signal handler associated with trap 16. The example interrupt() service
automatically allocates signal numbers, starting with SIGUSR1, to handler routines
which are to be processed via signal trampoline code. The interrupt() procedure
returns the selected signal number; zero is returned if a Freeze mode handler is
selected. An interrupt handler can be restricted to fast Freeze mode processing and
when more extensive processing is required the sig_sig variable can be set and a
second level handler invoked. (Note, the s_handler() routine calls the printf()
library routine. This is not permitted with the High C 29K library routines as the
printf() routine is not reentrant. However, the use of printf() helps illustrate the
two–stage principle.)
To perform signal processing, the trampoline code shown below is placed in
heap memory. It is similar to the code of section 2.5.3. Interrupts are disabled before
Chapter 2 Applications Programming 137
signal processing commences; this is not necessary if a Freeze mode handler has
already requested the interrupting device to deassert the interrupt request. If a Freeze
mode handler is always executed before the associated signal handler, the three
indicated lines of code can be removed. Doing so enables nested interrupts to be
supported without explicitly reenabling interrupts in the signal hander. However, if
the signal preparation code is called directly from the interrupt vector table (via an
interrupting device) then interrupts must be initially disabled by the shared signal
preparation code.
.global _signal_associate_code
.global _signal_associate_end
.reg it0,gr64
.reg it1,gr65
_signal_associate_code: ;signal number in it0
const gr64,0 ;push signal number on stack
const it1,0 ;clear sig_sig variable
const it2,_sig_sig ; need not do this if signal
consth it2,_sig_sig ; handler is called directly
store 0,0,it1,it2 ; from vector table entry
push msp,it0 ;interrupt context stack
push msp,gr1 ;use ’push’macro’
push msp,rab ; see section 3.3.1
const it0,512
sub rab,rfb,it0 ;set rab=rfb–WindowSize
;
pushsr msp,it0,pc0 ;push special registers
pushsr msp,it0,pc1
pushsr msp,it0,pc2
pushsr msp,it0,cha
pushsr msp,it0,chd
pushsr msp,it0,chc
pushsr msp,it0,alu
pushsr msp,it0,ops
push msp,tav ;push tav (gr121)
; set DI in CPS, but timer
mfsr it0,ops ; interrupts are still on
or it0,it0,0x2 ;this disables interrupts
mtsr ops,it0 ; in signal trampoline code
;
mtsrim chc,0 ;the trampoline code is
const it1,RegSigHand ; described in section 4.4.1
consth it1,RegSigHand ;RegSigHand is a library
load 0,0,it1,it1 ; variable
add it0,it1,4 ;IRET to signal
mtsr pc1,it1 ; trampoline code
mtsr pc0,it0
iret
_signal_associate_end:
All of the code presented is available from AMD in source and linkable library
form. Now to the interrupt() install routine itself, it is listed below and is
surprisingly short. Its operation is simple, it examines the interrupt tag word of the
138 Evaluating and Programming the 29K RISC Family
supplied C handler. Note that it assumes that the interrupt procedure has a one–word
procedure tag preceded by an interrupt tag word –– this is almost always the case. If
no interrupt tag is found then signal handling is selected. This would be the case if the
handler routine had been built with the GNU compiler which does not currently
support interrupt tag words.
Depending on the tag word, Freeze mode or signal processing is selected and the
appropriate code copied into heap memory space. For Freeze mode processing, only
the required number of global registers is saved in the interrupt context cache
(gr64–gr80). Additionally, only the minimum required amount of heap memory is
requested via the HIF–library malloc() service. After copying code into heap
memory, some instruction patching is performed to correctly reference the assigned
C handler. Finally the HIF–library _settrap() service is used to assign a trap handler
address to the requested trap number. Note that when the copying is performed, the
heap memory is only written to and never read. This will prevent the code being
placed into on–chip data cache, as 29K family data caches only allocate cache blocks
on data reads. Avoiding caching of the relevant heap memory ensures that the new
code will be fetched from instruction memory (see sections 5.13.2 and 5.14.4).
int interrupt(trap_number, C_handler)
int trap_number;
void (*C_handler)();
{
int *tag_p=(int*)C_handler – 2;
int ret_sig; /* return signal value */
int tag_word = *tag_p;
int glob_regs, *trap_handler, i, size;
_LOCK volatile int *code_p, *mem_p; /* see section 5.14.1 */
if((tag_word & 0xff000000) != 0)
tag_word = –1; /* no interrupt tag word */
if((tag_word & 0xffff00ff)==0)
{ glob_regs=(tag_word & 0xff00) >> 8;
code_p=&interrupt_cache_code;
size=4*((2*glob_regs)+6+8); 8 for sig_sig code support
mem_p=(int*)malloc(size) /* get heap memory */
trap_handler=mem_p;
code_p=code_p+(16–glob_regs); /* find start of save */
for(i=1; i <=glob_regs; i++) /* copy save code */
*mem_p++=*code_p++;
/* supply address to CONST instruction *
*mem_p++ =*code_p++ | ( (((int)C_handler&0xff00)<<8)
+ ((int)C_handler&0xff) );
/* supply address to CONSTH inst. */
*mem_p++ =*code_p++ | ( (((int)C_handler&0xff000000) >>8)
+ (((int)C_handler&0xff0000) >>16) );
for(i=1; i <=(4–2); i++) /* copy the call code */
*mem_p++=*code_p++;
code_p=code_p + (16–glob_regs); /* find start of restore */
for(i=1;i<=(glob_regs+2+8);i++) /* copy restore code */
*mem_p++=*code_p++;
ret_sig=0; 8 required for sig_sig code support
Chapter 2 Applications Programming 139
}
else
{static int sig_number=30; /* SIGUSR1 in SigEntry */
ret_sig=sig_number;
signal(sig_number,C_handler);
size=4*(signal_associate_end–signal_associate_code);
mem_p=(int*)malloc(size); /* get heap memory */
trap_handler=mem_p;
code_p=signal_associate_code;
/* supply sig_number to CONST instruction */
*mem_p++ = *code_p++ | ( ((sig_number&0xff00)<<8)
+ (sig_number&0xff) );
for(i=1; i <=(size–1); i++) /* copy rest of code */
*mem_p++ = *code_p++;
sig_number++;
}
_settrap(trap_number,(void(*)())trap_handler); /* HIF service */
return ret_sig;
}
Users of the above code which do not want to make use of the two–level
interrupt processing supported via the sig_sig variable, can remove the extra eight
instructions in the interrupt_cache_code and should also remove the extra code
copying indicated in the listing above. This will slightly improve interrupt
processing times for Freeze mode handlers. Other users who want to further exploit
the two–level approach can assign a single handler for all second level interrupt
processing, this is discussed in section 4.3.12. Interrupts are first dealt with in Freeze
mode by building an interrupt descriptor bead; then a second level Dispatcher routine
is responsible for popping beads off a string and calling the assigned second level
handler. Alternatively, a signal dispatcher technique can be applied; section 2.5.6
describes the method. Signal dispatching can be achieved entirely with support
routines accessible from C level –– this makes signal dispatching particularly
attractive.
If the interrupt() routine is used extensively for multiple signal handlers, it will
be necessary to increase the size of the signal handler array (SigEntry, described in
Appendix B). The array is normally large enough to hold signal numbers 1 through
32). Unless signal allocation is started at a number less than SIGUSR1 (30), there is
normally only sufficient space for two signal handlers.
2.5.5 Overloaded INTR3
The microcontroller members of the 29K family contain several on–chip pe-
ripherals. These peripherals can generate interrupts which are all directed to the core
29K processor via interrupt line INTR3. This causes overloading of the INTR3 vec-
tor handler. When a microcontroller receives an INTR3 interrupt, it must examine its
Interrupt Control Register (ICR) to determine the source of the interrupt. This re-
140 Evaluating and Programming the 29K RISC Family
quires all interrupts to initially be processed via the INTR3 vector handler. The
INTR3 handler must call the appropriate device service routine. The service routine
first clears the interrupt request by writing a one to the correct bit in the ICR; it can
then reenable interrupts and service the current request. The general format of the
ICR is shown on Figure 2-3.
31 23 15 7 0
15 9 8
reserved res IOPI Processor Specific
res
VDI DMA0I
vector 220 224 228 237
Figure 2-3. 29K Microcontroller Interrupt Control Register
The overloading of INTR3 adds complexity to the task of building a Freeze
mode interrupt handler for each interrupting device. The problem can be resolved by
allocating a region of the vector table for use by the interrupting devices sharing
INTR3. The code below (intr3) reserves 33 vector table entries starting with vector
220 –– these vectors are not normally used by a 29K based system. When an INTR3
occurs, the code examines the ICR register with a Count Leading Zeros (CLZ)
instruction. This assigns the highest priority to the bit (interrupt) which is most–left
in the ICR register. The value produced by the CLZ instruction is added to the base
value of 220 and the result used to obtain the correct vector entry from the vector
table.
.reg it0,gr64
.reg it1,gr65
.global _intr3
_intr3:
const it0,0x80000028 ;Interrupt Control register address
consth it0,0x80000028
load 0,0,it1,it0
clz it1,it1 ;priority order index
;
const it0,220 ;base vector number
add it1,it0,it1 ;add offset to base
sll it1,it1,2 ;convert to word offset
mfsr it0,vab ;get vector table base
add it1,it0,it1 ;get handler address
load 0,0,it1,it1 ; from vector table
Chapter 2 Applications Programming 141
jmpi it1 ;jump to interrupt
nop ; handler
The intr3 code completes by jumping to the selected vector handler. Note, the
code makes use of the four interrupt temporary registers (it0–it3, gr64–gr67) nor-
mally reserved by an operating system for interrupt handling. Each peripheral device
which can set a bit in the ICR register is assigned a unique vector handler number in
the range 220–252. If no bit is found to be set in the ICR register, vector 252 is se-
lected.
Using the intr3 code, it is possible to use the previously described interrupt()
library routine to deal with interrupts. A call to the the HIF library procedure
_settrap() is required to install the intr3 code for INTR3 handling. After this is done,
the interrupt() routine can be used to assign interrupt handlers for the selected vector
numbers in the 220–252 range, as shown below.
main()
{
. . .
_settrap(19,intr3); /* INTR3 handler */
interrupt(224,VD_handler); /* VDI */
interrupt(237,DMA_handler); /* DMA0I */
. . .
The intr3 code does not clear the interrupt request in the ICR register; this is left
to the specific interrupt handler. However, this is insufficient for level sensitive I/O
port interrupts. In this case the interrupting condition must first be cleared for the cor-
responding PIO signal before the ICR bit is cleared. Consequently, the clearing of the
bit in the ICR register is redundant.
AMD evaluation boards are normally supplied with a combined OS–boot oper-
ating system and MiniMON29K DebugCore in the ROM memory. When the target
processor is a Microcontroller, the message system used to support OS–boot and De-
bugcore communication with MonTIP, typically uses an on–chip UART. All on–chip
peripheral generated interrupts are handled via INTR3. MiniMON29K bundle 3.0,
and earlier versions, are built using OS–boot version 2.0. This version of OS–boot
assigned the INTR3 handler for MiniMON29K’s sole use. This makes it very diffi-
cult to add additional interrupt handlers for on–chip peripherals. The problem can be
solved by applying the code shown below.
main()
{
void (*V_minimon)();
. . .
V_minimon=(void(*)())_settrap(19,intr3);/* INTR3 */
_settrap(220+24,V_minimon); /* RXSI interrupt */
_settrap(220+25,V_minimon); /* RXDI interrupt */
142 Evaluating and Programming the 29K RISC Family
_settrap(220+26,V_minimon); /* TXDI interrupt */
. . .
The _settrap() HIF library service is used to install a new INTR3 handler; the
address of the old handler is returned. The MiniMON29K code is used to process
three peripheral interrupts via INTR3. The _settrap() service is used again to sepa-
rately reinstall the handlers required by MiniMON29K. New interrupt handlers for
additional on–chip peripherals can then be installed with further calls to _settrap() or
interrupt().
2.5.6 A Signal Dispatcher
Release 3.2, or newer, of the High C 29K compiler supports routines of defined
return–type _Interrupt. The use of this non–standard keyword was explained in sec-
tion 2.5.4. The keyword is used here to support a signal dispatcher. The method relies
on interrupts being processed in two stages. The first stage operates in Freeze mode.
It responds immediately to the interrupting device, captures any critical data and
deactivates the interrupt request. The second stage, if required, takes the form of a
signal handler. The sig_sig variable is used by the Freeze mode handler to request
signal handler execution. A signal handler can not be executed without a freeze mode
handler making the necessary request. This is because interrupts are not disabled in
the signal associate code.
The technique has a number of benefits: It is seldom necessary to disable inter-
rupts for long periods, as asynchronous interrupt events are only initially dealt with in
Freeze mode. This reduces interrupt latency. Signal handlers can be queued for pro-
cessing when nested interrupts would occur. This eliminates the need to prepare a C
level interrupt processing environment for each interrupt. A C level environment
need only be built for a Signal Dispatcher routine. The Signal Dispatcher is then re-
sponsible for calling the appropriate signal handler for all signals generated by inter-
rupts. The Signal Dispatcher is started in response to the first signal occurring. The
dispatcher causes execution of the first signal handler, then determines if other signal
handlers have been requested while the current signal handler was executing. The
dispatcher continues to processes signals until there are none remaining. At this point
the original interrupted state is restored. The original state being the processor state at
the time the first interrupt in the sequence occurred. The first interrupt occurred while
no interrupt or signal handler was being processed; and it caused the Signal Dispatch-
er to start execution.
Avoiding nested interrupts, other than for Freeze mode handling, is most
beneficial when large numbers of multiply nested interrupts are expected, and the
cost of preparing C level context for interrupt processing is high. For example, using
interrupt context caching, the processor can be prepared for Freeze mode interrupt
Chapter 2 Applications Programming 143
processing in 1–2 micro seconds (at 16Mhz). However, with an Am29205
microcontroller which has a 16–bit off–chip bus and relatively slow DRAM memory,
as much as 40 micro seconds can be required to prepare the processor for a C level
signal handler. In this case it is best to prepare for C level interrupt handling only
once. Nested interrupts are avoided by adding new interrupts to a stack when further
interrupts occur while the Signal Dispatcher is executing.
As explained in section 2.5.4, a signal handler is requested when the sig_sig
variable is set by a Freeze mode handler. Previous example code showed how the sig-
nal handler could be started immediately after the Freeze mode handler completes.
The alternative code, shown here, causes the signal to be added to a stack of signals
waiting for processing. Both methods can coexist, setting the sig_sig variable to a
signal number ORed with 0x8000,0000 indicates the signal should be queued (if nec-
essary) rather than processed immediately.
First, examine the two interrupt handlers shown below. The Freeze mode han-
dlers, uart_handler() and timer_handler(), use the _Interrupt keyword. They both
qualify for Freeze mode execution. The UART handler, is similar to the example of
section 2.5.4. However, this time sig_sig is set to the signal number held in uart_sig
and the most significant bit is also set when the end of a string is encountered. This
will request the associated signal handler to be placed in the signal queue.
_Interrupt uart_handler() /* Freeze mode interrupt handler */
{
static int count=0;
recv[count]=*uart_p; /* access UART */
if(recv[count]==’\n’) /* end of string ? */
{ count=0;
sig_sig=0x80000000 | uart_sig;
}
else count++;
}
The Freeze mode timer handler reloads the on–chip timer control registers for
repeated timer operation. Each timer interrupt causes the tick variable to be incre-
ment, and when a tick value of 100 is reached, signal timer_sig is added to the signal
queue. The Freeze mode handler is written in C. However, it needs to access special
register 9 (TMR, the Timer Reload register) which is not normally accessible from C.
The problem is overcome by using the C language extensions _mfsr() and _mtsr().
They enable special register to be read and written.
_Interrupt timer_handler() /* Freeze mode interrupt handler */
{
static int tick=0;
int tmr;
tmr=_mfsr(9); /* read TMR special register */
tmr=tmr&(–1–0x02000000) /* clear IN bit–field */
144 Evaluating and Programming the 29K RISC Family
_mtsr(9,tmr); /* write to TMR register */
if(tick++ > 100)
{ tick=0;
sig_sig=0x80000000 | timer_sig;
}
}
The second stage of the UART interrupt handler, the signal handler, is shown
below. Note, the sig_uart() routine calls the printf() library routine. This is not per-
mitted with the High C 29K library routines as the printf() routine is not reentrant.
However, the use of printf() helps illustrate the operating principle. Normally a sig-
nal handler must use the _sigret() signal return service, at least with a HIF conform-
ing operating system. However, when a signal handler is called from the dispatcher,
the signal return service should not be used. It is possible to determine if the dispatch-
er is in use by testing the variable dispatcher_running; it becomes non zero when
the dispatcher is in use. However, testing the dispatcher_running flag may be insuf-
ficient in some circumstances. It is possible that the Signal Dispatcher is running and
initiating signal handler execution. At the same time a signal handler may be re-
quested directly by, say, an interrupt. The Dispatcher is running but the directly re-
quested signal handler must use the signal return service.
Signals need not always be queued for processing. If a very high priority (im-
mediate) interrupt occurs and further signal processing is necessary, sig_sig should
be simply set to the signal number. In this case it is important that the signal handler
use the _sigret() service.
_Interrupt sig_uart(sig_number) /* signal handler for UART */
int sig_number;
{
printf(”in signal handler number=%d\n”, sig_number);
printf(”received string=%s\n”, recv_data);
if(!dispatcher_running)_sigret(); /* no _sigret() service call *
}
The Signal Dispatcher is implemented as a signal handler. The dispatcher re-
moves signals from a stack and calls the appropriate signal handler. When a signal
handler is requested by a Freeze mode handler, and the Signal Dispatcher is not cur-
rently executing, the requested signal (sig_sig value) is not immediately started. In its
place the dispatcher signal handler is initiated.
Shown on Figure 2-4 is an example of the Signal Dispatcher in operation. The
first interrupt is from the UART. It is dealt with entirely in Freeze mode; the sig_sig
variable is not set such as to request a second stage signal handler. The UART gener-
ates the second interrupt. This time the sig_sig variable is set to request the sig_uart()
signal handler be started by the Signal Dispatcher. While the second stage handler is
running, a timer interrupt occurs. The Freeze mode timer handler requests a second
stage handler be started by the Signal Dispatcher. When the dispatcher completes the
currently executing second stage handler (the UART’s), it initiates the timer’s second
Chapter 2 Applications Programming 145
Main Program uart_handler() Freeze mode code
sig_sig =0
Full C–context code
uart_handler()
sig_sig=uart_sig|0x8..
UART “signal”=uart_sig
interrupt
signal_associate_code
Push uart_sig on stack
“signal”=dispatcher_sig
sig_dispatcher()
Signal Dispatcher
timer Pop uart_sig off stack
interrupt
sig_uart() call
timer_handler() 2nd stage signal
sig_sig=timer_sig|0x8.. handler
“signal”=timer_sig
Signal Dispatcher
signal_associate_code Pop timer_sig off stack
Push timer_sig on stack sig_timer() call
2nd stage signal
handler
Signal Dispatcher
Main Program call _sigret()
signal return service
End
Figure 2-4. Processing Interrupts with a Signal Dispatcher
stage handler. When there are no remaining second stage handler requests, the dis-
patcher issues a signal–return service request. The original programs context is then
restored and its execution restarted.
Integer variable dispatcher_sig holds the signal number used by the Signal
Dispatcher. The user must select a signal number. The example code below uses 7
(SIGEMT). The signal() library routine is used to assign procedure sig_dispatcher()
to signal number 7. Before signal and trap handlers can be installed, the procedures
and variables defined in the support libraries must be declared external; as shown
below.
extern void signal(int, void (*handler)(int));
extern int interrupt(int, _Interrupt (*C_handler)(int));
extern void sig_dispatcher(int);
146 Evaluating and Programming the 29K RISC Family
extern int sig_sig;
extern int dispatcher_sig; /* dispatcher signal number */
int uart_sig, timer_sig;
During program initialization, after main() is called, the handler routines and
other support services must be installed. The code below uses the interrupt() library
routine to install a signal handler (sig_timer() not shown) for timer interrupt support.
The call to interrupt() returns the allocated signal number, and this number is saved
in timer_sig. The timer Freeze mode handler uses the timer_sig value to request the
timer signal handler be executed. The interrupt() service is called a second time to
install the Freeze mode handler, timer_handler(). The second call causes vector
table entry 14 to be reassigned the address of the Freeze mode handler.
The UART handlers are installed using an alternative method. The signal() ser-
vice rather than the interrupt() service is used to assign the sig_uart() signal handler
to signal number SIGUSR2. This method allows a specific signal number to be se-
lected, rather than using the interrupt() service to allocate the next available signal
number. Most users will prefer the previous method used to automatically select sig-
nal numbers.
main()
{
_settrap(218,_disable);
_settrap(217,_enable);
_settrap(216,_timer_init);
dispatcher_sig=7; /* select signal number for dispatcher */
signal(dispatcher_sig,sig_dispatcher);
timer_sig=interrupt(14,sig_timer); /* install signal handler */
if(interrupt(14,timer_handler)) /* install Freeze handler */
printf(”ERROR: Freeze mode handler not built for trap 14\n”);
if(interrupt(15,uart_handler) /* install Freeze handler */
printf(”ERROR: Freeze mode handler not built for trap 15\n”);
uart_sig=SIGUSR2; /* chose a signal number */
signal(uart_sig,sig_uart); /* install signal handler */
timer_init(); /* initialize the timer */
. . .
The sig_dispatcher() requires two helper services, disable() and enable().
They are described in more detail shortly, but are simply used to enable and disable
processor interrupts. The _settrap() service is used above to install trap handlers for
these services. The timer_init() routine is not required by the Signal Dispatcher. It is
included to simply make the example more complete.
The interrupt() routine uses the signal_associate method of assigning a trap
number to a signal handler. The code was described in section 2.5.4, but a few small
Chapter 2 Applications Programming 147
additions are required to support the Signal Dispatcher. The modified code is shown
below. There are two changes: Interrupts are not disabled (requiring that a Freeze
mode handler always be used for interrupt deactivation). A call to queue_sig is made
if the most significant bit of the signal number is set.
.reg it0,gr64
.reg it1,gr65
_signal_associate_code: ;signal number in it0
const gr64,0 ;push signal number on stack
;
const it1,0 ;clear sig_sig variable
const it2,_sig_sig ; need not do this if signal
consth it2,_sig_sig ; handler is called directly
store 0,0,it1,it2 ; from vector table entry
;
const it1,_queue_sig
consth it1,_queue_sig
jmpti gr64,it1 ;jump if msb–bit set
nop
push msp,it0 ;interrupt context stack
push msp,gr1 ;use ’push’ macro’
push msp,rab ; see section 3.3.1
const it0,512
sub rab,rfb,it0 ;set rab=rfb–WindowSize
;
pushsr msp,it0,pc0 ;push special registers
pushsr msp,it0,pc1
pushsr msp,it0,pc2
pushsr msp,it0,cha
pushsr msp,it0,chd
pushsr msp,it0,chc
pushsr msp,it0,alu
pushsr msp,it0,ops
push msp,tav ;push tav (gr121)
;
mtsrim chc,0 ;the trampoline code is
const it1,RegSigHand ; described in section 4.4.1
consth it1,RegSigHand ;RegSigHand is a library
load 0,0,it1,it1 ; variable
add it0,it1,4 ;IRET to signal
mtsr pc1,it1 ; trampoline code
mtsr pc0,it0
iret
_signal_associate_end:
The queue_sig routine is shown below. It pushes the signal number on a signal
stack and advances a stack pointer, sig_stack_p. The operation is performed while
still in Freeze mode and is therefor not interruptible. The variable
dispatcher_running is then tested. If it is set to TRUE, an interrupt return (IRET)
instruction is issued. If it is FALSE, the dispatcher_sig number is obtained and the
signal_associate code continues the process of starting a signal handler; but the
148 Evaluating and Programming the 29K RISC Family
signal number now in use will cause the Signal Dispatcher (sig_dispatcher()) to
commence execution.
_queue_sig: ;jump here from signal_associate
and it0,it0,0xff ;clear msb–bit
;
const it3,_sig_stack_p
consth it3,_sig_stack_p
load 0,0,it2,it3 ;get pointer value
store 0,0,it0,it2 ;store signal number on stack
add it2,it2,4 ;advance stack pointer
store 0,0,it2,it3
;
const it3,_dispatcher_running
consth it3,_dispatcher_running
load 0,0,it2,it3 ;test if signal dispatcher
cpeq it2,it2,0 ; already running
jmpt it2,_start_dispatcher
constn it2,–1
iret ;IRET if running
;
_start_dispatcher:
store 0,0,it2,it3 ;set dispatcher_running
const it3,_dispatcher_sig
consth it3,_dispatcher_sig
const it1,_signal_associate_code+5*4
consth it1,_signal_associate_code+5*4
jmpi it1 ;start signal handler
load 0,0,it0,it3 ;signal=dispatcher_sig
Before the signal_associate code starts the dispatcher signal handler, the
dispatcher_running variable is set to TRUE. Until this variable is cleared, further
signal requests (if the most significant bit of the signal number is set) will be added to
the queue of signals waiting for processing. The process of adding a signal to the
queue is kept simple –– a stack is used. Reducing the amount of code required results
in less interrupt latency as the queue_sig code runs in Freeze mode.
The signal handler which performs the dispatch operation is written in C. The
code is shown below. It requires some simple assembly–level support routines which
are described later. Having the code in C is a convenience as it simplifies the task of
modifying the code. Modification is necessary if a different execution schedule is
required for signals waiting in the signal stack. The variables used in the Signal
Dispatcher routine are described below. Note, that sig_stack_p and
dispatcher_running are defined volatile. This is because they may also be modified
by a Freeze mode interrupt handler. It is important that the C compiler be informed
about this possibility. Otherwise it may perform optimizations which prevent value
changes from being observed, such as holding a copy of sig_dispatcher_p in
register, and repeatedly accessing the register.
Chapter 2 Applications Programming 149
extern void (*_SigEntry[])(int); /* defined in HIF libraries */
int sig_stack[200]; /* signal stack */
volatile int *sig_stack_p=&sig_stack[0];
volatile int dispatcher_running; /* dispatcher running flag */
int sig_sig=0;
int dispatcher_sig; /* dispatcher signal number */
The example sig_dispatcher() is relatively simple but effective. It first disables
interrupts before removing all current signals from the stack. The signal values are
transferred to an array. Interrupts are then reenabled. Performing this procedure with
interrupts disabled prevents other signals being added to the stack while the transfer
operation is being performed. Signals are transferred to the array in the reverse order
they were placed on the stack. This ensures that signals are ultimately processed in
the order in which they were originally requested.
No attempt is made to apply a priority order to pending signals. The necessary
code can be applied after the signals have been removed from the stack. Performing
priority ordering at C–level rather than in the sig_queue code has the advantage of
reducing interrupt latency. Due to the fast operation of 29K processors the need to
priority order signals is not high, as a signal request is not likely to be kept waiting
very long.
void sig_dispatcher(sig) /* Signal Dispatcher */
int sig;
{
int cps;
int *sig_p; /* array of signals */
static int sig_array[20]; /* needing processing */
cps=disable(0x20002); /* set DI and TD in CPS */
for(;;)
{ sig_p=&sig_array[0]; /* mark array empty */
while(sig_stack_p!=&sig_stack[0])/* remove signals from
{ ––sig_stack_p; /* stack */
*sig_p++=*(int*)sig_stack_p; /* copy from
} /* stack to array */
enable(cps); /* enable interrupts */
while(sig_p!=&sig_array[0]) /* process signals removed */
{ ––sig_p; /* from stack */
(*_SigEntry[(*sig_p)–1])(*sig_p);
}
cps=disable(0x20002); /* disable interrupts */
if(sig_stack_p==&sig_stack[0]) /* stack empty ? */
break;
}
dispatcher_running=0;
enable(cps); /* enable interrupts */
_sigret(); /* _sigret() HIF service */
} /* would restore interrupted cps */
When there are no remaining signals to process, the dispatcher requests the
_sigret() signal–return service. The dispatcher_running flag is also cleared. It is
150 Evaluating and Programming the 29K RISC Family
possible that a new signal arrives just after the flag is cleared but before the
signal–return service is complete, this can not be avoided. It does not create a
problem (other than a loss of performance) as a new dispatcher signal handler is
simply started.
The disable() and enable() support routines are used by the Signal Dispatcher to
enable and disable interrupts around critical code. Interrupts are disabled by setting
the DI bit in the Current Processor Status (CPS) register. Freeze mode handler rou-
tines can use the _mtsr() C language extensions to modify special registers. However,
they can not be used by the dispatcher routine as it may be operating in User mode.
Accessing special register space from User mode would create a protection violation.
The problem is overcome by installing assembly level trap handlers which perform
the necessary special register access. The _settrap() HIF service is used to install the
trap handlers. Further assembly routines are required to assert the selected trap num-
ber. The code for disable() is shown below.
.global _disable
_disable:
asneq 218,gr96,gr96
jmpi lr0
nop
.global __disable
__disable:
mfsr gr96,ops ;read OPS
or gr97,gr96,lr2 ;OR with passed value
mtsr ops,gr97
iret ;copy OPS to CPS
A single parameter is passed to disable(). The parameter is ORed with the CPS
value and the CPS register updated. Since this task is performed by a trap handler, the
OPS register is actually modified; and OPS is copied to CPS when an IRET is issued.
There is a further advantage of using a trap handler to perform the task; the operation
can not be interrupted –– the read/modify/write of the the CPS is atomic.
The code for enable() is similar to disable(). In this case the passed parameter is
simply copied to the CPS. The disable() routine returns the CPS value before modi-
fying it. The value is normally stored and later passed to enable(). In this way only the
DI and TD (timer disable) bits in the CPS are temporarily modified. Note, older
members of the 29K family do not support the TD bit. In such case, the interrupt dis-
able code used by the example sig_dispatcher() routine does not prevent interrupts
being generated by the on–chip timer. The the problem can be resolved by modifying
the __enable and __disable assembly routines to clear and set the interrupt enable
(IE) bit in the Timer Reload register.
.global _enable
_enable:
asneq 217,gr96,gr96
Chapter 2 Applications Programming 151
jmpi lr0
nop
.global __enable
__enable:
mtsr ops,lr2
iret
2.5.7 Minimizing Interrupt Latency
Interrupt latency is minimized if interrupts are never disabled. In practice this
can be difficult to achieve. There are often critical code sections which must run to
completion without interruption. Traditionally, interrupts are disabled before enter-
ing such code sections and reenabled upon critical section completion. However, if
interrupts are processed using the two–stage method described in section 2.5.6 (A
Signal Dispatcher), interrupt disabling can be eliminated.
In place of disabling interrupts around a critical code section, the Signal Dis-
patcher is effectively disabled. This allows a first stage interrupt handler to interrupt a
critical code section. Second stage interrupt handlers (signal handlers) are not initi-
ated during the critical code section, as the Dispatcher is disabled. It is easy to disable
the Dispatched by simply indicating that it is already active; this will prevent its ac-
tivation which can occur when the first stage handler is completed (if the sig_sig vari-
able is set). First stage handlers execute in Freeze mode and can be configured to
avoid access to the shared resource being accessed by critical code sections. The ex-
ample below shows how the Signal Dispatcher can be deactivated around a critical
code section.
#define TRUE –1
#define FALSE 0
. . . interruptible code
dispatcher_running=TRUE; /* disable Dispatcher */
. . . start of critical code section
/* code only interruptible by
Freeze mode handler */
. . . end of critical code section
dispatcher_running=FALSE; /* enable Dispatcher */
if(sig_stack_p!=&sig_stack[0])
_sendsig(dispatcher_sig);
. . .
When the critical task has been accomplished, the Dispatcher is reenabled by
clearing the dispatcher_running variable. It is possible that one or more signal num-
bers were pushed on the signal stack during the critical stage. Hence, when the Dis-
patcher is reenabled, the signal stack must be tested to determine if there are any
pending signals. If there are, then the Signal Dispatcher must be started using the
_sendsig() HIF service.
152 Evaluating and Programming the 29K RISC Family
The method minimizes the latency in starting a Freeze mode interrupt handler
since their commencement is never disabled –– unless by another Freeze mode han-
dler. The latency in starting a second stage handler is not reduced. Further restrictions
may have to be applied to first stage handlers to disallow access to resources which
must be atomically manipulated within critical code sections –– such as linked–list
data structures.
2.5.8 Signal Processing Without a HIF Operating System
A signal processing technique is recommended for dealing with complex C lev-
el interrupt handlers. The previous sections have described in detail how signal pro-
cessing can be performed. AMD and other tool providers supply the necessary sup-
port code which has been well tested and is known to be reliable. However, some de-
velopers may select an operating system which does not support the HIF services re-
quired by the previous example code. Additionally, many embedded system are de-
pendant on simple home–made boot–up code, which provides few support services.
A commercial operating system will implement its own interrupt processing
services. It is likely these services will be somewhat based on the signal processing
code described in this book. However, the provided services should be used in prefer-
ence to the HIF services. In fact, the chosen operating system may not provide any
support for HIF services.
When building simple boot–up and run–time support code for a small
embedded system, it is best to provide the necessary HIF services required for signal
processing. If the boot–up code is based on AMD’s OS–boot product, then all HIF
services will be provided. If OS–boot is not used, it is important that limited HIF
support be included in the developed code. Only the signal, settrap, sysalloc and
sigret–type subset of HIF services are required. A trap handler for HIF trap number
69 should be installed, and the code required to process the HIF service request
installed. Very little code is required and example code can be taken from OS–boot.
2.5.9 An Example Am29200 Interrupt Handler
The following example makes use of the code presented in the previous sections
of this chapter. The Programmable I/O (PIO) port of an Am29200 microcontroller is
configured such that PIO signal–pin PIO0 is an output, and PIO signal–pin PIO15 an
input. The system hardware ensures that the two pins are wired together. A two stage
interrupt handler is assigned to processing interrupts generated by a rising edge on
pin PIO15. By first clearing pin PIO0 and then setting it to one, an interrupt will be
generated.
First, a number of include files must be accessed to declare external data and
procedure type information. Newer versions of file signal.h contain the extern
declarations listed below. Hence, only when using an older signal.h file need the
extern statement be explicitly included.
Chapter 2 Applications Programming 153
#include <hif.h>
#include <signal.h>
extern int interrupt(int, _Interrupt (*C_handler)(int));
extern int sig_sig;
extern int dispatcher_sig;
extern void intr3(void);
extern void _enable(void);
extern void _disable(void);
extern void enable(int);
extern int disable(int);
extern void sig_dispatcher(int);
It is best to access the Programmable I/O port via support macros or procedures.
Macros have a speed advantage (unless in–line procedures are used), and below are a
number of macros and support data structures which simplify control of the PIO port
typedef volatile struct PIO_str /* PIO class */
{
unsigned int poct;
unsigned int pin;
unsigned int pout;
unsigned int poen;
} PIO_t;
PIO_t *PIO_p=(PIO_t*)0x800000d0; /* PIO object */
/* ICR pntr. */
volatile unsigned int* ICR_p=(unsigned int*)0x80000028;
#define PIO_enable_m(port) PIO_p–>poen |= (1 << (port))
#define PIO_disable_m(port) PIO_p–>poen &= ~(1 << (port))
#define PIO_rising_m(port) \
PIO_p–>poct |= (0x2 << (2* (port))); \
PIO_p–>poct &= ~(1 << (port));
#define PIO_falling_m(port) \
PIO_p–>poct |= (0x2 << (2* (port))); \
PIO_p–>poct |= (1 << (port));
#define PIO_high_m(port) \
PIO_p–>poct |= (0x1 << (2* (port))); \
PIO_p–>poct |= (1 << (port));
#define PIO_out_m(port, val) \
{ unsigned int tmp = PIO_p–>pout; \
tmp &= ~(1 << (port)); \
tmp |= (((val) & 1) << (port)); \
PIO_p–>pout = tmp; \
}
#define ICR_clear_m(vec) *ICR_p |= (1<<(251–(vec)))
154 Evaluating and Programming the 29K RISC Family
Using the _Interrupt keyword, first and second stage interrupt handlers are
defined below for the PIO15 interrupt. No real work is performed by the example
second stage handler, but it does demonstrate how a full–C–context handler can be
reached. The second stage handler does not qualify as a Freeze mode interrupt
handler because it is not a leaf routine.
int PIO15_sig; /* signal number allocated to second stage */
_Interrupt PIO15_handler() /* first stage interrupt handlers */
{
ICR_clear_m(228); /* clear interrupt request */
PIO_out_m(0,0); /* clear PIO0 port bit */
sig_sig=0x80000000|PIO15_sig; /* request secnd stage */
}
_Interrupt sig_PIO15(sig_number) /* second stage handlers */
int sig_number;
{
printf(”Running PIO15 signal handler\n”);
}
Before the interrupt mechanism can be put to work, the various support handlers
must be installed as shown below. The program is being developed with the
MiniMON29K DebugCore and this requires that the OS–boot support interrupt
handlers be preserved before the new interrupt handlers are added. The PIO support
macros are then used to establish the correct PIO port operation before the an
interrupt is generated by forcing a 0–1 level transition on PI0 pin PIO0.
int main()
{
void (*V_minimon)();
V_minimon=(void(*)())_settrap(19,intr3); /* INTR3 */
_settrap(220+24,V_minimon); /* MiniMON support interrupts */
_settrap(220+25,V_minimon); /* see section 2.5.5 */
_settrap(220+26,V_minimon);
_settrap(218,_disable); /* signal dispatcher support */
_settrap(217,_enable); /* see section 2.5.6 */
dispatcher_sig=7; /* signal number for dispatcher */
signal(dispatcher_sig,sig_dispatcher);
/* application interrupt handlers for I/O port PIO15 */
PIO15_sig = interrupt(228,sig_PIO15); /* second stage */
if(interrupt(228,PIO15_handler)) /* first stage */
printf(”ERROR installing Freeze mode handler\n”);
/* configure PIO port operation */
PIO_p–>poct=0; /* clear control register */
PIO_enable_m(0); /* enable PIO0 output */
PIO_rising_m(15); /* PIO15 edge sensitive */
PIO_out_m(0,0); /* PIO0 = 0 */
Chapter 2 Applications Programming 155
PIO_out_m(0,1); /* generate an interrupt */
}
Users of the High C 29K tool chain could test the interrupt handling mechanism
without first building the necessary hardware by asserting the assigned trap number
as shown below.
_ASM(” asneq 228,gr1,gr1”); /* test interrupt mechanism */
2.6 SUPPORT UTILITY PROGRAMS
There are a number of important utility programs available to the software de-
veloper. These tools are generally available on all development platforms and are
shared by different tool vendors. Most of the programs operate on object files pro-
duced by the assembler or linker. All linkable object files and executable files are
maintained in AMD Common Object File Format (COFF). This standard is very
closely based on the AT&T standard used with UNIX System V. Readers wishing to
know more about the details of the format may consult the High C 29K documenta-
tion or the AT&T Programmer’s Guide for UNIX System V. The coff.h include file
found on most tool distributions, describes the C language data structures used by the
COFF standard –– often described as the COFF wrappers.
2.6.1 Examining Object Files (Type .o And a.Out)
nm29
The nm29 utility program is used to examine the symbol table contained in a
binary COFF file produced by the compiler, assembler or linker. The format is
very much like the UNIX nm utility. Originally nm29 was written to supply
symbol table information to the munch29 utility in support of the AT&T C++
cfront program. A number of command line options have been added to enable
additional information to be printed, such as symbol type and section type.
One useful way to use nm29 is to pipe the output to the sort utility, for example:
“nm29 a.out | sort | more”; each symbol is printed preceded by its value. The sort
utility arranges for symbol table entries to be presented in ascending value.
Since most symbols are associated with address labels, this is a useful way to
locate an address relative to its nearest address labels.
munch29
This utility is used with the AT&T C++ preprocessor. This program is known as
cfront and converts C++ programs into C. After the C++ program has been
converted and linked with other modules and libraries, it is examined with
156 Evaluating and Programming the 29K RISC Family
nm29 to determine the names of any static constructor and destructor functions.
The C++ translator builds these functions as necessary and tags their names with
predefined character sequences. The output from nm29 is passed to munch29
which looks for constructor and destructor names. If found, munch29 builds C
procedures which call all the identified object constructors and destructors.
Because the constructor functions must execute before the application main()
program, the original program is relinked with the constructor procedures being
called before main(). The main() entry is replaced with _main(). This also
enables the call to destructor procedures to be made in _main() when main()
returns.
Because G++ is now available for C++ code development (note, G++ is
incorporated into the GCC compiler), there is little use being made of the AT&T
cfront preprocessor. Additionally, MRI and Metaware are expected to shortly
have commercial C++ compilers available.
rdcoff
The rdcoff utility is only available to purchasers of the High C 29K product.
This utility prints the contents of a COFF conforming object file. Each COFF
file section is presented in an appropriate format. For example, text sections are
disassembled. If the symbol table has not been striped from the COFF file, then
symbol values are shown. The utility is useful for examining COFF header
information, such as the text and data region start addresses. Those using GNU
tools can use the coff and objdump utilities to obtain this information.
coff This utility is a shorthand way of examining COFF files. It reports a summary of
COFF header information, followed by similar reports for each of the sections
found in the object file. The utility is useful for quickly checking the link
mapping of a.out type files; especially when a project is using a number of
different 29K target systems which have different memory system layouts,
requiring different program linkage.
objdump
This utility is supplied with the GNU tool chain. It can be used to examine
selected parts of object files. It has an array of command line options which are
compatible with the UNIX System V utility of the same name. In a similar way
to the rdcoff utility it attempts to format selected information in a meaningful
way.
swaf
This utility is used to produce a General–Purpose ASCII (PGA) symbols file for
use with Hewlett–Packard’s B3740A Software Analyzer tool. This tool enables
a 16500B card cage along with a selection of logic analyzer cards to support
high level software debugging. The swaf utility builds a GPA symbols file from
Chapter 2 Applications Programming 157
information extracted from a linked COFF file. When the GPA file is loaded into
the analyzer it is possible to display address values in symbol format rather than,
say, hex based integers. Via a remote computer, the HP16500B can be used to
support execution trace at source level
mksym
This utility is required to build symbol table information for the UDB debugger.
The UDB debugger does not directly operate with COFF symbol information. A
mksym command is typically placed in a makefile; after the 29K program has
been linked a new symbol table file should be built.
2.6.2 Modifying Object Files
cvcoff
The COFF specification states that object file information is maintained in the
endian of the host processor. This need not be the endian of the target 29K
processor. As described in Chapter 1, 29K processors can run in big– or
little–endian but are almost exclusively used in big–endian format. Endian
refers to which byte position in a word is considered the byte of lowest address.
With big–endian, bytes further left have lower addresses. Machines such as
VAXs and IBM–PCs operate with little–endian; and machines from SUN and
HP tend to operate with big–endian.
What this means to the 29K software developer is that COFF files on, say, a PC
will have little–endian COFF wrappers. And COFF files on, say, a SUN
machine will have big–endian wrapers, regardless of the endianness of the 29K
target code. When object files or libraries containing object files are moved
between host machines of different endianness, the cvcoff utility must be used
to convert the endianness of the COFF wraper information. The cvcoff utility
can also be used to check the endianess of an object file. Most utility programs
and software development tools expect to operate on object files which are in
host endian; however, there are a few tools which can operate on COFF files of
either host endianness. In practice this reduces the need to use the cvcoff utility.
strpcoff
This utility can be used to remove unnecessary information from a COFF file.
When programs are compiled with the “–g” option, additional symbol
information is added to the COFF file. The strpcoff utility can be used to
remove this information and any other details such as relocation data and
line–number pointers. Typically linkers have an option to automatically strip
this information after linking. (ld29 has the “–s” option.) The COFF file header
information needed for loading a program is not stripped.
158 Evaluating and Programming the 29K RISC Family
2.6.3 Getting a Program into ROM
After a program has been finally linked, and possibly adjusted to deal with any
data initialization problems (see section 2.3.6), it must be transferred into ROM de-
vices. This is part of the typical software development cycle for embedded processor
products. A number of manufacturers make equipment for programming PROM de-
vices. They normally operate with data files which must be appropriately formatted.
Tektronix Hex format and Motorola S3 Records are two of the commonly used file
formats. The coff2hex utility can be used to convert the COFF formatted executable
file produced by the linker into a new file which is correctly formatted for the selected
PROM programmer. If more than one PROM is to required to store the program,
coff2hex can be instructed to divide the COFF data into a set of appropriate files. Al-
ternatively, this task can be left to more sophisticated programming equipment. The
utility has a number of command line options; the width and size of PROM devices
can be chosen, alternatively specific products can be selected by manufacture part
number.
Chapter 2 Applications Programming 159
160 Evaluating and Programming the 29K RISC Family
Chapter 3
Assembly Language Programming
Most developers of software for the 29K family will use a high level language,
such as C, for the majority of code development. This makes sense for a number of
reasons: Using a high level language enables a different processor to be selected at
some future date. The code, if written in a portable way, need only be recompiled for
the new target processor. The ever increasing size of embedded software projects
makes the higher productivity achievable with a high level language attractive. And
additionally, the 29K family has a RISC instruction set which can be efficiently used
by a high level language compiler [Mann et al 1991b].
However, the software developer must resort to the use of assembly code pro-
gramming in a number of special cases. Because of the relentless efficiency of the
current C language compilers for the 29K, it is difficult for a programmer to out–per-
form the code generating abilities of a compiler for any reasonably sized program.
For this reason it is best to limit the use of assembly code as much as possible. Some
of the support tasks which do require assembly coding are:
Low–level support routines for interrupts and traps (see Chapter 4).
Operating system support services such as system calls and application–task
context switching (see Chapter 5). Also, taking control of the processor during
the power–up and initialization sequence.
Memory Management Unit trapware (see Chapter 6).
Floating–point and complex integer operation trapware, where the 29K family
member does not support the operation directly in hardware.
High performance versions of critical routines. In some cases it may be possible
to enhance a routines performance by implementing assembly code short–cuts
not identified by a compiler.
161
This chapter deals with aspects of assembly level programming. There are some
differences between 29K family members, particularly in the area of on–chip periph-
erals for microcontrollers. The chapter does not go into details peculiar to individual
family members; for that it is best to study the processor User’s Manual.
The material covered is relevant to all 29K family members.
3.1 INSTRUCTION SET
The Am29000 microprocessor implements 112 instructions. All hardware im-
plemented instructions execute in a single–cycle, except for IRET, IRETINV,
LOADM and STOREM. Instruction format was discussed in section 1.11. All
instructions have a fixed 32–bit format, with an 8–bit opcode field and 3, 8–bit, oper-
and fields. Field–C specifies the result operand register (DEST), field–A and field–B
supply the source operands (SRCA and SRCB). Most instructions operate on data
held in global or local registers, and there are no complex addressing modes sup-
ported. Field–B, or field–B and field–A combined, can be used to provide 8–bit or
16–bit immediate data for instructions. Access to external memory can only be per-
formed with the LOAD[M] and STORE[M] instructions. There are a number of
instructions, mostly used by operating system code, for accessing the processor spe-
cial registers.
The following sections deal with the different instruction classes. Some of the
instructions described are not directly supported by all members of the 29K family.
In particular, many of the floating–point instructions are only directly executed by
the Am29050 processor. If an instruction is not directly supported by the processor
hardware, then a trap is generated during instruction execution. In this case, the oper-
ating system uses trapware to implement the instruction’s operation in software.
Emulating nonimplemented instructions in software means some instruction execu-
tion speeds are reduced, but the instruction set is compatible across all family mem-
bers.
3.1.1 Integer Arithmetic
The Integer Arithmetic instructions perform add, subtract, multiply, and divide
operations on word–length (32–bit) integers. All instructions in this class set the
ALU Status Register. The integer arithmetic instructions are shown Tables 3–1 and
3–2.
The MULTIPLU, MULTIPLY, DIVIDE, and DIVIDU instructions are not im-
plemented directly on most 29K family members, but are supported by traps. To de-
termine if your processor directly supports these instructions, check with the proces-
sor User’s Manual or the tables in Chapter 1. The Am29050 microprocessor supports
the multiply instructions directly but not the divide instructions.
162 Evaluating and Programming the 29K RISC Family
Table 3-1. Integer Arithmetic Instructions
Mnemonic Operation Description
ADD DEST <– SRCA + SRCB
ADDS DEST <– SRCA + SRCB
IF signed overflow THEN Trap (Out Of Range)
ADDU DEST <– SRCA + SRCB
IF unsigned overflow THEN Trap (Out Of Range)
ADDC DEST <– SRCA + SRCB + C (from ALU)
ADDCS DEST <– SRCA + SRCB + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
ADDCU DEST <– SRCA + SRCB + C (from ALU)
IF unsigned overflow THEN Trap (Out Of Range)
SUB DEST <– SRCA – SRCB
SUBS DEST <– SRCA – SRCB
IF signed overflow THEN Trap (Out Of Range)
SUBU DEST <– SRCA – SRCB
IF unsigned underflow THEN Trap (Out Of Range)
SUBC DEST <– SRCA – SRCB – 1 + C (from ALU)
SUBCS DEST <– SRCA – SRCB – 1 + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
SUBCU DEST <– SRCA – SRCB – 1 + C (from ALU)
IF unsigned underflow THEN Trap (Out Of Range)
SUBR DEST <– SRCB – SRCA
SUBRS DEST <– SRCB – SRCA
IF signed overflow THEN Trap (Out Of Range)
SUBRU DEST <– SRCB – SRCA
IF unsigned underflow THEN Trap (Out Of Range)
SUBRC DEST <– SRCB – SRCA – 1 + C (from ALU)
(continued)
Chapter 3 Assembly Language Programming 163
Table 3-2. Integer Arithmetic Instructions (Concluded)
(continued)
Mnemonic Operation Description
SUBRCS DEST <– SRCB – SRCA – 1 + C (from ALU)
IF signed overflow THEN Trap (Out Of Range)
SUBRCU DEST <– SRCB – SRCA – 1 + C (from ALU)
IF unsigned underflow THEN Trap (Out Of Range)
MULTIPLU
Q//DEST <– SRCA * SRCB (unsigned)
MULTIPLY Q//DEST <– SRCA * SRCB (signed)
MUL Perform one–bit step of a multiply operation (signed)
MULL Complete a sequence of multiply steps
MULU Perform one–bit step of a multiply operation (unsigned)
DIVIDE DEST <– (Q//SRCA)/SRCB (signed)
Q <– Remainder
DEST <– (Q//SRCA)/SRCB (unsigned)
DIVIDU
Q <– Remainder
DIV0 Intitialize for a sequence of divide steps (unsigned)
DIV Perform one–bit step of a divide operation (unsigned)
DIVL Complete a sequence of divide steps (unsigned)
DIVREM Generate remainder for divide operation (unsigned)
3.1.2 Compare
The Compare instructions test for various relationships between two values.
For all Compare instructions except the CPBYTE instruction, the comparisons are
performed on word–length signed or unsigned integers. There are two types of
compare instruction. The first writes a Boolean value into the result register (selected
by the instruction DEST operand) depending on the result of the comparison. A
Boolean TRUE value is represented by a 1 in the most significant bit position. A
Boolean FALSE is defined as a 0 in the most significant bit. The 29K uses a global or
local register to contain the comparison result rather than the ALU status register.
This offers a performance advantage as there is less conflict over access to a single
164 Evaluating and Programming the 29K RISC Family
shared resource. Compare instructions are frequently followed by conditional Jump
or Call instructions which depend on the contents of the compare result register.
The second type of compare instruction incorporates a conditional test in the
same instruction cycle accomplishing the comparison. These type of instructions,
known as Assert instructions, allow instruction execution to continue only if the re-
sult of the comparison is TRUE. Otherwise a trap to operating system code is taken.
The trap number is supplied in the field–C (DEST) operand position of the instruc-
tion. Trap numbers 0 to 63 are reserved for Supervisor mode program use. If an As-
sert instruction, with trap number less than 64 is attempted while the processor is op-
erating in User mode, a protection violation trap will be taken. Note, this is will occur
even if the assertion would have been TRUE. Assert instructions are used in proce-
dure prologue and epilogue routines to perform register stack bounds checking (see
Chapter 2). Their fast operation makes them ideal for reducing the overhead of regis-
ter stack support. They are also used as a means of requesting an operating system
support service (system call). In this case a condition known to be FALSE is asserted,
and the trap number for the system call is supplied in instruction field–C. The
Compare instructions are shown in Tables 3–3 and 3–4.
The CPBYTE performs four comparisons simultaneously. The four bytes in the
SRCA operand are compared with the SRCB operand and if any of them match then
Boolean TRUE is placed in the DEST register. The instruction can be very efficiently
used when scanning character strings. In particular, the C programming language
marks the end of character strings with a 0 value. Using the CPBYTE instruction with
SRCB supplying an immediate value 0, the string length can be quickly determined.
3.1.3 Logical
The Logical instructions perform a set of bit–by–bit Boolean functions on
word–length bit strings. All instructions in this class set the ALU Status Register.
These instructions are shown in Table 3-5.
3.1.4 Shift
The Shift instructions (Table 3-6) perform arithmetic and logical shifts on glob-
al and local register data. The one exception is the EXTRACT instruction which op-
erates on double–word data. When EXTRACT is used, SRCA and SRCB operand
registers are concatenated to form a 64–bit data value. This value is then shifted by
the funnel shifter by the amount specified by the Funnel Shift Count register (FC).
The high order 32–bits of the shifted result are placed in the DEST register. The fun-
nel shifter can be used to perform barrel shift and rotate operations in a single cycle.
Note, when the SRCA and SRCB operands are the same register, the 32–bit operand
is effectively rotated. The result may be written back to the same register or placed in
a different global or local register (see Figure 3-1). The funnel shifter is useful for
fixing–up unaligned memory accesses. The two memory words holding the un-
Chapter 3 Assembly Language Programming 165