Graphics Programming Black Book-Gpbb11
Graphics Programming Black Book-Gpbb11
4
::$
'_.b
207
FamiIy Matters
While the x86 family is a large one, only a few members of the family-which in-
cludes the 8088, 8086, 80188, 80186, 286, 386SX, 386DX, numerous permutations
of the 486, and now the Pentium-really matter.
The 8088 is now allbut extinct in the PC arena. The8086 was used fairly widelyfor a
while, but has now all but disappeared. The 80186 and 80188 never really caught on
for use in PC and don’t require further discussion.
That leaves us withthe high-end chips: the 286, the 386SX, the 386, the 486, and the
Pentium. At this writing, the 386SX is fast going the way of the 8088; people are
realizing that its relatively small costadvantage over the 386 isn’t enough to offset its
relatively large performance disadvantage. After all, the 386SX suffers from the same
debilitating problem thatlooms over the 8088-a too-small [Link], the 386SX
is a 32-bit processor, but externally, it’s a 16-bit processor, a non-optimal architec-
ture, especially for 32-bit code.
I’m not goingto discussthe 386SX in detail. If youdo find yourself programming for
the 386SX, follow the same general rules you should follow for the 8088: use short
instructions, use the registers as heavily aspossible, and don’t [Link] otherwords,
avoid memory, since the 386SX is by definition better at processing data internally
than it is at accessing memory.
The 486 is a world unto itself for the purposesof optimization, and the Pentiumis a
universe unto itself. We’ll treat them separately in later chapters.
This leaves us with just two processors: the 286 and the386. Each was the PC standard
in its day. The 286 is no longer used in new systems, but there are millions of 286-
based systems still in daily use. The 386 is still being used in new systems, although
it’s on the downhill leg of its lifespan, and it is in even wider use than the 286. The
future clearly belongs to the 486 and Pentium, but the 286 and 386 are still very
much a partof the present-day landscape.
208 Chapter 1 1
can’t be set to arbitraryvalues. That means that segments can’t be used for tempo-
rary storage or as part of a fast indivisible 32-bit load from memory, as in
les [Link] p [t Lr o n g V a r l
mov [Link]
Protected mode uses those altered segmentregisters to offer access to a great deal
more memory than real mode:The 286 supports 16megabytes of memory, whilethe
386 supports 4 gigabytes (4K megabytes) of physical memory and 64 terabytes (64K
gigabytes!) of virtual memory.
In protected mode,your programs generally run under an operating system (OS/2,
Unix, Windows NT or the like) that exerts much more control over the computer
than does MS-DOS. Protected mode operating systems can generally run multiple
programs simultaneously, and the performanceof any one programmay depend far
less on code quality than on how efficiently the program uses operating system ser-
vices and how often and under what circumstances the operating system preempts
the program. Protected mode programs are often mostly collections of operating
system calls, and the performanceof whatever code isn’t operating-system oriented
may depend primarily on how large atime slice the operatingsystem givesthat code
to run in.
In short,taken as a whole, protected mode programmingis a different kettle of fish
altogether fromwhat I’ve been describing inthis book. There’s certainly a knack to
optimizing specifically for protected mode undergiven
a operating system.. .but it’s
not what we’ve been learning, andnow is not the time to pursue it further. In gen-
eral, though, the optimization strategies discussed in this book still hold true in
protected mode;it’sjust issues specific to protected mode or a particular operating
system that we won’t discuss.
210 Chapter 1 1
that offer more performance for the dollar-but less performance overall-than
zero-wait-state memory.(It is possible to build zero-wait-state systemsfor the 286 and
386; it’sjust so expensive that it’s rarely done.)
The IBM AT and true compatibles use one-wait-statememory (some AT clones use
zero-wait-state memory, but such clones are less common than one-wait-state AT
clones). The 386 systems use a wide variety of memory systems-including high-speed
caches, interleaved memory, and static-column RAM-that insert anywhere from 0 to
about 5 wait states (and many more if 8 or l6bitmemory expansion cards are used) ;
the exact number of wait states inserted at any given time depends on theinterac-
tion between the code being executed and the memory system it’s running on.
The performance of most 386 memory systems can vary great&,from one memory
p access to anothel; depending on factorssuch as what data happensto bein the cache
and which interleavedbank and/or RAM column was accessedlast.
The many memory systems in use make it impossible for us to optimize for 286/386
computers with the precision that’s possible on the 8088. Instead, we must write code
that runsreasonably well under thevarying conditions found in the286/386 arena.
The wait states that occur onmost accesses to system memory in 286 and 386 com-
puters mean that nearly every accessto system memory-memory in the DOS’s normal
640K memory area-is slowed down. (Accessesin computerswith high-speed caches
may be wait-state-free if the desired data is already in the cache, but will certainly
encounter wait states if the data isn’t cached; this phenomenon produces highly
variable instruction execution times.) While this is our first encounter with system
memory wait states,we haverun into wait-state
a cycle-eaterbefore: the display adapter
cycle-eater, which we discussed along with the other 8088 cycle-eaters way back in
Chapter 4. System memory generally has fewer wait states per access than display
memory. However, system memory is also accessed far more often than display
memory, so system memory wait states hurt plenty-and the place they hurt most is
instruction fetching.
Consider this: The 286 can store an immediate value to memory, as in MOV
[WordVar],O,in just 3 cycles. However, that instruction is 6 bytes long. The 286 is
capable of fetching 1word every 2 cycles; however,the one-wait-statearchitecture of
the AT stretches that to 3 cycles. Consequently, nine cycles are needed tofetch the
six instruction bytes. On topof that, 3 cycles are needed towrite to memory, bring-
ing the total memory access time to 1 2 cycles. On balance, memory access
time-especially instruction prefetching-greatly exceeds execution time, to the
extent that this particular instruction can take up to four times as long to run as it
does to execute in the Execution Unit.
And that, my friend, is unmistakably the prefetchqueue cycle-eater. I mightadd that
the prefetch queuecycle-eater is in rare good form in theabove example: A 440-1
: M e a s u r e st h ep e r f o r m a n c eo fa ni m m e d i a t e move t o
: memory. i n o r d e r t o d e m o n s t r a t e t h a t t h e p r e f e t c h
: q u e u ec y c l e - e a t e ri sa l i v ea n dw e l l on t h e AT.
Skip jmp
Skip:
call ZTimerOn
rept 1000
mov CWordVarl .O
endm
ZTim
c aelrl O f f
What does this mean? Itmeans that, practically speaking, the 286 as used in the AT
doesn’t have a 16-bit [Link] a performance perspective, the 286 in an AT has two-
thirds of a 16-bit bus (a 10.7-bit bus?), since every bus access on an AT takes 50
percent longer than it should. A 286 running at 10 MHz should be able to access
memory at a maximum rate of 1 word every 200 ns; in a 10 MHz AT, however, that
rate is reduced to 1 word every 300 ns by the one-wait-state memory.
In short, a close relative ofour old friend the8-bit bus cycleeater-the system memory
wait state cycle-eater-haunts us still on all but zero-wait-state 286and 386 computers,
and that means that the prefetch queue cycleeater is aliveand well. (The system memory
wait state cycle-eater isn’t really a new cycleeater, but rather a variant of the general
wait state cycleeater, of which the display adapter cycleeater is yet another variant.)
While the 286 in the AT can fetch instructions much faster than can the 8088 in the
PC, it can execute those instructions faster still.
The picture is less clear in the 386 world since there areso many different memory
architectures, but similar problems can occur inany computer built around a 286 or
386. The prefetch queue cycle-eater is even a factor-albeit a lesser one-on zero-
wait-state machines, both because branching empties the queue and because some
instructions can outrun even zero-wait-stateinstruction fetching.(Listing 11.1 would
21 2 Chapter 1 1
take at least 8 cycles per instruction on a zero-wait-stateAT-5 cycles longer than the
official execution time.)
To summarize:
Memory-accessing instructions don’t run at their official speeds on non-zero-
wait-state 286/386 computers.
Theprefetchqueuecycle-eaterreducesperformanceon 286/386 computers,
particularly when non-zero-wait-state memoryis used.
Branches often execute at less than their rated speeds on the 286 and 386 since
the prefetch queue is emptied.
The extent to which the prefetch queue and wait states affect performance varies
from one 286/386 computer to another, making precise optimization impossible.
What’s to be learned fromall this? Several things:
Keepyourinstructionsshort.
Keep it in the registers; avoid memory, since memory generally can’t keep up
with the processor.
Don’t jump.
Of course, those areexactly the rules that apply to 8088 optimization as well. Isn’t it
convenient that the same general rules apply across the board?
Data Alignment
Thanks toits l6bit bus, the 286 can access word-sizedmemory variables just as fast as
byte-sized [Link]’s a catch, however: That’s onlytrue forword-sized variables
that start at even addresses. When the 286 is asked to perform a word-sized access
starting at an odd address, it actually performs two separate accesses, each of which
fetches 1 byte, just as the 8088 does for all word-sized accesses.
Figure 11.1 illustrates thisphenomenon. The conversion of word-sized accesses toodd
addresses into double byte-sized accesses transparent
is to memory-accessing instructions;
all any instruction knows is that the requested word has been accessed, no matter
whether 1 word-sized access or 2 byte-sized accesses wererequired to accomplish it.
The penalty for performinga word-sized accessstarting at anodd address is easy to
calculate: Two accesses take twice as long as one access.
That, ina nutshell, is the data alignment cycle-eater, the onenew cycle-eater of the
286 and 386. (The dataalignment cycle-eater is a close relative of the 8088’s 8-bit bus
cycle-eater, but since it behaves differently-occurring only at odd addresses-and is
avoided with adifferent workaround,we’ll consider it to be a new cycle-eater.)
To
286
2002
The 80286 reads the word value
8382h ataddress 1 FFFFh with two 2003 85
byte-sized accesses since that word
value starts at an odd address.
The way to deal with the data alignmentcycle-eater isstraightforward: Don’t perform
word-sized accesses to odd addmses on the 284 ifyou can he&
it. The easiest way to avoid the
data alignment cycleeater is to placethe directive EVEN before each of your word-sized
variables. EVEN forces the offset of the nextbyte assembled to be even by inserting
a NOP if the currentoffset is odd; consequently, you can ensure thatany word-sized
variable can be accessed efficiently by the 286 simply by preceding itwith EVEN.
Listing 11.2, which accesses memory a word at a time with each word startingat an
odd address, runs on a 10 MHz AT clone in 1.27 ps per repetition of MOVW, or 0.64 ps
per word-sized memory access. That’s 6plus cycles per word-sized access, which breaks
down to two separate memory accesses-3 cycles to access the high byte of each
word and 3 cycles to access the low byte of each word, the inevitable result of non-
word-aligned word-sized memory accesses-plus a bit extra forDRAM refresh.
214 Chapter 1 1
LISTING 1 1.2 11 [Link]
; *** L i s t i n g1 1 . 2 ***
; M e a s u r e st h ep e r f o r m a n c eo fa c c e s s e st ow o r d - s i z e d
; v a r i a b l e st h a ts t a r ta t o d da d d r e s s e s( a r en o t
; word-aligned).
Skip:
push ds
POP es
mov s i .; ls o u r c ae n d e s t i n a t i o an r teh e same
mov [Link] ; a n bd o t ah r ne owt o r d - a l i g n e d
mov c x . 1 0 0 0 ;move 1000words
c ld
call ZTimerOn
rep movsw
call ZTimerOff
On the other hand,Listing 11.3, which is exactly the same as Listing 11.2 save that
the memory accesses are word-aligned (start ateven addresses), runs in0.64 ps per
repetition of MOVSW, or 0.32 ps per word-sized memory access. That’s 3 cycles per
word-sized access-exactly twice as fast as the non-word-aligned accesses of Listing
11.2,just as we predicted.
Skip:
ds push
POP es
sub s i .;ssio u r ca end de s t i n a t i oa tnrhee same
mov [Link] ; abnoadwtrhoe r d - a l i g n e d
mov c x . 1 0 0 0 :move 1000
words
cl d
call ZTimerOn
rep movsw
ZTimc aelrl O f f
Code Alignment
Lack of word alignment can also interfere with instruction fetching on the 286, al-
though not to the extent that it interferes
with accessto word-sized memoryvariables.
E
Memory
A 201 00
21 6 Chapter 1 1
Memory
20 100 c3 I ret
’
20101 68
286
201 02 05 I mov ax,5
201 03 00
201 04 28
~
sub dl,dl
On a branch to 201 01, only 201 05 D2
one useful instruction byte is
fetched by the first instruction
fetch after the branch, since
the other byte in the word-
aligned word that covers
address 20 1 0 1 precedes the
branch destination and is
therefore of no use as an
instruction byte after the
branch.
Now, this code should run in, say, about 12 cycles per loop at most. Instead, it took
over 14 cycles per loop, an execution time that I could not explain in any way. After
rolling i t around in my head for a while, I took a look at the code under a
debugger...and the answer leaped out atme. The loop begun ut a n odd address! That
meant that two instruction fetches were required eachtime through the loop; one to
get the opcodebyte of the LOOP instruction, which resided at the end of one word-
aligned word, and anotherto get the displacementbyte, whichresided at the start of
the nextword-aligned word.
One simple change broughtthe execution time down to a reasonable 12.5 cycles per
loop:
mov c x . 1000
call ZTimerOn
even
LoopTop:
1 oop LoopTop
call ZTimerOff
I recommend that you onlygo out of your way to word-align the start offsets ofyour
subroutines, as in:
even
FindChar proc near
In my experience, this simple practice is the one form of code alignment thatconsis-
tently providesa reasonable return forbytes and effort expended, although sometimes
it also pays to word-align tight time-critical loops.
21 8 Chapter 1 1
improved the performanceof a complex application on theAT simply by forcing the
Forth interpreter tomaintain an even stack pointer atall times.
An interesting corollary to this rule is that you shouldn’t INC SP twice to add 2, even
though that takes fewer bytes than ADD SP,2. The stack pointer is odd between the
first and secondINC, so any interrupt occurringbetween the two instructions will be
serviced more slowly than it normally would. The same goes for decrementingtwice;
use SUB SP,2 instead.
220 Chapter 1 1
And that explains the second viewpoint expressed above regarding thedisplay adapter
cycle-eater vis-a-vis the 286 and 386. The display adapter cycle-eater, asmeasured in
cycles lost to wait states, is indeed muchworse on AT-class computers than itis on the
PC, and it’s worse stillon morepowerful computers.
How bad is the display adapter cycle-eater on an AT? It’s this bad: Based on my (not
inconsiderable) experience in timing display adapter access, found I’ve that the dis-
play adapter cycle-eater can slow an AT-r even a 386 computer-to near-PC
speeds when display memory is accessed.
I know that’s hard to believe, but the display adapter cycle-eater gives out just so
many displaymemory accesses in agiven time, and no more, no matter how fast the
processor is. In fact, the faster the processor, the more the display adapter cycleeater
hurts the performance of instructions that access display memory. The display adapter
cycle-eater is not only still present in 286/386 computers,it’s worsethan ever.
What can we do about this new, more virulent form of the display adapter cycle-
eater? The workaround is the same as it was on the PC: Access display memory as
little as you possibly can.
222 Chapter 1 1
p Note well: Those tricks don ’t necessarily work with system sofmare such asWin-
dows, so Ih’ recommend against using [Link] want $-gigabyte segments, use
a 32-bit environment suchas Win32.
Detailed Optimization
While the major 8088optimization rules hold true on computers built around the 286
and 386, manyof the instruction-specific optimizations no longer hold, forthe execu-
tion times of most instructions are quite different on the 286 and 386 than on the
8088. We have already seen one such example of the sometimes vast difference be-
tween 8088 and 286/386 instruction execution times: MOV [wordvar],O, which has
an Execution Unit execution time of 20 cycleson the8088, hasan EU execution time
ofjust 3cycles on the 286 and 2 cycles on the 386.
In fact, the performanceof virtually all memory-accessing instructions has been im-
proved enormously on the 286 and 386. The key to this improvement is the near
elimination of effective address (EA) calculation time. Where an 8088 takes from 5
to 12 cycles to calculate an EA, a 286 or 386 usually takes no time whatsoever to
perform the calculation. If a base+index+displacement addressing mode, such as
MOV AX,[WordArray+BX+SI],is used on a 286 or 386, 1 cycle is taken to perform
the EA calculation, but that’s both the worst case and the only case in which there’s
any EA overhead at all.
The elimination of EA calculation time means that the EU execution time of memory-
addressing instructions is much closer to the EU execution time of register-only
instructions. For instance, on the 8088 ADD [wordVar],lOOH is a 31-cycle instruc-
tion, while ADD DX,lOOHis a 4cycle instruction-a ratio of nearly 8 to 1. By contrast,
call ZTimerOn
1 0 0r0e p t
dx.100h add
endm
ZTim c ae lrlO f f
224 Chapter 1 1
jv Skip
Skip:
call ZTimerOn
rept 1000
add [WordVarllOOh
endm
call ZTimerOff
What’s going on? Simply this: Instruction fetching is controlling overall execution
time on both processors. Boththe 8088 in a PC and the 286 in an AT can execute the bytes
of the instructions inListings 11.4 and 11.5faster than they can be fetched. Since the
instructions areexactly the same lengths on bothprocessors, it standsto reason that
the ratio of the overall execution times of the instructions should be the same on
both processors as [Link] length controls execution time, and theinstruc-
tion lengths are thesame-therefore the ratios of the execution times are thesame.
The 286 can both fetch and execute instruction bytes faster than the 8088 can, so
code executes much faster on the 286; nonetheless, because the 286 can also ex-
ecute those instruction bytes much faster than it can fetchthem, overall performance
is still largely determined by the size of the instructions.
Is this always the case? No. When the prefetch queue is full, memory-accessing in-
structions on the 286 and 386 are much faster(relative to register-only instructions)
than they are on the 8088. Given the system wait states prevalent on 286 and 386
computers, however, the prefetch queue is likely to be empty quitea bit, especially
when code consisting of instructions with short EU execution times is executed. Of
course, that’sjust the sort of code we’re likely to write when we’re optimizing, so the
performance of high-speed code is more likely to be controlledby instruction size
than by EU execution time on most 286 and 386 computers, justas it is on thePC.
All of which is just a way of saying that faster memory access and EA calculation
notwithstanding, it’sjustas desirable to keep instructions short and memory accesses
to a minimum on the 286 and 386 asit is on the8088. And the way to do that is to use
the registers as heavily aspossible, use string instructions,use short formsof instruc-
tions, and the like.
The more things change, the morethey remain the same....
226 Chapter 1 1
SP
@ Memory
SP I 1800 b
ss 1 3000 1
FLAGS 1 0640 , I
8 SP [ ,1802 1
ss 31800
31801
FLAGS 1 0295 ,
31802
ending up at the instruction after the workaround code withthe flags popped. That’s
just the result that would have occurred had we executed POPF-with the bonus
that no interrupts can accidentally occur when the Interrupt flag is 0 both before
and after the pop.
How can we push the segment:offset of the next instruction?Well, finding the offset
of the next instruction by performing a near call to that instruction is a tried-and-
true trick. We can do something similar here, but in thiscase we need a far call, since
IRE’”requires both a segmentand an offset. We’ll alsobranch backward so that the
cs 18 31803
31804 95
FLAGS 31805 02
31806 57
Memory
31800 05
31801 90
31802 10
18
31804 95
02
31806 57
Memory
31800 05
31801 90
IP
10 31802
18 31803
31804 95
FLAGS 31805 02
+ 31806 57
By the way, the flags can be popped much morequickly if you’re willing to alter a
register in the process. For example, the following macro emulates POPF with just
one branch, butwipes out AX:
EMULATE-POPFLTRASHLAX macro
p u schs
mov a x . o f f s e t $+5
pushax
ir e t
endm
EMULATE-POPF macro
p u sc hs
p u s ho f f s e t $+4
which is shorter still, alters no registers, and branches just once. (Of course, this
version of EMULATE-POPF won't work on an8088.)
'L
Memory
H
317FA ???
317FC ???
IP 1 o f f s e tp o p f s k i p 317FE ???
cs 1 s e g m e n tp o p f s k i p b 4 31800
31802 I ??? I
FLAGS 1 ??? b
Memory
317FA ???
317FC o f f speot p f s k i p + 5
317FE o f f sp eo tp f s k i p
cs 1 s e g m e n tp o p f s k i p C 3p1uf8sl 0ah0gesd
31802 ???
FLAGS
1- ???
Memory
3- 317FA
317FA
317FE
31800
cs I s e g m e n tp o p f s k i p
31802
230 Chapter 1 1
Previous Home Next
The standard version of EMULATE-POPF is 6 bytes longer than POPF and much
slower, as you’dexpect given that itinvolves three branches. Anyone in his/her right
mind would prefer POPF to a larger, slower, three-branch macro-given a choice. In
non-interruptible code, however, there’s no choice here; thesafer-if slower-approach
is the best. (Having people associate your programs with crashed computers is nota
desirable situation, no matter how unfair the circumstances under which it occurs.)
And now you know the nature of and theworkaround for the POPF bug. Whether
you ever need theworkaround or not,it’s a neatly packaged example of the tremen-
dous flexibility of the x86 instruction set.