0% found this document useful (0 votes)
4 views26 pages

Graphics Programming Black Book-Gpbb11

Uploaded by

Nuno Gomes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

Graphics Programming Black Book-Gpbb11

Uploaded by

Nuno Gomes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Previous Home Next

new registers, new instructions, new timings, new complications

4
::$
'_.b

ers, New Instructions, New Timings,


.%
New Complications
This chapter,adaptearlier book Zen of Assembly Language (1989; now out
of print), provides an Df the 286 and 386, oftencontrasting those proces-
sors with the 8088. &t the time I originally wrote this, the 8088 was the king of
processors, and the$36 and 386 werethe new kidson the block. Today, ofcourse, all
three processors ar6 past their primes, but many millions of each are still in use, and
the 386 in partic@r is still well worth considering when optimizing software.
This chadesaninteresting look atthe evolution of the x86 architecture, to
a greater degree th$n you might expect, for thex86 family came into full maturity
with the 386; the 486hnd the Pentium are really nothing more thanfaster 386s, with
very little in the way of new functionality. In contrast, the 286 added a number of
instructions, respectable performance, and protected mode to the 8088's capabili-
ties, and the386 added moreinstructions and a whole new set of addressing modes,
and brought the x86 family into the 32-bit world that represents the future (and,
increasingly, the present)of personal [Link] chapter also provides insight
into theeffects on optimization of the variations in processors and memory architec-
tures that are common in the PC world. So, although the 286 and 386 no longer
represent the mainstream of computing, this chapter is a useful mix of history les-
son, x86 overview, and details on two workhorse processors that arestill in wide use.

207
FamiIy Matters
While the x86 family is a large one, only a few members of the family-which in-
cludes the 8088, 8086, 80188, 80186, 286, 386SX, 386DX, numerous permutations
of the 486, and now the Pentium-really matter.
The 8088 is now allbut extinct in the PC arena. The8086 was used fairly widelyfor a
while, but has now all but disappeared. The 80186 and 80188 never really caught on
for use in PC and don’t require further discussion.
That leaves us withthe high-end chips: the 286, the 386SX, the 386, the 486, and the
Pentium. At this writing, the 386SX is fast going the way of the 8088; people are
realizing that its relatively small costadvantage over the 386 isn’t enough to offset its
relatively large performance disadvantage. After all, the 386SX suffers from the same
debilitating problem thatlooms over the 8088-a too-small [Link], the 386SX
is a 32-bit processor, but externally, it’s a 16-bit processor, a non-optimal architec-
ture, especially for 32-bit code.
I’m not goingto discussthe 386SX in detail. If youdo find yourself programming for
the 386SX, follow the same general rules you should follow for the 8088: use short
instructions, use the registers as heavily aspossible, and don’t [Link] otherwords,
avoid memory, since the 386SX is by definition better at processing data internally
than it is at accessing memory.
The 486 is a world unto itself for the purposesof optimization, and the Pentiumis a
universe unto itself. We’ll treat them separately in later chapters.
This leaves us with just two processors: the 286 and the386. Each was the PC standard
in its day. The 286 is no longer used in new systems, but there are millions of 286-
based systems still in daily use. The 386 is still being used in new systems, although
it’s on the downhill leg of its lifespan, and it is in even wider use than the 286. The
future clearly belongs to the 486 and Pentium, but the 286 and 386 are still very
much a partof the present-day landscape.

Crossing the Gulf to the 286 and the 386


Apart from vastly improved performance, the biggest difference between the 8088
and the 286 and 386 (as well as the later Intel CPUs) is that the 286 introduced pro-
tected mode, and the386 greatlyexpanded thecapabilities ofprotected [Link]’re
only going to talkabout real-mode operation of the 286 and 386 in thisbook, however.
Protected mode offers a whole new memory management scheme, one that isn’t s u p
ported by the 8088. Onlycode specifically writtenfor protected modecan run in that
mode; it’s an alien and hostile environment for MS-DOS programs.
In particular, segments are different creatures in protected mode. They’re selectors-“
indexes into a table of segment descriptors-rather than plain old registers, and

208 Chapter 1 1
can’t be set to arbitraryvalues. That means that segments can’t be used for tempo-
rary storage or as part of a fast indivisible 32-bit load from memory, as in
les [Link] p [t Lr o n g V a r l
mov [Link]

which loads LongVar into DX:AX faster than this:


mov a x . w o pr dt r [LongVarl
mov d x . w o pr dt r [LongVar+21

Protected mode uses those altered segmentregisters to offer access to a great deal
more memory than real mode:The 286 supports 16megabytes of memory, whilethe
386 supports 4 gigabytes (4K megabytes) of physical memory and 64 terabytes (64K
gigabytes!) of virtual memory.
In protected mode,your programs generally run under an operating system (OS/2,
Unix, Windows NT or the like) that exerts much more control over the computer
than does MS-DOS. Protected mode operating systems can generally run multiple
programs simultaneously, and the performanceof any one programmay depend far
less on code quality than on how efficiently the program uses operating system ser-
vices and how often and under what circumstances the operating system preempts
the program. Protected mode programs are often mostly collections of operating
system calls, and the performanceof whatever code isn’t operating-system oriented
may depend primarily on how large atime slice the operatingsystem givesthat code
to run in.
In short,taken as a whole, protected mode programmingis a different kettle of fish
altogether fromwhat I’ve been describing inthis book. There’s certainly a knack to
optimizing specifically for protected mode undergiven
a operating system.. .but it’s
not what we’ve been learning, andnow is not the time to pursue it further. In gen-
eral, though, the optimization strategies discussed in this book still hold true in
protected mode;it’sjust issues specific to protected mode or a particular operating
system that we won’t discuss.

In the Lair of the Cycle-Eaters, Part II


Under the programming interface, the 286 and 386 differ considerably from the8088.
Nonetheless, with one exception and oneaddition, thecycle-eatersremain much the
same on computers built around the286 and 386. Next, we’ll revieweach of the famil-
iar cycle-eatersI covered in Chapter 4 asthey apply to the 286 and 386, and we’ll look
at thenew member of the gang, the data alignment cycle-eater.
The onecycle-eater that vanishes on the286 and 386 is the 8-bit bus cycle-eater. The
286 isa 16-bitprocessor both internally and externally, and the386 isa 32-bit proces-
sor both internally and externally, so the Execution Unit/Bus Interface Unit size

Pushing the 286 and 386 209


mismatch that plagues the 8088 is eliminated. Consequently, there’s no longer any
need to use byte-sized memory variables in preference to word-sized variables, at
least so long as word-sized variablesstart at even addresses, as we’ll see shortly. On
the other hand, access to byte-sized variables still
isn’t any slowm than access to word-
sized variables, so you can use whichever size suits a given task best.
You might think thatthe elimination of the 8-bit bus cycle-eater wouldmean that the
prefetch queue cycle-eater would alsovanish, since on the 8088 the prefetch queue
cycle-eater is a side effect of the 8-bit bus. That would seem all the morelikely given
that both the 286 and the386 havelarger prefetchqueues than the 8088 (6 bytes for
the 286, 16 bytes for the 386) and can perform memory accesses, including instruc-
tion fetches, in far fewer cycles than the 8088.
However, the prefetch queue cycle-eater doesn’t vanish on either the286 or the386,
for several reasons. For one thing, branching instructions still empty the prefetch
queue, so instruction fetching still slows things down after most branches; when the
prefetch queue is empty, it doesn’t much matter how big it is. (Even apart from
emptying the prefetch queue, branchesaren’t particularly fast on the286 or the386,
at a minimumof seven-plus cyclesapiece. Avoid branching whenever possible.)
After a branch it does matter how fast the queuecan refill, and therewe come to the
second reason the prefetch queue cycle-eater lives on: The 286 and 386 are so fast
that sometimes the Execution Unit can execute instructionsfaster than they can be
fetched, even though instruction fetchingis much faster on the286 and 386 than on
the 8088.
(All other things being equal,too-slow instruction fetchingis more of a problem on
the 286 than on the 386, since the 386 fetches 4 instruction bytes at atime versus the
2 instruction bytes fetched per memory access by the 286. However, the 386 also
typically runs atleast twice asfast as the 286, meaning that the386 can easily execute
instructions faster than they can be fetched unless very high-speed memory is used.)
The most significant reason that the prefetch queue cycle-eater not only survivesbut
prospers on the286 and 386, however, liesin the various memory architecturesused
in computersbuilt around the286 and 386. Due to the memory architectures,the 8-
bit bus cycle-eater is replaced by a new form of the wait state cycle-eater: waitstates
on accesses to normal system memory.

System Wait States


The 286 and 386 were designed to lose relatively little performance to the prefetch
queue cycle-eater.. . when used with zero-wait-state memory: memory that can complete
memory accesses so rapidly that no wait states are [Link], true zero-wait-
state memory is almost never used with those processors. Why? Because memory that
can keep up with a 286 is fairlyexpensive, and memory that can keep up with a 386
is very expensive. Instead, computer designers use alternative memory architectures

210 Chapter 1 1
that offer more performance for the dollar-but less performance overall-than
zero-wait-state memory.(It is possible to build zero-wait-state systemsfor the 286 and
386; it’sjust so expensive that it’s rarely done.)
The IBM AT and true compatibles use one-wait-statememory (some AT clones use
zero-wait-state memory, but such clones are less common than one-wait-state AT
clones). The 386 systems use a wide variety of memory systems-including high-speed
caches, interleaved memory, and static-column RAM-that insert anywhere from 0 to
about 5 wait states (and many more if 8 or l6bitmemory expansion cards are used) ;
the exact number of wait states inserted at any given time depends on theinterac-
tion between the code being executed and the memory system it’s running on.

The performance of most 386 memory systems can vary great&,from one memory
p access to anothel; depending on factorssuch as what data happensto bein the cache
and which interleavedbank and/or RAM column was accessedlast.

The many memory systems in use make it impossible for us to optimize for 286/386
computers with the precision that’s possible on the 8088. Instead, we must write code
that runsreasonably well under thevarying conditions found in the286/386 arena.
The wait states that occur onmost accesses to system memory in 286 and 386 com-
puters mean that nearly every accessto system memory-memory in the DOS’s normal
640K memory area-is slowed down. (Accessesin computerswith high-speed caches
may be wait-state-free if the desired data is already in the cache, but will certainly
encounter wait states if the data isn’t cached; this phenomenon produces highly
variable instruction execution times.) While this is our first encounter with system
memory wait states,we haverun into wait-state
a cycle-eaterbefore: the display adapter
cycle-eater, which we discussed along with the other 8088 cycle-eaters way back in
Chapter 4. System memory generally has fewer wait states per access than display
memory. However, system memory is also accessed far more often than display
memory, so system memory wait states hurt plenty-and the place they hurt most is
instruction fetching.
Consider this: The 286 can store an immediate value to memory, as in MOV
[WordVar],O,in just 3 cycles. However, that instruction is 6 bytes long. The 286 is
capable of fetching 1word every 2 cycles; however,the one-wait-statearchitecture of
the AT stretches that to 3 cycles. Consequently, nine cycles are needed tofetch the
six instruction bytes. On topof that, 3 cycles are needed towrite to memory, bring-
ing the total memory access time to 1 2 cycles. On balance, memory access
time-especially instruction prefetching-greatly exceeds execution time, to the
extent that this particular instruction can take up to four times as long to run as it
does to execute in the Execution Unit.
And that, my friend, is unmistakably the prefetchqueue cycle-eater. I mightadd that
the prefetch queuecycle-eater is in rare good form in theabove example: A 440-1

Pushing the 286 and 386 21 1


ratio of instruction fetch time to execution time is in a class with the best (or worst!)
that’s found on the8088.
Let’s check out the prefetch queue cycle-eater in action. Listing 11.1 times MOV
WordVar1,O. The Zen timer reports that on a one-wait-state 10 MHz 286-based AT
clone (the computerused for all tests in this chapter), Listing 11.1 runs in 1.27 ps
per instruction. That’s12.7 cycles per instruction, just as we calculated. (That extra
seven-tenths of a cycle comes fromDRAM refresh, which we’ll get to shortly.)

LISTING 1 1.1 11 [Link]


: *** L i s t i n g 11.1 ***

: M e a s u r e st h ep e r f o r m a n c eo fa ni m m e d i a t e move t o
: memory. i n o r d e r t o d e m o n s t r a t e t h a t t h e p r e f e t c h
: q u e u ec y c l e - e a t e ri sa l i v ea n dw e l l on t h e AT.

Skip jmp

: ae lvweany s make wsourrde- s i z e d memory


: v a r i a b l e sa r ew o r d - a l i g n e d !
WordVar dw 0

Skip:
call ZTimerOn
rept 1000
mov CWordVarl .O
endm
ZTim
c aelrl O f f

What does this mean? Itmeans that, practically speaking, the 286 as used in the AT
doesn’t have a 16-bit [Link] a performance perspective, the 286 in an AT has two-
thirds of a 16-bit bus (a 10.7-bit bus?), since every bus access on an AT takes 50
percent longer than it should. A 286 running at 10 MHz should be able to access
memory at a maximum rate of 1 word every 200 ns; in a 10 MHz AT, however, that
rate is reduced to 1 word every 300 ns by the one-wait-state memory.
In short, a close relative ofour old friend the8-bit bus cycleeater-the system memory
wait state cycle-eater-haunts us still on all but zero-wait-state 286and 386 computers,
and that means that the prefetch queue cycleeater is aliveand well. (The system memory
wait state cycle-eater isn’t really a new cycleeater, but rather a variant of the general
wait state cycleeater, of which the display adapter cycleeater is yet another variant.)
While the 286 in the AT can fetch instructions much faster than can the 8088 in the
PC, it can execute those instructions faster still.
The picture is less clear in the 386 world since there areso many different memory
architectures, but similar problems can occur inany computer built around a 286 or
386. The prefetch queue cycle-eater is even a factor-albeit a lesser one-on zero-
wait-state machines, both because branching empties the queue and because some
instructions can outrun even zero-wait-stateinstruction fetching.(Listing 11.1 would

21 2 Chapter 1 1
take at least 8 cycles per instruction on a zero-wait-stateAT-5 cycles longer than the
official execution time.)
To summarize:
Memory-accessing instructions don’t run at their official speeds on non-zero-
wait-state 286/386 computers.
Theprefetchqueuecycle-eaterreducesperformanceon 286/386 computers,
particularly when non-zero-wait-state memoryis used.
Branches often execute at less than their rated speeds on the 286 and 386 since
the prefetch queue is emptied.
The extent to which the prefetch queue and wait states affect performance varies
from one 286/386 computer to another, making precise optimization impossible.
What’s to be learned fromall this? Several things:
Keepyourinstructionsshort.
Keep it in the registers; avoid memory, since memory generally can’t keep up
with the processor.
Don’t jump.
Of course, those areexactly the rules that apply to 8088 optimization as well. Isn’t it
convenient that the same general rules apply across the board?

Data Alignment
Thanks toits l6bit bus, the 286 can access word-sizedmemory variables just as fast as
byte-sized [Link]’s a catch, however: That’s onlytrue forword-sized variables
that start at even addresses. When the 286 is asked to perform a word-sized access
starting at an odd address, it actually performs two separate accesses, each of which
fetches 1 byte, just as the 8088 does for all word-sized accesses.
Figure 11.1 illustrates thisphenomenon. The conversion of word-sized accesses toodd
addresses into double byte-sized accesses transparent
is to memory-accessing instructions;
all any instruction knows is that the requested word has been accessed, no matter
whether 1 word-sized access or 2 byte-sized accesses wererequired to accomplish it.
The penalty for performinga word-sized accessstarting at anodd address is easy to
calculate: Two accesses take twice as long as one access.

.p of the 286 j . external databus is halved when


In other words, the effective capacity
a word-sized access to an odd address is performed.

That, ina nutshell, is the data alignment cycle-eater, the onenew cycle-eater of the
286 and 386. (The dataalignment cycle-eater is a close relative of the 8088’s 8-bit bus
cycle-eater, but since it behaves differently-occurring only at odd addresses-and is
avoided with adifferent workaround,we’ll consider it to be a new cycle-eater.)

Pushing the 286 and 386 21 3


69
To
Memory
-
-
286

The 80286 reads the word value


838217 at address 20000h with a
single word-sized access since that
word value starts at an even address.
2o02
2003 w
0 Memory

To
286

2002
The 80286 reads the word value
8382h ataddress 1 FFFFh with two 2003 85
byte-sized accesses since that word
value starts at an odd address.

The data alignment cycle-eater:


Figure 1 1.1

The way to deal with the data alignmentcycle-eater isstraightforward: Don’t perform
word-sized accesses to odd addmses on the 284 ifyou can he&
it. The easiest way to avoid the
data alignment cycleeater is to placethe directive EVEN before each of your word-sized
variables. EVEN forces the offset of the nextbyte assembled to be even by inserting
a NOP if the currentoffset is odd; consequently, you can ensure thatany word-sized
variable can be accessed efficiently by the 286 simply by preceding itwith EVEN.
Listing 11.2, which accesses memory a word at a time with each word startingat an
odd address, runs on a 10 MHz AT clone in 1.27 ps per repetition of MOVW, or 0.64 ps
per word-sized memory access. That’s 6plus cycles per word-sized access, which breaks
down to two separate memory accesses-3 cycles to access the high byte of each
word and 3 cycles to access the low byte of each word, the inevitable result of non-
word-aligned word-sized memory accesses-plus a bit extra forDRAM refresh.

214 Chapter 1 1
LISTING 1 1.2 11 [Link]
; *** L i s t i n g1 1 . 2 ***
; M e a s u r e st h ep e r f o r m a n c eo fa c c e s s e st ow o r d - s i z e d
; v a r i a b l e st h a ts t a r ta t o d da d d r e s s e s( a r en o t
; word-aligned).

Skip:
push ds
POP es
mov s i .; ls o u r c ae n d e s t i n a t i o an r teh e same
mov [Link] ; a n bd o t ah r ne owt o r d - a l i g n e d
mov c x . 1 0 0 0 ;move 1000words
c ld
call ZTimerOn
rep movsw
call ZTimerOff

On the other hand,Listing 11.3, which is exactly the same as Listing 11.2 save that
the memory accesses are word-aligned (start ateven addresses), runs in0.64 ps per
repetition of MOVSW, or 0.32 ps per word-sized memory access. That’s 3 cycles per
word-sized access-exactly twice as fast as the non-word-aligned accesses of Listing
11.2,just as we predicted.

LISTING 1 1.3 11 1 -[Link]


; *** L i s t i n g1 1 . 3 ***
; M e a s u r e st h ep e r f o r m a n c eo fa c c e s s e st ow o r d - s i z e d
; v a r i a b l e st h a ts t a r ta te v e na d d r e s s e s( a r ew o r d - a l i g n e d ) .

Skip:
ds push
POP es
sub s i .;ssio u r ca end de s t i n a t i oa tnrhee same
mov [Link] ; abnoadwtrhoe r d - a l i g n e d
mov c x . 1 0 0 0 :move 1000
words
cl d
call ZTimerOn
rep movsw
ZTimc aelrl O f f

The data alignment cycle-eater has intriguing implications for speeding286/386


up
code. The expenditureof a little care and a few bytes to make sure that word-sized
variables and memory blocks are word-aligned can literally double the performance
of certain code running on the 286. Even if it doesn’t double performance,word
alignment usually helps and never hurts.

Code Alignment
Lack of word alignment can also interfere with instruction fetching on the 286, al-
though not to the extent that it interferes
with accessto word-sized memoryvariables.

Pushing the 286 and 386 21 5


The 286 prefetches instructions a word at a time; even if a given instruction doesn’t
begin at an even address, the 286 simply fetches the first byte of that instruction at
the same time that it fetches the last byte of the previous instruction, as shown in
Figure 11.2, then separates the bytes internally. That means that in most cases, in-
structions run justas fastwhether they’re word-aligned or not.
There is, however, a non-word-alignment penalty on branches to odd addresses. On a
branch to an odd address, the 286 is only able to fetch 1 useful byte with the first
instruction fetch following the branch,as shown in Figure 11.3. In otherwords, lack
of word alignment of the target instruction for any branch effectively cuts the in-
struction-fetching power of the 286 in half for the first instruction fetch after that
branch. While that may not soundlike much, you’d be surprised atwhat it can do to
tight loops; in fact, a brief story is in order.
When I was developing the Zen timer, I used my trusty 10 MHz 286based AT clone
to verify the basic functionality of the timer by measuring the performance of simple
instruction sequences. I was cruising along with no problems until I timed the fol-
lowing code:
mov cx. 1000
call ZTimerOn
LoopTop:
1 oop LoopTop
call ZTimerOff

E
Memory

A 201 00

20101 mov ax, 1


20102
201 03

201 04 I mov bx,2


The last byte of mov ax, 1 and the first
byte of mov bx,2, which together
201O5 J
form a worduligned word, are 00
02
prefetched with a single word-sized
access; the 286 later splits the bytes
apart internally in the prefetch queue.

Word-aligned on the 286.


prefetching
Figure 1 1.2

21 6 Chapter 1 1
Memory

20 100 c3 I ret

20101 68

286
201 02 05 I mov ax,5
201 03 00
201 04 28
~

sub dl,dl
On a branch to 201 01, only 201 05 D2
one useful instruction byte is
fetched by the first instruction
fetch after the branch, since
the other byte in the word-
aligned word that covers
address 20 1 0 1 precedes the
branch destination and is
therefore of no use as an
instruction byte after the
branch.

How instruction bytes arefetched after a branch.


Figure 1 1.3

Now, this code should run in, say, about 12 cycles per loop at most. Instead, it took
over 14 cycles per loop, an execution time that I could not explain in any way. After
rolling i t around in my head for a while, I took a look at the code under a
debugger...and the answer leaped out atme. The loop begun ut a n odd address! That
meant that two instruction fetches were required eachtime through the loop; one to
get the opcodebyte of the LOOP instruction, which resided at the end of one word-
aligned word, and anotherto get the displacementbyte, whichresided at the start of
the nextword-aligned word.
One simple change broughtthe execution time down to a reasonable 12.5 cycles per
loop:
mov c x . 1000
call ZTimerOn
even
LoopTop:
1 oop LoopTop
call ZTimerOff

While word-aligning branch destinations can improve branching performance,it’s a


nuisance and can increase code size a good deal, so it’s not worth doing in most
code. Besides, EVEN inserts a NOP instruction if necessary, and thetime required to

Pushing the 286 and 386 21 7


execute aNOP can sometimes cancel the performanceadvantage of having a word-
aligned branch destination.

Consequently, it b best to word-align only those branch destinations that can be


p reached solely by branching.

I recommend that you onlygo out of your way to word-align the start offsets ofyour
subroutines, as in:
even
FindChar proc near

In my experience, this simple practice is the one form of code alignment thatconsis-
tently providesa reasonable return forbytes and effort expended, although sometimes
it also pays to word-align tight time-critical loops.

Alignment and the 386


So far we’ve only discussed alignment as it pertains to the 286. What, you may well
ask, of the 386?
The 386 adds theissue of doubleword alignment (thatis, alignment to addresses that
are multiples of four.) The rule for the 386 is: Word-sized memory accesses should
be word-aligned (it’s impossible for word-aligned word-sized accesses to cross
doubleword boundaries) , and doubleword-sized memory accesses should be
doubleword-aligned. However, in real (as opposed to 32-bit protected) mode,
doubleword-sized memory accesses are rare,so the simple word-alignment rule we’ve
developed for the 286 servesfor the 386 in real mode as well.
As for code alignment.. .the subroutine-start word-alignment rule of the 286 serves
reasonably well there too since it avoids the worst case, where just 1byte is fetched on
entry to a [Link] optimum performancewould dictate doublewordalign-
ment of subroutines, that takes 3 bytes, a high price to pay for an optimization that
improves performance only on thepost 286 processors.

Alignment and the Stack


One side-effect ofthe data alignmentcycle-eater of the 286 and 386 is that you should
nmerallow the stack pointer to become odd.(You can make the stack pointer oddby
adding an odd value to it or subtracting an oddvalue from it, or by loading itwith an
odd value.) An odd stack pointer on the286 or 386 (or a nondoubleword-aligned
stack in 32-bit protected mode on the 386,486, or Pentium) will significantly reduce
the performance of PUSH, POP, C A L L , and RET, as well as INT and IRET, which
are executed to invoke DOS and BIOS functions, handle keystrokes and incoming
serial characters, and manage themouse. I know of a Forth programmer who vastly

21 8 Chapter 1 1
improved the performanceof a complex application on theAT simply by forcing the
Forth interpreter tomaintain an even stack pointer atall times.
An interesting corollary to this rule is that you shouldn’t INC SP twice to add 2, even
though that takes fewer bytes than ADD SP,2. The stack pointer is odd between the
first and secondINC, so any interrupt occurringbetween the two instructions will be
serviced more slowly than it normally would. The same goes for decrementingtwice;
use SUB SP,2 instead.

P Keep the stuckpointer aligned ut all times.

The DRAM Refresh Cycle-Eater: Still an Act of God


The DRAM refresh cycle-eateris the cycle-eater that’s least changed fromits 8088 form
on the 286 and 386. In the AT,DRAM refresh uses a little over five percent of all
available memory accesses, slightly less than it uses in the PC, but in thesame ballpark.
While the DRAM refresh penalty varies somewhaton various AT clones and 386 com-
puters (infact, a few computers arebuilt around static RAM, which requires no refresh
at all; likewise,caches are made of static RAM so cached systems generally suffer less
from DRAM refresh), the5 percent figure is a goodrule of thumb.
Basically, the effect of the DRAM refresh cycle-eater is pretty much the same through-
out the PC-compatible world: fairly small, so it doesn’t greatly affect performance;
unavoidable, so there’s no point in worrying about it anyway; and a nuisance since it
results in fractional cycle counts when using the Zen [Link] as with the PC, a given
code sequenceon theAT can execute atvarying speeds at differenttimes as a result of
the interaction between the code and DRAM refresh.
There’s nothing much new with DRAMrefresh on 286/386 computers, then. Be aware
of it, but don’toverly concern yourself-DRAM refresh is stillan act of God, and there’s
not a blessed thing you can do aboutit. Happily, the internal cachesof the 486 and
Pentium make DRAM refresh largely a performance non-issue on those processors.

The Display Adapter Cycle-Eater


Finally wecome to the last ofthe cycle-eaters,the display adapter [Link] are
two ways of looking at this cycle-eater on 286/386 computers: (1) It’s much worse than
it was on the PC, or (2) it’sjust about thesame as it was on thePC.
Either way, the display adapter cycle-eater is extremely bad news on 286/386 com-
puters and on486s and Pentiums as well. In fact, this cycle-eater on those systems is
largely responsible forthe popularity of VESA local bus (VLB).
The two ways of looking at thedisplay adapter cycle-eater on 286/386 computers are
actually the [Link] you’ll recall from my earlier discussion of the matter in Chap-
ter 4, display adapters offer only a limited number of accesses to display memory

Pushing the 286 and 386 21 9


during any given period of time. The 8088 is capable of making use of most but not
all of those slots withREP MOVSW,so the numberof memory accesses allowedby a
display adapter such as a standard VGA is reasonably well-matched to an 8088’s
memory access speed. Granted,access to a VGA slows the 8088 down considerably-
but, as we’reabout to find out, “considerably”is a relative term. What VGAa does to
PC performance is nothing compared to what it doesto faster computers.
Under ideal conditions, a 286 can access memory much, muchfaster than an 8088.
A 10 MHz 286 is capable of accessing a word of system memory every 0.20 ps with
REP MOVSW, dwarfing the 1 byte every 1.31 ps that the 8088 in a PC can manage.
However, access to display memory is anything but ideal for a 286. For one thing,
most display adapters are 8-bit devices,although newer adapters are 16-bit in nature.
One consequence of that is that only 1 byte can be read or written per access to
display memory; word-sized accesses to 8-bit devices are automatically split into 2
separate byte-sized accesses by the AT’s bus. Another consequence is that accesses
are simply slower; the AT’s bus inserts additional wait states on accesses to 8-bit de-
vices since it mustassume that such devices weredesigned for PCs and may not run
reliably at AT speeds.
However, the 8-bit size of most display adapters is but one of the two factors that
reduce the speed with whichthe 286 can access display memory. Far more cycles are
eaten by the inherent memory-access limitations of display adapters-that is, the
limited number of display memory accesses that display adapters make available to
the 286. Look at it this way: If REP MOVSW on a PC can use more than half of all
available accesses to display memory, then how much faster can code running ona
286 or 386 possibly run when accessing displaymemory?
That’s right-less than twice as fast.
In otherwords, instructions thataccess displaymemory won’t run a whole lot faster
on ATs and faster computers than they do on PCs. That explains one of the two
viewpoints expressed at the beginning of this section: The display adapter cycle-eater
is just about thesame on high-end computers as it is on thePC, in the sense that it
allows instructions thataccess displaymemory to run atjust about the same speed on
all computers.
Of course, the picture is quite abit different whenyou compare the performanceof
instructions that access display memory to the maximum performance of those in-
structions. Instructions that access display memory receive many more wait states
when running ona 286 than they do on an8088. Why? While the 286 is capable of
accessing memory much more often than the 8088, we’ve seen that thefrequency of
access to display memory is determined not by processor speed but by the display
adapter itself. As a result, both processors are actually allowedjust about thesame
maximum number of accesses to display memory in any given time. By definition,
then, the 286 must spend many more cycles waiting than does the8088.

220 Chapter 1 1
And that explains the second viewpoint expressed above regarding thedisplay adapter
cycle-eater vis-a-vis the 286 and 386. The display adapter cycle-eater, asmeasured in
cycles lost to wait states, is indeed muchworse on AT-class computers than itis on the
PC, and it’s worse stillon morepowerful computers.

How bad is the display adapter cycle-eater on an AT? It’s this bad: Based on my (not
inconsiderable) experience in timing display adapter access, found I’ve that the dis-
play adapter cycle-eater can slow an AT-r even a 386 computer-to near-PC
speeds when display memory is accessed.

I know that’s hard to believe, but the display adapter cycle-eater gives out just so
many displaymemory accesses in agiven time, and no more, no matter how fast the
processor is. In fact, the faster the processor, the more the display adapter cycleeater
hurts the performance of instructions that access display memory. The display adapter
cycle-eater is not only still present in 286/386 computers,it’s worsethan ever.
What can we do about this new, more virulent form of the display adapter cycle-
eater? The workaround is the same as it was on the PC: Access display memory as
little as you possibly can.

New Instructions and Features: The 286


The 286 and 386 offer a number of new instructions. The 286 has a relatively small
number of instructions that the 8088 lacks, whilethe 386 has those instructionsand
quite afew more, alongwith new addressing modesand datasizes. We’ll discussthe
286 and the 386 separately in this regard.
The 286 has a number of instructions designed for protected-mode [Link]
I’ve said, we’re not going to discuss protected mode in this book; in any case, pro-
tected-mode instructions are generally used only by operating systems. (I should
mention that the286’s protected mode brings with it the ability to address 16MB of
memory, a considerable improvementover the 8088’s 1 MB. In real mode, however,
programs are still limited to 1 MBof addressable memory on the 286. In either
mode, each segmentis still limited to 64K.)
There are also a handful of 286-specific real-mode instructions, and they can be
quite useful. BOUND checks array bounds. ENTER and LEAVE support compact
and speedy stack frame construction and removal, ideal for interfacingto high-level
languages such as C and Pascal (although these instructions are actually relatively
slow on the 386 and its successors, and should be used with caution when perfor-
mance matters). INS and OUTS are new string instructions that support efficient
data transfer between memory and 1 / 0 ports. Finally, PUSHA and POPA push and
pop all eight general-purpose registers.

Pushingthe 286 and 386 221


A couple of old instructions gain new features on the 286. For one, the 286 version
of PUSH is capable of pushing a constanton the stack. For another, the 286 allows
all shifts and rotates tobe performed for notjust 1bit or the number of bits specified
by CL, but for any constant number of bits.

New Instructions and Features: The 386


The 386 is somewhat more complex than the 286 regarding new features. Once
again, we won’t discuss protected mode, which on the 386 comes with the ability to
address up to 4 gigabytes per segment and 64 terabytes in all. In real mode (and in
virtual-86 mode, which allows the 386 to multitask MS-DOS applications, and which
is identical to real mode so far as MS-DOS programs are concerned), programs run-
ning on the 386 are still limited to 1MB of addressable memory and 64Kpersegment.
The 386 has many new instructions, as well as newregisters, addressing modes and
data sizes that have trickled down from protected mode. Let’s take a quick look at
these new real-mode features.
Even in real mode,it’s possible to access many of the 386’s newand extendedregis-
ters. Most of these registers are simply 32-bitextensions of the 16-bit registers of the
8088. For example, EAX is a 32-bit register containingAX as its lower 16 bits, EBX is
a 32-bit register containingBX as its lower 16 bits, and so on. There are also two new
segment registers: FS and GS.
The 386 also comes with a slew of new real-mode instructions beyond thosesupported by
the 8088 and 286. These instructions canscan data on bit-by-bit
a basis,set theCarry
flag to the value of a specified bit, sign-extend or zero-extend dataas it’s moved,set
a registeror memory variable to 1 or 0 on the basis of any of the conditions that can
be tested with conditional jumps,and more. (Again, beware: Many of these complex
386-specific instructions are slower than equivalent sequences of simple instructions
on the 486 and especially on the Pentium.) What’s more, both old andnew instruc-
tions support32-bit operations on the 386. Forexample, it’s relativelysimple to copy
data in chunks of 4 bytes on a 386, evenin real mode,by using the MOVSD (“move
string double”) instruction, or to negate32-bit a value withNEG EAX.
Finally, it’s possible in real mode to use the 386’s new addressing modes, in which
any 32-bit general-purpose register or pair of registers can be used toaddress memory.
What’s more, multiplicationof memory-addressing registers by 2,4, or8 for look-ups
in word, doubleword, or quadword tables can be built right into the memory ad-
dressing mode. (The 32-bit addressing modes are discussed further in later chapters.)
In protected mode, these new addressing modes allow youto address a full 4 gigabytes
per segment, but in real mode you’re still limited to 64K, even with 32-bitregisters
and the new addressing modes, unless you play some unorthodox tricks with the
segment registers.

222 Chapter 1 1
p Note well: Those tricks don ’t necessarily work with system sofmare such asWin-
dows, so Ih’ recommend against using [Link] want $-gigabyte segments, use
a 32-bit environment suchas Win32.

Optimization Rules: The More Things Change.. .


Let’s see what we’ve learned about286/386 optimization. Mostly what we’ve learned
is that our familiar PC cycle-eaters still apply,
although in somewhat differentforms,
and that the major optimization rules for the PC hold true on ATs and 386-based
computers. You won’t go wrong on any of these computers if you keep your instruc-
tions short, use the registers heavily and avoid memory, don’t branch, and avoid
accessing displaymemory like the plague.
Although we haven’t touched on them, repeated string instructions are still desir-
able on the 286 and 386 since they provide a greatdeal of functionality per instruction
byte and eliminate both the prefetch queue cycle-eater and branching. However,
string instructions are notquite so spectacularly superior on the 286 and 386 as they
are on the8088 since non-string memory-accessing instructions have been speeded
up considerably on thenewer processors.
There’s one cycle-eater with newimplications on the 286 and 386, and that’s the data
alignment cycle-eater. From the data alignment cycle-eater we get a new rule: Word-
align your word-sized variables, and start your subroutines at even addresses.

Detailed Optimization
While the major 8088optimization rules hold true on computers built around the 286
and 386, manyof the instruction-specific optimizations no longer hold, forthe execu-
tion times of most instructions are quite different on the 286 and 386 than on the
8088. We have already seen one such example of the sometimes vast difference be-
tween 8088 and 286/386 instruction execution times: MOV [wordvar],O, which has
an Execution Unit execution time of 20 cycleson the8088, hasan EU execution time
ofjust 3cycles on the 286 and 2 cycles on the 386.
In fact, the performanceof virtually all memory-accessing instructions has been im-
proved enormously on the 286 and 386. The key to this improvement is the near
elimination of effective address (EA) calculation time. Where an 8088 takes from 5
to 12 cycles to calculate an EA, a 286 or 386 usually takes no time whatsoever to
perform the calculation. If a base+index+displacement addressing mode, such as
MOV AX,[WordArray+BX+SI],is used on a 286 or 386, 1 cycle is taken to perform
the EA calculation, but that’s both the worst case and the only case in which there’s
any EA overhead at all.
The elimination of EA calculation time means that the EU execution time of memory-
addressing instructions is much closer to the EU execution time of register-only
instructions. For instance, on the 8088 ADD [wordVar],lOOH is a 31-cycle instruc-
tion, while ADD DX,lOOHis a 4cycle instruction-a ratio of nearly 8 to 1. By contrast,

Pushing the 286 and 386 223


on the286ADD wordVar1,lOOH is a kycle instruction,while ADD DX,lOOH ais3-cycle
instruction-a ratio ofjust 2.3 to 1.
It would seem, then, thatit’s lessnecessary to use the registers on the286 than itwas
on the8088, but that’s simply not thecase, for reasons we’ve already seen. Thekey is
this: The 286 can execute memory-addressing instructions so fast that there’s no
spare instruction prefetchingtime during those instructions,so the prefetch queue
runs dry, especially on the AT, with its one-wait-statememory. On the AT, the 6-byte
instruction ADD [WordVar],lOOH iseffectivelyat least a 15-cycle instruction, because
3 cycles are needed tofetch each of the three instruction words and 6 more cycles
are needed to read WordVar and write the result back to memory.
Granted, theregister-only instruction ADD DX,lOOH also slows down-to 6 cycles-
because of instruction prefetching, leaving a ratio of 2.5 to 1. Now, however,let’s look at
the performanceof the same code on an 8088. The register-only code would run in 16
cycles (4 instruction bytes at 4 cycles per byte),while the memory-accessing code would
run in 40 cycles (6 instruction bytes at 4 cycles per byte, plus 2 word-sized memory
accesses at 8 cycles per word).That’s a ratioof 2.5 to 1, exactly the same ason the 286.
This is all theoretical. We put our trust not in theory but in actual performance,so
let’s run this code through the Zen timer. On a PC, Listing 11.4, which performs
register-only addition, runs in3.62 ms, while Listing 11.5, which performs addition
to amemory variable, runs in10.05 ms. On a 10MHz AT clone, Listing 11.4 runs in
0.64 ms, while Listing 11.5 runs in 1.80 ms. Obviously,the AT is much faster...but the
ratio of Listing 11.5 to Listing 11.4 is virtually identical on both computers, at2.78
for the PC and 2.81 for the [Link] anything, the register-only form of ADD has a
slightly Zurgeradvantage on the AT than it doeson the PC in this case.
Theory confirmed.

LISTING 1 1.4 11 [Link]


: *** L i s t i n g1 1 . 4 ***
; M e a s u r e st h ep e r f o r m a n c eo fa d d i n ga ni m m e d i a t ev a l u e
; t o a r e g i s t e r ,f o rc o m p a r i s o nw i t hL i s t i n g1 1 . 5 ,w h i c h
: a d d sa ni m m e d i a t ev a l u et o a memory v a r i a b l e .

call ZTimerOn
1 0 0r0e p t
dx.100h add
endm
ZTim c ae lrlO f f

LISTING 1 1.5 11 [Link]


: *** L i s t i n g1 1 . 5 ***
: M e a s u r e st h ep e r f o r m a n c e o f a d d i n ga ni m m e d i a t ev a l u e
: t o a memory v a r i a b l e ,f o rc o m p a r i s o nw i t hL i s t i n g1 1 . 4 ,
; w h i c ha d d sa ni m m e d i a t ev a l u e t o a register.

224 Chapter 1 1
jv Skip

even : a l w a y s make s u r ew o r d - s i z e d memory


: v a r i a b l e sa r ew o r d - a l i g n e d !
WordVar dw 0

Skip:
call ZTimerOn
rept 1000
add [WordVarllOOh
endm
call ZTimerOff

What’s going on? Simply this: Instruction fetching is controlling overall execution
time on both processors. Boththe 8088 in a PC and the 286 in an AT can execute the bytes
of the instructions inListings 11.4 and 11.5faster than they can be fetched. Since the
instructions areexactly the same lengths on bothprocessors, it standsto reason that
the ratio of the overall execution times of the instructions should be the same on
both processors as [Link] length controls execution time, and theinstruc-
tion lengths are thesame-therefore the ratios of the execution times are thesame.
The 286 can both fetch and execute instruction bytes faster than the 8088 can, so
code executes much faster on the 286; nonetheless, because the 286 can also ex-
ecute those instruction bytes much faster than it can fetchthem, overall performance
is still largely determined by the size of the instructions.
Is this always the case? No. When the prefetch queue is full, memory-accessing in-
structions on the 286 and 386 are much faster(relative to register-only instructions)
than they are on the 8088. Given the system wait states prevalent on 286 and 386
computers, however, the prefetch queue is likely to be empty quitea bit, especially
when code consisting of instructions with short EU execution times is executed. Of
course, that’sjust the sort of code we’re likely to write when we’re optimizing, so the
performance of high-speed code is more likely to be controlledby instruction size
than by EU execution time on most 286 and 386 computers, justas it is on thePC.
All of which is just a way of saying that faster memory access and EA calculation
notwithstanding, it’sjustas desirable to keep instructions short and memory accesses
to a minimum on the 286 and 386 asit is on the8088. And the way to do that is to use
the registers as heavily aspossible, use string instructions,use short formsof instruc-
tions, and the like.
The more things change, the morethey remain the same....

POPF and the 286


We’ve one final 286-related item to discuss: the hardware malfunctionof POPF un-
der certain circumstanceson the 286.
The problem is this: Sometimes POPF permits interrupts to occur when interrupts
are initially off and thesetting popped into the Interruptflag from the stack keeps

Pushing the 286 and 386 225


interrupts off. In other words, an interrupt can happen even though the Interrupt
flag is never set to1. Now, I don’t want to blow this particular bug outof proportion.
It only causes problems in code that cannot tolerate interrupts underany circum-
stances, and that’s a rare sortof code, especially in user programs. However, some
code really does need to have interrupts absolutely disabled, with no chance of an
interrupt sneaking through. For example, a critical portion of a disk BIOS might
need to retrieve data from the disk controller the instant it becomes available; even
a few hundred microseconds of delay could result in asector’s worth of data mis-
read. In this case, one misplaced interrupt during aPOPF could result in a trashed
hard disk if that interruptoccurs while the disk BIOS isreading a sectorof the File
Allocation Table.
There is a workaround for thePOPF bug. While the workaroundis easy to use, it’s
considerably slower than POPF, and costs a few bytes as well, so you won’t want to
use it in code that can tolerate interrupts. On the other hand, in code that truly
cannot be interrupted,you should view those extracycles and bytes as cheap insur-
ance againstmysterious and erratic program crashes.
One obvious reason to discuss the POPF workaround is that it’s useful. Another
reason is that the workaround is an excellent example of Zen-level assemblycoding,
in that there’s a well-defined goal to be achieved but no obvious way to do so. The
goal is to reproduce the functionality of the POPF instruction withoutusing POPF,
and theplace to startis by asking exactly whatPOPF does.
All POPF does is pop theword on topof the stack into theFLAGS register, as shown
in Figure 11.4. How can we do that withoutPOPF? Of course, the 286’s designers
intended us touse POPF for this purpose, and didn’t intentionally provide any alter-
native approach, so we’ll have to devise an alternative approach of our own. To do
that, we’ll have to search for instructions that containsome of the same functionality
as POPF, in the hope that one of those instructions can be used in some way to
replace POPF.
Well, there’s only one instruction other thanPOPF that loads the FLAGS register
directly from the stack, and that’s IRET, which loads the FLAGS register from the
stack as it branches, as shown in Figure 11.5. IRET has no known bugs of the sort
that plague POPF, so it’s certainly a candidate to replace POPF in non-interruptible
applications. Unfortunately,IRET loads theFLAGS register with the third word down
on thestack, not theword on topof the stack, as isthe case withPOPF; the far return
address that IRET pops into CS:IP lies between the top of the stack and the word
popped into theFLAGS register.
Obviously, the segment:offset that IRET expects to find on the stack abovethe pushed
flags isn’tpresent when the stack is set up forPOPF, so we’ll have to adjustthe stack
a bit before we can substitute IRET for POPF. What we’ll have to do is push the
segment:offset of the instruction after our workaround code onto the stack right
above the pushed flags. IRET will then branch to that address and pop the flags,

226 Chapter 1 1
SP

ss I- 3000 1 Smo 31800


31801
FLAGS 31802

@ Memory
SP I 1800 b
ss 1 3000 1
FLAGS 1 0640 , I
8 SP [ ,1802 1
ss 31800
31801
FLAGS 1 0295 ,
31802

The opemtion of POPE


fi#u?o 11.4

ending up at the instruction after the workaround code withthe flags popped. That’s
just the result that would have occurred had we executed POPF-with the bonus
that no interrupts can accidentally occur when the Interrupt flag is 0 both before
and after the pop.
How can we push the segment:offset of the next instruction?Well, finding the offset
of the next instruction by performing a near call to that instruction is a tried-and-
true trick. We can do something similar here, but in thiscase we need a far call, since
IRE’”requires both a segmentand an offset. We’ll alsobranch backward so that the

Pushing the 286 and 386 227


ss 31800 05
31801 90
IP
31802 10

cs 18 31803
31804 95
FLAGS 31805 02
31806 57

Memory

31800 05
31801 90
31802 10
18
31804 95
02
31806 57

Memory

31800 05
31801 90
IP
10 31802
18 31803
31804 95
FLAGS 31805 02
+ 31806 57

The operation of IRET


Figure 1 1.5
228 Chapter 1 1
address pushed on the stack will point to the instruction we want to continuewith.
The codeworks out like this:
j ms ph opr ot p f s k i p
popfiret:
:ibr reat n c h e s t o itnhset r u c tai oft thneer
; c a l l ,p o p p i n gt h ew o r db e l o wt h ea d d r e s s
: pushedby CALL i n t o t h e FLAGS r e g i s t e r
popfskip:
c a l lf a rp t rp o p f i r e t
; p u s h e st h es e g m e n t : o f f s e t o f t h en e x t
; i n s t r u c t i o n on t h es t a c kj u s ta b o v e
; t h ef l a g sw o r d ,s e t t i n gt h i n g su p so
: t h a t IRET will b r a n c ht ot h en e x t
; i n s t r u c t i o n a n dp o pt h ef l a g s
; When e x e c u t i o nr e a c h e st h ei n s t r u c t i o nf o l l o w i n gt h i s comment,
; t h ew o r dt h a t was on t o p o f t h e s t a c k when JMP SHORT POPFSKIP
: was r e a c h e dh a sb e e np o p p e di n t ot h e FLAGS r e g i s t e r , j u s t as
: i f a POPF i n s t r u c t i o nh a db e e ne x e c u t e d .

The operationof this code is illustrated in Figure 11.6.


The POPF workaround can best be implementedas a macro;we can also emulate a
far call by pushing CS and performinga near call, thereby shrinking the workaround
code by 1 byte:
EMULATELPOPF macro
l o c a lp o p f s k i p .p o p f i r e t
j ms ph oprot p f s k i p
p o p f ir e t :
ir e t
popfskip:
p u schs
c a l lp o p f ir e t
endm

By the way, the flags can be popped much morequickly if you’re willing to alter a
register in the process. For example, the following macro emulates POPF with just
one branch, butwipes out AX:
EMULATE-POPFLTRASHLAX macro
p u schs
mov a x . o f f s e t $+5
pushax
ir e t
endm

POPF, since POPF doesn’t alterany registers, but it’s


It’s not a perfect substitute for
faster and shorter than EMULATE-POPF when you can spare theregister. If you’re
using 286-specific instructions, you can use
.286

EMULATE-POPF macro
p u sc hs
p u s ho f f s e t $+4

Pushing the 286 and 386 229


ir e t
endm

which is shorter still, alters no registers, and branches just once. (Of course, this
version of EMULATE-POPF won't work on an8088.)

'L
Memory

H
317FA ???

317FC ???
IP 1 o f f s e tp o p f s k i p 317FE ???

cs 1 s e g m e n tp o p f s k i p b 4 31800
31802 I ??? I
FLAGS 1 ??? b

Memory

317FA ???

317FC o f f speot p f s k i p + 5

317FE o f f sp eo tp f s k i p

cs 1 s e g m e n tp o p f s k i p C 3p1uf8sl 0ah0gesd

31802 ???

FLAGS
1- ???

Memory

3- 317FA

317FA

317FE

31800
cs I s e g m e n tp o p f s k i p
31802

Workaround code for the POPF bug.


Figure 1 1.6

230 Chapter 1 1
Previous Home Next

The standard version of EMULATE-POPF is 6 bytes longer than POPF and much
slower, as you’dexpect given that itinvolves three branches. Anyone in his/her right
mind would prefer POPF to a larger, slower, three-branch macro-given a choice. In
non-interruptible code, however, there’s no choice here; thesafer-if slower-approach
is the best. (Having people associate your programs with crashed computers is nota
desirable situation, no matter how unfair the circumstances under which it occurs.)
And now you know the nature of and theworkaround for the POPF bug. Whether
you ever need theworkaround or not,it’s a neatly packaged example of the tremen-
dous flexibility of the x86 instruction set.

Pushing the 286 and 386 231

You might also like