INTRODUCTION :
Very Long Instruction Word (VLIW) processors [2, 3] are examples of architectures for hich the program pro!ides explicit information regarding parallelism" #he compiler identifies the parallelism in the program and communicates it to the hard are $y specifying hich operations are independent of one another" #his information is of direct !alue to the hard are, since it %no s ith no further chec%ing hich operations it can start executing in the same cycle" In this report, e introduce the &xplicitly 'arallel Instruction (omputing (&'I() style of architecture, an e!olution of VLIW hich has a$sor$ed many of the $est ideas of superscalar processors, al$eit in a form adapted to the &'I( philosophy" &'I( is not so much an architecture as it is a philosophy of ho to $uild IL' processors along ith a set of architectural features that support this philosophy" In this sense &'I( is li%e )I*(+ it denotes a class of architectures, all of hich su$scri$e to a common architectural philosophy" ,ust as there are many distinct )I*( architectures (-e lett.'ac%ard/s '0)I*(, *ilicon 1raphic/s 2I'* and *un/s *'0)() there can $e more than one instruction set architecture (I*0) ithin the &'I( fold" "
A. Instruction-level parallelism
0 common design goal for general.purpose processors is to maximi3e throughput, hich may $e defined $roadly as the amount of or% performed in a gi!en time" 0!erage processor throughput is a function of t o !aria$les4 the a!erage num$er of cloc% cycles re5uired to execute an instruction, and the fre5uency of cloc% cycles" #o increase throughput, then,
a designer could increase the cloc% rate of the architecture, or increase the a!erage instruction-level parallelism(IL') of the architecture" 2odern processor design has focused on executing more instructions in a gi!en num$er of cloc% cycles, that is, increasing IL'" 0 num$er of techni5ues may $e used" 6ne techni5ue, pipelining, is particularly popular $ecause it is relati!ely simple, and can $e used in con7unction ith superscalar and VLIW techni5ues" 0ll modern ('8 architectures are pipelined""
B. Pipelining
0ll instructions are executed in multiple stages" 9or example, a simple processor may ha!e fi!e stages4 first the instruction must $e fetched from cache, then it must $e decoded, the instruction must $e executed, and any memory referenced $y the instruction must $e loaded or stored" 9inally the result of the instruction is stored in registers" #he output from one stage ser!es as the input to the next stage, forming a pipeline of instruction implementation" #hese stages are fre5uently independent of each other, so, if separate hard are is used to perform each stage, multiple instructions may $e :in flight; at once, ith each instruction at a different stage in the pipeline" Ignoring potential pro$lems, the theoretical increase in speed is
proportional to the length of the pipeline4 longer pipelines means more simultaneous in.flight instructions and therefore fe er a!erage cycles per instruction" #he ma7or potential pro$lem ith pipelining is the potential for hazards" 0 ha3ard occurs hen an instruction in the pipeline cannot $e executed" -ennessey and 'atterson identify three types of ha3ards4 structural hazards, here there simply isn<t sufficient hard are to execute all paralleli3a$le instructions at once+ data hazards, here an instruction depends on the result of a pre!ious instruction+ and control hazards, hich arise from instructions hich change the program counter (ie, $ranch instructions)" Various techni5ues exist for managing ha3ards" #he simplest of these is simply to stall the pipeline until the instruction causing the ha3ard has completed"
VLIW :
VLIW architecture 0ll this additional hard are is complex, and contri$utes to the transistor count of the processor" 0ll other things $eing e5ual, more transistors e5uals more po er consumption, more heat, and less on.die space for cache" #hus it seems $eneficial to expose more of the architecture<s parallelism to the programmer" #his ay, not only is the architecture simplified, $ut programmers ha!e more control o!er the hard are, and can ta%e $etter ad!antage of it" VLIW is an architecture designed to help soft are designers extract more parallelism from their soft are than ould $e possi$le using a traditional )I*( design" It is an alternati!e to $etter.%no n superscalar architectures" VLIW is a lot simpler than superscalar designs, $ut has not so far $een commercially successful" 9igure sho s a typical VLIW architecture" =ote the simplified instruction decode and dispatch"
A. ILP in VLIW
VLIW and superscalar approach the IL' pro$lem differently" #he %ey difference $et een the t o is here instruction scheduling is performed4 in a superscalar architecture, scheduling is performed in hard are (and is called dynamic scheduling, $ecause the schedule of a gi!en piece of code may differ depending on the code path follo ed), hereas in a VLIW scheduling is performed in soft are ( static scheduling, $ecause the schedule is :$uilt in to the $inary; $y the compiler or assem$ly language programmer)"
B. Superscalar
8sually, the execution phase of the pipeline ta%es the longest" 6n modern hard are, the execution of the instruction may $e performed $y one of a num$er of functional units" 9or example, integer instructions may $e
executed $y the 0L8, hereas floating.point operations are performed $y the 9'8" 6n a traditional, scalar pipelined architecture, either one or the other of these units ill al ays $e idle, depending on the instruction $eing executed" 6n a superscalar architecture, instructions may $e executed in parallel on multiple functional units" #he pipeline is essentially split after instruction issue"
C. Interloc ing
0nother architecture feature present in some )I*( and VLIW architectures $ut ne!er in superscalar/s is lac% of interloc%s" In a pipelined processor, it is important to ensure that a stall some here in the pipeline on<t result in the machine performing incorrectly" #his could happen if later stages of the pipeline do not detect the stall, and thus proceed as if the stalled stage had completed" #o pre!ent this, most architectures incorporate interloc s on the pipeline stages" )emo!ing interloc%s from the architecture is $eneficial, $ecause they complicate the design and can ta%e time to set up, lo ering the o!erall cloc% rate" -o e!er, doing so means that the compiler (or assem$ly.language programmer) must %no details a$out the timing of pipeline stages for each instruction in the processor, and insert =6's into the code to ensure correctness" #his ma%es code incredi$ly hard are.specific" >oth the architectures studied in detail $elo are fully interloc%ed, though *un<s ill.fated 20,( architecture as not, and relied on fast, uni!ersal ,I# compilation to sol!e the hard are pro$lems"