Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
computer architecture, as measured by the computers in use and total computing power available in the world. The two major factors that drive the popularity of IA-32 architecture are: (1) software compatibility (2) and the fact that each generation of IA-32 processors delivers significantly higher performance. This chapter provides a brief historical summary of the IA-32 architecture, from the Intel 8086 processor to the latest version implemented in the Pentium 4 and Intel Xeon processors.
One of the most important achievements of the IA-32 architecture is that object code created for processors released in 1978 still executes on the latest processors in the IA-32 architecture family.
2.1.1.
The exponential growth of computing power and personal computer ownership made the computer one of the most important forces that shaped business and society in the second half of the twentieth century. Computers are expected to continue to play crucial roles in the growth of technology, business, and new arenas.
The IA-32 architecture can be traced to Intel 8085 and 8080 microprocessors and to the Intel 4004 microprocessor (the first microprocessor, designed by Intel in 1969). The IA-32 architecture family was preceded by 16-bit processors which include the 8086 and the 8088 processors. The 8086 has 16-bit registers and a 16-bit external data bus, with 20-bit addressing giving a 1-MByte address space. The 8088 is similar to the 8086 except it has an 8-bit external data bus. These processors introduced segmentation to the IA-32 architecture. With segmentation, a 16bit segment register contains a pointer to a memory segment of up to 64 KBytes. Using four segment registers at a time, the 8086/8088 processors are able to address up to 256 KBytes without switching between segments. The 20-bit addresses that can be formed using a segment register and an additional 16-bit pointer provide a total address range of 1 MByte.
2-1
Intel Architecture
The Intel386 processor was the first 32-bit processor in the IA-32 architecture family. In 1985, it introduced 32-bit registers for use both to hold operands and for addressing. The lower half of each 32-bit Intel386 register retains the properties of the 16-bit registers of earlier generations. This permits complete backward compatibility. The processor also provides a virtual-8086 mode that allows for greater efficiency when executing programs created for the 8086 and 8088 processors. The Intel386 processor has a 32-bit address bus and supports up to 4 GBytes of physical memory. Logical address space is provided for each software process. The 32-bit architecture supports both a segmented-memory model and a flat1 memory model. In the flat memory model, segment registers point to the same address. All 4 GBytes of addressable space within each segment are accessible. Earlier 16-bit instructions were enhanced with new Intel386 32-bit operands and addressing forms. The processor also introduced paging, with the fixed 4 KByte page size providing a method for virtual memory management that is superior to using segments for this purpose. Intel386 processor was the first to include a number of parallel stages. The six stages are: the bus interface unit (accesses memory and I/O for the other units), the code prefetch unit (receives object code from the bus unit and puts it into a 16-byte queue), the instruction decode unit (decodes object code from the prefetch unit into microcode), the execution unit (executes the microcode instructions), the segment unit (translates logical addresses to linear addresses and does protection checks), and the paging unit (translates linear addresses to physical addresses, does page based protection checks, and contains a cache with information for up to 32 most recently accessed pages).
The Intel486 processor, introduced in 1989, added additional parallel execution capability by expanding the Intel386 processors instruction decode and execution units into five pipelined stages. Each each stage operates in parallel with the others on up to five instructions in different
1. Requires only one 32-bit address component to access anywhere in the linear address space. 2-2
stages of execution. Each stage can do its work on one instruction in one clock, so the Intel486 processor can execute as rapidly as one instruction per clock cycle. An 8-KByte on-chip first-level cache was added to the Intel486 processor to greatly increase the percent of instructions that could execute at the scalar rate of one per clock. Memory access instructions are included if the operand is in the first-level cache. The Intel486 processor also added an integrated x87 FPU. Subsequent generations of the Intel486 processor incorporated new power saving and system management capabilities. These features were initially developed for processors targeted at the notebook PC market (the Intel386 SL and Intel486 SL processors). They include: System Management Mode (triggered by a dedicated interrupt pin), the Stop Clock, and Auto Halt Powerdown.
The introduction of the Intel Pentium processor in 1993 added a second execution pipeline to achieve superscalar performance (two pipelines, known as u and v, together can execute two instructions per clock). The on-chip first-level cache doubled, with 8 KBytes devoted to code and another 8 KBytes devoted to data. The data cache uses the MESI protocol to support the more efficient write-back cache in addition to the write-through cache previously used by the Intel486 processor. Branch prediction with an on-chip branch table was added to increase performance in looping constructs. Extensions were added to make the virtual-8086 mode more efficient and to allow for 4-MByte as well as 4-KByte pages. The processor registers are still 32 bits, but internal data paths of 128 and 256 bits add speed to internal data transfers. The burstable external data bus was increased to 64 bits. An Advanced Programmable Interrupt Controller (APIC) was added to support systems with multiple Pentium processors. New pins and a special mode (dual processing) were designed in to support glueless two processor systems. A subsequent stepping of the Pentium family introduced Intel MMX Technology (the Pentium Processor with MMX technology). Intel MMX technology uses the single-instruction, multipledata (SIMD) execution model to perform parallel computations on packed integer data contained in 64-bit MMX registers. This technology greatly enhanced the performance in advanced media, image processing, and data compression applications.
Intel introduced the P6 family of processors in 1995. This processor family was based on a superscalar micro-architecture that set new performance standards. One of the goals in the design of the P6 family micro-architecture was to exceed the performance of the Pentium processor significantly while using the same 0.6-micrometer, four-layer, metal BICMOS manufacturing process. This meant that performance gains could only be achieved through substantial advances in the micro-architecture.
2-3
The Intel Pentium Pro processor was the first processor based on the P6 micro-architecture. Subsequent members of the P6 processor family include: the Intel Pentium II, Intel Pentium II Xeon, Intel Celeron, Intel Pentium III, and Intel Pentium III Xeon processors. The Pentium Pro processor is three-way superscalar. By using parallel processing techniques, the processor is able on average to decode, dispatch, and complete execution of (retire) three instructions per clock cycle. The processor also introduced the dynamic execution (the microdata flow analysis, out-of-order execution, superior branch prediction, and speculative execution) in a superscalar implementation. Three instruction decode units work in parallel to decode object code into smaller operations called micro-ops (micro-architecture op-codes). These micro-ops are fed into an instruction pool and (when interdependencies permit) can be executed out of order by the five parallel execution units (two integer, two FPU and one memory interface unit). The Retirement Unit retires completed micro-ops in their original program order, taking branches into account. The Pentium Pro processor was further enhanced by its caches. It has the same two on-chip 8 KByte 1st-Level caches as the Pentium processor and an additional 256 KByte 2nd-Level cache in the same package as the processor. The 256 KByte 2nd-Level cache uses a dedicated 64-bit backside (cache-bus) full clock speed bus. The 1st-Level cache is dual-ported, the 2ndLevel cache supports up to 4 concurrent accesses. The 64-bit external data bus is transactionoriented, meaning that each access is handled as a separate request and response with numerous requests allowed while awaiting a response. The Pentium Pro processors expanded 36-bit address bus gives a maximum physical address space of 64 GBytes. The Intel Pentium II processor added Intel MMX Technology to the P6 family processors along with new packaging and several hardware enhancements. The processor core is packaged in the single edge contact cartridge (SECC), enabling ease of design and flexible motherboard architecture. The 1st-Level data and instruction caches were enlarged to 16 KBytes each, and 2ndLevel cache sizes of 256 KBytes, 512 KBytes, and 1 MByte are supported. A half-clock speed backside bus connects the 2nd-Level cache to the processor. Multiple low-power states such as AutoHALT, Stop-Grant, Sleep, and Deep Sleep are supported to conserve power when idling. The Pentium II Xeon processor combined premium characteristics of previous generations of Intel processors. This includes: 4-way, 8-way (and up) scalability and a 2 MByte 2nd-Level cache running on a full-clock speed backside bus. The Intel Celeron processor family focused the IA-32 architecture on the desktop or value PC market segment. It offers an integrated 128 KByte of Level 2 cache and a plastic pin grid array (P.P.G.A.) form factor to lower system design cost. The Pentium III processor introduced the Streaming SIMD Extensions (SSE) to the IA-32 architecture. SSE extensions expand the SIMD execution model introduced with the Intel MMX technology by providing a new set of 128-bit registers and the ability to perform SIMD operations on packed single-precision floating-point values. The Pentium III Xeon processor extended the performance levels of the IA-32 processors with the enhancement of a full-speed, on-die, and Advanced Transfer Cache.
2-4
In 2000, the Intel Pentium 4 processor introduced the Intel NetBurst micro-architecture. The Intel NetBurst micro-architecture allows processors to operate at significantly higher clock speeds and performance levels than previous IA-32 processors. The processor has the following features:
Intel NetBurst micro-architecture (see Section 2.2.3., The Intel NetBurst Micro-Architecture for a detailed description) Rapid Execution Engine Hyper Pipelined Technology Advanced Dynamic Execution Innovative cache subsystem2
Streaming SIMD Extensions 2 (SSE2) Extends the Intel MMX Technology and the SSE extensions with 144 new instructions; these include support for:
128-bit SIMD integer arithmetic operations 128-bit SIMD double precision floating point operations Cache and memory management operations Enhances and accelerates video, speech, encryption, image and photo processing
A 400 MHz Intel NetBurst micro-architecture system bus; this includes: 3.2 GBytes per second throughput (3 times faster than the Pentium III processor) Quad-pumped 100 MHz scalable bus clock achieves 400 MHz effective speed Split-transaction, pipelined 64-byte line size with 128-byte accesses Support for higher data throughput with higher bus clock
Support for Hyper-Threading Technology (see Section 2.2.4., Hyper-Threading Technology) Compatible with applications and operating systems written to run on Intel IA-32 architecture processors
2. The Intel Pentium 4 processor uses a cache line size of 64 bytes throughout its cache hierarchy. The larger unified cache levels use a sectored implementation, where each 128-byte cache sector consists of two associated 64-byte cache lines. 2-5
The Intel Pentium M processor is a high performance, low power mobile processor with microarchitectural enhancements over previous generations of Intel mobile processors. The Pentium M processor includes the following features:
Supports Intel Architecture with Dynamic Execution High performance, low-power core On-die, primary 32-kbyte instruction cache and 32-kbyte write-back data cache On-die, 1-MByte second level cache with Advanced Transfer Cache Architecture Advanced Branch Prediction and Data Prefetch Logic Streaming SIMD Extensions 2 (SSE2) 400 MHz, Source-Synchronous Processor System Bus Advanced Power Management features including Enhanced Intel SpeedStep Technology The Intel Pentium M processor is manufactured using Intels advanced 0.13 micron process technology with copper interconnect. The processor supports MMX Technology, Streaming SIMD instructions, and the SSE2 instruction set. It is fully compatibility with IA-32 software. The high performance core features innovations like Micro-op Fusion and Advanced Stack Management. These reduce the number of ops handled by the processor and this results in more efficient scheduling and better performance at low power. On-die 32-KB first-level instruction and data caches and a 1 MByte second-level cache with Advanced Transfer Cache Architecture deliver significant performance improvements over previous generations of mobile Intel processors. The processor also features advanced branch prediction architecture that significantly reduces the number of mispredicted branches. The processors Data Prefetch Logic speculatively fetches data to the second-level cache before a cache request to the first-level data cache occurs. This results in reduced bus cycle penalties.
The following sections provide more information on major additions to the IA-32 architecture.
Figure 2-1. The P6 Processor Micro-Architecture with Advanced Transfer Cache Enhancement
To insure a steady supply of instructions and data for the instruction execution pipeline, the P6 processor micro-architecture incorporates two cache levels. The Level 1 cache provides an 8-KByte instruction cache and an 8-KByte data cache, both closely coupled to the pipeline. The second-level cache provides 256-KByte, 512-KByte, or 1-MByte static RAM that is coupled to the core processor through a full clock-speed 64-bit cache bus.
2-7
The centerpiece of the P6 processor micro-architecture is an out-of-order execution mechanism called dynamic execution. Dynamic execution incorporates three data-processing concepts:
Deep branch prediction allows the processor to decode instructions beyond branches to keep the instruction pipeline full. The P6 processor family implements highly optimized branch prediction algorithm to predict the direction of the instruction. Dynamic data flow analysis requires real-time analysis of the flow of data through the processor to determine dependencies and to detect opportunities for out-of-order instruction execution. The out-of-order execution core can monitor many instructions and execute these instructions in the order that best optimizes the use of the processors multiple execution units, while maintaining the data integrity. Speculative execution refers to the processors ability to execute instructions that lie beyond a conditional branch that has not yet been resolved, and ultimately to commit the results in the order of the original instruction stream. To make speculative execution possible, the P6 processor micro-architecture decouples the dispatch and execution of instructions from the commitment of results. The processors out-of-order execution core uses data-flow analysis to execute all available instructions in the instruction pool and store the results in temporary registers. The retirement unit then linearly searches the instruction pool for completed instructions that no longer have data dependencies with other instructions or unresolved branch predictions. When completed instructions are found, the retirement unit commits the results of these instructions to memory and/or the IA-32 registers (the processors eight general-purpose registers and eight x87 FPU data registers) in the order they were originally issued and retires the instructions from the instruction pool.
In addition to new 128-bit SIMD instructions, there are 128-bit enhancements to 68 integer SIMD instructions. These operated solely on 64-bit MMX registers in the Pentium II and Pentium III processors. Those 64-bit integer SIMD instructions were enhanced to support operation on 128-bit XMM registers in the Pentium 4 processor. These enhanced integer SIMD instructions allow software developers to have maximum flexibility by writing SIMD code with either XMM registers or MMX registers. The Intel Pentium 4 processors features enable software developers to deliver new levels of performance in multimedia applications ranging from 3-D graphics, video decoding/encoding, to speech recognition. The processors packed double-precision floatingpoint instructions enhance performance for applications that require greater range and precision, including scientific and engineering applications and advanced 3-D geometry techniques, such as ray tracing. To speed up processing and improve cache usage, the SSE2 extensions offer several new instructions that allow application programmers to control the cacheability of data. These instructions provide the ability to stream data in and out of the registers without disrupting the caches and the ability to prefetch data before it is actually used. The new architectural features introduced with the SSE2 extensions do not require new operating system support. This is because the SSE2 extensions do not introduce new architectural states, and the FXSAVE/FXRSTOR instructions, which supports the SSE extensions, also supports SSE2 extensions and are sufficient for saving and restoring the state of the XMM registers, the MMX registers, and the x87 FPU registers during a context switch. The CPUID instruction has been enhanced to allow operating system or applications to identify for the existence of the SSE and SSE2 features. The SSE2 extensions are accessible in all IA-32 architecture operating modes on the Intel Pentium 4 and Intel Xeon processors. Both processors maintain IA-32 software compatibility meaning all existing software continues to run correctly, without modification on the Pentium 4, Intel Xeon, and future IA-32 processors that incorporate the SSE2 extensions. Also, existing software continues to run correctly in the presence of applications that make use of the SSE2 instructions.
The Rapid Execution Engine Arithmetic Logic Units (ALUs) run at twice the processor frequency Basic integer operations executes in 1/2 processor clock tick Provides higher throughput and reduced latency of execution
Hyper-Pipelined Technology Twenty-stage pipeline to enable industry-leading clock rates for desktop PCs and servers
2-9
Up to 126 instructions in flight Up to 48 loads and 24 stores in pipeline Reduces the misprediction penalty associated with deeper pipelines Advanced branch prediction algorithm 4K-entry branch target array Enhanced branch prediction capability
Advanced Execution Trace Cache stores decoded instructions Execution Trace Cache removes decoder latency from main execution loops Execution Trace Cache integrates path of program execution flow into a single line Low latency data cache with 2 cycle latency Full-speed, unified 8-way Level 2 on-die Advance Transfer Cache Bandwidth and performance increases with processor frequency Second level cache
High-performance, quad-pumped bus interface to the Intel NetBurst micro-architecture system bus Supports quad-pumped, scalable bus clock to achieve 4X effective speed Capable of delivering up to 3.2 GB of bandwidth per second (Pentium 4 and Intel Xeon processors)
Superscalar issue to enable parallelism Expanded hardware registers with renaming to avoid register name space limitations 128-byte cache line size (two 64-byte sectors) Figure 2-2 is an overview of the Intel NetBurst micro-architecture. This micro-architecture pipeline is made up of three sections: (1) the front end pipeline (2) the out-of-order execution core, and (3) the retirement unit.
2-10
2nd Level Cache 1st Level Cache Front End Trace Cache
Microcode ROM
Fetch/Decode Execution
Out-Of-Order Core
2.2.3.1. THE FRONT END PIPELINE The front end supplies instructions in program order to the out-of-order execution core. It performs a number of functions:
Prefetches IA-32 instructions that are likely to be executed Fetches instructions that have not already been prefetched Decodes IA-32 instructions into micro-operations Generates microcode for complex instructions and special-purpose code Delivers decoded instructions from the execution trace cache Predicts branches using highly advanced algorithm The pipeline is designed to address common problems in high-speed, pipelined microprocessors. Two of these problems contribute to major sources of delays: Time to decode instructions fetched from the target Wasted decode bandwidth due to branches or branch target in the middle of cache lines The operation of the pipelines trace cache addresses these issues. Instructions are constantly being fetched and decoded by the translation engine (part of the fetch/decode logic) and built
2-11
into sequences of ops called traces. At any time, multiple traces (representing prefetched branches) are being stored in the trace cache. The trace cache is searched for the instruction that follows the active branch. If the instruction also appears as the first instruction in a pre-fetched branch, the fetch and decode of instructions from the memory hierarchy ceases and the prefetched branch becomes the new source of instructions (see Figure 2-2). The trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear addresses using branch target buffers (BTBs) and fetched as soon as possible. 2.2.3.2. OUT-OF-ORDER EXECUTION CORE The out-of-order execution cores ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the processor to reorder instructions so that if one op is delayed, other ops may proceed around it. The processor employs several buffers to smooth the flow of ops. The core is designed to facilitate parallel execution. It can dispatch up to six ops per cycle (this exceeds trace cache and retirement op bandwidth). Most pipelines can start executing a new op every cycle, so several instructions can be in flight at a time for each pipeline. A number of arithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructions can start once every two cycles. Note that ops can begin execution, out of order, as soon as their data inputs are ready and resources are available. 2.2.3.3. RETIREMENT UNIT The retirement unit receives the results of the executed ops from the out-of-order execution core and processes the results so that the architectural state updates according to the original program order. Instructions retire in program order. IA-32 exceptions always occur in program order. This means that exceptions do not occur speculatively; they must occur in correct order so that there can be an appropriate restart after an exception. When a op completes and writes its result, it is retired. Up to three ops may be retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed ops, updates the architectural state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and sends updated branch target information to the BTB. The BTB then purges pre-fetched traces that are no longer needed.
Hyper-Threading (HT) Technology was developed to improve the performance of IA-32 processors when executing multi-threaded operating system and application code or single-threaded applications under multi-tasking environments. The technology enables a single physical processor to execute two or more separate code streams (threads) concurrently.
2-12
Architecturally, an IA-32 processor that supports HT Technology consists of two or more logical processors, each of which has its own IA-32 architectural state. Each logical processor consists of the IA-32 data registers, segment registers, control registers, debug registers and most of the MSRs. Each also has its own advanced programmable interrupt controller (APIC). Figure 2-3 shows a comparison of an IA-32 processor with HT Technology (implemented with two logical processors) and a traditional dual processor system.
IA-32 Processor with Hyper-Threading Technology AS AS Traditional Multiple Processor (MP) System AS AS Processor Core Processor Core Processor Core IA-32 processor Two logical processors that share a single core IA-32 processor IA-32 processor Each processor is a separate physical package AS = IA-32 Architectural State
Figure 2-3. Comparison of an IA-32 Processor with Hyper-Threading Technology and a Traditional Dual Processor System
Unlike a traditional MP system configuration that uses two or more separate physical IA-32 processors, the logical processors in an IA-32 processor with HT Technology share the core resources of the physical processor. This includes the execution engine and the system bus interface. After power up and initialization, each logical processor can be independently directed to execute a specified thread, interrupted, or halted. HT Technology leverages the process and thread-level parallelism found in contemporary operating systems and high-performance applications by providing two or more logical processors on a single chip. This configuration allows two or more threads3 to be executed simultaneously on each a physical processor. Each logical processor executes instructions from an application thread using the resources in the processor core. The core executes these threads concurrently,
3. In the remainder of this document, the term thread will be used as a general term for the terms process and thread. 2-13
using out-of-order instruction scheduling to maximize the use of execution units during each clock cycle. 2.2.4.1. NOTES ON IMPLEMENTATION Hyper-Threading (HT) Technology was introduced into the IA-32 architecture in the Intel Xeon processor MP and in later steppings of the Intel Xeon processor. It is also supported by the Intel Pentium 4 processor at 3.06 GHz or higher. All HT Technology configurations require a chipset and BIOS that utilize the technology, and an operating system that includes optimizations for HT technology. See www.intel.com/info/hyperthreading for more information. At the firmware (BIOS) level, the basic procedures to initialize the logical processors in a processor supporting HT Technology are the same as those for a traditional DP or MP platform4. The same mechanisms that are described in the Multiprocessor Specification Version 1.4 to power-up and initialize physical processors in an MP system apply to the logical processors in a HT Technology-enabled processor. An operating system designed to run on a traditional DP or MP platform uses the CPUID instruction to detect the presence of an IA-32 processor with Hyper-Threading Technology and the number of logical processors it provides. Although existing operating system and application code should run correctly on a processor that supports HT Technology, some code modifications are recommended to get the optimum benefit. These modifications are discussed in the IA-32 Intel Architecture Software Developers Manual, Volume 3, in the section titled Required Operating System Support in Chapter 7, Multiple Processor Management.
In the mid-1960s, Intel cofounder and Chairman Emeritus Gordon Moore had this observation: the number of transistors that would be incorporated on a silicon die would double every 18 months for the next several years. Over the past three and half decades, this prediction known as Moore's Law has continued to hold true. The computing power and the complexity (or roughly, the number of transistors per processor) of Intel architecture processors has grown in close relation to Moore's law. By taking advantage of new process technology and new micro-architecture designs, each new generation of IA-32 processors has demonstrated frequency-scaling headroom and new performance levels over the previous generation processors. The key features of the Intel Pentium 4 processor, Intel Xeon processor, Intel Xeon processor MP, Pentium III and Pentium III Xeon processors with advanced transfer cache are shown in Table 2-1. Older generation IA-32 processors, which do not employ on-die Level 2 cache, are shown in Table 2-2.
4. Some relatively simple enhancements to the MP initialization algorithm are needed. 2-14
INTRODUCTION TO THE IA-32 INTEL ARCHITECTURE Table 2-1. Key Features of Most Recent IA-32 Processors
Date Introduced 2000
Clock Frequency TransisMicro-at Intro-tors Per ArchitectureductionDie Intel NetBurst Microarchitecture 1.50 GHz 42 M System Max. Bus Extern. Band- Addr. width Space 3.2 GB/s Intel Processor Pentium 4 Processor Register Sizes1 GP: 32 FPU: 80 MMX: 64 XMM: 128 On-Die Caches2 64 GB 12K op Execution Trace Cache; 8KB L1; 256-KB L2 64 GB 12K op Trace Cache; 8-KB L1; 256-KB L2 64 GB 12K op Trace Cache; 8-KB L1; 512-KB L2 64 GB 12K op Trace Cache; 8-KB L1; 256-KB L2; 1-MB L3 64 GB 12K op Execution Trace Cache; 8KB L1; 512-KB L2 64 GB L1: 64KB L2: 1MB Intel Xeon Processor 2001 Intel NetBurst Microarchitecture 1.70 GHz 42 M GP: 32 FPU: 80 MMX: 64 XMM: 128 GP: 32 FPU: 80 MMX: 64
XMM: 128 3.2 GB/s Intel Xeon Processor 2002 Intel NetBurst Microarchitecture; HyperThreading Technology Intel NetBurst Microarchitecture; HyperThreading Technology Intel NetBurst Microarchitecture; HyperThreading Technology Intel Pentium M Processor 2.20 GHz 55 M 3.2 GB/s Intel Xeon Processor MP 2002 1.60 GHz 108 M GP: 32 FPU: 80 MMX: 64 XMM: 128 3.2 GB/s Intel Pentium 4 Processor supporting HyperThreading Technology Intel Pentium M Processor NOTES 2002 3.06 GHz 55M GP: 32 FPU: 80 MMX: 64 XMM: 128 4.2 GB/s 2003 1.60 GHz 77M GP: 32 FPU: 80 MMX: 64 XMM: 128 3.2 GB/s 1. The register size and external data bus size are given in bits. 2. First level cache is denoted using the abbreviation L1, 2nd level cache is denoted as L2. The size of L1 includes the first-level data cache and the instruction cache where applicable, but does not include the trace cache. 2-15
INTRODUCTION TO THE IA-32 INTEL ARCHITECTURE Table 2-2. Key Features of Previous Generations of IA-32 Processors
Date Introduced 1978 1982 1985 1989 1993 1995 Max. Clock Frequency at Introduction 8 MHz 12.5 MHz 20 MHz 25 MHz 60 MHz 200 MHz Transis -tors per Die 29 K 134 K 275 K 1.2 M 3.1 M 5.5 M Ext. Data Bus Size2 16 16 32 32 64 64 Max. Extern. Addr. Space 1 MB 16 MB 4 GB 4 GB 4 GB 64 GB Intel Processor 8086 Intel 286 Intel386 DX Processor Intel486 DX Processor Pentium Processor Pentium Pro Processor Register Sizes1 16 GP 16 GP 32 GP 32 GP
80 FPU 32 GP 80 FPU 32 GP 80 FPU 32 GP 80 FPU 64 MMX 32 GP 80 FPU 64 MMX 128 XMM 32 GP 80 FPU 64 MMX 128 XMM Caches None Note 3 Note 3 L1: 8KB L1:16KB L1: 16KB L2: 256KB or 512KB L1: 32KB L2: 256KB or 512KB L1: 32KB L2: 512KB Pentium II Processor 1997 266 MHz 7M 64 64 GB Pentium III Processor 1999 500 MHz 8.2 M 64 64 GB Pentium III and Pentium III Xeon Processors 1999 700 MHz 28 M 64 64 GB L1: 32KB L2: 256KB