Pci Express System Architecture PDF
Pci Express System Architecture PDF
Table of Contents
• Index
PCI Express System Architecture
By MindShare, Inc , Ravi Budruk,
Don Anderson, Tom Shanley
Publisher : Addison Wesley
Pub Date : September 04, 2003
ISBN : 0-321-15630-7
Pages : 1120
ACK/NAK protocol
Copyright
Figures
Tables
Acknowledgments
About This Book
The MindShare Architecture Series
Cautionary Note
Intended Audience
Prerequisite Knowledge
Topics and Organization
Documentation Conventions
Visit Our Web Site
We Want Your Feedback
Part One. The Big Picture
Chapter 1. Architectural Perspective
This Chapter
The Next Chapter
Introduction To PCI Express
Predecessor Buses Compared
I/O Bus Architecture Perspective
The PCI Express Way
PCI Express Specifications
Chapter 2. Architecture Overview
Previous Chapter
This Chapter
The Next Chapter
Introduction to PCI Express Transactions
PCI Express Device Layers
Example of a Non-Posted Memory Read Transaction
Hot Plug
PCI Express Performance and Data Transfer Efficiency
Chapter 9. Interrupts
The Previous Chapter
This Chapter
The Next Chapter
Two Methods of Interrupt Delivery
Message Signaled Interrupts
Legacy PCI Interrupt Delivery
Devices May Support Both MSI and Legacy Interrupts
Special Consideration for Base System Peripherals
Appendices
Appendix A. Test, Debug and Verification
Scope
Serial Bus Topology
Dual-Simplex
Setting Up the Analyzer, Capturing and Triggering
Link Training, the First Step in Communication
Slot Connector vs. Mid-Bus Pad
Exercising: In-Depth Verification
Signal Integrity, Design and Measurement
Index
Copyright
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designators appear in this book, and Addison-Wesley
was aware of the trademark claim, the designations have been printed in initial capital letters or
all capital letters.
The authors and publisher have taken care in preparation of this book, but make no expressed
or implied warranty of any kind and assume no responsibility for errors or omissions. No liability
is assumed for incidental or consequential damages in connection with or arising out of the use
of the information or programs contained herein.
The publisher offers discounts on this book when ordered in quantity for bulk purchases and
special sales. For more information, please contact:
International Sales
(317) 581-3793
[email protected]
Budruk, Ravi.
PCI express system architecture / Mindshare, Inc., Ravi Buduk ... [et al.].
p. cm.
Includes index.
ISBN 0-321-15630-7 (alk. paper)
1. Computer architecture. 2. Microcomputersbuses. 3. Computer architecture. I.
Budruk, Ravi II. Mindshare, Inc. III. Title.
QA76.9.A73P43 2003
004.2 '2dc22 2003015461
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording,
or otherwise, without the prior consent of the publisher. Printed in the United States of America.
Published simultaneously in Canada.
For information on obtaining permission for use of material from this work, please submit a
written request to:
1 2 3 4 5 6 7 8 9 10CRS0706050403
Dedication
To my parents Aruna and Shripal Budruk who started me on the path to Knowledge
Figures
1-1 Comparison of Performance Per Pin for Various Buses
3-15 PCI Express Devices And Type 0 And Type 1 Header Use
5-7 Example 1 that Shows Transmitter Behavior with Receipt of an ACK DLLP
5-8 Example 2 that Shows Transmitter Behavior with Receipt of an ACK DLLP
5-11 Example that Shows Receiver Behavior with Receipt of Good TLP
5-12 Example that Shows Receiver Behavior When It Receives Bad TLPs
6-19 Software checks Port Arbitration Capabilities and Selects the Scheme to be Used
7-12 Device Confirm that Flow Control Initialization is Completed for a Given Buffer
9-11 Legacy Devices Use INTx Messages Virtualize INTA#-INTD# Signal Transitions
10-12 Link Retraining Status Bits within the Link Status Register
11-6 TLP and DLLP Packet Framing with Start and End Control Characters
11-13 Scrambler
12-14 Screen Capture of a Bad Eye Showing Effect of Jitter, Noise and Signal
Attenuation (With no De-emphasis Shown)
14-4 Five Ordered-Sets Used in the Link Training and Initialization Process
16-1 Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers, and ACPI
16-2 Example of OS Powering Down All Functions On PCI Express Links and then the
Links Themselves
16-3 Example of OS Restoring a PCI Express Function To Full Power
16-10 PM Registers
16-20 Config. Registers Used for ASPM Exit Latency Management and Reporting
16-21 Devices Transition to L1 When Software Changes their Power Level from D0
A-1 PCI Parallel Bus Start and End of a Transaction Easily Identified
A-7 SKIP
3-9 Results Of Reading The BAR Pair after Writing All "1s" To Both
D-19 Class Code 11h: Data Acquisition and Signal Processing Controllers
Jay Trodden for his contribution in developing the chapter on Transaction Routing and Packet-
Based Transactions.
Mike Jackson for his contribution in preparing the Card Electromechanical chapter.
Thanks also to the PCI SIG for giving permission to use some of the mechanical drawings from
the specification.
About This Book
Cautionary Note
Intended Audience
Prerequisite Knowledge
Documentation Conventions
Part 2: PCI Express Transaction Protocol. Includes packet format and field definition and
use, along with transaction and link layer functions.
Part 3: Physical Layer Description. Describes the physical layer functions, link training and
initialization, reset, and electrical signaling.
Part 5: Optional Topics. Discusses the major features of PCI Express that are optional,
including Hot Plug and Expansion Card implementation details.
Appendix:
PCI Express™
PCI Express™ is a trademark of the PCI SIG. This book takes the liberty of abbreviating PCI
Express as "PCI-XP" primarily in illustration where limited space is an issue.
Hexadecimal Notation
All hex numbers are followed by a lower case "h." For example:
89F2BD02h
0111h
Binary Notation
All binary numbers are followed by a lower case "b." For example:
01b
Decimal Notation
Number without any suffix are decimal. When required for clarity, decimal numbers are followed
by a lower case "d." Examples:
15
512d
Megabits/second = Mb/s
Megabytes/second = MB/s
Bit Fields
Groups bits are represented with the high-order bits first followed by the low-order bits and
enclosed by brackets. For example:
Signals that are active low are followed by #, as in PERST# and WAKE#. Active high signals
have no suffix, such as POWERGOOD.
Visit Our Web Site
Our web site lists all of our courses and the delivery options available for each course:
Technical papers
All of our books are listed and can be ordered in bound or e-book versions.
www.mindshare.com
We Want Your Feedback
MindShare values you comments and suggestions. Contact us at:
Mailing Address:
>MindShare, Inc.
4285 Slash Pine Drive
Colorado Springs, CO 80908
Part One: The Big Picture
This Chapter
The PCI Express architects have carried forward the most beneficial features from previous
generation bus architectures and have also taken advantages of new developments in computer
architecture.
For example, PCI Express employs the same usage model and load-store communication
model as PCI and PCI-X. PCI Express supports familiar transactions such as memory
read/write, IO read/write and configuration read/write transactions. The memory, IO and
configuration address space model is the same as PCI and PCI-X address spaces. By
maintaining the address space model, existing OSs and driver software will run in a PCI
Express system without any modifications. In other words, PCI Express is software backwards
compatible with PCI and PCI-X systems. In fact, a PCI Express system will boot an existing
OS with no changes to current drivers and application programs. Even PCI/ACPI power
management software will still run.
Like predecessor buses, PCI Express supports chip-to-chip interconnect and board-to-board
interconnect via cards and connectors. The connector and card structure are similar to PCI and
PCI-X connectors and cards. A PCI Express motherboard will have a similar form factor to
existing FR4 ATX motherboards which is encased in the familiar PC package.
To improve bus performance, reduce overall system cost and take advantage of new
developments in computer design, the PCI Express architecture had to be significantly re-
designed from its predecessor buses. PCI and PCI-X buses are multi-drop parallel interconnect
buses in which many devices share one bus.
PCI Express on the other hand implements a serial, point-to-point type interconnect for
communication between two devices. Multiple PCI Express devices are interconnected via the
use of switches which means one can practically connect a large number of devices together in
a system. A point-to-point interconnect implies limited electrical load on the link allowing
transmission and reception frequencies to scale to much higher numbers. Currently PCI
Express transmission and reception data rate is 2.5 Gbits/sec. A serial interconnect between
two devices results in fewer pins per device package which reduces PCI Express chip and
board design cost and reduces board design complexity. PCI Express performance is also
highly scalable. This is achieved by implementing scalable numbers for pins and signal Lanes
per interconnect based on communication performance requirements for that interconnect.
The configuration address space available per function is extended to 4KB, allowing designers
to define additional registers. However, new software is required to access this extended
configuration register space.
In the future, PCI Express communication frequencies are expected to double and quadruple to
5 Gbits/sec and 10 Gbits/sec. Taking advantage of these frequencies will require Physical
Layer re-design of a device with no changes necessary to the higher layers of the device
design.
Additional mechanical form factors are expected. Support for a Server IO Module, Newcard
(PC Card style), and Cable form factors are expected.
Predecessor Buses Compared
In an effort to compare and contrast features of predecessor buses, the next section of this
chapter describes some of the key features of IO bus architectures defined by the PCI Special
Interest Group (PCISIG). These buses, shown in Table 1-1 on page 12, include the PCI 33
MHz bus, PCI- 66 MHz bus, PCI-X 66 MHz/133 MHz buses, PCI-X 266/533 MHz buses and
finally PCI Express.
Author's Disclaimer
In comparing these buses, it is not the authors' intention to suggest that any one bus is better
than any other bus. Each bus architecture has its advantages and disadvantages. After
evaluating the features of each bus architecture, a particular bus architecture may turn out to
be more suitable for a specific application than another bus architecture. For example, it is the
system designers responsibility to determine whether to implement a PCI-X bus or PCI Express
for the I/O interconnect in a high-end server design. Our goal in this chapter is to document the
features of each bus architecture so that the designer can evaluate the various bus
architectures.
Table 1-2 on page 13 shows the various bus architectures defined by the PCISIG. The table
shows the evolution of bus frequencies and bandwidths. As is obvious, increasing bus
frequency results in increased bandwidth. However, increasing bus frequency compromises the
number of electrical loads or number of connectors allowable on a bus at that frequency. At
some point, for a given bus architecture, there is an upper limit beyond which one cannot further
increase the bus frequency, hence requiring the definition of a new bus architecture.
Bus Type Clock Frequency Peak Bandwidth [*] Number of Card Slots per Bus
[*] Double all these bandwidth numbers for 64-bit bus implementations
A PCI Express interconnect that connects two devices together is referred to as a Link. A Link
consists of either x1, x2, x4, x8, x12, x16 or x32 signal pairs in each direction. These signals
are referred to as Lanes. A designer determines how many Lanes to implement based on the
targeted performance benchmark required on a given Link.
Table 1-3 shows aggregate bandwidth numbers for various Link width implementations. As is
apparent from this table, the peak bandwidth achievable with PCI Express is significantly higher
than any existing bus today.
Let us consider how these bandwidth numbers are calculated. The transmission/reception rate
is 2.5 Gbits/sec per Lane per direction. To support a greater degree of robustness during data
transmission and reception, each byte of data transmitted is converted into a 10-bit code (via
an 8b/10b encoder in the transmitter device). In other words, for every Byte of data to be
transmitted, 10-bits of encoded data are actually transmitted. The result is 25% additional
overhead to transmit a byte of data. Table 1-3 accounts for this 25% loss in transmission
performance.
PCI Express implements a dual-simplex Link which implies that data is transmitted and received
simultaneously on a transmit and receive Lane. The aggregate bandwidth assumes
simultaneous traffic in both directions.
To obtain the aggregate bandwith numbers in Table 1-3 multiply 2.5 Gbits/sec by 2 (for each
direction), then multiply by number of Lanes, and finally divide by 10-bits per Byte (to account
for the 8-to-10 bit encoding).
Table 1-3. PCI Express Aggregate Throughput for Various Link Widths
As is apparent from Figure 1-1, PCI Express achieves the highest bandwidth per pin. This
results in a device package with fewer pins and a motherboard implementation with few wires
and hence overall reduced system cost per unit bandwidth.
In Figure 1-1, the first 7 bars are associated with PCI and PCI-X buses where we assume 84
pins per device. This includes 46 signal pins, interrupt and power management pins, error pins
and the remainder are power and ground pins. The last bar associated with a x8 PCI Express
Link assumes 40 pins per device which include 32 signal lines (8 differential pairs per direction)
and the rest are power and ground pins.
I/O Bus Architecture Perspective
Figure 1-2 on page 17 is a 33 MHz PCI bus based system. The PCI system consists of a Host
(CPU) bus-to-PCI bus bridge, also referred to as the North bridge. Associated with the North
bridge is the system memory bus, graphics (AGP) bus, and a 33 MHz PCI bus. I/O devices
share the PCI bus and are connected to it in a multi-drop fashion. These devices are either
connected directly to the PCI bus on the motherboard or by way of a peripheral card plugged
into a connector on the bus. Devices connected directly to the motherboard consume one
electrical load while connectors are accounted for as 2 loads. A South bridge bridges the PCI
bus to the ISA bus where slower, lower performance peripherals exist. Associated with the
south bridge is a USB and IDE bus. A CD or hard disk is associated with the IDE bus. The
South bridge contains an interrupt controller (not shown) to which interrupt signals from PCI
devices are connected. The interrupt controller is connected to the CPU via an INTR signal or
an APIC bus. The South bridge is the central resource that provides the source of reset,
reference clock, and error reporting signals. Boot ROM exists on the ISA bus along with a
Super IO chip, which includes keyboard, mouse, floppy disk controller and serial/parallel bus
controllers. The PCI bus arbiter logic is included in the North bridge.
Figure 1-3 on page 18 represents a typical PCI bus cycle. The PCI bus clock is 33 MHz. The
address bus width is 32-bits (4GB memory address space), although PCI optionally supports
64-bit address bus. The data bus width is implemented as either 32-bits or 64-bits depending
on bus performance requirement. The address and data bus signals are multiplexed on the
same pins (AD bus) to reduce pin count. Command signals (C/BE#) encode the transaction
type of the bus cycle that master devices initiate. PCI supports 12 transaction types that
include memory, IO, and configuration read/write bus cycles. Control signals such as FRAME#,
DEVSEL#, TRDY#, IRDY#, STOP# are handshake signals used during bus cycles. Finally, the
PCI bus consists of a few optional error related signals, interrupt signals and power
management signals. A PCI master device implements a minimum of 49 signals.
Any PCI master device that wishes to initiate a bus cycle first arbitrates for use of the PCI bus
by asserting a request (REQ#) to the arbiter in the North bridge. After receiving a grant (GNT#)
from the arbiter and checking that the bus is idle, the master device can start a bus cycle.
PCI implements reflected-wave switching signal drivers. The driver drives a half signal swing
signal on the rising edge of PCI clock. The signal propagates down the PCI bus transmission
line and is reflected at the end of the transmission line where there is no termination. The
reflection causes the half swing signal to double. The doubled (full signal swing) signal must
settle to a steady state value with sufficient setup time prior to the next rising edge of PCI clock
where receiving devices sample the signal. The total time from when a driver drives a signal
until the receiver detects a valid signal (including propagation time and reflection delay plus
setup time) must be less than the clock period of 30 ns.
The more electrical loads on a bus, the longer it takes for the signal to propagate and double
with sufficient setup to the next rising edge of clock. As mentioned earlier, a 33 MHz PCI bus
meets signal timing with no more than 10-12 loads. Connectors on the PCI bus are counted as
2 loads because the connector is accounted for as one load and the peripheral card with a PCI
device is the second load. As indicated in Table 1-2 on page 13 a 33 MHz PCI bus can be
designed with a maximum of 4-5 connectors.
To connect any more than 10-12 loads in a system requires the implementation of a PCI-to-PCI
bridge as shown in Figure 1-4. This permits an additional 10-12 loads to be connected on the
secondary PCI bus 1. The PCI specification theoretically supports up to 256 buses in a system.
This means that PCI enumeration software will detect and recognize up to 256 PCI bridges per
system.
Consider an example in which the CPU communicates with a PCI peripheral such as an
Ethernet device shown in Figure 1-5. Transaction 1 shown in the figure, which is initiated by the
CPU and targets a peripheral device, is referred to as a programmed IO transaction. Software
commands the CPU to initiate a memory or IO read/write bus cycle on the host bus targeting
an address mapped in a PCI device's address space. The North bridge arbitrates for use of the
PCI bus and when it wins ownership of the bus generates a PCI memory or IO read/write bus
cycle represented in Figure 1-3 on page 18. During the first clock of this bus cycle (known as
the address phase), all target devices decode the address. One target (the Ethernet device in
this example) decodes the address and claims the transaction. The master (North bridge in this
case) communicates with the claiming target (Ethernet controller). Data is transferred between
master and target in subsequent clocks after the address phase of the bus cycle. Either 4
bytes or 8 bytes of data are transferred per clock tick depending on the PCI bus width. The
bus cycle is referred to as a burst bus cycle if data is transferred back-to-back between
master and target during multiple data phases of that bus cycle. Burst bus cycles result in the
most efficient use of PCI bus bandwidth.
Efficiency of the PCI bus for data payload transport is in the order of 50%. Efficiency is defined
as number of clocks during which data is transferred divided by the number of total clocks,
times 100. The lost performance is due to bus idle time between bus cycles, arbitration time,
time lost in the address phase of a bus cycle, wait states during data phases, delays during
transaction retries (not discussed yet), as well as latencies through PCI bridges.
Data transfer between a PCI device and system memory is accomplished in two ways:
The first less efficient method uses programmed IO transfers as discussed in the previous
section. The PCI device generates an interrupt to inform the CPU that it needs data
transferred. The device interrupt service routine (ISR) causes the CPU to read from the PCI
device into one of its own registers. The ISR then tells the CPU to write from its register to
memory. Similarly, if data is to be moved from memory to the PCI device, the ISR tells the CPU
to read from memory into its own register. The ISR then tells the CPU to write from its register
to the PCI device. It is apparent that the process is very inefficient for two reasons. First, there
are two bus cycles generated by the CPU for every data transfer, one to memory and one to
the PCI device. Second, the CPU is busy transferring data rather than performing its primary
function of executing application code.
The second more efficient method to transfer data is the DMA (direct memory access) method
illustrated by Transaction 2 in Figure 1-5 on page 20, where the PCI device becomes a bus
master. Upon command by a local application (software) which runs on a PCI peripheral or the
PCI peripheral hardware itself, the PCI device may initiate a bus cycle to talk to memory. The
PCI bus master device (SCSI device in this example) arbitrates for the PCI bus, wins
ownership of the bus and initiates a PCI memory bus cycle. The North bridge which decodes
the address acts as the target for the transaction. In the data phase of the bus cycle, data is
transferred between the SCSI master and the North bridge target. The bridge in turn generates
a DRAM bus cycle to communicate with system memory. The PCI peripheral generates an
interrupt to inform the system software that the data transfer has completed. This bus master
or DMA method of data transport is more efficient because the CPU is not involved in the data
move and further only one burst bus cycle is generated to move a block of data.
A PCI device that wishes to initiate a bus cycle arbitrates for use of the bus first. The arbiter
implements an arbitration algorithm with which it decides who to grant the bus to next. The
arbiter is able to grant the bus to the next requesting device while a bus cycle is in progress.
This arbitration protocol is referred to as hidden bus arbitration. Hidden bus arbitration allows
for more efficient hand over of the bus from one bus master device to another with only one idle
clock between two bus cycles (referred to as back-to-back bus cycles). PCI protocol does not
provide a standard mechanism by which system software or device drivers can configure the
arbitration algorithm in order to provide for differentiated class of service for various
applications.
When a PCI master initiates a transaction to access a target device and the target device is not
ready, the target signals a transaction retry. This scenario is illustrated in Figure 1-7.
Consider the following example in which the North bridge initiates a memory read transaction to
read data from the Ethernet device. The Ethernet target claims the bus cycle. However, the
Ethernet target does not immediately have the data to return to the North bridge master. The
Ethernet device has two choices by which to delay the data transfer. The first is to insert wait-
states in the data phase. If only a few wait-states are needed, then the data is still transferred
efficiently. If however the target device requires more time (more than 16 clocks from the
beginning of the transaction), then the second option the target has is to signal a retry with a
signal called STOP#. A retry tells the master to end the bus cycle prematurely without
transferring data. Doing so prevents the bus from being held for a long time in wait-states,
which compromises the bus efficiency. The bus master that is retried by the target waits a
minimum of 2 clocks and must once again arbitrate for use of the bus to re-initiate the identical
bus cycle. During the time that the bus master is retried, the arbiter can grant the bus to other
requesting masters so that the PCI bus is more efficiently utilized. By the time the retried
master is granted the bus and it re-initiates the bus cycle, hopefully the target will claim the
cycle and will be ready to transfer data. The bus cycle goes to completion with data transfer.
Otherwise, if the target is still not ready, it retries the master's bus cycle again and the process
is repeated until the master successfully transfers data.
When a PCI master initiates a transaction to access a target device and if the target device is
able to transfer at least one doubleword of data but cannot complete the entire data transfer, it
disconnects the bus cycle at the point at which it cannot continue the data transfer. This
scenario is illustrated in Figure 1-8.
Consider the following example in which the North bridge initiates a burst memory read
transaction to read data from the Ethernet device. The Ethernet target device claims the bus
cycle and transfers some data, but then runs out of data to transfer. The Ethernet device has
two choices to delay the data transfer. The first option is to insert wait-states during the current
data phase while waiting for additional data to arrive. If the target needs to insert only a few
wait-states, then the data is still transferred efficiently. If however the target device requires
more time (the PCI specification allows maximum of 8 clocks in the data phase), then the target
device must signal a disconnect. To do this the target asserts STOP# in the middle of the bus
cycle to tell the master to end the bus cycle prematurely. A disconnect results in some data is
transferred, while a retry does not. Disconnect frees the bus from long periods of wait states.
The disconnected master waits a minimum of 2 clocks before once again arbitrating for use of
the bus and continuing the bus cycle at the disconnected address. During the time that the bus
master is disconnected, the arbiter may grant the bus to other requesting masters so that the
PCI bus is utilized more efficiently. By the time the disconnected master is granted the bus and
continues the bus cycle, hopefully the target is ready to continue the data transfer until it is
completed. Otherwise, the target once again retries or disconnects the master's bus cycle and
the process is repeated until the master successfully transfers all its data.
Central to the PCI interrupt handling protocol is the interrupt controller shown in Figure 1-9. PCI
devices use one-of-four interrupt signals (INTA#, INTB#, INTC#, INTD#) to trigger an interrupt
request to the interrupt controller. In turn, the interrupt controller asserts INTR to the CPU. If
the architecture supports an APIC (Advanced Programmable Interrupt Controller) then it sends
an APIC message to the CPU as opposed to asserting the INTR signal. The interrupted CPU
determines the source of the interrupt, saves its state and services the device that generated
the interrupt. Interrupts on PCI INTx# signals are sharable. This allows multiple devices to
generate their interrupts on the same interrupt signal. OS software has the overhead to
determine which one of the devices sharing the interrupt signal generated the interrupt. This is
accomplished by polling the Interrupt Pending bit mapped in a device's memory space. Doing
so incurs additional latency in servicing the interrupting device.
PCI devices are optionally designed to detect address and data phase parity errors during
transactions. Even parity is generated on the PAR signal during each bus cycle's address and
data phases. The device that receives the address or data during a bus cycle uses the parity
signal to determine if a parity error has occurred due to noise on the PCI bus. If a device
detects an address phase parity error, it asserts SERR#. If a device detects a data phase
parity error, it asserts PERR#. The PERR# and SERR# signals are connected to the error logic
(in the South bridge) as shown in Figure 1-10 on page 27. In many systems, the error logic
asserts the NMI signal (non-maskable interrupt signal) to the CPU upon detecting PERR# or
SERR#. This interrupt results in notification of a parity error and the system shuts down (We all
know the blue screen of death). Kind of draconian don't you agree?
PCI architecture supports 3 address spaces shown in Figure 1-11. These are the memory, IO
and configuration address spaces. The memory address space goes up to 4 GB for systems
that support 32-bit memory addressing and optionally up to 16 EB (exabytes) for systems that
support 64-bit memory addressing. PCI supports up to 4GB of IO address space, however,
many platforms limit IO space to 64 KB due to X86 CPUs only supporting 64 KB of IO address
space. PCI devices are configured to map to a configurable region within either the memory or
IO address space.
Step 1. The CPU generates an IO write to the Address Port at IO address CF8h in the
North bridge. The data written to the Address Port is the configuration register address to
be accessed.
Step 2. The CPU either generates an IO read or IO write to the Data Port at location
CFCh in the North bridge. The North bridge in turn then generates either a configuration
read or configuration write transaction on the PCI bus.
The address for the configuration transaction address phase is obtained from the contents of
the Address register. During the configuration bus cycle, one of the point-to-point IDSEL signals
shown in Figure 1-12 on page 29 is asserted to select the device whose register is being
accessed. That PCI target device claims the configuration cycle and fulfills the request.
Figure 1-12. PCI Configuration Cycle Generation
Each PCI device contains up to 256 Bytes of configuration register space. The first 64 bytes
are configuration header registers and the remainding 192 Bytes are device specific registers.
The header registers are configured at boot time by the Boot ROM configuration firmware and
by the OS. The device specific registers are configured by the device's device driver that is
loaded and executed by the OS at boot time.
Software instructions may cause the CPU to generate memory or IO read/write bus cycles.
The North bridge decodes the address of the resulting CPU bus cycles, and if the address
maps to PCI address space, the bridge in turn generates a PCI memory or IO read/write bus
cycle. A target device on the PCI bus claims the cycle and completes the transfer. In summary,
the CPU communicates with any PCI device via the North bridge, which generates PCI memory
or IO bus cycles on the behalf of the CPU.
An intelligent PCI device that includes a local processor or bus master state machine (typically
intelligent IO cards) can also initiate PCI memory or IO transactions on the PCI bus. These
masters can communicate directly with any other devices, including system memory associated
with the North bridge.
A device driver executing on the CPU configures the device-specific configuration register space
of an associated PCI device. A configured PCI device that is bus master capable can initiate its
own transactions, which allows it to communicate with any other PCI target device including
system memory associated with the North bridge.
The CPU can access configuration space as described in the previous section.
PCI Express architecture assumes the identical programming model as the PCI programming
model described above. In fact, current OSs written for PCI systems can boot a PCI Express
system. Current PCI device drivers will initialize PCI Express devices without any driver
changes. PCI configuration and enumeration firmware will function unmodified on a PCI Express
system.
As indicated in Table 1-2 on page 13, peak bandwidth achievable on a 64-bit 33 MHz PCI bus
is 266 Mbytes/sec. Current high-end workstation and server applications require greater
bandwidth.
Applications such as gigabit Ethernet and high performance disc transfers in RAID and SCSI
configurations require greater bandwidth capability than the 33 MHz PCI bus offers.
Figure 1-14 shows an example of a later generation Intel PCI chipset. The two shaded devices
are NOT the North bridge and South bridge shown in earlier diagrams. Instead, one device is
the Memory Controller Hub (MCH) and the other is the IO Controller Hub (ICH). The two chips
are connected by a proprietary Intel high throughput, low pin count bus called the Hub Link.
The ICH includes the South bridge functionality but does not support the ISA bus. Other buses
associated with ICH include LPC (low pin count) bus, AC'97, Ethernet, Boot ROM, IDE, USB,
SMbus and finally the PCI bus. The advantage of this architecture over previous architectures is
that the IDE, USB, Ethernet and audio devices do not transfer their data through the PCI bus to
memory as is the case with earlier chipsets. Instead they do so through the Hub Link. Hub Link
is a higher performance bus compared to PCI. In other words, these devices bypass the PCI
bus when communicating with memory. The result is improved performance.
High end systems that require better IO bandwidth implement a 66 MHz 64-bit PCI buses. This
PCI bus supports peak data transfer rate of 533 MBytes/sec.
The PCI 2.1 specification released in 1995 added 66MHz PCI support.
Figure 1-15 shows an example of a 66 MHz PCI bus based system. This system has similar
features to that described in Figure 1-14 on page 32. However, the MCH chip in this example
supports two additional Hub Link buses that connect to P64H (PCI 64-bit Hub) bridge chips,
providing access to the 64-bit, 66 MHz buses. These buses each support 1 connector in which
a high-end peripheral card may be installed.
The PCI clock period at 66 MHz is 15 ns. Recall that PCI supports reflected-wave signaling
drivers that are weaker drivers, which have slower rise and fall times as compared to incident-
wave signaling drivers. It is a challenge to design a 66 MHz device or system that satisfies the
signal timing requirements.
A 66 MHz PCI based motherboard is routed with shorter signal traces to ensure shorter signal
propagation delays. In addition, the bus is loaded with fewer loads in order to ensure faster
signal rise and fall times. Taking into account typical board impedances and minimum signal
trace lengths, it is possible to interconnect a maximum of four to five 66 MHz PCI devices. Only
one or two connectors may be connected on a 66 MHz PCI bus. This is a significant limitation
for a system which requires multiple devices interconnected.
The solution requires the addition of PCI bridges and hence multiple buses to interconnect
devices. This solution is expensive and consumes additional board real estate. In addition,
transactions between devices on opposite sides of a bridge complete with greater latency
because bridges implement delayed transactions. This requires bridges to retry all transactions
that must cross to the other side (with the exception of memory writes which are posted).
The maximum frequency achievable with the PCI architecture is 66 MHz. This is a result of the
static clock method of driving and latching signals and because reflected-wave signaling is
used.
PCI bus efficiency is in the order of 50% or 60%. Some of the factors that contribute towards
this reduced efficiency are listed below.
The PCI specification allows master and target devices to insert wait-states during data phases
of a bus cycle. Slow devices will add wait-states which reduces the efficiency of bus cycles.
PCI bus cycles do not indicate transfer size. This makes buffer management within master and
target devices inefficient.
Delayed transactions on PCI are handled inefficiently. When a master is retried, it guesses
when to try again. If the master tries too soon, the target may retry the transaction again. If the
master waits too long to retry, the latency to complete a data transfer is increased. Similarly, if
a target disconnects a transaction the master must guess when to resume the bus cycle at a
later time.
All PCI bus master accesses to system memory result in a snoop access to the CPU cache.
Doing so results in additional wait states during PCI bus master accesses of system memory.
The North bridge or MCH must assume all system memory address space is cachable even
though this may not be the case. PCI bus cycles provide no mechanism by which to indicate an
access to non-cachable memory address space.
PCI architecture observes strict ordering rules as defined by the specification. Even if a PCI
application does not require observation of these strict ordering rules, PCI bus cycles do not
provide a mechanism to allow relaxed ordering rule. Observing relaxed ordering rules allows
bus cycles (especially those that cross a bridge) to complete with reduced latency.
PCI interrupt handling architecture is inefficient especially because multiple devices share a PCI
interrupt signal. Additional software latency is incurred while software discovers which device or
devices that share an interrupt signal actually generated the interrupt.
The processor's NMI interrupt input is asserted when a PCI parity or system error is detected.
Ultimately the system shuts down when an error is detected. This is a severe response. A more
appropriate response might be to detect the error and attempt error recovery. PCI does not
require error recovery features, nor does it support an extensive register set for documenting a
variety of detectable errors.
These limitations above have been resolved in the next generation bus architectures, namely
PCI-X and PCI Express.
Figure 1-16 on page 36 is an example of an Intel 7500 server chipset based system. This
chipset has similarities to the 8XX chipset described earlier. MCH and ICH chips are connected
via a Hub Link 1.0 bus. Associated with ICH is a 32-bit 33 MHz PCI bus. The 7500 MCH chip
includes 3 additional high performance Hub Link 2.0 ports. These Hub Link ports are connected
to 3 Hub Link-to-PCI-X Hub 2 bridges (P64H2). Each P64H2 bridge supports 2 PCI-X buses
that can run at frequencies up to 133MHz. Hub Link 2.0 Links can sustain the higher bandwidth
requirements for PCI-X traffic that targets system memory.
The PCI-X bus is a higher frequency, higher performance, higher efficiency bus compared to
the PCI bus.
PCI-X devices can be plugged into PCI slots and vice-versa. PCI-X and PCI slots employ the
same connector format. Thus, PCI-X is 100% backwards compatible to PCI from both a
hardware and software standpoint. The device drivers, OS, and applications that run on a PCI
system also run on a PCI-X system.
PCI-X signals are registered. A registered signal requires smaller setup time to sample the
signal as compared with a non-registered signal employed in PCI. Also, PCI-X devices employ
PLLs that are used to pre-drive signals with smaller clock-to-out time. The time gained from
reduced setup time and clock-to-out time is used towards increased clock frequency capability
and the ability to support more devices on the bus at a given frequency compared to PCI. PCI-
X supports 8-10 loads or 4 connectors at 66 MHz and 3-4 loads or 1-2 connectors at 133 MHz.
The peak bandwidth achievable with 64-bit 133 MHz PCI-X is 1064 MBytes/sec.
Following the first data phase, the PCI-X bus does not allow wait states during subsequent
data phases.
Most PCI-X bus cycles are burst cycles and data is generally transferred in blocks of no less
than 128 Bytes. This results in higher bus utilization. Further, the transfer size is specified in the
attribute phase of PCI-X transactions. This allows for more efficient device buffer management.
Figure 1-17 is an example of a PCI-X burst memory read transaction.
Figure 1-17. Example PCI-X Burst Memory Read Bus Cycle
Consider an example of the split transaction protocol supported by PCI-X for delaying
transactions. This protocol is illustrated in Figure 1-18. A requester initiates a read transaction.
The completer that claims the bus cycles may be unable to return the requested data
immediately. Rather than signaling a retry as would be the case in PCI protocol, the completer
memorizes the transaction (address, transaction type, byte count, requester ID are memorized)
and signals a split response. This prompts the requester to end the bus cycle, and the bus
goes idle. The PCI-X bus is now available for other transactions, resulting in more efficient bus
utilization. Meanwhile, the requester simply waits for the completer to supply it the requested
data at a later time. Once the completer has gathered the requested data, it then arbitrates
and obtains bus ownership and initiates a split completion bus cycle during which it returns the
requested data. The requester claims the split completion bus cycle and accepts the data from
the completer.
PCI-X devices must support Message Signaled Interrupt (MSI) architecture, which is a more
efficient architecture than the legacy interrupt architecture described in the PCI architecture
section. To generate an interrupt request, a PCI-X devices initiates a memory write transaction
targeting the Host (North) bridge. The data written is a unique interrupt vector associated with
the device generating the interrupt. The Host bridge interrupts the CPU and the vector is
delivered to the CPU in a platform specific manner. With this vector, the CPU is immediately
able to run an interrupt service routine to service the interrupting device. There is no software
overhead in determining which device generated the interrupt. Also, unlike in the PCI
architecture, no interrupt pins are required.
PCI Express architecture implements the MSI protocol, resulting in reduced interrupt servicing
latency and elimination of interrupt signals.
PCI Express architecture also supports the RO bit and NS bit feature with the result that those
transactions with either NS=1 or RO=1 complete with better performance than transactions
with NS=0 or RO=0. PCI transactions by definition assume NS=0 and RO=0.
NS No Snoop (NS) may be used when accessing system memory. PCI-X bus masters can
use the NS bit to indicate whether the region of memory being accessed is cachable
(NS=0) or not (NS=1). For those transactions with NS=1, the Host bridge does not snoop
the processor cache. The result is improved performance during accesses to non-cachable
memory.
RO Relaxed Ordering (RO) allows transactions that do not have any order of completion
requirements to complete more efficiently. We will not get into the details here. Suffice it to
say that transactions with the RO bit set can complete on the bus in any order with respect
to other transactions that are pending completion.
The PCI-X 2.0 specification released in Q1 2002 was designed to further increase the
bandwidth capability of PCI-X bus. This bus is described next.
A design requiring greater than 1 GByte/sec bus bandwidth can implement the DDR or QDR
protocol. As indicated in Table 1-2 on page 13, PCI-X 2.0 peak bandwidth capability is 4256
MBytes/sec for a 64-bit 533 MHz effective PCI-X bus. With the aid of a strobe clock, data is
transferred two times or four times per 133 MHz clock.
PCI-X 2.0 devices also support ECC generation and checking. This allows auto-correction of
single bit errors and detection and reporting of multi-bit errors. Error handling is more robust
than PCI and PCI-X 1.0 systems making this bus more suited for high-performance, robust,
non-stop server applications.
Some noteworthy points to remember are that with very fast signal timing, it is only possible to
support one connector on the PCI-X 2.0 bus. This implies that a PCI-X 2.0 bus essentially
becomes a point-to-point connection with no multi-drop capability as with its predecessor
buses.
PCI-X 2.0 bridges are essentially switches with one primary bus and one or more downstream
secondary buses as shown in Figure 1-19 on page 40.
The PCI Express Way
PCI Express provides a high-speed, high-performance, point-to-point, dual simplex, differential
signaling Link for interconnecting devices. Data is transmitted from a device on one set of
signals, and received on another set of signals.
As shown in Figure 1-20, a PCI Express interconnect consists of either a x1, x2, x4, x8, x12,
x16 or x32 point-to-point Link. A PCI Express Link is the physical connection between two
devices. A Lane consists of signal pairs in each direction. A x1 Link consists of 1 Lane or 1
differential signal pair in each direction for a total of 4 signals. A x32 Link consists of 32 Lanes
or 32 signal pairs for each direction for a total of 128 signals. The Link supports a symmetric
number of Lanes in each direction. During hardware initialization, the Link is initialized for Link
width and frequency of operation automatically by the devices on opposite ends of the Link. No
OS or firmware is involved during Link level initialization.
Differential Signaling
PCI Express devices employ differential drivers and receivers at each port. Figure 1-21 shows
the electrical characteristics of a PCI Express signal. A positive voltage difference between the
D+ and D- terminals implies Logical 1. A negative voltage difference between D+ and D- implies
a Logical 0. No voltage difference between D+ and D- means that the driver is in the high-
impedance tristate condition, which is referred to as the electrical-idle and low-power state of
the Link.
Rather than bus cycles we are familiar with from PCI and PCI-X architectures, PCI Express
encodes transactions using a packet based protocol. Packets are transmitted and received
serially and byte striped across the available Lanes of the Link. The more Lanes implemented
on a Link the faster a packet is transmitted and the greater the bandwidth of the Link. The
packets are used to support the split transaction protocol for non-posted transactions. Various
types of packets such as memory read and write requests, IO read and write requests,
configuration read and write requests, message requests and completions are defined.
As is apparent from Table 1-3 on page 14, the aggregate bandwidth achievable with PCI
Express is significantly higher than any bus available today. The PCI Express 1.0 specification
supports 2.5 Gbits/sec/lane/direction transfer rate.
No clock signal exists on the Link. Each packet to be transmitted over the Link consists of bytes
of information. Each byte is encoded into a 10-bit symbol. All symbols are guaranteed to have
one-zero transitions. The receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0
transitions of the incoming bit stream.
Address Space
PCI Express supports the same address spaces as PCI: memory, IO and configuration
address spaces. In addition, the maximum configuration address space per device function is
extended from 256 Bytes to 4 KBytes. New OS, drivers and applications are required to take
advantage of this additional configuration address space. Also, a new messaging transaction
and address space provides messaging capability between devices. Some messages are PCI
Express standard messages used for error reporting, interrupt and power management
messaging. Other messages are vendor defined messages.
PCI Express supports the same transaction types supported by PCI and PCI-X. These include
memory read and memory write, I/O read and I/O write, configuration read and configuration
write. In addition, PCI Express supports a new transaction type called Message transactions.
These transactions are encoded using the packet-based PCI Express protocol described later.
PCI Express transactions can be divided into two categories. Those transactions that are non-
posted and those that are posted. Non-posted transactions, such as memory reads, implement
a split transaction communication model similar to the PCI-X split transaction protocol. For
example, a requester device transmits a non-posted type memory read request packet to a
completer. The completer returns a completion packet with the read data to the requester.
Posted transactions, such as memory writes, consist of a memory write packet transmitted uni-
directionally from requester to completer with no completion packet returned from completer to
requester.
CRC fields are embedded within each packet transmitted. One of the CRC fields supports a
Link-level error checking protocol whereby each receiver of a packet checks for Link-level CRC
errors. Packets transmitted over the Link in error are recognized with a CRC error at the
receiver. The transmitter of the packet is notified of the error by the receiver. The transmitter
automatically retries sending the packet (with no software involvement), hopefully resulting in
auto-correction of the error.
In addition, an optional CRC field within a packet allows for end-to-end data integrity checking
required for high availability applications.
Error handling on PCI Express can be as rudimentary as PCI level error handling described
earlier or can be robust enough for server-level requirements. A rich set of error logging
registers and error reporting mechanisms provide for improved fault isolation and recovery
solutions required by RAS (Reliable, Available, Serviceable) applications.
The Quality of Service feature of PCI Express refers to the capability of routing packets from
different applications through the fabric with differentiated priorities and deterministic latencies
and bandwidth. For example, it may be desirable to ensure that Isochronous applications, such
as video data packets, move through the fabric with higher priority and guaranteed bandwidth,
while control data packets may not have specific bandwidth or latency requirements.
PCI Express packets contain a Traffic Class (TC) number between 0 and 7 that is assigned by
the device's application or device driver. Packets with different TCs can move through the fabric
with different priority, resulting in varying performances. These packets are routed through the
fabric by utilizing virtual channel (VC) buffers implemented in switches, endpoints and root
complex devices.
Each Traffic Class is individually mapped to a Virtual Channel (a VC can have several TCs
mapped to it, but a TC cannot be mapped to multiple VCs). The TC in each packet is used by
the transmitting and receiving ports to determine which VC buffer to drop the packet into.
Switches and devices are configured to arbitrate and prioritize between packets from different
VCs before forwarding. This arbitration is referred to as VC arbitration. In addition, packets
arriving at different ingress ports are forwarded to their own VC buffers at the egress port.
These transactions are prioritized based on the ingress port number when being merged into a
common VC output buffer for delivery across the egress link. This arbitration is referred to as
Port arbitration.
The result is that packets with different TC numbers could observe different performance when
routed through the PCI Express fabric.
Flow Control
A packet transmitted by a device is received into a VC buffer in the receiver at the opposite end
of the Link. The receiver periodically updates the transmitter with information regarding the
amount of buffer space it has available. The transmitter device will only transmit a packet to the
receiver if it knows that the receiving device has sufficient buffer space to hold the next
transaction. The protocol by which the transmitter ensures that the receiving buffer has
sufficient space available is referred to as flow control. The flow control mechanism guarantees
that a transmitted packet will be accepted by the receiver, baring error conditions. As such, the
PCI Express transaction protocol does not require support of packet retry (unless an error
condition is detected in the receiver), thereby improving the efficiency with which packets are
forwarded to a receiver via the Link.
MSI Style Interrupt Handling Similar to PCI-X
Interrupt handling is accomplished in-band via PCI-X-like MSI protocol. PCI Express device use
a memory write packet to transmit an interrupt vector to the root complex host bridge device,
which in-turn interrupts the CPU. PCI Express devices are required to implement the MSI
capability register block. PCI Express also supports legacy interrupt handling in-band by
encoding interrupt signal transitions (for INTA#, INTB#, INTC# and INTD#) using Message
transactions. Only endpoint devices that must support legacy functions and PCI Express-to-PCI
bridges are allowed to support legacy interrupt generation.
Power Management
The PCI Express fabric consumes less power because the interconnect consists of fewer
signals that have smaller signal swings. Each device's power state is individually managed.
PCI/PCI Express power management software determines the power management capability
of each device and manages it individually in a manner similar to PCI. Devices can notify
software of their current power state, as well as power management software can propagate a
wake-up event through the fabric to power-up a device or group of devices. Devices can also
signal a wake-up event using an in-band mechanism or a side-band signal.
With no software involvement, devices place a Link into a power savings state after a time-out
when they recognize that there are no packets to transmit over the Link. This capability is
referred to as Active State power management.
PCI Express supports device power states: D0, D1, D2, D3-Hot and D3-Cold, where D0 is the
full-on power state and D3-Cold is the lowest power state.
PCI Express also supports the following Link power states: L0, L0s, L1, L2 and L3, where L0
is the full-on Link state and L3 is the Link-Off power state.
PCI Express supports hot plug and surprise hot unplug without usage of sideband signals. Hot
plug interrupt messages, communicated in-band to the root complex, trigger hot plug software
to detect a hot plug or removal event. Rather than implementing a centralized hot plug controller
as exists in PCI platforms, the hot plug controller function is distributed to the port logic
associated with a hot plug capable port of a switch or root complex. 2 colored LEDs, a
Manually-operated Retention Latch (MRL), MRL sensor, attention button, power control signal
and PRSNT2# signal are some of the elements of a hot plug capable port.
PCI Compatible Software Model
PCI Express employs the same programming model as PCI and PCI-X systems described
earlier in this chapter. The memory and IO address space remains the same as PCI/PCI-X.
The first 256 Bytes of configuration space per PCI Express function is the same as PCI/PCI-X
device configuration address space, thus ensuring that current OSs and device drivers will run
on a PCI Express system. PCI Express architecture extends the configuration address space
to 4 KB per functional device. Updated OSs and device drivers are required to take advantage
and access this additional configuration address space.
1. PCI compatible configuration model which is 100% compatible with existing OSs and
bus enumeration and configuration software for PCI/PCI-X systems.
PCI Express architecture supports multiple platform interconnects such as chip-to-chip, board-
to-peripheral card via PCI-like connectors and Mini PCI Express form factors for the mobile
market. Specifications for these are fully defined. See "Add-in Cards and Connectors" on page
685 for details on PCI Express peripheral card and connector definition.
Currently, x1, x4, x8 and x16 PCI-like connectors are defined along with associated peripheral
cards. Desktop computers implementing PCI Express can have the same look and feel as
current computers with no changes required to existing system form factors. PCI Express
motherboards can have an ATX-like motherboard form factor.
Mini PCI Express connector and add-in card implements a subset of signals that exist on a
standard PCI Express connector and add-in card. The form factor, as the name implies, is
much smaller. This form factor targets the mobile computing market. The Mini PCI Express slot
supports x1 PCI Express signals including power management signals. In addition, the slot
supports LED control signals, a USB interface and an SMBus interface. The Mini PCI Express
module is similar but smaller than a PC Card.
Mechanical Form Factors Pending Release
As of May 2003, specifications for two new form factors have not been released. Below is a
summary of publicly available information about these form factors.
Another new module form factor that will service both mobile and desktop markets is the
NEWCARD form factor. This is a PCMCIA PC card type form factor, but of nearly half the size
that will support x1 PCI Express signals including power management signals. In addition, the
slot supports USB and SMBus interfaces. There are two size form factors defined, a narrower
version and a wider version though the thickness and depth remain the same. Although similar
in appearance to Mini PCI Express Module, this is a different form factor.
These are a family of modules that target the workstation and server market. They are
designed with future support of larger PCI Express Lane widths and higher frequency bit rates
beyond 2.5 Gbits/s Generation 1 transmission rates. Four form factors are under consideration.
The base module with single- and double-width modules. Also, the full height with single- and
double-width modules.
Major components in the PCI Express system shown in Figure 1-22 include a root complex,
switches, and endpoint devices.
Root complex implements central resources such as: hot plug controller, power management
controller, interrupt controller, error detection and reporting logic. The root complex initializes
with a bus number, device number and function number which are used to form a requester ID
or completer ID. The root complex bus, device and function numbers initialize to all 0s.
A Hierarchy is a fabric of all the devices and Links associated with a root complex that are
either directly connected to the root complex via its port(s) or indirectly connected via switches
and bridges. In Figure 1-22 on page 48, the entire PCI Express fabric associated with the root
is one hierarchy.
A Hierarchy Domain is a fabric of devices and Links that are associated with one port of the
root complex. For example in Figure 1-22 on page 48, there are 3 hierarchy domains.
Endpoints are devices other than root complex and switches that are requesters or
completers of PCI Express transactions. They are peripheral devices such as Ethernet, USB or
graphics devices. Endpoints initiate transactions as a requester or respond to transactions as a
completer. Two types of endpoints exist, PCI Express endpoints and legacy endpoints. Legacy
Endpoints may support IO transactions. They may support locked transaction semantics as a
completer but not as a requester. Interrupt capable legacy devices may support legacy style
interrupt generation using message requests but must in addition support MSI generation using
memory write transactions. Legacy devices are not required to support 64-bit memory
addressing capability. PCI Express Endpoints must not support IO or locked transaction
semantics and must support MSI style interrupt generation. PCI Express endpoints must
support 64-bit memory addressing capability in prefetchable memory address space, though
their non-prefetchable memory address space is permitted to map the below 4GByte boundary.
Both types of endpoints implement Type 0 PCI configuration headers and respond to
configuration transactions as completers. Each endpoint is initialized with a device ID
(requester ID or completer ID) which consists of a bus number, device number, and function
number. Endpoints are always device 0 on a bus.
Multi-Function Endpoints. Like PCI devices, PCI Express devices may support up to 8
functions per endpoint with at least one function number 0. However, a PCI Express Link
supports only one endpoint numbered device 0.
PCI Express-to-PCI(-X) Bridge is a bridge between PCI Express fabric and a PCI or PCI-X
hierarchy.
A Requester is a device that originates a transaction in the PCI Express fabric. Root complex
and endpoints are requester type devices.
A Port is the interface between a PCI Express component and the Link. It consists of
differential transmitters and receivers. An Upstream Port is a port that points in the direction of
the root complex. A Downstream Port is a port that points away from the root complex. An
endpoint port is an upstream port. A root complex port(s) is a downstream port. An Ingress
Port is a port that receives a packet. An Egress Port is a port that transmits a packet.
A Switch can be thought of as consisting of two or more logical PCI-to-PCI bridges, each
bridge associated with a switch port. Each bridge implements configuration header 1 registers.
Configuration and enumeration software will detect and initialize each of the header 1 registers
at boot time. A 4 port switch shown in Figure 1-22 on page 48 consists of 4 virtual bridges.
These bridges are internally connected via a non-defined bus. One port of a switch pointing in
the direction of the root complex is an upstream port. All other ports pointing away from the
root complex are downstream ports.
Standard PCI Plug and Play enumeration software can enumerate a PCI Express system. The
Links are numbered in a manner similar to the PCI depth first search enumeration algorithm. An
example of the bus numbering is shown in Figure 1-22 on page 48. Each PCI Express Link is
equivalent to a logical PCI bus. In other words, each Link is assigned a bus number by the bus
enumerating software. A PCI Express endpoint is device 0 on a PCI Express Link of a given
bus number. Only one device (device 0) exists per PCI Express Link. The internal bus within a
switch that connects all the virtual bridges together is also numbered. The first Link associated
with the root complex is number bus 1. Bus 0 is an internal virtual bus within the root complex.
Buses downstream of a PCI Express-to-PCI(-X) bridge are enumerated the same way as in a
PCI(-X) system.
Endpoints and PCI(-X) devices may implement up to 8 functions per device. Only 1 device is
supported per PCI Express Link though PCI(-X) buses may theoretically support up to 32
devices per bus. A system could theoretically include up to 256 PCI Express Link and PCI(-X)
buses.
Figure 1-23 on page 52 is a block diagram of a low cost PCI Express based system. As of the
writing of this book (April 2003) no real life PCI Express chipset architecture designs were
publicly disclosed. The author describes here a practical low cost PCI Express chipset whose
architecture is based on existing non-PCI Express chipset architectures. In this solution, AGP
which connects MCH to a graphics controller in earlier MCH designs (see Figure 1-14 on page
32) is replaced with a PCI Express Link. The Hub Link that connects MCH to ICH is replaced
with a PCI Express Link. And in addition to a PCI bus associated with ICH, the ICH chip
supports 4 PCI Express Links. Some of these Links can connect directly to devices on the
motherboard and some can be routed to connectors where peripheral cards are installed.
The CPU can communicate with PCI Express devices associated with ICH as well as the PCI
Express graphics controller. PCI Express devices can communicate with system memory or the
graphics controller associated with MCH. PCI devices may also communicate with PCI Express
devices and vice versa. In other words, the chipset supports peer-to-peer packet routing
between PCI Express endpoints and PCI devices, memory and graphics. It is yet to be
determined if the first generation PCI Express chipsets, will support peer-to-peer packet routing
between PCI Express endpoints. Remember that the specification does not require the root
complex to support peer-to-peer packet routing between the multiple Links associated with the
root complex.
This design does not require the use of switches if the number of PCI Express devices to be
connected does not exceed the number of Links available in this design.
Figure 1-24 on page 53 is a block diagram of another low cost PCI Express system. In this
design, the Hub Link connects the root complex to an ICH device. The ICH device may be an
existing design which has no PCI Express Link associated with it. Instead, all PCI Express
Links are associated with the root complex. One of these Links connects to a graphics
controller. The other Links directly connect to PCI Express endpoints on the motherboard or
connect to PCI Express endpoints on peripheral cards inserted in slots.
Figure 1-25 shows a more complex system requiring a large number of devices connected
together. Multi-port switches are a necessary design feature to accomplish this. To support PCI
or PCI-X buses, a PCI Express-to-PCI(-X) bridge is connected to one switch port. PCI Express
packets can be routed from any device to any other device because switch support peer-to-
peer packet routing (Only multi-port root complex devices are not required to support peer-to-
peer functionality).
PCI Express Specifications
As of the writing of this book (May 2003) the following are specifications released by the
PCISIG.
As of May 2003, the specifications pending release are: the PCI Express-to-PCI Bridge
specification, Server IO Module specification, Cable specification, Backplane specification,
updated Mini PCI Express specification, and NEWCARD specification.
Chapter 2. Architecture Overview
Previous Chapter
This Chapter
Hot Plug
IO Read Non-Posted
IO Write Non-Posted
Message Posted
Table 2-2 lists all of the TLP request and TLP completion packets. These packets are used in
the transactions referenced in Table 2-1. Our goal in this section is to describe how these
packets are used to complete transactions at a system level and not to describe the packet
routing through the PCI Express fabric nor to describe packet contents in any detail.
IO Read IORd
IO Write IOWr
Completion without Data - associated with Locked Memory Read Requests CplLk
Completion with Data - associated with Locked Memory Read Requests CplDLk
Non-Posted Read Transactions
Figure 2-1 shows the packets transmitted by a requester and completer to complete a non-
posted read transaction. To complete this transfer, a requester transmits a non-posted read
request TLP to a completer it intends to read data from. Non-posted read request TLPs include
memory read request (MRd), IO read request (IORd), and configuration read request type 0 or
type 1 (CfgRd0, CfgRd1) TLPs. Requesters may be root complex or endpoint devices
(endpoints do not initiate configuration read/write requests however).
The request TLP is routed through the fabric of switches using information in the header portion
of the TLP. The packet makes its way to a targeted completer. The completer can be a root
complex, switches, bridges or endpoints.
When the completer receives the packet and decodes its contents, it gathers the amount of
data specified in the request from the targeted address. The completer creates a single
completion TLP or multiple completion TLPs with data (CplD) and sends it back to the
requester. The completer can return up to 4 KBytes of data per CplD packet.
The completion packet contains routing information necessary to route the packet back to the
requester. This completion packet travels through the same path and hierarchy of switches as
the request packet.
Requesters uses a tag field in the completion to associate it with a request TLP of the same
tag value it transmitted earlier. Use of a tag in the request and completion TLPs allows a
requester to manage multiple outstanding transactions.
If a completer is unable to obtain requested data as a result of an error, it returns a completion
packet without data (Cpl) and an error status indication. The requester determines how to
handle the error at the software layer.
Figure 2-2 on page 60 shows packets transmitted by a requester and completer to complete a
non-posted locked read transaction. To complete this transfer, a requester transmits a memory
read locked request (MRdLk) TLP. The requester can only be a root complex which initiates a
locked request on the behalf of the CPU. Endpoints are not allowed to initiate locked requests.
The locked memory read request TLP is routed downstream through the fabric of switches
using information in the header portion of the TLP. The packet makes its way to a targeted
completer. The completer can only be a legacy endpoint. The entire path from root complex to
the endpoint (for TCs that map to VC0) is locked including the ingress and egress port of
switches in the pathway.
When the completer receives the packet and decodes its contents, it gathers the amount of
data specified in the request from the targeted address. The completer creates one or more
locked completion TLP with data (CplDLk) along with a completion status. The completion is
sent back to the root complex requester via the path and hierarchy of switches as the original
request.
The CplDLk packet contains routing information necessary to route the packet back to the
requester. Requesters uses a tag field in the completion to associate it with a request TLP of
the same tag value it transmitted earlier. Use of a tag in the request and completion TLPs
allows a requester to manage multiple outstanding transactions.
If the completer is unable to obtain the requested data as a result of an error, it returns a
completion packet without data (CplLk) and an error status indication within the packet. The
requester who receives the error notification via the CplLk TLP must assume that atomicity of
the lock is no longer guaranteed and thus determine how to handle the error at the software
layer.
The path from requester to completer remains locked until the requester at a later time
transmits an unlock message to the completer. The path and ingress/egress ports of a switch
that the unlock message passes through are unlocked.
Figure 2-3 on page 61 shows the packets transmitted by a requester and completer to
complete a non-posted write transaction. To complete this transfer, a requester transmits a
non-posted write request TLP to a completer it intends to write data to. Non-posted write
request TLPs include IO write request (IOWr), configuration write request type 0 or type 1
(CfgWr0, CfgWr1) TLPs. Memory write request and message requests are posted requests.
Requesters may be a root complex or endpoint device (though not for configuration write
requests).
A request packet with data is routed through the fabric of switches using information in the
header of the packet. The packet makes its way to a completer.
When the completer receives the packet and decodes its contents, it accepts the data. The
completer creates a single completion packet without data (Cpl) to confirm reception of the
write request. This is the purpose of the completion.
The completion packet contains routing information necessary to route the packet back to the
requester. This completion packet will propagate through the same hierarchy of switches that
the request packet went through before making its way back to the requester. The requester
gets confirmation notification that the write request did make its way successfully to the
completer.
If the completer is unable to successfully write the data in the request to the final destination or
if the write request packet reaches the completer in error, then it returns a completion packet
without data (Cpl) but with an error status indication. The requester who receives the error
notification via the Cpl TLP determines how to handle the error at the software layer.
Memory write requests shown in Figure 2-4 are posted transactions. This implies that the
completer returns no completion notification to inform the requester that the memory write
request packet has reached its destination successfully. No time is wasted in returning a
completion, thus back-to-back posted writes complete with higher performance relative to non-
posted transactions.
The write request packet which contains data is routed through the fabric of switches using
information in the header portion of the packet. The packet makes its way to a completer. The
completer accepts the specified amount of data within the packet. Transaction over.
If the write request is received by the completer in error, or is unable to write the posted write
data to the final destination due to an internal error, the requester is not informed via the
hardware protocol. The completer could log an error and generate an error message
notification to the root complex. Error handling software manages the error.
Message requests are also posted transactions as pictured in Figure 2-5 on page 64. There
are two categories of message request TLPs, Msg and MsgD. Some message requests
propagate from requester to completer, some are broadcast requests from the root complex to
all endpoints, some are transmitted by an endpoint to the root complex. Message packets may
be routed to completer(s) based on the message's address, device ID or routed implicitly.
Message request routing is covered in Chapter 3.
The completer accepts any data that may be contained in the packet (if the packet is MsgD)
and/or performs the task specified by the message.
Message request support eliminates the need for side-band signals in a PCI Express system.
They are used for PCI style legacy interrupt signaling, power management protocol, error
signaling, unlocking a path in the PCI Express fabric, slot power support, hot plug protocol, and
vender defined purposes.
This section describes a few transaction examples showing packets transmitted between
requester and completer to accomplish a transaction. The examples consist of a memory read,
IO write, and Memory write.
Figure 2-6 shows an example of packet routing associated with completing a memory read
transaction. The root complex on the behalf of the CPU initiates a non-posted memory read
from the completer endpoint shown. The root complex transmits an MRd packet which contains
amongst other fields, an address, TLP type, requester ID (of the root complex) and length of
transfer (in doublewords) field. Switch A which is a 3 port switch receives the packet on its
upstream port. The switch logically appears like a 3 virtual bridge device connected by an
internal bus. The logical bridges within the switch contain memory and IO base and limit
address registers within their configuration space similar to PCI bridges. The MRd packet
address is decoded by the switch and compared with the base/limit address range registers of
the two downstream logical bridges. The switch internally forwards the MRd packet from the
upstream ingress port to the correct downstream port (the left port in this example). The MRd
packet is forwarded to switch B. Switch B decodes the address in a similar manner. Assume
the MRd packets is forwarded to the right-hand port so that the completer endpoint receives
the MRd packet.
The completer decodes the contents of the header within the MRd packet, gathers the
requested data and returns a completion packet with data (CplD). The header portion of the
completion TLP contains the requester ID copied from the original request TLP. The requester
ID is used to route the completion packet back to the root complex.
The logical bridges within Switch B compares the bus number field of the requester ID in the
CplD packet with the secondary and subordinate bus number configuration registers. The CplD
packet is forwarded to the appropriate port (in this case the upstream port). The CplD packet
moves to Switch A which forwards the packet to the root complex. The requester ID field of the
completion TLP matches the root complex's ID. The root complex checks the completion status
(hopefully "successful completion") and accepts the data. This data is returned to the CPU in
response to its pending memory read transaction.
In a similar manner, the endpoint device shown in Figure 2-7 on page 67 initiates a memory
read request (MRd). This packet contains amongst other fields in the header, the endpoint's
requester ID, targeted address and amount of data requested. It forwards the packet to Switch
B which decodes the memory address in the packet and compares it with the memory
base/limit address range registers within the virtual bridges of the switch. The packet is
forwarded to Switch A which decodes the address in the packet and forwards the packet to the
root complex completer.
The root complex obtains the requested data from system memory and creates a completion
TLP with data (CplD). The bus number portion of the requester ID in the completion TLP is
used to route the packet through the switches to the endpoint.
A requester endpoint can also communicate with another peer completer endpoint. For example
an endpoint attached to switch B can talk to an endpoint connected to switch C. The request
TLP is routed using an address. The completion is routed using bus number. Multi-port root
complex devices are not required to support port-to-port packet routing. In which case, peer-to-
peer transactions between endpoints associated with two different ports of the root complex is
not supported.
IO requests can only be initiated by a root complex or a legacy endpoint. PCI Express
endpoints do not initiate IO transactions. IO transactions are intended for legacy support.
Native PCI Express devices are not prohibited from implementing IO space, but the
specification states that a PCI Express Endpoint must not depend on the operating system
allocating I/O resources that are requested.
IO requests are routed by switches in a similar manner to memory requests. Switches route IO
request packets by comparing the IO address in the packet with the IO base and limit address
range registers in the virtual bridge configuration space associated with a switch
Figure 2-8 on page 68 shows routing of packets associated with an IO write transaction. The
CPU initiates an IO write on the Front Side Bus (FSB). The write contains a target IO address
and up to 4 Bytes of data. The root complex creates an IO Write request TLP (IOWr) using
address and data from the CPU transaction. It uses its own requester ID in the packet header.
This packet is routed through switch A and B. The completer endpoint returns a completion
without data (Cpl) and completion status of 'successful completion' to confirm the reception of
good data from the requester.
Memory write (MWr) requests (and message requests Msg or MsgD) are posted transactions.
This implies that the completer does not return a completion. The MWr packet is routed through
the PCI Express fabric of switches in the same manner as described for memory read
requests. The requester root complex can write up to 4 KBytes of data with one MWr packet.
Figure 2-9 on page 69 shows a memory write transaction originated by the CPU. The root
complex creates a MWr TLP on behalf of the CPU using target address and data from the CPU
FSB transaction. This packet is routed through switch A and B. The packet reaches the
endpoint and the transaction is complete.
Overview
The PCI Express specification defines a layered architecture for device design as shown in
Figure 2-10 on page 70. The layers consist of a Transaction Layer, a Data Link Layer and a
Physical layer. The layers can be further divided vertically into two, a transmit portion that
processes outbound traffic and a receive portion that processes inbound traffic. However, a
device design does not have to implement a layered architecture as long as the functionality
required by the specification is supported.
The goal of this section is to describe the function of each layer and to describe the flow of
events to accomplish a data transfer. Packet creation at a transmitting device and packet
reception and decoding at a receiving device are also explained.
The receiver device decodes the incoming packet contents in the Physical Layer and forwards
the resulting contents to the upper layers. The Data Link Layer checks for errors in the
incoming packet and if there are no errors forwards the packet up to the Transaction Layer.
The Transaction Layer buffers the incoming TLPs and converts the information in the packet to
a representation that can be processed by the device core and application.
Three categories of packets are defined, each one is associated with one of the three device
layers. Associated with the Transaction Layer is the Transaction Layer Packet (TLP).
Associated with the Data Link Layer is the Data Link Layer Packet (DLLP). Associated with the
Physical Layer is the Physical Layer Packet (PLP). These packets are introduced next.
PCI Express transactions employ TLPs which originate at the Transaction Layer of a
transmitter device and terminate at the Transaction Layer of a receiver device. This process is
represented in Figure 2-11 on page 72. The Data Link Layer and Physical Layer also contribute
to TLP assembly as the TLP moves through the layers of the transmitting device. At the other
end of the Link where a neighbor receives the TLP, the Physical Layer, Data Link Layer and
Transaction Layer disassemble the TLP.
A TLP that is transmitted on the Link appears as shown in Figure 2-12 on page 73.
The software layer/device core sends to the Transaction Layer the information required to
assemble the core section of the TLP which is the header and data portion of the packet. Some
TLPs do not contain a data section. An optional End-to-End CRC (ECRC) field is calculated and
appended to the packet. The ECRC field is used by the ultimate targeted device of this packet
to check for CRC errors in the header and data portion of the TLP.
The core section of the TLP is forwarded to the Data Link Layer which then appends a
sequence ID and another LCRC field. The LCRC field is used by the neighboring receiver
device at the other end of the Link to check for CRC errors in the core section of the TLP plus
the sequence ID. The resultant TLP is forwarded to the Physical Layer which concatenates a
Start and End framing character of 1 byte each to the packet. The packet is encoded and
differentially transmitted on the Link using the available number of Lanes.
A neighboring receiver device receives the incoming TLP bit stream. As shown in Figure 2-13
on page 74 the received TLP is decoded by the Physical Layer and the Start and End frame
fields are stripped. The resultant TLP is sent to the Data Link Layer. This layer checks for any
errors in the TLP and strips the sequence ID and LCRC field. Assume there are no LCRC
errors, then the TLP is forwarded up to the Transaction Layer. If the receiving device is a
switch, then the packet is routed from one port of the switch to an egress port based on
address information contained in the header portion of the TLP. Switches are allowed to check
for ECRC errors and even report the errors it finds and error. However, a switch is not allowed
to modify the ECRC that way the targeted device of this TLP will detect an ECRC error if there
is such an error.
The ultimate targeted device of this TLP checks for ECRC errors in the header and data portion
of the TLP. The ECRC field is stripped, leaving the header and data portion of the packet. It is
this information that is finally forwarded to the Device Core/Software Layer.
Data Link Layer Packets (DLLPs)
Another PCI Express packet called DLLP originates at the Data Link Layer of a transmitter
device and terminates at the Data Link Layer of a receiver device. This process is represented
in Figure 2-14 on page 75. The Physical Layer also contributes to DLLP assembly and
disassembly as the DLLP moves from one device to another via the PCI Express Link.
DLLPs are used for Link Management functions including TLP acknowledgement associated
with the ACK/NAK protocol, power management, and exchange of Flow Control information.
DLLPs are transferred between Data Link Layers of the two directly connected components on
a Link. DLLPs do not pass through switches unlike TLPs which do travel through the PCI
Express fabric. DLLPs do not contain routing information. These packets are smaller in size
compared to TLPs, 8 bytes to be precise.
DLLP Assembly
The DLLP shown in Figure 2-15 on page 76 originates at the Data Link Layer. There are
various types of DLLPs some of which include Flow Control DLLPs (FCx), acknowledge/ no
acknowledge DLLPs which confirm reception of TLPs (ACK and NAK), and power management
DLLPs (PMx). A DLLP type field identifies various types of DLLPs. The Data Link Layer
appends a 16-bit CRC used by the receiver of the DLLP to check for CRC errors in the DLLP.
DLLP Disassembly
The DLLP is received by Physical Layer of a receiving device. The received bit stream is
decoded and the Start and End frame fields are stripped as depicted in Figure 2-16. The
resultant packet is sent to the Data Link Layer. This layer checks for CRC errors and strips the
CRC field. The Data Link Layer is the destination layer for DLLPs and it is not forwarded up to
the Transaction Layer.
Another PCI Express packet called PLP originates at the Physical Layer of a transmitter device
and terminates at the Physical Layer of a receiver device. This process is represented in Figure
2-17 on page 77. The PLP is a very simple packet that starts with a 1 byte COM character
followed by 3 or more other characters that define the PLP type as well as contain other
information. The PLP is a multiple of 4 bytes in size, an example of which is shown in Figure 2-
18 on page 78. The specification refers to this packet as the Ordered-Set. PLPs do not contain
any routing information. They are not routed through the fabric and do not propagate through a
switch.
Some PLPs are used during the Link Training process described in "Ordered-Sets Used During
Link Training and Initialization" on page 504. Another PLP is used for clock tolerance
compensation. PLPs are used to place a Link into the electrical idle low power state or to wake
up a link from this low power state.
Figure 2-19 on page 79 is a more detailed block diagram of a PCI Express Device's layers.
This block diagram is used to explain key functions of each layer and explain the function of
each layer as it relates to generation of outbound traffic and response to inbound traffic. The
layers consist of Device Core/Software Layer, Transaction Layer, Data Link Layer and Physical
Layer.
The Device Core consists of, for example, the root complex core logic or an endpoint core logic
such as that of an Ethernet controller, SCSI controller, USB controller, etc. To design a PCI
Express endpoint, a designer may reuse the Device Core logic from a PCI or PCI-X core logic
design and wrap around it the PCI Express layered design described in this section.
Transmit Side
The Device Core logic in conjunction with local software provides the necessary information
required by the PCI Express device to generate TLPs. This information is sent via the Transmit
interface to the Transaction Layer of the device. Example of information transmitted to the
Transaction Layer includes: transaction type to inform the Transaction Layer what type of TLP
to generate, address, amount of data to transfer, data, traffic class, message index etc.
Receive Side
The Device Core logic is also responsible to receive information sent by the Transaction Layer
via the Receive interface. This information includes: type of TLP received by the Transaction
Layer, address, amount of data received, data, traffic class of received TLP, message index,
error conditions etc.
Transaction Layer
The transaction Layer shown in Figure 2-19 is responsible for generation of outbound TLP
traffic and reception of inbound TLP traffic. The Transaction Layer supports the split transaction
protocol for non-posted transactions. In other words, the Transaction Layer associates an
inbound completion TLP of a given tag value with an outbound non-posted request TLP of the
same tag value transmitted earlier.
The transaction layer contains virtual channel buffers (VC Buffers) to store outbound TLPs that
await transmission and also to store inbound TLPs received from the Link. The flow control
protocol associated with these virtual channel buffers ensures that a remote transmitter does
not transmit too many TLPs and cause the receiver virtual channel buffers to overflow. The
Transaction Layer also orders TLPs according to ordering rules before transmission. It is this
layer that supports the Quality of Service (QoS) protocol.
The Transaction Layer supports 4 address spaces: memory address, IO address, configuration
address and message space. Message packets contain a message.
Transmit Side
The Transaction Layer receives information from the Device Core and generates outbound
request and completion TLPs which it stores in virtual channel buffers. This layer assembles
Transaction Layer Packets (TLPs). The major components of a TLP are: Header, Data Payload
and an optional ECRC (specification also uses the term Digest) field as shown in Figure 2-20.
The Header is 3 doublewords or 4 doublewords in size and may include information such as;
Address, TLP type, transfer size, requester ID/completer ID, tag, traffic class, byte enables,
completion codes, and attributes (including "no snoop" and "relaxed ordering" bits). The TLP
types are defined in Table 2-2 on page 57.
The address is a 32-bit memory address or an extended 64-bit address for memory requests.
It is a 32-bit address for IO requests. For configuration transactions the address is an ID
consisting of Bus Number, Device Number and Function Number plus a configuration register
address of the targeted register. For completion TLPs, the address is the requester ID of the
device that originally made the request. For message transactions the address used for routing
is the destination device's ID consisting of Bus Number, Device Number and Function Number of
the device targeted by the message request. Message requests could also be broadcast or
routed implicitly by targeting the root complex or an upstream port.
The transfer size or length field indicates the amount of data to transfer calculated in
doublewords (DWs). The data transfer length can be between 1 to 1024 DWs. Write request
TLPs include data payload in the amount indicated by the length field of the header. For a read
request TLP, the length field indicates the amount of data requested from a completer. This
data is returned in one or more completion packets. Read request TLPs do not include a data
payload field. Byte enables specify byte level address resolution.
Request packets contain a requester ID (bus#, device#, function #) of the device transmitting
the request. The tag field in the request is memorized by the completer and the same tag is
used in the completion.
A bit in the Header (TD = TLP Digest) indicates whether this packet contains an ECRC field
also referred to as Digest. This field is 32-bits wide and contains an End-to-End CRC (ECRC).
The ECRC field is generated by the Transaction Layer at time of creation of the outbound TLP.
It is generated based on the entire TLP from first byte of header to last byte of data payload
(with the exception of the EP bit, and bit 0 of the Type field. These two bits are always
considered to be a 1 for the ECRC calculation). The TLP never changes as it traverses the
fabric (with the exception of perhaps the two bits mentioned in the earlier sentence). The
receiver device checks for an ECRC error that may occur as the packet moves through the
fabric.
Receiver Side
The receiver side of the Transaction Layer stores inbound TLPs in receiver virtual channel
buffers. The receiver checks for CRC errors based on the ECRC field in the TLP. If there are
no errors, the ECRC field is stripped and the resultant information in the TLP header as well as
the data payload is sent to the Device Core.
Flow Control
The Transaction Layer ensures that it does not transmit a TLP over the Link to a remote
receiver device unless the receiver device has virtual channel buffer space to accept TLPs (of a
given traffic class). The protocol for guaranteeing this mechanism is referred to as the "flow
control" protocol. If the transmitter device does not observe this protocol, a transmitted TLP will
cause the receiver virtual channel buffer to overflow. Flow control is automatically managed at
the hardware level and is transparent to software. Software is only involved to enable additional
buffers beyond the default set of virtual channel buffers (referred to as VC 0 buffers). The
default buffers are enabled automatically after Link training, thus allowing TLP traffic to flow
through the fabric immediately after Link training. Configuration transactions use the default
virtual channel buffers and can begin immediately after the Link training process. Link training
process is described in Chapter 14, entitled "Link Initialization & Training," on page 499.
Refer to Figure 2-21 on page 82 for an overview of the flow control process. A receiver device
transmits DLLPs called Flow Control Packets (FCx DLLPs) to the transmitter device on a
periodic basis. The FCx DLLPs contain flow control credit information that updates the
transmitter regarding how much buffer space is available in the receiver virtual channel buffer.
The transmitter keeps track of this information and will only transmit TLPs out of its Transaction
Layer if it knows that the remote receiver has buffer space to accept the transmitted TLP.
Consider Figure 2-22 on page 83 in which the video camera and SCSI device shown need to
transmit write request TLPs to system DRAM. The camera data is time critical isochronous
data which must reach memory with guaranteed bandwidth otherwise the displayed image will
appear choppy or unclear. The SCSI data is not as time sensitive and only needs to get to
system memory correctly without errors. It is clear that the video data packet should have
higher priority when routed through the PCI Express fabric, especially through switches. QoS
refers to the capability of routing packets from different applications through the fabric with
differentiated priorities and deterministic latencies and bandwidth. PCI and PCI-X systems do
not support QoS capability.
As TLPs from these two applications (video and SCSI applications) move through the fabric,
the switches post incoming packets moving upstream into their respective VC buffers (VC0 and
VC7). The switch uses a priority based arbitration mechanism to determine which of the two
incoming packets to forward with greater priority to a common egress port. Assume VC7 buffer
contents are configured with higher priority than VC0. Whenever two incoming packets are to
be forwarded to one upstream port, the switch will always pick the VC7 packet, the video data,
over the VC0 packet, the SCSI data. This guarantees greater bandwidth and reduced latency
for video data compared to SCSI data.
A PCI Express device that implements more than one set of virtual channel buffers has the
ability to arbitrate between TLPs from different VC buffers. VC buffers have configurable
priorities. Thus traffic flowing through the system in different VC buffers will observe
differentiated performances. The arbitration mechanism between TLP traffic flowing through
different VC buffers is referred to as VC arbitration.
Also, multi-port switches have the ability to arbitrate between traffic coming in on two ingress
ports but using the same VC buffer resource on a common egress port. This configurable
arbitration mechanism between ports supported by switches is referred to as Port arbitration.
TC is a TLP header field transmitted within the packet unmodified end-to-end through the
fabric. Local application software and system software based on performance requirements
decides what TC label a TLP uses. VCs are physical buffers that provide a means to support
multiple independent logical data flows over the physical Link via the use of transmit and
receiver virtual channel buffers.
PCI Express devices may implement up to 8 VC buffers (VC0-VC7). The TC field is a 3-bit field
that allows differentiation of traffic into 8 traffic classes (TC0-TC7). Devices must implement
VC0. Similarly, a device is required to support TC0 (best effort general purpose service class).
The other optional TCs may be used to provide differentiated service through the fabric.
Associated with each implemented VC ID, a transmit device implements a transmit buffer and a
receive device implements a receive buffer.
Devices or switches implement TC-to-VC mapping logic by which a TLP of a given TC number
is forwarded through the Link using a particular VC numbered buffer. PCI Express provides the
capability of mapping multiple TCs onto a single VC, thus reducing device cost by means of
providing limited number of VC buffer support. TC/VC mapping is configured by system
software through configuration registers. It is up to the device application software to determine
TC label for TLPs and TC/VC mapping that meets performance requirements. In its simplest
form TC/VC mapping registers can be configured with a one-to-one mapping of TC to VC.
Consider the example illustrated in Figure 2-23 on page 85. The TC/VC mapping registers in
Device A are configured to map, TLPs with TC[2:0] to VC0 and TLPs with TC[7:3] to VC1. The
TC/VC mapping registers in receiver Device B must also be configured identically as Device A.
The same numbered VC buffers are enabled both in transmitter Device A and receiver Device
B.
If Device A needs to transmit a TLP with TC label of 7 and another packet with TC label of 0,
the two packets will be placed in VC1 and VC0 buffers, respectively. The arbitration logic
arbitrates between the two VC buffers. Assume VC1 buffer is configured with higher priority
than VC0 buffer. Thus, Device A will forward the TC7 TLPs in VC1 to the Link ahead of the TC0
TLPs in VC0.
When the TLPs arrive in Device B, the TC/VC mapping logic decodes the TC label in each TLP
and places the TLPs in their associated VC buffers.
In this example, TLP traffic with TC[7:3] label will flow through the fabric with higher priority
than TC[2:0] traffic. Within each TC group however, TLPs will flow with equal priority.
Packets of different TCs are routed through the fabric of switches with different priority based
on arbitration policy implemented in switches. Packets coming in from ingress ports heading
towards a particular egress port compete for use of that egress port.
Switches implement two types of arbitration for each egress port: Port Arbitration and VC
Arbitration. Consider Figure 2-24 on page 86.
Port arbitration is arbitration between two packets arriving on different ingress ports but that
map to the same virtual channel (after going through TC-to-VC mapping) of the common egress
port. The port arbiter implements round-robin, weighted round-robin or programmable time-
based round-robin arbitration schemes selectable through configuration registers.
VC arbitration takes place after port arbitration. For a given egress port, packets from all VCs
compete to transmit on the same egress port. VC arbitration resolves the order in which TLPs
in different VC buffers are forwarded on to the Link. VC arbitration policies supported include,
strict priority, round-robin and weighted round-robin arbitration schemes selectable through
configuration registers.
Independent of arbitration, each VC must observe transaction ordering and flow control rules
before it can make pending TLP traffic visible to the arbitration mechanism.
Endpoint devices and a root complex with only one port do not support port arbitration. They
only support VC arbitration in the Transaction Layer.
Transaction Ordering
PCI Express protocol implements PCI/PCI-X compliant producer-consumer ordering model for
transaction ordering with provision to support relaxed ordering similar to PCI-X architecture.
Transaction ordering rules guarantee that TLP traffic associated with a given traffic class is
routed through the fabric in the correct order to prevent potential deadlock or live-lock
conditions from occurring. Traffic associated with different TC labels have no ordering
relationship. Chapter 8, entitled "Transaction Ordering," on page 315 describes these ordering
rules.
The Transaction Layer ensures that TLPs for a given TC are ordered correctly with respect to
other TLPs of the same TC label before forwarding to the Data Link Layer and Physical Layer
for transmission.
Power Management
Configuration Registers
A device's configuration registers are associated with the Transaction Layer. The registers are
configured during initialization and bus enumeration. They are also configured by device drivers
and accessed by runtime software/OS. Additionally, the registers store negotiated Link
capabilities, such as Link width and frequency. Configuration registers are described in Part 6
of the book.
Transmit Side
The Transaction Layer must observe the flow control mechanism before forwarding outbound
TLPs to the Data Link Layer. If sufficient credits exist, a TLP stored within the virtual channel
buffer is passed from the Transaction Layer to the Data Link Layer for transmission.
Consider Figure 2-25 on page 88 which shows the logic associated with the ACK-NAK
mechanism of the Data Link Layer. The Data Link Layer is responsible for TLP CRC generation
and TLP error checking. For outbound TLPs from transmit Device A, a Link CRC (LCRC) is
generated and appended to the TLP. In addition, a sequence ID is appended to the TLP. Device
A's Data Link Layer preserves a copy of the TLP in a replay buffer and transmits the TLP to
Device B. The Data Link Layer of the remote Device B receives the TLP and checks for CRC
errors.
If on the other hand a CRC error is detected in the TLP received at the remote Device B, then
a NAK DLLP with a sequence ID is returned to Device A. An error has occurred during TLP
transmission. Device A's Data Link Layer replays associated TLPs from the replay buffer. The
Data Link Layer generates error indications for error reporting and logging mechanisms.
In summary, the replay mechanism uses the sequence ID field within received ACK/NAK DLLPs
to associate it with outbound TLPs stored in the replay buffer. Reception of ACK DLLPs cause
the replay buffer to clear TLPs from the buffer. Receiving NAK DLLPs cause the replay buffer
to replay associated TLPs.
For a given TLP in the replay buffer, if the transmitter device receives a NAK 4 times and the
TLP is replayed 3 additional times as a result, then the Data Link Layer logs the error, reports
a correctable error, and re-trains the Link.
Receive Side
The receive side of the Data Link Layer is responsible for LCRC error checking on inbound
TLPs. If no error is detected, the device schedules an ACK DLLP for transmission back to the
remote transmitter device. The receiver strips the TLP of the LCRC field and sequence ID.
If a CRC error is detected, it schedules a NAK to return back to the remote transmitter. The
TLP is eliminated.
The receive side of the Data Link Layer also receives ACKs and NAKs from a remote device. If
an ACK is received the receive side of the Data Link layer informs the transmit side to clear an
associated TLP from the replay buffer. If a NAK is received, the receive side causes the replay
buffer of the transmit side to replay associated TLPs.
The receive side is also responsible for checking the sequence ID of received TLPs to check
for dropped or out-of-order TLPs.
The Data Link Layer concatenates a 12-bit sequence ID and 32-bit LCRC field to an outbound
TLP that arrives from the Transaction Layer. The resultant TLP is shown in Figure 2-26 on page
90. The sequence ID is used to associate a copy of the outbound TLP stored in the replay
buffer with a received ACK/NAK DLLP inbound from a neighboring remote device. The
ACK/NAK DLLP confirms arrival of the outbound TLP in the remote device.
Figure 2-26. TLP and DLLP Structure at the Data Link Layer
The 32-bit LCRC is calculated based on all bytes in the TLP including the sequence ID.
A DLLP shown in Figure 2-26 on page 90 is a 4 byte packet with a 16-bit CRC field. The 8-bit
DLLP Type field indicates various categories of DLLPs. These include: ACK, NAK, Power
Management related DLLPs (PM_Enter_L1, PM_Enter_L23, PM_Active_State_Request_L1,
PM_Request_Ack) and Flow Control related DLLPs (InitFC1-P, InitFC1-NP, InitFC1-Cpl,
InitFC2-P, InitFC2-NP, InitFC2-Cpl, UpdateFC-P, UpdateFC-NP, UpdateFC-Cpl). The 16-bit
CRC is calculated using all 4 bytes of the DLLP. Received DLLPs which fail the CRC check are
discarded. The loss of information from discarding a DLLP is self repairing such that a
successive DLLP will supersede any information lost. ACK and NAK DLLPs contain a sequence
ID field (shown as Misc. field in Figure 2-26) used by the device to associate an inbound
ACK/NAK DLLP with a stored copy of a TLP in the replay buffer.
Step 1a. Requester transmits a memory read request TLP (MRd). Switch receives the
MRd TLP and checks for CRC error using the LCRC field in the MRd TLP.
Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy
of the TLP from its replay buffer.
Step 2a. Switch forwards the MRd TLP to the correct egress port using memory address
for routing. Completer receives MRd TLP. Completer checks for CRC errors in received
MRd TLP using LCRC.
Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of
the MRd TLP from its replay buffer.
Step 3a. Completer checks for CRC error using optional ECRC field in MRd TLP. Assume
no End-to-End error. Completer returns Completion (CplD) with Data TLP whenever it has
the requested data. Switch receives CplD TLP and checks for CRC error using LCRC.
Step 3b. If no error then switch returns ACK DLLP to completer. Completer discards copy
of the CplD TLP from its replay buffer.
Step 4a. Switch decodes Requester ID field in CplD TLP and routes the packet to the
correct egress port. Requester receives CplD TLP. Requester checks for CRC errors in
received CplD TLP using LCRC.
Step 4b. If no error then requester returns ACK DLLP to switch. Switch discards copy of
the CplD TLP from its replay buffer. Requester determines if there is error in CplD TLP
using CRC field in optional ECRC field. Assume no End-to-End error. Requester checks
completion error code in CplD. Assume completion code of 'Successful Completion'. To
associate the completion with the original request, requester matches tag in CplD with
original tag of MRd request. Requester accepts data.
Below are the steps involved in completing a memory write request between a requester and a
completer on the far end of a switch. Figure 2-28 on page 92 shows the activity on the Link to
complete this transaction:
Step 1a. Requester transmits a memory write request TLP (MWr) with data. Switch
receives MWr TLP and checks for CRC error with LCRC field in the TLP.
Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy
of the TLP from its replay buffer.
Step 2a. Switch forwards the MWr TLP to the correct egress port using memory address
for routing. Completer receives MWr TLP. Completer checks for CRC errors in received
MRd TLP using LCRC.
Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of
the MWr TLP from its replay buffer. Completer checks for CRC error using optional digest
field in MWr TLP. Assume no End-to-End error. Completer accepts data. There is no
completion associated with this transaction.
Flow control for the default virtual channel VC0 is initialized first. In addition, when additional
VCs are enabled by software, the flow control initialization process is repeated for each newly
enabled VC. Since VC0 is enabled before all other VCs, no TLP traffic will be active prior to
initialization of VC0.
Physical Layer
Refer to Figure 2-19 on page 79 for a block diagram of a device's Physical Layer. Both TLP
and DLLP type packets are sent from the Data Link Layer to the Physical Layer for
transmission over the Link. Also, packets are received by the Physical Layer from the Link and
sent to the Data Link Layer.
The Physical Layer is divided in two portions, the Logical Physical Layer and the Electrical
Physical Layer. The Logical Physical Layer contains digital logic associated with processing
packets before transmission on the Link, or processing packets inbound from the Link before
sending to the Data Link Layer. The Electrical Physical Layer is the analog interface of the
Physical Layer that connects to the Link. It consists of differential drivers and receivers for each
Lane.
Transmit Side
TLPs and DLLPs from the Data Link Layer are clocked into a buffer in the Logical Physical
Layer. The Physical Layer frames the TLP or DLLP with a Start and End character. The symbol
is a framing code byte which a receiver device uses to detect the start and end of a packet.
The Start and End characters are shown appended to a TLP and DLLP in Figure 2-29 on page
94. The diagram shows the size of each field in a TLP or DLLP.
Each byte of a packet is then scrambled with the aid of Linear Feedback Shift Register type
scrambler. By scrambling the bytes, repeated bit patterns on the Link are eliminated, thus
reducing the average EMI noise generated.
The resultant bytes are encoded into a 10b code by the 8b/10b encoding logic. The primary
purpose of encoding 8b characters to 10b symbols is to create sufficient 1-to-0 and 0-to-1
transition density in the bit stream to facilitate recreation of a receive clock with the aid of a PLL
at the remote receiver device. Note that data is not transmitted along with a clock. Instead, the
bit stream contains sufficient transitions to allow the receiver device to recreate a receive clock.
The parallel-to-serial converter generates a serial bit stream of the packet on each Lane and
transmits it differentially at 2.5 Gbits/s.
Receive Side
The receive Electrical Physical Layer clocks in a packet arriving differentially on all Lanes. The
serial bit stream of the packet is converted into a 10b parallel stream using the serial-to-parallel
converter. The receiver logic also includes an elastic buffer which accommodates for clock
frequency variation between a transmit clock with which the packet bit stream is clocked into a
receiver and the receiver clock. The 10b symbol stream is decoded back to the 8b
representation of each symbol with the 8b/10b decoder. The 8b characters are de-scrambled.
The Byte unstriping logic, re-creates the original packet stream transmitted by the remote
device.
An additional function of the Physical Layer is Link initialization and training. Link initialization and
training is a Physical Layer controlled process that configures and initializes each Link for
normal operation. This process is automatic and does not involve software. The following are
determined during the Link initialization and training process:
Link width
Lane reversal
Polarity inversion.
Link width. Two devices with a different number of Lanes per Link may be connected. E.g. one
device has x2 port and it is connected to a device with x4 port. After initialization the Physical
Layer of both devices determines and sets the Link width to the minimum Lane width of x2.
Other Link negotiated behaviors include Lane reversal and splitting of ports into multiple Links.
Lane reversal if necessary is an optional feature. Lanes are numbered. A designer may not
wire the correct Lanes of two ports correctly. In which case training allows for the Lane
numbers to be reversed so that the Lane numbers of adjacent ports on each end of the Link
match up. Part of the same process may allow for a multi-Lane Link to be split into multiple
Links.
Polarity inversion. The D+ and D- differential pair terminals for two devices may not be
connected correctly. In which case the training sequence receiver reverses the polarity on the
differential receiver.
Link data rate. Training is completed at data rate of 2.5 Gbit/s. In the future, higher data rates
of 5 Gbit/s and 10 Gbit/s will be supported. During training, each node advertises its highest
data rate capability. The Link is initialized with the highest common frequency that devices at
opposite ends of a Link support.
Lane-to-Lane De-skew. Due to Link wire length variations and different driver/receiver
characteristics on a multi-Lane Link, bit streams on each Lane will arrive at a receiver skewed
with respect to other Lanes. The receiver circuit must compensate for this skew by
adding/removing delays on each Lane. Relaxed routing rules allow Link wire lengths in the order
of 20"-30".
Reset
Cold/warm reset also called a Fundamental Reset which occurs following a device being
powered-on (cold reset) or due to a reset without circulating power (warm reset).
On exit from reset (cold, warm, or hot), all state machines and configuration registers (hot
reset does not reset sticky configuration registers) are initialized.
The transmitter of one device is AC coupled to the receiver of another device at the opposite
end of the Link as shown in Figure 2-30. The AC coupling capacitor is between 75-200 nF. The
transmitter DC common mode voltage is established during Link training and initialization. The
DC common mode impedance is typically 50 ohms while the differential impedance is 100 ohms
typical. This impedance is matched with a standard FR4 board.
Refer to Figure 2-31. The requester Device Core or Software Layer sends the following
information to the Transaction Layer:
32-bit or 64-bit memory address, transaction type of memory read request, amount of data to
read calculated in doublewords, traffic class if other than TC0, byte enables, attributes to
indicate if 'relaxed ordering' and 'no snoop' attribute bits should be set or clear.
The Transaction layer uses this information to build a MRd TLP. The exact TLP packet format is
described in a later chapter. A 3 DW or 4 DW header is created depending on address size
(32-bit or 64-bit). In addition, the Transaction Layer adds its requester ID (bus#, device#,
function#) and an 8-bit tag to the header. It sets the TD (transaction digest present) bit in the
TLP header if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP does
not have a data payload. The TLP is placed in the appropriate virtual channel buffer ready for
transmission. The flow control logic confirms there are sufficient "credits" available (obtained
from the completer device) for the virtual channel associated with the traffic class used.
Only then the memory read request TLP is sent to the Data Link Layer. The Data Link Layer
adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A
copy of the TLP with sequence ID and LCRC is stored in the replay buffer.
This packet is forwarded to the Physical Layer which tags on a Start symbol and an End
symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10
bit encoded. Finally the packet is converted to a serial bit stream on all Lanes and transmitted
differentially across the Link to the neighbor completer device.
The completer converts the incoming serial bit stream back to 10b symbols while assembling
the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes
from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and
removed. The resultant TLP is sent to the Data Link Layer.
The completer Data Link Layer checks for LCRC errors in the received TLP and checks the
Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer
creates an ACK DLLP which contains the same sequence ID as contained in the memory read
request TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the
Physical Layer which transmits the ACK DLLP to the requester.
The requester Physical Layer reformulates the ACK DLLP and sends it up to the Data Link
Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer.
The stored memory read request TLP associated with the ACK received is discarded from the
replay buffer. If a NAK DLLP was received by the requester instead, it would re-send a copy of
the stored memory read request TLP.
In the mean time the Data Link Layer of the completer strips the sequence ID and LCRC field
from the memory read request TLP and forwards it to the Transaction Layer.
The Transaction Layer receives the memory read request TLP in the appropriate virtual channel
buffer associated with the TC of the TLP. The Transaction layer checks for ECRC error. It
forwards the contents of the header (address, requester ID, memory read transaction type,
amount of data requested, traffic class etc.) to the completer Device Core/Software Layer.
Refer to Figure 2-32 on page 99 during the following discussion. To service the memory read
request, the completer Device Core/Software Layer sends the following information to the
Transaction Layer:
Figure 2-32. Completion with Data Phase
Requester ID and Tag copied from the original memory read request, transaction type of
completion with data (CplD), requested amount of data with data length field, traffic class if
other than TC0, attributes to indicate if 'relaxed ordering' and 'no snoop' bits should be set or
clear (these bits are copied from the original memory read request). Finally, a completion
status of successful completion (SC) is sent.
The Transaction layer uses this information to build a CplD TLP. The exact TLP packet format is
described in a later chapter. A 3 DW header is created. In addition, the Transaction Layer adds
its own completer ID to the header. The TD (transaction digest present) bit in the TLP header is
set if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP includes the
data payload. The flow control logic confirms sufficient "credits" are available (obtained from
the requester device) for the virtual channel associated with the traffic class used.
Only then the CplD TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit
sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the
TLP with sequence ID and LCRC is stored in the replay buffer.
This packet is forwarded to the Physical Layer which tags on a Start symbol and an End
symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10
bit encoded. Finally the CplD packet is converted to a serial bit stream on all Lanes and
transmitted differentially across the Link to the neighbor requester device.
The requester converts the incoming serial bit stream back to 10b symbols while assembling
the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes
from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and
removed. The resultant TLP is sent to the Data Link Layer.
The Data Link Layer checks for LCRC errors in the received CplD TLP and checks the
Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer
creates an ACK DLLP which contains the same sequence ID as contained in the CplD TLP
received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical
Layer which transmits the ACK DLLP to the completer.
The completer Physical Layer reformulates the ACK DLLP and sends it up to the Data Link
Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer.
The stored CplD TLP associated with the ACK received is discarded from the replay buffer. If a
NAK DLLP was received by the completer instead, it would re-send a copy of the stored CplD
TLP.
In the mean time, the requester Transaction Layer receives the CplD TLP in the appropriate
virtual channel buffer mapped to the TLP TC. The Transaction Layer uses the tag in the header
of the CplD TLP to associate the completion with the original request. Transaction layer checks
for ECRC error. It forwards the header contents and data payload including the Completion
Status to the requester Device Core/Software Layer. Memory read transaction DONE.
Hot Plug
PCI Express supports native hot-plug though hot-plug support in a device is not mandatory.
Some of the elements found in a PCI Express hot plug system are:
Indicators which show the power and attention state of the slot.
MRL Sensor that allow the port and system software to detect the MRL being opened.
Electromechanical Interlock which prevents removal of add-in cards while slot is powered.
When a port has no connection or a removal event occurs, the port transmitter moves to the
electrical high impedance detect state. The receiver remains in the electrical low impedance
state.
PCI Express Performance and Data Transfer Efficiency
As of May 2003, no realistic performance and efficiency numbers were available. However,
Table 2-3 shows aggregate bandwidth numbers for various Link widths after factoring the
overhead of 8b/10b encoding.
Table 2-3. PCI Express Aggregate Throughput for Various Link Widths
DLLPs are 2 doublewords in size. The ACK/NAK and flow control protocol utilize DLLPs, but it
is not expected that these DLLPs will use up a significant portion of the bandwidth.
The remainder of the bandwidth is available for TLPs. Between 6-7 doublewords of the TLP is
overhead associated with Start and End framing symbols, sequence ID, TLP header, ECRC
and LCRC fields. The remainder of the TLP contains between 0-1024 doublewords of data
payload. It is apparent that the bus efficiency is significantly low if small size packets are
transmitted. The efficiency numbers are very high if TLPs contain significant amounts of data
payload.
Packets can be transmitted back-to-back without the Link going idle. Thus the Link can be
100% utilized.
The switch does not introduce any arbitration overhead when forwarding incoming packets from
multiple ingress ports to one egress port. However, it is yet to be seen what the effect is of the
Quality of Service protocol on actual bandwidth numbers for given applications.
There is overhead associated with the split transaction protocol, especially for read
transactions. For a read request TLP, the data payload is contained in the completion. This
factor has to be accounted for when determining the effective performance of the bus. Posted
write transactions improve the efficiency of the fabric.
Switches support cut-through mode. That is to say that an incoming packet can be immediately
forwarded to an egress port for transmission without the switch having to buffer up the packet.
The latency for packet forwarding through a switch can be very small allowing packets to travel
from one end of the PCI Express fabric to another end with very small latency.
Part Two: Transaction Protocol
Chapter 3. Address Spaces & Transaction Routing
Chapter 9. Interrupts
This Chapter
Introduction
As illustrated in Figure 3-1 on page 106, a PCI Express topology consists of independent,
point-to-point links connecting each device with one or more neighbors. As traffic arrives at the
inbound side of a link interface (called the ingress port), the device checks for errors, then
makes one of three decisions:
Reject the traffic because it is neither the intended target nor an interface to it (note that
there are also other reasons why traffic may be rejected)
Assuming a link is fully operational, the physical layer receiver interface of each device is
prepared to monitor the logical idle condition and detect the arrival of the three types of link
traffic: Ordered Sets, DLLPs, and TLPs. Using control (K) symbols which accompany the traffic
to determine framing boundaries and traffic type, PCI Express devices then make a distinction
between traffic which is local to the link vs. traffic which may require routing to other links (e.g.
TLPs). Local link traffic, which includes Ordered Sets and Data Link Layer Packets (DLLPs),
isn't forwarded and carries no routing information. Transaction Layer Packets (TLPs) can and
do move from link to link, using routing information contained in the packet headers.
It should be apparent in Figure 3-1 on page 106 that devices with multiple PCI Express ports
are responsible for handling their own traffic as well as forwarding other traffic between ingress
ports and any enabled egress ports. Also note that while peer-peer transaction support is
required of switches, it is optional for a multi-port Root Complex. It is up to the system designer
to account for peer-to-peer traffic when selecting devices and laying out a motherboard.
It should also be apparent in Figure 3-1 on page 106 that endpoint devices have a single link
interface and lack the ability to route inbound traffic to other links. For this reason, and because
they don't reside on shared busses, endpoints never expect to see ingress port traffic which is
not intended for them (this is different than shared-bus PCI(X), where devices commonly
decode addresses and commands not targeting them). Endpoint routing is limited to accepting
or rejecting transactions presented to them.
Ordered Sets
These are sent by each physical layer transmitter to the physical layer of the corresponding
receiver to initiate link training, compensate for clock tolerance, or transition a link to and from
the Electrical Idle state. As indicated in Table 3-1 on page 109, there are five types of Ordered
Sets.
Each ordered set is constructed of 10-bit control (K) symbols that are created within the
physical layer. These symbols have a common name as well as a alph-numeric code that
defines the 10 bits pattern of 1s and 0s, of which they are comprised. For example, the SKP
(Skip) symbol has a 10-bit value represented as K28.0.
Figure 3-2 on page 110 illustrates the transmission of Ordered Sets. Note that each ordered
set is fixed in size, consisting of 4 or 16 characters. Again, the receiver is required to consume
them as they are sent. Note that the COM control symbol (K28.5) is used to indicate the start
of any ordered set.
Training Sequence One COM, Lane ID, 14 Used in link training, to align and synchronize the incoming bit stream at startup, convey
(TS1) more reset, other functions.
Electrical Idle (IDLE) COM, 3 IDL Indicates that link should be brought to a lower power state (L0s, L1, L2).
DLLP Purpose
Receiver Data Link Layer sends Ack to indicate that no CRC or other errors have been encountered in
Acknowledge (Ack)
received TLP(s). Transmitter retains copy of TLPs until Ack'd
Receiver Data Link Layer sends Nak to indicate that a TLP was received with a CRC or other error. All
No Acknowledge (Nak)
TLPs remaining in the transmitter's Retry Buffer must be resent, in the original order.
PM_Enter_L1; Following a software configuration space access to cause a device power management event, a
PM_Enter_L23 downstream device requests entry to link L1 or Level 2-3 state
InitFC1-P
Flow Control Initialization Type One DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow
InitFC1-NP
control credits.
InitFC1-Cpl
InitFC2-P
Flow Control Initialization Type Two DLLP confirming award of InitFC1 posted (P), nonposted (NP), or
InitFC2-NP
completion (Cpl) flow control credits.
InitFC2-Cpl
UpdateFC-P
Flow Control Credit Update DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow control
UpdateFC-NP credits.
UpdateFC-Cpl
As described in Table 3-2 on page 111 and shown in Figure 3-3 on page 112, there are three
major types of DLLPs: Ack/Nak, Power Management (several variants), and Flow Control. In
addition, a vendor-specific DLLP is permitted in the specification. Each DLLP is 8 bytes,
including a Start Of DLLP (SDP) byte, 2-byte CRC, and an End Of Packet (END) byte in
addition to the 4 byte DLLP core (which includes the type field and any required attributes).
Refer to "Data Link Layer Packets" on page 198 for a thorough discussion of Data Link Layer
packets.
Transaction Layer Packet Routing Basics
The third class of link traffic originates in the Transaction Layer of one device and targets the
Transaction Layer of another device. These Transaction Layer Packets (TLPs) are forwarded
from one link to another as necessary, subject to the routing mechanisms and rules described in
the following sections. Note that other chapters in this book describe additional aspects of
Transaction Layer Packet handling, including Flow Control, Quality Of Service, Error Handling,
Ordering rules, etc. The term transaction is used here to describe the exchange of information
using Transaction Layer Packets. Because Ordered Sets and DLLPs carry no routing
information and are not forwarded, the routing rules described in the following sections apply
only to TLPs.
As transactions are carried out between PCI Express requesters and completers, four
separate address spaces are used: Memory, IO, Configuration, and Message. The basic use
of each address space is described in Table 3-3 on page 113.
Address Transaction
Purpose
Space Types
Read,
Memory Transfer data to or from a location in the system memory map
Write
Read,
IO Transfer data to or from a location in the system IO map
Write
Read,
Configuration Transfer data to or from a location in the configuration space of a PCI-compatible device.
Write
Baseline, General in-band messaging and event reporting (without consuming memory or IO address
Message
resources)
Vendor-specific
Accesses to the four address spaces in PCI Express are accomplished using split-transaction
requests and completions.
Split Transactions: Better Performance, More Overhead
The split transaction protocol is an improvement over earlier bus protocols (e.g. PCI) which
made extensive use of bus wait-states or delayed transactions (retries) to deal with latencies in
accessing targets. In PCI Express, the completion following a request is initiated by the
completer only when it has data and/or status ready for delivery. The fact that the completion is
separated in time from the request which caused it also means that two separate TLPs are
generated, with independent routing for the request TLP and the Completion TLP. Note that
while a link is free for other activity in the time between a request and its subsequent
completion, a split-transaction protocol involves some additional overhead as two complete
TLPs must be generated to carry out a single transaction.
Figure 3-4 on page 115 illustrates the request-completion phases of a PCI Express split
transaction. This example represents an endpoint read from system memory.
To mitigate the penalty of the request-completion latency, messages and some write
transactions in PCI Express are posted, meaning the write request (including data) is sent, and
the transaction is over from the requester's perspective as soon as the request is sent out of
the egress port; responsibility for delivery is now the problem of the next device. In a multi-level
topology, this has the advantage of being much faster than waiting for the entire request-
completion transit, but as in all posting schemes uncertainty exists concerning when (and if)
the transaction completed successfully at the ultimate recipient.
In PCI Express, write posting to memory is considered acceptable in exchange for the higher
performance. On the other hand, writes to IO and configuration space may change device
behavior, and write posting is not permitted. A completion will always be sent to report status of
the IO or configuration write operation.
Table 3-4 on page 116 lists PCI Express posted and non-posted transactions.
Memory
All Memory Write requests are posted. No completion is expected or sent.
Write
Memory
Read
All memory read requests are non-posted. A completion with data (CplD or CplDLK) will be returned by the completer
with requested data and to report status of the memory read
Memory
Read Lock
All IO Write requests are non-posted. A completion without data (Cpl) will be returned by the completer to report status
IO Write
of the IO write operation.
All IO read requests are non-posted. A completion with data (CplD) will be returned by the completer with requested
IO Read
data and to report status of the IO read operation.
Configuration
Write All Configuration Write requests are non-posted. A completion without data (Cpl) will be returned by the completer to
report status of the configuration space write operation.
Type 0 and
Type 1
Configuration
Read All configuration read requests are non-posted. A completion with data (CplD) will be returned by the completer with
requested data and to report status of the read operation.
Type 0 and
Type 1
Message
While the routing method varies, all message transactions are handled in the same manner as memory writes in that
Message they are considered posted requests
With Data
Memory Read (MRd), Memory Read Lock (MRdLk), Memory Write (MWr) Address Routing
Configuration Read Type 0 (CfgRd0), Configuration Read Type 1 (CfgRd1) Configuration Write
ID Routing
Type 0 (CfgWr0), Configuration Write Type 1(CfgWr1)
As indicated in Table 3-5 on page 117, memory and IO transactions are routed through the PCI
Express topology using address routing to reference system memory and IO maps, while
configuration cycles use ID routing to reference the completer's (target's) logical position within
the PCI-compatible bus topology (using Bus Number, Device Number, Function Number in place
of a linear address). Both address routing and ID routing are completely compatible with routing
methods used in the PCI and PCIX protocols when performing memory, IO, or configuration
transactions. PCI Express completions also use the ID routing scheme.
PCI Express adds the third routing method, implicit routing, which is an option when sending
messages. In implicit routing, neither address or ID routing information applies; the packet is
routed based on a code in the packet header indicating it is destined for device(s) with known,
fixed locations (the Root Complex, the next receiver, etc.).
While limited in the cases it can support, implicit routing simplifies routing of messages. Note
that messages may optionally use address or ID routing instead.
The target claims the transaction based on decoding and comparing the transaction start
address with ranges it has been programmed to respond to in its configuration space Base
Address Registers.
If the transaction involves bursting, then addresses are indexed after each data transfer.
While PCI Express also supports load and store transactions with its memory and IO
transactions, it adds in-band messages. The main reason for this is that the PCI Express
protocol seeks to (and does) eliminate many of the sideband signals related to interrupts, error
handling, and power management which are found in PCI(X)-based systems. Elimination of
signals is very important in an architecture with the scalability possible with PCI Express. It
would not be efficient to design a PCI Express device with a two lane link and then saddle it
with numerous additional signals to handle auxiliary functions.
The PCI Express protocol replaces most sideband signals with a variety of in-band packet
types; some of these are conveyed as Data Link Layer packets (DLLPs) and some as
Transaction Layer packets (TLPs).
One side effect of using in-band messages in place of hard-wired sideband signals is the
problem of delivering the message to the proper recipient in a topology consisting of numerous
point-to-point links. The PCI Express protocol provides maximum flexibility in routing message
TLPs; they may use address routing, ID routing, or the third method, implicit routing. Implicit
routing takes advantage of the fact that, due to their architecture, switches and other multi-port
devices have a fundamental sense of upstream and downstream, and where the Root Complex
is to be found. Because of this, a message header can be routed implicitly with a simple code
indicating that it is intended for the Root Complex, a broadcast downstream message, should
terminate at the next receiver, etc.
The advantage of implicit routing is that it eliminates the need to assign a set of memory
mapped addresses for all of the possible message variants and program all of the devices to
use them.
Figure 3-5. Transaction Layer Packet Generic 3DW And 4DW Headers
General
As TLPs arrive at an ingress port, they are first checked for errors at both the physical and
data link layers of the receiver. Assuming there are no errors, TLP routing is performed; basic
steps include:
1. The TLP header Type and Format fields in the first DWord are examined to
determine the size and format of the remainder of the packet.
Depending on the routing method associated with the packet, the device will determine if it is
the intended recipient; if so, it will accept (consume) the TLP. If it is not the recipient, and it is a
multi-port device, it will forward the TLP to the appropriate egress port--subject to the rules for
ordering and flow control for that egress port.
If it is neither the intended recipient nor a device in the path to it, it will generally reject the
packet as an Unsupported Request (UR).
Table 3-6 on page 120 below summarizes the encodings used in TLP header Type and Format
fields. These two fields, used together, indicate TLP format and routing to the receiver.
00 = 3DW, no data
Memory Read Request (MRd) 0 0000
01 = 4DW, no data
00 = 3DW, no data
Memory Read Lock Request (MRdLk) 0 0001
01 = 4DW, no data
10 = 3DW, w/ data
Memory Write Request (MWr) 0 0000
11 = 4DW, w/ data
Message Request (Msg) 01 = 4DW, no data 1 0 RRR* (for RRR, see routing subfield)
Message Request W/Data (MsgD) 11 = 4DW, w/ data 1 0 RRR* (for RRR, see routing subfield)
Address Routing
PCI Express transactions using address routing reference the same system memory and IO
maps that PCI and PCIX transactions do. Address routing is used to transfer data to or from
memory, memory mapped IO, or IO locations. Memory transaction requests may carry either
32 bit addresses using the 3DW TLP header format, or 64 bit addresses using the 4DW TLP
header format. IO transaction requests are restricted to 32 bits of address using the 3DW TLP
header format, and should only target legacy devices.
Figure 3-6 on page 122 depicts generic system memory and IO maps. Note that the size of the
system memory map is a function of the range of addresses that devices are capable of
generating (often dictated by the CPU address bus). As in PCI and PCI-X, PCI Express permits
either 32 bit or 64 bit memory addressing. The size of the system IO map is limited to 32 bits
(4GB), although in many systems only the lower 16 bits (64KB) are used.
If the Type field in a received TLP indicates address routing is to be used, then the Address
Fields in the header are used to performing the routing check. There are two cases: 32-bit
addresses and 64-bit addresses.
For IO or a 32-bit memory requests, only 32 bits of address are contained in the header.
Devices targeted with these TLPs will reside below the 4GB memory or IO address boundary.
Figure 3-7 on page 123 depicts this case.
For 64-bit memory requests, 64 bits of address are contained in the header. Devices targeted
with these TLPs will reside above the 4GB memory boundary. Figure 3-8 on page 124 shows
this case.
If the Type field in a received TLP indicates address routing is to be used, then an endpoint
device simply checks the address in the packet header against each of its implemented BARs
in its Type 0 configuration space header. As it has only one link interface, it will either claim the
packet or reject it. Figure 3-9 on page 125 illustrates this case.
General
If the Type field in a received TLP indicates address routing is to be used, then a switch first
checks to see if it is the intended completer. It compares the header address against target
addresses programmed in its two BARs. If the address falls within the range, it consumes the
packet. This case is indicated by (1) in Figure 3-10 on page 126. If the header address field
does not match a range programmed in a BAR, it then checks the Type 1 configuration space
header for each downstream link. It checks the non-prefetchable memory (MMIO) and
prefetchable Base/Limit registers if the transaction targets memory, or the I/O Base and Limt
registers if the transaction targets I/O address space. This check is indicated by (2) in Figure
3-10 on page 126.
1. If the address-routed packet address falls in the range of one of its secondary
bridge interface Base/Limit register sets, it will forward the packet downstream.
If the address-routed packet was moving downstream (was received on the primary
interface) and it does not map to any BAR or downstream link Base/Limit registers, it will be
handled as an unsupported request on the primary link.
Upstream address-routed packets are always forwarded to the upstream link if they do not
target an internal location or another downstream link.
ID Routing
ID routing is based on the logical position (Bus Number, Device Number, Function Number) of a
device function within the PCI bus topology. ID routing is compatible with routing methods used
in the PCI and PCIX protocols when performing Type 0 or Type 1 configuration transactions. In
PCI Express, it is also used for routing completions and may be used in message routing as
well.
ID Bus Number, Device Number, Function Number Limits
PCI Express supports the same basic topology limits as PCI and PCI-X:
A maximum of 32 devices per bus/link. Of course, While a PCI(X) bus or the internal bus of a
switch may host more than one downstream bridge interface, external PCI Express links are
always point-to-point with only two devices per link. The downstream device on an external link
is device 0.
A significant difference in PCI Express over PCI is the provision for extending the amount of
configuration space per function from 256 bytes to 4KB. Refer to the "Configuration Overview"
on page 711 for a detailed description of the compatible and extended areas of PCI Express
configuration space.
If the Type field in a received TLP indicates ID routing is to be used, then the ID fields in the
header are used to perform the routing check. There are two cases: ID routing with a 3DW
header and ID routing with a 4DW header.
Figure 3-11 on page 128 illustrates a TLP using ID routing and the 3DW header.
Figure 3-12 on page 129 illustrates a TLP using ID routing and the 4DW header.
If the Type field in a received TLP indicates ID routing is to be used, then an endpoint device
simply checks the ID field in the packet header against its own Bus Number, Device Number,
and Function Number(s). In PCI Express, each device "captures" (and remembers) its own Bus
Number and Device Number contained in TLP header bytes 8-9 each time a configuration write
(Type 0) is detected on its primary link. At reset, all bus and device numbers in the system
revert to 0, so a device will not respond to transactions other than configuration cycles until at
least one configuration write cycle (Type 0) has been performed. Note that the PCI Express
protocol does not define a configuration space location where the device function is required to
store the captured Bus Number and Device Number information, only that it must do it.
Once again, as it has only one link interface, an endpoint will either claim an ID-routed packet
or reject it. Figure 3-11 on page 128 illustrates this case.
A Switch Receives an ID-Routed TLP: Two Checks
If the Type field in a received TLP indicates ID routing is to be used, then a switch first checks
to see if it is the intended completer. It compares the header ID field against its own Bus
Number, Device Number, and Function Number(s). This is indicated by (1) in Figure 3-13 on
page 131. As in the case of an endpoint, a switch captures its own Bus Number and Device
number each time a configuration write (Type 0) is detected on i's primary link interface. If the
header ID agrees with the ID of the switch, it consumes the packet. If the ID field does not
match i's own, it then checks the Secondary-Subordinate Bus Number registers in the
configuration space for each downstream link. This check is indicated by (2) in Figure 3-13 on
page 131.
1. If the ID-routed packet matches the range of one of its secondary bridge interface
Secondary-Subordinate registers, it will forward the packet downstream.
If the ID-routed packet was moving downstream (was received on the primary interface) and
it does not map to any downstream interface, it will be handled as an unsupported request on
the primary link.
Upstream ID-routed packets are always forwarded to the upstream link if they do not target
an internal location or another downstream link.
Implicit Routing
Implicit routing is based on the intrinsic knowledge PCI Express devices are required to have
concerning upstream and downstream traffic and the existence of a single PCI Express Root
Complex at the top of the PCI Express topology. This awareness allows limited routing of
packets without the need to assign and include addresses with certain message packets.
Because the Root Complex generally implements power management and interrupt controllers,
as well as system error handling, it is either the source or recipient of most PCI Express
messages.
With the elimination of many sideband signals in the PCI Express protocol, alternate methods
are required to inform the host system when devices need service with respect to interrupts,
errors, power management, etc. PCI Express addresses this by defining a number of special
TLPs which may be used as virtual wires in conveying sideband events. Message groups
currently defined include:
Power Management
Error signaling
Vendor-specific messages
In systems where all or some of this event traffic should target the system memory map or a
logical location in the PCI bus topology, address routing and ID routing may be used in place of
implicit routing. If address or ID routing is chosen for a message, then the routing mechanisms
just described are applied in the same way as they would for other posted write packets.
As a message TLP moves between PCI Express devices, packet header fields indicate both
that it is a message, and whether it should be routed using address, ID, or implicitly.
If the Type field in a received message TLP indicates implicit routing is to be used, then the
routing sub-field in the header is also used to determine the message destination when the
routing check is performed. Figure 3-14 on page 133 illustrates a message TLP using implicit
routing.
Table 3-7 on page 134 summarizes the use of the TLP header Type field when a message is
being sent. As shown, the upper two bits of the 5 bit Type field indicate the packet is a
message, and the lower three bits are the routing sub-field which specify the routing method to
apply. Note that the 4DW header is always used with message TLPs, regardless of the routing
option selected.
If the Type field in a received message TLP indicates implicit routing is to be used, then an
endpoint device simply checks that the routing sub-field is appropriate for it. For example, an
endpoint may accept a broadcast message or a message which terminates at the receiver; it
won't accept messages which implicitly target the Root Complex.
If the Type field in a received message TLP indicates implicit routing is to be used, then a
switch device simply considers the ingress port it arrived on and whether the routing sub-field
code is appropriate for it. Some examples:
The switch may accept messages indicating implicit routing to the root complex on
secondary links; it will forward all of these upstream because it "knows" the location of the Root
Complex is on its primary side. It would not accept messages routed implicitly to the Root
Complex if they arrived on the primary link receive interface.
If the implicitly-routed message arrives on either upstream or downstream ingress ports, the
switch may consume the packet if routing indicates it should terminate at receiver.
If messages are routed using address or ID methods, a switch will simply perform normal
address checks in deciding whether to accept or forward it.
Plug-And-Play Configuration of Routing Options
PCI-compatible configuration space and PCI Express extended configuration space are
covered in detail in the Part 6. For reference, the programming of three sets of configuration
space registers related to routing is summarized here.
PCI Express supports the basic 256 byte PCI configuration space common to all compatible
devices, including the Type 0 and Type 1 PCI configuration space header formats used by non-
bridge and switch/bridge devices, respectively. Devices may implement basic PCI-equivalent
functionality with no change to drivers or Operating System software.
PCI Express endpoint devices support a single PCI Express link and use the Type 0 (non-
bridge) format header. Switch/bridge devices support multiple links, and implement a Type 1
format header for each link interface. Figure 3-15 on page 136 illustrates a PCI Express
topology and the use of configuration space Type 0 and Type 1 header formats.
Figure 3-15. PCI Express Devices And Type 0 And Type 1 Header Use
Three sets of Base/Limit Register pairs supported in the Type 1 header of switch/bridge
devices.
Figure 3-16 on page 137 illustrates the Type 0 and Type 1 PCI Express Configuration Space
header formats. Key routing registers are indicated.
Figure 3-16. PCI Express Configuration Space Type 0 and Type 1 Headers
The first of the configuration space registers related to routing are the Base Address Registers
(BARs) These are marked "<1" in Figure 3-16 on page 137, and are implemented by all
devices which require system memory, IO, or memory mapped IO (MMIO) addresses
allocated to them as targets. The location and use of BARs is compatible with PCI and PCI-X.
As shown in Figure 3-16 on page 137, a Type 0 configuration space header has 6 BARs
available for the device designer (at DW 4-9), while a Type 1 header has only two BARs (at
DW 4-5).
After discovering device resource requirements, system software programs each BAR with
start address for a range of addresses the device may respond to as a completer (target). Set
up of BARs involves several things:
1. The device designer uses a BAR to hard-code a request for an allocation of one
block of prefetchable or non-prefetchable memory, or of IO addresses in the system
memory or IO map. A pair of adjacent BARs are concatenated if a 64-bit memory
request is being made.
Hard-coded bits in the BAR include an indication of the request type, the size of the request,
and whether the target device may be considered prefetchable (memory requests only).
During enumeration, all PCI-compatible devices are discovered and the BARs are examined by
system software to decode the request. Once the system memory and IO maps are
established, software programs upper bits in implemented BARs with the start address for the
block allocated to the target.
Figure 3-17 depicts the basic steps in setting up a BAR which is being used to track a 1 MB
block of prefetchable addresses for a device residing in the system memory map. In the
diagram, the BAR is shown at three points in the configuration process:
1. The uninitialized BAR in Figure 3-17 is as it looks after power-up or a reset. While
the designer has tied lower bits to indicate the request type and size, there is no
requirement about how the upper bits (which are read-write) must come up in a
BAR, so these bits are indicated with XXXXX. System software will first write all 1's
to the BAR to set all read-write bits = 1. Of course, the hard-coded lower bits are
not affected by the configuration write.
The second view of the BAR shown in Figure 3-17 is as it looks after configuration software
has performed the write of all 1's to it. The next step in configuration is a read of the BAR to
check the request. Table 3-8 on page 140 summarizes the results of this configuration read.
The third view of the BAR shown in Figure 3-17 on page 139 is as it looks after configuration
software has performed another configuration write (Type 0) to program the start address for
the block. In this example, the device start address is 2GB, so bit 31 is written = 1 (231 = 2GB)
and all other upper bits are written = 0's.
At this point the configuration of the BAR is complete. Once software enables memory address
decoding in the PCI command register, the device will claim memory transactions in the range
2GB to 2GB+1MB.
Table 3-8. Results Of Reading The BAR after Writing All "1s" To It
BAR
Meaning
Bits
2:1 Read back as 00b indicating the target only supports a 32 bit address decoder
19:4 All read back as "0", used to help indicate the size of the request (also see bit 20)
All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note that
31:20 because bit 20 was the first bit (above bit 3) to read back as written (=1); this indicates the memory request size is 1MB (220 =
1MB).
Figure 3-18 on page 141 depicts the basic steps in setting up a pair of BARs being used to
track a 64 MB block of prefetchable addresses for a device residing in the system memory
map. In the diagram, the BARs are shown at three points in the configuration process:
1. The uninitialized BARs are as they look after power-up or a reset. The designer has
hard-coded lower bits of the lower BAR to indicate the request type and size; the
upper BAR bits are all read-write. System software will first write all 1's to both
BARs to set all read-write bits = 1. Of course, the hard-coded bits in the lower BAR
are unaffected by the configuration write.
The second view of the BARs in Figure 3-18 on page 141 shows them as they look after
configuration software has performed the write of all 1's to both. The next step in configuration
is a read of the BARs to check the request. Table 3-9 on page 142 summarizes the results of
this configuration read.
The third view of the BAR pair Figure 3-18 on page 141 indicates conditions after
configuration software has performed two configuration writes (Type 0) to program the two
halves of the 64 bit start address for the block. In this example, the device start address is
16GB, so bit 1 of the Upper BAR (address bit 33 in the BAR pair) is written = 1 (233 = 16GB);
all other read-write bits in both BARs are written = 0's.
Table 3-9. Results Of Reading The BAR Pair after Writing All "1s" To Both
BAR
BAR Meaning
Bits
Read back as 10 b indicating the target supports a 64 bit address decoder, and that the first BAR is concatenated with
Lower 2:1
the next
Lower 25:4 All read back as "0", used to help indicate the size of the request (also see bit 26)
All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note
Lower 31:26 that because bit 26 was the first bit (above bit 3) to read back as written (=1); this indicates the memory request size is
64MB (226 = 64MB).
All read back as "1". These bits will be used as the upper 32 bits of the 64-bit start address programmed by system
Upper 31:0
software.
BAR Setup Example Three: 256-Byte IO Request
Figure 3-19 on page 143 depicts the basic steps in setting up a BAR which is being used to
track a 256 byte block of IO addresses for a legacy PCI Express device residing in the system
IO map. In the diagram, the BAR is shown at three points in the configuration process:
1. The uninitialized BAR in Figure 3-19 is as it looks after power-up or a reset. System
software first writes all 1's to the BAR to set all read-write bits = 1. Of course, the
hard-coded bits are unaffected by the configuration write.
The second view of the BAR shown in Figure 3-19 on page 143 is as it looks after
configuration software has performed the write of all 1's to it. The next step in configuration is a
read of the BAR to check the request. Table 3-10 on page 144 summarizes the results of this
configuration read.
The third view of the BAR shown Figure 3-19 on page 143 is as it looks after configuration
software has performed another configuration write (Type 0) to program the start address for
the IO block. In this example, the device start address is 16KB, so bit 14 is written = 1 (214 =
16KB); all other upper bits are written = 0's.
At this point the configuration of the IO BAR is complete. Once software enables IO address
decoding in the PCI command register, the device will claim IO transactions in the range 16KB
to 16KB+256.
Table 3-10. Results Of Reading The IO BAR after Writing All "1s" To It
BAR
Meaning
Bits
7:2 All read back as "0", used to help indicate the size of the request (also see bit 8)
All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note that
31:8 because bit 8 was the first bit (above bit 1) to read back as written (=1); this indicates the IO request size is 256 bytes (28 =
256).
General
The second set of configuration registers related to routing are also found in Type 1
configuration headers and used when forwarding address-routed TLPs. Marked "<2" in Figure
3-16 on page 137, these are the three sets of Base/Limit registers programmed in each bridge
interface to enable a switch/bridge to claim and forward address-routed TLPs to a secondary
bus. Three sets of Base/Limit Registers are needed because transactions are handled
differently (e.g. prefetching, write-posting, etc.) in the prefetchable memory, non-prefetchable
memory (MMIO), and IO address domains. The Base Register in each pair establishes the
start address for the community of downstream devices and the Limit Register defines the
upper address for that group of devices. The three sets of Base/Limit Registers include:
The Prefetchable Memory Base/Limit registers are located at DW 9 and Prefetchable Memory
Base/Limit Upper registers at DW 10-11 within the header 1. These registers track all
downstream prefetchable memory devices. Either 32 bit or 64 bit addressing can be supported
by these registers. If the Upper Registers are not implemented, only 32 bits of memory
addressing is available, and the TLP headers mapping to this space will be the 3DW format. If
the Upper registers and system software maps the device above the 4GB boundary, TLPs
accessing the device will carry the 4DW header format. In the example shown in Figure 3-20 on
page 145, a 6GB prefetchable address range is being set up for the secondary link of a switch.
Register programming in the example shown in Figure 3-20 on page 145 is summarized in Table
3-11.
Upper 3 nibbles (800h) are used to provide most significant 3 digits of the 32-bit Base Address for
Prefetchable
Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be 00000h. The
Memory 8001h
least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported and
Base
that the Upper Base/Limit Registers are also used.
Upper 3 nibbles (FFFh) are used to provide most significant 3 digits of the 32-bit Limit Address for
Prefetchable
Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be FFFFFh.
Memory FFF1h
The least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported
Limit
and that the Upper Base/Limit Registers are also used.
Prefetchable
Memory
00000001h Upper 32 bits of the 64-bit Base address for Prefetchable Memory behind this switch.
Base Upper
32 Bits
Prefetchable
Memory
00000002h Upper 32 bits of the 64-bit Limit address for Prefetchable Memory behind this switch.
Limit Upper
32 Bits
Non-Prefetchable Memory Base/Limit (at DW 8). These registers are used to track all
downstream non-prefetchable memory (memory mapped IO) devices. Non-prefetchable
memory devices are limited to 32 bit addressing; TLPs targeting them always use the 3DW
header format.
Register programming in the example shown in Figure 3-21 on page 147 is summarized in Table
3-12.
Memory Upper 3 nibbles (121h) are used to provide most significant 3 digits of the 32-bit Base Address for Non-
Base Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be 00000h.
1210h
(Non-
Prefetchable) The least significant nibble of this register value (0h) is reserved and should be set = 0.
Memory Limit Upper 3 nibbles (122h) are used to provide most significant 3 digits of the 32-bit Limit Address for Prefetchable
1220h Memory behind this switch. The lower 5 digits of the address are assumed to be FFFFFh. The least significant
(Non-
nibble of this register value (0h) is reserved and should be set = 0.
Prefetchable)
IO Base/Limit Registers
IO Base/Limit (at DW 7) and IO Base/Limit Upper registers (at DW 12). These registers are
used to track all downstream IO target devices. If the Upper Registers are used, then IO
address space may be extended to a full 32 bits (4GB). If they are not implemented, then IO
address space is limited to 16 bits (64KB). In either case, TLPs targeting these IO devices
always carry the 3DW header format.
Register programming in the example shown in Figure 3-22 on page 149 is summarized in Table
3-13 on page 150.
Upper nibble (4h) specifies the most significant hex digit of the 32 bit IO Limit address (the lower digits are FFFh).
IO Limit 41h The lower nibble (1h) indicates that the device supports 32 bit IO behind the bridge interface. This also means the
device implements the Upper IO Base/Limit register set, and those registers will be concatenated with Base/Limit.
IO Base
Upper 16 0000h Upper 16 bits of the 32-bit Base address for IO behind this switch.
Bits
IO Limit
Upper 16 0000h Upper 16 bits of the 32-bit Limit address for IO behind this switch.
Bits
The third set of configuration registers related to routing are used when forwarding ID-routed
TLPs, including configuration cycles and completions and optionally messages. These are
marked "<3" in Figure 3-16 on page 137. As in PCI, a switch/bridge interface requires three
registers: Primary Bus Number, Secondary Bus Number, and Subordinate bus number. The
function of these registers is summarized here.
The Primary Bus Number register contains the bus (link) number to which the upstream side of
a bridge (switch) is connected. In PCI Express, the primary bus is the one in the direction of the
Root Complex and host processor.
The Secondary Bus Number register contains the bus (link) number to which the downstream
side of a bridge (switch) is connected.
The Subordinate Bus Number register contains the highest bus (link) number on the
downstream side of a bridge (switch). The Subordinate and Secondary Bus Number registers
will contain the same value unless there is another bridge (switch) on the secondary side.
A Switch Is a Two-Level Bridge Structure
Because PCI does not natively support bridges with multiple downstream ports, PCI Express
switch devices appear logically as two-level PCI bridge structures, consisting of a single bridge
to the primary link and an internal PCI bus which hosts one or more virtual bridges to secondary
interfaces. Each bridge interface has an independent Type 1 format configuration header with
its own sets of Base/Limit Registers and Bus Number Registers. Figure 3-23 on page 152
illustrates the bus numbering associated with the external links and internal bus of a switch.
Note that the secondary bus on the primary link interface is the internal virtual bus, and that the
primary interface of all downstream link interfaces connect to the internal bus logically.
This Chapter
With the exception of the logical idle indication and physical layer Ordered Sets, all information
moves across an active PCI Express link in fundamental chunks called packets which are
comprised of 10 bit control (K) and data (D) symbols. The two major classes of packets
exchanged between two PCI Express devices are high level Transaction Layer Packets (TLPs),
and low-level link maintenance packets called Data Link Layer Packets (DLLPs). Collectively,
the various TLPs and DLLPs allow two devices to perform memory, IO, and Configuration
Space transactions reliably and use messages to initiate power management events, generate
interrupts, report errors, etc. Figure 4-1 on page 155 depicts TLPs and DLLPs on a PCI
Express link.
Some early bus protocols (e.g. PCI) allow transfers of indeterminate (and unlimited) size,
making identification of payload boundaries impossible until the end of the transfer. In addition,
an early transaction end might be signaled by either agent (e.g. target disconnect on a write or
pre-emption of the initiator during a read), resulting in a partial transfer. In these cases, it is
difficult for the sender of data to calculate and send a checksum or CRC covering an entire
payload, when it may terminate unexpectedly. Instead, PCI uses a simple parity scheme which
is applied and checked for each bus phase completed.
In contrast, each PCI Express packet has a known size and format, and the packet header--
positioned at the beginning of each DLLP and TLP packet-- indicates the packet type and
presence of any optional fields. The size of each packet field is either fixed or defined by the
packet type. The size of any data payload is conveyed in the TLP header Length field. Once a
transfer commences, there are no early transaction terminations by the recipient. This
structured packet format makes it possible to insert additional information into the packet into
prescribed locations, including framing symbols, CRC, and a packet sequence number (TLPs
only).
Each TLP and DLLP packet sent is framed with a Start and End control symbol, clearly defining
the packet boundaries to the receiver. Note that the Start and End control (K) symbols
appended to packets by the transmitting device are 10 bits each. This is a big improvement
over PCI and PCI-X which use the assertion and de-assertion of a single FRAME# signal to
indicate the beginning and end of a transaction. A glitch on the FRAME# signal (or any of the
other PCI/PCIX control signals) could cause a target to misconstrue bus events. In contrast, a
PCI Express receiver must properly decode a complete 10 bit symbol before concluding link
activity is beginning or ending. Unexpected or unrecognized control symbols are handled as
errors.
Unlike the side-band parity signals used by PCI devices during the address and each data
phase of a transaction, the in-band 16-bit or 32-bit PCI Express CRC value "protects" the entire
packet (other than framing symbols). In addition to CRC, TLP packets also have a packet
sequence number appended to them by the transmitter so that if an error is detected at the
receiver, the specific packet(s) which were received in error may be resent. The transmitter
maintains a copy of each TLP sent in a Retry Buffer until it is checked and acknowledged by
the receiver. This TLP acknowledgement mechanism (sometimes referred to as the Ack/Nak
protocol) forms the basis of link-level TLP error correction and is very important in deep
topologies where devices may be many links away from the host in the event an error occurs
and CPU intervention would otherwise be needed.
Transaction Layer Packets
In PCI Express terminology, high-level transactions originate at the device core of the
transmitting device and terminate at the core of the receiving device. The Transaction Layer is
the starting point in the assembly of outbound Transaction Layer Packets (TLPs), and the end
point for disassembly of inbound TLPs at the receiver. Along the way, the Data Link Layer and
Physical Layer of each device contribute to the packet assembly and disassembly as described
below.
Figure 4-2 on page 158 depicts the general flow of TLP assembly at the transmit side of a link
and disassembly at the receiver. The key stages in Transaction Layer Packet protocol are
listed below. The numbers correspond to those in Figure 4-2.
1. Device B's core passes a request for service to the PCI Express hardware interface.
How this done is not covered by the PCI Express Specification, and is device-
specific. General information contained in the request would include:
- Attributes of the transfer: No Snoop bit set?, Relaxed Ordering set?, etc.
The Transaction Layer builds the TLP header, data payload, and digest based on the
request from the core. Before sending a TLP to the Data Link Layer, flow control credits and
ordering rules must be applied.
When the TLP is received at the Data Link Layer, a Sequence Number is assigned and a
Link CRC is calculated for the TLP (includes Sequence Number). The TLP is then passed on to
the Physical Layer.
At the Physical Layer, byte striping, scrambling, encoding, and serialization are performed.
STP and END control (K) characters are appended to the packet. The packet is sent out on the
transmit side of the link.
At the Physical Layer receiver of Device A, de-serialization, framing symbol check, decoding,
and byte un-striping are performed. Note that at the Physical Layer, the first level or error
checking is performed (on the control codes).
The Data Link Layer of the receiver calculates CRC and checks it against the received value.
It also checks the Sequence Number of the TLP for violations. If there are no errors, it passes
the TLP up to the Transaction Layer of the receiver. The information is decoded and passed to
the core of Device A. The Data Link Layer of the receiver will also notify the transmitter of the
success or failure in processing the TLP by sending an Ack or Nak DLLP to the transmitter. In
the event of a Nak (No Acknowledge), the transmitter will re-send all TLPs in its Retry Buffer.
Transactions are carried out between PCI Express requesters and completers, using four
separate address spaces: Memory, IO, Configuration, and Message. (See Table 4-1.)
Address Transaction
Purpose
Space Types
Transfer data to or from a location in the system memory map. The protocol also supports a locked
Memory Read, Write
memory read transaction.
Transfer data to or from a location in the system IO map. PCI Express IO address assignment to legacy
devices.
IO Read, Write
IO addressing is not permitted for Native PCI Express devices.
Transfer data to or from a location in the configuration space of a PCI Express device. As in PCI,
Configuration Read, Write configuration is used to discover device capabilities, program plug-and-play features, and check status
using the 4KB PCI Express configuration space.
Baseline,
Provides in-band messaging and event reporting (without consuming memory or IO address resources).
Message Vendor-
These are handled the same as posted write transactions.
specific
In accessing the four address spaces, PCI Express Transaction Layer Packets (TLPs) carry a
header field, called the Type field, which encodes the specific command variant to be used.
Table 4-2 on page 160 summarizes the allowed transactions:
Completion (Cpl)
TLP Structure
The basic usage of each component of a Transaction Layer Packet is defined in Table 4-3 on
page 161.
TLP Protocol
Component Use
Component Layer
3DW or 4DW (12 or 16 bytes) in size. Format varies with type, but Header defines transaction
parameters:
Transaction type
Ordering attribute
Traffic Class
Transaction Optional field. 0-1024 DW Payload, which may be further qualified with Byte Enables to get byte address
Data
Layer and byte transfer size resolution.
Transaction
Digest Optional field. If present, always 1 DW in size. Used for end-to-end CRC (ECRC) and data poisoning.
Layer
Figure 4-3 on page 162 illustrates the format and contents of a generic TLP 3DW header. In
this section, fields common to nearly all transactions are summarized. In later sections, header
format differences associated with the specific transaction types are covered.
Table 4-4 on page 163 summarizes the size and use of each of the generic TLP header fields.
Note that fields marked "R" in Figure 4-3 on page 162 are reserved and should be set = 0.
Header Header
Field Use
Field Location
TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, 210 = 1024 DW (4KB). Encoding:
When set = 1, PCI-X relaxed ordering is enabled for this TLP. If set = 0, then strict PCI ordering is used.
Attr Byte 2 Bit Bit 4 = No Snoop.
(Attributes) 5:4
When set = 1, requester is indicating that no host cache coherency issues exist with respect to this TLP.
System hardware is not required to cause processor cache snoop for coherency. When set = 0, PCI -type
cache snoop protection is required.
EP
Byte 2 Bit If set = 1, the data accompanying this data should be considered invalid although the transaction is being
(Poisoned
6 allowed to complete normally.
Data)
If set = 1, the optional 1 DW TLP Digest field is included with this TLP that contains an ECRC value. Some
rules:
Presence of the Digest field must be checked by all receivers (using this bit).
TD (TLP
Byte 2 Bit
Digest Field A TLP with TD = 1, but no Digest field is handled as a Malformed TLP.
7
Present)
If a device supports checking ECRC and TD=1, it must perform the ECRC check.
If a device does not support checking ECRC (optional) at the ultimate destination, the device must
ignore the digest.
These three bits are used to encode the traffic class to be applied to this TLP and to the completion
associated with it (if any).
TC 0 is the default class, and TC 1-7 are used in providing differentiated services. See "Traffic Classes and
Virtual Channels" on page 256 for additional information.
These 5 bits encode the transaction variant used with this TLP. The Type field is used with Fmt [1:0] field to
Byte 0 Bit
Type[4:0] specify transaction type, header size, and whether data payload is present. See below for additional
4:0
information of Type/Fmt encoding for each transaction type.
These two bits encode information about header size and whether a data payload will be part of the TLP:
See below for additional information of Type/Fmt encoding for each transaction type.
These four high-true bits map one-to-one to the bytes within the first double word of payload.
First DW
Byte 7 Bit Bit 2 = 1: Byte 2 in first DW is valid; otherwise not
Byte
3:0 Bit 1 = 1: Byte 1 in first DW is valid; otherwise not
Enables
These four high-true bits map one-to-one to the bytes within the first double word of payload.
Table 4-5 on page 165 summarizes the encodings used in TLP header Type and Format (Fmt)
fields.
00 = 3DW, no data
Memory Read Request (MRd) 0 0000
01 = 4DW, no data
00 = 3DW, no data
Memory Read Lock Request (MRdLk) 0 0001
01 = 4DW, no data
10 = 3DW, w/ data
Memory Write Request (MWr) 0 0000
11 = 4DW, w/ data
Message Request (Msg) 01 = 4DW, no data 1 0 rrr* (for rrr, see routing subfield)
Message Request W/Data (MsgD) 11 = 4DW, w/ data 1 0rrr* (for rrr, see routing subfield)
This book does not detail the algorithm and process of calculating ECRC, but is defined within
the specification. ECRC covers all fields that do not change as the TLP is forwarded across the
fabric. The ECRC includes all invariant fields of the TLP header and the data payload, if
present. All variant fields are set to 1 for calculating the ECRC, include:
Bit 0 of the Type field is variant this bit changes when the transaction type is altered for
a packet. For example, a configuration transaction being forwarded to a remote link
(across one or more switches) begins as a type 1 configuration transaction. When the
transaction reaches the destination link, it is converted to a type 0 configuration transaction
by changing bit 0 of the type field.
Error/Poisoned (EP) bit this bit can be set as a TLP traverses the fabric in the event that
the data field associated with the packet has been corrupted. This is also referred to as
error forwarding.
The ECRC check is intended for the device that is the ultimate receipient of the TLP. Link CRC
checking verifies that a TLP traverses a given link before being forwarded to the next link, but
ECRC is intended to verify that the packet send has not been altered in its journey between the
Requester and Completer. Switches in the path must maintain the integrity of the TD bit
because corruption of TD will cause an error at the ultimate target device.
The specification makes two statements regarding a Switch's role in ECRC checking:
A switch that supports ECRC checking performs this check on TLPs destined to a location
within the Switch itself. "On all other TLPs a Switch must preserve the ECRC (forward it
untouched) as an integral part of the TLP."
"Note that a Switch may perform ECRC checking on TLPs passing through the Switch.
ECRC Errors detected by the Switch are reported in the same way any other device would
report them, but do not alter the TLPs passage through the Switch."
These statements may appear to contradict each other. However, the first statement does not
explicitly state that an ECRC check cannot be made in the process of forwarding the TLP
untouched. The second statement clarifies that it is possible for switches, as well as the
ultimate target device, to check and report ECRC.
As in the PCI protocol, PCI Express requires a mechanism for reconciling its DW addressing
and data transfers with the need, at times, for byte resolution in transfer sizes and transaction
start/end addresses. To achieve byte resolution, PCI Express makes use of the two Byte
Enable fields introduced earlier in Figure 4-3 on page 162 and in Table 4-4 on page 163.
The First DW Byte Enable field and the Last DW Byte Enable fields allow the requester to
qualify the bytes of interest within the first and last double words transferred; this has the effect
of allowing smaller transfers than a full double word and offsetting the start and end addresses
from DW boundaries.
1. Byte enable bits are high true. A value of "0" indicates the corresponding byte in the
data payload should not be written by the completer. A value of "1", indicates it
should.
If the valid data transferred is all within a single aligned double word, the Last DW Byte
enable field must be = 0000b.
If the header Length field indicates a transfer is more than 1DW, the First DW Byte Enable
must have at least one bit enabled.
If the Length field indicates a transfer of 3DW or more, then neither the First DW Byte
Enable field or the Last DW Byte Enable field may have discontinuous byte enable bits set. In
these cases, the Byte Enable fields are only being used to offset the effective start address of
a burst transaction.
Discontinuous byte enable bit patterns in the First DW Byte enable field are allowed if the
transfer is 1DW.
Discontinuous byte enable bit patterns in both the First and Second DW Byte enable fields
are allowed only if the transfer is Quadword aligned (2DWs).
A write request with a transfer length of 1DW and no byte enables set is legal, but has no
effect on the completer.
If a read request of 1 DW is done with no byte enable bits set, the completer returns a 1DW
data payload of undefined data. This may be used as a Flush mechanism. Because of ordering
rules, a flush may be used to force all previously posted writes to memory before the
completion is returned.
An example of byte enable use in this case is illustrated in Figure 4-4 on page 168. Note that
the transfer length must extend from the first DW with any valid byte enabled to the last DW
with any valid bytes enabled. Because the transfer is more than 2DW, the byte enables may
only be used to specify the start address location (2d) and end address location (34d) of the
transfer.
Transaction ID
This is comprised of the Bus, Device, and Function Number of the TLP requester AND the Tag
field of the TLP.
Traffic Class
Traffic Class (TC 0 -7) is inserted in the TLP by the requester, and travels unmodified through
the topology to the completer. At every link, Traffic Class is mapped to one of the available
virtual channels.
Transaction Attributes
These consist of the Relaxed Ordering and No Snoop bits. These are also set by the requester
and travel with the packet to the completer.
1. The Length field refers to data payload only; the Digest field (if present) is not
included in the Length.
The first byte of data in the payload (immediately after the header) is always associated
with the lowest (start) address.
The Length field always represents an integral number of doublewords (DW) transferred.
Partial doublewords are qualified using First and Last Byte Enable fields.
The PCI Express specification states that when multiple transactions are returned by a
completer in response to a single memory request, that each intermediate transaction must end
on naturally-aligned 64 and 128 byte address boundaries for a root complex (this is termed the
Read Completion Boundary, or RCB). All other devices must break such transactions at
naturally-aligned 128 byte boundaries. This behavior promotes system performance related to
cache lines.
The Length field is reserved when sending message TLPs using the transaction Msg. The
Length field is valid when sending the message with data variant MsgD.
PCI Express supports load tuning of links. This means that the data payload of a TLP must
not exceed the current value in the Max_Payload_Size field of the Device Control Register. Only
write transactions have data payloads, so this restriction does not apply to reads. A receiver is
required to check for violations of the Max_Payload_Size limit during writes; violations are
handled as Malformed TLPs.
Receivers also must check for discrepancies between the value in the Length field and the
actual amount of data transferred in a TLP with data. Violations are also handled as Malformed
TLPs.
Requests must not mix combinations of start address and transfer length which will cause a
memory space access to cross a 4KB boundary. While checking is optional in this case,
receivers checking for violations of this rule will report it as a Malformed TLP.
In this section, the format of 3DW and 4DW headers used to accomplish specific transaction
types are described. Many of the generic fields described previously apply, but an emphasis is
placed on the fields which are handled differently between transaction types.
IO Requests
While the PCI Express specification discourages the use of IO transactions, an allowance is
made for legacy devices and software which may rely on a compatible device residing in the
system IO map rather than the memory map. While the IO transactions can technically access
a 32-bit IO range, in reality many systems (and CPUs) restrict IO access to the lower 16 bits
(64KB) of this range. Figure 4-6 on page 171 depicts the system IO map and the 16/32 bit
address boundaries. PCI Express non-legacy devices are memory-mapped, and not permitted
to make requests for IO address allocation in their configuration Base Address Registers.
Figure 4-6. System IO Map
Figure 4-7 on page 172 depicts the format of the 3DW IO request header. Each field in the
header is described in the section that follows.
Table 4-6 on page 173 describes the location and use of each field in an IO request header.
Table 4-6. IO Request Header Fields
Header
Field Name Function
Byte/Bit
Byte 3 Bit
7:0
Indicates data payload size in DW. For IO requests, this field is always = 1. Byte Enables are used
Length 9:0
Byte 2 Bit to qualify bytes within DW.
1:0
Byte 2 Bit
EP If = 1, indicates the data payload (if present) is poisoned.
6
Byte 2 Bit If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and
TD
7 END)
Byte 0 Bit
Type 4:0 TLP packet type field. Always set to 00010b for IO requests
4:0
1st DW BE 3:0 (First Byte 7 Bit These high true bits map one-to-one to qualify bytes within the DW payload. For IO requests, any
DW Byte Enables) 3:0 bit combination is valid (including none)
Last BE 3:0 (Last DW Byte 7 Bit These high true bits map one-to-one to qualify bytes within the last DW transferred. For IO
Byte Enables) 7:4 requests, these bits must be 0000b. (Single DW)
These bits are used to identify each outstanding request issued by the requester. As non-posted
requests are sent, the next sequential tag is assigned.
Byte 6 Bit
Tag 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)
7:0
If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256
tags).
Byte 7 Bit
7:0 The upper 30 bits of the 32-bit start address for the IO transfer. Note that the lower two bits of the
Address 31:2
32 bit address are reserved (00b), forcing the start address to be DW aligned.
Byte 6 Bit
7:0
Byte 5 Bit
7:0
Memory Requests
PCI Express memory transactions include two classes: Read Request/Completion and Write
Request. Figure 4-8 on page 175 depicts the system memory map and the 3DW and 4DW
memory request packet formats. When request memory data transfer it is important to
remember that memory transactions are never permitted to cross 4KB boundaries.
The location and use of each field in a 4DW memory request header is listed in Table 4-7 on
page 176.
Note: The difference between a 3DW header and a 4DW header is the location and size of the
starting Address field:
For a 3DW header (32 bit addressing): Address bits 31:2 are in Bytes 8-11, and 12-15 are
not used.
For a 4DW header (64 bit addressing): Address bits 31:2 are in Bytes 12-15, and address
bits 63:32 are in Bytes 8-11.
Header
Field Name Function
Byte/Bit
TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, 210 = 1024 DW (4KB). Encoding:
When set = 1, PCI-X relaxed ordering is enabled for this TLP. If set = 0, then strict PCI ordering is used.
Byte 2 Bit 4 = No Snoop.
Attr (Attributes)
Bit 5:4
When set = 1, requester is indicating that no host cache coherency issues exist with respect to this TLP.
System hardware is not required to cause processor cache snoop for coherency. When set = 0, PCI -type
cache snoop protection is required.
EP (Poisoned Byte 2 If set = 1, the data accompanying this data should be considered invalid although the transaction is being
Data) Bit 6 allowed to complete normally.
If set = 1, the optional 1 DW TLP Digest field is included with this TLP. Some rules:
Presence of the Digest field must be checked by all receivers (using this bit)
TD (TLP Digest Byte 2 A TLP with TD = 1, but no Digest field is handled as a Malformed TLP.
Field Present) Bit 7
If a device supports checking ECRC and TD=1, it must perform the ECRC check.
If a device does not support checking ECRC (optional) at the ultimate destination, the device must
ignore the digest field.
These three bits are used to encode the traffic class to be applied to this TLP and to the completion
associated with it (if any).
.
TC (Traffic Byte 1
Class) Bit 6:4 .
TC 0 is the default class, and TC 1-7 are used in providing differentiated services. See"Traffic Classes
and Virtual Channels" on page 256 for additional information.
Type field is used with Fmt [1:0] field to specify transaction type, header size, and whether data payload is
present.
Packet Format:
1st DW BE 3:0
Byte 7
(First DW Byte These high true bits map one-to-one to qualify bytes within the DW payload.
Bit 3:0
Enables)
Last BE 3:0
Byte 7
(Last DW Byte These high true bits map one-to-one to qualify bytes within the last DW transferred.
Bit 7:4
Enables)
These bits are used to identify each outstanding request issued by the requester. As non-posted
requests are sent, the next sequential tag is assigned.
Byte 6
Tag 7:0
Bit 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)
If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256 tags).
Byte 15
Bit 7:2
Byte 14
Bit 7:0 The lower 32 bits of the 64 bit start address for the memory transfer. Note that the lower two bits of the 32
Address 31:2 bit address are reserved (00b), forcing the start address to be DW aligned.
Byte 13
Bit 7:0
Byte 12
Bit 7:0
Byte 11
Bit 7:2
Byte 10
Bit 7:0
Address 63:32 The upper 32 bits of the 64-bit start address for the memory transfer
Byte 9
Bit 7:0
Byte 8
Bit 7:0
All memory mapped writes are posted, resulting in much higher performance.
Either 32 bit or 64 bit addressing may be used. The 3DW header format supports 32 bit
addresses and the 4DW header supports 64 bits.
The full capability of burst transfers is available with a transfer length of 0-1024 DW (0-4KB).
Advanced PCI Express Quality of Service features, including up to 8 transfer classes and
virtual channels may be implemented.
The No Snoop attribute bit in the header may be set = 1, relieving the system hardware from
the burden of snooping processor caches when PCI Express transactions target main memory.
Optionally, the bit may be deasserted in the packet, providing PCI-like cache coherency
protection.
The Relaxed Ordering bit may also be set = 1, permitting devices in the path between the
packet and its destination to apply the relaxed ordering rules available in PCI-X. If deasserted,
strong PCI producer-consumer ordering is enforced.
Configuration Requests
To maintain compatibility with PCI, PCI Express supports both Type 0 and Type 1 configuration
cycles. A Type 1 cycle propagates downstream until it reaches the bridge interface hosting the
bus (link) that the target device resides on. The configuration transaction is converted on the
destination link from Type 1 to Type 0 by the bridge. The bridge forwards and converts
configuration cycles using previously programmed Bus Number registers that specify its
primary, secondary, and subordinate buses. Refer to the "PCI-Compatible Configuration
Mechanism" on page 723 for a discussion of routing these transactions.
Figure 4-9 on page 180 illustrates a Type 1 configuration cycle making its way downstream. At
the destination link, it is converted to Type 0 and claimed by the endpoint device. Note that
unlike PCI, only one device (other than the bridge) resides on a link. For this reason, no IDSEL
or other hardware indication is required to instruct the device to claim the Type 0 cycle; any
Type 0 configuration cycle a device sees on its primary link will be claimed.
Table 4-8 on page 181 describes the location and use of each field in the configuration request
header illustrated in Figure 4-9 on page 180.
Field Header
Function
Name Byte/Bit
Byte 3
Bit 7:0 Indicates data payload size in DW. For configuration requests, this field is always = 1. Byte Enables are used to
Length 9:0
qualify bytes within DW (any combination is legal)
Byte 2
Bit 1:0
Byte 2
EP If = 1, indicates the data payload (if present) is poisoned.
Bit 6
Byte 2
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
Bit 7
TC 2:0
Byte 2
(Transfer Indicates transfer class for the packet. TC is = 0 for all Configuration requests.
Bit 6:4
Class)
1st DW BE
3:0
Byte 7 These high true bits map one-to-one to qualify bytes within the DW payload. For config requests, any bit
(First DW Bit 3:0 combination is valid (including none)
Byte
Enables)
Last BE
3:0
Byte 7 These high true bits map one-to-one to qualify bytes within the last DW transferred. For config requests, these
(Last DW Bit 7:4 bits must be 0000b. (Single DW)
Byte
Enables)
These bits are used to identify each outstanding request issued by the requester. As non-posted requests are
sent, the next sequential tag is assigned.
Byte 6
Tag 7:0
Bit 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)
If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256 tags).
These bits provide the lower 6 bits of DW configuration space offset. The Register Number is used in
Register Byte 11
conjunction with Ext Register Number to provide the full 10 bits of offset needed for the 1024 DW (4096 byte)
Number Bit 7:2
PCI Express configuration space.
Ext
Register
These bits provide the upper 4 bits of DW configuration space offset. The Ext Register Number is used in
Number Byte 10 conjunction with Register Number to provide the full 10 bits of offset needed for the 1024 DW (4096 byte) PCI
(Extended Bit 3:0 Express configuration space. For compatibility, this field can be set = 0, and only the lower 64DW (256 bytes will
Register be seen) when indexing the Register Number.
Number)
Identifies the completer being accessed with this configuration cycle. The Bus and Device numbers in this field
Byte 9 are "captured" by the device on each configuration Type 0 write.
Bit 7:0
Completer Byte 8, 7:0 = Bus Number
ID 15:0 Byte 8
Byte 9, 7:3 = Device Number
Bit 7:0
Byte 9, 2:0 = Function Number
Configuration requests always use the 3DW header format and are routed by the contents of
the ID field.
All devices "capture" the Bus Number and Device Number information provided by the upstream
device during each Type 0 configuration write cycle. Information is contained in Byte 8-9
(Completer ID) of configuration request.
Completions
Table 4-9 on page 185 describes the location and use of each field in a completion header.
Field Header
Function
Name Byte/Bit
Byte 3
Bit 7:0 Indicates data payload size in DW. For completions, this field reflects the size of the data payload associated
Length 9:0
Byte 2 with this completion.
Bit 1:0
Byte 2
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
Bit 7
TC 2:0
Byte 2
(Transfer Indicates transfer class for the packet. For a completion, TC is set to same value as in the request.
Bit 6:4
Class)
Byte 0
Type 4:0 TLP packet type field. Always set to 01010b for a completion.
Bit 4:0
Byte 7
Bit 7:0 This is the remaining byte count until a read request is satisfied. Generally, it is derived from the original
Byte Count request Length field. See "Data Returned For Read Requests:" on page 188 for special cases caused by
Byte 6 multiple completions.
Bit 3:0
BCM
Byte 6 Set = 1 only by PCI-X completers. Indicates that the byte count field (see previous field) reflects the first
(Byte Count Bit 4 transfer payload rather than total payload remaining. See "Using The Byte Count Modified Bit" on page 188.
Modified)
These bits encoded by the completer to indicate success in fulfilling the request.
Identifies the completer. While not needed for routing a completion, this information may be useful if debugging
Byte 5 bus traffic.
The lower 7 bits of address for the first enabled byte of data returned with a read. Calculated from request
Lower Byte 11
Length and Byte enables, it is used to determine next legal Read Completion Boundary. See "Calculating
Address 6:0 Bit 6:0
Lower Address Field" on page 187.
Byte 10 These bits are set to reflect the Tag received with the request. The requester uses them to associate inbound
Tag 7:0
Bit 7:0 completion with an outstanding request.
Copied from the request into this field to be used in routing the completion back to the original requester.
Byte 9
Byte 4, 7:0 = Requester Bus #
Requester Bit 7:0
ID 15:0
Byte 8 Byte 5, 7:3 = Requester Device #
Bit 7:0
Byte 5, 2:0 = Requester Function #
000b (SC) Successful Completion code indicates the original request completed properly
at the target.
001b (UR) Unsupported Request code indicates original request failed at the target
because it targeted an unsupported address, carried an unsupported address or request,
etc. This is handled as an uncorrectable error. See the "Unsupported Request" on page
365 for details.
010b (CRS) Configuration Request Retry Status indicates target was temporarily off-line
and the attempt should be retried. (e.g. initialization delay after reset, etc.).
100b (CA) Completer Abort code indicates that completer is off-line due to an error (much
like target abort in PCI). The error will be logged and handled as an uncorrectable error.
Refer to the Lower Address field in Table 4-9 on page 185. The Lower Address field is set up
by the completer during completions with data (CplD) to reflect the address of the first enabled
byte of data being returned in the completion payload. This must be calculated in hardware by
considering both the DW start address and the byte enable pattern in the First DW Byte Enable
field provided in the original request. Basically, the address is an offset from the DW start
address:
If the First DW Byte Enable field is 1111b, all bytes are enabled in the first DW and the
offset is 0. The byte start address is = DW start address.
If the First DW Byte Enable field is 1110b, the upper three bytes are enabled in the first
DW and the offset is 1. The byte start address is = DW start address + 1.
If the First DW Byte Enable field is 1100b, the upper two bytes are enabled in the first DW
and the offset is 2. The byte start address is = DW start address + 2.
If the First DW Byte Enable field is 1000b, only the upper byte is enabled in the first DW
and the offset is 3. The byte start address is = DW start address + 3.
Once calculated, the lower 7 bits are placed in the Lower Address field of the completion
header in the event the start address was not aligned on a Read Completion Boundary (RCB)
and the read completion must break off at the first RCB. Knowledge of the RCB is necessary
because breaking a transaction must be done on RCBs which are based on start address--not
transfer size.
Refer to the Byte Count Modified Bit in Table 4-9 on page 185. This bit is only set by a PCI-X
completer (e.g. a bridge from PCI Express to PCI-X) in a particular circumstance. Rules for its
assertion include:
The BCM bit is only set for the first completion of the series. It is set to indicate that the first
completion contains a Byte Count field that reflects the first completion payload rather than the
total remaining (as it would in normal PCI Express protocol). The receiver then recognizes that
the completion will be followed by others to satisfy the original request as required.
For the second and any other completions in the series, the BCM bit must be deasserted
and the Byte Count field will reflect the total remaining count--just as in normal PCI Express
protocol.
PCI Express devices receiving completions with the BCM bit set must interpret this case
properly.
The Lower Address field is set up by the completer during completions with data (CplD) to
reflect the address of the first enabled byte of data being returned
1. Completions for read requests may be broken into multiple completions, but total
data transfer must equal size of original request
The Read Completion Boundary (RCB) must be observed when handling a read request with
multiple completions. The RCB is 64 bytes or 128 bytes for the root complex; the value used
should be visible in a configuration register.
Bridges and endpoints may implement a bit for selecting the RCB size (64 or 128 bytes)
under software control.
Completions that do not cross an aligned RCB boundary must complete in one transfer.
Multiple completions for a single read request must return data in increasing address order.
When the Root Complex receivers a CRS status during a configuration cycle, its handling of
the event is not defined except after reset (when a period is defined when it must allow it).
If CRS is received for a request other than configuration, it is handled as a Malformed TLP.
If a read completion is received with a status other than Successful Completion (SC), no
data is received with the completion and a CPl (or CplLk) is returned in place of a CplD (or
CplDLk).
In the event multiple completions are being returned for a read request, a completion status
other than Successful Completion (SC) immediately ends the transaction. Device handling of
data received prior to the error is implementation-specific.
In maintaining compatibility with PCI, a Root Complex may be required to synthesize a read
value of a "1's" when a configuration cycle ends with a completion indicating an Unsupported
Request. (This is analogous to master aborts which occur when PCI enumeration probes
devices which are not in the system).
Message Requests
Message requests replace many of the interrupt, error, and power management sideband
signals used on earlier bus protocols. All message requests use the 4DW header format, and
are handled much the same as posted memory write transactions. Messages may be routed
using address, ID, or implicit routing. The routing subfield in the packet header indicates the
routing method to apply, and which additional header registers are in use (address registers,
etc.). Figure 4-11 on page 190 depicts the message request header format.
Table 4-10 on page 191 describes the location and use of each field in a message request
header.
Header
Field Name Function
Byte/Bit
Byte 3 Bit
7:0
Indicates data payload size in DW. For message requests, this field is always 0 (no data) or 1 (one DW of
Length 9:0
Byte 2 Bit data)
1:0
Byte 2 Bit
EP If = 1, indicates the data payload (if present) is poisoned.
6
Byte 2 Bit
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
7
TC 2:0
Byte 2 Bit
(Transfer Indicates transfer class for the packet. TC is = 0 for all message requests.
6:4
Class)
Bit 4:3:
10b = Msg
0thers = reserved
This field contains the code indicating the type of message being sent.
Byte 6 Bit
Tag 7:0 As all message requests are posted, no tag is assigned to them. These bits should be = 0.
7:0
Byte 11
Bit 7:2
Byte 10
Bit 7:0
If address routing was selected for the message (see Type 4:0 field above), then this field contains the
Address 31:2
Byte 9 Bit lower part of the 64-bit starting address. Otherwise, this field is not used.
7:0
Byte 8 Bit
7:0
Byte 15
Bit 7:2
Byte 14
Bit 7:0
Address If address routing was selected for the message (see Type 4:0 field above), then this field contains the
63:32 upper 32 bits of the 64 bit starting address. Otherwise, this field is not used.
Byte 13
Bit 7:0
Byte 12
Bit 7:0
Message Notes
The following tables specify the message coding used for each of the seven message groups,
and is based on the message code field listed in Table 4-10 on page 191. The defined groups
include:
Power Management
Error Signaling
While many devices are capable of using the PCI 2.3 Message Signaled Interrupt (MSI)
method of delivering interrupts, some devices may not support it. PCI Express defines a virtual
wire alternative in which devices simulate the assertion and deassertion of the INTx (INTA-
INTD) interrupt signals seen in PCI-based systems. Basically, a message is sent to inform the
upstream device an interrupt has been asserted. After servicing, the device which sent the
interrupt sends a second message indicating the virtual interrupt signal is being released. Refer
to the "Message Signaled Interrupts" on page 331 for details. Table 4-11 summarizes the INTx
message coding at the packet level.
Assert_INTx and Deassert_INTx are only issued by upstream ports. Checking violations of
this rule is optional. If checked, a TLP violation is handled as a Malformed TLP.
These messages are required to use the default traffic class, TC0. Receivers must check for
violation of this rule (handled as Malformed TLPs).
Components at both ends of the link must track the current state of the four virtual interrupts.
If the logical state of one of the interrupts changes at the upstream port, the port must send the
appropriate INTx message to the downstream port on the same link.
INTx signaling is disabled when the Interrupt Disable bit of the Command Register is set = 1
(just as it would be if physical interrupt lines are used).
If any virtual INTx signals are active when the Interrupt Disable bit is set in the device, the
device must transmit a corresponding Deassert_INTx message onto the link.
Switches must track the state of the four INTx signals independently for each downstream
port and combine the states for the upstream link.
The Root Complex must track the state of the four INTx lines independently and convert
them into system interrupts in a system-specific way.
Because of switches in the path, the Requester ID in an INTx message may be the last
transmitter, not the original requester.
PCI Express is compatible with PCI power management, and adds the PCI Express active link
management mechanism. Refer to Chapter 16, entitled "Power Management," on page 567 for
a description of power management. Table 4-12 on page 194 summarizes the four power
management message types.
1. Power Management Message type does not include a data payload. The Length field
is reserved.
These messages are required to use the default traffic class, TC0. Receivers must check for
violation of this rule (handled as Malformed TLPs).
PME_TO_Ack is sent upstream by endpoint. For switch with devices attached to multiple
downstream ports, this message won't be sent upstream until all it is first received from all
downstream ports.
Error Messages
Error messages are sent upstream by enabled devices that detect correctable, non-fatal
uncorrectable, and fatal non-correctable errors. The device detecting the error is defined by the
Requester ID field in the message header. Table 4-13 on page 195 describes the three error
message types.
1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).
This message type does not include a data payload. The Length field is reserved.
Unlock Message
The Unlock message is sent to a completer to release it from lock as part of the PCI Express
Locked Transaction sequence. Table 4-14 on page 196 summarizes the coding for this
message.
1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).
This message type does not include a data payload. The Length field is reserved.
This message is sent from a downstream switch or Root Complex port to the upstream port of
the device attached to it. It conveys a slot power limit which the downstream device then copies
into the Device Capabilities Register for its upstream port. Table 4-15 summarizes the coding
for this message.
1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).
This message type carries a data payload of 1 DW. The Length field is set = 1. Only the
lower 10 bits of the 32-bit data payload is used for slot power scaling; the upper bits in the
data payload must be set = 0.
This message is sent automatically anytime the link transitions to DL_Up status or if a
configuration write to the Slot Capabilities Register occurs when the Data Link Layer reports
DL_Up status.
If a card in a slot consumes less power than the power limit specified for the card/form
factor, it may ignore the message.
These messages are passed between downstream ports of switches and Root Ports that
support Hot Plug Event signaling. Table 4-16 summarizes the Hot Plug message types.
The Attention and Power indicator messages are all driven by the switch/root complex port
to the card.
The Attention Button message is driven upstream by a slot device that implements a
switch.
Data Link Layer Packets
The primary responsibility of the PCI Express Data Link Layer is to assure that integrity is
maintained when TLPs move between two devices. It also has link initialization and power
management responsibilities, including tracking of the link state and passing messages and
status between the Transaction Layer above and the Physical Layer below.
In performing its role, the Data Link Layer exchanges traffic with its neighbor using Data Link
Layer Packets (DLLPs). DLLPs originate and terminate at the Data Link Layer of each device,
without involvement of the Transaction Layer. DLLPs and TLPs are interleaved on the link.
Figure 4-12 on page 198 depicts the transmission of a DLLP from one device to another.
Types Of DLLPs
DLLPs have a simple packet format. Unlike TLPs, they carry no target information because
they are used for nearest-neighbor communications only.
The following rules apply when a DLLP is sent from transmitter to receiver:
1. As DLLPs arrive at the receiver, they are immediately processed. They cannot be
flow controlled.
All received DLLPs are checked for errors. This includes a control symbol check at the
Physical Layer after deserialization, followed by a CRC check at the receiver Data Link Layer.
A 16 bit CRC is calculated and sent with the packet by the transmitter; the receiver calculates
its own DLLP checksum and compares it to the received value.
Any DLLPs that fail the CRC check are discarded. There are several reportable errors
associated with DLLPs.
Unlike TLPs, the is no acknowledgement protocol for DLLPs. The PCI Express specification
has time-out mechanisms which are intended to allow recovery from lost or discarded DLLPs.
Assuming no errors occur, the DLLP type is determined and it is passed to the appropriate
internal logic:
- Power Management DLLPs are passed to the device power management logic
- Flow Control DLLPs are passed to the Transaction Layer so credits may be updated.
- Ack/Nak DLLPs are routed to the Data Link Layer transmit interface so TLPs in the retry
buffer may be discarded or resent.
DLLPs are assembled on the transmit side and disassembled on the receiver side of a link.
These packets originate at the Data Link Layer and are passed to the Physical Layer. There,
framing symbols are added before the packet is sent. Figure 4-13 on page 200 depicts a
generic DLLP in transit from Device B to Device A.
1. A 1 DW core (4 bytes) consisting of the one byte Type field and three additional
bytes of attributes. The attributes vary with the DLLP type.
A 16 bit CRC value which is calculated based on the DW core contents, then appended to it.
These 6 bytes are then passed to the Physical Layer where a Start Of DLLP (SDP) control
symbol and an End Of Packet (END) control symbol are added to it. Before transmission, the
Physical Layer encodes the 8 bytes of information into eight 10-bit symbols for transmission to
the receiver.
Note that there is never a data payload with a DLLP; all information of interest is carried in the
Type and Attribute fields.
Ack
0000 0000b TLP transmission integrity
(TLP Acknowledge)
Nak
0001 0000b TLP transmission integrity
(TLP No Acknowledge)
Header
Field Name DLLP Function
Byte/Bit
For good TLPs received with Sequence Number = NEXT_RCV_SEQ count (count before
incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus 1).
For TLP received with Sequence Number earlier than NEXT_RCV_SEQ count (duplicate TLP),
Byte 3 Bit use NEXT_RCV_SEQ count - 1.
AckNak_Seq_Num 7:0
For a NAK DLLP:
[11:0]
Byte 2 Bit
3:0 Associated with a TLP that failed the CRC check, use NEXT_RCV_SEQ count - 1.
For a TLP received with Sequence Number later than NEXT_RCV_SEQ count, use
NEXT_RCV_SEQ count - 1.
Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay
the remainder TLPs.
Byte 5 Bit
7:0 16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0-3 of the
16-bit CRC
ACK/NAK.
Byte 4 Bit
7:0
Power Management DLLP Packet Format
PCI Express power management DLLPs and TLPs replace most signals associated with power
management state changes. The format of the DLLP used for power management is illustrated
in Figure 4-15.
Field Header
DLLP Function
Name Byte/Bit
This field indicates type of DLLP. For the Power Management DLLPs:
Byte 5 Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of
Link CRC
whether fields are used.
Byte 4 Bit 7:0
PCI Express eliminates many of the inefficiencies of earlier bus protocols through the use of a
credit-based flow control scheme. This topic is covered in detail in Chapter 7, entitled "Flow
Control," on page 285. Three slightly different DLLPs are used to initialize the credits and to
update them as receiver buffer space becomes available. The two flow control initialization
packets are referred to as InitFC1 and InitFC2. The Update DLLP is referred to as UpdateFC.
The generic DLLP format for all three flow control DLLP variants is illustrated in Figure 4-16 on
page 205.
Table 4-20 on page 206 describes the fields contained in a flow control DLLP.
Field Header
DLLP Function
Name Byte/Bit
Byte 3
This field contains the credits associated with data storage. Data credits are in units of 16 bytes per credit, and are
DataFC Bit 7:0
applied to the flow control counter for the virtual channel indicated in V[2:0], and for the traffic type indicated by the
11:0
Byte 2 code in Byte 0, Bits 7:4.
Bit 3:0
Byte 2
This field contains the credits associated with header storage. Data credits are in units of 1 header (including
HdrFC Bit 7:6
digest) per credit, and are applied to the flow control counter for the virtual channel indicated in V[2:0], and for the
11:0
Byte 1 traffic type indicated by the code in Byte 0, Bits 7:4.
Bit 5:0
VC Byte 0
This field indicates the virtual channel (VC 0-7) receiving the credits.
[2:0] Bit 2:0
Byte 5
Link Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of whether fields
CRC Byte 4 are used.
Bit 7:0
PCI Express reserves a DLLP type for vendor specific use. Only the Type code is defined. The
Vendor DLLP is illustrated in Figure 4-17.
Table 4-21 on page 207 describes the fields contained in a Vendor-Specific DLLP
Field Header
DLLP Function
Name Byte/Bit
Byte 5 Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of
Link CRC
Byte 4 Bit 7:0 whether fields are used.
Chapter 5. ACK/NAK Protocol
The Previous Chapter
This Chapter
'Reliable' transport of TLPs from one device to another device across the Link.
The receiver's Transaction Layer should receive TLPs in the same order that the
transmitter sent them. The Data Link Layer must preserve this order despite any
occurrence of errors that require TLPs to be replayed (retried).
The ACK/NAK protocol associated with the Data Link Layer is described with the aid of Figure
5-2 on page 211 which shows sub-blocks with greater detail. For every TLP that is sent from
one device (Device A) to another (Device B) across one Link, the receiver checks for errors in
the TLP (using the TLP's LCRC field). The receiver Device B notifies transmitter Device A on
good or bad reception of TLPs by returning an ACK or a NAK DLLP. Reception of an ACK
DLLP by the transmitter indicates that the receiver has received one or more TLP(s)
successfully. Reception of a NAK DLLP by the transmitter indicates that the receiver has
received one or more TLP(s) in error. Device A which receives a NAK DLLP then re-sends
associated TLP(s) which will hopefully, arrive at the receiver successfully without error.
Definition: As used in this chapter, the term Transmitter refers to the device that sends TLPs.
Definition: As used in this chapter, the term Receiver refers to the device that receives TLPs.
Elements of the ACK/NAK Protocol
Figure 5-3 is a block diagram of a transmitter and a remote receiver connected via a Link. The
diagram shows all of the major Data Link Layer elements associated with reliable TLP transfer
from the transmitter's Transaction Layer to the receiver's Transaction Layer. Packet order is
maintained by the transmitter's and receiver's Transaction Layer.
Figure 5-4 on page 215 illustrates the transmitter Data Link Layer elements associated with
processing of outbound TLPs and inbound ACK/NAK DLLPs.
The replay buffer stores TLPs with all fields including the Data Link Layer-related Sequence
Number and LCRC fields. The TLPs are saved in the order of arrival from the Transaction
Layer before transmission. Each TLP in the Replay Buffer contains a Sequence Number which
is incrementally greater than the sequence number of the previous TLP in the buffer.
When the transmitter receives acknowledgement via an ACK DLLP that TLPs have reached the
receiver successfully, it purges the associated TLPs from the Replay Buffer. If, on the other
hand, the transmitter receives a NAK DLLP, it replays (i.e., re-transmits) the contents of the
buffer.
NEXT_TRANSMIT_SEQ Counter
This counter generates the Sequence Number assigned to each new transmitted TLP. The
counter is a 12-bit counter that is initialized to 0 at reset, or when the Data Link Layer is in the
inactive state. It increments until it reaches 4095 and then rolls over to 0 (i.e., it is a modulo
4096 counter).
LCRC Generator
The LCRC Generator provides a 32-bit LCRC for the TLP. The LCRC is calculated using all
fields of the TLP including the Header, Data Payload, ECRC and Sequence Number. The
receiver uses the TLP's LCRC field to check for a CRC error in the received TLP.
REPLAY_NUM Count
This 2-bit counter stores the number of replay attempts following either reception of a NAK
DLLP, or a REPLAY_TIMER time-out. When the REPLAY_NUM count rolls over from 11b to
00b, the Data Link Layer triggers a Physical Layer Link-retrain (see the description of the
LTSSM recovery state on page 532). It waits for completion of re-training before attempting to
transmit TLPs once again. The REPLAY_NUM counter is initialized to 00b at reset, or when the
Data Link Layer is inactive. It is also reset whenever an ACK is received, indicating that forward
progress is being made in transmitting TLPs.
REPLAY_TIMER Count
The REPLAY_TIMER is used to measure the time from when a TLP is transmitted until an
associated ACK or NAK DLLP is received. The REPLAY_TIMER is started (or restarted, if
already running) when the last Symbol of any TLP is sent. It restarts from 0 each time that
there are outstanding TLPs in the Replay Buffer and an ACK DLLP is received that references
a TLP still in the Replay Buffer. It resets to 0 and holds when there are no outstanding TLPs in
the Replay Buffer, or until restart conditions are met for each NAK received (except during a
replay), or when the REPLAY_TIMER expires. It is not advanced (i.e., its value remains fixed)
during Link re-training.
ACKD_SEQ Count
This 12-bit register tracks or stores the Sequence Number of the most recently received ACK
or NAK DLLP. It is initialized to all 1s at reset, or when the Data Link Layer is inactive. This
register is updated with the AckNak_Seq_Num [11:0] field of a received ACK or NAK DLLP.
The ACKD_SEQ count is compared with the NEXT_TRANSMIT_SEQ count.
This block checks for CRC errors in DLLPs returned from the receiver. Good DLLPs are further
processed. If a DLLP CRC error is detected, the DLLP is discarded and an error reported. No
further action is taken.
Definition: The Data Link Layer is in the inactive state when the Physical Layer reports that the
Link is non-operational or nothing is connected to the Port. The Physical Layer is in the non-
operational state when the Link Training and Status State Machine (LTSSM) is in the Detect,
Polling, Configuration, Disabled, Reset or Loopback states during which LinkUp = 0 (see
Chapter 14 on 'Link Initialization and Training'). While in the inactive state, the Data Link Layer
state machines are initialized to their default values and the Replay Buffer is cleared. The Data
Link Layer exits the inactive state when the Physical Layer reports LinkUp = 1 and the Link
Disable bit of the Link Control register = 0.
Figure 5-5 on page 218 illustrates the receiver Data Link Layer elements associated with
processing of inbound TLPs and outbound ACK/NAK DLLPs.
The receive buffer temporarily stores received TLPs while TLP CRC and Sequence Number
checks are performed. If there are no errors, the TLP is processed and transferred to the
receiver's Transaction Layer. If there are errors associated with the TLP, it is discarded and a
NAK DLLP may be scheduled (more on this later in this chapter). If the TLP is a duplicate TLP
(more on this later in this chapter), it is discarded and an ACK DLLP is scheduled. If the TLP is
a 'nullified' TLP, it is discarded and no further action is taken (see "Switch Cut-Through Mode"
on page 248).
This block checks for LCRC errors in the received TLP using the TLP's 32-bit LCRC field.
NEXT_RCV_SEQ Count
The 12-bit NEXT_RCV_SEQ counter keeps track of the next expected TLP's Sequence
Number. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive. This
counter is incremented once for each good TLP received that is forwarded to the Transaction
Layer. The counter rolls over to 0 after reaching a value of 4095. The counter is not
incremented for TLPs received with CRC error, nullified TLPs, or TLPs with an incorrect
Sequence Number.
After the CRC error check, this block verifies that a received TLP's Sequence Number matches
the NEXT_RCV_SEQ count.
If the TLP's Sequence Number = NEXT_RCV_SEQ count, the TLP is accepted, processed
and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented. The
receiver continues to process inbound TLPs and does not have to return an ACK DLLP
until the ACKNAK_LATENCY_TIMER expires or exceeds its set value.
If the TLP's Sequence Number is a later Sequence Number than NEXT_RCV_SEQ count,
or for any other case other than the above two conditions, the TLP is discarded and a NAK
DLLP may be scheduled (more on this later) for return to the transmitter.
NAK_SCHEDULED Flag
The NAK_SCHEDULED flag is set when the receiver schedules a NAK DLLP to return to the
remote transmitter. It is cleared when the receiver sees the first TLP associated with the replay
of a previously-Nak'd TLP. The specification is unclear about whether the receiver should
schedule additional NAK DLLP for bad TLPs received while the NAK_SCHEDULED flag is set.
It is the authors' interpretation that the receiver must not schedule the return of additional NAK
DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.
ACKNAK_LATENCY_TIMER
The ACKNAK_LATENCY_TIMER monitors the elapsed time since the last ACK or NAK DLLP
was scheduled to be returned to the remote transmitter. The receiver uses this timer to ensure
that it processes TLPs promptly and returns an ACK or a NAK DLLP when the timer expires or
exceeds its set value. The timer value is set based on a formula described in "Receivers
ACKNAK_LATENCY_TIMER" on page 237.
This block generates the ACK or NAK DLLP upon command from the LCRC or Sequence
Number check block. The ACK or NAK DLLP contains an AckNak_Seq_Num[11:0] field
obtained from the NEXT_RCV_SEQ counter. ACK or NAK DLLPs contain a
AckNak_Seq_Num[11:0] value equal to NEXT_RCV_SEQ count - 1.
ACK/NAK DLLP Format
The format of an ACK or NAK DLLP is illustrated in Figure 5-6 on page 219. Table 5-6
describes the ACK or NAK DLLP Fields.
Header
Field Name DLLP Function
Byte/Bit
For good TLPs received with Sequence Number = NEXT_RCV_SEQ count (count before
incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus 1).
For TLP received with Sequence Number earlier than NEXT_RCV_SEQ count (duplicate TLP),
Byte 3 Bit use NEXT_RCV_SEQ count - 1.
AckNak_Seq_Num 7:0
For a NAK DLLP:
[11:0] Byte 2 Bit
3:0 Associated with a TLP that fails the CRC check, use NEXT_RCV_SEQ count - 1.
For a TLP received with Sequence Number later than NEXT_RCV_SEQ count, use
NEXT_RCV_SEQ count - 1.
Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay
the remainder TLPs.
Byte 5 Bit
7:0 16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0-3 of the
16-bit CRC
Byte 4 Bit ACK/NAK.
7:0
ACK/NAK Protocol Details
This section describes the detailed transmitter and receiver behavior in processing TLPs and
ACK/NAK DLLPs. The examples demonstrate flow of TLPs from transmitter to the remote
receiver in both the normal non-error case, as well as the error cases.
This section delves deeper into the ACK/NAK protocol. Consider the transmit side of a device's
Data Link Layer shown in Figure 5-4 on page 215.
Sequence Number
Before a transmitter sends TLPs delivered by the Transaction Layer, the Data Link Layer
appends a 12-bit Sequence Numbers to each TLP. The Sequence Number is generated by the
12-bit NEXT_TRANSMIT_SEQ counter. The counter is initialized to 0 at reset, or when the
Data Link Layer is in the inactive state. It increments after each new TLP is transmitted until it
reaches its maximum value of 4095, and then rolls over to 0. For each new TLP sent, the
transmitter appends the Sequence Number from the NEXT_TRANSMIT_SEQ counter.
Keep in mind that an incremented Sequence Number does not necessarily mean a greater
Sequence Number (since the counter rolls over when after it reaches a maximum value of
4095).
32-Bit LCRC
The transmitter also appends a 32-bit LCRC (Link CRC) calculated based on TLP contents
which include the Header, Data Payload, ECRC and Sequence Number.
General
Before a device transmits a TLP, it stores a copy of the TLP in a buffer associated with the
Data Link Layer referred to as the Replay Buffer (the specification uses the term Retry Buffer).
Each buffer entry stores a complete TLP with all of its fields including the Header (up to 16
bytes), an optional Data Payload (up to 4KB), an optional ECRC (up to four bytes), the
Sequence Number (12-bits wide, but occupies two bytes) and the LCRC field (four bytes). The
buffer size is unspecified. The buffer should be big enough to store transmitted TLPs that have
not yet been acknowledged via ACK DLLPs.
When the transmitter receives an ACK DLLP, it purges from the Replay Buffer TLPs with equal
to or earlier Sequence Numbers than the Sequence Number received with the ACK DLLPs.
When the transmitter receives NAK DLLPs, it purges the Replay Buffer of TLPs with Sequence
Numbers that are equal to or earlier than the Sequence Number that arrives with the NAK and
replays (re-transmits) TLPs of later Sequence Numbers (the remainder TLPs in the Replay
Buffer). This implies that a NAK DLLP inherently acknowledges TLPs with equal to or earlier
Sequence Numbers than the AckNak_Seq_Num[11:0] of the NAK DLLP and replays the
remainder TLPs in the Replay Buffer. Efficient replay strategies are discussed later.
The Replay Buffer should be large enough so that, under normal operating conditions, TLP
transmissions are not throttled due to a Replay Buffer full condition. To determine what buffer
size to implement, one must consider the following:
Delays cause by the physical Link interconnect and the Physical Layer implementations.
Receiver L0s exit latency to L0. i.e., the buffer should ideally be big enough to hold TLPs
while the Link which is in L0s is returned to L0.
General
If the transmitter receives an ACK DLLP, it has positive confirmation that its transmitted TLP(s)
have reached the receiver successfully. The transmitter associates the Sequence Number
contained in the ACK DLLP with TLP entries contained in the Replay Buffer.
A single ACK DLLP returned by the receiver Device B may be used to acknowledge multiple
TLPs. It is not necessary that every TLP transmitted must have a corresponding ACK DLLP
returned by the remote receiver. This is done to conserve bandwidth by reducing the ACK
DLLP traffic on the bus. The receiver gathers multiple TLPs and then collectively acknowledges
them with one ACK DLLP that corresponds to the last received good TLP. In InfiniBand, this is
referred to as ACK coalescing.
The transmitter's response to reception of an ACK DLLP include:
Example 1
Consider Figure 5-7 on page 223, with the emphasis on the transmitter Device A.
Device B receives TLPs with Sequence Numbers 3, 4, 5 in that order. TLP 6, 7 are still en
route.
Device B performs the error checks and collectively acknowledges good receipt of TLPs 3,
4, 5 with the return of an ACK DLLP with a Sequence Number of 5.
When Device B receives TLP 6, 7, steps 3 through 5 may be repeated for those packets as
well.
Figure 5-7. Example 1 that Shows Transmitter Behavior with Receipt of an
ACK DLLP
Example 2
1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, 2 where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.
Device B receives TLPs with Sequence Numbers 4094, 4095, 0, 1 in that order. TLP 2 is still
en route.
Device B performs the error checks and collectively acknowledges good receipt of TLPs
4094, 4095, 0, 1 with the return of an ACK DLLP with a Sequence Number of 1.
When Device B ultimately receives TLP 2, steps 3 through 5 may be repeated for TLP 2.
A NAK DLLP received by the transmitter implies that a TLP transmitted at an earlier time was
received by the receiver in error. The transmitter first purges from the Replay Buffer any TLP
with Sequence Numbers equal to or earlier than the NAK DLLP's AckNak_Seq_Num[11:0]. It
then replays (retries) the remainder TLPs starting with the TLP with Sequence Number
immediately after the AckNak_Seq_Num[11:0] of the NAK DLLP until the newest TLP. In
addition, the transmitter's response to reception of a NAK DLLP include:
TLP Replay
When a Replay becomes necessary, the transmitter blocks the delivery of new TLPs by the
Transaction Layer. It then replays (re-sends or retries) the contents of the Replay Buffer
starting with the earliest TLP first (of Sequence Number = AckNak_Seq_Num[11:0] + 1) until
the remainder of the Replay Buffer is replayed. After the replay event, the Data Link Layer
unblocks acceptance of new TLPs from the Transaction Layer. The transmitter continues to
save the TLPs just replayed until they are finally acknowledged at a later time.
A more efficient design might begin processing the ACK/NAK DLLPs while the transmitter is still
in the act of replaying. By doing so, newly received ACK DLLPs are used to purge the Replay
Buffer even while replay is in progress. If another NAK DLLP is received in the meantime, at the
very least, the TLPs that were acknowledged have been purged and would not be replayed.
During replay, if multiple ACK DLLPs are received, the ACK DLLP received last with the latest
Sequence Number can collapse earlier ACK DLLPs of earlier Sequence Numbers. During the
replay, the transmitter can concurrently purge TLPs of Sequence Number equal to and earlier
than the AckNak_Seq_Num[11:0] of the last received ACK DLLP.
1. Device A transmits TLPs with Sequence Number 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.
Device B receives TLPs 4094, 4095, and 0 in that order. TLP 1, 2 are still en route.
Device B receives TLP 4094 with no error and hence NEXT_RCV_SEQ count increments to
4095
Device B schedules the return of a NAK DLLP with Sequence Number 4094
(NEXT_RCV_SEQ count - 1).
Device A receives NAK 4094 and blocks acceptance of new TLPs from its Transaction Layer
until replay completes.
Device A first purges TLP 4094 (and earlier TLPs; none in this example).
Device A then replays TLPs 4095, 0, 1, and 2, but does not purge them.
Each time the transmitter receives a NAK DLLP, it replays the Replay Buffer contents. The
transmitter uses a 2-bit Replay Number counter, referred to as the REPLAY_NUM counter, to
keep track of the number of replay events. Reception of a NAK DLLP increments
REPLAY_NUM. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive.
It is also reset if an ACK or NAK DLLP is received with a later Sequence Number than that
contained in the ACKD_SEQ register. As long as forward progress is made in transmitting
TLPs the REPLAY_NUM counter resets. When a fourth NAK is received, indicating no forward
progress has been made after several tries, the counter rolls over to zero. The transmitter will
not replay the TLPs a fourth time but instead it signals a replay number rollover error. The
device assumes that the Link is non-functional or that there is a Physical Layer problem at
either the transmitter or receiver end.
A transmitter's Data Link Layer triggers the Physical Layer to re-train the Link. The Physical
Layer Link Training and Status State Machine (LTSSM) enters the Recovery State (see
"Recovery State" on page 532). The Replay Number Rollover error bit is set ("Advanced
Correctable Error Handling" on page 384) in the Advanced Error Reporting registers (if
implemented). The Replay Buffer contents are preserved and the Data Link Layer is not
initialized by the re-training process. Upon Physical Layer re-training exit, assuming that the
problem has been cleared, the transmitter resumes the same replay process again. Hopefully,
the TLPs can be re-sent successfully on this attempt.
The specification does not address a device's handling of repeated re-train attempts. The
author recommends that a device track the number of re-train attempts. After a re-train rollover
the device could signal a Data Link Layer protocol error indicating the severity as an
Uncorrectable Fatal Error.
The transmitter implements a REPLAY_TIMER to measure the time from when a TLP is
transmitted until the transmitter receives an associated ACK or NAK DLLP from the remote
receiver. A formula (described below) determines the timer's expiration period. Timer expiration
triggers a replay event and the REPLAY_NUM count increments. A time-out may arise if an
ACK or NAK DLLP is lost en route, or because of an error in the receiver that prevents it from
returning an ACK or NAK DLLP. Timer-related rules are:
The Timer starts (if not already started) when the last symbol of any TLP is transmitted.
- A Replay event occurs and the last symbol of the first TLP is replayed.
- For each ACK DLLP received, as long as there are unacknowledged TLPs in the
Replay Buffer,
REPLAY_TIMER Equation
The timer is loaded with a value that reflects the worst-case latency for the return of an ACK or
NAK DLLP. This time depends on the maximum data payload allowed for a TLP and the width
of the Link.
- TLP Overhead includes the additional TLP fields beyond the data payload (header,
digest, LCRC, and Start/End framing symbols). In the specification, the overhead value is
treated as a constant of 28 symbols.
- The Ack Factor is a fudge factor that represents the number of maximum-sized TLPs
(based on Max_Payload) that can be received before an ACK DLLP must be sent. The AF
value ranges from 1.0 to 3.0 and is used to balance Link bandwidth efficiency and Replay
Buffer size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link
widths and payloads. These Ack Factor values are chosen to allow implementations to
achieve good performance without requiring a large uneconomical buffer.
- Internal Delay is the receiver's internal delay between receiving a TLP, processing it at
the Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19
symbol times in these calculations.
- Rx_L0s_Adjustment is the time required by the receive circuits to exit from L0s to L0,
expressed in symbol times.
REPLAY_TIMER Summary Table
Figure 5-10 on page 229 is a summary table that shows possible timer load values with various
variables plugged into the REPLAY_TIMER equation.
The DLLP CRC Error Checking block determines whether there is a CRC error in the received
DLLP. The DLLP includes a 16-bit CRC for this purpose (see Table 5-1 on page 219). If there
are no DLLP CRC errors, then the DLLPs are further processed. If a DLLP CRC error is
detected, the DLLP is discarded, and the error is reported as a DLLP CRC error to the error
handling logic which logs the error in the optional Advanced Error Reporting registers (see Bad
DLLP in "Advanced Correctable Error Handling" on page 384). No further action is taken.
Discarding an ACK or NAK DLLP received in error is not a severe response because a
subsequently received DLLP will accomplish the same goal as the discarded DLLP. The side
effect of this action is that associated TLPs are purged a little later than they would have been
or that a replay happens at a later time. If a subsequent DLLP is not received in time, the
transmitter REPLAY_TIMER expires anyway, and the TLPs are replayed.
Consider the receive side of a device's Data Link Layer shown in Figure 5-5 on page 218.
TLPs received at the Physical Layer are checked for STP and END framing errors as well as
other receiver errors such as disparity errors. If there are no errors, the TLPs are passed to
the Data Link Layer. If there are any errors, the TLP is discarded and the allocated storage is
freed up. The Data Link Layer is informed of this error so that it can schedule a NAK DLLP.
(see "Receiver Schedules a NAK" on page 233).
The receiver accepts TLPs from the Link into a receiver buffer and checks for CRC errors. The
receiver calculates an expected LCRC value based on the received TLP (excluding the LCRC
field) and compares this value with the TLP's 32-bit LCRC. If the two match, the TLP is good. If
the two LCRC values do not match, the received TLP is bad and the receiver schedules a NAK
DLLP to be returned to the remote transmitter. The receiver also checks for other types of non-
CRC related errors (such as that described in the next section).
The receiver keeps track of the next expected TLP's Sequence Number via a 12-bit counter
referred to as the NEXT_RCV_SEQ counter. This counter is initialized to 0 at reset, or when
the Data Link Layer is inactive. This counter is incremented once for each good TLP that is
received and forwarded to the Transaction Layer. The counter rolls over to 0 after reaching a
value of 4095.
The receiver uses the NEXT_RCV_SEQ counter to identify the Sequence Number that should
be in the next received TLP. If a received TLP has no LCRC error, the device compares its
Sequence Number with the NEXT_RCV_SEQ count. Under normal operational conditions, these
two numbers should match. If this is the case, the receiver accepts the TLP, forwards the TLP
to the Transaction Layer, increments the NEXT_RCV_SEQ counter and is ready for the next
TLP. An ACK DLLP may be scheduled for return if the ACKNAK_LATENCY_ TIMER expires or
exceeds its set value. The receiver is ready to perform a comparison on the next received
TLP's Sequence Number.
In some cases, a received TLP's Sequence Number may not match the NEXT_RCV_SEQ
count. The received TLP's Sequence Number may be either logically greater than or logically
less than NEXT_RCV_SEQ count (a logical number in this case accounts for the count rollover,
so in fact a logically greater number may actually be a lower number if the count rolls over).
See "Receiver Sequence Number Check" on page 234 for details on these two abnormal
conditions.
For a TLP received with a CRC error, or a nullified TLP or a TLP for which the Sequence
Number check described above fails, the NEXT_RCV_SEQ counter is not incremented.
If the receiver does not detect an LCRC error (see "Received TLP Error Check" on page 230)
or a Sequence Number related error (see "Next Received TLP's Sequence Number" on page
230) associated with a received TLP, it accepts the TLP and sends it to the Transaction Layer.
The NEXT_RCV_SEQ counter is incremented and the receiver is ready for the next TLP. At this
point, the receiver can schedule an ACK DLLP with the Sequence Number of the received TLP
(see the AckNak_Seq_Num[11:0] field described in Table 5-1 on page 219). Alternatively, the
receiver could also wait for additional TLPs and schedule an ACK DLLP with the Sequence
Number of the last good TLP received.
The receiver is allowed to accumulate a number of good TLPs and then sends one aggregate
ACK DLLP with a Sequence Number of the latest good TLP received. The coalesced ACK
DLLP acknowledges the good receipt of a collection of TLPs starting with the oldest TLP in the
transmitter's Replay Buffer and ending with the TLP being acknowledged by the current ACK
DLLP. By doing so, the receiver optimizes the use of Link bandwidth due to reduced ACK DLLP
traffic. The frequency with which ACK DLLPs are scheduled for return is described in
"Receivers ACKNAK_LATENCY_TIMER" on page 237. When the ACKNAK_LATENCY_ TIMER
expires or exceeds its set value and TLPs are received, an ACK DLLP with a Sequence
Number of the last good TLP is returned to the transmitter.
When the receiver schedules an ACK DLLP to be returned to the remote transmitter, the
receiver might have other packets (TLPs, DLLPs or PLPs) enqueued that also have to be
transmitted on the Link in the same direction as the ACK DLLP. This implies that the receiver
may not immediately return the ACK DLLP to the transmitter, especially if a large TLP (with up
to a 4KB data payload) is already being transmitted (see "Recommended Priority To Schedule
Packets" on page 244).
The receiver continues to receive TLPs and as long as there are no detected errors (LCRC or
Sequence Number errors), it forwards the TLPs to the Transaction Layer. When the receiver
has the opportunity to return the ACK DLLP to the remote transmitter, it appends the Sequence
Number of the latest good TLP received and returns the ACK DLLP. Upon receipt of the ACK
DLLP, the remote transmitter purges its Replay Buffer of the TLPs with matching Sequence
Numbers and all TLPs transmitted earlier than the acknowledged TLP.
Example: Consider Figure 5-11 on page 233, with focus on the receiver Device B.
1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.
Device B receives TLPs with Sequence Numbers 4094, 4095, 0, and 1, in that order.
NEXT_RCV_SEQ count increments to 2. TLP 2 is still en route.
Device B performs error checks and issues a coalesced ACK to collectively acknowledge
receipt of TLPs 4094, 4095, 0, and 1, with the return of an ACK DLLP with Sequence Number
of 1.
When Device B ultimately receives TLP 2, steps 3 and 4 may be repeated for TLP 2.
Figure 5-11. Example that Shows Receiver Behavior with Receipt of Good
TLP
NAK Scheduled Flag
The receiver implements a Flag bit referred to as the NAK_SCHEDULED flag. When a receiver
detects a TLP CRC error, or any other non-CRC related error that requires it to schedule a
NAK DLLP to be returned, the receiver sets the NAK_SCHEDULED flag and clears it when the
receiver detects replayed TLPs from the transmitter for which there are no CRC errors.
Upon receipt of a TLP, the first type of error condition the receiver may detect is a TLP LCRC
error (see "Received TLP Error Check" on page 230). The receiver discards the bad TLP. If the
NAK_SCHEDULED flag is clear, it schedules a NAK DLLP to return to the transmitter. The
NAK_SCHEDULED flag is then set. The receiver uses the NEXT_RCV_SEQ count - 1 count
value as the AckNak_Seq_Num [11:0] field in the NAK DLLP (Table 5-1 on page 219). At the
time the receiver schedules a NAK DLLP to return to the transmitter, the Link may be in use to
transmit other queued TLPs, DLLPs or PLPs. In that case, the receiver delays the transmission
of the NAK DLLP (see "Recommended Priority To Schedule Packets" on page 244). When the
Link becomes available, however, it sends the NAK DLLP to the remote transmitter. The
transmitter replays the TLPs from the Replay Buffer (see "TLP Replay" on page 225).
In the meantime, TLPs currently en route continue to arrive at the receiver. These TLPs have
later Sequence Numbers than the NEXT_RCV_SEQ count. The receiver discards them. The
specification is unclear about whether the receiver should schedule a NAK DLLP for these
TLPs. It is the authors' interpretation that the receiver must not schedule the return of additional
NAK DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.
The receiver detects a replayed TLP when it receives a TLP with Sequence Numbers that
matches NEXT_RCV_SEQ count. If the replayed TLPs arrive with no errors, the receiver
increments NEXT_RCV_SEQ count and clears the NAK_SCHEDULED flag. The receiver may
schedule an ACK DLLP for return to the transmitter if the ACKNAK_LATENCY_TIMER expires.
The good replayed TLPs are forwarded to the Transaction Layer.
There is a second scenario under which the receiver schedules NAK DLLPs to return to the
transmitter. If the receiver detects a TLP with a later Sequence Number than the next expected
Sequence Number indicated by NEXT_RCV_SEQ count or for which the TLP has a Sequence
Number that is separated from NEXT_RCV_SEQ count by more than 2048, the above
described procedure is repeated. See "Receiver Sequence Number Check" below for the
reasons why this could happen.
The two error conditions just described wherein a NAK DLLP is scheduled for return are
reported as errors associated with the Data Link Layer. The error reported is a bad TLP error
with a severity of correctable.
Every received TLP that passes the CRC check goes through a Sequence Number check. The
received TLPs Sequence Number is compared with the NEXT_RCV_SEQ count. Below are
three possibilities:
TLP Sequence Number equal NEXT_RCV_SEQ count. This situation results when a
good TLP is received. It also occurs when a replayed TLP is received. The TLP is
accepted and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented
and an ACK DLLP may be scheduled (according to the ACK DLLP scheduling rules
described in "Receiver Schedules An ACK DLLP" on page 231).
In addition to guaranteeing reliable TLP transport, the ACK/NAK protocol preserves packet
ordering. The receiver's Transaction Layer receives TLPs in the same order that the transmitter
sent them.
A transmitter correctly orders TLPs according to the ordering rules before transmission in order
to maintain correct program flow and to eliminate the occurrence of potential deadlock and
livelock conditions (see Chapter 8, entitled "Transaction Ordering," on page 315). The Receiver
is required to preserve TLP order (otherwise, application program flow is altered). To
preserved this order, the receiver applies three rules:
When the receiver detects a bad TLP, it discards the TLP and all new TLPs that follow in
the pipeline until the replayed TLPs are detected.
TLPs received after one or more lost TLPs are received are discarded.
For TLPs that arrive after the first bad TLP, the motivation to discard these TLPs, not forward
them to the Transaction Layer and schedule a NAK DLLP is as follows. When the receiver
detects a bad TLP, it discards it and any new TLPs in the pipeline. The receiver then waits for
TLP replay. After verifying that there are no errors in the replayed TLP(s), the receiver
forwards them to the Transaction Layer and resumes acceptance of new TLPs in the pipeline.
Doing so preserves TLP receive and acceptance order at the receivers Transaction Layer.
Example: Consider Figure 5-12 on page 237 with emphasis on the receiver Device B.
1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.
Device B receives TLPs 4094, 4095, and 0, in that order. TLPs 1 and 2 are still in flight.
Device B receives TLP 4094 with no errors and forwards it to the Transaction Layer.
NEXT_RCV_SEQ count increments to 4095.
Device B detects an LCRC error in TLP 4095 and hence returns a NAK DLLP with a
Sequence Number of 4094 (NEXT_RCV_SEQ count - 1). The NAK_SCHEDULED flag is set.
NEXT_RCV_SEQ count does not increment.
Device B also discards TLP 0, even though it is a good TLP. Also TLP 1 and 2 are discarded
when they arrive.
Device B does not schedule a NAK DLLP for TLP 0, 1 and 2 because the
NAK_SCHEDULED flag is set.
Device A does not accept any new TLPs from its Transaction Layer.
Device A then replays TLPs 4095, 0, 1, and 2, but continues to save these TLPs in the
Replay Buffer. It then accepts TLPs from the Transaction Layer.
After verifying that there are no CRC errors in the received TLPs, device B detects TLP
4095 as a replayed TLP because it has a Sequence Number equal to NEXT_RCV_SEQ count.
NAK_SCHEDULED flag is cleared.
Device B forwards these TLPs to the Transaction Layer in this order: 4095, 0, 1, and 2.
Figure 5-12. Example that Shows Receiver Behavior When It Receives Bad
TLPs
Receivers ACKNAK_LATENCY_TIMER
The ACKNAK_LATENCY_TIMER measures the duration since an ACK or NAK DLLP was
scheduled for return to the remote transmitter. This timer has a value that is approximately 1/3
that of the transmitter REPLAY_TIMER. When the timer expires, the receiver schedules an
ACK DLLP with a Sequence Number of the last good unacknowledged TLP received. The timer
guarantees that the receiver schedules an ACK or NAK DLLP for a received TLP before the
transmitter's REPLAY_TIMER expires causing it to replay.
The timer resets to 0 and restarts when an ACK or NAK DLLP is scheduled.
ACKNAK_LATENCY_TIMER Equation
The receiver's ACKNAK_ LATENCY_TIMER is loaded with a value that reflects the worst-case
transmission latency in sending an ACK or NAK in response to a received TLP. This time
depends on the anticipated maximum payload size and the width of the Link.
TLP Overhead includes the additional TLP fields beyond the data payload (header, digest,
LCRC, and Start/End framing symbols). In the specification, the overhead value is treated
as a constant of 28 symbols.
The Ack Factor is the biggest number of maximum-sized TLPs (based on Max_Payload)
which can be received before an ACK DLLP is sent. The AF value (it's a fudge factor)
ranges from 1.0 to 3.0, and is used to balance Link bandwidth efficiency and Replay Buffer
size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link widths
and payloads. These Ack Factor values are chosen to allow implementations to achieve
good performance without requiring a large, uneconomical buffer.
Internal Delay is the receiver's internal delay between receiving a TLP, processing it at the
Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19
symbol times in these calculations.
Tx_L0s_Adjustment: If L0s is enabled, the time required for the transmitter to exit L0s,
expressed in symbol times. Note that setting the Extended Sync bit of the Link Control
register affects the exit time from L0s and must be taken into account in this adjustment.
It turns out that the entries in this table are approximately a third in value of the
REPLAY_TIMER latency values in Figure 5-10 on page 229.
Figure 5-13 on page 239 is a summary table that shows possible timer load values with various
variables plugged into the ACKNAK_LATENCY_TIMER equation.
Solution: Receiver detects LCRC error and schedules a NAK DLLP with Sequence
Number = NEXT_RCV_SEQ count - 1. Transmitter replays TLPs.
Solution: The receiver performs a sequence number check on all received TLPs. The
receiver expects TLPs to arrive with each TLP that has an incremented 12-bit Sequence
Number from that in the previous TLP. If one or more TLPs are lost en route, a TLP will
have a Sequence Number issued later than expected Sequence Number reflected in the
NEXT_RCV_SEQ count. The receiver schedules a NAK DLLP with a Sequence Number =
NEXT_RCV_SEQ count - 1. Transmitter replays the Replay Buffer contents.
Problem: Receiver returns an ACK DLLP, but it is corrupted en route to the transmitter.
The remote Transmitter detects a CRC error in the DLLP (DLLP is covered by 16-bit CRC,
see "ACK/NAK DLLP Format" on page 219). In fact, the transmitter does not know that the
malformed DLLP just received is supposed to be an ACK DLLP. All it knows is that the
packet is a DLLP.
Solution:
- Case 1: The Transmitter discards the DLLP. A subsequent ACK DLLP received with
a later Sequence Number causes the transmitter Replay Buffer to purge all TLPs with
equal and earlier generated Sequence Numbers. The transmitter never knew that
anything went wrong.
- Case 2: The Transmitter discards the DLLP. A subsequent NAK DLLP received with
a later generated Sequence Number causes the transmitter Replay Buffer to purge
TLPs with equal to an earlier Sequence Numbers. The transmitter then replay all TLPs
with later Sequence Numbers till the last TLP in the Replay Buffer. The transmitter
never knew that anything went wrong.
Problem: ACK or NAK DLLP for received TLPs are not returned by the receiver by the
proper ACKNAK_LATENCY_TIMER time-out. The associated TLPs remain in the
transmitter Replay Buffer.
Solution: The REPLAY_TIMER times-out and the transmitter replays its Replay Buffer.
Problem: The Receiver returns a NAK DLLP but it is corrupted en route to the transmitter.
The remote Transmitter detects a CRC error in the DLLP. In fact, the transmitter does not
know that the DLLP received is supposed to be an NAK DLLP. All it knows is that the
packet is a DLLP.
Solution: The Transmitter discards the DLLP. The receiver discards all subsequently
received TLPs and awaits the replay. Given that the NAK was rejected by the transmitter,
it's REPLAY_TIMER expires and triggers the replay.
Problem: Due to an error in the receiver, it is unable to schedule an ACK or NAK DLLP for
a received TLP.
Solution: The transmitter REPLAY_TIMER will expire and result in TLP replay.
ACK/NAK Protocol Summary
Refer to Figure 5-3 on page 212 and the following subsections for a summary of the elements
of the Data Link Layer.
Transmitter Side
Unless blocked by the Data Link Layer, the Transaction Layer passes down the Header,
Data, and Digest information for each TLP to be sent.
A check is made to see if the acceptance of new TLPs from the Transaction Layer should
be blocked. The transmitter performs a modulo 4096 subtraction of the ACKD_SEQ count
from the NEXT_TRANSMIT_SEQ count to see if the result is >= 2048d. If it is, further
TLPs are blocked until incoming ACK/NAK DLLPs render the equation untrue.
The NEXT_TRANSMIT_SEQ counter increments by one for each TLP processed. Note: if
the transmitter wants to nullify a TLP being sent, it sends an inverted CRC to the physical
layer and indicates an EDB end (End Bad Packet) symbol should be used
(NEXT_TRANSMIT_SEQ is not incremented). See the "Switch Cut-Through Mode" on
page 248 for details.
A 32-bit LCRC value is calculated for the TLP (the LCRC calculation includes the Sequence
Number).
A copy of the TLP is placed in the Replay Buffer and the TLP is forwarded to the Physical
Layer for transmission.
The Physical Layer adds STP and END framing symbols, then transmits the packet.
At a later time, assume the transmitter receives an ACK DLLP from the receiver. It
performs a CRC error check and, if the check fails, discards the ACK DLLP (the same
holds true if a bad NAK DLLP is received). If the check is OK, it purges the Replay buffer
of TLPs from the oldest TLP up to and including the TLP with Sequence Number that
matches the Sequence Number in the ACK DLLP.
Error Case (NAK DLLP Management)
Repeat the process described in the previous section, but this time, assume that the transmitter
receives a NAK DLLP:
Upon receipt of the NAK DLLP with no CRC error, the transmitter follows the following
sequence of steps in performing the Replay. NOTE: this is the same sequence of events
which would occur if the REPLAY_TIMER expires instead.
- If the REPLAY_NUM count rolls over from 11b to 00b, the transmitter instructs the
Physical Layer to re-train the Link.
- Purge any TLPs of equal or earlier Sequence Numbers than NAK DLLP's
AckNak_Seq_Num[11:0].
- Re-transmit TLPs with later Sequence Numbers than the NAK DLLP's
AckNak_Seq_Num[11:0].
- ACK DLLPs or NAK DLLPs received during replay must be processed. The
transmitter may disregard them until replay is complete or use them during replay to
skip transmission of newly acknowledged TLPs. Earlier Sequence Numbers can be
collapsed when an ACK DLLP is received with a later Sequence Number. Also, ACK
DLLPs with later Sequence Numbers than a NAK DLLP received earlier supersede the
earlier NAK DLLP.
- When the replay is complete, unblock TLPs and return to normal operation.
Receiver Side
Non-Error Case
TLPs are received at the Physical Layer where they are checked for framing errors and other
receiver-related errors. Assume that there are no errors. If the Physical Layer reports the end
symbol was EDB and the CRC value was inverted, this is not an error condition; discard the
packet and free any allocated space (see "Switch Cut-Through Mode" on page 248). There will
be no ACK or NAK DLLP returned for this case.
Calculate the CRC for the incoming TLP and check it against the LCRC provided with the
packet. If the CRC passes, go to the next step.
Compare the Sequence Number for the inbound packet against the current value in the
NEXT_RCV_SEQ count.
If they are the same, this is the next expected TLP. Forward the TLP to the Transaction
Layer. Also increment the NEXT_RCV_SEQ count.
Error Case
TLPs are received at the Physical Layer where they are checked for framing errors and other
receiver-related errors. In the event of an error, the Physical Layer discards the packet, reports
the error, and frees any storage allocated for the TLP. If the EDB is set and the CRC is not
inverted, this is a bad packet: discard the TLP and set the error flag. If the NAK_SCHEDULED
flag is clear, set it, and schedule a NAK DLLP with the NEXT_RCV_SEQ count - 1 value used
as the Sequence Number.
If there are no Physical Layer errors detected, forward the TLP to the Data Link Layer.
Calculate the CRC for the incoming TLP and check it against the LCRC provided with the
packet. If the CRC fails, set the NAK_SCHEDULED flag. Schedule a NAK DLLP with
NEXT_RCV_SEQ count - 1 used as the Sequence Number. If LCRC error check passes,
go to the next bullet.
If the LCRC check passes, then compare the Sequence Number for the inbound packet
against the current value in the NEXT_RCV_SEQ count. If the TLP Sequence Number is
not equal to NEXT_RCV_SEQ count and if (NEXT_RCV_SEQ - TLP Sequence Number)
mod 4096 <= 2048, the TLP is a duplicate TLP. Discard the TLP, and schedule an ACK
with NEXT_RCV_SEQ count - 1 value used as AckNak_Seq_Num[11:0].
Discard TLPs received with Sequence Number other than the Sequence Number described
by the above bullet. If the NAK_SCHEDULED flag is clear, set it, and schedule a NAK
DLLP with NEXT_RCV_SEQ count - 1 used as AckNak_Seq_Num[11:0]. If the NAK
_SCHEDULED flag bit is already set, keep it set and do not schedule a NAK DLLP.
Recommended Priority To Schedule Packets
A device may have many types of TLPs, DLLPs and PLPs to transmit on a given Link. The
following is a recommended but not required set of priorities for scheduling packets:
PLP transmissions.
NAK DLLP.
ACK DLLP.
Lost TLP
Consider Figure 5-14 on page 245 which shows the ACK/NAK protocol for handling lost TLPs.
Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is 1. Device B is ready to receive TLP 1.
Seeing ACK 0, Device A purges TLPs 4094, 4095, and 0 from its replay buffer.
TLP 2 arrives instead. Upon performing a Sequence Number check, Device B realizes that
TLP 2's Sequence Number is greater than NEXT_RCV_SEQ count.
TLPs 1 and 2 arrive without error at Device B and are forwarded to the Transaction Layer.
Consider Figure 5-15 on page 246 which shows the ACK/NAK protocol for handling a lost ACK
DLLP.
Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is set to 1.
ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.
Device B returns ACK 2 and sends TLPs 1 and 2 to the Transaction Layer.
If ACK 2 is also lost or corrupted, and no further ACK or NAK DLLPs are returned to Device A,
its REPLAY_TIMER will expire. This results in replay of its entire buffer. Device B receives TLP
4094, 4095, 0, 1 and 2 and detects them as duplicate TLPs because their Sequence Numbers
are earlier than NEXT_RCV_SEQ count of 3. These TLPs are discarded and ACK DLLPs with
AckNak_Seq_Num[11:0] = 2 are returned to Device A for each duplicate TLP.
Consider Figure 5-16 on page 247 which shows the ACK/NAK protocol for handling a lost ACK
DLLP followed by a valid NAK DLLP.
Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is 1.
ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.
TLPs 1 and 2 arrive at Device B shortly thereafter. TLP 1 is good and NEXT_RCV_SEQ
count increments to 2. TLP 1 is forwarded to the Transaction Layer.
Device B accepts good TLP 2 and forwards it to the Transaction Layer. NEXT_RCV_SEQ
increments to 3.
Background
Consider an example where a large TLP needs to pass through a switch from one port to
another. Until the tail end of the TLP is received by the switch's ingress port, the switch is
unable to determine if there is a CRC error. Typically, the switch will not forward the packet
through the egress port until it determines that there is no CRC error. This implies that the
latency through the switch is at least the time to clock the packet into the switch. If the packet
needs to pass through many switches to get to the final destination, the latencies would add up,
increasing the time to get from source to destination.
Possible Solution
One option to reduce latency would be to start forwarding the TLP through the switch's egress
port before the tail end of the TLP has been received by the switch ingress port. This is fine as
long as the packet is not corrupted. Consider what would happen if the TLP were corrupt. The
packet would begin transmitting through the egress port before the switch realized that there is
an error. After the switch detects the CRC error, it would return a NAK to the TLP source and
discard the packet, but part of the packet has already been transmitted and its transmission
cannot be cleanly aborted in mid-transmit. There is no point keeping a copy of the bad TLP in
the egress port Replay Buffer because it is bad. The TLP source port would at a later time
replay after receiving the NAK DLLP. The TLP is already outbound and en route to the Endpoint
destination. The Endpoint receives the packet, detects a CRC error, and returns a NAK to the
switch. The switch is expected to replay the TLP, but the switch has already discarded the TLP
due to the detected error on the inbound TLP. The switch is stuck between a rock and a hard
place!
Background
The PCI Express protocol permits the implementation of an optional feature referred to as cut-
through mode. Cut-though is the ability to start streaming a packet through a switch without
waiting for the receipt of the tail end of the packet. If, ultimately, a CRC error is detected when
the CRC is received at the tail end of the packet, the packet that has already begun
transmission from the switch egress port can be 'nullified'.
A nullified packet is a packet that terminates with an EDB symbol as opposed to an END. It
also has an inverted 32-bit LCRC.
Consider the example in Figure 5-17 that illustrates the cut-though mode of a switch.
A TLP with large data payload passes from the left, through the switch, to the Endpoint on the
right. The steps as the packet is routed through the switch are as follows:
The TLP header at the head of the TLP is decoded by the switch and the packet is
forwarded to the egress port before the switch becomes aware of a CRC error. Finally, the tail
end of the packet arrives in the switch ingress port and it is able to complete a CRC check.
The switch detects a CRC error for which the switch returns a NAK DLLP to the TLP source.
On the egress port, the switch replaces the END framing symbol at the tail end of the bad
TLP with the EDB (End Bad) symbol. The CRC is also inverted from what it would normally be.
The TLP is now 'nullified'. Once the TLP has exited the switch, the switch discards its copy
from the Replay Buffer.
The nullified packet arrives at the Endpoint. The Endpoint detects the EDB symbol and the
inverted CRC and discards the packet.
The Endpoint does not return a NAK DLLP (otherwise the switch would be obliged to
replay).
When the TLP source device receives the NAK DLLP, it replays the packet. This time no error
occurs on the switch's ingress port. As the packet arrives in the switch, the header is decoded
and the TLP is forwarded to the egress port with very short latency. When the tail end of the
TLP arrives at the switch, a CRC check is performed. There is no error, so an ACK is returned
to the TLP source which then purges its replay buffer. The switch stores a copy of the TLP in
its egress port Replay Buffer. When the TLP reaches the destination Endpoint, the Endpoint
device performs a CRC check. The packet is a good packet terminated with the END framing
symbol. There are no CRC errors and so the Endpoint returns an ACK DLLP to the switch. The
switch purges the copy of the TLP from its Replay Buffer. The packet has been routed from
source to destination with minimal latency.
Chapter 6. QoS/TCs/VCs and Arbitration
The Previous Chapter
This Chapter
Quality of Service
Arbitration
The Previous Chapter
The previous chapter detailed the Ack/Nak Protocol that verifies the delivery of TLPs between
each port as they travel between the requester and completer devices. This chapter details the
hardware retry mechanism that is automatically triggered when a TLP transmission error is
detected on a given link.
This Chapter
This chapter discusses Traffic Classes, Virtual Channels, and Arbitration that support Quality of
Service concepts in PCI Express implementations. The concept of Quality of Service in the
context of PCI Express is an attempt to predict the bandwidth and latency associated with the
flow of different transaction streams traversing the PCI Express fabric. The use of QoS is
based on application-specific software assigning Traffic Class (TC) values to transactions,
which define the priority of each transaction as it travels between the Requester and Completer
devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage transaction
priority via two arbitration schemes called port and VC arbitration.
The Next Chapter
The next chapter discusses the purposes and detailed operation of the Flow Control Protocol.
This protocol requires each device to implement credit-based link flow control for each virtual
channel on each port. Flow control guarantees that transmitters will never send Transaction
Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and
eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control
also helps enable compliance with PCI Express ordering rules by maintaining separate virtual
channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and
Completions (Cpl).
Quality of Service
Quality of Service (QoS) is a generic term that normally refers to the ability of a network or
other entity (in our case, PCI Express) to provide predictable latency and bandwidth. QoS is of
particular interest when applications require guaranteed bus bandwidth at regular intervals,
such as audio data. To help deal with this type of requirement PCI Express defines isochronous
transactions that require a high degree of QoS. However, QoS can apply to any transaction or
series of transactions that must traverse the PCI Express fabric. Note that QoS can only be
supported when the system and device-specific software is PCI Express aware.
Transmission rate
Effective Bandwidth
Latency
Error rate
Several features of PCI Express architecture provide the mechanisms that make QoS
achievable. The PCI Express features that support QoS include:
Port Arbitration
PCI Express uses these features to support two general classes of transactions that can
benefit from the PCI Express implementation of QoS.
Isochronous Transactions from Iso (same) + chronous (time), these transactions require a
constant bus bandwidth at regular intervals along with guaranteed latency. Isochronous
transactions are most often used when a synchronous connection is required between two
devices. For example, a CD-ROM drive containing a music CD may be sourcing data to
speakers. A synchronous connection exists when a headset is plugged directly into the drive.
However, when the audio card is used to deliver the audio information to a set of external
speakers, isochronous transactions may be used to simplify the delivery of the data.
PCI Express supports QoS and the associated TC, VC, and arbitration mechanisms so that
isochronous transactions can be performed. A classic example of a device that benefits from
isochronous transaction support is a video camera attached to a tape deck. This real-time
application requires that image and audio data be transferred at a constant rate (e.g., 64
frames/second). This type of application is typically supported via a direct synchronous
attachment between the two devices.
Two devices connected directly perform synchronous transfers. A synchronous source delivers
data directly to the synchronous sink through use of a common reference clock. In our example,
the video camera (synchronous source) sends audio and video data to the tape deck
(synchronous sink), which immediately stores the data in real time with little or no data
buffering, and with only a slight delay due to signal propagation.
When these devices are connected via PCI Express a synchronous connection is not possible.
Instead, PCI Express emulates synchronous connections through the use of isochronous
transactions and data buffering. In this scenario, isochronous transactions can be used to
ensure that a constant amount of data is delivered at specified intervals (100µs in this
example), thus achieving the required transmission characteristics. Consider the following
sequence (Refer to Figure 6-1 on page 254):
1. The synchronous source (video camera and PCI Express interface) accumulates
data in Buffer A during service interval 1 (SI 1).
The camera delivers the accumulated data to the synchronous sink (tape deck) sometime
during the next service interval (SI 2). The camera also accumulates the next block of data in
Buffer B as the contents of Buffer A is delivered.
The tape deck buffers the incoming data (in its Buffer A), which can then be delivered
synchronously for recording on tape during service interval 3. During SI 3 the camera once
again accumulates data into Buffer A, and the cycle repeats.
Differentiated Services
Various types of asynchronous traffic (all traffic other than isochronous) have different priority
from the system perspective. For example, ethernet traffic requires higher priority (smaller
latencies) than mass storage transactions. PCI Express software can establish different TC
values and associated virtual channels and can set up the communications paths to ensure
different delivery policies are established as required. Note that the specification does not
define specific methods for identifying delivery requirements or the policies to be used when
setting up differentiated services.
Perspective on QOS/TC/VC and Arbitration
PCI does not include any QoS-related features similar to those defined by PCI Express. Many
questions arise regarding the need for such an elaborate scheme for managing traffic flow
based on QoS and differentiated services. Without implementing these new features, the
bandwidth available with a PCI Express system is far greater and latencies much shorter than
PCI-based implementations, due primarily to the topology and higher delivery rates.
Consequently, aside from the possible advantage of isochronous transactions, there appears to
be little advantage to implementing systems that support multiple Traffic Classes and Virtual
Channels.
While this may be true for most desktop PCs, other high-end applications may benefit
significantly from these new features. The PCI Express specification also opens the door to
applications that demand the ability to differentiate and manage system traffic based on Traffic
Class prioritization.
Traffic Classes and Virtual Channels
During initialization a PCI Express device-driver communicates the levels of QoS that it desires
for its transactions, and the operating system returns TC values that correspond to the QoS
requested. The TC value ultimately determines the relative priority of a given transaction as it
traverses the PCI Express fabric. Two hardware mechanisms provide guaranteed isochronous
bandwidth and differentiated services:
Port Arbitration
The TC value is carried in the transaction packet header and can contain one of eight values
(TC0-TC7). TC0 must be implemented by all PCI Express devices and the system makes a
"best effort" when delivering transactions with the TC0 label. TC values of TC1-TC7 are
optional and provide seven levels of arbitration for differentiating between packet streams that
require varying amounts of bandwidth. Similarly, eight VC numbers (VC0-VC7) are specified,
with VC0 required and VC1-VC7 optional. ("VC Assignment and TC Mapping" on page 258
discusses VC initialization).
Note that TC0 is hardwired to VC0 in all devices. If configuration software is not PCI Express
aware all transactions will use the default TC0 and VC0; thereby eliminating the possibility of
supporting differentiated services and isochronous transactions. Furthermore, the specification
requires some transaction types to use TC0/VC0 exclusively:
Configuration
I/O
INTx Message
Unlock Message
Set_Slot_Power_Limit Message
Configuration software designed for PCI Express sets up virtual channels for each link in the
fabric. Recall that the default TC and VC assignments following Cold Reset will be TC0 and
VC0, which is used when the configuration software is not PCI Express aware. The number of
virtual channels used depends on the greatest capability shared by the two devices attached to
a given link. Software assigns an ID for each VC and maps one or more TCs to each.
Software checks the number of VCs supported by the devices attached to a common link and
assigns the greatest number of VCs that both devices have in common. For example, consider
the three devices attached to the switch in Figure 6-3 on page 259. In this example, the switch
supports all 8 VCs on each of its ports; while Device A supports only the default VC, Device B
supports 4 VC s, and Device C support 8 VCs. When configuring VCs for each link, software
determines the maximum number of VCs supported by both devices at each end of the link and
assigns that number to both devices. The VC assignment applies to transactions flowing across
a link in both directions.
Note that even though switch port A supports all 8 VCs Device A supports a single VC, leaving
7 VCs unused within switch port A. Similarly, 4 VCs are used by switch port B. Software of
course configures and enables all 8 VCs within switch port C.
Configuration software determines the maximum number of VCs supported by each port
interface by reading its Extended VC Count field contained within the "Virtual Channel
Capability" registers. The smaller of the two values governs the maximum number of VCs
supported by this link for both transmission and reception of transactions. Figure 6-4 on page
260 illustrates the location and format of the Extended VC Count field. Software may restrict
the number of VCs configured and enabled to fewer than actually allowed. This may be done to
achieve the QoS desired for a given platform or application.
Configuration software must assign VC numbers or IDs to each of the virtual channels, except
VC0 which is always hardwired. As illustrated in Figure 6-5 on page 261, the VC Capabilities
registers include 3 DWs used for configuring each VC. The first set of registers (starting at
offset 10h) always applies to VC0. The Extended VCs Count field (described above) defines
the number of additional VC register sets implemented by this port, each of which permits
configuration of an additional VC. Note that these register sets are mapped in configuration
space directly following the VC0 registers. The mapping is expressed as an offset from each of
the three VC0 DW registers:
10h + (n*0Ch)
14h + (n*0Ch)
18h + (n*0Ch)
Software assigns a VC ID for each of the additional VCs being used via the VC ID field within
the VCn Resource Control Register. (See Figure 6-5) These IDs are not required to be
assigned contiguous values, but the same VC value can be used only once.
The Traffic Class value assigned by a requester to each transaction must be associated with a
VC as it traverses each link on its journey to the recipient. Also, the VC ID associated with a
given TC may change from link to link. Configuration software establishes this association
during initialization via the TC/VC Map field of the VC Resource Control Register. This 8-bit field
permits any TC value to be mapped to the selected VC, where each bit position represents the
corresponding TC value (i.e., bit 0 = TC0:: bit 7 = TC7). Setting a bit assigns the corresponding
TC value to the VC ID. Figure 6-6 shows a mapping example where TC0 and TC1 are mapped
to VC0 and TC2::TC4 are mapped to VC3.
Figure 6-6. TC to VC Mapping Example
Software is permitted a great deal of flexibility in assigning VC IDs and mapping the associated
TCs. However, the specification states several rules associated with the TC/VC mapping:
TC/VC mapping must be identical for the two ports attached to the same link.
One TC must not be mapped to multiple VCs in any PCI Express Port.
Table 6-1 on page 263 lists a variety of combinations that may be implemented. This is
intended only to illustrate a few combinations, and many more are possible.
VC
TC Comment
Assignment
TC0-
TC1 VC0
VCs are not required to be assigned consecutively. Multiple TCs can be assigned to a single VC.
TC2- VC7
TC7
TC0 VC0
TC1 VC1 Several transaction types must use TC0/VC0. (1) TCs are not required to be assigned consecutively. Some
TC6 VC6 TC/VC combinations can be used to support an isochronous connection.
TC7 VC7
TC0 VC0
TC1 VC1
TC2 VC2
TC3 VC3
All TCs can be assigned to the corresponding VC numbers.
TC4 VC4
TC5 VC5
TC6 VC6
TC7 VC7
TC0
VC0
The VC number that is assigned need not match one of the corresponding TC numbers.
TC1-
VC6
TC4
TC0
VC0
TC1- Illegal. A TC number can be assigned to only one VC number. This example shows TC2 mapped to both VC1
VC1
TC2 and VC2, which is not allowed.
VC2
TC2
Arbitration
Two types of transaction arbitration provide the method for managing isochronous transactions
and differentiated services:
Port Arbitration determines the priority of transactions with the same VC assignment at
the egress port, based on the priority of the port at which the transactions arrived. Port
arbitration applies to transactions that have the same VC ID at the egress port, therefore a
port arbitration mechanism exists for each virtual channel supported by the egress port.
Arbitration is also affected by the requirements associated with transaction ordering and flow
control. These additional requirements are discussed in subsequent chapters, but are
mentioned in the context of arbitration as required in the following discussions.
In addition to supporting QoS objectives, VC arbitration should also ensure that forward
progress is made for all transactions. This prevents inadvertent split transaction time-outs. Any
device that both initiates transactions and supports two or more VCs must implement VC
arbitration. Furthermore, other device types that support more than one VC (e.g., switches)
must also support VC arbitration.
Each VC supported and enabled provides its own buffers and flow control.
Transactions mapped to the same VC are issued in strict order (unless the "Relaxed
Ordering" attribute bit is set).
Figure 6-7 on page 265 illustrates the concept of VC arbitration. In this example two VCs are
implemented (VC0 and VC1) and transmission priority is based on a 3:1 ratio, where 3 VC1
transactions are sent to each VC0 transaction. The device core issues transactions (that
include a TC value) to the TC/VC Mapping logic. Based on the associated VC value, the
transaction is routed to the appropriate VC buffer where it awaits transmission. The VC arbiter
determines the VC buffer priority when sending transactions.
This example illustrates the flow of transaction in only one direction. The same logic exists for
transmitting transactions simultaneously in the opposite direction. That is, the root port also
contains transmit buffers and an arbiter and the endpoint device contains receive buffers.
Split Priority Arbitration VCs are segmented into low- and high-priority groups. The low-
priority group uses some form of round robin arbitration and the high-priority group uses
strict priority.
The originating port can manage the injection rate of high priority transactions, to permit
greater bandwidth for lower priority transactions.
Switches can regulate multiple data flows at the egress port that are vying for link
bandwidth. This method may limit the throughput from high bandwidth applications and
devices that attempt to exceed the limitations of the available bandwidth.
The designer of a device may also limit the number of VCs that participate in strict priority by
specifying a split between the low- and high-priority VCs as discussed in the next section.
As depicted in Figure 6-11 on page 269, the high-priority VCs continue to use strict priority
arbitration, while the low-priority arbitration group uses one of the other prioritization methods
supported by the device. VC Capability Register 2 reports which alternate arbitration methods
are supported for the low priority group, and the VC Control Register permits selection of the
method to be used by this group. See Figure 6-10 on page 268. The low-priority arbitration
schemes include:
Hardware Based Fixed Arbitration Scheme the specification permits the vendor to define a
hardware-based fixed arbitration scheme that provides all VCs with the same priority. (e.g.
round robin).
Weighted Round Robin (WRR) with WRR some VCs can be given higher priority than
others because they have more positions within the round robin than others. The
specification defines three WRR configurations, each with a different number of entries (or
phases).
The weighted round robin (WRR) approach permits software to configure the VC Arbitration
table. The number of arbitration table entries supported by the design is reported in the VC
Arbitration Capability field of Port VC Capability Register 2. The table size is selected by
writing the corresponding value in to the VC Arbitration Select field of the Port VC Control
Register. See Figure 6-10 on page 268. Each entry in the table represents one phase that
software loads with a low priority VC ID value. The VC arbiter repeatedly scans all table entries
in a sequential fashion and sends transactions from the VC buffer specified in the table entries.
Once a transaction has been sent, the arbiter immediately proceeds to the next phase.
Software can set up the VC arbitration table such that some VCs are listed in more entries than
others; thereby, allowing differentiation of QoS between the VCs. This gives software
considerable flexibility in establishing the desired priority. Figure 6-12 on page 270 depicts the
weighted round robin VC arbitration concept.
The hardware designer may choose to implement one of the round robin forms of VC
arbitration for all VCs. This is accomplished by specifying the highest VC number supported by
the device as a member of the low priority group (via the Lowest Priority Extended Count field.
In this case, all VC priorities are managed via the VC arbitration table. Note that the VC
arbitration table is not used when the Hardware Fixed Round Robin scheme is selected. See
page 269.
The VC Arbitration Table (VAT) is located at an offset from the beginning of the extended
configuration space as indicated by the VC Arbitration Table Offset field. This offset is
contained within Port VC Capability Register 2. (See Figure 6-13 on page 271.)
The specification does not state how an endpoint should manage the arbitration of data flows
from different functions within an endpoint. However it does state that "Multi-function
Endpoints... should support PCI Express VC-based arbitration control mechanisms if multiple
VCs are implemented for the PCI Express Link." VC arbitration when there are multiple
functions raises interesting questions about the approach to be taken. Of course when the
device functions support only VC0, no VC arbitration is necessary. The specification leaves the
approach open to the designer.
Figure 6-15 on page 274 shows a functional block diagram of an example implementation in
which two functions are implemented within an endpoint device, each of which supports two
VCs. The example approach is based upon the goal of using a standard PCI Express core to
interface both functions to the link. The transaction layer within the link performs the TC/VC
mapping and VC arbitration. The device-specific portion of the design is the function arbiter that
determines the priority of data flows from the functions to the transaction layer of the core.
Following are key considerations for such an approach:
Rather than duplicating the TC/VC mapping within each function, the standard device core
performs the task. An important consideration for this decision is that all functions must use
the same TC/VC mapping. The specification requires that the TC/VC mapping be the same
for devices at each end of a link. This means that each function within the endpoint must
have the same mappings.
The function arbiter used TC values to determine the priority of transactions being
delivered from the two functions, and selects the highest priority transaction from the
functions when forwarding transactions to the transaction layer of the PCI Express core.
The arbitration algorithm is hardwired based on the applications associated with each
function.
Port Arbitration
When traffic from multiple ports vie for limited bandwidth associated with a common egress
port, arbitration is required. The concept of port arbitration is pictured in Figure 6-16 on page
275. Note that port arbitration exists in three locations within a system:
Root Complex egress ports to that lead to sources such as main memory
Figure 6-16. Port Arbitration Concept
Port arbitration requires software configuration, which is handled via PCI-to-PCI bridge (PPB)
configuration in both switches and peer-to-peer transfers within the Root Complex and by the
Root Complex Register Block when accessing shared root complex resources such as main
memory. Port arbitration occurs independently for each virtual channel supported by the egress
port. In the example below, root port 2 supports peer-to-peer transfers from root ports 1 and 2;
however, peer-to-peer transfer support between root complex ports is not required.
Because port arbitration is managed independently for each VC of the egress port or RCRB, a
port arbitration table is required for each VC that supports programmable port arbitration as
illustrated in Figure 6-17 on page 276. Port arbitration tables are supported only by switches
and RCRBs and are not allowed for endpoints, root ports and PCI Express bridges.
1. Transactions arriving at the ingress ports are directed to the appropriate flow
control buffers based on the TC/VC mapping.
Transactions are forwarded from the flow control buffers to the routing logic is consulted to
determine the egress port.
Transactions are routed to the egress port (3) where TC/VC mapping determines into which
VC buffer the transactions should be placed.
A set of VC buffers is associated with each of the egress ports. Note that the ingress port
number is tracked until transactions are placed in their VC buffer.
Port arbitration logic determines the order in which transactions are sent from each group of
VC buffers.
The actual port arbitration mechanisms defined by the specification are similar to the models
used for VC arbitration and include:
Configuration software must determine the port arbitration capability for a switch or RCRB and
select the port arbitration scheme to be used for each enabled VC. Figure 6-19 on page 278
illustrates the registers and fields involved in determining port arbitration capabilities and
selecting the port arbitration scheme to be used by each VC.
Figure 6-19. Software checks Port Arbitration Capabilities and Selects the
Scheme to be Used
Non-Configurable Hardware-Fixed Arbitration
This port arbitration mechanism does not require configuration of the port arbitration table.
Once selected by software, the mechanism is managed solely by hardware. The actual
arbitration scheme is based on a round-robin or similar approach where each port has the
same priority. This type of mechanism ensures a type of fairness and ensures that all
transactions can make forward progress. However, it does not service the goals of
differentiated services and does not support isochronous transactions.
Like the weighted round robin mechanism used for VC arbitration, software loads the port
arbitration table such that some ports can receive higher priority than others based on the
number of phases in the round robin that are allocated for each port. This approach allows
software to facilitate differentiated services by assigning different weights to traffic coming from
different ports.
As the table is scanned each table phase specifies a port number that identifies the VC buffer
from which the next transaction is sent. Once the transaction is delivered arbitration control
logic immediately proceeds to the next phase. For a given port, if no transaction is pending
transmission the arbiter advances immediately to the next phase.
The specification defines four table lengths for WRR port arbitration, determined by the number
of phases used by the table. The table length selections include:
32 phases
64 phases
128 phases
256 phases
Time-based weighted round robin adds the element of a virtual timeslot for each arbitration
phase. Just as in WRR the port arbiter delivers one transaction from the Ingress Port VC buffer
indicated by the Port Number of the current phase. However, rather than immediately advancing
to the next phase, the time-based arbiter waits until the current virtual timeslot elaspses before
advancing. This ensures that transactions are accepted from the ingress port buffer at regular
intervals. Note that the timeslot does not govern the duration of the transfer, but rather the
interval between transfers. The maximum duration of a transaction is the time it takes to
complete the round robin and return to the original timeslot. Each timeslot is defined as 100ns.
Also, it is possible that no transaction is delivered during a timeslot, resulting in an idle timeslot.
This occurs when:
no transaction is pending for the selected ingress port during the current phase, or
Time-based WRR arbitration supports a maximum table length of 128 phases. The actual
number of phases implemented is reported via the Maximum Time Slot field of each virtual
channel that supports Timed WRR arbitration. See the Figure 6-20 on page 280 which illustrate
the Maximum Time Slots Field within the VCn Resource Capability register. See MindShare's
website for a white paper on example applications of Time-Based WRR.
A port arbitration table is required for each VC supported by the egress port.
The actual size and format of the Port Arbitration Tables are a function of the number of phases
and the number of ingress ports supported by the Switch, RCRB, or Root Port that supports
peer-to-peer transfers. The maximum number of ingress ports supported by the Port Arbitration
Table is 256 ports. The actual number of bits within each table entry is design dependent and
governed by the number of ingress ports whose transactions can be delivered to the egress
port. The size of each table entry is reported in the 2-bit Port Arbitration Table Entry Size field
of Port VC Capability Register 1. The permissible values are:
00b 1 bit
01b 2 bits
10b 4 bits
11b 8 bits
Configuration software loads each table with port numbers to accomplish the desired port
priority for each VC supported. As illustrated in Figure 6-21 on page 281, the port arbitration
table format depends on the size of each entry and the number of time slots supported by this
design.
This section provides an example of a three-port switch with both Port and VC arbitration
illustrated. The example presumes that packets arriving on ingress ports 0 and 1 are moving in
the upstream direction and port 2 is the egress port facing the Root Complex. This example
serves to summarize port and VC arbitration and illustrate their use within a PCI Express
switch. Refer to Figure 6-22 on page 283 during the following discussion.
1. Packets arrive at ingress port 0 and are placed in a receiver flow control buffer
based on TC/VC mapping associated with port 0. As indicated, TLPs carrying traffic
class TC0 or TC1 are sent to the VC0 receiver flow control buffers. TLPs carrying
traffic class TC3 or TC5 are sent to the VC1 receiver flow control buffers. No other
TCs are permitted on this link.
Packets arrive at ingress port 1 and are placed in a receiver flow control buffer based on
port 1 TC/VC mapping. As indicated, TLPs carrying traffic class TC0 are sent to the VC0
receiver flow control buffers. TLPs carrying traffic class TC2-TC4 are sent to the VC3 receiver
flow control buffers. NO OTHER TCs are permitted on this link.
The target egress port is determined from routing information in each packet. Address
routing is applied to memory or IO request TLPs, ID routing is applied to configuration or
completion TLPs, etc.
All packets destined for egress port 2 are subjected to the TC/VC mapping for that port. As
shown, TLPs carrying traffic class TC0-TC2 are managed as virtual channel 0 (VC0) traffic,
TLPs carrying traffic class TC3-TC7 are managed as VC1 traffic.
Independent Port Arbitration is applied to packets within each VC. This may be a fixed or
weighted round robin arbitration used to select packets from all possible different ingress ports.
Port arbitration ultimately results in all VCs of a given type being routed to the same VC buffer.
Following Port Arbitration, VC arbitration determines the order in which transactions pending
transmission within the individual VC buffers will be transferred across the link. The arbitration
algorithm may be fixed or weighted round robin. The arbiter selects transactions from the head
of each VC buffer based on the priority scheme implemented.
Note that the VC arbiter selects packets for transmission only if sufficient flow control credits
exist.
This Chapter
Because PCI Express is a point-to-point implementation, the Flow Control mechanism would be
ineffective, if only one transaction stream was pending transmission across a link. That is, if the
receive buffer was temporarily full, the transmitter would be prevented from sending a
subsequent transaction due to transaction ordering requirements, thereby blocking any further
transfers. PCI Express improves link efficiency by implementing multiple flow-control buffers for
separate transaction streams (virtual channels). Because Flow Control is managed separately
for each virtual channel implemented for a given link, if the Flow Control buffer for one VC is full,
the transmitter can advance to another VC buffer and send transactions associated with it.
The link Flow Control mechanism uses a credit-based mechanism that allows the transmitting
port to check buffer space availability at the receiving port. During initialization each receiver
reports the size of its receive buffers (in Flow Control credits) to the port at the opposite end of
the link. The receiving port continues to update the transmitting port regularly by transmitting the
number of credits that have been freed up. This is accomplished via Flow Control DLLPs.
Flow control logic is located in the transaction layer of the transmitting and receiving devices.
Both transmitter and receiver sides of each device are involved in flow control. Refer to Figure
7-1 on page 287 during the following descriptions.
Devices Report Buffer Space Available The receiver of each node contains the Flow
Control buffers. Each device must report the amount of flow control buffer space they have
available to the device on the opposite end of the link. Buffer space is reported in units
called Flow Control Credits (FCCs). The number of Flow Control Credits within each buffer
is forwarded from the transaction layer to the transmit side of the link layer as illustrated in
Figure 7-1. The link creates a Flow Control DLLP that carries this credit information to the
receiver at the opposite end of the link. This is done for each Flow Control Buffer.
Receiving Credits Notice that the receiver in Figure 7-1 also receives Flow Control
DLLPs from the device at the opposite end of the link. This information is transferred to the
transaction layer to update the Flow Control Counters that track the amount of Flow
Control Buffer space in the other device.
Credit Checks Made Each transmitter check consults the Flow Control Counters to check
available credits. If sufficient credits are available to receive the transaction pending
delivery then the transaction is forwarded to the link layer and is ultimately sent to the
opposite device. If enough credits are not available the transaction is temporarily blocked
until additional Flow Control credits are reported by the receiving device.
Each VC Flow Control buffer at the receiver is managed for each category of transaction
flowing through the virtual channel. These categories are:
Non-Posted Transactions Memory Reads, Configuration Reads and Writes, and I/O
Reads and Writes
In addition, each of these categories is separated into header and data portions of each
transaction. Flow control operates independently for each of the six buffers listed below (also
see Figure 7-2 on page 289).
Posted Header
Posted Data
Non-Posted Header
Non-Posted Data
Completion Header
Completion Data
Buffer space is reported by the receiver in units called Flow Control credits. The unit value of
Flow Control credits (FCCs) may differ between header and data as listed below:
Flow control credits are passed within the header of the link layer Flow Control Packets. Note
that DLLPs do not require Flow Control credits because they originate and terminate at the link
layer.
Maximum Flow Control Buffer Size
The maximum buffer size that can be reported via the Flow Control Initialization and Update
packets for the header and data portions of a transaction are as follows:
32KB @ 16 bytes/credit
The reason for these limits is discussed in the section entitled "Stage 1 Flow Control Following
Initialization" page 296, step 2.
Introduction to the Flow Control Mechanism
The specification defines the requirements of the Flow Control mechanism by describing
conceptual registers and counters along with procedures and mechanisms for reporting,
tracking, and calculating whether a transaction can be sent. These elements define the
functional requirements; however, the actual implementation may vary from the conceptual
model. This section introduces the specified model that serves to explain the concept and
define the requirements. The approach taken focuses on a single flow control example for a
non-posted header. The concepts discussed apply to all Flow Control buffer types.
Figure 7-3 identifies and illustrates the elements used by the transmitter and receiver when
managing flow control. This diagram illustrates transactions flowing in a single direction across
a link, but of course another set of these elements is used to support transfers in the opposite
direction. The primary function of each element within the transmitting and receiving devices is
listed below. Note that for a single direction these Flow Control elements are duplicated for
each Flow Control receive buffer, yielding six sets of elements. This example deals with non-
posted header flow control.
Transmitter Elements
Pending Transaction Buffer holds transactions that are pending transfer within the same
virtual channel.
Credit Consumed Counter tracks the size of all transactions sent from the VC buffer (of
the specified type, e.g., non-posted headers) in Flow Control credits. This count is
abbreviated "CC."
Credit Limit Register this register is initialized by the receiving device when it sends Flow
Control initialization packets to report the size of the corresponding Flow Control receive
buffer. Following initialization, Flow Control update packets are sent periodically to add
more Flow Control credits as they become available at the receiver. This value is
abbreviated "CL."
Flow Control Gating Logic performs the calculations to determine if the receiver has
sufficient Flow Control credits to receive the pending TLP (PTLP). In essence, this check
ensures that the total CREDITS_CONSUMED (CC) plus the credit required for the next
packet pending transmission (PTLP) does not exceed the CREDIT_LIMIT (CL). This
specification defines the following equation for performing the check, with all values
represented in credits:
For an example application of this equation, See "Stage 1 Flow Control Following Initialization"
on page 294.
Receiver Elements
Credit Allocated This counter tracks the total Flow Control credits that have been
allocated (made available) since initialization. It is initialized by hardware to reflect the size
of the associated Flow Control buffer. As the buffer fills the amount of available buffer
space decreases until transactions are removed from the buffer. The number of Flow
Control credits associated with each transaction removed from the buffer is added to the
CREDIT_ALLOCATED counter; thereby keeping a running count of new credits made
available.
Credits Received Counter (optional) this counter keeps track of the total size of all data
received from the transmitting device and placed into the Flow Control buffer (in Flow
Control credits). When flow control is functioning properly, the CREDITS_RECEIVED count
should be the same as CREDITS_CONSUMED count at the transmitter and be equal to or
less than the CREDIT_ALLOCATED count. If this is not true, then a flow control buffer
overflow has occurred and error is detected. Although optional the specification
recommends its use.
Flow control management is based on keeping track of Flow Control credits using modulo
counters. Consequently, the counters are designed to role over when the count saturates. The
width of the counters depend on whether flow control is tracking transaction headers or data:
In addition, all calculations are made using unsigned arithmetic. The operation of the counters
and the calculations are explained by example on page 290.
Flow Control Packets
The transmit side of a device reports flow control credit information from its receive buffers to
the opposite device. The specification defines three types of Flow Control packets:
Flow Control Init1 used to report the size of the Flow Control buffers for a given virtual
channel
Flow Control Init2 same as Flow Control Init1 except it is used to verify completion of flow
control initialization at each end of the link (receiving device ignores flow control credit
information)
Each Flow Control packet contains the header and data flow control credit information for each
virtual channel and type of Flow Control packet. The packet fields that carry the header and
data Flow Control credits reflect the counter width as discussed in the previous section. Figure
7-4 pictures the format and content of these packets.
Stage One Immediately following initialization, the several transactions are tracked to explain
the basic operation of the counters and registers as they track transactions as they are sent
across the link. In this stage, data is accumulating within the Flow Control buffer, but no
transactions are being removed.
Stage Two If the transmitter sends non-posted transactions at a rate such that the Flow
Control buffer is filled faster than the receiver can forward transactions from the buffer, the
buffer will fill. Stage two describes this circumstance.
Stage Three The modulo counters are designed to roll over and continue counting from zero.
This stage describes the flow control operation at the point of the CREDITS_ALLOCATED
count rolling over to zero.
Stage Four The specification describes the optional error check that can be made by the
receiver in the event of a Flow Control buffer overflow. This error check is described in this
section.
The assumption made in this example is that flow control initialization has just completed and
the devices are ready for normal operation. The Flow Control buffer is presumed to be 2KB in
size, which represents 102d (66h) Flow Control units with 20 bytes/header. Figure 7-5 on page
295 illustrates the elements involved with the values that would be in each counter and register
following flow control initialization.
The credit check is made using unsigned arithmetic (2's complement) in order to satisfy the
following formula:
01100110
11111111 (add)
01100101 = 01100101b = 65h
The result of the subtraction must be equal to or less than 1/2 the maximum value that can be
tracked with a modulo 256 counter (128). This approach is taken to ensure unique results from
the unsigned arithmetic. For example, unsigned 2's-complement subtraction yields the same
results for both 0-128 and 255-127, as shown below.
To ensure that conflicts such as the one above do not occur, the maximum number of unused
credits that can be reported is limited to 28/2 (128) credits for headers and 212/2 (2048) credits
for data. This means that the CREDITS_ALLOCATED count must never exceed the
CREDITS_CONSUMED count by more than 128 for headers and 2048 for data. This ensures
that any result < 1/2 the maximum register count is a positive number and represents credits
available, and results > 1/2 the maximum count are negative numbers that indicate credits not
available.
The CREDITS_CONSUMED count increments by one when the transaction is forwarded to
the link layer.
When the transaction arrives at the receiver, the transaction header is placed into the Flow
Control buffer and the CREDITS_RECEIVED counter (optional) increments by one. Note that
CREDIT_ALLOCATED does not change.
Figure 7-6 on page 297 illustrates the Flow Control elements following transfer of the first
transaction.
This example presumes that the receiving device has been unable to move transactions from
the Flow Control buffer since initialization. This could be caused if the device core was
temporarily busy and unable to process transactions. Consequently, the Flow Control buffer has
completely filled. Figure 7-7 on page 299 illustrates this scenario.
Figure 7-7. Flow Control Elements with Flow Control Buffer Filled
Again the transmitter checks Flow Control credits to determine if the next pending TLP can be
sent. The unsigned arithmetic is performed to subtract the Credits Required from the
CREDIT_LIMIT:
CL 01100110(66)
CR 10011001 (add 2's complement of 67h)
11111111 = FFh<=80h (not true, don't send packet)
Not until the receiver moves one or more transactions from the Flow Control buffer can the
pending transaction be sent. When the first transaction is moved from the Flow Control buffer,
the CREDIT_ALLOCATED count is increased to 67h. When the Update Flow Control packet is
delivered to the transmitter, the new CREDIT_LIMIT will be loaded into the CL register. The
resulting check will pass the test, thereby permitting the packet to be sent.
CL 01100111 (67)
CR 10011001 add 2's complement of 67
00000000 = 00h<=80h (send transaction)
The receiver's CREDIT_LIMIT (CL) always runs ahead of (or is equal to) the
CREDITS_CONSUMED (CC) count. Each time the transmitter performs a credit check, it adds
the credits required (CR) for a TLP to the current CREDITS_CONSUMED count and subtracts
the result from the current CREDIT_LIMIT to determine if enough credits are available to send
the TLP.
Because both the CL count and the CC count only index up, they are allowed to roll over from
maximum count back to 0. A problem appears to arise when the CL count (which, again, is
running ahead) has rolled over and the CC has not. Figure 7-8 shows the CL and CR counts
before and after CL rollover.
If a simple subtraction is performed in the rollover case, the result is negative. This indicates
that credits are not available. However, because unsigned arithmetic is used the problem does
not arise. See below:
CL 00001000 (08h)
CR 11111000 (F8h) > 00000111+1 = 2's complement
CL 00001000 (08h)
CR 00001000 (add 2's complement)
00010000 or 10h
The specification recommends implementation of the optional FC buffer overflow error checking
mechanism. These optional elements include:
CREDITS_RECEIVED counter
Error Check Logic
These elements permit the receiver to track Flow Control credits in the same manner as the
transmitter. That is, the transmitter CREDIT_LIMIT count should be the same as the receiver's
CREDITS_ALLOCATED count (after an Update DLLP is sent) and the receiver's
CREDITS_RECEIVED count should be the same as the transmitter's CREDITS_CONSUMED
count. If flow control is working correctly the following will be true:
An overflow condition is detected when the following formula is satisfied. Note that the field size
is either 8 (headers) or 12 (data):
If the formula is true, then the result is negative; thus, more credits have been sent to the FC
buffer than were available and an overflow has occurred. Note that the 1.0a version of the
specification defines the equation as rather than > as shown above. This appears to be an
error, because when CA=CR no overflow condition exists. For example, for the case right after
initialization where the receiver advertises that it has 128 credits for the transmitter to use, CA
= 128, and CR = 0 because it hasn't received anything yet, then this equation evaluates true.
Which means it has overflowed, when actually all we have done is advertise our max allowed
number of credits. If the equation evaluates for only > and not , then everything seems to
work.
Infinite Flow Control Advertisement
PCI Express defines an infinite Flow Control credit value. A device that advertises infinite Flow
Control credits need not send Flow Control Update packets following initialization and the
transmitter will never be blocked from sending transactions. During flow control initialization, a
device advertises "infinite" credits by delivering a zero in the credit field of the FC_INIT1 DLLP.
It's interesting to note that the minimum Flow Control credits that must be advertised includes
infinite credits for completion transactions in certain situations. See Table 7-1 on page 303.
These requirements involve devices that originate requests for which completions are expected
to be returned (i.e., Endpoints and root ports that do not support peer-to-peer transfers). It
does not include devices that merely forward completions (switches and root ports that support
peer-to-peer transfers). This implies a requirement that any device initiating a request must
commit buffer space for the expected completion header and data (if applicable). This
guarantees that no throttling would ever occur when completions cross the final link to the
original requester. This type of rule is required of PCI-X devices that initiate split transactions.
Multiple searches of the specification failed to reveal this requirement explicitly stated for PCI
Express devices; however, it is implied by the requirement to advertise infinite Flow Control
credits.
Note also that infinite flow control credits can only be advertised during initializtion. This must be
true, because the CA counter in the receiver could rollover to 00h and send an Update FC
packet with the credit field set to 00h. If the Link is in the DL_Init state, this means infinite
credits, but if the Link is in the DL_Active state, this does not mean infinite credits.
The specification points out a special consideration for devices that do not need to implement
all the FC buffer types for all VCs. For example, the only Non-Posted writes are I/O Writes and
Configuration Writes both of which are permitted only on VC0. Thus, Non-Posted data buffers
are not needed for VC1 - VC7. Because no Flow Control tracking is needed, a device can
simply advertise infinite Flow Control credits during initialization, thereby eliminating the need to
send needless FC_Update packets.
An infinite Flow Control advertisement might be sent for either the Data or header buffers (with
same FC type) but not both. In this case, Update DLLPs are required for one buffer but not the
other. This simply means that the device requiring credits will send an Update DLLP with the
corresponding field containing the CREDITS_ALLOCATED credit information, and the other
field must be set to zero (consistent with its advertisement).
The Minimum Flow Control Advertisement
The minimum number of credits that can be reported for the different Flow Control buffer types
is listed in Table 7-1 on page 303.
Posted Request
1 unit. Credit Value = one 4DW HDR + Digest = 5DW.
Header (PH)
Largest possible setting of the Max_Payload_Size; for the component divided by FC Unit Size (4DW).
Posted Request
Data (PD) Example: If the largest Max_Payload_Size value supported is 1024 bytes, the smallest permitted initial credit value
would be 040h.
Non-Posted
Request HDR 1 unit. Credit Value = one 4 DW HDR + Digest = 5DW.
(NPH)
Non-Posted
Request Data 1 unit. Credit Value = 4DW.
(NPD)
Completion HDR 1 unit. Credit Value = one 3DW HDR + Digest = 4DW; for Root Complex with peer-to-peer support and Switches.
(CPLH)
Infinite units. Initial Credit Value = all 0's for Root Complex with no peer-to-peer support and Endpoints.
n units. Value of largest possible setting of Max_Payload_Size or size of largest Read Request (which ever is
Completion Data smaller) divided by FC Unit Size (4DW); for Root Complex with peer-to-peer support and Switches.
(CPLD)
Infinite units. Initial Credit Value = all 0's; for Root Complex with no peer-to-peer support and Endpoints.
Flow Control Initialization
Prior to sending any transactions, flow control initialization must be performed. Initialization
occurs for each link in the system and involves a handshake between the devices attached to
the same link. TLPs associated with the virtual channel being initialized cannot be forwarded
across the link until Flow Control Initialization is performed successfully.
Once initiated, the flow control initialization procedure is fundamentally the same for all Virtual
Channels. The small differences that exist are discussed later. Initialization of VC0 (default VC)
must be done in hardware so that configuration transactions can traverse the PCI Express
fabric. Other VCs initialize once configuration software has set up and enabled the VCs at both
ends of the link. Enabling a VC triggers hardware to perform flow control initialization for this
VC.
Figure 7-9 pictures the Flow Control counters within the devices at both ends of the link, along
with the state of flag bits used during initialization.
PCI Express defines two stages in flow control initialization: FC_INIT1 and FC_INIT2. Each
stage of course involves the use of the Flow Control packets (FCPs).
Flow Control Init1 reports the size of the Flow Control buffers for a given virtual channel
Flow Control Init2 verifies that the device transmitting the Init2 packet has completed the
flow control initialization for the specified VC and buffer type.
During the FC_INIT1 state, a device continuously outputs a sequence of 3 InitFC1 Flow Control
packets advertising its posted, non-posted, and completion receiver buffer sizes. (See Figure
7-10.) Each device also waits to receive a similar sequence from its neighbor. Once a device
has received the complete sequence and sent its own, it initializes transmit counters, sets an
internal flag FI1, and exits FC_INIT1. This process is illustrated in Figure 7-11 on page 306 and
described below. The example shows Device A reporting Non-Posted Buffer Credits and Device
B reporting Posted Buffer Credits. This illustrates that the devices need not be in
synchronization regarding what they are reporting. In fact, the two device will typically not start
the flow control initialization process at the same time.
1. Each device sends InitFC1 type Flow Control packets (FCPs) to advertise the size of
its respective receive buffers. A separate FCP for posted requests (P), non-posted
requests (NP) and completion (CPL) packet types is required. The order in which
this sequence of three FCPs is sent is:
Header and Data buffer credit units for Posted Requests (P).
Header and Data buffer credit units for Non-Posted Requests (NP)
The sequence of FCPs is repeated continuously until a device leaves the FC_INIT1
initialization state.
In the meantime, devices take the credit information and initialize the transmit credit limit
registers. In this example, Device A loads its PH transmit Credit Limit register with a value of 4,
which was reported by Device B for its posted request header FC buffer. It also loads its PD
Credit Limit register with a value of 64d credits (1024 bytes worth of data) for accompanying
posted data. Similarly, Device B loads its NPH transmit Credit Limit counter with a value of 2 for
non-posted request headers and its NPD transmit counter with a value of 32d credits (512
bytes worth of data) for accompanying non-posted data.
Note that when this process is complete, the Credits Allocated counter in the receivers and
the corresponding Credit Limit counters in the transmitters will be equal.
Once a device receives Init1 FC values for a given buffer type (e.g., Posted) and has
recorded them, the FC_INIT1 state is complete for that Flow Control buffer. Once all FC
buffers for a given VC have completed the FC_INIT1 state, Flag 1 (Fl1) is set and the device
ceases to send FCInit1 DLLPs and advances to FC Init2 state. Note that receipt of an Init2 FC
packets may also cause Fl1 to be set. This can occur if the neighboring device has already
advanced to the FC Init2 state.
PCI Express defines the InitFC2 state that is used for feedback to verify the Flow Control
initialization has been successful for a given VC. During FC_INIT2, each device continuously
outputs a sequence of 3 InitFC2 Flow Control packets; however, credit values are discarded
during the FC_INIT2 state. Note that devices are permitted to send TLPs upon entering the
FC_INIT2 state. Figure 7-12 illustrates InitFC2 behavior, which is described following the
illustration.
1. At the start of initialization state FC_INIT2, each device commences sending InitFC2
type Flow Control packets (FCPs) to indicate it has completed the FC_INIT1 state.
Devices use the same repetitive sequence when sending FCPs in this state as
before:
Header and Data buffer credit allocation for Posted Requests (P)
Header and Data buffer credit allocation for Non-Posted Requests (NP)
All credits reported in InitFC2 FCPs may be discarded, as the transmitter Credit Limit
counters were already set up in FC_INIT1.
Once a device receives an FC_INIT2 packet for any buffer type, it sets an internal flag (Fl2).
(It doesn't wait to receive an FC_Init2 for each type.) Note that Fl2 is also set upon receipt of
an UpdateFC packet or TLP.
The specification defines the latency between sending FC_INIT DLLPs as follows:
VC0. Hardware initiated flow control of VC0 requires that FC_INIT1 and FC_INIT2 packets
be transmitted "continuously at the maximum rate possible." That is, the resend timer is set
to a value of zero.
VC1-VC7. When software initiates flow control initialization, the FC_INIT sequence is
repeated "when no other TLPs or DLLPs are available for transmission." However, the
latency between the beginning of one sequence to the next can be no greater than 17µs.
A violation of the flow control initialization protocol can be optionally checked by a device. An
error detected can be reported as a Data Link Layer protocol error. See "Link Flow Control-
Related Errors" on page 363.
Flow Control Updates Following FC_INIT
The receiver must continually update its neighboring device to report additional Flow Control
credits that have accumulated as a result of moving transactions from the Flow Control buffer.
Figure 7-13 on page 309 illustrates an example where the transmitter was previously blocked
from sending header transactions because the Flow Control buffer was full. In the example, the
receiver has just removed three headers from the Flow Control buffer. More space is now
available, but the neighboring device has no knowledge of this. As each header is removed
from the Flow Control buffer, the CREDITS_ALLOCATED count increments. The new count is
delivered to the CREDIT_LIMIT register of the neighboring device via an update Flow Control
packet. The updated credit limit allows transmission of additional transactions.
Recall that update Flow Control packets, like the Flow Control initialization packets contain two
update fields, one for header and one for data for the selected credit type (P, NP, and Cpl).
Figure 7-14 on page 310 depicts the content of the update packet. The receiver's
CREDITS_ALLOCATED counts that are reported in the HdrFC and DataFC fields may have
been updated many times or not at all since the last update packet sent.
The specification defines a variety of rules and suggested implementations that govern when
and how often Flow Control Update DLLPs should be sent. The motivation includes:
Notifying the transmitting device as early as possible about new credits allocated, which
allows previously blocked transactions to continue.
Balancing the requirements and variables associated with flow control operation. This
involves:
the need to report credits available often enough to prevent transaction blocking
the desire to reduce the link bandwidth required to send FC_Update DLLPs
The update frequency limits specified assume that the link is in the active state (L0 or LOs
(s=standby). All other link states represent more aggressive power management with longer
recovery latencies that require link recovery prior to sending packets.
Maximum Packet Size = 1 Credit. When packet transmission is blocked due to a buffer
full condition for non-infinite NPH, NPD, PH, and CPLH buffer types, an UpdateFC packet
must be scheduled for Transmission when one or more credits are made available
(allocated) for that buffer type.
Maximum Packet Size = Max_Payload_Size. Flow Control buffer space may decrease
to the extent that a maximum-sized packet cannot be sent for non-infinite PD and CPLD
credit types. In this case, when one or more additional credits are allocated, an Update
FCP must be scheduled for transmission.
The transmission frequency of Update FCPs for each FC credit type (non-infinite) must be
scheduled for transmission at least once every 30 µs (-0%/+50%). If the Extended Sync bit
within the Control Link register is set, Updates must be scheduled no later than every 120 µs
(-0%/+50%). Note that Update FCPs may be scheduled for transmission more frequently than
is required.
The specification offers a formula for calculating the frequency at which update packets need to
be sent for maximum data payloads sizes and link widths. The formula, shown below, defines
FC Update delivery intervals in symbol times (.4ns).
where:
MaxPayloadSize = The value in the Max_Payload_Size field of the Device Control register
TLPOverhead = the constant value (28 symbols) representing the additional TLP
components that consume Link bandwidth (header, LCRC, framing Symbols)
UpdateFactor = the number of maximum size TLPs sent during the interval between
UpdateFC Packets received. This number balances link bandwidth efficiency and receive
buffer sizes the value varies with Max_Payload_Size and Link width
The simple relationship defined by the formula show that for a given data payload and buffer
size, the frequency of update packet delivery becomes higher as the link width increases. This
relatively simple approach suggests a timer implementation that triggers scheduling of update
packets. Note that this formula does not account for delays associated with the receiver or
transmitter being in the L0s power management state.
The specification recognizes that the formula will be inadequate for many applications such as
those that stream large blocks of data. These applications may require buffer sizes larger than
the minimum specified, as well as a more sophisticated update policy in order to optimize
performance and reduce power consumption. Because a given solution is dependent on the
particular requirements of an application, no definition for such policies is provided.
The specification defines an optional time-out mechanism that is highly recommended. So much
so, that the specification points out that it is expected to become a requirement in futures
versions of the spec. This mechanism detects prolonged absences of Flow Control packets.
The maximum latency between FC packets for a given Flow Control credit type is specified to
be no greater than 120µs. This error detection timer has a maximum limit of 200µs, and it gets
reset any time a Flow Control packet of any type is received. If a time-out occurs, this
suggests a serious problem with a device's ability to report Flow Control credits. Consequently,
a time-out triggers the Physical Layer to enter its Recovery state which retrains the link and
hopefully clears the error condition. Characteristics of this timer include:
operational only when the link is in its active state (L0 or L0s)
timer is reset when any Init or Update FCP is received, or optionally the timer may be
reset by the receipt of any type of DLLP
when timer expires Physical Layer enters the Link Training Sequence State Machine
(LTSSM) Recovery state
Chapter 8. Transaction Ordering
The Previous Chapter
This Chapter
Introduction
Producer/Consumer Model
Relaxed Ordering
Ensuring that the completion of transactions is deterministic and in the sequence intended
by the programmer.
Maintaining compatibility with ordering already used on legacy buses (e.g., PCI, PCI-X,
and AGP).
PCI Express ordering is based on the same Producer/Consumer model as PCI. The split
transaction protocol and related ordering rules are fairly straight forward when restricting the
discussion to transactions involving only native PCI Express devices. However, ordering
becomes more complex when including support for the legacy buses mentioned in bullet three
above.
Rather than presenting the ordering rules defined by the specification and attempting to explain
the rationale for each rule, this chapter takes the building block approach. Each major ordering
concern is introduced one at a time. The discussion begins with the most conservative (and
safest) approach to ordering, progresses to a more aggressive approach (to improve
performance), and culminates with the ordering rules presented in the specification. The
discussion is segmented into the following sections:
The fundamental PCI Express device ordering requirements that ensure the
Producer/Consumer model functions correctly.
The Relaxed Ordering feature that permits violation of the Producer/Consumer ordering
when the device issuing a request knows that the transaction is not part of a
Producer/Consumer programming sequence.
1. A network adapter begins to receive a stream of compressed video data over the
network and performs a series of memory write transactions to deliver the stream
of compressed video data into a Data buffer in memory (in other words the network
adapter is the Producer of the data).
After the Producer moves the data to memory, it performs a memory write transaction to
set an indicator (or Flag) in a memory location (or a register) to indicate that the data is ready
for processing.
Another requester (referred to as the Consumer) periodically performs a memory read from
the Flag location to see if there's any data to be processed. In this example, this requester is a
video decompressor that will decompress and display the data.
When it sees that the Flag has been set by the Producer, it performs a memory write to
clear the Flag, followed by a burst memory read transaction to read the compressed data (it
consumes the data; hence the name Consumer) from the Data buffer in memory.
When it is done consuming the Data, the Consumer writes the completion status into the
Status location. It then resumes periodically reading the Flag location to determine when more
data needs to be processed.
In the meantime, the Producer has been reading periodically from the Status location to
see if data processing has been completed by the other requester (the Consumer). This
location typically contains zero until the other requester completes the data processing and
writes the completion status into it. When the Producer reads the Status and sees that the
Consumer has completed processing the Data, the Producer then performs a memory write
to clear the Status location.
The process then repeats whenever the Producer has more data to be processed.
Ordering rules are required to ensure that the Producer/Consumer model works correctly no
matter where the Producer, the Consumer, the Data buffer, the Flag location, and the Status
location are located in the system (in other words, no matter how they are distributed on
various links in the system).
Native PCI Express Ordering Rules
PCI Express transaction ordering for native devices can be summarized with four simple rules:
The ordering rules apply in the same way to all types of transactions: memory, IO,
configuration, and messages.
Under limited circumstances, transactions with the Relaxed Ordering attribute bit set can be
ordered ahead of other transactions with the same TC.
These fundamental rules ensure that transactions always complete in the order intended by
software. However, these rules are extremely conservative and do not necessarily result in
optimum performance. For example, when transactions from many devices merge within
switches, there may be no ordering relationship between transactions from these different
devices. In such cases, more aggressive rules can be applied to improve performance as
discussed in "Modified Ordering Rules Improve Performance" on page 322.
Because the Producer/Consumer model depends on strong ordering, when the following
conditions are met native PCI Express devices support this model without additional ordering
rules:
1. All elements associated with the Producer/Consumer model reside within native PCI
Express devices.
All transactions associated with the operation of the Producer/Consumer model transverse
only PCI Express links within the same fabric.
All associated transactions have the same TC values. If different TC values are used, then
the strong ordering relationship between the transactions is no longer guaranteed.
The Relaxed Ordering (RO) attribute bit of the transactions must be cleared to avoid
reordering the transactions that are part of the Producer/Consumer transaction series.
When PCI legacy devices reside within a PCI Express system, the ordering rules become more
involved. Consequently, additional ordering rules apply because of PCI's delayed transaction
protocol. Without ordering rules, this protocol could permit Producer/Consumer transactions to
complete out of order and cause the programming model to break.
Relaxed Ordering
PCI Express supports the Relaxed Ordering mechanism introduced by PCI-X; however, PCI
Express introduces some changes (discussed later in this chapter). The concept of Relaxed
Ordering in the PCI Express environment allows switches in the path between the Requester
and Completer to reorder some transactions just received before others that were previously
enqueued.
The ordering rules that exist to support the Producer/Consumer model may result in
transactions being blocked, when in fact the blocked transactions are completely unrelated to
any Producer/Consumer transaction sequence. Consequently, in certain circumstances, a
transaction with its Relaxed Ordering (RO) attribute bit set can be re-ordered ahead of other
transactions.
The Relaxed Ordering bit may be set by the device if its device driver has enabled it to do so
(by setting the Enable Relaxed Ordering bit in the Device Control registersee Table 24 - 3 on
page 906). Relaxed ordering gives switches and the Root Complex permission to move this
transaction ahead of others, whereas the action is normally prohibited.
PCI Express Switches and the Root Complex are affected by memory write and message
transactions that have their RO bit set. Memory write and Message transactions are treated
the same in most respectsboth are handled as posted operations, both are received into the
same Posted buffer, and both are subject to the same ordering requirements. When the RO bit
is set, switches handle these transactions as follows:
Switches are permitted to reorder memory write transactions just posted ahead of
previously posted memory write transactions or message transactions. Similarly, message
transactions just posted may be ordered ahead of previously posted memory write or
message transactions. Switches must also forward the RO bit unmodified. The ability to
reorder these transactions within switches is not supported by PCI-X bridges. In PCI-X, all
posted writes must be forwarded in the exact order received. Another difference between
the PCI-X and PCI Express implementations is that message transactions are not defined
for PCI-X.
The Root Complex is permitted to order a just-posted write transaction ahead of another
write transaction that was received earlier in time. Also, when receiving write requests
(with RO set), the Root Complex is required to write the data payload to the specified
address location within system memory, but is permitted to write each byte to memory in
any address order.
RO Effects on Memory Read Transactions
All read transactions in PCI Express are handled as split transactions. When a device issues a
memory read request with the RO bit set, the request may traverse one or more switches on
its journey to the Completer. The Completer returns the requested read data in a series of one
or more split completion transactions, and uses the same RO setting as in the request. Switch
behavior for the example stated above is as follows:
1. A switch that receives a memory read request with the RO bit set must forward the
request in the order received, and must not reorder it ahead of memory write
transactions that were previously posted. This action guarantees that all write
transactions moving in the direction of the read request are pushed ahead of the
read. Such actions are not necessarily part of the Producer/Consumer programming
sequence, but software may depend on this flushing action taking place. Also, the
RO bit must not be modified by the switch.
When the Completer receives the memory read request, it fetches the requested read data
and delivers a series of one or more memory read Completion transactions with the RO bit set
(because it was set in the request).
A switch receiving the memory read Completion(s) detects the RO bit set and knows that it
is allowed to order the read Completion(s) ahead of previously posted memory writes moving in
the direction of the Completion. If the memory write transaction were blocked (due to flow
control), then the memory read Completion would also be blocked if the RO was not set.
Relaxed ordering in this case improves read performance.
The PCI Express specification defines strong ordering rules associated with transactions that
are assigned the same TC value, and further defines a Relaxed Ordering attribute that can be
used when a device knows that a transaction has no ordering relationship to other transactions
with the same TC value. Table 8-2 on page 322 summarizes the PCI Express ordering rules
that satisfy the Producer/Consumer model and also provides for Relaxed Ordering. The table
represents a draconian approach to ordering and does not consider issues of performance,
preventing deadlocks, etc.
The table applies to transactions with the same TC assignment that are moving in the same
direction. These rules ensure that transactions will complete in the intended program order and
eliminates the possibility of deadlocks in a pure PCI Express implementation (i.e., systems with
no PCI Bridges). Columns 2 - 6 represent transactions that have previously latched by a PCI
Express device, while column 1 represents subsequently-latched transactions. The ordering
relationship between the transaction in column 1 to other transactions previously enqueued is
expressed in the table on a row-by-row basis. Note that these rules apply uniformly to all
transaction types (Memory, Messages, IO, and Configuration). The table entries are defined as
follows:
No The transaction in column 1 must not be permitted to proceed ahead of the previously
enqueued transaction in the corresponding columns (2-6).
Y/N (Yes/No) The transaction in column 1 is allowed to proceed ahead of the previously
enqueued transaction because its Relaxed Ordering bit is set (1), but it is not required to do so.
Note that the shaded area represents the ordering requirements that ensure the
Producer/Consumer model functions correctly and is consistent with the basic rules associated
with strong ordering. The transaction ordering associated with columns 3 - 6 play no role in the
Producer/Consumer model.
Modified Ordering Rules Improve Performance
This section describes how temporary transaction blocking can occur when the strong ordering
rules listed in Table 8-2 are rigorously enforced. Modification of strong ordering between
transactions that do not violate the Produce/Consumer programming model can eliminate many
blocking conditions and improve link efficiency.
Maintaining the strong ordering relationship between transactions would likely result in instances
where all transactions would be blocked due to a single receive buffer being full. The strong
ordering requirements to support the Producer/Consumer model cannot be modified (except in
the case of relaxed ordering described previously). However, transaction sequences that do not
occur within the Producer/Consumer programming model can be modified to a weakly ordered
scheme that can lead to improved performance.
The Problem
Consider the following example illustrated in Figure 8-1 on page 323 when strong ordering is
maintained for all transaction sequences. This example depicts transmitter and receiver buffers
associated with the delivery of transactions in a single direction (from left to right) for a single
Virtual Channel (VC), and the transmit and receive buffers are organized in the same way. Also,
recall that each of the transaction types (Posted, Non-Posted, and Completions) have
independent flow control within the same VC. The numbers within the transmit buffers show the
order in which these transactions were issued to the transmitter. In addition, the non-posted
receive buffer is currently full. Consider the following sequence.
Transaction 2 (a posted memory write) is the next transaction pending. When consulting
Table 8-2 (based on strong ordering), entry A3 specifies that a memory write must not pass a
previously posted read transaction.
Because all entries in Table 8-2 are "No", all transactions are blocked due to the non-posted
receive buffer being filled.