100% found this document useful (2 votes)
7K views1,138 pages

Pci Express System Architecture PDF

Uploaded by

LakshmanaLavuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
7K views1,138 pages

Pci Express System Architecture PDF

Uploaded by

LakshmanaLavuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Table of Contents
• Index
PCI Express System Architecture
By MindShare, Inc , Ravi Budruk,
Don Anderson, Tom Shanley

Publisher : Addison Wesley
Pub Date : September 04, 2003
ISBN : 0-321-15630-7
Pages : 1120

"We have always recommended these books to our customers and


even our own engineers for developing a better understanding of
technologies and specifications. We find the latest PCI Express
book from MindShare to have the same content and high quality as
all the others."
​Nader Saleh, CEO/President, Catalyst Enterprises, Inc.

PCI Express is the third-generation Peripheral Component Inter-


connect technology for a wide range of systems and peripheral
devices. Incorporating recent advances in high-speed, point-to-
point interconnects, PCI Express provides significantly higher
performance, reliability, and enhanced capabilities​at a lower
cost​than the previous PCI and PCI-X standards. Therefore, anyone
working on next-generation PC systems, BIOS and device driver
development, and peripheral device design will need to have a
thorough understanding of PCI Express.

PCI Express System Architecture provides an in-depth description


and comprehensive reference to the PCI Express standard. The
book contains information needed for design, verification, and test,
as well as background information essential for writing low-level
BIOS and device drivers. In addition, it offers valuable insight into
the technology's evolution and cutting-edge features.

Following an overview of the PCI Express architecture, the book


moves on to cover transaction protocols, the physical/electrical
layer, power management, configuration, and more. Specific topics
covered include:

Split transaction protocol

Packet format and definition, including use of each field

ACK/NAK protocol

Traffic Class and Virtual Channel applications and use

Flow control initialization and operation

Error checking mechanisms and reporting options

Switch design issues

Advanced Power Management mechanisms and use

Active State Link power management

Hot Plug design and operation


Message transactions

Physical layer functions

Electrical signaling characteristics and issues

PCI Express enumeration procedures

Configuration register definitions

Thoughtfully organized, featuring a plethora of illustrations, and


comprehensive in scope, PCI Express System Architecture is an
essential resource for anyone working with this important
technology.

MindShare's PC System Architecture Series is a crisply written and


comprehensive set of guides to the most important PC hardware
standards. Books in the series are intended for use by hardware
and software designers, programmers, and support personnel.

• Table of Contents
• Index
PCI Express System Architecture
By MindShare, Inc , Ravi Budruk,
Don Anderson, Tom Shanley

Publisher : Addison Wesley
Pub Date : September 04, 2003
ISBN : 0-321-15630-7
Pages : 1120

Copyright
Figures
Tables
Acknowledgments
About This Book
The MindShare Architecture Series
Cautionary Note
Intended Audience
Prerequisite Knowledge
Topics and Organization
Documentation Conventions
Visit Our Web Site
We Want Your Feedback
Part One. The Big Picture
Chapter 1. Architectural Perspective
This Chapter
The Next Chapter
Introduction To PCI Express
Predecessor Buses Compared
I/O Bus Architecture Perspective
The PCI Express Way
PCI Express Specifications
Chapter 2. Architecture Overview
Previous Chapter
This Chapter
The Next Chapter
Introduction to PCI Express Transactions
PCI Express Device Layers
Example of a Non-Posted Memory Read Transaction
Hot Plug
PCI Express Performance and Data Transfer Efficiency

Part Two. Transaction Protocol


Chapter 3. Address Spaces & Transaction Routing
The Previous Chapter
This Chapter
The Next Chapter
Introduction
Two Types of Local Link Traffic
Transaction Layer Packet Routing Basics
Applying Routing Mechanisms
Plug-And-Play Configuration of Routing Options

Chapter 4. Packet-Based Transactions


The Previous Chapter
This Chapter
The Next Chapter
Introduction to the Packet-Based Protocol
Transaction Layer Packets
Data Link Layer Packets
Chapter 5. ACK/NAK Protocol
The Previous Chapter
This Chapter
The Next Chapter
Reliable Transport of TLPs Across Each Link
Elements of the ACK/NAK Protocol
ACK/NAK DLLP Format
ACK/NAK Protocol Details
Error Situations Reliably Handled by ACK/NAK Protocol
ACK/NAK Protocol Summary
Recommended Priority To Schedule Packets
Some More Examples
Switch Cut-Through Mode
Chapter 6. QoS/TCs/VCs and Arbitration
The Previous Chapter
This Chapter
The Next Chapter
Quality of Service
Perspective on QOS/TC/VC and Arbitration
Traffic Classes and Virtual Channels
Arbitration
Chapter 7. Flow Control

The Previous Chapter


This Chapter
The Next Chapter
Flow Control Concept
Flow Control Buffers
Introduction to the Flow Control Mechanism
Flow Control Packets
Operation of the Flow Control Model - An Example
Infinite Flow Control Advertisement
The Minimum Flow Control Advertisement
Flow Control Initialization
Flow Control Updates Following FC_INIT
Chapter 8. Transaction Ordering
The Previous Chapter
This Chapter
The Next Chapter
Introduction
Producer/Consumer Model
Native PCI Express Ordering Rules
Relaxed Ordering
Modified Ordering Rules Improve Performance
Support for PCI Buses and Deadlock Avoidance

Chapter 9. Interrupts
The Previous Chapter
This Chapter
The Next Chapter
Two Methods of Interrupt Delivery
Message Signaled Interrupts
Legacy PCI Interrupt Delivery
Devices May Support Both MSI and Legacy Interrupts
Special Consideration for Base System Peripherals

Chapter 10. Error Detection and Handling


The Previous Chapter
This Chapter
The Next Chapter
Background
Introduction to PCI Express Error Management
Sources of PCI Express Errors
Error Classifications
How Errors are Reported
Baseline Error Detection and Handling
Advanced Error Reporting Mechanisms
Summary of Error Logging and Reporting

Part Three. The Physical Layer


Chapter 11. Physical Layer Logic
The Previous Chapter
This Chapter
The Next Chapter
Physical Layer Overview
Transmit Logic Details
Receive Logic Details
Physical Layer Error Handling
Chapter 12. Electrical Physical Layer
The Previous Chapter
This Chapter
The Next Chapter
Electrical Physical Layer Overview
High Speed Electrical Signaling
LVDS Eye Diagram
Transmitter Driver Characteristics
Input Receiver Characteristics
Electrical Physical Layer State in Power States

Chapter 13. System Reset


The Previous Chapter
This Chapter
The Next Chapter
Two Categories of System Reset
Reset Exit
Link Wakeup from L2 Low Power State
Chapter 14. Link Initialization & Training
The Previous Chapter
This Chapter
The Next Chapter
Link Initialization and Training Overview
Ordered-Sets Used During Link Training and Initialization
Link Training and Status State Machine (LTSSM)
Detailed Description of LTSSM States
LTSSM Related Configuration Registers

Part Four. Power-Related Topics


Chapter 15. Power Budgeting
The Previous Chapter
This Chapter
The Next Chapter
Introduction to Power Budgeting
The Power Budgeting Elements
Slot Power Limit Control
The Power Budget Capabilities Register Set

Chapter 16. Power Management


The Previous Chapter
This Chapter
The Next Chapter
Introduction
Primer on Configuration Software
Function Power Management
Introduction to Link Power Management
Link Active State Power Management
Software Initiated Link Power Management
Link Wake Protocol and PME Generation

Part Five. Optional Topics


Chapter 17. Hot Plug
The Previous Chapter
This Chapter
The Next Chapter
Background
Hot Plug in the PCI Express Environment
Elements Required to Support Hot Plug
Card Removal and Insertion Procedures
Standardized Usage Model
Standard Hot Plug Controller Signaling Interface
The Hot-Plug Controller Programming Interface
Slot Numbering
Quiescing Card and Driver
The Primitives
Chapter 18. Add-in Cards and Connectors
The Previous Chapter
This Chapter
The Next Chapter
Introduction
Form Factors Under Development

Part Six. PCI Express Configuration


Chapter 19. Configuration Overview
The Previous Chapter
This Chapter
The Next Chapter
Definition of Device and Function
Definition of Primary and Secondary Bus
Topology Is Unknown At Startup
Each Function Implements a Set of Configuration Registers
Host/PCI Bridge's Configuration Registers
Configuration Transactions Are Originated by the Processor
Configuration Transactions Are Routed Via Bus, Device, and Function Number
How a Function Is Discovered
How To Differentiate a PCI-to-PCI Bridge From a Non-Bridge Function

Chapter 20. Configuration Mechanisms


The Previous Chapter
This Chapter
The Next Chapter
Introduction

PCI-Compatible Configuration Mechanism


PCI Express Enhanced Configuration Mechanism
Type 0 Configuration Request
Type 1 Configuration Request
Example PCI-Compatible Configuration Access
Example Enhanced Configuration Access
Initial Configuration Accesses

Chapter 21. PCI Express Enumeration


The Previous Chapter
This Chapter
The Next Chapter
Introduction
Enumerating a System With a Single Root Complex
Enumerating a System With Multiple Root Complexes
A Multifunction Device Within a Root Complex or a Switch
An Endpoint Embedded in a Switch or Root Complex
Memorize Your Identity
Root Complex Register Blocks (RCRBs)
Miscellaneous Rules

Chapter 22. PCI Compatible Configuration Registers


The Previous Chapter
This Chapter
The Next Chapter
Header Type 0
Header Type 1
PCI-Compatible Capabilities
Chapter 23. Expansion ROMs
The Previous Chapter
This Chapter
The Next Chapter
ROM Purpose​Device Can Be Used In Boot Process
ROM Detection
ROM Shadowing Required
ROM Content
Execution of Initialization Code
Introduction to Open Firmware
Chapter 24. Express-Specific Configuration Registers
The Previous Chapter
This Chapter
Introduction
PCI Express Capability Register Set
PCI Express Extended Capabilities
RCRB

Appendices
Appendix A. Test, Debug and Verification

Scope
Serial Bus Topology

Dual-Simplex
Setting Up the Analyzer, Capturing and Triggering
Link Training, the First Step in Communication
Slot Connector vs. Mid-Bus Pad
Exercising: In-Depth Verification
Signal Integrity, Design and Measurement

Appendix B. Markets & Applications for the PCI Express™ Architecture


Introduction
Enterprise Computing Systems
Embedded Control
Storage Systems
Communications Systems
Summary
Appendix C. Implementing Intelligent Adapters and Multi-Host Systems With PCI

Express™ Technology
Introduction
Usage Models
The History Multi-Processor Implementations Using PCI
Implementing Multi-host/Intelligent Adapters in PCI Express Base Systems
Summary
Address Translation

Appendix D. Class Codes


Appendix E. Locked Transactions Series
Introduction
Background
The PCI Express Lock Protocol
Summary of Locking Rules

Index
Copyright
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designators appear in this book, and Addison-Wesley
was aware of the trademark claim, the designations have been printed in initial capital letters or
all capital letters.

The authors and publisher have taken care in preparation of this book, but make no expressed
or implied warranty of any kind and assume no responsibility for errors or omissions. No liability
is assumed for incidental or consequential damages in connection with or arising out of the use
of the information or programs contained herein.

The publisher offers discounts on this book when ordered in quantity for bulk purchases and
special sales. For more information, please contact:

U.S. Corporate and Government Sales


(800) 382-3419
[email protected]

For sales outside of the U.S., please contact:

International Sales
(317) 581-3793
[email protected]

Visit Addison-Wesley on the Web: www.awprofessional.com

Library of Congress Cataloging-in-Publication Data

Budruk, Ravi.
PCI express system architecture / Mindshare, Inc., Ravi Buduk ... [et al.].
p. cm.
Includes index.
ISBN 0-321-15630-7 (alk. paper)
1. Computer architecture. 2. Microcomputers​buses. 3. Computer architecture. I.
Budruk, Ravi II. Mindshare, Inc. III. Title.
QA76.9.A73P43 2003
004.2 '2​dc22 2003015461

Copyright © 2004 by MindShare, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording,
or otherwise, without the prior consent of the publisher. Printed in the United States of America.
Published simultaneously in Canada.
For information on obtaining permission for use of material from this work, please submit a
written request to:

Pearson Education, Inc.


Rights and Contracts Department
75 Arlington Street, Suite 300
Boston, MA 02116
Fax: (617) 848-7047

Set in 10 point Palatino by MindShare, Inc.

1 2 3 4 5 6 7 8 9 10​CRS​0706050403

First printing, September 2003

Dedication
To my parents Aruna and Shripal Budruk who started me on the path to Knowledge
Figures
1-1 Comparison of Performance Per Pin for Various Buses

1-2 33 MHz PCI Bus Based Platform

1-3 Typical PCI Burst Memory Read Bus Cycle

1-4 33 MHz PCI Based System Showing Implementation of a PCI-to-PCI Bridge

1-5 PCI Transaction Model

1-6 PCI Bus Arbitration

1-7 PCI Transaction Retry Mechanism

1-8 PCI Transaction Disconnect Mechanism

1-9 PCI Interrupt Handling

1-10 PCI Error Handling Protocol

1-11 Address Space Mapping

1-12 PCI Configuration Cycle Generation

1-13 256 Byte PCI Function Configuration Register Space

1-14 Latest Generation of PCI Chipsets

1-15 66 MHz PCI Bus Based Platform

1-16 66 MHz/133 MHz PCI-X Bus Based Platform

1-17 Example PCI-X Burst Memory Read Bus Cycle

1-18 PCI-X Split Transaction Protocol

1-19 Hypothetical PCI-X 2.0 Bus Based Platform

1-20 PCI Express Link


1-21 PCI Express Differential Signal

1-22 PCI Express Topology

1-23 Low Cost PCI Express System

1-24 Another Low Cost PCI Express System

1-25 PCI Express High-End Server System

2-1 Non-Posted Read Transaction Protocol

2-2 Non-Posted Locked Read Transaction Protocol

2-3 Non-Posted Write Transaction Protocol

2-4 Posted Memory Write Transaction Protocol

2-5 Posted Message Transaction Protocol

2-6 Non-Posted Memory Read Originated by CPU and Targeting an Endpoint

2-7 Non-Posted Memory Read Originated by Endpoint and Targeting Memory

2-8 IO Write Transaction Originated by CPU, Targeting Legacy Endpoint

2-9 Memory Write Transaction Originated by CPU, Targeting Endpoint

2-10 PCI Express Device Layers

2-11 TLP Origin and Destination

2-12 TLP Assembly

2-13 TLP Disassembly

2-14 DLLP Origin and Destination

2-15 DLLP Assembly

2-16 DLLP Disassembly


2-17 PLP Origin and Destination

2-18 PLP or Ordered-Set Structure

2-19 Detailed Block Diagram of PCI Express Device's Layers

2-20 TLP Structure at the Transaction Layer

2-21 Flow Control Process

2-22 Example Showing QoS Capability of PCI Express

2-23 TC Numbers and VC Buffers

2-24 Switch Implements Port Arbitration and VC Arbitration Logic

2-25 Data Link Layer Replay Mechanism

2-26 TLP and DLLP Structure at the Data Link Layer

2-27 Non-Posted Transaction on Link

2-28 Posted Transaction on Link

2-29 TLP and DLLP Structure at the Physical Layer

2-30 Electrical Physical Layer Showing Differential Transmitter and Receiver

2-31 Memory Read Request Phase

2-32 Completion with Data Phase

3-1 Multi-Port PCI Express Devices Have Routing Responsibilities

3-2 PCI Express Link Local Traffic: Ordered Sets

3-3 PCI Express Link Local Traffic: DLLPs

3-4 PCI Express Transaction Request And Completion TLPs

3-5 Transaction Layer Packet Generic 3DW And 4DW Headers


3-6 Generic System Memory And IO Address Maps

3-7 3DW TLP Header Address Routing Fields

3-8 4DW TLP Header Address Routing Fields

3-9 Endpoint Checks Routing Of An Inbound TLP Using Address Routing

3-10 Switch Checks Routing Of An Inbound TLP Using Address Routing

3-11 3DW TLP Header ID Routing Fields

3-12 4DW TLP Header ID Routing Fields

3-13 Switch Checks Routing Of An Inbound TLP Using ID Routing

3-14 4DW Message TLP Header Implicit Routing Fields

3-15 PCI Express Devices And Type 0 And Type 1 Header Use

3-16 PCI Express Configuration Space Type 0 and Type 1 Headers

3-17 32-Bit Prefetchable Memory BAR Set Up

3-18 64-Bit Prefetchable Memory BAR Set Up

3-19 IO BAR Set Up

3-20 6GB, 64-Bit Prefetchable Memory Base/Limit Register Set Up

3-21 2MB, 32-Bit Non-Prefetchable Base/Limit Register Set Up

3-22 IO Base/Limit Register Set Up

3-23 Bus Number Registers In A Switch

4-1 TLP And DLLP Packets

4-2 PCI Express Layered Protocol And TLP Assembly/Disassembly

4-3 Generic TLP Header Fields


4-4 Using First DW and Last DW Byte Enable Fields

4-5 Transaction Descriptor Fields

4-6 System IO Map

4-7 3DW IO Request Header Format

4-8 3DW And 4DW Memory Request Header Formats

4-9 3DW Configuration Request And Header Format

4-10 3DW Completion Header Format

4-11 4DW Message Request Header Format

4-12 Data Link Layer Sends A DLLP

4-13 Generic Data Link Layer Packet Format

4-14 Ack Or Nak DLLP Packet Format

4-15 Power Management DLLP Packet Format

4-16 Flow Control DLLP Packet Format

4-17 Vendor Specific DLLP Packet Format

5-1 Data Link Layer

5-2 Overview of the ACK/NAK Protocol

5-3 Elements of the ACK/NAK Protocol

5-4 Transmitter Elements Associated with the ACK/NAK Protocol

5-5 Receiver Elements Associated with the ACK/NAK Protocol

5-6 Ack Or Nak DLLP Packet Format

5-7 Example 1 that Shows Transmitter Behavior with Receipt of an ACK DLLP
5-8 Example 2 that Shows Transmitter Behavior with Receipt of an ACK DLLP

5-9 Example that Shows Transmitter Behavior on Receipt of a NAK DLLP

5-10 Table and Equation to Calculate REPLAY_TIMER Load Value

5-11 Example that Shows Receiver Behavior with Receipt of Good TLP

5-12 Example that Shows Receiver Behavior When It Receives Bad TLPs

5-13 Table to Calculate ACKNAK_LATENCY_TIMER Load Value

5-14 Lost TLP Handling

5-15 Lost ACK DLLP Handling

5-16 Lost ACK DLLP Handling

5-17 Switch Cut-Through Mode Showing Error Handling

6-1 Example Application of Isochronous Transaction

6-2 VC Configuration Registers Mapped in Extended Configuration Address Space

6-3 The Number of VCs Supported by Device Can Vary

6-4 Extended VCs Supported Field

6-5 VC Resource Control Register

6-6 TC to VC Mapping Example

6-7 Conceptual VC Arbitration Example

6-8 Strict Arbitration Priority

6-9 Low Priority Extended VC Count

6-10 Determining VC Arbitration Capabilities and Selecting the Scheme

6-11 VC Arbitration with Low-and High-Priority Implementations


6-12 Weighted Round Robin Low-Priority VC Arbitration Table Example

6-13 VC Arbitration Table Offset and Load VC Arbitration Table Fields

6-14 Loading the VC Arbitration Table Entries

6-15 Example Multi-Function Endpoint Implementation with VC Arbitration

6-16 Port Arbitration Concept

6-17 Port Arbitration Tables Needed for nEach VC

6-18 Port Arbitration Buffering

6-19 Software checks Port Arbitration Capabilities and Selects the Scheme to be Used

6-20 Maximum Time Slots Register

6-21 Format of Port Arbitration Table

6-22 Example of Port and VC Arbitration within A Switch

7-1 Location of Flow Control Logic

7-2 Flow Control Buffer Organization

7-3 Flow Control Elements

7-4 Types and Format of Flow Control Packets

7-5 Flow Control Elements Following Initialization

7-6 Flow Control Elements Following Delivery of First Transaction

7-7 Flow Control Elements with Flow Control Buffer Filled

7-8 Flow Control Rollover Problem

7-9 Initial State of Example FC Elements

7-10 INIT1 Flow Control Packet Format and Contents


7-11 Devices Send and Initialize Flow Control Registers

7-12 Device Confirm that Flow Control Initialization is Completed for a Given Buffer

7-13 Flow Control Update Example

7-14 Update Flow Control Packet Format and Contents

8-1 Example of Strongly Ordered Transactions that Results in Temporary Blocking

9-1 Native PCI Express and Legacy PCI Interrupt Delivery

9-2 64-bit MSI Capability Register Format

9-3 32-bit MSI Capability Register Set Format

9-4 Message Control Register

9-5 Device MSI Configuration Process

9-6 Format of Memory Write Transaction for Native-Deive MSI Delivery

9-7 Interrupt Pin Register within PCI Configuration Header

9-8 INTx Signal Routing is Platform Specific

9-9 Configuration Command Register ​ Interrupt Disable Field

9-10 Configuration Status Register ​ Interrupt Status Field

9-11 Legacy Devices Use INTx Messages Virtualize INTA#-INTD# Signal Transitions

9-12 Switch Collapses INTx Message to Achieve Wired-OR Characteristics

9-13 INTx Message Format and Types

9-14 PCI Express System with PCI-Based IO Controller Hub

10-1 The Scope of PCI Express Error Checking and Reporting

10-2 Location of PCI Express Error-Related Configuration Registers


10-3 The Error/Poisoned Bit within Packet Headers

10-4 Basic Format of the Error Messages

10-5 Completion Status Field within the Completion Header

10-6 PCI-Compatible Configuration Command Register

10-7 PCI-Compatible Status Register (Error-Related Bits)

10-8 PCI Express Capability Register Set

10-9 Device Control Register Bit Fields Related to Error Handling

10-10 Device Status Register Bit Fields Related to Error Handling

10-11 Link Control Register Allows Retraining of Link

10-12 Link Retraining Status Bits within the Link Status Register

10-13 Root Control Register

10-14 Advanced Error Capability Registers

10-15 The Advanced Error Capability & Control Register

10-16 Advanced Correctable Error Status Register

10-17 Advanced Correctable Error Mask Register

10-18 Advanced Uncorrectable Error Status Register

10-19 Advanced Uncorrectable Error Severity Register

10-20 Advanced Uncorrectable Error Mask Register

10-21 Root Error Status Register

10-22 Advanced Source ID Register

10-23 Advanced Root Error Command Register


10-24 Error Handling Flow Chart

11-1 Physical Layer

11-2 Logical and Electrical Sub-Blocks of the Physical Layer

11-3 Physical Layer Details

11-4 Physical Layer Transmit Logic Details

11-5 Transmit Logic Multiplexer

11-6 TLP and DLLP Packet Framing with Start and End Control Characters

11-7 x1 Byte Striping

11-8 x4 Byte Striping

11-9 x8, x12, x16, x32 Byte Striping

11-10 x1 Packet Format

11-11 x4 Packet Format

11-12 x8 Packet Format

11-13 Scrambler

11-14 Example of 8-bit Character of 00h Encoded to 10-bit Symbol

11-15 Preparing 8-bit Character for Encode

11-16 8-bit to 10-bit (8b/10b) Encoder

11-17 Example 8-bit/10-bit Encodings

11-18 Example 8-bit/10-bit Transmission

11-19 SKIP Ordered-Set

11-20 Physical Layer Receive Logic Details


11-21 Receiver Logic's Front End Per Lane

11-22 Receiver's Link De-Skew Logic

11-23 8b/10b Decoder per Lane

11-24 Example of Delayed Disparity Error Detection

11-25 Example of x8 Byte Un-Striping

12-1 Electrical Sub-Block of the Physical Layer

12-2 Differential Transmitter/Receiver

12-3 Receiver DC Common Mode Voltage Requirement

12-4 Receiver Detection Mechanism

12-5 Pictorial Representation of Differential Peak-to-Peak and Differential Peak Voltages

12-6 Electrical Idle Ordered-Set

12-7 Transmission with De-emphasis

12-8 Problem of Inter-Symbol Interference

12-9 Solution is Pre-emphasis

12-10 LVDS (Low-Voltage Differential Signal) Transmitter Eye Diagram

12-11 Transmitter Eye Diagram Jitter Indication

12-12 Transmitter Eye Diagram Noise/Attenuation Indication

12-13 Screen Capture of a Normal Eye (With no De-emphasis Shown)

12-14 Screen Capture of a Bad Eye Showing Effect of Jitter, Noise and Signal
Attenuation (With no De-emphasis Shown)

12-15 Compliance Test/Measurement Load


12-16 Receiver Eye Diagram

12-17 L0 Full-On Link State

12-18 L0s Low Power Link State

12-19 L1 Low Power Link State

12-20 L2 Low Power Link State

12-21 L3 Link Off State

13-1 PERST# Generation

13-2 TS1 Ordered-Set Showing the Hot Reset Bit

13-3 Secondary Bus Reset Register to Generate Hot Reset

13-4 Switch Generates Hot Reset on One Downstream Port

13-5 Switch Generates Hot Reset on All Downstream Ports

14-1 Link Training and Status State Machine Location

14-2 Example Showing Lane Reversal

14-3 Example Showing Polarity Inversion

14-4 Five Ordered-Sets Used in the Link Training and Initialization Process

14-5 Link Training and Status State Machine (LTSSM)

14-6 Detect State Machine

14-7 Polling State Machine

14-8 Configuration State Machine

14-9 Combining Lanes to form Links

14-10 Example 1 Link Numbering and Lane Numbering


14-11 Example 2 Link Numbering and Lane Numbering

14-12 Example 3 Link Numbering and Lane Numbering

14-13 Recovery State Machine

14-14 L0s Transmitter State Machine

14-15 L0s Receiver State Machine

14-16 L1 State Machine

14-17 L2 State Machine

14-18 Hot Reset State Machine

14-19 Disable State Machine

14-20 Loopback State Machine

14-21 Link Capabilities Register

14-22 Link Status Register

14-23 Link Control Register

15-1 System Allocated Bit

15-2 Elements Involved in Power Budget

15-3 Slot Power Limit Sequence

15-4 Power Budget Capability Registers

15-5 Power Budget Data Field Format and Definition

16-1 Relationship of OS, Device Drivers, Bus Driver, PCI Express Registers, and ACPI

16-2 Example of OS Powering Down All Functions On PCI Express Links and then the
Links Themselves
16-3 Example of OS Restoring a PCI Express Function To Full Power

16-4 OS Prepares a Function To Cause System WakeUp On Device-Specific Event

16-5 PCI Power Management Capability Register Set

16-6 PCI Express Function Power Management State Transitions

16-7 PCI Function's PM Registers

16-8 Power Management Capabilities (PMC) Register - Read Only

16-9 Power Management Control/Status (PMCSR) Register - R/W

16-10 PM Registers

16-11 ASPM Link State Transitions

16-12 ASPM Support

16-13 Active State PM Control Field

16-14 Ports that Initiate L1 ASPM Transitions

16-15 Negotiation Sequence Required to Enter L1 Active State PM

16-16 Negotiation Sequence Resulting in Rejection to Enter L1 ASPM State

16-17 Switch Behavior When Downstream Component Signals L1 Exit

16-18 Switch Behavior When Upstream Component Signals L1 Exit

16-19 Example of Total L1 Latency

16-20 Config. Registers Used for ASPM Exit Latency Management and Reporting

16-21 Devices Transition to L1 When Software Changes their Power Level from D0

16-22 Software Placing a Device into a D2 State and Subsequent Transition to L1

16-23 Procedure Used to Transition a Link from the L0 to L1 State


16-24 Link States Transitions Associated with Preparing Devices for Removal of the
Reference Clock and Power

16-25 Negotiation for Entering L2/L3 Ready State

16-26 State Transitions from L2/L3 Ready When Power is Removed

16-27 PME Message Format

16-28 WAKE# Signal Implementations

16-29 Auxiliary Current Enable for Devices Not Supporting PMEs

17-1 PCI Hot Plug Elements

17-2 PCI Express Hot-Plug Hardware/Software Elements

17-3 Hot Plug Control Functions within a Switch

17-4 PCI Express Configuration Registers Used for Hot-Plug

17-5 Attention Button and Hot Plug Indicators Present Bits

17-6 Slot Control Register Fields

17-7 Slot Status Register Fields

17-8 Location of Attention Button and Indicators

17-9 Hot-Plug Capability Bits for Server IO Modules

17-10 Hot Plug Message Format

18-1 PCI Express x1 connector

18-2 PCI Express Connectors on System Board

18-3 PERST Timing During Power Up

18-4 PERST# Timing During Power Management States


18-5 Example of WAKE# Circuit Protection

18-6 Presence Detect

18-7 PCI Express Riser Card

18-8 Mini PCI Express Add-in Card Installed in a Mobile Platform

18-9 Mini PCI Express Add-in Card Photo 1

18-10 Mini PCI Express Add-in Card Photo 2

19-1 Example System

19-2 Topology View At Startup

19-3 4KB Configuration Space per PCI Express Function

19-4 Header Type Register

20-1 A Function's Configuration Space

20-2 Configuration Address Port at 0CF8h

20-3 Example System

20-4 Peer Root Complexes

20-5 Type 0 Configuration Read Request Packet Header

20-6 Type 0 Configuration Write Request Packet Header

20-7 Type 1 Configuration Read Request Packet Header

20-8 Type 1 Configuration Write Request Packet Header

20-9 Example Configuration Access

21-1 Topology View At Startup

21-2 Example System Before Bus Enumeration


21-3 Example System After Bus Enumeration

21-4 Header Type Register

21-5 Capability Register

21-6 Header Type 0

21-7 Header Type 1

21-8 Peer Root Complexes

21-9 Multifunction Bridges in Root Complex

21-10 First Example of a Multifunction Bridge In a Switch

21-11 Second Example of a Multifunction Bridge In a Switch

21-12 Embedded Root Endpoint

21-13 Embedded Switch Endpoint

21-14 Type 0 Configuration Write Request Packet Header

21-15 RCRB Example

22-1 Header Type 0

22-2 Class Code Register

22-3 Header Type Register Bit Assignment

22-4 BIST Register Bit Assignment

22-5 Status Register

22-6 General Format of a New Capabilities List Entry

22-7 Expansion ROM Base Address Register Bit Assignment

22-8 Command Register


22-9 PCI Configuration Status Register

22-10 32-Bit Memory Base Address Register Bit Assignment

22-11 64-Bit Memory Base Address Register Bit Assignment

22-12 IO Base Address Register Bit Assignment

22-13 Header Type 1

22-14 IO Base Register

22-15 IO Limit Register

22-16 Example of IO Filtering Actions

22-17 Prefetchable Memory Base Register

22-18 Prefetchable Memory Limit Register

22-19 Memory-Mapped IO Base Register

22-20 Memory-Mapped IO Limit Register

22-21 Command Register

22-22 Bridge Control Register

22-23 Primary Interface Status Register

22-24 Secondary Status Register

22-25 Format of the AGP Capability Register Set

22-26 VPD Capability Registers

22-27 Chassis and Slot Number Registers

22-28 Main Chassis

22-29 Expansion Slot Register


22-30 Slot Capability Register

22-31 PCI Express Capabilities Register

22-32 Chassis Example One

22-33 Chassis Example Two

23-1 Expansion ROM Base Address Register Bit Assignment

23-2 Header Type Zero Configuration Register Format

23-3 Multiple Code Images Contained In One Device ROM

23-4 Code Image Format

23-5 AX Contents On Entry To Initialization Code

24-1 Function's Configuration Space Layout

24-2 PCI Express Capability Register Set

24-3 PCI Express Capabilities Register

24-4 Device Capabilities Register

24-5 Device Control Register

24-6 Device Status Register

24-7 Link Capabilities Register

24-8 Link Control Register

24-9 Link Status Register

24-10 Slot Capabilities Register

24-11 Slot Control Register

24-12 Slot Status Register


24-13 Root Control Register

24-14 Root Status Register

24-15 Enhanced Capability Header Register

24-16 Advanced Error Reporting Capability Register Set

24-17 Advanced Error Reporting Enhanced Capability Header

24-18 Advanced Error Capabilities and Control Register

24-19 Advanced Error Correctable Error Mask Register

24-20 Advanced Error Correctable Error Status Register

24-21 Advanced Error Uncorrectable Error Mask Register

24-22 Advanced Error Uncorrectable Error Severity Register

24-23 Advanced Error Uncorrectable Error Status Register

24-24 Advanced Error Root Error Command Register

24-25 Advanced Error Root Error Status Register

24-26 Advanced Error Uncorrectable and Uncorrectable Error Source ID Registers

24-27 Port and VC Arbitration

24-28 Virtual Channel Capability Register Set

24-29 VC Enhanced Capability Header

24-30 Port VC Capability Register 1 (Read-Only)

24-31 Port VC Capability Register 2 (Read-Only)

24-32 Port VC Control Register (Read-Write)

24-33 Port VC Status Register (Read-Only)


24-34 VC Resource Capability Register

24-35 VC Resource Control Register (Read-Write)

24-36 VC Resource Status Register (Read-Only)

24-37 Device Serial Number Enhanced Capability Header

24-38 Device Serial Number Register

24-39 EUI-64 Format

24-40 Power Budget Register Set

24-41 Power Budgeting Enhanced Capability Header

24-42 Power Budgeting Data Register

24-43 Power Budgeting Capability Register

24-44 RCRB Example

A-1 PCI Parallel Bus Start and End of a Transaction Easily Identified

A-2 PCI Express Serial Bit Stream

A-3 PCI Express Dual-Simplex Bus

A-4 Capturing All Patterns on PCI Express

A-5 Specific Trigger Definition for Upstream or Downstream Pair

A-6 Start with TS1

A-7 SKIP

A-8 Completion of 1024 TS1

A-9 Lane Number Declaration

A-10 Start of TS2


A-11 Initialization of Flow Control 1

A-12 Initialization of Flow Control 2

A-13 Flow Control Updates

A-14 Alternate Display in Listing Format

A-15 Mid-bus Pad Definition

A-16 Mid-Bus Suggested Signal Assignment

A-17 Exerciser Covering All Possible Commands

A-18 Exerciser Bit Level Manipulation Allowing Various Options

A-19 Supporting All Layers, Simultaneously

A-20 Jitter Analysis of a Transceiver source clock​Acceptable (for a specific device)

A-21 Jitter analysis of a Transceiver source clock​Unacceptable (for a specific device)

B--1 Migration from PCI to PCI Express

B--2 PCI Express in a Desktop System

B--3 PCI Express in a Server System

B--4 PCI Express in Embedded-Control Applications

B--5 PCI Express in a Storage System

B--6 PCI Express in Communications Systems

C--1 Enumeration Using Transparent Bridges

C--2 Direct Address Translation

C--3 Look Up Table Translation Creates Multiple Windows

C--4 Intelligent Adapters in PCI and PCI Express Systems


C--5 Host Failover in PCI and PCI Express Systems

C--6 Dual Host in a PCI and PCI Express System

C--7 Dual-Star Fabric

C--8 Direct Address Translation

C--9 Lookup Table Based Translation

C--10 Use of Limit Register

D--1 Class Code Register

E-1 Lock Sequence Begins with Memory Read Lock Request

E-2 Lock Completes with Memory Write Followed by Unlock Message


Tables
1 PC Architecture Book Series

1-1 Bus Specifications and Release Dates

1-2 Comparison of Bus Frequency, Bandwidth and Number of Slots

1-3 PCI Express Aggregate Throughput for Various Link Widths

2-1 PCI Express Non-Posted and Posted Transactions

2-2 PCI Express TLP Packet Types

2-3 PCI Express Aggregate Throughput for Various Link Widths

3-1 Ordered Set Types

3-2 Data Link Layer Packet (DLLP) Types

3-3 PCI Express Address Space And Transaction Types

3-4 PCI Express Posted and Non-Posted Transactions

3-5 PCI Express TLP Variants And Routing Options

3-6 TLP Header Type and Format Field Encodings

3-7 Message Request Header Type Field Usage

3-8 Results Of Reading The BAR after Writing All "1s" To It

3-9 Results Of Reading The BAR Pair after Writing All "1s" To Both

3-10 Results Of Reading The IO BAR after Writing All "1s" To It

3-11 6 GB, 64-Bit Prefetchable Base/Limit Register Setup

3-12 2MB, 32-Bit Non-Prefetchable Base/Limit Register Setup

3-13 256 Byte IO Base/Limit Register Setup


4-1 PCI Express Address Space And Transaction Types

4-2 TLP Header Type Field Defines Transaction Variant

4-3 TLP Header Type Field Defines Transaction Variant

4-4 Generic Header Field Summary

4-5 TLP Header Type and Format Field Encodings

4-6 IO Request Header Fields

4-7 4DW Memory Request Header Fields

4-8 Configuration Request Header Fields

4-9 Completion Header Fields

4-10 Message Request Header Fields

4-11 INTx Interrupt Signaling Message Coding

4-12 Power Management Message Coding

4-13 Error Message Coding

4-14 Unlock Message Coding

4-15 Slot Power Limit Message Coding

4-16 Hot Plug Message Coding

4-17 DLLP Packet Types

4-18 Ack or Nak DLLP Fields

4-19 Power Management DLLP Fields

4-20 Flow Control DLLP Fields

4-21 Vendor-Specific DLLP Fields


5-1 Ack or Nak DLLP Fields

6-1 Example TC to VC Mappings

7-1 Required Minimum Flow Control Advertisements

8-1 Transactions That Can Be Reordered Due to Relaxed Ordering

8-2 Fundamental Ordering Rules Based on Strong Ordering and RO Attribute

8-3 Weak Ordering Rules Enhance Performance

8-4 Ordering Rules with Deadlock Avoidance Rules

9-1 Format and Usage of Message Control Register

9-2 INTx Message Codes

10-1 Error Message Codes and Description

10-2 Completion Code and Description

10-3 Error-Related Command Register Bits

10-4 Description of PCI-Compatible Status Register Bits for Reporting Errors

10-5 Default Classification of Errors

10-6 Transaction Layer Errors That are Logged

11-1 5-bit to 6-bit Encode Table for Data Characters

11-2 5-bit to 6-bit Encode Table for Control Characters

11-3 3-bit to 4-bit Encode Table for Data Characters

11-4 3-bit to 4-bit Encode Table for Control Characters

11-5 Control Character Encoding and Definition

12-1 Output Driver Characteristics


12-2 Input Receiver Characteristics

14-1 Summary of TS1 and TS2 Ordered-Set Contents

15-1 Maximum Power Consumption for System Board Expansion Slots

16-1 Major Software/Hardware Elements Involved In PC PM

16-2 System PM States as Defined by the OnNow Design Initiative

16-3 OnNow Definition of Device-Level PM States

16-4 Concise Description of OnNow Device PM States

16-5 Default Device Class PM States

16-6 D0 Power Management Policies

16-7 D1 Power Management Policies

16-8 D2 Power Management Policies

16-9 D3hot Power Management Policies

16-10 D3cold Power Management Policies

16-11 Description of Function State Transitions

16-12 Function State Transition Delays

16-13 The PMC Register Bit Assignments

16-14 PM Control/Status Register (PMCSR) Bit Assignments

16-15 Data Register Interpretation

16-16 Relationship Between Device and Link Power States

16-17 Link Power State Characteristics

16-18 Active State Power Management Control Field Definition


17-1 Introduction to Major Hot-Plug Software Elements

17-2 Major Hot-Plug Hardware Elements

17-3 Behavior and Meaning of the Slot Attention Indicator

17-4 Behavior and Meaning of the Power Indicator

17-5 Slot Capability Register Fields and Descriptions

17-6 Slot Control Register Fields and Descriptions

17-7 Slot Status Register Fields and Descriptions

17-8 The Primitives

18-1 PCI Express Connector Pinout

18-2 PCI Express Connector Auxiliary Signals

18-3 Power Supply Requirements

18-4 Add-in Card Power Dissipation

18-5 Card Interoperability

20-1 Enhanced Configuration Mechanism Memory-Mapped IO Address Range

21-1 Capability Register's Device/Port Type Field Encoding

22-1 Defined Class Codes

22-2 BIST Register Bit Assignment

22-3 Currently-Assigned Capability IDs

22-4 Command Register

22-5 Status Register

22-6 Bridge Command Register Bit Assignment


22-7 Bridge Control Register Bit Assignment

22-8 Bridge Primary Side Status Register

22-9 Bridge Secondary Side Status Register

22-10 AGP Status Register (Offset CAP_PTR + 4)

22-11 AGP Command Register (Offset CAP_PTR + 8)

22-12 Basic Format of VPD Data Structure

22-13 Format of the Identifier String Tag

22-14 Format of the VPD-R Descriptor

22-15 General Format of a Read or a Read/Write Keyword Entry

22-16 List of Read-Only VPD Keywords

22-17 Extended Capability (CP) Keyword Format

22-18 Format of Checksum Keyword

22-19 Format of the VPD-W Descriptor

22-20 List of Read/Write VPD Keywords

22-21 Example VPD List

22-22 Slot Numbering Register Set

22-23 Expansion Slot Register Bit Assignment

23-1 PCI Expansion ROM Header Format

23-2 PC-Compatible Processor/Architecture Data Area In ROM Header

23-3 PCI Expansion ROM Data Structure Format

24 - 1 PCI Express Capabilities Register


24 - 2 Device Capabilities Register (read-only)

24 - 3 Device Control Register (read/write)

24 - 4 Device Status Register

24 - 5 Link Capabilities Register

24 - 6 Link Control Register

24 - 7 Link Status Register

24 - 8 Slot Capabilities Register (all fields are HWInit)

24 - 9 Slot Control Register (all fields are RW)

24 - 10 Slot Status Register

24 - 11 Root Control Register (all fields are RW)

24 - 12 Root Status Register

24 - 13 Advanced Error Reporting Capability Register Set

24 - 14 Port VC Capability Register 1 (Read-Only)

24 - 15 Port VC Capability Register 2 (Read-Only)

24 - 16 Port VC Control Register (Read-Write)

24 - 17 Port VC Status Register (Read-Only)

24 - 18 VC Resource Capability Register

24 - 19 VC Resource Control Register (Read-Write)

24 - 20 VC Resource Status Register (Read-Only)

D-1 Defined Class Codes

D-2 Class Code 0 (PCI rev 1.0)


D-3 Class Code 1: Mass Storage Controllers

D-4 Class Code 2: Network Controllers

D-5 Class Code 3: Display Controllers

D-6 Class Code 4: Multimedia Devices

D-7 Class Code 5: Memory Controllers

D-8 Class Code 6: Bridge Devices

D-9 Class Code 7: Simple Communications Controllers

D-10 Class Code 8: Base System Peripherals

D-11 Class Code 9: Input Devices

D-12 Class Code A: Docking Stations

D-13 Class Code B: Processors

D-14 Class Code C: Serial Bus Controllers

D-15 Class Code D: Wireless Controllers

D-16 Class Code E: Intelligent IO Controllers

D-17 Class Code F: Satellite Communications Controllers

D-18 Class Code 10h: Encryption/Decryption Controllers

D-19 Class Code 11h: Data Acquisition and Signal Processing Controllers

D-20 Definition of IDE Programmer's Interface Byte Encoding


Acknowledgments
Thanks to those who made significant contributions to this book:

Joe Winkles ​ for this superb job of technical editing.

Jay Trodden ​ for his contribution in developing the chapter on Transaction Routing and Packet-
Based Transactions.

Mike Jackson ​ for his contribution in preparing the Card Electromechanical chapter.

Dave Dzatko ​ for research and editing.

Special thanks to Catalyst Enterprises, Inc. for supplying:

Appendix A: Test, Debug and Verification

Special thanks to PLX Technology for contributing two appendices:

Appendix B: Markets & Applications for the PCI Express™ Architecture

Appendix C: Implementing Intelligent Adapters and Multi-Host Systems With PCI


Express™ Technology

Thanks also to the PCI SIG for giving permission to use some of the mechanical drawings from
the specification.
About This Book

The MindShare Architecture Series

Cautionary Note

Intended Audience

Prerequisite Knowledge

Topics and Organization

Documentation Conventions

Visit Our Web Site

We Want Your Feedback


The MindShare Architecture Series
The MindShare Architecture book series currently includes the books listed in Table 1 below.
The entire book series is published by Addison-Wesley.

Table 1. PC Architecture Book Series

Category Title Edition ISBN

80486 System Architecture 3rd 0-201-40994-1

Pentium Processor System Architecture 2nd 0-201-40992-5


Processor Architecture
Pentium Pro and Pentium II System Architecture 2nd 0-201-30973-4

PowerPC System Architecture 1st 0-201-40990-9

PCI System Architecture 4th 0-201-30974-2

PCI-X System Architecture 1st 0-201-72682-3

EISA System Architecture Out-of-print 0-201-40995-X

Firewire System Architecture: IEEE 1394a 2nd 0-201-48535-4


Bus Architecture
ISA System Architecture 3rd 0-201-40996-8

Universal Serial Bus System Architecture 2.0 2nd 0-201-46137-4

HyperTransport System Architecture 1st 0-321-16845-3

PCI Express System Architecture 1st 0-321-15630-7

Network Architecture Infiniband Network Architecture 1st 0-321-11765-4

PCMCIA System Architecture: 16-Bit PC Cards 2nd 0-201-40991-7

CardBus System Architecture 1st 0-201-40997-6

Other Architectures Plug and Play System Architecture 1st 0-201-41013-3

Protected Mode Software Architecture 1st 0-201-55447-X

AGP System Architecture 1st 0-201-37964-3


Cautionary Note
The reader should keep in mind that MindShare's book series often details rapidly evolving
technologies, as is the case with PCI Express. This being the case, it should be recognized that
the book is a "snapshot" of the state of the technology at the time the book was completed. We
make every attempt to produce our books on a timely basis, but the next revision of the
specification is not introduced in time to make necessary changes. This PCI Express book
complies with revision 1.0a of the PCI Express™ Base Specification released and trademarked
by the PCI Special Interest Group. Several expansion card form-factor specifications are
planned for PCI Express, but only the Electromechanical specification, revision 1.0 was
released when this book was completed. However, the chapter covering the Card
Electromechanical topic reviews several form-factors that were under development at the time
of writing.
Intended Audience
This book is intended for use by hardware and software design and support personnel. The
tutorial approach taken may also make it useful to technical personnel not directly involved
design, verification, and other support functions.
Prerequisite Knowledge
It is recommended that the reader has a reasonable background in PC architecture, including
experience or knowledge of an I/O bus and related protocol. Because PCI Express maintains
several levels of compatibility with the original PCI design, critical background information
regarding PCI has been incorporated into this book. However, the reader may find it beneficial
to read the MindShare publication entitled PCI System Architecture, which focusses on and
details the PCI architecture.
Topics and Organization
Topics covered in this book and the flow of the book are as follows:

Part 1: Background and Comprehensive Overview. Provides an architectural perspective of


the PCI Express technology by comparing and contrasting it with the PCI and PCI-X buses. It
also introduces the major features of the PCI Express architecture.

Part 2: PCI Express Transaction Protocol. Includes packet format and field definition and
use, along with transaction and link layer functions.

Part 3: Physical Layer Description. Describes the physical layer functions, link training and
initialization, reset, and electrical signaling.

Part 4: Power-Related Topics. Discusses Power Budgeting and Power Management.

Part 5: Optional Topics. Discusses the major features of PCI Express that are optional,
including Hot Plug and Expansion Card implementation details.

Part 6: PCI Express Configuration. Discusses the configuration process, accessing


configuration space, and details the content and use of all configuration registers.

Appendix:

Test, Debug, and Verification

Markets & Applications for the PCI Express™ Architecture

Implementing Intelligent Adapters and Multi-Host Systems With PCI Express™


Technology

PCI Express Class Codes

Legacy Support for Locking


Documentation Conventions
This section defines the typographical convention used throughout this book.

PCI Express™

PCI Express™ is a trademark of the PCI SIG. This book takes the liberty of abbreviating PCI
Express as "PCI-XP" primarily in illustration where limited space is an issue.

Hexadecimal Notation

All hex numbers are followed by a lower case "h." For example:

89F2BD02h

0111h

Binary Notation

All binary numbers are followed by a lower case "b." For example:

1000 1001 1111 0010b

01b

Decimal Notation

Number without any suffix are decimal. When required for clarity, decimal numbers are followed
by a lower case "d." Examples:

15

512d

Bits Versus Bytes Notation


This book represents bit with lower case "b" and bytes with an upper case "B." For example:

Megabits/second = Mb/s

Megabytes/second = MB/s

Bit Fields

Groups bits are represented with the high-order bits first followed by the low-order bits and
enclosed by brackets. For example:

[7:0] = bits 0 through 7

Active Signal States

Signals that are active low are followed by #, as in PERST# and WAKE#. Active high signals
have no suffix, such as POWERGOOD.
Visit Our Web Site
Our web site lists all of our courses and the delivery options available for each course:

Information on MindShare courses:

Self-paced DVDs and CDs

Live web-delivered classes

Live on-site classes.

Free short courses on selected topics

Technical papers

Errata for a number of our books

All of our books are listed and can be ordered in bound or e-book versions.

www.mindshare.com
We Want Your Feedback
MindShare values you comments and suggestions. Contact us at:

Phone: (719) 487-1417 or within the U.S. (800) 633-1440

Fax: (719) 487-1434 (Fax)

Technical seminars: E-mail [email protected]

Technical questions: E-mail [email protected] or [email protected]

General information: E-mail [email protected]

Mailing Address:

>MindShare, Inc.
4285 Slash Pine Drive
Colorado Springs, CO 80908
Part One: The Big Picture

Chapter 1. Architectural Perspective

Chapter 2. Architecture Overview


Chapter 1. Architectural Perspective

This Chapter

The Next Chapter

Introduction To PCI Express

Predecessor Buses Compared

I/O Bus Architecture Perspective

The PCI Express Way

PCI Express Specifications


This Chapter
This chapter describes performance advantages and key features of the PCI Express (PCI-XP)
Link. To highlight these advantages, this chapter describes performance characteristics and
features of predecessor buses such as PCI and PCI-X buses with the goal of discussing the
evolution of PCI Express from these predecessor buses. The reader will be able to compare
and contrast features and performance points of PCI, PCI-X and PCI Express buses. The key
features of a PCI Express system are described. In addition, the chapter describes some
examples of PCI Express system topologies.
The Next Chapter
The next chapter describes in further detail the features of the PCI Express bus. It describes
the layered architecture of a device design while providing a brief functional description of each
layer. The chapter provides an overview of packet formation at a transmitter device, the
transmission and reception of the packet over the PCI Express Link and packet decode at a
receiver device.
Introduction To PCI Express
PCI Express is the third generation high performance I/O bus used to interconnect peripheral
devices in applications such as computing and communication platforms. The first generation
buses include the ISA, EISA, VESA, and Micro Channel buses, while the second generation
buses include PCI, AGP, and PCI-X. PCI Express is an all encompassing I/O device
interconnect bus that has applications in the mobile, desktop, workstation, server, embedded
computing and communication platforms.

The Role of the Original PCI Solution

Don't Throw Away What is Good! Keep It

The PCI Express architects have carried forward the most beneficial features from previous
generation bus architectures and have also taken advantages of new developments in computer
architecture.

For example, PCI Express employs the same usage model and load-store communication
model as PCI and PCI-X. PCI Express supports familiar transactions such as memory
read/write, IO read/write and configuration read/write transactions. The memory, IO and
configuration address space model is the same as PCI and PCI-X address spaces. By
maintaining the address space model, existing OSs and driver software will run in a PCI
Express system without any modifications. In other words, PCI Express is software backwards
compatible with PCI and PCI-X systems. In fact, a PCI Express system will boot an existing
OS with no changes to current drivers and application programs. Even PCI/ACPI power
management software will still run.

Like predecessor buses, PCI Express supports chip-to-chip interconnect and board-to-board
interconnect via cards and connectors. The connector and card structure are similar to PCI and
PCI-X connectors and cards. A PCI Express motherboard will have a similar form factor to
existing FR4 ATX motherboards which is encased in the familiar PC package.

Make Improvements for the Future

To improve bus performance, reduce overall system cost and take advantage of new
developments in computer design, the PCI Express architecture had to be significantly re-
designed from its predecessor buses. PCI and PCI-X buses are multi-drop parallel interconnect
buses in which many devices share one bus.

PCI Express on the other hand implements a serial, point-to-point type interconnect for
communication between two devices. Multiple PCI Express devices are interconnected via the
use of switches which means one can practically connect a large number of devices together in
a system. A point-to-point interconnect implies limited electrical load on the link allowing
transmission and reception frequencies to scale to much higher numbers. Currently PCI
Express transmission and reception data rate is 2.5 Gbits/sec. A serial interconnect between
two devices results in fewer pins per device package which reduces PCI Express chip and
board design cost and reduces board design complexity. PCI Express performance is also
highly scalable. This is achieved by implementing scalable numbers for pins and signal Lanes
per interconnect based on communication performance requirements for that interconnect.

PCI Express implements switch-based technology to interconnect a large number of devices.


Communication over the serial interconnect is accomplished using a packet-based
communication protocol. Quality Of Service (QoS) features provide differentiated transmission
performance for different applications. Hot Plug/Hot Swap support enables "always-on"
systems. Advanced power management features allow one to design for low power mobile
applications. RAS (Reliable, Available, Serviceable) error handling features make PCI Express
suitable for robust high-end server applications. Hot plug, power management, error handling
and interrupt signaling are accomplished in-band using packet based messaging rather than
side-band signals. This keeps the device pin count low and reduces system cost.

The configuration address space available per function is extended to 4KB, allowing designers
to define additional registers. However, new software is required to access this extended
configuration register space.

Looking into the Future

In the future, PCI Express communication frequencies are expected to double and quadruple to
5 Gbits/sec and 10 Gbits/sec. Taking advantage of these frequencies will require Physical
Layer re-design of a device with no changes necessary to the higher layers of the device
design.

Additional mechanical form factors are expected. Support for a Server IO Module, Newcard
(PC Card style), and Cable form factors are expected.
Predecessor Buses Compared
In an effort to compare and contrast features of predecessor buses, the next section of this
chapter describes some of the key features of IO bus architectures defined by the PCI Special
Interest Group (PCISIG). These buses, shown in Table 1-1 on page 12, include the PCI 33
MHz bus, PCI- 66 MHz bus, PCI-X 66 MHz/133 MHz buses, PCI-X 266/533 MHz buses and
finally PCI Express.

Table 1-1. Bus Specifications and Release Dates

Bus Type Specification Release Date of Release

PCI 33 MHz 2.0 1993

PCI 66 MHz 2.1 1995

PCI-X 66 MHz and 133 MHz 1.0 1999

PCI-X 266 MHz and 533 MHz 2.0 Q1, 2002

PCI Express 1.0 Q2, 2002

Author's Disclaimer

In comparing these buses, it is not the authors' intention to suggest that any one bus is better
than any other bus. Each bus architecture has its advantages and disadvantages. After
evaluating the features of each bus architecture, a particular bus architecture may turn out to
be more suitable for a specific application than another bus architecture. For example, it is the
system designers responsibility to determine whether to implement a PCI-X bus or PCI Express
for the I/O interconnect in a high-end server design. Our goal in this chapter is to document the
features of each bus architecture so that the designer can evaluate the various bus
architectures.

Bus Performances and Number of Slots Compared

Table 1-2 on page 13 shows the various bus architectures defined by the PCISIG. The table
shows the evolution of bus frequencies and bandwidths. As is obvious, increasing bus
frequency results in increased bandwidth. However, increasing bus frequency compromises the
number of electrical loads or number of connectors allowable on a bus at that frequency. At
some point, for a given bus architecture, there is an upper limit beyond which one cannot further
increase the bus frequency, hence requiring the definition of a new bus architecture.

Table 1-2. Comparison of Bus Frequency, Bandwidth and Number of Slots

Bus Type Clock Frequency Peak Bandwidth [*] Number of Card Slots per Bus

PCI 32-bit 33 MHz 133 MBytes/sec 4-5

PCI 32-bit 66 MHz 266 MBytes/sec 1-2

PCI-X 32-bit 66 MHz 266 MBytes/sec 4

PCI-X 32-bit 133 MHz 533 MBytes/sec 1-2

PCI-X 32-bit 266 MHz effective 1066 MBytes/sec 1

PCI-X 32-bit 533 MHz effective 2131 MByte/sec 1

[*] Double all these bandwidth numbers for 64-bit bus implementations

PCI Express Aggregate Throughput

A PCI Express interconnect that connects two devices together is referred to as a Link. A Link
consists of either x1, x2, x4, x8, x12, x16 or x32 signal pairs in each direction. These signals
are referred to as Lanes. A designer determines how many Lanes to implement based on the
targeted performance benchmark required on a given Link.

Table 1-3 shows aggregate bandwidth numbers for various Link width implementations. As is
apparent from this table, the peak bandwidth achievable with PCI Express is significantly higher
than any existing bus today.

Let us consider how these bandwidth numbers are calculated. The transmission/reception rate
is 2.5 Gbits/sec per Lane per direction. To support a greater degree of robustness during data
transmission and reception, each byte of data transmitted is converted into a 10-bit code (via
an 8b/10b encoder in the transmitter device). In other words, for every Byte of data to be
transmitted, 10-bits of encoded data are actually transmitted. The result is 25% additional
overhead to transmit a byte of data. Table 1-3 accounts for this 25% loss in transmission
performance.

PCI Express implements a dual-simplex Link which implies that data is transmitted and received
simultaneously on a transmit and receive Lane. The aggregate bandwidth assumes
simultaneous traffic in both directions.

To obtain the aggregate bandwith numbers in Table 1-3 multiply 2.5 Gbits/sec by 2 (for each
direction), then multiply by number of Lanes, and finally divide by 10-bits per Byte (to account
for the 8-to-10 bit encoding).

Table 1-3. PCI Express Aggregate Throughput for Various Link Widths

PCI Express Link Width x1 x2 x4 x8 x12 x16 x32

Aggregate Bandwidth (GBytes/sec) 0.5 1 2 4 6 8 16

Performance Per Pin Compared

As is apparent from Figure 1-1, PCI Express achieves the highest bandwidth per pin. This
results in a device package with fewer pins and a motherboard implementation with few wires
and hence overall reduced system cost per unit bandwidth.

Figure 1-1. Comparison of Performance Per Pin for Various Buses

In Figure 1-1, the first 7 bars are associated with PCI and PCI-X buses where we assume 84
pins per device. This includes 46 signal pins, interrupt and power management pins, error pins
and the remainder are power and ground pins. The last bar associated with a x8 PCI Express
Link assumes 40 pins per device which include 32 signal lines (8 differential pairs per direction)
and the rest are power and ground pins.
I/O Bus Architecture Perspective

33 MHz PCI Bus Based System

Figure 1-2 on page 17 is a 33 MHz PCI bus based system. The PCI system consists of a Host
(CPU) bus-to-PCI bus bridge, also referred to as the North bridge. Associated with the North
bridge is the system memory bus, graphics (AGP) bus, and a 33 MHz PCI bus. I/O devices
share the PCI bus and are connected to it in a multi-drop fashion. These devices are either
connected directly to the PCI bus on the motherboard or by way of a peripheral card plugged
into a connector on the bus. Devices connected directly to the motherboard consume one
electrical load while connectors are accounted for as 2 loads. A South bridge bridges the PCI
bus to the ISA bus where slower, lower performance peripherals exist. Associated with the
south bridge is a USB and IDE bus. A CD or hard disk is associated with the IDE bus. The
South bridge contains an interrupt controller (not shown) to which interrupt signals from PCI
devices are connected. The interrupt controller is connected to the CPU via an INTR signal or
an APIC bus. The South bridge is the central resource that provides the source of reset,
reference clock, and error reporting signals. Boot ROM exists on the ISA bus along with a
Super IO chip, which includes keyboard, mouse, floppy disk controller and serial/parallel bus
controllers. The PCI bus arbiter logic is included in the North bridge.

Figure 1-2. 33 MHz PCI Bus Based Platform

Figure 1-3 on page 18 represents a typical PCI bus cycle. The PCI bus clock is 33 MHz. The
address bus width is 32-bits (4GB memory address space), although PCI optionally supports
64-bit address bus. The data bus width is implemented as either 32-bits or 64-bits depending
on bus performance requirement. The address and data bus signals are multiplexed on the
same pins (AD bus) to reduce pin count. Command signals (C/BE#) encode the transaction
type of the bus cycle that master devices initiate. PCI supports 12 transaction types that
include memory, IO, and configuration read/write bus cycles. Control signals such as FRAME#,
DEVSEL#, TRDY#, IRDY#, STOP# are handshake signals used during bus cycles. Finally, the
PCI bus consists of a few optional error related signals, interrupt signals and power
management signals. A PCI master device implements a minimum of 49 signals.

Figure 1-3. Typical PCI Burst Memory Read Bus Cycle

Any PCI master device that wishes to initiate a bus cycle first arbitrates for use of the PCI bus
by asserting a request (REQ#) to the arbiter in the North bridge. After receiving a grant (GNT#)
from the arbiter and checking that the bus is idle, the master device can start a bus cycle.

Electrical Load Limit of a 33 MHz PCI Bus


The PCI specification theoretically supports 32 devices per PCI bus. This means that PCI
enumeration software will detect and recognize up to 32 devices per bus. However, as a rule of
thumb, a PCI bus can support a maximum of 10-12 electrical loads (devices) at 33 MHz. PCI
implements a static clocking protocol with a clock period of 30 ns at 33 MHz.

PCI implements reflected-wave switching signal drivers. The driver drives a half signal swing
signal on the rising edge of PCI clock. The signal propagates down the PCI bus transmission
line and is reflected at the end of the transmission line where there is no termination. The
reflection causes the half swing signal to double. The doubled (full signal swing) signal must
settle to a steady state value with sufficient setup time prior to the next rising edge of PCI clock
where receiving devices sample the signal. The total time from when a driver drives a signal
until the receiver detects a valid signal (including propagation time and reflection delay plus
setup time) must be less than the clock period of 30 ns.

The more electrical loads on a bus, the longer it takes for the signal to propagate and double
with sufficient setup to the next rising edge of clock. As mentioned earlier, a 33 MHz PCI bus
meets signal timing with no more than 10-12 loads. Connectors on the PCI bus are counted as
2 loads because the connector is accounted for as one load and the peripheral card with a PCI
device is the second load. As indicated in Table 1-2 on page 13 a 33 MHz PCI bus can be
designed with a maximum of 4-5 connectors.

To connect any more than 10-12 loads in a system requires the implementation of a PCI-to-PCI
bridge as shown in Figure 1-4. This permits an additional 10-12 loads to be connected on the
secondary PCI bus 1. The PCI specification theoretically supports up to 256 buses in a system.
This means that PCI enumeration software will detect and recognize up to 256 PCI bridges per
system.

Figure 1-4. 33 MHz PCI Based System Showing Implementation of a PCI-to-


PCI Bridge
PCI Transaction Model - Programmed IO

Consider an example in which the CPU communicates with a PCI peripheral such as an
Ethernet device shown in Figure 1-5. Transaction 1 shown in the figure, which is initiated by the
CPU and targets a peripheral device, is referred to as a programmed IO transaction. Software
commands the CPU to initiate a memory or IO read/write bus cycle on the host bus targeting
an address mapped in a PCI device's address space. The North bridge arbitrates for use of the
PCI bus and when it wins ownership of the bus generates a PCI memory or IO read/write bus
cycle represented in Figure 1-3 on page 18. During the first clock of this bus cycle (known as
the address phase), all target devices decode the address. One target (the Ethernet device in
this example) decodes the address and claims the transaction. The master (North bridge in this
case) communicates with the claiming target (Ethernet controller). Data is transferred between
master and target in subsequent clocks after the address phase of the bus cycle. Either 4
bytes or 8 bytes of data are transferred per clock tick depending on the PCI bus width. The
bus cycle is referred to as a burst bus cycle if data is transferred back-to-back between
master and target during multiple data phases of that bus cycle. Burst bus cycles result in the
most efficient use of PCI bus bandwidth.

Figure 1-5. PCI Transaction Model


At 33 MHz and the bus width of 32-bits (4 Bytes), peak bandwidth achievable is 4 Bytes x 33
MHz = 133 MBytes/sec. Peak bandwidth on a 64-bit bus is 266 Mbytes/sec. See Table 1-2 on
page 13.

Efficiency of the PCI bus for data payload transport is in the order of 50%. Efficiency is defined
as number of clocks during which data is transferred divided by the number of total clocks,
times 100. The lost performance is due to bus idle time between bus cycles, arbitration time,
time lost in the address phase of a bus cycle, wait states during data phases, delays during
transaction retries (not discussed yet), as well as latencies through PCI bridges.

PCI Transaction Model - Direct Memory Access (DMA)

Data transfer between a PCI device and system memory is accomplished in two ways:

The first less efficient method uses programmed IO transfers as discussed in the previous
section. The PCI device generates an interrupt to inform the CPU that it needs data
transferred. The device interrupt service routine (ISR) causes the CPU to read from the PCI
device into one of its own registers. The ISR then tells the CPU to write from its register to
memory. Similarly, if data is to be moved from memory to the PCI device, the ISR tells the CPU
to read from memory into its own register. The ISR then tells the CPU to write from its register
to the PCI device. It is apparent that the process is very inefficient for two reasons. First, there
are two bus cycles generated by the CPU for every data transfer, one to memory and one to
the PCI device. Second, the CPU is busy transferring data rather than performing its primary
function of executing application code.

The second more efficient method to transfer data is the DMA (direct memory access) method
illustrated by Transaction 2 in Figure 1-5 on page 20, where the PCI device becomes a bus
master. Upon command by a local application (software) which runs on a PCI peripheral or the
PCI peripheral hardware itself, the PCI device may initiate a bus cycle to talk to memory. The
PCI bus master device (SCSI device in this example) arbitrates for the PCI bus, wins
ownership of the bus and initiates a PCI memory bus cycle. The North bridge which decodes
the address acts as the target for the transaction. In the data phase of the bus cycle, data is
transferred between the SCSI master and the North bridge target. The bridge in turn generates
a DRAM bus cycle to communicate with system memory. The PCI peripheral generates an
interrupt to inform the system software that the data transfer has completed. This bus master
or DMA method of data transport is more efficient because the CPU is not involved in the data
move and further only one burst bus cycle is generated to move a block of data.

PCI Transaction Model - Peer-to-Peer

A Peer-to-peer transaction shown as Transaction 3 in Figure 1-5 on page 20 is the direct


transfer of data between two PCI devices. A master that wishes to initiate a transaction,
arbitrates, wins ownership of the bus and starts a transaction. A target PCI device that
recognizes the address claims the bus cycle. For a write bus cycle, data is moved from master
to target. For a read bus cycle, data is moved from target to master.

PCI Bus Arbitration

A PCI device that wishes to initiate a bus cycle arbitrates for use of the bus first. The arbiter
implements an arbitration algorithm with which it decides who to grant the bus to next. The
arbiter is able to grant the bus to the next requesting device while a bus cycle is in progress.
This arbitration protocol is referred to as hidden bus arbitration. Hidden bus arbitration allows
for more efficient hand over of the bus from one bus master device to another with only one idle
clock between two bus cycles (referred to as back-to-back bus cycles). PCI protocol does not
provide a standard mechanism by which system software or device drivers can configure the
arbitration algorithm in order to provide for differentiated class of service for various
applications.

Figure 1-6. PCI Bus Arbitration


PCI Delayed Transaction Protocol

PCI Retry Protocol

When a PCI master initiates a transaction to access a target device and the target device is not
ready, the target signals a transaction retry. This scenario is illustrated in Figure 1-7.

Figure 1-7. PCI Transaction Retry Mechanism

Consider the following example in which the North bridge initiates a memory read transaction to
read data from the Ethernet device. The Ethernet target claims the bus cycle. However, the
Ethernet target does not immediately have the data to return to the North bridge master. The
Ethernet device has two choices by which to delay the data transfer. The first is to insert wait-
states in the data phase. If only a few wait-states are needed, then the data is still transferred
efficiently. If however the target device requires more time (more than 16 clocks from the
beginning of the transaction), then the second option the target has is to signal a retry with a
signal called STOP#. A retry tells the master to end the bus cycle prematurely without
transferring data. Doing so prevents the bus from being held for a long time in wait-states,
which compromises the bus efficiency. The bus master that is retried by the target waits a
minimum of 2 clocks and must once again arbitrate for use of the bus to re-initiate the identical
bus cycle. During the time that the bus master is retried, the arbiter can grant the bus to other
requesting masters so that the PCI bus is more efficiently utilized. By the time the retried
master is granted the bus and it re-initiates the bus cycle, hopefully the target will claim the
cycle and will be ready to transfer data. The bus cycle goes to completion with data transfer.
Otherwise, if the target is still not ready, it retries the master's bus cycle again and the process
is repeated until the master successfully transfers data.

PCI Disconnect Protocol

When a PCI master initiates a transaction to access a target device and if the target device is
able to transfer at least one doubleword of data but cannot complete the entire data transfer, it
disconnects the bus cycle at the point at which it cannot continue the data transfer. This
scenario is illustrated in Figure 1-8.

Figure 1-8. PCI Transaction Disconnect Mechanism

Consider the following example in which the North bridge initiates a burst memory read
transaction to read data from the Ethernet device. The Ethernet target device claims the bus
cycle and transfers some data, but then runs out of data to transfer. The Ethernet device has
two choices to delay the data transfer. The first option is to insert wait-states during the current
data phase while waiting for additional data to arrive. If the target needs to insert only a few
wait-states, then the data is still transferred efficiently. If however the target device requires
more time (the PCI specification allows maximum of 8 clocks in the data phase), then the target
device must signal a disconnect. To do this the target asserts STOP# in the middle of the bus
cycle to tell the master to end the bus cycle prematurely. A disconnect results in some data is
transferred, while a retry does not. Disconnect frees the bus from long periods of wait states.
The disconnected master waits a minimum of 2 clocks before once again arbitrating for use of
the bus and continuing the bus cycle at the disconnected address. During the time that the bus
master is disconnected, the arbiter may grant the bus to other requesting masters so that the
PCI bus is utilized more efficiently. By the time the disconnected master is granted the bus and
continues the bus cycle, hopefully the target is ready to continue the data transfer until it is
completed. Otherwise, the target once again retries or disconnects the master's bus cycle and
the process is repeated until the master successfully transfers all its data.

PCI Interrupt Handling

Central to the PCI interrupt handling protocol is the interrupt controller shown in Figure 1-9. PCI
devices use one-of-four interrupt signals (INTA#, INTB#, INTC#, INTD#) to trigger an interrupt
request to the interrupt controller. In turn, the interrupt controller asserts INTR to the CPU. If
the architecture supports an APIC (Advanced Programmable Interrupt Controller) then it sends
an APIC message to the CPU as opposed to asserting the INTR signal. The interrupted CPU
determines the source of the interrupt, saves its state and services the device that generated
the interrupt. Interrupts on PCI INTx# signals are sharable. This allows multiple devices to
generate their interrupts on the same interrupt signal. OS software has the overhead to
determine which one of the devices sharing the interrupt signal generated the interrupt. This is
accomplished by polling the Interrupt Pending bit mapped in a device's memory space. Doing
so incurs additional latency in servicing the interrupting device.

Figure 1-9. PCI Interrupt Handling


PCI Error Handling

PCI devices are optionally designed to detect address and data phase parity errors during
transactions. Even parity is generated on the PAR signal during each bus cycle's address and
data phases. The device that receives the address or data during a bus cycle uses the parity
signal to determine if a parity error has occurred due to noise on the PCI bus. If a device
detects an address phase parity error, it asserts SERR#. If a device detects a data phase
parity error, it asserts PERR#. The PERR# and SERR# signals are connected to the error logic
(in the South bridge) as shown in Figure 1-10 on page 27. In many systems, the error logic
asserts the NMI signal (non-maskable interrupt signal) to the CPU upon detecting PERR# or
SERR#. This interrupt results in notification of a parity error and the system shuts down (We all
know the blue screen of death). Kind of draconian don't you agree?

Figure 1-10. PCI Error Handling Protocol


Unfortunately, PCI error detection and reporting is not robust. PCI errors are fatal
uncorrectable errors that many times result in system shutdown. Further, errors are detectable
as long as an odd number of signals are affected by noise. Given the poor PCI error detection
protocol and error handling policies, many system designs either disable or do not support error
checking and reporting.

PCI Address Space Map

PCI architecture supports 3 address spaces shown in Figure 1-11. These are the memory, IO
and configuration address spaces. The memory address space goes up to 4 GB for systems
that support 32-bit memory addressing and optionally up to 16 EB (exabytes) for systems that
support 64-bit memory addressing. PCI supports up to 4GB of IO address space, however,
many platforms limit IO space to 64 KB due to X86 CPUs only supporting 64 KB of IO address
space. PCI devices are configured to map to a configurable region within either the memory or
IO address space.

Figure 1-11. Address Space Mapping


PCI device configuration registers map to a third space called configuration address space.
Each PCI function may have up to 256 Bytes of configuration address space. The configuration
address space is 16 MBytes. This is calculated by multiplying 256 Bytes, by 8 functions per
device, by 32 devices per bus, by 256 buses per system. An x86 CPU can access memory or
IO address space but does not support configuration address space directly. Instead, CPUs
access PCI configuration space indirectly by indexing through an IO mapped Address Port and
Data Port in the host bridge (North bridge or MCH). The Address Port is located at IO address
CF8h-CFBh and the Data Port is mapped to location CFCh-CFFh.

PCI Configuration Cycle Generation

PCI configuration cycle generation involves two steps.

Step 1. The CPU generates an IO write to the Address Port at IO address CF8h in the
North bridge. The data written to the Address Port is the configuration register address to
be accessed.

Step 2. The CPU either generates an IO read or IO write to the Data Port at location
CFCh in the North bridge. The North bridge in turn then generates either a configuration
read or configuration write transaction on the PCI bus.

The address for the configuration transaction address phase is obtained from the contents of
the Address register. During the configuration bus cycle, one of the point-to-point IDSEL signals
shown in Figure 1-12 on page 29 is asserted to select the device whose register is being
accessed. That PCI target device claims the configuration cycle and fulfills the request.
Figure 1-12. PCI Configuration Cycle Generation

PCI Function Configuration Register Space

Each PCI device contains up to 256 Bytes of configuration register space. The first 64 bytes
are configuration header registers and the remainding 192 Bytes are device specific registers.
The header registers are configured at boot time by the Boot ROM configuration firmware and
by the OS. The device specific registers are configured by the device's device driver that is
loaded and executed by the OS at boot time.

Figure 1-13. 256 Byte PCI Function Configuration Register Space


Within the header space, the Base Address registers are one of the most important registers
configured by the 'Plug and Play' configuration software. It is via these registers that software
assigns a device its memory and/or IO address space within the system's memory and IO
address space. No two devices are assigned the same address range, thus ensuring the 'plug
and play' nature of the PCI system.

PCI Programming Model

Software instructions may cause the CPU to generate memory or IO read/write bus cycles.
The North bridge decodes the address of the resulting CPU bus cycles, and if the address
maps to PCI address space, the bridge in turn generates a PCI memory or IO read/write bus
cycle. A target device on the PCI bus claims the cycle and completes the transfer. In summary,
the CPU communicates with any PCI device via the North bridge, which generates PCI memory
or IO bus cycles on the behalf of the CPU.

An intelligent PCI device that includes a local processor or bus master state machine (typically
intelligent IO cards) can also initiate PCI memory or IO transactions on the PCI bus. These
masters can communicate directly with any other devices, including system memory associated
with the North bridge.

A device driver executing on the CPU configures the device-specific configuration register space
of an associated PCI device. A configured PCI device that is bus master capable can initiate its
own transactions, which allows it to communicate with any other PCI target device including
system memory associated with the North bridge.

The CPU can access configuration space as described in the previous section.
PCI Express architecture assumes the identical programming model as the PCI programming
model described above. In fact, current OSs written for PCI systems can boot a PCI Express
system. Current PCI device drivers will initialize PCI Express devices without any driver
changes. PCI configuration and enumeration firmware will function unmodified on a PCI Express
system.

Limitations of a 33 MHz PCI System

As indicated in Table 1-2 on page 13, peak bandwidth achievable on a 64-bit 33 MHz PCI bus
is 266 Mbytes/sec. Current high-end workstation and server applications require greater
bandwidth.

Applications such as gigabit Ethernet and high performance disc transfers in RAID and SCSI
configurations require greater bandwidth capability than the 33 MHz PCI bus offers.

Latest Generation of Intel PCI Chipsets

Figure 1-14 shows an example of a later generation Intel PCI chipset. The two shaded devices
are NOT the North bridge and South bridge shown in earlier diagrams. Instead, one device is
the Memory Controller Hub (MCH) and the other is the IO Controller Hub (ICH). The two chips
are connected by a proprietary Intel high throughput, low pin count bus called the Hub Link.

Figure 1-14. Latest Generation of PCI Chipsets

The ICH includes the South bridge functionality but does not support the ISA bus. Other buses
associated with ICH include LPC (low pin count) bus, AC'97, Ethernet, Boot ROM, IDE, USB,
SMbus and finally the PCI bus. The advantage of this architecture over previous architectures is
that the IDE, USB, Ethernet and audio devices do not transfer their data through the PCI bus to
memory as is the case with earlier chipsets. Instead they do so through the Hub Link. Hub Link
is a higher performance bus compared to PCI. In other words, these devices bypass the PCI
bus when communicating with memory. The result is improved performance.

66 MHz PCI Bus Based System

High end systems that require better IO bandwidth implement a 66 MHz 64-bit PCI buses. This
PCI bus supports peak data transfer rate of 533 MBytes/sec.

The PCI 2.1 specification released in 1995 added 66MHz PCI support.

Figure 1-15 shows an example of a 66 MHz PCI bus based system. This system has similar
features to that described in Figure 1-14 on page 32. However, the MCH chip in this example
supports two additional Hub Link buses that connect to P64H (PCI 64-bit Hub) bridge chips,
providing access to the 64-bit, 66 MHz buses. These buses each support 1 connector in which
a high-end peripheral card may be installed.

Figure 1-15. 66 MHz PCI Bus Based Platform

Limitations of 66 MHz PCI bus

The PCI clock period at 66 MHz is 15 ns. Recall that PCI supports reflected-wave signaling
drivers that are weaker drivers, which have slower rise and fall times as compared to incident-
wave signaling drivers. It is a challenge to design a 66 MHz device or system that satisfies the
signal timing requirements.

A 66 MHz PCI based motherboard is routed with shorter signal traces to ensure shorter signal
propagation delays. In addition, the bus is loaded with fewer loads in order to ensure faster
signal rise and fall times. Taking into account typical board impedances and minimum signal
trace lengths, it is possible to interconnect a maximum of four to five 66 MHz PCI devices. Only
one or two connectors may be connected on a 66 MHz PCI bus. This is a significant limitation
for a system which requires multiple devices interconnected.

The solution requires the addition of PCI bridges and hence multiple buses to interconnect
devices. This solution is expensive and consumes additional board real estate. In addition,
transactions between devices on opposite sides of a bridge complete with greater latency
because bridges implement delayed transactions. This requires bridges to retry all transactions
that must cross to the other side (with the exception of memory writes which are posted).

Limitations of PCI Architecture

The maximum frequency achievable with the PCI architecture is 66 MHz. This is a result of the
static clock method of driving and latching signals and because reflected-wave signaling is
used.

PCI bus efficiency is in the order of 50% or 60%. Some of the factors that contribute towards
this reduced efficiency are listed below.

The PCI specification allows master and target devices to insert wait-states during data phases
of a bus cycle. Slow devices will add wait-states which reduces the efficiency of bus cycles.

PCI bus cycles do not indicate transfer size. This makes buffer management within master and
target devices inefficient.

Delayed transactions on PCI are handled inefficiently. When a master is retried, it guesses
when to try again. If the master tries too soon, the target may retry the transaction again. If the
master waits too long to retry, the latency to complete a data transfer is increased. Similarly, if
a target disconnects a transaction the master must guess when to resume the bus cycle at a
later time.

All PCI bus master accesses to system memory result in a snoop access to the CPU cache.
Doing so results in additional wait states during PCI bus master accesses of system memory.
The North bridge or MCH must assume all system memory address space is cachable even
though this may not be the case. PCI bus cycles provide no mechanism by which to indicate an
access to non-cachable memory address space.

PCI architecture observes strict ordering rules as defined by the specification. Even if a PCI
application does not require observation of these strict ordering rules, PCI bus cycles do not
provide a mechanism to allow relaxed ordering rule. Observing relaxed ordering rules allows
bus cycles (especially those that cross a bridge) to complete with reduced latency.

PCI interrupt handling architecture is inefficient especially because multiple devices share a PCI
interrupt signal. Additional software latency is incurred while software discovers which device or
devices that share an interrupt signal actually generated the interrupt.

The processor's NMI interrupt input is asserted when a PCI parity or system error is detected.
Ultimately the system shuts down when an error is detected. This is a severe response. A more
appropriate response might be to detect the error and attempt error recovery. PCI does not
require error recovery features, nor does it support an extensive register set for documenting a
variety of detectable errors.

These limitations above have been resolved in the next generation bus architectures, namely
PCI-X and PCI Express.

66 MHz and 133 MHz PCI-X 1.0 Bus Based Platforms

Figure 1-16 on page 36 is an example of an Intel 7500 server chipset based system. This
chipset has similarities to the 8XX chipset described earlier. MCH and ICH chips are connected
via a Hub Link 1.0 bus. Associated with ICH is a 32-bit 33 MHz PCI bus. The 7500 MCH chip
includes 3 additional high performance Hub Link 2.0 ports. These Hub Link ports are connected
to 3 Hub Link-to-PCI-X Hub 2 bridges (P64H2). Each P64H2 bridge supports 2 PCI-X buses
that can run at frequencies up to 133MHz. Hub Link 2.0 Links can sustain the higher bandwidth
requirements for PCI-X traffic that targets system memory.

Figure 1-16. 66 MHz/133 MHz PCI-X Bus Based Platform


PCI-X Features

The PCI-X bus is a higher frequency, higher performance, higher efficiency bus compared to
the PCI bus.

PCI-X devices can be plugged into PCI slots and vice-versa. PCI-X and PCI slots employ the
same connector format. Thus, PCI-X is 100% backwards compatible to PCI from both a
hardware and software standpoint. The device drivers, OS, and applications that run on a PCI
system also run on a PCI-X system.

PCI-X signals are registered. A registered signal requires smaller setup time to sample the
signal as compared with a non-registered signal employed in PCI. Also, PCI-X devices employ
PLLs that are used to pre-drive signals with smaller clock-to-out time. The time gained from
reduced setup time and clock-to-out time is used towards increased clock frequency capability
and the ability to support more devices on the bus at a given frequency compared to PCI. PCI-
X supports 8-10 loads or 4 connectors at 66 MHz and 3-4 loads or 1-2 connectors at 133 MHz.

The peak bandwidth achievable with 64-bit 133 MHz PCI-X is 1064 MBytes/sec.

Following the first data phase, the PCI-X bus does not allow wait states during subsequent
data phases.

Most PCI-X bus cycles are burst cycles and data is generally transferred in blocks of no less
than 128 Bytes. This results in higher bus utilization. Further, the transfer size is specified in the
attribute phase of PCI-X transactions. This allows for more efficient device buffer management.
Figure 1-17 is an example of a PCI-X burst memory read transaction.
Figure 1-17. Example PCI-X Burst Memory Read Bus Cycle

PCI-X Requester/Completer Split Transaction Model

Consider an example of the split transaction protocol supported by PCI-X for delaying
transactions. This protocol is illustrated in Figure 1-18. A requester initiates a read transaction.
The completer that claims the bus cycles may be unable to return the requested data
immediately. Rather than signaling a retry as would be the case in PCI protocol, the completer
memorizes the transaction (address, transaction type, byte count, requester ID are memorized)
and signals a split response. This prompts the requester to end the bus cycle, and the bus
goes idle. The PCI-X bus is now available for other transactions, resulting in more efficient bus
utilization. Meanwhile, the requester simply waits for the completer to supply it the requested
data at a later time. Once the completer has gathered the requested data, it then arbitrates
and obtains bus ownership and initiates a split completion bus cycle during which it returns the
requested data. The requester claims the split completion bus cycle and accepts the data from
the completer.

Figure 1-18. PCI-X Split Transaction Protocol


The split completion bus cycle is very much like a write bus cycle. Exactly two bus transactions
are needed to complete the entire data transfer. In between these two bus transactions (the
read request and the split completion transaction) the bus is utilized for other transactions. The
requester also receives the requested data in a very efficient manner.

PCI Express architecture employs a similar transaction protocol.

These performance enhancement features described so far contribute towards an increased


transfer efficiency of 85% for PCI-X as compared to 50%-60% with PCI protocol.

PCI-X devices must support Message Signaled Interrupt (MSI) architecture, which is a more
efficient architecture than the legacy interrupt architecture described in the PCI architecture
section. To generate an interrupt request, a PCI-X devices initiates a memory write transaction
targeting the Host (North) bridge. The data written is a unique interrupt vector associated with
the device generating the interrupt. The Host bridge interrupts the CPU and the vector is
delivered to the CPU in a platform specific manner. With this vector, the CPU is immediately
able to run an interrupt service routine to service the interrupting device. There is no software
overhead in determining which device generated the interrupt. Also, unlike in the PCI
architecture, no interrupt pins are required.

PCI Express architecture implements the MSI protocol, resulting in reduced interrupt servicing
latency and elimination of interrupt signals.

PCI Express architecture also supports the RO bit and NS bit feature with the result that those
transactions with either NS=1 or RO=1 complete with better performance than transactions
with NS=0 or RO=0. PCI transactions by definition assume NS=0 and RO=0.

NS ​ No Snoop (NS) may be used when accessing system memory. PCI-X bus masters can
use the NS bit to indicate whether the region of memory being accessed is cachable
(NS=0) or not (NS=1). For those transactions with NS=1, the Host bridge does not snoop
the processor cache. The result is improved performance during accesses to non-cachable
memory.

RO ​ Relaxed Ordering (RO) allows transactions that do not have any order of completion
requirements to complete more efficiently. We will not get into the details here. Suffice it to
say that transactions with the RO bit set can complete on the bus in any order with respect
to other transactions that are pending completion.

The PCI-X 2.0 specification released in Q1 2002 was designed to further increase the
bandwidth capability of PCI-X bus. This bus is described next.

DDR and QDR PCI-X 2.0 Bus Based Platforms


Figure 1-19 shows a hypothetical PCI-X 2.0 system. This diagram is the author's best guess as
to what a PCI-X 2.0 system will look like. PCI-X 2.0 devices and connectors are 100%
hardware and software backwards compatible with PCI-X 1.0 as well as PCI devices and
connectors. A PCI-X 2.0 bus supports either Dual Data Rate (DDR) or Quad Data Rate (QDR)
data transport using a PCI-X 133 MHz clock and strobes that are phase shifted to provide the
necessary clock edges.

Figure 1-19. Hypothetical PCI-X 2.0 Bus Based Platform

A design requiring greater than 1 GByte/sec bus bandwidth can implement the DDR or QDR
protocol. As indicated in Table 1-2 on page 13, PCI-X 2.0 peak bandwidth capability is 4256
MBytes/sec for a 64-bit 533 MHz effective PCI-X bus. With the aid of a strobe clock, data is
transferred two times or four times per 133 MHz clock.

PCI-X 2.0 devices also support ECC generation and checking. This allows auto-correction of
single bit errors and detection and reporting of multi-bit errors. Error handling is more robust
than PCI and PCI-X 1.0 systems making this bus more suited for high-performance, robust,
non-stop server applications.

Some noteworthy points to remember are that with very fast signal timing, it is only possible to
support one connector on the PCI-X 2.0 bus. This implies that a PCI-X 2.0 bus essentially
becomes a point-to-point connection with no multi-drop capability as with its predecessor
buses.

PCI-X 2.0 bridges are essentially switches with one primary bus and one or more downstream
secondary buses as shown in Figure 1-19 on page 40.
The PCI Express Way
PCI Express provides a high-speed, high-performance, point-to-point, dual simplex, differential
signaling Link for interconnecting devices. Data is transmitted from a device on one set of
signals, and received on another set of signals.

The Link - A Point-to-Point Interconnect

As shown in Figure 1-20, a PCI Express interconnect consists of either a x1, x2, x4, x8, x12,
x16 or x32 point-to-point Link. A PCI Express Link is the physical connection between two
devices. A Lane consists of signal pairs in each direction. A x1 Link consists of 1 Lane or 1
differential signal pair in each direction for a total of 4 signals. A x32 Link consists of 32 Lanes
or 32 signal pairs for each direction for a total of 128 signals. The Link supports a symmetric
number of Lanes in each direction. During hardware initialization, the Link is initialized for Link
width and frequency of operation automatically by the devices on opposite ends of the Link. No
OS or firmware is involved during Link level initialization.

Figure 1-20. PCI Express Link

Differential Signaling

PCI Express devices employ differential drivers and receivers at each port. Figure 1-21 shows
the electrical characteristics of a PCI Express signal. A positive voltage difference between the
D+ and D- terminals implies Logical 1. A negative voltage difference between D+ and D- implies
a Logical 0. No voltage difference between D+ and D- means that the driver is in the high-
impedance tristate condition, which is referred to as the electrical-idle and low-power state of
the Link.

Figure 1-21. PCI Express Differential Signal


The PCI Express Differential Peak-to-Peak signal voltage at the transmitter ranges from 800
mV - 1200 mV, while the differential peak voltage is one-half these values. The common mode
voltage can be any voltage between 0 V and 3.6 V. The differential driver is DC isolated from
the differential receiver at the opposite end of the Link by placing a capacitor at the driver side
of the Link. Two devices at opposite ends of a Link may support different DC common mode
voltages. The differential impedance at the receiver is matched with the board impedance to
prevent reflections from occurring.

Switches Used to Interconnect Multiple Devices

Switches are implemented in systems requiring multiple devices to be interconnected. Switches


can range from a 2-port device to an n-port device, where each port connects to a PCI Express
Link. The specification does not indicate a maximum number of ports a switch can implement. A
switch may be incorporated into a Root Complex device (Host bridge or North bridge
equivalent), resulting in a multi-port root complex. Figure 1-23 on page 52 and Figure 1-25 on
page 54 are examples of PCI Express systems showing multi-ported devices such as the root
complex or switches.

Figure 1-23. Low Cost PCI Express System


Figure 1-25. PCI Express High-End Server System

Packet Based Protocol

Rather than bus cycles we are familiar with from PCI and PCI-X architectures, PCI Express
encodes transactions using a packet based protocol. Packets are transmitted and received
serially and byte striped across the available Lanes of the Link. The more Lanes implemented
on a Link the faster a packet is transmitted and the greater the bandwidth of the Link. The
packets are used to support the split transaction protocol for non-posted transactions. Various
types of packets such as memory read and write requests, IO read and write requests,
configuration read and write requests, message requests and completions are defined.

Bandwidth and Clocking

As is apparent from Table 1-3 on page 14, the aggregate bandwidth achievable with PCI
Express is significantly higher than any bus available today. The PCI Express 1.0 specification
supports 2.5 Gbits/sec/lane/direction transfer rate.

No clock signal exists on the Link. Each packet to be transmitted over the Link consists of bytes
of information. Each byte is encoded into a 10-bit symbol. All symbols are guaranteed to have
one-zero transitions. The receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0
transitions of the incoming bit stream.

Address Space
PCI Express supports the same address spaces as PCI: memory, IO and configuration
address spaces. In addition, the maximum configuration address space per device function is
extended from 256 Bytes to 4 KBytes. New OS, drivers and applications are required to take
advantage of this additional configuration address space. Also, a new messaging transaction
and address space provides messaging capability between devices. Some messages are PCI
Express standard messages used for error reporting, interrupt and power management
messaging. Other messages are vendor defined messages.

PCI Express Transactions

PCI Express supports the same transaction types supported by PCI and PCI-X. These include
memory read and memory write, I/O read and I/O write, configuration read and configuration
write. In addition, PCI Express supports a new transaction type called Message transactions.
These transactions are encoded using the packet-based PCI Express protocol described later.

PCI Express Transaction Model

PCI Express transactions can be divided into two categories. Those transactions that are non-
posted and those that are posted. Non-posted transactions, such as memory reads, implement
a split transaction communication model similar to the PCI-X split transaction protocol. For
example, a requester device transmits a non-posted type memory read request packet to a
completer. The completer returns a completion packet with the read data to the requester.
Posted transactions, such as memory writes, consist of a memory write packet transmitted uni-
directionally from requester to completer with no completion packet returned from completer to
requester.

Error Handling and Robustness of Data Transfer

CRC fields are embedded within each packet transmitted. One of the CRC fields supports a
Link-level error checking protocol whereby each receiver of a packet checks for Link-level CRC
errors. Packets transmitted over the Link in error are recognized with a CRC error at the
receiver. The transmitter of the packet is notified of the error by the receiver. The transmitter
automatically retries sending the packet (with no software involvement), hopefully resulting in
auto-correction of the error.

In addition, an optional CRC field within a packet allows for end-to-end data integrity checking
required for high availability applications.

Error handling on PCI Express can be as rudimentary as PCI level error handling described
earlier or can be robust enough for server-level requirements. A rich set of error logging
registers and error reporting mechanisms provide for improved fault isolation and recovery
solutions required by RAS (Reliable, Available, Serviceable) applications.

Quality of Service (QoS), Traffic Classes (TCs) and Virtual


Channels (VCs)

The Quality of Service feature of PCI Express refers to the capability of routing packets from
different applications through the fabric with differentiated priorities and deterministic latencies
and bandwidth. For example, it may be desirable to ensure that Isochronous applications, such
as video data packets, move through the fabric with higher priority and guaranteed bandwidth,
while control data packets may not have specific bandwidth or latency requirements.

PCI Express packets contain a Traffic Class (TC) number between 0 and 7 that is assigned by
the device's application or device driver. Packets with different TCs can move through the fabric
with different priority, resulting in varying performances. These packets are routed through the
fabric by utilizing virtual channel (VC) buffers implemented in switches, endpoints and root
complex devices.

Each Traffic Class is individually mapped to a Virtual Channel (a VC can have several TCs
mapped to it, but a TC cannot be mapped to multiple VCs). The TC in each packet is used by
the transmitting and receiving ports to determine which VC buffer to drop the packet into.
Switches and devices are configured to arbitrate and prioritize between packets from different
VCs before forwarding. This arbitration is referred to as VC arbitration. In addition, packets
arriving at different ingress ports are forwarded to their own VC buffers at the egress port.
These transactions are prioritized based on the ingress port number when being merged into a
common VC output buffer for delivery across the egress link. This arbitration is referred to as
Port arbitration.

The result is that packets with different TC numbers could observe different performance when
routed through the PCI Express fabric.

Flow Control

A packet transmitted by a device is received into a VC buffer in the receiver at the opposite end
of the Link. The receiver periodically updates the transmitter with information regarding the
amount of buffer space it has available. The transmitter device will only transmit a packet to the
receiver if it knows that the receiving device has sufficient buffer space to hold the next
transaction. The protocol by which the transmitter ensures that the receiving buffer has
sufficient space available is referred to as flow control. The flow control mechanism guarantees
that a transmitted packet will be accepted by the receiver, baring error conditions. As such, the
PCI Express transaction protocol does not require support of packet retry (unless an error
condition is detected in the receiver), thereby improving the efficiency with which packets are
forwarded to a receiver via the Link.
MSI Style Interrupt Handling Similar to PCI-X

Interrupt handling is accomplished in-band via PCI-X-like MSI protocol. PCI Express device use
a memory write packet to transmit an interrupt vector to the root complex host bridge device,
which in-turn interrupts the CPU. PCI Express devices are required to implement the MSI
capability register block. PCI Express also supports legacy interrupt handling in-band by
encoding interrupt signal transitions (for INTA#, INTB#, INTC# and INTD#) using Message
transactions. Only endpoint devices that must support legacy functions and PCI Express-to-PCI
bridges are allowed to support legacy interrupt generation.

Power Management

The PCI Express fabric consumes less power because the interconnect consists of fewer
signals that have smaller signal swings. Each device's power state is individually managed.
PCI/PCI Express power management software determines the power management capability
of each device and manages it individually in a manner similar to PCI. Devices can notify
software of their current power state, as well as power management software can propagate a
wake-up event through the fabric to power-up a device or group of devices. Devices can also
signal a wake-up event using an in-band mechanism or a side-band signal.

With no software involvement, devices place a Link into a power savings state after a time-out
when they recognize that there are no packets to transmit over the Link. This capability is
referred to as Active State power management.

PCI Express supports device power states: D0, D1, D2, D3-Hot and D3-Cold, where D0 is the
full-on power state and D3-Cold is the lowest power state.

PCI Express also supports the following Link power states: L0, L0s, L1, L2 and L3, where L0
is the full-on Link state and L3 is the Link-Off power state.

Hot Plug Support

PCI Express supports hot plug and surprise hot unplug without usage of sideband signals. Hot
plug interrupt messages, communicated in-band to the root complex, trigger hot plug software
to detect a hot plug or removal event. Rather than implementing a centralized hot plug controller
as exists in PCI platforms, the hot plug controller function is distributed to the port logic
associated with a hot plug capable port of a switch or root complex. 2 colored LEDs, a
Manually-operated Retention Latch (MRL), MRL sensor, attention button, power control signal
and PRSNT2# signal are some of the elements of a hot plug capable port.
PCI Compatible Software Model

PCI Express employs the same programming model as PCI and PCI-X systems described
earlier in this chapter. The memory and IO address space remains the same as PCI/PCI-X.
The first 256 Bytes of configuration space per PCI Express function is the same as PCI/PCI-X
device configuration address space, thus ensuring that current OSs and device drivers will run
on a PCI Express system. PCI Express architecture extends the configuration address space
to 4 KB per functional device. Updated OSs and device drivers are required to take advantage
and access this additional configuration address space.

PCI Express configuration model supports two mechanisms:

1. PCI compatible configuration model which is 100% compatible with existing OSs and
bus enumeration and configuration software for PCI/PCI-X systems.

PCI Express enhanced configuration mechanism which provides access to additional


configuration space beyond the first 256 Bytes and up to 4 KBytes per function.

Mechanical Form Factors

PCI Express architecture supports multiple platform interconnects such as chip-to-chip, board-
to-peripheral card via PCI-like connectors and Mini PCI Express form factors for the mobile
market. Specifications for these are fully defined. See "Add-in Cards and Connectors" on page
685 for details on PCI Express peripheral card and connector definition.

PCI-like Peripheral Card and Connector

Currently, x1, x4, x8 and x16 PCI-like connectors are defined along with associated peripheral
cards. Desktop computers implementing PCI Express can have the same look and feel as
current computers with no changes required to existing system form factors. PCI Express
motherboards can have an ATX-like motherboard form factor.

Mini PCI Express Form Factor

Mini PCI Express connector and add-in card implements a subset of signals that exist on a
standard PCI Express connector and add-in card. The form factor, as the name implies, is
much smaller. This form factor targets the mobile computing market. The Mini PCI Express slot
supports x1 PCI Express signals including power management signals. In addition, the slot
supports LED control signals, a USB interface and an SMBus interface. The Mini PCI Express
module is similar but smaller than a PC Card.
Mechanical Form Factors Pending Release

As of May 2003, specifications for two new form factors have not been released. Below is a
summary of publicly available information about these form factors.

NEWCARD Form Factor

Another new module form factor that will service both mobile and desktop markets is the
NEWCARD form factor. This is a PCMCIA PC card type form factor, but of nearly half the size
that will support x1 PCI Express signals including power management signals. In addition, the
slot supports USB and SMBus interfaces. There are two size form factors defined, a narrower
version and a wider version though the thickness and depth remain the same. Although similar
in appearance to Mini PCI Express Module, this is a different form factor.

Server IO Module (SIOM) Form Factor

These are a family of modules that target the workstation and server market. They are
designed with future support of larger PCI Express Lane widths and higher frequency bit rates
beyond 2.5 Gbits/s Generation 1 transmission rates. Four form factors are under consideration.
The base module with single- and double-width modules. Also, the full height with single- and
double-width modules.

PCI Express Topology

Major components in the PCI Express system shown in Figure 1-22 include a root complex,
switches, and endpoint devices.

Figure 1-22. PCI Express Topology


The Root Complex denotes the device that connects the CPU and memory subsystem to the
PCI Express fabric. It may support one or more PCI Express ports. The root complex in this
example supports 3 ports. Each port is connected to an endpoint device or a switch which
forms a sub-hierarchy. The root complex generates transaction requests on the behalf of the
CPU. It is capable of initiating configuration transaction requests on the behalf of the CPU. It
generates both memory and IO requests as well as generates locked transaction requests on
the behalf of the CPU. The root complex as a completer does not respond to locked requests.
Root complex transmits packets out of its ports and receives packets on its ports which it
forwards to memory. A multi-port root complex may also route packets from one port to
another port but is NOT required by the specification to do so.

Root complex implements central resources such as: hot plug controller, power management
controller, interrupt controller, error detection and reporting logic. The root complex initializes
with a bus number, device number and function number which are used to form a requester ID
or completer ID. The root complex bus, device and function numbers initialize to all 0s.

A Hierarchy is a fabric of all the devices and Links associated with a root complex that are
either directly connected to the root complex via its port(s) or indirectly connected via switches
and bridges. In Figure 1-22 on page 48, the entire PCI Express fabric associated with the root
is one hierarchy.

A Hierarchy Domain is a fabric of devices and Links that are associated with one port of the
root complex. For example in Figure 1-22 on page 48, there are 3 hierarchy domains.

Endpoints are devices other than root complex and switches that are requesters or
completers of PCI Express transactions. They are peripheral devices such as Ethernet, USB or
graphics devices. Endpoints initiate transactions as a requester or respond to transactions as a
completer. Two types of endpoints exist, PCI Express endpoints and legacy endpoints. Legacy
Endpoints may support IO transactions. They may support locked transaction semantics as a
completer but not as a requester. Interrupt capable legacy devices may support legacy style
interrupt generation using message requests but must in addition support MSI generation using
memory write transactions. Legacy devices are not required to support 64-bit memory
addressing capability. PCI Express Endpoints must not support IO or locked transaction
semantics and must support MSI style interrupt generation. PCI Express endpoints must
support 64-bit memory addressing capability in prefetchable memory address space, though
their non-prefetchable memory address space is permitted to map the below 4GByte boundary.
Both types of endpoints implement Type 0 PCI configuration headers and respond to
configuration transactions as completers. Each endpoint is initialized with a device ID
(requester ID or completer ID) which consists of a bus number, device number, and function
number. Endpoints are always device 0 on a bus.

Multi-Function Endpoints. Like PCI devices, PCI Express devices may support up to 8
functions per endpoint with at least one function number 0. However, a PCI Express Link
supports only one endpoint numbered device 0.

PCI Express-to-PCI(-X) Bridge is a bridge between PCI Express fabric and a PCI or PCI-X
hierarchy.

A Requester is a device that originates a transaction in the PCI Express fabric. Root complex
and endpoints are requester type devices.

A Completer is a device addressed or targeted by a requester. A requester reads data from a


completer or writes data to a completer. Root complex and endpoints are completer type
devices.

A Port is the interface between a PCI Express component and the Link. It consists of
differential transmitters and receivers. An Upstream Port is a port that points in the direction of
the root complex. A Downstream Port is a port that points away from the root complex. An
endpoint port is an upstream port. A root complex port(s) is a downstream port. An Ingress
Port is a port that receives a packet. An Egress Port is a port that transmits a packet.

A Switch can be thought of as consisting of two or more logical PCI-to-PCI bridges, each
bridge associated with a switch port. Each bridge implements configuration header 1 registers.
Configuration and enumeration software will detect and initialize each of the header 1 registers
at boot time. A 4 port switch shown in Figure 1-22 on page 48 consists of 4 virtual bridges.
These bridges are internally connected via a non-defined bus. One port of a switch pointing in
the direction of the root complex is an upstream port. All other ports pointing away from the
root complex are downstream ports.

A switch forwards packets in a manner similar to PCI bridges using memory, IO or


configuration address based routing. Switches must forward all types of transactions from any
ingress port to any egress port. Switches forward these packets based on one of three routing
mechanisms: address routing, ID routing, or implicit routing. The logical bridges within the
switch implement PCI configuration header 1. The configuration header contains memory and
IO base and limit address registers as well as primary bus number, secondary bus number and
subordinate bus number registers. These registers are used by the switch to aid in packet
routing and forwarding.
Switches implement two arbitration mechanisms, port arbitration and VC arbitration, by which
they determine the priority with which to forward packets from ingress ports to egress ports.
Switches support locked requests.

Enumerating the System

Standard PCI Plug and Play enumeration software can enumerate a PCI Express system. The
Links are numbered in a manner similar to the PCI depth first search enumeration algorithm. An
example of the bus numbering is shown in Figure 1-22 on page 48. Each PCI Express Link is
equivalent to a logical PCI bus. In other words, each Link is assigned a bus number by the bus
enumerating software. A PCI Express endpoint is device 0 on a PCI Express Link of a given
bus number. Only one device (device 0) exists per PCI Express Link. The internal bus within a
switch that connects all the virtual bridges together is also numbered. The first Link associated
with the root complex is number bus 1. Bus 0 is an internal virtual bus within the root complex.
Buses downstream of a PCI Express-to-PCI(-X) bridge are enumerated the same way as in a
PCI(-X) system.

Endpoints and PCI(-X) devices may implement up to 8 functions per device. Only 1 device is
supported per PCI Express Link though PCI(-X) buses may theoretically support up to 32
devices per bus. A system could theoretically include up to 256 PCI Express Link and PCI(-X)
buses.

PCI Express System Block Diagram

Low Cost PCI Express Chipset

Figure 1-23 on page 52 is a block diagram of a low cost PCI Express based system. As of the
writing of this book (April 2003) no real life PCI Express chipset architecture designs were
publicly disclosed. The author describes here a practical low cost PCI Express chipset whose
architecture is based on existing non-PCI Express chipset architectures. In this solution, AGP
which connects MCH to a graphics controller in earlier MCH designs (see Figure 1-14 on page
32) is replaced with a PCI Express Link. The Hub Link that connects MCH to ICH is replaced
with a PCI Express Link. And in addition to a PCI bus associated with ICH, the ICH chip
supports 4 PCI Express Links. Some of these Links can connect directly to devices on the
motherboard and some can be routed to connectors where peripheral cards are installed.

The CPU can communicate with PCI Express devices associated with ICH as well as the PCI
Express graphics controller. PCI Express devices can communicate with system memory or the
graphics controller associated with MCH. PCI devices may also communicate with PCI Express
devices and vice versa. In other words, the chipset supports peer-to-peer packet routing
between PCI Express endpoints and PCI devices, memory and graphics. It is yet to be
determined if the first generation PCI Express chipsets, will support peer-to-peer packet routing
between PCI Express endpoints. Remember that the specification does not require the root
complex to support peer-to-peer packet routing between the multiple Links associated with the
root complex.

This design does not require the use of switches if the number of PCI Express devices to be
connected does not exceed the number of Links available in this design.

Another Low Cost PCI Express Chipset

Figure 1-24 on page 53 is a block diagram of another low cost PCI Express system. In this
design, the Hub Link connects the root complex to an ICH device. The ICH device may be an
existing design which has no PCI Express Link associated with it. Instead, all PCI Express
Links are associated with the root complex. One of these Links connects to a graphics
controller. The other Links directly connect to PCI Express endpoints on the motherboard or
connect to PCI Express endpoints on peripheral cards inserted in slots.

Figure 1-24. Another Low Cost PCI Express System

High-End Server System

Figure 1-25 shows a more complex system requiring a large number of devices connected
together. Multi-port switches are a necessary design feature to accomplish this. To support PCI
or PCI-X buses, a PCI Express-to-PCI(-X) bridge is connected to one switch port. PCI Express
packets can be routed from any device to any other device because switch support peer-to-
peer packet routing (Only multi-port root complex devices are not required to support peer-to-
peer functionality).
PCI Express Specifications
As of the writing of this book (May 2003) the following are specifications released by the
PCISIG.

PCI Express 1.0a Base Specification released Q2, 2003

PCI Express 1.0a Card Electomechanical Specification released Q2, 2002

PCI Express 1.0 Base Specification released Q2, 2002

PCI Express 1.0 Card Electomechanical Specification released Q2, 2002

Mini PCI Express 1.0 Specification released Q2, 2003

As of May 2003, the specifications pending release are: the PCI Express-to-PCI Bridge
specification, Server IO Module specification, Cable specification, Backplane specification,
updated Mini PCI Express specification, and NEWCARD specification.
Chapter 2. Architecture Overview

Previous Chapter

This Chapter

The Next Chapter

Introduction to PCI Express Transactions

PCI Express Device Layers

Example of a Non-Posted Memory Read Transaction

Hot Plug

PCI Express Performance and Data Transfer Efficiency


Previous Chapter
The previous chapter described performance advantages and key features of the PCI Express
(PCI-XP) Link. To highlight these advantages, the chapter described performance
characteristics and features of predecessor buses such as PCI and PCI-X buses with the goal
of discussing the evolution of PCI Express from these predecessor buses. It compared and
contrasted features and performance points of PCI, PCI-X and PCI Express buses. The key
features of a PCI Express system were described. The chapter in addition described some
examples of PCI Express system topologies.
This Chapter
This chapter is an introduction to the PCI Express data transfer protocol. It describes the
layered approach to PCI Express device design while describing the function of each device
layer. Packet types employed in accomplishing data transfers are described without getting into
packet content details. Finally, this chapter outlines the process of a requester initiating a
transaction such as a memory read to read data from a completer across a Link.
The Next Chapter
The next chapter describes how packets are routed through a PCI Express fabric consisting of
switches. Packets are routed based on a memory address, IO address, device ID or implicitly.
Introduction to PCI Express Transactions
PCI Express employs packets to accomplish data transfers between devices. A root complex
can communicate with an endpoint. An endpoint can communicate with a root complex. An
endpoint can communicate with another endpoint. Communication involves the transmission and
reception of packets called Transaction Layer packets (TLPs).

PCI Express transactions can be grouped into four categories:

1) memory, 2) IO, 3) configuration, and 4) message transactions. Memory, IO and


configuration transactions are supported in PCI and PCI-X architectures, but the message
transaction is new to PCI Express. Transactions are defined as a series of one or more
packet transmissions required to complete an information transfer between a requester and a
completer. Table 2-1 is a more detailed list of transactions. These transactions can be
categorized into non-posted transactions and posted transactions.

Table 2-1. PCI Express Non-Posted and Posted Transactions

Transaction Type Non-Posted or Posted

Memory Read Non-Posted

Memory Write Posted

Memory Read Lock Non-Posted

IO Read Non-Posted

IO Write Non-Posted

Configuration Read (Type 0 and Type 1) Non-Posted

Configuration Write (Type 0 and Type 1) Non-Posted

Message Posted

For Non-posted transactions, a requester transmits a TLP request packet to a completer. At a


later time, the completer returns a TLP completion packet back to the requester. Non-posted
transactions are handled as split transactions similar to the PCI-X split transaction model
described on page 37 in Chapter 1. The purpose of the completion TLP is to confirm to the
requester that the completer has received the request TLP. In addition, non-posted read
transactions contain data in the completion TLP. Non-Posted write transactions contain data in
the write request TLP.
For Posted transactions, a requester transmits a TLP request packet to a completer. The
completer however does NOT return a completion TLP back to the requester. Posted
transactions are optimized for best performance in completing the transaction at the expense of
the requester not having knowledge of successful reception of the request by the completer.
Posted transactions may or may not contain data in the request TLP.

PCI Express Transaction Protocol

Table 2-2 lists all of the TLP request and TLP completion packets. These packets are used in
the transactions referenced in Table 2-1. Our goal in this section is to describe how these
packets are used to complete transactions at a system level and not to describe the packet
routing through the PCI Express fabric nor to describe packet contents in any detail.

Table 2-2. PCI Express TLP Packet Types

TLP Packet Types Abbreviated Name

Memory Read Request MRd

Memory Read Request - Locked access MRdLk

Memory Write Request MWr

IO Read IORd

IO Write IOWr

Configuration Read (Type 0 and Type 1) CfgRd0, CfgRd1

Configuration Write (Type 0 and Type 1) CfgWr0, CfgWr1

Message Request without Data Msg

Message Request with Data MsgD

Completion without Data Cpl

Completion with Data CplD

Completion without Data - associated with Locked Memory Read Requests CplLk

Completion with Data - associated with Locked Memory Read Requests CplDLk
Non-Posted Read Transactions

Figure 2-1 shows the packets transmitted by a requester and completer to complete a non-
posted read transaction. To complete this transfer, a requester transmits a non-posted read
request TLP to a completer it intends to read data from. Non-posted read request TLPs include
memory read request (MRd), IO read request (IORd), and configuration read request type 0 or
type 1 (CfgRd0, CfgRd1) TLPs. Requesters may be root complex or endpoint devices
(endpoints do not initiate configuration read/write requests however).

Figure 2-1. Non-Posted Read Transaction Protocol

The request TLP is routed through the fabric of switches using information in the header portion
of the TLP. The packet makes its way to a targeted completer. The completer can be a root
complex, switches, bridges or endpoints.

When the completer receives the packet and decodes its contents, it gathers the amount of
data specified in the request from the targeted address. The completer creates a single
completion TLP or multiple completion TLPs with data (CplD) and sends it back to the
requester. The completer can return up to 4 KBytes of data per CplD packet.

The completion packet contains routing information necessary to route the packet back to the
requester. This completion packet travels through the same path and hierarchy of switches as
the request packet.

Requesters uses a tag field in the completion to associate it with a request TLP of the same
tag value it transmitted earlier. Use of a tag in the request and completion TLPs allows a
requester to manage multiple outstanding transactions.
If a completer is unable to obtain requested data as a result of an error, it returns a completion
packet without data (Cpl) and an error status indication. The requester determines how to
handle the error at the software layer.

Non-Posted Read Transaction for Locked Requests

Figure 2-2 on page 60 shows packets transmitted by a requester and completer to complete a
non-posted locked read transaction. To complete this transfer, a requester transmits a memory
read locked request (MRdLk) TLP. The requester can only be a root complex which initiates a
locked request on the behalf of the CPU. Endpoints are not allowed to initiate locked requests.

Figure 2-2. Non-Posted Locked Read Transaction Protocol

The locked memory read request TLP is routed downstream through the fabric of switches
using information in the header portion of the TLP. The packet makes its way to a targeted
completer. The completer can only be a legacy endpoint. The entire path from root complex to
the endpoint (for TCs that map to VC0) is locked including the ingress and egress port of
switches in the pathway.

When the completer receives the packet and decodes its contents, it gathers the amount of
data specified in the request from the targeted address. The completer creates one or more
locked completion TLP with data (CplDLk) along with a completion status. The completion is
sent back to the root complex requester via the path and hierarchy of switches as the original
request.

The CplDLk packet contains routing information necessary to route the packet back to the
requester. Requesters uses a tag field in the completion to associate it with a request TLP of
the same tag value it transmitted earlier. Use of a tag in the request and completion TLPs
allows a requester to manage multiple outstanding transactions.

If the completer is unable to obtain the requested data as a result of an error, it returns a
completion packet without data (CplLk) and an error status indication within the packet. The
requester who receives the error notification via the CplLk TLP must assume that atomicity of
the lock is no longer guaranteed and thus determine how to handle the error at the software
layer.

The path from requester to completer remains locked until the requester at a later time
transmits an unlock message to the completer. The path and ingress/egress ports of a switch
that the unlock message passes through are unlocked.

Non-Posted Write Transactions

Figure 2-3 on page 61 shows the packets transmitted by a requester and completer to
complete a non-posted write transaction. To complete this transfer, a requester transmits a
non-posted write request TLP to a completer it intends to write data to. Non-posted write
request TLPs include IO write request (IOWr), configuration write request type 0 or type 1
(CfgWr0, CfgWr1) TLPs. Memory write request and message requests are posted requests.
Requesters may be a root complex or endpoint device (though not for configuration write
requests).

Figure 2-3. Non-Posted Write Transaction Protocol

A request packet with data is routed through the fabric of switches using information in the
header of the packet. The packet makes its way to a completer.

When the completer receives the packet and decodes its contents, it accepts the data. The
completer creates a single completion packet without data (Cpl) to confirm reception of the
write request. This is the purpose of the completion.
The completion packet contains routing information necessary to route the packet back to the
requester. This completion packet will propagate through the same hierarchy of switches that
the request packet went through before making its way back to the requester. The requester
gets confirmation notification that the write request did make its way successfully to the
completer.

If the completer is unable to successfully write the data in the request to the final destination or
if the write request packet reaches the completer in error, then it returns a completion packet
without data (Cpl) but with an error status indication. The requester who receives the error
notification via the Cpl TLP determines how to handle the error at the software layer.

Posted Memory Write Transactions

Memory write requests shown in Figure 2-4 are posted transactions. This implies that the
completer returns no completion notification to inform the requester that the memory write
request packet has reached its destination successfully. No time is wasted in returning a
completion, thus back-to-back posted writes complete with higher performance relative to non-
posted transactions.

Figure 2-4. Posted Memory Write Transaction Protocol

The write request packet which contains data is routed through the fabric of switches using
information in the header portion of the packet. The packet makes its way to a completer. The
completer accepts the specified amount of data within the packet. Transaction over.

If the write request is received by the completer in error, or is unable to write the posted write
data to the final destination due to an internal error, the requester is not informed via the
hardware protocol. The completer could log an error and generate an error message
notification to the root complex. Error handling software manages the error.

Posted Message Transactions

Message requests are also posted transactions as pictured in Figure 2-5 on page 64. There
are two categories of message request TLPs, Msg and MsgD. Some message requests
propagate from requester to completer, some are broadcast requests from the root complex to
all endpoints, some are transmitted by an endpoint to the root complex. Message packets may
be routed to completer(s) based on the message's address, device ID or routed implicitly.
Message request routing is covered in Chapter 3.

Figure 2-5. Posted Message Transaction Protocol

The completer accepts any data that may be contained in the packet (if the packet is MsgD)
and/or performs the task specified by the message.

Message request support eliminates the need for side-band signals in a PCI Express system.
They are used for PCI style legacy interrupt signaling, power management protocol, error
signaling, unlocking a path in the PCI Express fabric, slot power support, hot plug protocol, and
vender defined purposes.

Some Examples of Transactions

This section describes a few transaction examples showing packets transmitted between
requester and completer to accomplish a transaction. The examples consist of a memory read,
IO write, and Memory write.

Memory Read Originated by CPU, Targeting an Endpoint

Figure 2-6 shows an example of packet routing associated with completing a memory read
transaction. The root complex on the behalf of the CPU initiates a non-posted memory read
from the completer endpoint shown. The root complex transmits an MRd packet which contains
amongst other fields, an address, TLP type, requester ID (of the root complex) and length of
transfer (in doublewords) field. Switch A which is a 3 port switch receives the packet on its
upstream port. The switch logically appears like a 3 virtual bridge device connected by an
internal bus. The logical bridges within the switch contain memory and IO base and limit
address registers within their configuration space similar to PCI bridges. The MRd packet
address is decoded by the switch and compared with the base/limit address range registers of
the two downstream logical bridges. The switch internally forwards the MRd packet from the
upstream ingress port to the correct downstream port (the left port in this example). The MRd
packet is forwarded to switch B. Switch B decodes the address in a similar manner. Assume
the MRd packets is forwarded to the right-hand port so that the completer endpoint receives
the MRd packet.

Figure 2-6. Non-Posted Memory Read Originated by CPU and Targeting an


Endpoint

The completer decodes the contents of the header within the MRd packet, gathers the
requested data and returns a completion packet with data (CplD). The header portion of the
completion TLP contains the requester ID copied from the original request TLP. The requester
ID is used to route the completion packet back to the root complex.
The logical bridges within Switch B compares the bus number field of the requester ID in the
CplD packet with the secondary and subordinate bus number configuration registers. The CplD
packet is forwarded to the appropriate port (in this case the upstream port). The CplD packet
moves to Switch A which forwards the packet to the root complex. The requester ID field of the
completion TLP matches the root complex's ID. The root complex checks the completion status
(hopefully "successful completion") and accepts the data. This data is returned to the CPU in
response to its pending memory read transaction.

Memory Read Originated by Endpoint, Targeting System Memory

In a similar manner, the endpoint device shown in Figure 2-7 on page 67 initiates a memory
read request (MRd). This packet contains amongst other fields in the header, the endpoint's
requester ID, targeted address and amount of data requested. It forwards the packet to Switch
B which decodes the memory address in the packet and compares it with the memory
base/limit address range registers within the virtual bridges of the switch. The packet is
forwarded to Switch A which decodes the address in the packet and forwards the packet to the
root complex completer.

Figure 2-7. Non-Posted Memory Read Originated by Endpoint and Targeting


Memory

The root complex obtains the requested data from system memory and creates a completion
TLP with data (CplD). The bus number portion of the requester ID in the completion TLP is
used to route the packet through the switches to the endpoint.

A requester endpoint can also communicate with another peer completer endpoint. For example
an endpoint attached to switch B can talk to an endpoint connected to switch C. The request
TLP is routed using an address. The completion is routed using bus number. Multi-port root
complex devices are not required to support port-to-port packet routing. In which case, peer-to-
peer transactions between endpoints associated with two different ports of the root complex is
not supported.

IO Write Initiated by CPU, Targeting an Endpoint

IO requests can only be initiated by a root complex or a legacy endpoint. PCI Express
endpoints do not initiate IO transactions. IO transactions are intended for legacy support.
Native PCI Express devices are not prohibited from implementing IO space, but the
specification states that a PCI Express Endpoint must not depend on the operating system
allocating I/O resources that are requested.

IO requests are routed by switches in a similar manner to memory requests. Switches route IO
request packets by comparing the IO address in the packet with the IO base and limit address
range registers in the virtual bridge configuration space associated with a switch

Figure 2-8 on page 68 shows routing of packets associated with an IO write transaction. The
CPU initiates an IO write on the Front Side Bus (FSB). The write contains a target IO address
and up to 4 Bytes of data. The root complex creates an IO Write request TLP (IOWr) using
address and data from the CPU transaction. It uses its own requester ID in the packet header.
This packet is routed through switch A and B. The completer endpoint returns a completion
without data (Cpl) and completion status of 'successful completion' to confirm the reception of
good data from the requester.

Figure 2-8. IO Write Transaction Originated by CPU, Targeting Legacy


Endpoint
Memory Write Transaction Originated by CPU and Targeting an Endpoint

Memory write (MWr) requests (and message requests Msg or MsgD) are posted transactions.
This implies that the completer does not return a completion. The MWr packet is routed through
the PCI Express fabric of switches in the same manner as described for memory read
requests. The requester root complex can write up to 4 KBytes of data with one MWr packet.

Figure 2-9 on page 69 shows a memory write transaction originated by the CPU. The root
complex creates a MWr TLP on behalf of the CPU using target address and data from the CPU
FSB transaction. This packet is routed through switch A and B. The packet reaches the
endpoint and the transaction is complete.

Figure 2-9. Memory Write Transaction Originated by CPU, Targeting


Endpoint
PCI Express Device Layers

Overview

The PCI Express specification defines a layered architecture for device design as shown in
Figure 2-10 on page 70. The layers consist of a Transaction Layer, a Data Link Layer and a
Physical layer. The layers can be further divided vertically into two, a transmit portion that
processes outbound traffic and a receive portion that processes inbound traffic. However, a
device design does not have to implement a layered architecture as long as the functionality
required by the specification is supported.

Figure 2-10. PCI Express Device Layers

The goal of this section is to describe the function of each layer and to describe the flow of
events to accomplish a data transfer. Packet creation at a transmitting device and packet
reception and decoding at a receiving device are also explained.

Transmit Portion of Device Layers


Consider the transmit portion of a device. Packet contents are formed in the Transaction Layer
with information obtained from the device core and application. The packet is stored in buffers
ready for transmission to the lower layers. This packet is referred to as a Transaction Layer
Packet (TLP) described in the earlier section of this chapter. The Data Link Layer concatenates
to the packet additional information required for error checking at a receiver device. The packet
is then encoded in the Physical layer and transmitted differentially on the Link by the analog
portion of this Layer. The packet is transmitted using the available Lanes of the Link to the
receiving device which is its neighbor.

Receive Portion of Device Layers

The receiver device decodes the incoming packet contents in the Physical Layer and forwards
the resulting contents to the upper layers. The Data Link Layer checks for errors in the
incoming packet and if there are no errors forwards the packet up to the Transaction Layer.
The Transaction Layer buffers the incoming TLPs and converts the information in the packet to
a representation that can be processed by the device core and application.

Device Layers and their Associated Packets

Three categories of packets are defined, each one is associated with one of the three device
layers. Associated with the Transaction Layer is the Transaction Layer Packet (TLP).
Associated with the Data Link Layer is the Data Link Layer Packet (DLLP). Associated with the
Physical Layer is the Physical Layer Packet (PLP). These packets are introduced next.

Transaction Layer Packets (TLPs)

PCI Express transactions employ TLPs which originate at the Transaction Layer of a
transmitter device and terminate at the Transaction Layer of a receiver device. This process is
represented in Figure 2-11 on page 72. The Data Link Layer and Physical Layer also contribute
to TLP assembly as the TLP moves through the layers of the transmitting device. At the other
end of the Link where a neighbor receives the TLP, the Physical Layer, Data Link Layer and
Transaction Layer disassemble the TLP.

Figure 2-11. TLP Origin and Destination


TLP Packet Assembly

A TLP that is transmitted on the Link appears as shown in Figure 2-12 on page 73.

Figure 2-12. TLP Assembly

The software layer/device core sends to the Transaction Layer the information required to
assemble the core section of the TLP which is the header and data portion of the packet. Some
TLPs do not contain a data section. An optional End-to-End CRC (ECRC) field is calculated and
appended to the packet. The ECRC field is used by the ultimate targeted device of this packet
to check for CRC errors in the header and data portion of the TLP.
The core section of the TLP is forwarded to the Data Link Layer which then appends a
sequence ID and another LCRC field. The LCRC field is used by the neighboring receiver
device at the other end of the Link to check for CRC errors in the core section of the TLP plus
the sequence ID. The resultant TLP is forwarded to the Physical Layer which concatenates a
Start and End framing character of 1 byte each to the packet. The packet is encoded and
differentially transmitted on the Link using the available number of Lanes.

TLP Packet Disassembly

A neighboring receiver device receives the incoming TLP bit stream. As shown in Figure 2-13
on page 74 the received TLP is decoded by the Physical Layer and the Start and End frame
fields are stripped. The resultant TLP is sent to the Data Link Layer. This layer checks for any
errors in the TLP and strips the sequence ID and LCRC field. Assume there are no LCRC
errors, then the TLP is forwarded up to the Transaction Layer. If the receiving device is a
switch, then the packet is routed from one port of the switch to an egress port based on
address information contained in the header portion of the TLP. Switches are allowed to check
for ECRC errors and even report the errors it finds and error. However, a switch is not allowed
to modify the ECRC that way the targeted device of this TLP will detect an ECRC error if there
is such an error.

Figure 2-13. TLP Disassembly

The ultimate targeted device of this TLP checks for ECRC errors in the header and data portion
of the TLP. The ECRC field is stripped, leaving the header and data portion of the packet. It is
this information that is finally forwarded to the Device Core/Software Layer.
Data Link Layer Packets (DLLPs)

Another PCI Express packet called DLLP originates at the Data Link Layer of a transmitter
device and terminates at the Data Link Layer of a receiver device. This process is represented
in Figure 2-14 on page 75. The Physical Layer also contributes to DLLP assembly and
disassembly as the DLLP moves from one device to another via the PCI Express Link.

Figure 2-14. DLLP Origin and Destination

DLLPs are used for Link Management functions including TLP acknowledgement associated
with the ACK/NAK protocol, power management, and exchange of Flow Control information.

DLLPs are transferred between Data Link Layers of the two directly connected components on
a Link. DLLPs do not pass through switches unlike TLPs which do travel through the PCI
Express fabric. DLLPs do not contain routing information. These packets are smaller in size
compared to TLPs, 8 bytes to be precise.

DLLP Assembly

The DLLP shown in Figure 2-15 on page 76 originates at the Data Link Layer. There are
various types of DLLPs some of which include Flow Control DLLPs (FCx), acknowledge/ no
acknowledge DLLPs which confirm reception of TLPs (ACK and NAK), and power management
DLLPs (PMx). A DLLP type field identifies various types of DLLPs. The Data Link Layer
appends a 16-bit CRC used by the receiver of the DLLP to check for CRC errors in the DLLP.

Figure 2-15. DLLP Assembly


The DLLP content along with a 16-bit CRC is forwarded to the Physical Layer which appends a
Start and End frame character of 1 byte each to the packet. The packet is encoded and
differentially transmitted on the Link using the available number of Lanes.

DLLP Disassembly

The DLLP is received by Physical Layer of a receiving device. The received bit stream is
decoded and the Start and End frame fields are stripped as depicted in Figure 2-16. The
resultant packet is sent to the Data Link Layer. This layer checks for CRC errors and strips the
CRC field. The Data Link Layer is the destination layer for DLLPs and it is not forwarded up to
the Transaction Layer.

Figure 2-16. DLLP Disassembly


Physical Layer Packets (PLPs)

Another PCI Express packet called PLP originates at the Physical Layer of a transmitter device
and terminates at the Physical Layer of a receiver device. This process is represented in Figure
2-17 on page 77. The PLP is a very simple packet that starts with a 1 byte COM character
followed by 3 or more other characters that define the PLP type as well as contain other
information. The PLP is a multiple of 4 bytes in size, an example of which is shown in Figure 2-
18 on page 78. The specification refers to this packet as the Ordered-Set. PLPs do not contain
any routing information. They are not routed through the fabric and do not propagate through a
switch.

Figure 2-17. PLP Origin and Destination


Figure 2-18. PLP or Ordered-Set Structure

Some PLPs are used during the Link Training process described in "Ordered-Sets Used During
Link Training and Initialization" on page 504. Another PLP is used for clock tolerance
compensation. PLPs are used to place a Link into the electrical idle low power state or to wake
up a link from this low power state.

Function of Each PCI Express Device Layer

Figure 2-19 on page 79 is a more detailed block diagram of a PCI Express Device's layers.
This block diagram is used to explain key functions of each layer and explain the function of
each layer as it relates to generation of outbound traffic and response to inbound traffic. The
layers consist of Device Core/Software Layer, Transaction Layer, Data Link Layer and Physical
Layer.

Figure 2-19. Detailed Block Diagram of PCI Express Device's Layers


Device Core / Software Layer

The Device Core consists of, for example, the root complex core logic or an endpoint core logic
such as that of an Ethernet controller, SCSI controller, USB controller, etc. To design a PCI
Express endpoint, a designer may reuse the Device Core logic from a PCI or PCI-X core logic
design and wrap around it the PCI Express layered design described in this section.

Transmit Side

The Device Core logic in conjunction with local software provides the necessary information
required by the PCI Express device to generate TLPs. This information is sent via the Transmit
interface to the Transaction Layer of the device. Example of information transmitted to the
Transaction Layer includes: transaction type to inform the Transaction Layer what type of TLP
to generate, address, amount of data to transfer, data, traffic class, message index etc.

Receive Side

The Device Core logic is also responsible to receive information sent by the Transaction Layer
via the Receive interface. This information includes: type of TLP received by the Transaction
Layer, address, amount of data received, data, traffic class of received TLP, message index,
error conditions etc.
Transaction Layer

The transaction Layer shown in Figure 2-19 is responsible for generation of outbound TLP
traffic and reception of inbound TLP traffic. The Transaction Layer supports the split transaction
protocol for non-posted transactions. In other words, the Transaction Layer associates an
inbound completion TLP of a given tag value with an outbound non-posted request TLP of the
same tag value transmitted earlier.

The transaction layer contains virtual channel buffers (VC Buffers) to store outbound TLPs that
await transmission and also to store inbound TLPs received from the Link. The flow control
protocol associated with these virtual channel buffers ensures that a remote transmitter does
not transmit too many TLPs and cause the receiver virtual channel buffers to overflow. The
Transaction Layer also orders TLPs according to ordering rules before transmission. It is this
layer that supports the Quality of Service (QoS) protocol.

The Transaction Layer supports 4 address spaces: memory address, IO address, configuration
address and message space. Message packets contain a message.

Transmit Side

The Transaction Layer receives information from the Device Core and generates outbound
request and completion TLPs which it stores in virtual channel buffers. This layer assembles
Transaction Layer Packets (TLPs). The major components of a TLP are: Header, Data Payload
and an optional ECRC (specification also uses the term Digest) field as shown in Figure 2-20.

Figure 2-20. TLP Structure at the Transaction Layer

The Header is 3 doublewords or 4 doublewords in size and may include information such as;
Address, TLP type, transfer size, requester ID/completer ID, tag, traffic class, byte enables,
completion codes, and attributes (including "no snoop" and "relaxed ordering" bits). The TLP
types are defined in Table 2-2 on page 57.

The address is a 32-bit memory address or an extended 64-bit address for memory requests.
It is a 32-bit address for IO requests. For configuration transactions the address is an ID
consisting of Bus Number, Device Number and Function Number plus a configuration register
address of the targeted register. For completion TLPs, the address is the requester ID of the
device that originally made the request. For message transactions the address used for routing
is the destination device's ID consisting of Bus Number, Device Number and Function Number of
the device targeted by the message request. Message requests could also be broadcast or
routed implicitly by targeting the root complex or an upstream port.

The transfer size or length field indicates the amount of data to transfer calculated in
doublewords (DWs). The data transfer length can be between 1 to 1024 DWs. Write request
TLPs include data payload in the amount indicated by the length field of the header. For a read
request TLP, the length field indicates the amount of data requested from a completer. This
data is returned in one or more completion packets. Read request TLPs do not include a data
payload field. Byte enables specify byte level address resolution.

Request packets contain a requester ID (bus#, device#, function #) of the device transmitting
the request. The tag field in the request is memorized by the completer and the same tag is
used in the completion.

A bit in the Header (TD = TLP Digest) indicates whether this packet contains an ECRC field
also referred to as Digest. This field is 32-bits wide and contains an End-to-End CRC (ECRC).
The ECRC field is generated by the Transaction Layer at time of creation of the outbound TLP.
It is generated based on the entire TLP from first byte of header to last byte of data payload
(with the exception of the EP bit, and bit 0 of the Type field. These two bits are always
considered to be a 1 for the ECRC calculation). The TLP never changes as it traverses the
fabric (with the exception of perhaps the two bits mentioned in the earlier sentence). The
receiver device checks for an ECRC error that may occur as the packet moves through the
fabric.

Receiver Side

The receiver side of the Transaction Layer stores inbound TLPs in receiver virtual channel
buffers. The receiver checks for CRC errors based on the ECRC field in the TLP. If there are
no errors, the ECRC field is stripped and the resultant information in the TLP header as well as
the data payload is sent to the Device Core.

Flow Control

The Transaction Layer ensures that it does not transmit a TLP over the Link to a remote
receiver device unless the receiver device has virtual channel buffer space to accept TLPs (of a
given traffic class). The protocol for guaranteeing this mechanism is referred to as the "flow
control" protocol. If the transmitter device does not observe this protocol, a transmitted TLP will
cause the receiver virtual channel buffer to overflow. Flow control is automatically managed at
the hardware level and is transparent to software. Software is only involved to enable additional
buffers beyond the default set of virtual channel buffers (referred to as VC 0 buffers). The
default buffers are enabled automatically after Link training, thus allowing TLP traffic to flow
through the fabric immediately after Link training. Configuration transactions use the default
virtual channel buffers and can begin immediately after the Link training process. Link training
process is described in Chapter 14, entitled "Link Initialization & Training," on page 499.

Refer to Figure 2-21 on page 82 for an overview of the flow control process. A receiver device
transmits DLLPs called Flow Control Packets (FCx DLLPs) to the transmitter device on a
periodic basis. The FCx DLLPs contain flow control credit information that updates the
transmitter regarding how much buffer space is available in the receiver virtual channel buffer.
The transmitter keeps track of this information and will only transmit TLPs out of its Transaction
Layer if it knows that the remote receiver has buffer space to accept the transmitted TLP.

Figure 2-21. Flow Control Process

Quality of Service (QoS)

Consider Figure 2-22 on page 83 in which the video camera and SCSI device shown need to
transmit write request TLPs to system DRAM. The camera data is time critical isochronous
data which must reach memory with guaranteed bandwidth otherwise the displayed image will
appear choppy or unclear. The SCSI data is not as time sensitive and only needs to get to
system memory correctly without errors. It is clear that the video data packet should have
higher priority when routed through the PCI Express fabric, especially through switches. QoS
refers to the capability of routing packets from different applications through the fabric with
differentiated priorities and deterministic latencies and bandwidth. PCI and PCI-X systems do
not support QoS capability.

Figure 2-22. Example Showing QoS Capability of PCI Express


Consider this example. Application driver software in conjunction with the OS assigns the video
data packets a traffic class of 7 (TC7) and the SCSI data packet a traffic class of 0 (TC0).
These TC numbers are embedded in the TLP header. Configuration software uses TC/VC
mapping device configuration registers to map TC0 related TLPs to virtual channel 0 buffers
(VC0) and TC7 related TLPs to virtual channel 7 buffers (VC7).

As TLPs from these two applications (video and SCSI applications) move through the fabric,
the switches post incoming packets moving upstream into their respective VC buffers (VC0 and
VC7). The switch uses a priority based arbitration mechanism to determine which of the two
incoming packets to forward with greater priority to a common egress port. Assume VC7 buffer
contents are configured with higher priority than VC0. Whenever two incoming packets are to
be forwarded to one upstream port, the switch will always pick the VC7 packet, the video data,
over the VC0 packet, the SCSI data. This guarantees greater bandwidth and reduced latency
for video data compared to SCSI data.

A PCI Express device that implements more than one set of virtual channel buffers has the
ability to arbitrate between TLPs from different VC buffers. VC buffers have configurable
priorities. Thus traffic flowing through the system in different VC buffers will observe
differentiated performances. The arbitration mechanism between TLP traffic flowing through
different VC buffers is referred to as VC arbitration.

Also, multi-port switches have the ability to arbitrate between traffic coming in on two ingress
ports but using the same VC buffer resource on a common egress port. This configurable
arbitration mechanism between ports supported by switches is referred to as Port arbitration.

Traffic Classes (TCs) and Virtual Channels (VCs)

TC is a TLP header field transmitted within the packet unmodified end-to-end through the
fabric. Local application software and system software based on performance requirements
decides what TC label a TLP uses. VCs are physical buffers that provide a means to support
multiple independent logical data flows over the physical Link via the use of transmit and
receiver virtual channel buffers.

PCI Express devices may implement up to 8 VC buffers (VC0-VC7). The TC field is a 3-bit field
that allows differentiation of traffic into 8 traffic classes (TC0-TC7). Devices must implement
VC0. Similarly, a device is required to support TC0 (best effort general purpose service class).
The other optional TCs may be used to provide differentiated service through the fabric.
Associated with each implemented VC ID, a transmit device implements a transmit buffer and a
receive device implements a receive buffer.

Devices or switches implement TC-to-VC mapping logic by which a TLP of a given TC number
is forwarded through the Link using a particular VC numbered buffer. PCI Express provides the
capability of mapping multiple TCs onto a single VC, thus reducing device cost by means of
providing limited number of VC buffer support. TC/VC mapping is configured by system
software through configuration registers. It is up to the device application software to determine
TC label for TLPs and TC/VC mapping that meets performance requirements. In its simplest
form TC/VC mapping registers can be configured with a one-to-one mapping of TC to VC.

Consider the example illustrated in Figure 2-23 on page 85. The TC/VC mapping registers in
Device A are configured to map, TLPs with TC[2:0] to VC0 and TLPs with TC[7:3] to VC1. The
TC/VC mapping registers in receiver Device B must also be configured identically as Device A.
The same numbered VC buffers are enabled both in transmitter Device A and receiver Device
B.

Figure 2-23. TC Numbers and VC Buffers

If Device A needs to transmit a TLP with TC label of 7 and another packet with TC label of 0,
the two packets will be placed in VC1 and VC0 buffers, respectively. The arbitration logic
arbitrates between the two VC buffers. Assume VC1 buffer is configured with higher priority
than VC0 buffer. Thus, Device A will forward the TC7 TLPs in VC1 to the Link ahead of the TC0
TLPs in VC0.

When the TLPs arrive in Device B, the TC/VC mapping logic decodes the TC label in each TLP
and places the TLPs in their associated VC buffers.

In this example, TLP traffic with TC[7:3] label will flow through the fabric with higher priority
than TC[2:0] traffic. Within each TC group however, TLPs will flow with equal priority.

Port Arbitration and VC Arbitration

The goals of arbitration support in the Transaction Layer are:

To provide differentiated services between data flows within the fabric.

To provide guaranteed bandwidth with deterministic and smallest end-to-end transaction


latency.

Packets of different TCs are routed through the fabric of switches with different priority based
on arbitration policy implemented in switches. Packets coming in from ingress ports heading
towards a particular egress port compete for use of that egress port.

Switches implement two types of arbitration for each egress port: Port Arbitration and VC
Arbitration. Consider Figure 2-24 on page 86.

Figure 2-24. Switch Implements Port Arbitration and VC Arbitration Logic

Port arbitration is arbitration between two packets arriving on different ingress ports but that
map to the same virtual channel (after going through TC-to-VC mapping) of the common egress
port. The port arbiter implements round-robin, weighted round-robin or programmable time-
based round-robin arbitration schemes selectable through configuration registers.

VC arbitration takes place after port arbitration. For a given egress port, packets from all VCs
compete to transmit on the same egress port. VC arbitration resolves the order in which TLPs
in different VC buffers are forwarded on to the Link. VC arbitration policies supported include,
strict priority, round-robin and weighted round-robin arbitration schemes selectable through
configuration registers.

Independent of arbitration, each VC must observe transaction ordering and flow control rules
before it can make pending TLP traffic visible to the arbitration mechanism.

Endpoint devices and a root complex with only one port do not support port arbitration. They
only support VC arbitration in the Transaction Layer.

Transaction Ordering

PCI Express protocol implements PCI/PCI-X compliant producer-consumer ordering model for
transaction ordering with provision to support relaxed ordering similar to PCI-X architecture.
Transaction ordering rules guarantee that TLP traffic associated with a given traffic class is
routed through the fabric in the correct order to prevent potential deadlock or live-lock
conditions from occurring. Traffic associated with different TC labels have no ordering
relationship. Chapter 8, entitled "Transaction Ordering," on page 315 describes these ordering
rules.

The Transaction Layer ensures that TLPs for a given TC are ordered correctly with respect to
other TLPs of the same TC label before forwarding to the Data Link Layer and Physical Layer
for transmission.

Power Management

The Transaction Layer supports ACPI/PCI power management, as dictated by system


software. Hardware within the Transaction Layer autonomously power manages a device to
minimize power during full-on power states. This automatic power management is referred to
as Active State Power Management and does not involve software. Power management
software associated with the OS power manages a device's power states though power
management configuration registers. Power management is described in Chapter 16.

Configuration Registers

A device's configuration registers are associated with the Transaction Layer. The registers are
configured during initialization and bus enumeration. They are also configured by device drivers
and accessed by runtime software/OS. Additionally, the registers store negotiated Link
capabilities, such as Link width and frequency. Configuration registers are described in Part 6
of the book.

Data Link Layer


Refer to Figure 2-19 on page 79 for a block diagram of a device's Data Link Layer. The
primary function of the Data Link Layer is to ensure data integrity during packet transmission
and reception on each Link. If a transmitter device sends a TLP to a remote receiver device at
the other end of a Link and a CRC error is detected, the transmitter device is notified with a
NAK DLLP. The transmitter device automatically replays the TLP. This time hopefully no error
occurs. With error checking and automatic replay of packets received in error, PCI Express
ensures very high probability that a TLP transmitted by one device will make its way to the final
destination with no errors. This makes PCI Express ideal for low error rate, high-availability
systems such as servers.

Transmit Side

The Transaction Layer must observe the flow control mechanism before forwarding outbound
TLPs to the Data Link Layer. If sufficient credits exist, a TLP stored within the virtual channel
buffer is passed from the Transaction Layer to the Data Link Layer for transmission.

Consider Figure 2-25 on page 88 which shows the logic associated with the ACK-NAK
mechanism of the Data Link Layer. The Data Link Layer is responsible for TLP CRC generation
and TLP error checking. For outbound TLPs from transmit Device A, a Link CRC (LCRC) is
generated and appended to the TLP. In addition, a sequence ID is appended to the TLP. Device
A's Data Link Layer preserves a copy of the TLP in a replay buffer and transmits the TLP to
Device B. The Data Link Layer of the remote Device B receives the TLP and checks for CRC
errors.

Figure 2-25. Data Link Layer Replay Mechanism


If there is no error, the Data Link Layer of Device B returns an ACK DLLP with a sequence ID
to Device A. Device A has confirmation that the TLP has reached Device B (not necessarily the
final destination) successfully. Device A clears its replay buffer of the TLP associated with that
sequence ID.

If on the other hand a CRC error is detected in the TLP received at the remote Device B, then
a NAK DLLP with a sequence ID is returned to Device A. An error has occurred during TLP
transmission. Device A's Data Link Layer replays associated TLPs from the replay buffer. The
Data Link Layer generates error indications for error reporting and logging mechanisms.

In summary, the replay mechanism uses the sequence ID field within received ACK/NAK DLLPs
to associate it with outbound TLPs stored in the replay buffer. Reception of ACK DLLPs cause
the replay buffer to clear TLPs from the buffer. Receiving NAK DLLPs cause the replay buffer
to replay associated TLPs.

For a given TLP in the replay buffer, if the transmitter device receives a NAK 4 times and the
TLP is replayed 3 additional times as a result, then the Data Link Layer logs the error, reports
a correctable error, and re-trains the Link.

Receive Side

The receive side of the Data Link Layer is responsible for LCRC error checking on inbound
TLPs. If no error is detected, the device schedules an ACK DLLP for transmission back to the
remote transmitter device. The receiver strips the TLP of the LCRC field and sequence ID.

If a CRC error is detected, it schedules a NAK to return back to the remote transmitter. The
TLP is eliminated.

The receive side of the Data Link Layer also receives ACKs and NAKs from a remote device. If
an ACK is received the receive side of the Data Link layer informs the transmit side to clear an
associated TLP from the replay buffer. If a NAK is received, the receive side causes the replay
buffer of the transmit side to replay associated TLPs.

The receive side is also responsible for checking the sequence ID of received TLPs to check
for dropped or out-of-order TLPs.

Data Link Layer Contribution to TLPs and DLLPs

The Data Link Layer concatenates a 12-bit sequence ID and 32-bit LCRC field to an outbound
TLP that arrives from the Transaction Layer. The resultant TLP is shown in Figure 2-26 on page
90. The sequence ID is used to associate a copy of the outbound TLP stored in the replay
buffer with a received ACK/NAK DLLP inbound from a neighboring remote device. The
ACK/NAK DLLP confirms arrival of the outbound TLP in the remote device.

Figure 2-26. TLP and DLLP Structure at the Data Link Layer

The 32-bit LCRC is calculated based on all bytes in the TLP including the sequence ID.

A DLLP shown in Figure 2-26 on page 90 is a 4 byte packet with a 16-bit CRC field. The 8-bit
DLLP Type field indicates various categories of DLLPs. These include: ACK, NAK, Power
Management related DLLPs (PM_Enter_L1, PM_Enter_L23, PM_Active_State_Request_L1,
PM_Request_Ack) and Flow Control related DLLPs (InitFC1-P, InitFC1-NP, InitFC1-Cpl,
InitFC2-P, InitFC2-NP, InitFC2-Cpl, UpdateFC-P, UpdateFC-NP, UpdateFC-Cpl). The 16-bit
CRC is calculated using all 4 bytes of the DLLP. Received DLLPs which fail the CRC check are
discarded. The loss of information from discarding a DLLP is self repairing such that a
successive DLLP will supersede any information lost. ACK and NAK DLLPs contain a sequence
ID field (shown as Misc. field in Figure 2-26) used by the device to associate an inbound
ACK/NAK DLLP with a stored copy of a TLP in the replay buffer.

Non-Posted Transaction Showing ACK-NAK Protocol


Next the steps required to complete a memory read request between a requester and a
completer on the far end of a switch are described. Figure 2-27 on page 91 shows the activity
on the Link to complete this transaction:

Step 1a. Requester transmits a memory read request TLP (MRd). Switch receives the
MRd TLP and checks for CRC error using the LCRC field in the MRd TLP.

Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy
of the TLP from its replay buffer.

Step 2a. Switch forwards the MRd TLP to the correct egress port using memory address
for routing. Completer receives MRd TLP. Completer checks for CRC errors in received
MRd TLP using LCRC.

Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of
the MRd TLP from its replay buffer.

Step 3a. Completer checks for CRC error using optional ECRC field in MRd TLP. Assume
no End-to-End error. Completer returns Completion (CplD) with Data TLP whenever it has
the requested data. Switch receives CplD TLP and checks for CRC error using LCRC.

Step 3b. If no error then switch returns ACK DLLP to completer. Completer discards copy
of the CplD TLP from its replay buffer.

Step 4a. Switch decodes Requester ID field in CplD TLP and routes the packet to the
correct egress port. Requester receives CplD TLP. Requester checks for CRC errors in
received CplD TLP using LCRC.

Step 4b. If no error then requester returns ACK DLLP to switch. Switch discards copy of
the CplD TLP from its replay buffer. Requester determines if there is error in CplD TLP
using CRC field in optional ECRC field. Assume no End-to-End error. Requester checks
completion error code in CplD. Assume completion code of 'Successful Completion'. To
associate the completion with the original request, requester matches tag in CplD with
original tag of MRd request. Requester accepts data.

Figure 2-27. Non-Posted Transaction on Link


Posted Transaction Showing ACK-NAK Protocol

Below are the steps involved in completing a memory write request between a requester and a
completer on the far end of a switch. Figure 2-28 on page 92 shows the activity on the Link to
complete this transaction:

Step 1a. Requester transmits a memory write request TLP (MWr) with data. Switch
receives MWr TLP and checks for CRC error with LCRC field in the TLP.

Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy
of the TLP from its replay buffer.

Step 2a. Switch forwards the MWr TLP to the correct egress port using memory address
for routing. Completer receives MWr TLP. Completer checks for CRC errors in received
MRd TLP using LCRC.

Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of
the MWr TLP from its replay buffer. Completer checks for CRC error using optional digest
field in MWr TLP. Assume no End-to-End error. Completer accepts data. There is no
completion associated with this transaction.

Figure 2-28. Posted Transaction on Link

Other Functions of the Data Link Layer


Following power-up or Reset, the flow control mechanism described earlier is initialized by the
Data Link Layer. This process is accomplished automatically at the hardware level and has no
software involvement.

Flow control for the default virtual channel VC0 is initialized first. In addition, when additional
VCs are enabled by software, the flow control initialization process is repeated for each newly
enabled VC. Since VC0 is enabled before all other VCs, no TLP traffic will be active prior to
initialization of VC0.

Physical Layer

Refer to Figure 2-19 on page 79 for a block diagram of a device's Physical Layer. Both TLP
and DLLP type packets are sent from the Data Link Layer to the Physical Layer for
transmission over the Link. Also, packets are received by the Physical Layer from the Link and
sent to the Data Link Layer.

The Physical Layer is divided in two portions, the Logical Physical Layer and the Electrical
Physical Layer. The Logical Physical Layer contains digital logic associated with processing
packets before transmission on the Link, or processing packets inbound from the Link before
sending to the Data Link Layer. The Electrical Physical Layer is the analog interface of the
Physical Layer that connects to the Link. It consists of differential drivers and receivers for each
Lane.

Transmit Side

TLPs and DLLPs from the Data Link Layer are clocked into a buffer in the Logical Physical
Layer. The Physical Layer frames the TLP or DLLP with a Start and End character. The symbol
is a framing code byte which a receiver device uses to detect the start and end of a packet.
The Start and End characters are shown appended to a TLP and DLLP in Figure 2-29 on page
94. The diagram shows the size of each field in a TLP or DLLP.

Figure 2-29. TLP and DLLP Structure at the Physical Layer


The transmit logical sub-block conditions the received packet from the Data Link Layer into the
correct format for transmission. Packets are byte striped across the available Lanes on the
Link.

Each byte of a packet is then scrambled with the aid of Linear Feedback Shift Register type
scrambler. By scrambling the bytes, repeated bit patterns on the Link are eliminated, thus
reducing the average EMI noise generated.

The resultant bytes are encoded into a 10b code by the 8b/10b encoding logic. The primary
purpose of encoding 8b characters to 10b symbols is to create sufficient 1-to-0 and 0-to-1
transition density in the bit stream to facilitate recreation of a receive clock with the aid of a PLL
at the remote receiver device. Note that data is not transmitted along with a clock. Instead, the
bit stream contains sufficient transitions to allow the receiver device to recreate a receive clock.

The parallel-to-serial converter generates a serial bit stream of the packet on each Lane and
transmits it differentially at 2.5 Gbits/s.

Receive Side

The receive Electrical Physical Layer clocks in a packet arriving differentially on all Lanes. The
serial bit stream of the packet is converted into a 10b parallel stream using the serial-to-parallel
converter. The receiver logic also includes an elastic buffer which accommodates for clock
frequency variation between a transmit clock with which the packet bit stream is clocked into a
receiver and the receiver clock. The 10b symbol stream is decoded back to the 8b
representation of each symbol with the 8b/10b decoder. The 8b characters are de-scrambled.
The Byte unstriping logic, re-creates the original packet stream transmitted by the remote
device.

Link Training and Initialization

An additional function of the Physical Layer is Link initialization and training. Link initialization and
training is a Physical Layer controlled process that configures and initializes each Link for
normal operation. This process is automatic and does not involve software. The following are
determined during the Link initialization and training process:

Link width

Link data rate

Lane reversal

Polarity inversion.

Bit lock per Lane

Symbol lock per Lane

Lane-to-Lane de-skew within a multi-Lane Link.

Link width. Two devices with a different number of Lanes per Link may be connected. E.g. one
device has x2 port and it is connected to a device with x4 port. After initialization the Physical
Layer of both devices determines and sets the Link width to the minimum Lane width of x2.
Other Link negotiated behaviors include Lane reversal and splitting of ports into multiple Links.

Lane reversal if necessary is an optional feature. Lanes are numbered. A designer may not
wire the correct Lanes of two ports correctly. In which case training allows for the Lane
numbers to be reversed so that the Lane numbers of adjacent ports on each end of the Link
match up. Part of the same process may allow for a multi-Lane Link to be split into multiple
Links.

Polarity inversion. The D+ and D- differential pair terminals for two devices may not be
connected correctly. In which case the training sequence receiver reverses the polarity on the
differential receiver.

Link data rate. Training is completed at data rate of 2.5 Gbit/s. In the future, higher data rates
of 5 Gbit/s and 10 Gbit/s will be supported. During training, each node advertises its highest
data rate capability. The Link is initialized with the highest common frequency that devices at
opposite ends of a Link support.

Lane-to-Lane De-skew. Due to Link wire length variations and different driver/receiver
characteristics on a multi-Lane Link, bit streams on each Lane will arrive at a receiver skewed
with respect to other Lanes. The receiver circuit must compensate for this skew by
adding/removing delays on each Lane. Relaxed routing rules allow Link wire lengths in the order
of 20"-30".

Link Power Management


The normal power-on operation of a Link is called the L0 state. Lower power states of the Link
in which no packets are transmitted or received are L0s, L1, L2 and L3 power states. The L0s
power state is automatically entered when a time-out occurs after a period of inactivity on the
Link. Entering and exiting this state does not involve software and the exit latency is the
shortest. L1 and L2 are lower power states than L0s, but exit latencies are greater. The L3
power state is the full off power state from which a device cannot generate a wake up event.

Reset

Two types of reset are supported:

Cold/warm reset also called a Fundamental Reset which occurs following a device being
powered-on (cold reset) or due to a reset without circulating power (warm reset).

Hot reset sometimes referred to as protocol reset is an in-band method of propagating


reset. Transmission of an ordered-set is used to signal a hot reset. Software initiates hot
reset generation.

On exit from reset (cold, warm, or hot), all state machines and configuration registers (hot
reset does not reset sticky configuration registers) are initialized.

Electrical Physical Layer

The transmitter of one device is AC coupled to the receiver of another device at the opposite
end of the Link as shown in Figure 2-30. The AC coupling capacitor is between 75-200 nF. The
transmitter DC common mode voltage is established during Link training and initialization. The
DC common mode impedance is typically 50 ohms while the differential impedance is 100 ohms
typical. This impedance is matched with a standard FR4 board.

Figure 2-30. Electrical Physical Layer Showing Differential Transmitter and


Receiver
Example of a Non-Posted Memory Read Transaction
Let us put our knowledge so far to describe the set of events that take place from the time a
requester device initiates a memory read request, until it obtains the requested data from a
completer device. Given that such a transaction is a non-posted transaction, there are two
phases to the read process. The first phase is the transmission of a memory read request TLP
from requester to completer. The second phase is the reception of a completion with data from
the completer.

Memory Read Request Phase

Refer to Figure 2-31. The requester Device Core or Software Layer sends the following
information to the Transaction Layer:

Figure 2-31. Memory Read Request Phase

32-bit or 64-bit memory address, transaction type of memory read request, amount of data to
read calculated in doublewords, traffic class if other than TC0, byte enables, attributes to
indicate if 'relaxed ordering' and 'no snoop' attribute bits should be set or clear.

The Transaction layer uses this information to build a MRd TLP. The exact TLP packet format is
described in a later chapter. A 3 DW or 4 DW header is created depending on address size
(32-bit or 64-bit). In addition, the Transaction Layer adds its requester ID (bus#, device#,
function#) and an 8-bit tag to the header. It sets the TD (transaction digest present) bit in the
TLP header if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP does
not have a data payload. The TLP is placed in the appropriate virtual channel buffer ready for
transmission. The flow control logic confirms there are sufficient "credits" available (obtained
from the completer device) for the virtual channel associated with the traffic class used.

Only then the memory read request TLP is sent to the Data Link Layer. The Data Link Layer
adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A
copy of the TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End
symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10
bit encoded. Finally the packet is converted to a serial bit stream on all Lanes and transmitted
differentially across the Link to the neighbor completer device.

The completer converts the incoming serial bit stream back to 10b symbols while assembling
the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes
from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and
removed. The resultant TLP is sent to the Data Link Layer.

The completer Data Link Layer checks for LCRC errors in the received TLP and checks the
Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer
creates an ACK DLLP which contains the same sequence ID as contained in the memory read
request TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the
Physical Layer which transmits the ACK DLLP to the requester.

The requester Physical Layer reformulates the ACK DLLP and sends it up to the Data Link
Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer.
The stored memory read request TLP associated with the ACK received is discarded from the
replay buffer. If a NAK DLLP was received by the requester instead, it would re-send a copy of
the stored memory read request TLP.

In the mean time the Data Link Layer of the completer strips the sequence ID and LCRC field
from the memory read request TLP and forwards it to the Transaction Layer.

The Transaction Layer receives the memory read request TLP in the appropriate virtual channel
buffer associated with the TC of the TLP. The Transaction layer checks for ECRC error. It
forwards the contents of the header (address, requester ID, memory read transaction type,
amount of data requested, traffic class etc.) to the completer Device Core/Software Layer.

Completion with Data Phase

Refer to Figure 2-32 on page 99 during the following discussion. To service the memory read
request, the completer Device Core/Software Layer sends the following information to the
Transaction Layer:
Figure 2-32. Completion with Data Phase

Requester ID and Tag copied from the original memory read request, transaction type of
completion with data (CplD), requested amount of data with data length field, traffic class if
other than TC0, attributes to indicate if 'relaxed ordering' and 'no snoop' bits should be set or
clear (these bits are copied from the original memory read request). Finally, a completion
status of successful completion (SC) is sent.

The Transaction layer uses this information to build a CplD TLP. The exact TLP packet format is
described in a later chapter. A 3 DW header is created. In addition, the Transaction Layer adds
its own completer ID to the header. The TD (transaction digest present) bit in the TLP header is
set if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP includes the
data payload. The flow control logic confirms sufficient "credits" are available (obtained from
the requester device) for the virtual channel associated with the traffic class used.

Only then the CplD TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit
sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the
TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End
symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10
bit encoded. Finally the CplD packet is converted to a serial bit stream on all Lanes and
transmitted differentially across the Link to the neighbor requester device.

The requester converts the incoming serial bit stream back to 10b symbols while assembling
the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes
from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and
removed. The resultant TLP is sent to the Data Link Layer.
The Data Link Layer checks for LCRC errors in the received CplD TLP and checks the
Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer
creates an ACK DLLP which contains the same sequence ID as contained in the CplD TLP
received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical
Layer which transmits the ACK DLLP to the completer.

The completer Physical Layer reformulates the ACK DLLP and sends it up to the Data Link
Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer.
The stored CplD TLP associated with the ACK received is discarded from the replay buffer. If a
NAK DLLP was received by the completer instead, it would re-send a copy of the stored CplD
TLP.

In the mean time, the requester Transaction Layer receives the CplD TLP in the appropriate
virtual channel buffer mapped to the TLP TC. The Transaction Layer uses the tag in the header
of the CplD TLP to associate the completion with the original request. Transaction layer checks
for ECRC error. It forwards the header contents and data payload including the Completion
Status to the requester Device Core/Software Layer. Memory read transaction DONE.
Hot Plug
PCI Express supports native hot-plug though hot-plug support in a device is not mandatory.
Some of the elements found in a PCI Express hot plug system are:

Indicators which show the power and attention state of the slot.

Manually-operated Retention Latch (MRL) that holds add-in cards in place.

MRL Sensor that allow the port and system software to detect the MRL being opened.

Electromechanical Interlock which prevents removal of add-in cards while slot is powered.

Attention Button that allows user to request hot-plug operations.

Software User Interface that allows user to request hot-plug operations.

Slot Numbering for visual identification of slots.

When a port has no connection or a removal event occurs, the port transmitter moves to the
electrical high impedance detect state. The receiver remains in the electrical low impedance
state.
PCI Express Performance and Data Transfer Efficiency
As of May 2003, no realistic performance and efficiency numbers were available. However,
Table 2-3 shows aggregate bandwidth numbers for various Link widths after factoring the
overhead of 8b/10b encoding.

Table 2-3. PCI Express Aggregate Throughput for Various Link Widths

PCI Express Link Width x1 x2 x4 x8 x12 x16 x32

Aggregate Bandwidth (GBytes/sec) 0.5 1 2 4 6 8 16

DLLPs are 2 doublewords in size. The ACK/NAK and flow control protocol utilize DLLPs, but it
is not expected that these DLLPs will use up a significant portion of the bandwidth.

The remainder of the bandwidth is available for TLPs. Between 6-7 doublewords of the TLP is
overhead associated with Start and End framing symbols, sequence ID, TLP header, ECRC
and LCRC fields. The remainder of the TLP contains between 0-1024 doublewords of data
payload. It is apparent that the bus efficiency is significantly low if small size packets are
transmitted. The efficiency numbers are very high if TLPs contain significant amounts of data
payload.

Packets can be transmitted back-to-back without the Link going idle. Thus the Link can be
100% utilized.

The switch does not introduce any arbitration overhead when forwarding incoming packets from
multiple ingress ports to one egress port. However, it is yet to be seen what the effect is of the
Quality of Service protocol on actual bandwidth numbers for given applications.

There is overhead associated with the split transaction protocol, especially for read
transactions. For a read request TLP, the data payload is contained in the completion. This
factor has to be accounted for when determining the effective performance of the bus. Posted
write transactions improve the efficiency of the fabric.

Switches support cut-through mode. That is to say that an incoming packet can be immediately
forwarded to an egress port for transmission without the switch having to buffer up the packet.
The latency for packet forwarding through a switch can be very small allowing packets to travel
from one end of the PCI Express fabric to another end with very small latency.
Part Two: Transaction Protocol
Chapter 3. Address Spaces & Transaction Routing

Chapter 4. Packet-Based Transactions

Chapter 5. ACK/NAK Protocol

Chapter 6. QoS/TCs/VCs and Arbitration

Chapter 7. Flow Control

Chapter 8. Transaction Ordering

Chapter 9. Interrupts

Chapter 10. Error Detection and Handling


Chapter 3. Address Spaces & Transaction Routing
The Previous Chapter

This Chapter

The Next Chapter

Introduction

Two Types of Local Link Traffic

Transaction Layer Packet Routing Basics

Applying Routing Mechanisms

Plug-And-Play Configuration of Routing Options


The Previous Chapter
The previous chapter introduced the PCI Express data transfer protocol. It described the
layered approach to PCI Express device design while describing the function of each device
layer. Packet types employed in accomplishing data transfers were described without getting
into packet content details. Finally, this chapter outlined the process of a requester initiating a
transaction such as a memory read to read data from a completer across a Link.
This Chapter
This chapter describes the general concepts of PCI Express transaction routing and the
mechanisms used by a device in deciding whether to accept, forward, or reject a packet
arriving at an ingress port. Because Data Link Layer Packets (DLLPs) and Physical Layer
ordered set link traffic are never forwarded, the emphasis here is on Transaction Layer Packet
(TLP) types and the three routing methods associated with them: address routing, ID routing,
and implicit routing. Included is a summary of configuration methods used in PCI Express to set
up PCI-compatible plug-and-play addressing within system IO and memory maps, as well as
key elements in the PCI Express packet protocol used in making routing decisions.
The Next Chapter
The next chapter details the two major classes of packets are Transaction Layer Packets
(TLPs), and Data Link Layer Packets (DLLPs). The use and format of each TLP and DLLP
packet type is covered, along with definitions of the field within the packets.
Introduction
Unlike shared-bus architectures such as PCI and PCI-X, where traffic is visible to each device
and routing is mainly a concern of bridges, PCI Express devices are dependent on each other
to accept traffic or forward it in the direction of the ultimate recipient.

As illustrated in Figure 3-1 on page 106, a PCI Express topology consists of independent,
point-to-point links connecting each device with one or more neighbors. As traffic arrives at the
inbound side of a link interface (called the ingress port), the device checks for errors, then
makes one of three decisions:

1. Accept the traffic and use it internally.

Forward the traffic to the appropriate outbound (egress) port.

Reject the traffic because it is neither the intended target nor an interface to it (note that
there are also other reasons why traffic may be rejected)

Figure 3-1. Multi-Port PCI Express Devices Have Routing Responsibilities


Receivers Check For Three Types of Link Traffic

Assuming a link is fully operational, the physical layer receiver interface of each device is
prepared to monitor the logical idle condition and detect the arrival of the three types of link
traffic: Ordered Sets, DLLPs, and TLPs. Using control (K) symbols which accompany the traffic
to determine framing boundaries and traffic type, PCI Express devices then make a distinction
between traffic which is local to the link vs. traffic which may require routing to other links (e.g.
TLPs). Local link traffic, which includes Ordered Sets and Data Link Layer Packets (DLLPs),
isn't forwarded and carries no routing information. Transaction Layer Packets (TLPs) can and
do move from link to link, using routing information contained in the packet headers.

Multi-port Devices Assume the Routing Burden

It should be apparent in Figure 3-1 on page 106 that devices with multiple PCI Express ports
are responsible for handling their own traffic as well as forwarding other traffic between ingress
ports and any enabled egress ports. Also note that while peer-peer transaction support is
required of switches, it is optional for a multi-port Root Complex. It is up to the system designer
to account for peer-to-peer traffic when selecting devices and laying out a motherboard.

Endpoints Have Limited Routing Responsibilities

It should also be apparent in Figure 3-1 on page 106 that endpoint devices have a single link
interface and lack the ability to route inbound traffic to other links. For this reason, and because
they don't reside on shared busses, endpoints never expect to see ingress port traffic which is
not intended for them (this is different than shared-bus PCI(X), where devices commonly
decode addresses and commands not targeting them). Endpoint routing is limited to accepting
or rejecting transactions presented to them.

System Routing Strategy Is Programmed

Before transactions can be generated by a requester, accepted by the completer, and


forwarded by any devices in the path between the two, all devices must be configured to
enforce the system transaction routing scheme. Routing is based on traffic type, system
memory and IO address assignments, etc. In keeping with PCI plug-and-play configuration
methods, each PCI express device is discovered, memory and IO address resources are
assigned to them, and switch/bridge devices are programmed to forward transactions on their
behalf. Once routing is programmed, bus mastering and target address decoding are enabled.
Thereafter, devices are prepared to generate, accept, forward, or reject transactions as
necessary.
Two Types of Local Link Traffic
Local traffic occurs between the transmit interface of one device and the receive interface of its
neighbor for the purpose of managing the link itself. This traffic is never forwarded or flow
controlled; when sent, it must be accepted. Local traffic is further classified as Ordered Sets
exchanged between the Physical Layers of two devices on a link or Data Link Layer packets
(DLLPs) exchanged between the Data Link Layers of the two devices.

Ordered Sets

These are sent by each physical layer transmitter to the physical layer of the corresponding
receiver to initiate link training, compensate for clock tolerance, or transition a link to and from
the Electrical Idle state. As indicated in Table 3-1 on page 109, there are five types of Ordered
Sets.

Each ordered set is constructed of 10-bit control (K) symbols that are created within the
physical layer. These symbols have a common name as well as a alph-numeric code that
defines the 10 bits pattern of 1s and 0s, of which they are comprised. For example, the SKP
(Skip) symbol has a 10-bit value represented as K28.0.

Figure 3-2 on page 110 illustrates the transmission of Ordered Sets. Note that each ordered
set is fixed in size, consisting of 4 or 16 characters. Again, the receiver is required to consume
them as they are sent. Note that the COM control symbol (K28.5) is used to indicate the start
of any ordered set.

Figure 3-2. PCI Express Link Local Traffic: Ordered Sets


Refer to the "8b/10b Encoding" on page 419 for a thorough discussion of Ordered Sets.

Table 3-1. Ordered Set Types

Ordered Set Type Symbols Purpose

Fast Training Sequence


COM, 3 FTS Quick synchronization of bit stream when leaving L0s power state.
(FTS)

Training Sequence One COM, Lane ID, 14 Used in link training, to align and synchronize the incoming bit stream at startup, convey
(TS1) more reset, other functions.

Training Sequence Two COM, Lane ID, 14


See TS1.
(TS2) more

Electrical Idle (IDLE) COM, 3 IDL Indicates that link should be brought to a lower power state (L0s, L1, L2).

Skip COM, 3 SKP Inserted periodically to compensate for clock tolerances.

Data Link Layer Packets (DLLPs)


The other type of local traffic sent by a device transmit interface to the corresponding receiver
of the device attached to it are Data Link Layer Packets (DLLPs). These are also used in link
management, although they are sourced at the device Data Link Layer rather that the Physical
Layer. The main functions of DLLPs are to facilitate Link Power Management, TLP Flow
Control, and the acknowledgement of successful TLP delivery across the link.

Table 3-2. Data Link Layer Packet (DLLP) Types

DLLP Purpose

Receiver Data Link Layer sends Ack to indicate that no CRC or other errors have been encountered in
Acknowledge (Ack)
received TLP(s). Transmitter retains copy of TLPs until Ack'd

Receiver Data Link Layer sends Nak to indicate that a TLP was received with a CRC or other error. All
No Acknowledge (Nak)
TLPs remaining in the transmitter's Retry Buffer must be resent, in the original order.

PM_Enter_L1; Following a software configuration space access to cause a device power management event, a
PM_Enter_L23 downstream device requests entry to link L1 or Level 2-3 state

PM_Active_State_Req_L1 Downstream device autonomously requests L1 Active State

PM_Request_Ack Upstream device acknowledges transition to L1 State

Vendor-Specific DLLP Reserved for vendor-specific purposes

InitFC1-P
Flow Control Initialization Type One DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow
InitFC1-NP
control credits.
InitFC1-Cpl

InitFC2-P
Flow Control Initialization Type Two DLLP confirming award of InitFC1 posted (P), nonposted (NP), or
InitFC2-NP
completion (Cpl) flow control credits.
InitFC2-Cpl

UpdateFC-P
Flow Control Credit Update DLLP awarding posted (P), nonposted (NP), or completion (Cpl) flow control
UpdateFC-NP credits.
UpdateFC-Cpl

As described in Table 3-2 on page 111 and shown in Figure 3-3 on page 112, there are three
major types of DLLPs: Ack/Nak, Power Management (several variants), and Flow Control. In
addition, a vendor-specific DLLP is permitted in the specification. Each DLLP is 8 bytes,
including a Start Of DLLP (SDP) byte, 2-byte CRC, and an End Of Packet (END) byte in
addition to the 4 byte DLLP core (which includes the type field and any required attributes).

Figure 3-3. PCI Express Link Local Traffic: DLLPs


Note that unlike Ordered Sets, DLLPs always carry a 16-bit CRC which is verified by the
receiver before carrying out the required operation. If an error is detected by the receiver of a
DLLP, it is dropped. Even though DLLPs are not acknowledged, time-out mechanisms built into
the specification permit recovery from dropped DLLPs due to CRC errors.

Refer to "Data Link Layer Packets" on page 198 for a thorough discussion of Data Link Layer
packets.
Transaction Layer Packet Routing Basics
The third class of link traffic originates in the Transaction Layer of one device and targets the
Transaction Layer of another device. These Transaction Layer Packets (TLPs) are forwarded
from one link to another as necessary, subject to the routing mechanisms and rules described in
the following sections. Note that other chapters in this book describe additional aspects of
Transaction Layer Packet handling, including Flow Control, Quality Of Service, Error Handling,
Ordering rules, etc. The term transaction is used here to describe the exchange of information
using Transaction Layer Packets. Because Ordered Sets and DLLPs carry no routing
information and are not forwarded, the routing rules described in the following sections apply
only to TLPs.

TLPs Used to Access Four Address Spaces

As transactions are carried out between PCI Express requesters and completers, four
separate address spaces are used: Memory, IO, Configuration, and Message. The basic use
of each address space is described in Table 3-3 on page 113.

Table 3-3. PCI Express Address Space And Transaction Types

Address Transaction
Purpose
Space Types

Read,
Memory Transfer data to or from a location in the system memory map
Write

Read,
IO Transfer data to or from a location in the system IO map
Write

Read,
Configuration Transfer data to or from a location in the configuration space of a PCI-compatible device.
Write

Baseline, General in-band messaging and event reporting (without consuming memory or IO address
Message
resources)
Vendor-specific

Split Transaction Protocol Is Used

Accesses to the four address spaces in PCI Express are accomplished using split-transaction
requests and completions.
Split Transactions: Better Performance, More Overhead

The split transaction protocol is an improvement over earlier bus protocols (e.g. PCI) which
made extensive use of bus wait-states or delayed transactions (retries) to deal with latencies in
accessing targets. In PCI Express, the completion following a request is initiated by the
completer only when it has data and/or status ready for delivery. The fact that the completion is
separated in time from the request which caused it also means that two separate TLPs are
generated, with independent routing for the request TLP and the Completion TLP. Note that
while a link is free for other activity in the time between a request and its subsequent
completion, a split-transaction protocol involves some additional overhead as two complete
TLPs must be generated to carry out a single transaction.

Figure 3-4 on page 115 illustrates the request-completion phases of a PCI Express split
transaction. This example represents an endpoint read from system memory.

Figure 3-4. PCI Express Transaction Request And Completion TLPs

Write Posting: Sometimes a Completion Isn't Needed

To mitigate the penalty of the request-completion latency, messages and some write
transactions in PCI Express are posted, meaning the write request (including data) is sent, and
the transaction is over from the requester's perspective as soon as the request is sent out of
the egress port; responsibility for delivery is now the problem of the next device. In a multi-level
topology, this has the advantage of being much faster than waiting for the entire request-
completion transit, but ​ as in all posting schemes ​ uncertainty exists concerning when (and if)
the transaction completed successfully at the ultimate recipient.

In PCI Express, write posting to memory is considered acceptable in exchange for the higher
performance. On the other hand, writes to IO and configuration space may change device
behavior, and write posting is not permitted. A completion will always be sent to report status of
the IO or configuration write operation.

Table 3-4 on page 116 lists PCI Express posted and non-posted transactions.

Table 3-4. PCI Express Posted and Non-Posted Transactions

Request How Request Is Handled

Memory
All Memory Write requests are posted. No completion is expected or sent.
Write

Memory
Read
All memory read requests are non-posted. A completion with data (CplD or CplDLK) will be returned by the completer
with requested data and to report status of the memory read
Memory
Read Lock

All IO Write requests are non-posted. A completion without data (Cpl) will be returned by the completer to report status
IO Write
of the IO write operation.

All IO read requests are non-posted. A completion with data (CplD) will be returned by the completer with requested
IO Read
data and to report status of the IO read operation.

Configuration
Write All Configuration Write requests are non-posted. A completion without data (Cpl) will be returned by the completer to
report status of the configuration space write operation.
Type 0 and
Type 1

Configuration
Read All configuration read requests are non-posted. A completion with data (CplD) will be returned by the completer with
requested data and to report status of the read operation.
Type 0 and
Type 1

Message
While the routing method varies, all message transactions are handled in the same manner as memory writes in that
Message they are considered posted requests
With Data

Three Methods of TLP Routing


All of the TLP variants, targeting any of the four address spaces, are routed using one of the
three possible schemes: Address Routing, ID Routing, and Implicit Routing. Table 3-5 on page
117 summarizes the PCI Express TLP header type variants and the routing method used for
each. Each of these is described in the following sections.

Table 3-5. PCI Express TLP Variants And Routing Options

TLP Type Routing Method Used

Memory Read (MRd), Memory Read Lock (MRdLk), Memory Write (MWr) Address Routing

IO Read (IORd), IO Write (IOWr) Address Routing

Configuration Read Type 0 (CfgRd0), Configuration Read Type 1 (CfgRd1) Configuration Write
ID Routing
Type 0 (CfgWr0), Configuration Write Type 1(CfgWr1)

Address Routing, ID Routing,


Message (Msg), Message With Data (MsgD)
or Implicit routing

Completion (Cpl), Completion With Data (CplD) ID Routing

PCI Express Routing Is Compatible with PCI

As indicated in Table 3-5 on page 117, memory and IO transactions are routed through the PCI
Express topology using address routing to reference system memory and IO maps, while
configuration cycles use ID routing to reference the completer's (target's) logical position within
the PCI-compatible bus topology (using Bus Number, Device Number, Function Number in place
of a linear address). Both address routing and ID routing are completely compatible with routing
methods used in the PCI and PCIX protocols when performing memory, IO, or configuration
transactions. PCI Express completions also use the ID routing scheme.

PCI Express Adds Implicit Routing for Messages

PCI Express adds the third routing method, implicit routing, which is an option when sending
messages. In implicit routing, neither address or ID routing information applies; the packet is
routed based on a code in the packet header indicating it is destined for device(s) with known,
fixed locations (the Root Complex, the next receiver, etc.).

While limited in the cases it can support, implicit routing simplifies routing of messages. Note
that messages may optionally use address or ID routing instead.

Why Were Messages Added to PCI Express Protocol?


PCI and PCI-X protocols support load and store memory and IO read-write transactions, which
have the following features:

1. The transaction initiator drives out a memory or IO start address selecting a


location within the desired target.

The target claims the transaction based on decoding and comparing the transaction start
address with ranges it has been programmed to respond to in its configuration space Base
Address Registers.

If the transaction involves bursting, then addresses are indexed after each data transfer.

While PCI Express also supports load and store transactions with its memory and IO
transactions, it adds in-band messages. The main reason for this is that the PCI Express
protocol seeks to (and does) eliminate many of the sideband signals related to interrupts, error
handling, and power management which are found in PCI(X)-based systems. Elimination of
signals is very important in an architecture with the scalability possible with PCI Express. It
would not be efficient to design a PCI Express device with a two lane link and then saddle it
with numerous additional signals to handle auxiliary functions.

The PCI Express protocol replaces most sideband signals with a variety of in-band packet
types; some of these are conveyed as Data Link Layer packets (DLLPs) and some as
Transaction Layer packets (TLPs).

How Implicit Routing Helps with Messages

One side effect of using in-band messages in place of hard-wired sideband signals is the
problem of delivering the message to the proper recipient in a topology consisting of numerous
point-to-point links. The PCI Express protocol provides maximum flexibility in routing message
TLPs; they may use address routing, ID routing, or the third method, implicit routing. Implicit
routing takes advantage of the fact that, due to their architecture, switches and other multi-port
devices have a fundamental sense of upstream and downstream, and where the Root Complex
is to be found. Because of this, a message header can be routed implicitly with a simple code
indicating that it is intended for the Root Complex, a broadcast downstream message, should
terminate at the next receiver, etc.

The advantage of implicit routing is that it eliminates the need to assign a set of memory
mapped addresses for all of the possible message variants and program all of the devices to
use them.

Header Fields Define Packet Format and Routing


As depicted in Figure 3-5 on page 119, each Transaction Layer Packet contains a three or four
double word (12 or 16 byte) header. Included in the 3DW or 4DW header are two fields, Type
and Format (Fmt), which define the format of the remainder of the header and the routing
method to be used on the entire TLP as it moves between devices in the PCI Express topology.

Figure 3-5. Transaction Layer Packet Generic 3DW And 4DW Headers

Using TLP Header Information: Overview

General

As TLPs arrive at an ingress port, they are first checked for errors at both the physical and
data link layers of the receiver. Assuming there are no errors, TLP routing is performed; basic
steps include:

1. The TLP header Type and Format fields in the first DWord are examined to
determine the size and format of the remainder of the packet.

Depending on the routing method associated with the packet, the device will determine if it is
the intended recipient; if so, it will accept (consume) the TLP. If it is not the recipient, and it is a
multi-port device, it will forward the TLP to the appropriate egress port--subject to the rules for
ordering and flow control for that egress port.

If it is neither the intended recipient nor a device in the path to it, it will generally reject the
packet as an Unsupported Request (UR).

Header Type/Format Field Encodings

Table 3-6 on page 120 below summarizes the encodings used in TLP header Type and Format
fields. These two fields, used together, indicate TLP format and routing to the receiver.

Table 3-6. TLP Header Type and Format Field Encodings

TLP FMT[1:0] TYPE [4:0]

00 = 3DW, no data
Memory Read Request (MRd) 0 0000
01 = 4DW, no data

00 = 3DW, no data
Memory Read Lock Request (MRdLk) 0 0001
01 = 4DW, no data

10 = 3DW, w/ data
Memory Write Request (MWr) 0 0000
11 = 4DW, w/ data

IO Read Request (IORd) 00 = 3DW, no data 00010

IO Write Request (IOWr) 10 = 3DW, w/ data 0 0010

Config Type 0 Read Request (CfgRd0) 00 = 3DW, no data 0 0100

Config Type 0 Write Request (CfgWr0) 10 = 3DW, w/ data 0 0100

Config Type 1 Read Request (CfgRd1) 00 = 3DW, no data 0 0101

Config Type 1 Write Request (CfgWr1) 10 = 3DW, w/ data 0 0101

Message Request (Msg) 01 = 4DW, no data 1 0 RRR* (for RRR, see routing subfield)

Message Request W/Data (MsgD) 11 = 4DW, w/ data 1 0 RRR* (for RRR, see routing subfield)

Completion (Cpl) 00 = 3DW, no data 0 1010

Completion W/Data (CplD) 10 = 3DW, w/ data 0 1010


Completion-Locked (CplLk) 00 = 3DW, no data 0 1011

Completion W/Data (CplDLk) 10 = 3DW, w/ data 0 1011


Applying Routing Mechanisms
Once configuration of the system routing strategy is complete and transactions are enabled,
PCI Express devices decode inbound TLP headers and use corresponding fields in
configuration space Base Address Registers, Base/Limit registers, and Bus Number registers
to apply address, ID, and implicit routing to the packet. Note that there are actually two levels
of decision: the device first determines if the packet targets an internal location; if not, and the
device is a switch, it will evaluate the packet to see if it should be forwarded out of an egress
port. A third possibility is that the packet has been received in error or is malformed; in this
case, it will be handled as a receive error. There are a number of cases when this may happen,
and a number of ways it may be handled. Refer to "PCI Express Error Checking Mechanisms"
on page 356 for a description of error checking and handling. The following sections describe
the basic features of each routing mechanism; we will assume no errors are encountered.

Address Routing

PCI Express transactions using address routing reference the same system memory and IO
maps that PCI and PCIX transactions do. Address routing is used to transfer data to or from
memory, memory mapped IO, or IO locations. Memory transaction requests may carry either
32 bit addresses using the 3DW TLP header format, or 64 bit addresses using the 4DW TLP
header format. IO transaction requests are restricted to 32 bits of address using the 3DW TLP
header format, and should only target legacy devices.

Memory and IO Address Maps

Figure 3-6 on page 122 depicts generic system memory and IO maps. Note that the size of the
system memory map is a function of the range of addresses that devices are capable of
generating (often dictated by the CPU address bus). As in PCI and PCI-X, PCI Express permits
either 32 bit or 64 bit memory addressing. The size of the system IO map is limited to 32 bits
(4GB), although in many systems only the lower 16 bits (64KB) are used.

Figure 3-6. Generic System Memory And IO Address Maps


Key TLP Header Fields in Address Routing

If the Type field in a received TLP indicates address routing is to be used, then the Address
Fields in the header are used to performing the routing check. There are two cases: 32-bit
addresses and 64-bit addresses.

TLPs with 3DW, 32-Bit Address

For IO or a 32-bit memory requests, only 32 bits of address are contained in the header.
Devices targeted with these TLPs will reside below the 4GB memory or IO address boundary.
Figure 3-7 on page 123 depicts this case.

Figure 3-7. 3DW TLP Header Address Routing Fields


TLPs With 4DW, 64-Bit Address

For 64-bit memory requests, 64 bits of address are contained in the header. Devices targeted
with these TLPs will reside above the 4GB memory boundary. Figure 3-8 on page 124 shows
this case.

Figure 3-8. 4DW TLP Header Address Routing Fields


An Endpoint Checks an Address-Routed TLP

If the Type field in a received TLP indicates address routing is to be used, then an endpoint
device simply checks the address in the packet header against each of its implemented BARs
in its Type 0 configuration space header. As it has only one link interface, it will either claim the
packet or reject it. Figure 3-9 on page 125 illustrates this case.

Figure 3-9. Endpoint Checks Routing Of An Inbound TLP Using Address


Routing
A Switch Receives an Address Routed TLP: Two Checks

General

If the Type field in a received TLP indicates address routing is to be used, then a switch first
checks to see if it is the intended completer. It compares the header address against target
addresses programmed in its two BARs. If the address falls within the range, it consumes the
packet. This case is indicated by (1) in Figure 3-10 on page 126. If the header address field
does not match a range programmed in a BAR, it then checks the Type 1 configuration space
header for each downstream link. It checks the non-prefetchable memory (MMIO) and
prefetchable Base/Limit registers if the transaction targets memory, or the I/O Base and Limt
registers if the transaction targets I/O address space. This check is indicated by (2) in Figure
3-10 on page 126.

Figure 3-10. Switch Checks Routing Of An Inbound TLP Using Address


Routing
Other Notes About Switch Address-Routing

The following notes also apply to switch address routing:

1. If the address-routed packet address falls in the range of one of its secondary
bridge interface Base/Limit register sets, it will forward the packet downstream.

If the address-routed packet was moving downstream (was received on the primary
interface) and it does not map to any BAR or downstream link Base/Limit registers, it will be
handled as an unsupported request on the primary link.

Upstream address-routed packets are always forwarded to the upstream link if they do not
target an internal location or another downstream link.

ID Routing

ID routing is based on the logical position (Bus Number, Device Number, Function Number) of a
device function within the PCI bus topology. ID routing is compatible with routing methods used
in the PCI and PCIX protocols when performing Type 0 or Type 1 configuration transactions. In
PCI Express, it is also used for routing completions and may be used in message routing as
well.
ID Bus Number, Device Number, Function Number Limits

PCI Express supports the same basic topology limits as PCI and PCI-X:

1. A maximum of 256 busses/links in a system. This includes busses created by


bridges to other PCI-compatible protocols such as PCI, PCI-X, AGP, etc.

A maximum of 32 devices per bus/link. Of course, While a PCI(X) bus or the internal bus of a
switch may host more than one downstream bridge interface, external PCI Express links are
always point-to-point with only two devices per link. The downstream device on an external link
is device 0.

A maximum of 8 internal functions per device.

A significant difference in PCI Express over PCI is the provision for extending the amount of
configuration space per function from 256 bytes to 4KB. Refer to the "Configuration Overview"
on page 711 for a detailed description of the compatible and extended areas of PCI Express
configuration space.

Key TLP Header Fields in ID Routing

If the Type field in a received TLP indicates ID routing is to be used, then the ID fields in the
header are used to perform the routing check. There are two cases: ID routing with a 3DW
header and ID routing with a 4DW header.

3DW TLP, ID Routing

Figure 3-11 on page 128 illustrates a TLP using ID routing and the 3DW header.

Figure 3-11. 3DW TLP Header ID Routing Fields


4DW TLP, ID Routing

Figure 3-12 on page 129 illustrates a TLP using ID routing and the 4DW header.

Figure 3-12. 4DW TLP Header ID Routing Fields


An Endpoint Checks an ID-Routed TLP

If the Type field in a received TLP indicates ID routing is to be used, then an endpoint device
simply checks the ID field in the packet header against its own Bus Number, Device Number,
and Function Number(s). In PCI Express, each device "captures" (and remembers) its own Bus
Number and Device Number contained in TLP header bytes 8-9 each time a configuration write
(Type 0) is detected on its primary link. At reset, all bus and device numbers in the system
revert to 0, so a device will not respond to transactions other than configuration cycles until at
least one configuration write cycle (Type 0) has been performed. Note that the PCI Express
protocol does not define a configuration space location where the device function is required to
store the captured Bus Number and Device Number information, only that it must do it.

Once again, as it has only one link interface, an endpoint will either claim an ID-routed packet
or reject it. Figure 3-11 on page 128 illustrates this case.
A Switch Receives an ID-Routed TLP: Two Checks

If the Type field in a received TLP indicates ID routing is to be used, then a switch first checks
to see if it is the intended completer. It compares the header ID field against its own Bus
Number, Device Number, and Function Number(s). This is indicated by (1) in Figure 3-13 on
page 131. As in the case of an endpoint, a switch captures its own Bus Number and Device
number each time a configuration write (Type 0) is detected on i's primary link interface. If the
header ID agrees with the ID of the switch, it consumes the packet. If the ID field does not
match i's own, it then checks the Secondary-Subordinate Bus Number registers in the
configuration space for each downstream link. This check is indicated by (2) in Figure 3-13 on
page 131.

Figure 3-13. Switch Checks Routing Of An Inbound TLP Using ID Routing

Other Notes About Switch ID Routing

1. If the ID-routed packet matches the range of one of its secondary bridge interface
Secondary-Subordinate registers, it will forward the packet downstream.

If the ID-routed packet was moving downstream (was received on the primary interface) and
it does not map to any downstream interface, it will be handled as an unsupported request on
the primary link.

Upstream ID-routed packets are always forwarded to the upstream link if they do not target
an internal location or another downstream link.

Implicit Routing

Implicit routing is based on the intrinsic knowledge PCI Express devices are required to have
concerning upstream and downstream traffic and the existence of a single PCI Express Root
Complex at the top of the PCI Express topology. This awareness allows limited routing of
packets without the need to assign and include addresses with certain message packets.
Because the Root Complex generally implements power management and interrupt controllers,
as well as system error handling, it is either the source or recipient of most PCI Express
messages.

Only Messages May Use Implicit Routing

With the elimination of many sideband signals in the PCI Express protocol, alternate methods
are required to inform the host system when devices need service with respect to interrupts,
errors, power management, etc. PCI Express addresses this by defining a number of special
TLPs which may be used as virtual wires in conveying sideband events. Message groups
currently defined include:

Power Management

INTx legacy interrupt signaling

Error signaling

Locked Transaction support

Hot Plug signaling

Vendor-specific messages

Slot Power Limit messages

Messages May Also Use Address or ID Routing

In systems where all or some of this event traffic should target the system memory map or a
logical location in the PCI bus topology, address routing and ID routing may be used in place of
implicit routing. If address or ID routing is chosen for a message, then the routing mechanisms
just described are applied in the same way as they would for other posted write packets.

Routing Sub-Field in Header Indicates Routing Method

As a message TLP moves between PCI Express devices, packet header fields indicate both
that it is a message, and whether it should be routed using address, ID, or implicitly.

Key TLP Header Fields in Implicit Routing

If the Type field in a received message TLP indicates implicit routing is to be used, then the
routing sub-field in the header is also used to determine the message destination when the
routing check is performed. Figure 3-14 on page 133 illustrates a message TLP using implicit
routing.

Figure 3-14. 4DW Message TLP Header Implicit Routing Fields

Message Type Field Summary

Table 3-7 on page 134 summarizes the use of the TLP header Type field when a message is
being sent. As shown, the upper two bits of the 5 bit Type field indicate the packet is a
message, and the lower three bits are the routing sub-field which specify the routing method to
apply. Note that the 4DW header is always used with message TLPs, regardless of the routing
option selected.

Table 3-7. Message Request Header Type Field Usage

Type Field Bits Description

Defines the type of transaction:


Bit 4:3
10b = Message Transaction

Message Routing Subfield R[2:0], used to select message routing:

000b = Route to Root Complex

001b = Use Address Routing

Bit 2:0 010b = Use ID Routing

011b = Route as a Broadcast Message from Root Complex

100b = Local message; terminate at receiver (INTx messages)

101b = Gather & route to Root Complex (PME_TO_Ack message)

An Endpoint Checks a TLP Routed Implicitly

If the Type field in a received message TLP indicates implicit routing is to be used, then an
endpoint device simply checks that the routing sub-field is appropriate for it. For example, an
endpoint may accept a broadcast message or a message which terminates at the receiver; it
won't accept messages which implicitly target the Root Complex.

A Switch Receives a TLP Routed Implicitly

If the Type field in a received message TLP indicates implicit routing is to be used, then a
switch device simply considers the ingress port it arrived on and whether the routing sub-field
code is appropriate for it. Some examples:

1. The upstream link interface of a switch may legitimately receive a broadcast


message routed implicitly from the Root Complex. If it does, it will forward it intact
onto all downstream links. It should not see an implicitly routed broadcast message
arrive on a downstream ingress port, and will handle this as a malformed TLP.

The switch may accept messages indicating implicit routing to the root complex on
secondary links; it will forward all of these upstream because it "knows" the location of the Root
Complex is on its primary side. It would not accept messages routed implicitly to the Root
Complex if they arrived on the primary link receive interface.

If the implicitly-routed message arrives on either upstream or downstream ingress ports, the
switch may consume the packet if routing indicates it should terminate at receiver.

If messages are routed using address or ID methods, a switch will simply perform normal
address checks in deciding whether to accept or forward it.
Plug-And-Play Configuration of Routing Options
PCI-compatible configuration space and PCI Express extended configuration space are
covered in detail in the Part 6. For reference, the programming of three sets of configuration
space registers related to routing is summarized here.

Routing Configuration Is PCI-Compatible

PCI Express supports the basic 256 byte PCI configuration space common to all compatible
devices, including the Type 0 and Type 1 PCI configuration space header formats used by non-
bridge and switch/bridge devices, respectively. Devices may implement basic PCI-equivalent
functionality with no change to drivers or Operating System software.

Two Configuration Space Header Formats: Type 0, Type 1

PCI Express endpoint devices support a single PCI Express link and use the Type 0 (non-
bridge) format header. Switch/bridge devices support multiple links, and implement a Type 1
format header for each link interface. Figure 3-15 on page 136 illustrates a PCI Express
topology and the use of configuration space Type 0 and Type 1 header formats.

Figure 3-15. PCI Express Devices And Type 0 And Type 1 Header Use

Routing Registers Are Located in Configuration Header


As with PCI, registers associated with transaction routing are located in the first 64 bytes (16
DW) of configuration space (referred to in PCI Express as the PCI 2.3 compatible header
area). The three sets of registers of principal interest are:

1. Base Address Registers (BARs) found in Type 0 and Type 1 headers.

Three sets of Base/Limit Register pairs supported in the Type 1 header of switch/bridge
devices.

Three Bus Number Registers, also found in Type 1 headers of bridge/devices.

Figure 3-16 on page 137 illustrates the Type 0 and Type 1 PCI Express Configuration Space
header formats. Key routing registers are indicated.

Figure 3-16. PCI Express Configuration Space Type 0 and Type 1 Headers

Base Address Registers (BARs): Type 0, 1 Headers


General

The first of the configuration space registers related to routing are the Base Address Registers
(BARs) These are marked "<1" in Figure 3-16 on page 137, and are implemented by all
devices which require system memory, IO, or memory mapped IO (MMIO) addresses
allocated to them as targets. The location and use of BARs is compatible with PCI and PCI-X.
As shown in Figure 3-16 on page 137, a Type 0 configuration space header has 6 BARs
available for the device designer (at DW 4-9), while a Type 1 header has only two BARs (at
DW 4-5).

After discovering device resource requirements, system software programs each BAR with
start address for a range of addresses the device may respond to as a completer (target). Set
up of BARs involves several things:

1. The device designer uses a BAR to hard-code a request for an allocation of one
block of prefetchable or non-prefetchable memory, or of IO addresses in the system
memory or IO map. A pair of adjacent BARs are concatenated if a 64-bit memory
request is being made.

Hard-coded bits in the BAR include an indication of the request type, the size of the request,
and whether the target device may be considered prefetchable (memory requests only).

During enumeration, all PCI-compatible devices are discovered and the BARs are examined by
system software to decode the request. Once the system memory and IO maps are
established, software programs upper bits in implemented BARs with the start address for the
block allocated to the target.

BAR Setup Example One: 1MB, Prefetchable Memory Request

Figure 3-17 depicts the basic steps in setting up a BAR which is being used to track a 1 MB
block of prefetchable addresses for a device residing in the system memory map. In the
diagram, the BAR is shown at three points in the configuration process:

1. The uninitialized BAR in Figure 3-17 is as it looks after power-up or a reset. While
the designer has tied lower bits to indicate the request type and size, there is no
requirement about how the upper bits (which are read-write) must come up in a
BAR, so these bits are indicated with XXXXX. System software will first write all 1's
to the BAR to set all read-write bits = 1. Of course, the hard-coded lower bits are
not affected by the configuration write.

The second view of the BAR shown in Figure 3-17 is as it looks after configuration software
has performed the write of all 1's to it. The next step in configuration is a read of the BAR to
check the request. Table 3-8 on page 140 summarizes the results of this configuration read.
The third view of the BAR shown in Figure 3-17 on page 139 is as it looks after configuration
software has performed another configuration write (Type 0) to program the start address for
the block. In this example, the device start address is 2GB, so bit 31 is written = 1 (231 = 2GB)
and all other upper bits are written = 0's.

Figure 3-17. 32-Bit Prefetchable Memory BAR Set Up

At this point the configuration of the BAR is complete. Once software enables memory address
decoding in the PCI command register, the device will claim memory transactions in the range
2GB to 2GB+1MB.

Table 3-8. Results Of Reading The BAR after Writing All "1s" To It

BAR
Meaning
Bits

0 Read back as a "0", indicating a memory request

2:1 Read back as 00b indicating the target only supports a 32 bit address decoder

3 Read back as a "1", indicating request is for prefetchable memory

19:4 All read back as "0", used to help indicate the size of the request (also see bit 20)
All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note that
31:20 because bit 20 was the first bit (above bit 3) to read back as written (=1); this indicates the memory request size is 1MB (220 =
1MB).

BAR Setup Example Two: 64-Bit, 64MB Memory Request

Figure 3-18 on page 141 depicts the basic steps in setting up a pair of BARs being used to
track a 64 MB block of prefetchable addresses for a device residing in the system memory
map. In the diagram, the BARs are shown at three points in the configuration process:

1. The uninitialized BARs are as they look after power-up or a reset. The designer has
hard-coded lower bits of the lower BAR to indicate the request type and size; the
upper BAR bits are all read-write. System software will first write all 1's to both
BARs to set all read-write bits = 1. Of course, the hard-coded bits in the lower BAR
are unaffected by the configuration write.

The second view of the BARs in Figure 3-18 on page 141 shows them as they look after
configuration software has performed the write of all 1's to both. The next step in configuration
is a read of the BARs to check the request. Table 3-9 on page 142 summarizes the results of
this configuration read.

The third view of the BAR pair Figure 3-18 on page 141 indicates conditions after
configuration software has performed two configuration writes (Type 0) to program the two
halves of the 64 bit start address for the block. In this example, the device start address is
16GB, so bit 1 of the Upper BAR (address bit 33 in the BAR pair) is written = 1 (233 = 16GB);
all other read-write bits in both BARs are written = 0's.

Figure 3-18. 64-Bit Prefetchable Memory BAR Set Up


At this point the configuration of the BAR pair is complete. Once software enables memory
address decoding in the PCI command register, the device will claim memory transactions in
the range 16GB to 16GB+64MB.

Table 3-9. Results Of Reading The BAR Pair after Writing All "1s" To Both

BAR
BAR Meaning
Bits

Lower 0 Read back as a "0", indicating a memory request

Read back as 10 b indicating the target supports a 64 bit address decoder, and that the first BAR is concatenated with
Lower 2:1
the next

Lower 3 Read back as a "1", indicating request is for prefetchable memory

Lower 25:4 All read back as "0", used to help indicate the size of the request (also see bit 26)

All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note
Lower 31:26 that because bit 26 was the first bit (above bit 3) to read back as written (=1); this indicates the memory request size is
64MB (226 = 64MB).

All read back as "1". These bits will be used as the upper 32 bits of the 64-bit start address programmed by system
Upper 31:0
software.
BAR Setup Example Three: 256-Byte IO Request

Figure 3-19 on page 143 depicts the basic steps in setting up a BAR which is being used to
track a 256 byte block of IO addresses for a legacy PCI Express device residing in the system
IO map. In the diagram, the BAR is shown at three points in the configuration process:

1. The uninitialized BAR in Figure 3-19 is as it looks after power-up or a reset. System
software first writes all 1's to the BAR to set all read-write bits = 1. Of course, the
hard-coded bits are unaffected by the configuration write.

The second view of the BAR shown in Figure 3-19 on page 143 is as it looks after
configuration software has performed the write of all 1's to it. The next step in configuration is a
read of the BAR to check the request. Table 3-10 on page 144 summarizes the results of this
configuration read.

The third view of the BAR shown Figure 3-19 on page 143 is as it looks after configuration
software has performed another configuration write (Type 0) to program the start address for
the IO block. In this example, the device start address is 16KB, so bit 14 is written = 1 (214 =
16KB); all other upper bits are written = 0's.

Figure 3-19. IO BAR Set Up

At this point the configuration of the IO BAR is complete. Once software enables IO address
decoding in the PCI command register, the device will claim IO transactions in the range 16KB
to 16KB+256.

Table 3-10. Results Of Reading The IO BAR after Writing All "1s" To It

BAR
Meaning
Bits

0 Read back as a "1", indicating an IO request

1 Reserved. Tied low and read back as "0".

7:2 All read back as "0", used to help indicate the size of the request (also see bit 8)

All read back as "1" because software has not yet programmed the upper bits with a start address for the block. Note that
31:8 because bit 8 was the first bit (above bit 1) to read back as written (=1); this indicates the IO request size is 256 bytes (28 =
256).

Base/Limit Registers, Type 1 Header Only

General

The second set of configuration registers related to routing are also found in Type 1
configuration headers and used when forwarding address-routed TLPs. Marked "<2" in Figure
3-16 on page 137, these are the three sets of Base/Limit registers programmed in each bridge
interface to enable a switch/bridge to claim and forward address-routed TLPs to a secondary
bus. Three sets of Base/Limit Registers are needed because transactions are handled
differently (e.g. prefetching, write-posting, etc.) in the prefetchable memory, non-prefetchable
memory (MMIO), and IO address domains. The Base Register in each pair establishes the
start address for the community of downstream devices and the Limit Register defines the
upper address for that group of devices. The three sets of Base/Limit Registers include:

Prefetchable Memory Base and Limit Registers

Non-Prefetchable Memory Base and Limit Register

I/O Base and Limit Registers

Prefetchable Memory Base/Limit Registers

The Prefetchable Memory Base/Limit registers are located at DW 9 and Prefetchable Memory
Base/Limit Upper registers at DW 10-11 within the header 1. These registers track all
downstream prefetchable memory devices. Either 32 bit or 64 bit addressing can be supported
by these registers. If the Upper Registers are not implemented, only 32 bits of memory
addressing is available, and the TLP headers mapping to this space will be the 3DW format. If
the Upper registers and system software maps the device above the 4GB boundary, TLPs
accessing the device will carry the 4DW header format. In the example shown in Figure 3-20 on
page 145, a 6GB prefetchable address range is being set up for the secondary link of a switch.

Figure 3-20. 6GB, 64-Bit Prefetchable Memory Base/Limit Register Set Up

Register programming in the example shown in Figure 3-20 on page 145 is summarized in Table
3-11.

Table 3-11. 6 GB, 64-Bit Prefetchable Base/Limit Register Setup

Register Value Use

Upper 3 nibbles (800h) are used to provide most significant 3 digits of the 32-bit Base Address for
Prefetchable
Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be 00000h. The
Memory 8001h
least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported and
Base
that the Upper Base/Limit Registers are also used.

Upper 3 nibbles (FFFh) are used to provide most significant 3 digits of the 32-bit Limit Address for
Prefetchable
Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be FFFFFh.
Memory FFF1h
The least significant nibble of this register value (1h) indicates that a 64 bit address decoder is supported
Limit
and that the Upper Base/Limit Registers are also used.

Prefetchable
Memory
00000001h Upper 32 bits of the 64-bit Base address for Prefetchable Memory behind this switch.
Base Upper
32 Bits
Prefetchable
Memory
00000002h Upper 32 bits of the 64-bit Limit address for Prefetchable Memory behind this switch.
Limit Upper
32 Bits

Non-Prefetchable Memory Base/Limit Registers

Non-Prefetchable Memory Base/Limit (at DW 8). These registers are used to track all
downstream non-prefetchable memory (memory mapped IO) devices. Non-prefetchable
memory devices are limited to 32 bit addressing; TLPs targeting them always use the 3DW
header format.

Register programming in the example shown in Figure 3-21 on page 147 is summarized in Table
3-12.

Figure 3-21. 2MB, 32-Bit Non-Prefetchable Base/Limit Register Set Up

Table 3-12. 2MB, 32-Bit Non-Prefetchable Base/Limit Register Setup

Register Value Use

Memory Upper 3 nibbles (121h) are used to provide most significant 3 digits of the 32-bit Base Address for Non-
Base Prefetchable Memory behind this switch. The lower 5 digits of the address are assumed to be 00000h.
1210h
(Non-
Prefetchable) The least significant nibble of this register value (0h) is reserved and should be set = 0.

Memory Limit Upper 3 nibbles (122h) are used to provide most significant 3 digits of the 32-bit Limit Address for Prefetchable
1220h Memory behind this switch. The lower 5 digits of the address are assumed to be FFFFFh. The least significant
(Non-
nibble of this register value (0h) is reserved and should be set = 0.
Prefetchable)

IO Base/Limit Registers

IO Base/Limit (at DW 7) and IO Base/Limit Upper registers (at DW 12). These registers are
used to track all downstream IO target devices. If the Upper Registers are used, then IO
address space may be extended to a full 32 bits (4GB). If they are not implemented, then IO
address space is limited to 16 bits (64KB). In either case, TLPs targeting these IO devices
always carry the 3DW header format.

Register programming in the example shown in Figure 3-22 on page 149 is summarized in Table
3-13 on page 150.

Figure 3-22. IO Base/Limit Register Set Up

Table 3-13. 256 Byte IO Base/Limit Register Setup

Register Value Use


Upper nibble (2h) specifies the most significant hex digit of the 32 bit IO Base address (the lower digits are 000h)
IO Base 21h The lower nibble (1h) indicates that the device supports 32 bit IO behind the bridge interface. This also means the
device implements the Upper IO Base/Limit register set, and those registers will be concatenated with Base/Limit.

Upper nibble (4h) specifies the most significant hex digit of the 32 bit IO Limit address (the lower digits are FFFh).
IO Limit 41h The lower nibble (1h) indicates that the device supports 32 bit IO behind the bridge interface. This also means the
device implements the Upper IO Base/Limit register set, and those registers will be concatenated with Base/Limit.

IO Base
Upper 16 0000h Upper 16 bits of the 32-bit Base address for IO behind this switch.
Bits

IO Limit
Upper 16 0000h Upper 16 bits of the 32-bit Limit address for IO behind this switch.
Bits

Bus Number Registers, Type 1 Header Only

The third set of configuration registers related to routing are used when forwarding ID-routed
TLPs, including configuration cycles and completions and optionally messages. These are
marked "<3" in Figure 3-16 on page 137. As in PCI, a switch/bridge interface requires three
registers: Primary Bus Number, Secondary Bus Number, and Subordinate bus number. The
function of these registers is summarized here.

Primary Bus Number

The Primary Bus Number register contains the bus (link) number to which the upstream side of
a bridge (switch) is connected. In PCI Express, the primary bus is the one in the direction of the
Root Complex and host processor.

Secondary Bus Number

The Secondary Bus Number register contains the bus (link) number to which the downstream
side of a bridge (switch) is connected.

Subordinate Bus Number

The Subordinate Bus Number register contains the highest bus (link) number on the
downstream side of a bridge (switch). The Subordinate and Secondary Bus Number registers
will contain the same value unless there is another bridge (switch) on the secondary side.
A Switch Is a Two-Level Bridge Structure

Because PCI does not natively support bridges with multiple downstream ports, PCI Express
switch devices appear logically as two-level PCI bridge structures, consisting of a single bridge
to the primary link and an internal PCI bus which hosts one or more virtual bridges to secondary
interfaces. Each bridge interface has an independent Type 1 format configuration header with
its own sets of Base/Limit Registers and Bus Number Registers. Figure 3-23 on page 152
illustrates the bus numbering associated with the external links and internal bus of a switch.
Note that the secondary bus on the primary link interface is the internal virtual bus, and that the
primary interface of all downstream link interfaces connect to the internal bus logically.

Figure 3-23. Bus Number Registers In A Switch


Chapter 4. Packet-Based Transactions
The Previous Chapter

This Chapter

The Next Chapter

Introduction to the Packet-Based Protocol

Transaction Layer Packets

Data Link Layer Packets


The Previous Chapter
The previous chapter described the general concepts of PCI Express transaction routing and
the mechanisms used by a device in deciding whether to accept, forward, or reject a packet
arriving at an ingress port. Because Data Link Layer Packets (DLLPs) and Physical Layer
ordered set link traffic are never forwarded, the emphasis here is on Transaction Layer Packet
(TLP) types and the three routing methods associated with them: address routing, ID routing,
and implicit routing. Included is a summary of configuration methods used in PCI Express to set
up PCI-compatible plug-and-play addressing within system IO and memory maps, as well as
key elements in the PCI Express packet protocol used in making routing decisions.
This Chapter
Information moves between PCI Express devices in packets, and the two major classes of
packets are Transaction Layer Packets (TLPs), and Data Link Layer Packets (DLLPs). The
use, format, and definition of all TLP and DLLP packet types and their related fields are
detailed in this chapter.
The Next Chapter
The next chapter discusses the Ack/Nak Protocol that verifies the delivery of TLPs between
each port as they travel between the requester and completer devices. This chapter details the
hardware retry mechanism that is automatically triggered when a TLP transmission error is
detected on a given link.
Introduction to the Packet-Based Protocol
The PCI Express protocol improves upon methods used by earlier busses (e.g. PCI) to
exchange data and to signal system events. In addition to supporting basic memory, IO, and
configuration read/write transactions, the links eliminate many sideband signals and replaces
them with in-band messages.

With the exception of the logical idle indication and physical layer Ordered Sets, all information
moves across an active PCI Express link in fundamental chunks called packets which are
comprised of 10 bit control (K) and data (D) symbols. The two major classes of packets
exchanged between two PCI Express devices are high level Transaction Layer Packets (TLPs),
and low-level link maintenance packets called Data Link Layer Packets (DLLPs). Collectively,
the various TLPs and DLLPs allow two devices to perform memory, IO, and Configuration
Space transactions reliably and use messages to initiate power management events, generate
interrupts, report errors, etc. Figure 4-1 on page 155 depicts TLPs and DLLPs on a PCI
Express link.

Figure 4-1. TLP And DLLP Packets

Why Use A Packet-Based Transaction Protocol


There are some distinct advantages in using a packet-based protocol, especially when it comes
to data integrity. Three important aspects of PCI Express packet protocol help promote data
integrity during link transmission:

Packet Formats Are Well Defined

Some early bus protocols (e.g. PCI) allow transfers of indeterminate (and unlimited) size,
making identification of payload boundaries impossible until the end of the transfer. In addition,
an early transaction end might be signaled by either agent (e.g. target disconnect on a write or
pre-emption of the initiator during a read), resulting in a partial transfer. In these cases, it is
difficult for the sender of data to calculate and send a checksum or CRC covering an entire
payload, when it may terminate unexpectedly. Instead, PCI uses a simple parity scheme which
is applied and checked for each bus phase completed.

In contrast, each PCI Express packet has a known size and format, and the packet header--
positioned at the beginning of each DLLP and TLP packet-- indicates the packet type and
presence of any optional fields. The size of each packet field is either fixed or defined by the
packet type. The size of any data payload is conveyed in the TLP header Length field. Once a
transfer commences, there are no early transaction terminations by the recipient. This
structured packet format makes it possible to insert additional information into the packet into
prescribed locations, including framing symbols, CRC, and a packet sequence number (TLPs
only).

Framing Symbols Indicate Packet Boundaries

Each TLP and DLLP packet sent is framed with a Start and End control symbol, clearly defining
the packet boundaries to the receiver. Note that the Start and End control (K) symbols
appended to packets by the transmitting device are 10 bits each. This is a big improvement
over PCI and PCI-X which use the assertion and de-assertion of a single FRAME# signal to
indicate the beginning and end of a transaction. A glitch on the FRAME# signal (or any of the
other PCI/PCIX control signals) could cause a target to misconstrue bus events. In contrast, a
PCI Express receiver must properly decode a complete 10 bit symbol before concluding link
activity is beginning or ending. Unexpected or unrecognized control symbols are handled as
errors.

CRC Protects Entire Packet

Unlike the side-band parity signals used by PCI devices during the address and each data
phase of a transaction, the in-band 16-bit or 32-bit PCI Express CRC value "protects" the entire
packet (other than framing symbols). In addition to CRC, TLP packets also have a packet
sequence number appended to them by the transmitter so that if an error is detected at the
receiver, the specific packet(s) which were received in error may be resent. The transmitter
maintains a copy of each TLP sent in a Retry Buffer until it is checked and acknowledged by
the receiver. This TLP acknowledgement mechanism (sometimes referred to as the Ack/Nak
protocol) forms the basis of link-level TLP error correction and is very important in deep
topologies where devices may be many links away from the host in the event an error occurs
and CPU intervention would otherwise be needed.
Transaction Layer Packets
In PCI Express terminology, high-level transactions originate at the device core of the
transmitting device and terminate at the core of the receiving device. The Transaction Layer is
the starting point in the assembly of outbound Transaction Layer Packets (TLPs), and the end
point for disassembly of inbound TLPs at the receiver. Along the way, the Data Link Layer and
Physical Layer of each device contribute to the packet assembly and disassembly as described
below.

TLPs Are Assembled And Disassembled

Figure 4-2 on page 158 depicts the general flow of TLP assembly at the transmit side of a link
and disassembly at the receiver. The key stages in Transaction Layer Packet protocol are
listed below. The numbers correspond to those in Figure 4-2.

1. Device B's core passes a request for service to the PCI Express hardware interface.
How this done is not covered by the PCI Express Specification, and is device-
specific. General information contained in the request would include:

- The PCI Express command to be performed

- Start address or ID of target (if address routing or ID routing are used)

- Transaction type (memory read or write, configuration cycle, etc.)

- Data payload size (and the data to send, if any)

- Virtual Channel/Traffic class information

- Attributes of the transfer: No Snoop bit set?, Relaxed Ordering set?, etc.

The Transaction Layer builds the TLP header, data payload, and digest based on the
request from the core. Before sending a TLP to the Data Link Layer, flow control credits and
ordering rules must be applied.

When the TLP is received at the Data Link Layer, a Sequence Number is assigned and a
Link CRC is calculated for the TLP (includes Sequence Number). The TLP is then passed on to
the Physical Layer.

At the Physical Layer, byte striping, scrambling, encoding, and serialization are performed.
STP and END control (K) characters are appended to the packet. The packet is sent out on the
transmit side of the link.
At the Physical Layer receiver of Device A, de-serialization, framing symbol check, decoding,
and byte un-striping are performed. Note that at the Physical Layer, the first level or error
checking is performed (on the control codes).

The Data Link Layer of the receiver calculates CRC and checks it against the received value.
It also checks the Sequence Number of the TLP for violations. If there are no errors, it passes
the TLP up to the Transaction Layer of the receiver. The information is decoded and passed to
the core of Device A. The Data Link Layer of the receiver will also notify the transmitter of the
success or failure in processing the TLP by sending an Ack or Nak DLLP to the transmitter. In
the event of a Nak (No Acknowledge), the transmitter will re-send all TLPs in its Retry Buffer.

Figure 4-2. PCI Express Layered Protocol And TLP Assembly/Disassembly

Device Core Requests Access to Four Spaces

Transactions are carried out between PCI Express requesters and completers, using four
separate address spaces: Memory, IO, Configuration, and Message. (See Table 4-1.)

Table 4-1. PCI Express Address Space And Transaction Types

Address Transaction
Purpose
Space Types

Transfer data to or from a location in the system memory map. The protocol also supports a locked
Memory Read, Write
memory read transaction.

Transfer data to or from a location in the system IO map. PCI Express IO address assignment to legacy
devices.
IO Read, Write
IO addressing is not permitted for Native PCI Express devices.

Transfer data to or from a location in the configuration space of a PCI Express device. As in PCI,
Configuration Read, Write configuration is used to discover device capabilities, program plug-and-play features, and check status
using the 4KB PCI Express configuration space.

Baseline,
Provides in-band messaging and event reporting (without consuming memory or IO address resources).
Message Vendor-
These are handled the same as posted write transactions.
specific

TLP Transaction Variants Defined

In accessing the four address spaces, PCI Express Transaction Layer Packets (TLPs) carry a
header field, called the Type field, which encodes the specific command variant to be used.
Table 4-2 on page 160 summarizes the allowed transactions:

Table 4-2. TLP Header Type Field Defines Transaction Variant

TLP Type Acronym

Memory Read Request (MRd)

Memory Read Lock Request (MRdLk)

Memory Write Request (MWr)

IO Read Request (IORd)

IO Write Request (IOWr)

Config Type 0 Read Request (CfgRd0)

Config Type 0 Write Request (CfgWr0)

Config Type 1 Read Request (CfgRd1)

Config Type 1 Write Request (CfgWr1)

Message Request (Msg)

Message Request W/Data (MsgD)

Completion (Cpl)

Completion W/Data (CplD)


Completion-Locked (CplLk)

Completion W/Data (CplDLk)

TLP Structure

The basic usage of each component of a Transaction Layer Packet is defined in Table 4-3 on
page 161.

Table 4-3. TLP Header Type Field Defines Transaction Variant

TLP Protocol
Component Use
Component Layer

3DW or 4DW (12 or 16 bytes) in size. Format varies with type, but Header defines transaction
parameters:

Transaction type

Intended recipient address, ID, etc.


Transaction
Header Transfer size (if any), Byte Enables
Layer

Ordering attribute

Cache coherency attribute

Traffic Class

Transaction Optional field. 0-1024 DW Payload, which may be further qualified with Byte Enables to get byte address
Data
Layer and byte transfer size resolution.

Transaction
Digest Optional field. If present, always 1 DW in size. Used for end-to-end CRC (ECRC) and data poisoning.
Layer

Generic TLP Header Format

Figure 4-3 on page 162 illustrates the format and contents of a generic TLP 3DW header. In
this section, fields common to nearly all transactions are summarized. In later sections, header
format differences associated with the specific transaction types are covered.

Figure 4-3. Generic TLP Header Fields


Generic Header Field Summary

Table 4-4 on page 163 summarizes the size and use of each of the generic TLP header fields.
Note that fields marked "R" in Figure 4-3 on page 162 are reserved and should be set = 0.

Table 4-4. Generic Header Field Summary

Header Header
Field Use
Field Location

TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, 210 = 1024 DW (4KB). Encoding:

00 0000 0001b = 1DW

Byte 3 Bit 00 0000 0010b = 2DW


7:0
Length [9:0] .
Byte 2 Bit
1:0 .

11 1111 1111b = 1023 DW

00 0000 0000b = 1024 DW

Bit 5 = Relaxed ordering.

When set = 1, PCI-X relaxed ordering is enabled for this TLP. If set = 0, then strict PCI ordering is used.
Attr Byte 2 Bit Bit 4 = No Snoop.
(Attributes) 5:4
When set = 1, requester is indicating that no host cache coherency issues exist with respect to this TLP.
System hardware is not required to cause processor cache snoop for coherency. When set = 0, PCI -type
cache snoop protection is required.

EP
Byte 2 Bit If set = 1, the data accompanying this data should be considered invalid although the transaction is being
(Poisoned
6 allowed to complete normally.
Data)

If set = 1, the optional 1 DW TLP Digest field is included with this TLP that contains an ECRC value. Some
rules:

Presence of the Digest field must be checked by all receivers (using this bit).
TD (TLP
Byte 2 Bit
Digest Field A TLP with TD = 1, but no Digest field is handled as a Malformed TLP.
7
Present)
If a device supports checking ECRC and TD=1, it must perform the ECRC check.

If a device does not support checking ECRC (optional) at the ultimate destination, the device must
ignore the digest.

These three bits are used to encode the traffic class to be applied to this TLP and to the completion
associated with it (if any).

000b = Traffic Class 0 (Default)

TC (Traffic Byte 1 Bit .


Class) 6:4 .

111b = Traffic Class 7

TC 0 is the default class, and TC 1-7 are used in providing differentiated services. See "Traffic Classes and
Virtual Channels" on page 256 for additional information.

These 5 bits encode the transaction variant used with this TLP. The Type field is used with Fmt [1:0] field to
Byte 0 Bit
Type[4:0] specify transaction type, header size, and whether data payload is present. See below for additional
4:0
information of Type/Fmt encoding for each transaction type.

These two bits encode information about header size and whether a data payload will be part of the TLP:

00b 3DW header, no data

Fmt[1:0] Byte 0 Bit 01b 4DW header, no data


Format 6:5
10b 3DW header, with data

11b 4DW header, with data

See below for additional information of Type/Fmt encoding for each transaction type.

These four high-true bits map one-to-one to the bytes within the first double word of payload.

Bit 3 = 1: Byte 3 in first DW is valid; otherwise not

First DW
Byte 7 Bit Bit 2 = 1: Byte 2 in first DW is valid; otherwise not
Byte
3:0 Bit 1 = 1: Byte 1 in first DW is valid; otherwise not
Enables

Bit 9 = 1: Byte 0 in first DW is valid; otherwise not

See below for details on Byte Enable use.

These four high-true bits map one-to-one to the bytes within the first double word of payload.

Bit 3 = 1: Byte 3 in last DW is valid; otherwise not

Last DW Bit 2 = 1: Byte 2 in last DW is valid; otherwise not


Byte 7 Bit
Byte
7:4 Bit 1 = 1: Byte 1 in last DW is valid; otherwise not
Enables

Bit 9 = 1: Byte 0 in last DW is valid; otherwise not

See below for details on Byte Enable use.


Header Type/Format Field Encodings

Table 4-5 on page 165 summarizes the encodings used in TLP header Type and Format (Fmt)
fields.

Table 4-5. TLP Header Type and Format Field Encodings

TLP FMT[1:0] TYPE [4:0]

00 = 3DW, no data
Memory Read Request (MRd) 0 0000
01 = 4DW, no data

00 = 3DW, no data
Memory Read Lock Request (MRdLk) 0 0001
01 = 4DW, no data

10 = 3DW, w/ data
Memory Write Request (MWr) 0 0000
11 = 4DW, w/ data

IO Read Request (IORd) 00 = 3DW, no data 00010

IO Write Request (IOWr) 10 = 3DW, w/ data 0 0010

Config Type 0 Read Request (CfgRd0) 00 = 3DW, no data 0 0100

Config Type 0 Write Request (CfgWr0) 10 = 3DW, w/ data 0 0100

Config Type 1 Read Request (CfgRd1) 00 = 3DW, no data 0 0101

Config Type 1 Write Request (CfgWr1) 10 = 3DW, w/ data 0 0101

Message Request (Msg) 01 = 4DW, no data 1 0 rrr* (for rrr, see routing subfield)

Message Request W/Data (MsgD) 11 = 4DW, w/ data 1 0rrr* (for rrr, see routing subfield)

Completion (Cpl) 00 = 3DW, no data 0 1010

Completion W/Data (CplD) 10 = 3DW, w/ data 0 1010

Completion-Locked (CplLk) 00 = 3DW, no data 0 1011

Completion W/Data (CplDLk) 10 = 3DW, w/ data 0 1011

The Digest and ECRC Field


The digest field and End-to-End CRC (ECRC) is optional as is a device's ability to generate and
check ECRC. If supported and enabled by software, devices must calculate and apply ECRC
for all TLPs that the device originates. Also, devices that support ECRC checking must also
support Advanced Error Reporting.

ECRC Generation and Checking

This book does not detail the algorithm and process of calculating ECRC, but is defined within
the specification. ECRC covers all fields that do not change as the TLP is forwarded across the
fabric. The ECRC includes all invariant fields of the TLP header and the data payload, if
present. All variant fields are set to 1 for calculating the ECRC, include:

Bit 0 of the Type field is variant ​ this bit changes when the transaction type is altered for
a packet. For example, a configuration transaction being forwarded to a remote link
(across one or more switches) begins as a type 1 configuration transaction. When the
transaction reaches the destination link, it is converted to a type 0 configuration transaction
by changing bit 0 of the type field.

Error/Poisoned (EP) bit ​ this bit can be set as a TLP traverses the fabric in the event that
the data field associated with the packet has been corrupted. This is also referred to as
error forwarding.

Who Can Check ECRC?

The ECRC check is intended for the device that is the ultimate receipient of the TLP. Link CRC
checking verifies that a TLP traverses a given link before being forwarded to the next link, but
ECRC is intended to verify that the packet send has not been altered in its journey between the
Requester and Completer. Switches in the path must maintain the integrity of the TD bit
because corruption of TD will cause an error at the ultimate target device.

The specification makes two statements regarding a Switch's role in ECRC checking:

A switch that supports ECRC checking performs this check on TLPs destined to a location
within the Switch itself. "On all other TLPs a Switch must preserve the ECRC (forward it
untouched) as an integral part of the TLP."

"Note that a Switch may perform ECRC checking on TLPs passing through the Switch.
ECRC Errors detected by the Switch are reported in the same way any other device would
report them, but do not alter the TLPs passage through the Switch."

These statements may appear to contradict each other. However, the first statement does not
explicitly state that an ECRC check cannot be made in the process of forwarding the TLP
untouched. The second statement clarifies that it is possible for switches, as well as the
ultimate target device, to check and report ECRC.

Using Byte Enables

As in the PCI protocol, PCI Express requires a mechanism for reconciling its DW addressing
and data transfers with the need, at times, for byte resolution in transfer sizes and transaction
start/end addresses. To achieve byte resolution, PCI Express makes use of the two Byte
Enable fields introduced earlier in Figure 4-3 on page 162 and in Table 4-4 on page 163.

The First DW Byte Enable field and the Last DW Byte Enable fields allow the requester to
qualify the bytes of interest within the first and last double words transferred; this has the effect
of allowing smaller transfers than a full double word and offsetting the start and end addresses
from DW boundaries.

Byte Enable Rules

1. Byte enable bits are high true. A value of "0" indicates the corresponding byte in the
data payload should not be written by the completer. A value of "1", indicates it
should.

If the valid data transferred is all within a single aligned double word, the Last DW Byte
enable field must be = 0000b.

If the header Length field indicates a transfer is more than 1DW, the First DW Byte Enable
must have at least one bit enabled.

If the Length field indicates a transfer of 3DW or more, then neither the First DW Byte
Enable field or the Last DW Byte Enable field may have discontinuous byte enable bits set. In
these cases, the Byte Enable fields are only being used to offset the effective start address of
a burst transaction.

Discontinuous byte enable bit patterns in the First DW Byte enable field are allowed if the
transfer is 1DW.

Discontinuous byte enable bit patterns in both the First and Second DW Byte enable fields
are allowed only if the transfer is Quadword aligned (2DWs).

A write request with a transfer length of 1DW and no byte enables set is legal, but has no
effect on the completer.
If a read request of 1 DW is done with no byte enable bits set, the completer returns a 1DW
data payload of undefined data. This may be used as a Flush mechanism. Because of ordering
rules, a flush may be used to force all previously posted writes to memory before the
completion is returned.

An example of byte enable use in this case is illustrated in Figure 4-4 on page 168. Note that
the transfer length must extend from the first DW with any valid byte enabled to the last DW
with any valid bytes enabled. Because the transfer is more than 2DW, the byte enables may
only be used to specify the start address location (2d) and end address location (34d) of the
transfer.

Figure 4-4. Using First DW and Last DW Byte Enable Fields

Transaction Descriptor Fields

As transactions move between requester and completer, it is important to uniquely identify a


transaction, since many split transactions may be pending at any instant. To this end, the
specification defines several important header fields that when used together form a unique
Transaction Descriptor as illustrated in Figure 4-5.

Figure 4-5. Transaction Descriptor Fields


While the Transaction Descriptor fields are not in adjacent header locations, collectively they
describe key transaction attributes, including:

Transaction ID

This is comprised of the Bus, Device, and Function Number of the TLP requester AND the Tag
field of the TLP.

Traffic Class

Traffic Class (TC 0 -7) is inserted in the TLP by the requester, and travels unmodified through
the topology to the completer. At every link, Traffic Class is mapped to one of the available
virtual channels.

Transaction Attributes

These consist of the Relaxed Ordering and No Snoop bits. These are also set by the requester
and travel with the packet to the completer.

Additional Rules For TLPs With Data Payloads

The following rules apply when a TLP includes a data payload.

1. The Length field refers to data payload only; the Digest field (if present) is not
included in the Length.

The first byte of data in the payload (immediately after the header) is always associated
with the lowest (start) address.

The Length field always represents an integral number of doublewords (DW) transferred.
Partial doublewords are qualified using First and Last Byte Enable fields.
The PCI Express specification states that when multiple transactions are returned by a
completer in response to a single memory request, that each intermediate transaction must end
on naturally-aligned 64 and 128 byte address boundaries for a root complex (this is termed the
Read Completion Boundary, or RCB). All other devices must break such transactions at
naturally-aligned 128 byte boundaries. This behavior promotes system performance related to
cache lines.

The Length field is reserved when sending message TLPs using the transaction Msg. The
Length field is valid when sending the message with data variant MsgD.

PCI Express supports load tuning of links. This means that the data payload of a TLP must
not exceed the current value in the Max_Payload_Size field of the Device Control Register. Only
write transactions have data payloads, so this restriction does not apply to reads. A receiver is
required to check for violations of the Max_Payload_Size limit during writes; violations are
handled as Malformed TLPs.

Receivers also must check for discrepancies between the value in the Length field and the
actual amount of data transferred in a TLP with data. Violations are also handled as Malformed
TLPs.

Requests must not mix combinations of start address and transfer length which will cause a
memory space access to cross a 4KB boundary. While checking is optional in this case,
receivers checking for violations of this rule will report it as a Malformed TLP.

Building Transactions: TLP Requests & Completions

In this section, the format of 3DW and 4DW headers used to accomplish specific transaction
types are described. Many of the generic fields described previously apply, but an emphasis is
placed on the fields which are handled differently between transaction types.

IO Requests

While the PCI Express specification discourages the use of IO transactions, an allowance is
made for legacy devices and software which may rely on a compatible device residing in the
system IO map rather than the memory map. While the IO transactions can technically access
a 32-bit IO range, in reality many systems (and CPUs) restrict IO access to the lower 16 bits
(64KB) of this range. Figure 4-6 on page 171 depicts the system IO map and the 16/32 bit
address boundaries. PCI Express non-legacy devices are memory-mapped, and not permitted
to make requests for IO address allocation in their configuration Base Address Registers.
Figure 4-6. System IO Map

IO Request Header Format

Figure 4-7 on page 172 depicts the format of the 3DW IO request header. Each field in the
header is described in the section that follows.

Figure 4-7. 3DW IO Request Header Format

Definitions Of IO Request Header Fields

Table 4-6 on page 173 describes the location and use of each field in an IO request header.
Table 4-6. IO Request Header Fields

Header
Field Name Function
Byte/Bit

Byte 3 Bit
7:0
Indicates data payload size in DW. For IO requests, this field is always = 1. Byte Enables are used
Length 9:0
Byte 2 Bit to qualify bytes within DW.
1:0

Attribute 1: Relaxed Ordering Bit


Byte 2 Bit
Attr 1:0 (Attributes) Attribute 0: No Snoop Bit
5:4
Both of these bits are always = 0 in IO requests.

Byte 2 Bit
EP If = 1, indicates the data payload (if present) is poisoned.
6

Byte 2 Bit If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and
TD
7 END)

TC 2:0 (Transfer Byte 2 Bit


Indicates transfer class for the packet. TC is = 0 for all IO requests.
Class) 6:4

Byte 0 Bit
Type 4:0 TLP packet type field. Always set to 00010b for IO requests
4:0

Packet Format. IO requests are:


Byte 0 Bit
Fmt 1:0 (Format) 00b = IO Read (3DW without data)
6:5
10b = IO Write (3DW with data)

1st DW BE 3:0 (First Byte 7 Bit These high true bits map one-to-one to qualify bytes within the DW payload. For IO requests, any
DW Byte Enables) 3:0 bit combination is valid (including none)

Last BE 3:0 (Last DW Byte 7 Bit These high true bits map one-to-one to qualify bytes within the last DW transferred. For IO
Byte Enables) 7:4 requests, these bits must be 0000b. (Single DW)

These bits are used to identify each outstanding request issued by the requester. As non-posted
requests are sent, the next sequential tag is assigned.
Byte 6 Bit
Tag 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)
7:0
If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256
tags).

Identifies the requester so a completion may be returned, etc.


Byte 5 Bit
7:0 Byte 4, 7:0 = Bus Number
Requester ID 15:0
Byte 4 Bit Byte 5, 7:3 = Device Number
7:0
Byte 5, 2:0 = Function Number
Byte 8 Bit
7:2

Byte 7 Bit
7:0 The upper 30 bits of the 32-bit start address for the IO transfer. Note that the lower two bits of the
Address 31:2
32 bit address are reserved (00b), forcing the start address to be DW aligned.
Byte 6 Bit
7:0

Byte 5 Bit
7:0

Memory Requests

PCI Express memory transactions include two classes: Read Request/Completion and Write
Request. Figure 4-8 on page 175 depicts the system memory map and the 3DW and 4DW
memory request packet formats. When request memory data transfer it is important to
remember that memory transactions are never permitted to cross 4KB boundaries.

Figure 4-8. 3DW And 4DW Memory Request Header Formats

Description of 3DW And 4DW Memory Request Header Fields

The location and use of each field in a 4DW memory request header is listed in Table 4-7 on
page 176.

Note: The difference between a 3DW header and a 4DW header is the location and size of the
starting Address field:

For a 3DW header (32 bit addressing): Address bits 31:2 are in Bytes 8-11, and 12-15 are
not used.

For a 4DW header (64 bit addressing): Address bits 31:2 are in Bytes 12-15, and address
bits 63:32 are in Bytes 8-11.

Otherwise the header fields are the same.

Table 4-7. 4DW Memory Request Header Fields

Header
Field Name Function
Byte/Bit

TLP data payload transfer size, in DW. Maximum transfer size is 10 bits, 210 = 1024 DW (4KB). Encoding:

00 0000 0001b = 1DW

Byte 3 00 0000 0010b = 2DW


Bit 7:0
Length [9:0] .
Byte 2
Bit 1:0 .

11 1111 1111b = 1023 DW

00 0000 0000b = 1024 DW

Bit 5 = Relaxed ordering.

When set = 1, PCI-X relaxed ordering is enabled for this TLP. If set = 0, then strict PCI ordering is used.
Byte 2 Bit 4 = No Snoop.
Attr (Attributes)
Bit 5:4
When set = 1, requester is indicating that no host cache coherency issues exist with respect to this TLP.
System hardware is not required to cause processor cache snoop for coherency. When set = 0, PCI -type
cache snoop protection is required.

EP (Poisoned Byte 2 If set = 1, the data accompanying this data should be considered invalid although the transaction is being
Data) Bit 6 allowed to complete normally.

If set = 1, the optional 1 DW TLP Digest field is included with this TLP. Some rules:

Presence of the Digest field must be checked by all receivers (using this bit)

TD (TLP Digest Byte 2 A TLP with TD = 1, but no Digest field is handled as a Malformed TLP.
Field Present) Bit 7
If a device supports checking ECRC and TD=1, it must perform the ECRC check.

If a device does not support checking ECRC (optional) at the ultimate destination, the device must
ignore the digest field.
These three bits are used to encode the traffic class to be applied to this TLP and to the completion
associated with it (if any).

000b = Traffic Class 0 (Default)

.
TC (Traffic Byte 1
Class) Bit 6:4 .

111b = Traffic Class 7

TC 0 is the default class, and TC 1-7 are used in providing differentiated services. See"Traffic Classes
and Virtual Channels" on page 256 for additional information.

TLP packet Type field:

00000b = Memory Read or Write


Byte 0
Type[4:0] 00001b = Memory Read Locked
Bit 4:0

Type field is used with Fmt [1:0] field to specify transaction type, header size, and whether data payload is
present.

Packet Format:

00b = Memory Read (3DW w/o data)


Fmt 1:0 Byte 0
10b = Memory Write (3DW w/ data)
(Format) Bit 6:5
01b = Memory Read (4DW w/o data)

11b = Memory Write (4DW w/ data)

1st DW BE 3:0
Byte 7
(First DW Byte These high true bits map one-to-one to qualify bytes within the DW payload.
Bit 3:0
Enables)

Last BE 3:0
Byte 7
(Last DW Byte These high true bits map one-to-one to qualify bytes within the last DW transferred.
Bit 7:4
Enables)

These bits are used to identify each outstanding request issued by the requester. As non-posted
requests are sent, the next sequential tag is assigned.
Byte 6
Tag 7:0
Bit 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)

If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256 tags).

Identifies the requester so a completion may be returned, etc.


Byte 5
Requester ID Bit 7:0 Byte 4, 7:0 = Bus Number
15:0 Byte 4 Byte 5, 7:3 = Device Number
Bit 7:0
Byte 5, 2:0 = Function Number

Byte 15
Bit 7:2

Byte 14
Bit 7:0 The lower 32 bits of the 64 bit start address for the memory transfer. Note that the lower two bits of the 32
Address 31:2 bit address are reserved (00b), forcing the start address to be DW aligned.
Byte 13
Bit 7:0

Byte 12
Bit 7:0

Byte 11
Bit 7:2

Byte 10
Bit 7:0
Address 63:32 The upper 32 bits of the 64-bit start address for the memory transfer
Byte 9
Bit 7:0

Byte 8
Bit 7:0

Memory Request Notes

Features of memory requests include:

1. Memory transfers are never permitted to cross a 4KB boundary.

All memory mapped writes are posted, resulting in much higher performance.

Either 32 bit or 64 bit addressing may be used. The 3DW header format supports 32 bit
addresses and the 4DW header supports 64 bits.

The full capability of burst transfers is available with a transfer length of 0-1024 DW (0-4KB).

Advanced PCI Express Quality of Service features, including up to 8 transfer classes and
virtual channels may be implemented.

The No Snoop attribute bit in the header may be set = 1, relieving the system hardware from
the burden of snooping processor caches when PCI Express transactions target main memory.
Optionally, the bit may be deasserted in the packet, providing PCI-like cache coherency
protection.

The Relaxed Ordering bit may also be set = 1, permitting devices in the path between the
packet and its destination to apply the relaxed ordering rules available in PCI-X. If deasserted,
strong PCI producer-consumer ordering is enforced.

Configuration Requests

To maintain compatibility with PCI, PCI Express supports both Type 0 and Type 1 configuration
cycles. A Type 1 cycle propagates downstream until it reaches the bridge interface hosting the
bus (link) that the target device resides on. The configuration transaction is converted on the
destination link from Type 1 to Type 0 by the bridge. The bridge forwards and converts
configuration cycles using previously programmed Bus Number registers that specify its
primary, secondary, and subordinate buses. Refer to the "PCI-Compatible Configuration
Mechanism" on page 723 for a discussion of routing these transactions.

Figure 4-9 on page 180 illustrates a Type 1 configuration cycle making its way downstream. At
the destination link, it is converted to Type 0 and claimed by the endpoint device. Note that
unlike PCI, only one device (other than the bridge) resides on a link. For this reason, no IDSEL
or other hardware indication is required to instruct the device to claim the Type 0 cycle; any
Type 0 configuration cycle a device sees on its primary link will be claimed.

Figure 4-9. 3DW Configuration Request And Header Format

Definitions Of Configuration Request Header Fields

Table 4-8 on page 181 describes the location and use of each field in the configuration request
header illustrated in Figure 4-9 on page 180.

Table 4-8. Configuration Request Header Fields

Field Header
Function
Name Byte/Bit
Byte 3
Bit 7:0 Indicates data payload size in DW. For configuration requests, this field is always = 1. Byte Enables are used to
Length 9:0
qualify bytes within DW (any combination is legal)
Byte 2
Bit 1:0

Attribute 1: Relaxed Ordering Bit


Attr 1:0 Byte 2
Attribute 0: No Snoop Bit
(Attributes) Bit 5:4
Both of these bits are always = 0 in configuration requests.

Byte 2
EP If = 1, indicates the data payload (if present) is poisoned.
Bit 6

Byte 2
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
Bit 7

TC 2:0
Byte 2
(Transfer Indicates transfer class for the packet. TC is = 0 for all Configuration requests.
Bit 6:4
Class)

TLP packet type field. Set to:


Byte 0
Type 4:0 00100b = Type 0 config request
Bit 4:0
00101b = Type 1 config request

Packet Format. Always a 3DW header


Fmt 1:0 Byte 0
00b = configuration read (no data)
(Format) Bit 6:5
10b = configuration write (with data)

1st DW BE
3:0
Byte 7 These high true bits map one-to-one to qualify bytes within the DW payload. For config requests, any bit
(First DW Bit 3:0 combination is valid (including none)
Byte
Enables)

Last BE
3:0
Byte 7 These high true bits map one-to-one to qualify bytes within the last DW transferred. For config requests, these
(Last DW Bit 7:4 bits must be 0000b. (Single DW)
Byte
Enables)

These bits are used to identify each outstanding request issued by the requester. As non-posted requests are
sent, the next sequential tag is assigned.
Byte 6
Tag 7:0
Bit 7:0 Default: only bits 4:0 are used (32 outstanding transactions at a time)

If Extended Tag bit in PCI Express Control Register is set = 1, then all 8 bits may be used (256 tags).

Byte 5 Identifies the requester so a completion may be returned, etc.


Bit 7:0 Byte 4, 7:0 = Bus Number
Requester
ID 15:0 Byte 4
Byte 5, 7:3 = Device Number
Bit 7:0 Byte 5, 2:0 = Function Number

These bits provide the lower 6 bits of DW configuration space offset. The Register Number is used in
Register Byte 11
conjunction with Ext Register Number to provide the full 10 bits of offset needed for the 1024 DW (4096 byte)
Number Bit 7:2
PCI Express configuration space.

Ext
Register
These bits provide the upper 4 bits of DW configuration space offset. The Ext Register Number is used in
Number Byte 10 conjunction with Register Number to provide the full 10 bits of offset needed for the 1024 DW (4096 byte) PCI
(Extended Bit 3:0 Express configuration space. For compatibility, this field can be set = 0, and only the lower 64DW (256 bytes will
Register be seen) when indexing the Register Number.
Number)

Identifies the completer being accessed with this configuration cycle. The Bus and Device numbers in this field
Byte 9 are "captured" by the device on each configuration Type 0 write.
Bit 7:0
Completer Byte 8, 7:0 = Bus Number
ID 15:0 Byte 8
Byte 9, 7:3 = Device Number
Bit 7:0
Byte 9, 2:0 = Function Number

Configuration Request Notes

Configuration requests always use the 3DW header format and are routed by the contents of
the ID field.

All devices "capture" the Bus Number and Device Number information provided by the upstream
device during each Type 0 configuration write cycle. Information is contained in Byte 8-9
(Completer ID) of configuration request.

Completions

Completions are returned following each non-posted request:

Memory Read request may result in completion with data (CplD)

IO Read request may result in a completion with or without data (CplD)

IO Write request may result in a completion without data (Cpl)

Configuration Read request may result in a completion with data (CplD)

Configuration Write request may result in a completion without data (Cpl)


Many of the fields in the completion must have the same values as the associated request,
including Traffic Class, Attribute bits, and the original Requester ID which is used to route the
completion back to the original requester. Figure 4-10 on page 184 depicts a completion
returning after a non-posted request, as well as the 3DW completion header format.

Figure 4-10. 3DW Completion Header Format

Definitions Of Completion Header Fields

Table 4-9 on page 185 describes the location and use of each field in a completion header.

Table 4-9. Completion Header Fields

Field Header
Function
Name Byte/Bit

Byte 3
Bit 7:0 Indicates data payload size in DW. For completions, this field reflects the size of the data payload associated
Length 9:0
Byte 2 with this completion.
Bit 1:0

Attribute 1: Relaxed Ordering Bit


Attr 1:0 Byte 2
Attribute 0: No Snoop Bit
(Attributes) Bit 5:4
For a completion, both of these bits are set to same state as in the request.
Byte 2
EP If = 1, indicates the data payload is poisoned.
Bit 6

Byte 2
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
Bit 7

TC 2:0
Byte 2
(Transfer Indicates transfer class for the packet. For a completion, TC is set to same value as in the request.
Bit 6:4
Class)

Byte 0
Type 4:0 TLP packet type field. Always set to 01010b for a completion.
Bit 4:0

Packet Format. Always a 3DW header


Fmt 1:0 Byte 0
00b = Completion without data (Cpl)
(Format) Bit 6:5
10b = Completion with data (CplD)

Byte 7
Bit 7:0 This is the remaining byte count until a read request is satisfied. Generally, it is derived from the original
Byte Count request Length field. See "Data Returned For Read Requests:" on page 188 for special cases caused by
Byte 6 multiple completions.
Bit 3:0

BCM
Byte 6 Set = 1 only by PCI-X completers. Indicates that the byte count field (see previous field) reflects the first
(Byte Count Bit 4 transfer payload rather than total payload remaining. See "Using The Byte Count Modified Bit" on page 188.
Modified)

These bits encoded by the completer to indicate success in fulfilling the request.

000b = Successful Completion (SC)


CS 2:0
001b = Unsupported Request (UR)
(Completion Byte 6
Bit 7:5
Status 010b = Config Req Retry Status (CR S)
Code)
100b = Completer abort. (CA)

others: reserved. See "Summary of Completion Status Codes:" on page 187.

Identifies the completer. While not needed for routing a completion, this information may be useful if debugging
Byte 5 bus traffic.

Completer Bit 7:0


Byte 4 7:0 = Completer Bus #
ID 15:0 Byte 4
Byte 5 7:3 = Completer Dev #
Bit 7:0
Byte 5 2:0 = Completer Function #

The lower 7 bits of address for the first enabled byte of data returned with a read. Calculated from request
Lower Byte 11
Length and Byte enables, it is used to determine next legal Read Completion Boundary. See "Calculating
Address 6:0 Bit 6:0
Lower Address Field" on page 187.

Byte 10 These bits are set to reflect the Tag received with the request. The requester uses them to associate inbound
Tag 7:0
Bit 7:0 completion with an outstanding request.
Copied from the request into this field to be used in routing the completion back to the original requester.
Byte 9
Byte 4, 7:0 = Requester Bus #
Requester Bit 7:0
ID 15:0
Byte 8 Byte 5, 7:3 = Requester Device #
Bit 7:0
Byte 5, 2:0 = Requester Function #

Summary of Completion Status Codes

(Refer to Completion Status field in table Table 4-9 on page 185).

000b (SC) Successful Completion code indicates the original request completed properly
at the target.

001b (UR) Unsupported Request code indicates original request failed at the target
because it targeted an unsupported address, carried an unsupported address or request,
etc. This is handled as an uncorrectable error. See the "Unsupported Request" on page
365 for details.

010b (CRS) Configuration Request Retry Status indicates target was temporarily off-line
and the attempt should be retried. (e.g. initialization delay after reset, etc.).

100b (CA) Completer Abort code indicates that completer is off-line due to an error (much
like target abort in PCI). The error will be logged and handled as an uncorrectable error.

Calculating The Lower Address Field (Byte 11, bits 7:0)

Refer to the Lower Address field in Table 4-9 on page 185. The Lower Address field is set up
by the completer during completions with data (CplD) to reflect the address of the first enabled
byte of data being returned in the completion payload. This must be calculated in hardware by
considering both the DW start address and the byte enable pattern in the First DW Byte Enable
field provided in the original request. Basically, the address is an offset from the DW start
address:

If the First DW Byte Enable field is 1111b, all bytes are enabled in the first DW and the
offset is 0. The byte start address is = DW start address.

If the First DW Byte Enable field is 1110b, the upper three bytes are enabled in the first
DW and the offset is 1. The byte start address is = DW start address + 1.

If the First DW Byte Enable field is 1100b, the upper two bytes are enabled in the first DW
and the offset is 2. The byte start address is = DW start address + 2.

If the First DW Byte Enable field is 1000b, only the upper byte is enabled in the first DW
and the offset is 3. The byte start address is = DW start address + 3.

Once calculated, the lower 7 bits are placed in the Lower Address field of the completion
header in the event the start address was not aligned on a Read Completion Boundary (RCB)
and the read completion must break off at the first RCB. Knowledge of the RCB is necessary
because breaking a transaction must be done on RCBs which are based on start address--not
transfer size.

Using The Byte Count Modified Bit

Refer to the Byte Count Modified Bit in Table 4-9 on page 185. This bit is only set by a PCI-X
completer (e.g. a bridge from PCI Express to PCI-X) in a particular circumstance. Rules for its
assertion include:

1. It is only set = 1 by a PCI-X completer if a read request is going to be broken into


multiple completions

The BCM bit is only set for the first completion of the series. It is set to indicate that the first
completion contains a Byte Count field that reflects the first completion payload rather than the
total remaining (as it would in normal PCI Express protocol). The receiver then recognizes that
the completion will be followed by others to satisfy the original request as required.

For the second and any other completions in the series, the BCM bit must be deasserted
and the Byte Count field will reflect the total remaining count--just as in normal PCI Express
protocol.

PCI Express devices receiving completions with the BCM bit set must interpret this case
properly.

The Lower Address field is set up by the completer during completions with data (CplD) to
reflect the address of the first enabled byte of data being returned

Data Returned For Read Requests:

1. Completions for read requests may be broken into multiple completions, but total
data transfer must equal size of original request

Completions for multiple requests may not be combined


IO and Configuration reads are always 1 DW, so will always be satisfied with a single
completion

A completion with a Status Code other than SC (successful completion) terminates a


transaction.

The Read Completion Boundary (RCB) must be observed when handling a read request with
multiple completions. The RCB is 64 bytes or 128 bytes for the root complex; the value used
should be visible in a configuration register.

Bridges and endpoints may implement a bit for selecting the RCB size (64 or 128 bytes)
under software control.

Completions that do not cross an aligned RCB boundary must complete in one transfer.

Multiple completions for a single read request must return data in increasing address order.

Receiver Completion Handling Rules:

1. A completion received without a match to an outstanding request is an Unexpected


Completion. It will be handled as an error.

Completions with a completion status other than Successful Completion (SC) or


Configuration Request Retry Status (CRS) will be handled as an error and buffer space
associated with them will be released.

When the Root Complex receivers a CRS status during a configuration cycle, its handling of
the event is not defined except after reset (when a period is defined when it must allow it).

If CRS is received for a request other than configuration, it is handled as a Malformed TLP.

Completions received with status = a reserved code alias to Unsupported Requests.

If a read completion is received with a status other than Successful Completion (SC), no
data is received with the completion and a CPl (or CplLk) is returned in place of a CplD (or
CplDLk).

In the event multiple completions are being returned for a read request, a completion status
other than Successful Completion (SC) immediately ends the transaction. Device handling of
data received prior to the error is implementation-specific.
In maintaining compatibility with PCI, a Root Complex may be required to synthesize a read
value of a "1's" when a configuration cycle ends with a completion indicating an Unsupported
Request. (This is analogous to master aborts which occur when PCI enumeration probes
devices which are not in the system).

Message Requests

Message requests replace many of the interrupt, error, and power management sideband
signals used on earlier bus protocols. All message requests use the 4DW header format, and
are handled much the same as posted memory write transactions. Messages may be routed
using address, ID, or implicit routing. The routing subfield in the packet header indicates the
routing method to apply, and which additional header registers are in use (address registers,
etc.). Figure 4-11 on page 190 depicts the message request header format.

Figure 4-11. 4DW Message Request Header Format

Definitions Of Message Request Header Fields

Table 4-10 on page 191 describes the location and use of each field in a message request
header.

Table 4-10. Message Request Header Fields


Table 4-10. Message Request Header Fields

Header
Field Name Function
Byte/Bit

Byte 3 Bit
7:0
Indicates data payload size in DW. For message requests, this field is always 0 (no data) or 1 (one DW of
Length 9:0
Byte 2 Bit data)
1:0

Attribute 1: Relaxed Ordering Bit


Attr 1:0 Byte 2 Bit
Attribute 0: No Snoop Bit
(Attributes) 5:4
Both of these bits are always = 0 in message requests.

Byte 2 Bit
EP If = 1, indicates the data payload (if present) is poisoned.
6

Byte 2 Bit
TD If = 1, indicates the presence of a digest field (1 DW) at the end of the TLP (preceding LCRC and END)
7

TC 2:0
Byte 2 Bit
(Transfer Indicates transfer class for the packet. TC is = 0 for all message requests.
6:4
Class)

TLP packet type field. Set to:

Bit 4:3:

10b = Msg

Bit 2:0 (Message Routing Subfield)

000b = Routed to Root Complex


Byte 0 Bit
Type 4:0 001b = Routed by address
4:0
010b = Routed by ID

011b = Root Complex Broadcast Msg

100b = Local; terminate at receiver

101b = Gather/route to Root Complex

0thers = reserved

Packet Format. Always a 4DW header


Fmt 1:0 Byte 0 Bit
01b = message request without data
(Format) 6:5
11b = message request with data

This field contains the code indicating the type of message being sent.

0000 0000b = Unlock Message

0001 xxxxb = Power Mgmt Message

0010 0xxxb = INTx Message


Message Byte 7 Bit 0011 00xxb = Error Message
Code 7:0 7:0
0100 xxxxb = Hot Plug Message

0101 0000b = Slot Power Message

0111 111xb = Vendor Type 0 Message

0111 1111b = Vendor Type 1 Message

Byte 6 Bit
Tag 7:0 As all message requests are posted, no tag is assigned to them. These bits should be = 0.
7:0

Identifies the requester sending the message.


Byte 5 Bit
Requester ID 7:0 Byte 4, 7:0 = Requester Bus #
15:0 Byte 4 Bit Byte 5, 7:3 = Requester Device #
7:0
Byte 5, 2:0 = Requester Function #

Byte 11
Bit 7:2

Byte 10
Bit 7:0
If address routing was selected for the message (see Type 4:0 field above), then this field contains the
Address 31:2
Byte 9 Bit lower part of the 64-bit starting address. Otherwise, this field is not used.
7:0

Byte 8 Bit
7:0

Byte 15
Bit 7:2

Byte 14
Bit 7:0
Address If address routing was selected for the message (see Type 4:0 field above), then this field contains the
63:32 upper 32 bits of the 64 bit starting address. Otherwise, this field is not used.
Byte 13
Bit 7:0

Byte 12
Bit 7:0

Message Notes

The following tables specify the message coding used for each of the seven message groups,
and is based on the message code field listed in Table 4-10 on page 191. The defined groups
include:

1. INTx Interrupt Signaling

Power Management
Error Signaling

Lock Transaction Support

Slot Power Limit Support

Vendor Defined Messages

Hot Plug Signaling

INTx Interrupt Signaling

While many devices are capable of using the PCI 2.3 Message Signaled Interrupt (MSI)
method of delivering interrupts, some devices may not support it. PCI Express defines a virtual
wire alternative in which devices simulate the assertion and deassertion of the INTx (INTA-
INTD) interrupt signals seen in PCI-based systems. Basically, a message is sent to inform the
upstream device an interrupt has been asserted. After servicing, the device which sent the
interrupt sends a second message indicating the virtual interrupt signal is being released. Refer
to the "Message Signaled Interrupts" on page 331 for details. Table 4-11 summarizes the INTx
message coding at the packet level.

Table 4-11. INTx Interrupt Signaling Message Coding

INTx Message Message Code 7:0 Routing 2:0

Assert_INTA 0010 0000b 100b

Assert_INTB 0010 0001b 100b

Assert_INTC 0010 0010b 100b

Assert_INTD 0010 0011b 100b

Deassert_INTA 0010 0100b 100b

Deassert_INTB 0010 0101b 100b

Deassert_INTC 0010 0110b 100b

Deassert_INTD 0010 0111b 100b

Other INTx Rules


1. The INTx Message type does not include a data payload. The Length field is
reserved.

Assert_INTx and Deassert_INTx are only issued by upstream ports. Checking violations of
this rule is optional. If checked, a TLP violation is handled as a Malformed TLP.

These messages are required to use the default traffic class, TC0. Receivers must check for
violation of this rule (handled as Malformed TLPs).

Components at both ends of the link must track the current state of the four virtual interrupts.
If the logical state of one of the interrupts changes at the upstream port, the port must send the
appropriate INTx message to the downstream port on the same link.

INTx signaling is disabled when the Interrupt Disable bit of the Command Register is set = 1
(just as it would be if physical interrupt lines are used).

If any virtual INTx signals are active when the Interrupt Disable bit is set in the device, the
device must transmit a corresponding Deassert_INTx message onto the link.

Switches must track the state of the four INTx signals independently for each downstream
port and combine the states for the upstream link.

The Root Complex must track the state of the four INTx lines independently and convert
them into system interrupts in a system-specific way.

Because of switches in the path, the Requester ID in an INTx message may be the last
transmitter, not the original requester.

Power Management Messages

PCI Express is compatible with PCI power management, and adds the PCI Express active link
management mechanism. Refer to Chapter 16, entitled "Power Management," on page 567 for
a description of power management. Table 4-12 on page 194 summarizes the four power
management message types.

Table 4-12. Power Management Message Coding

Power Management Message Message Code 7:0 Routing 2:0

PM_Active_State_Nak 0001 0100b 100b

PM_PME 0001 1000b 000b


PM_Turn_Off 0001 1001b 011b

PME_TO_Ack 0001 1011b 101b

Other Power Management Message Rules

1. Power Management Message type does not include a data payload. The Length field
is reserved.

These messages are required to use the default traffic class, TC0. Receivers must check for
violation of this rule (handled as Malformed TLPs).

PM_PME is sent upstream by component requesting event.

PM_Turn_Off is broadcast downstream

PME_TO_Ack is sent upstream by endpoint. For switch with devices attached to multiple
downstream ports, this message won't be sent upstream until all it is first received from all
downstream ports.

Error Messages

Error messages are sent upstream by enabled devices that detect correctable, non-fatal
uncorrectable, and fatal non-correctable errors. The device detecting the error is defined by the
Requester ID field in the message header. Table 4-13 on page 195 describes the three error
message types.

Table 4-13. Error Message Coding

Error Message Message Code 7:0 Routing 2:0

ERR_COR 0011 0000b 000b

ERR_NONFATAL 0011 0001b 000b

ERR_FATAL 0011 0011b 000b

Other Error Signaling Message Rules

1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).
This message type does not include a data payload. The Length field is reserved.

The Root Complex converts error messages into system-specific events.

Unlock Message

The Unlock message is sent to a completer to release it from lock as part of the PCI Express
Locked Transaction sequence. Table 4-14 on page 196 summarizes the coding for this
message.

Table 4-14. Unlock Message Coding

Unlock Message Message Code 7:0 Routing 2:0

Unlock 0000 0000b 011b

Other Unlock Message Rules

1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).

This message type does not include a data payload. The Length field is reserved.

Slot Power Limit Message

This message is sent from a downstream switch or Root Complex port to the upstream port of
the device attached to it. It conveys a slot power limit which the downstream device then copies
into the Device Capabilities Register for its upstream port. Table 4-15 summarizes the coding
for this message.

Table 4-15. Slot Power Limit Message Coding

Unlock Message Message Code 7:0 Routing 2:0

Set_Slot_Power_Limit 0101 0000b 100b

Other Set_Slot_Power_Limit Message Rules

1. These messages are required to use the default traffic class, TC0. Receivers must
check for violation of this rule (handled as Malformed TLPs).
This message type carries a data payload of 1 DW. The Length field is set = 1. Only the
lower 10 bits of the 32-bit data payload is used for slot power scaling; the upper bits in the
data payload must be set = 0.

This message is sent automatically anytime the link transitions to DL_Up status or if a
configuration write to the Slot Capabilities Register occurs when the Data Link Layer reports
DL_Up status.

If a card in a slot consumes less power than the power limit specified for the card/form
factor, it may ignore the message.

Hot Plug Signaling Message

These messages are passed between downstream ports of switches and Root Ports that
support Hot Plug Event signaling. Table 4-16 summarizes the Hot Plug message types.

Table 4-16. Hot Plug Message Coding

Error Message Message Code 7:0 Routing 2:0

Attention_Indicator_On 0100 0001b 100b

Attention_Indicator_Blink 0100 0011b 100b

Attention_Indicator_Off 0100 0000b 100b

Power_Indicator_On 0100 0101b 100b

Power_Indicator_Blink 0100 0111b 100b

Power_Indicator_Off 0100 0100b 100b

Attention_Button_Pressed 0100 1000b 100b

Other Hot Plug Message Rules

The Attention and Power indicator messages are all driven by the switch/root complex port
to the card.

The Attention Button message is driven upstream by a slot device that implements a
switch.
Data Link Layer Packets
The primary responsibility of the PCI Express Data Link Layer is to assure that integrity is
maintained when TLPs move between two devices. It also has link initialization and power
management responsibilities, including tracking of the link state and passing messages and
status between the Transaction Layer above and the Physical Layer below.

In performing its role, the Data Link Layer exchanges traffic with its neighbor using Data Link
Layer Packets (DLLPs). DLLPs originate and terminate at the Data Link Layer of each device,
without involvement of the Transaction Layer. DLLPs and TLPs are interleaved on the link.
Figure 4-12 on page 198 depicts the transmission of a DLLP from one device to another.

Figure 4-12. Data Link Layer Sends A DLLP

Types Of DLLPs

There are three important groups of DLLPs used in managing a link:

1. TLP Acknowledgement Ack/Nak DLLPs

Power Management DLLPs

Flow Control Packet DLLPs


In addition, the specification defines a vendor-specific DLLP.

DLLPs Are Local Traffic

DLLPs have a simple packet format. Unlike TLPs, they carry no target information because
they are used for nearest-neighbor communications only.

Receiver handling of DLLPs

The following rules apply when a DLLP is sent from transmitter to receiver:

1. As DLLPs arrive at the receiver, they are immediately processed. They cannot be
flow controlled.

All received DLLPs are checked for errors. This includes a control symbol check at the
Physical Layer after deserialization, followed by a CRC check at the receiver Data Link Layer.
A 16 bit CRC is calculated and sent with the packet by the transmitter; the receiver calculates
its own DLLP checksum and compares it to the received value.

Any DLLPs that fail the CRC check are discarded. There are several reportable errors
associated with DLLPs.

Unlike TLPs, the is no acknowledgement protocol for DLLPs. The PCI Express specification
has time-out mechanisms which are intended to allow recovery from lost or discarded DLLPs.

Assuming no errors occur, the DLLP type is determined and it is passed to the appropriate
internal logic:

- Power Management DLLPs are passed to the device power management logic

- Flow Control DLLPs are passed to the Transaction Layer so credits may be updated.

- Ack/Nak DLLPs are routed to the Data Link Layer transmit interface so TLPs in the retry
buffer may be discarded or resent.

Sending A Data Link Layer Packet

DLLPs are assembled on the transmit side and disassembled on the receiver side of a link.
These packets originate at the Data Link Layer and are passed to the Physical Layer. There,
framing symbols are added before the packet is sent. Figure 4-13 on page 200 depicts a
generic DLLP in transit from Device B to Device A.

Figure 4-13. Generic Data Link Layer Packet Format

Fixed DLLP Packet Size: 8 Bytes

All Data Link Layer Packets consist of the following components:

1. A 1 DW core (4 bytes) consisting of the one byte Type field and three additional
bytes of attributes. The attributes vary with the DLLP type.

A 16 bit CRC value which is calculated based on the DW core contents, then appended to it.

These 6 bytes are then passed to the Physical Layer where a Start Of DLLP (SDP) control
symbol and an End Of Packet (END) control symbol are added to it. Before transmission, the
Physical Layer encodes the 8 bytes of information into eight 10-bit symbols for transmission to
the receiver.

Note that there is never a data payload with a DLLP; all information of interest is carried in the
Type and Attribute fields.

DLLP Packet Types


The three groups of DLLPs are defined with a number of variants. Table 4-17 summarizes each
variant as well as their DLLP Type field coding.

Table 4-17. DLLP Packet Types

DLLP Type Type Field Encoding Purpose

Ack
0000 0000b TLP transmission integrity
(TLP Acknowledge)

Nak
0001 0000b TLP transmission integrity
(TLP No Acknowledge)

PM_Enter_L1 0010 0000b Power Management

PM_Enter_L23 0010 0001b Power Management

PM_Active_State_Request_L1 0010 0011b Power Management

PM_Request_Ack 0010 0100b Power Management

Vendor Specific 0011 0000b Vendor

InitFC1-P xxx=VC # 0100 0xxxb TLP Flow Control

InitFC1-NP xxx=VC # 0101 0xxxb TLP Flow Control

InitFC1-Cpl xxx=VC # 0110 0xxxb TLP Flow Control

InitFC2-P xxx=VC # 1100 0xxxb TLP Flow Control

InitFC2-NP xxx=VC # 1101 0xxxb TLP Flow Control

InitFC2-Cpl xxx=VC # 1110 0xxxb TLP Flow Control

UpdateFC-P xxx=VC # 1000 0xxxb TLP Flow Control

UpdateFC-NP xxx=VC # 1001 0xxxb TLP Flow Control

UpdateFC-Cpl xxx=VC # 1010 0xxxb TLP Flow Control

Reserved Others Reserved

Ack Or Nak DLLP Packet Format


The format of the DLLP used by a receiver to Ack or Nak the delivery of a TLP is illustrated in
Figure 4-14.

Figure 4-14. Ack Or Nak DLLP Packet Format

Definitions Of Ack Or Nak DLLP Fields

Table 4-18 describes the fields contained in an Ack or Nak DLLP.

Table 4-18. Ack or Nak DLLP Fields

Header
Field Name DLLP Function
Byte/Bit

For an ACK DLLP:

For good TLPs received with Sequence Number = NEXT_RCV_SEQ count (count before
incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus 1).

For TLP received with Sequence Number earlier than NEXT_RCV_SEQ count (duplicate TLP),
Byte 3 Bit use NEXT_RCV_SEQ count - 1.
AckNak_Seq_Num 7:0
For a NAK DLLP:
[11:0]
Byte 2 Bit
3:0 Associated with a TLP that failed the CRC check, use NEXT_RCV_SEQ count - 1.

For a TLP received with Sequence Number later than NEXT_RCV_SEQ count, use
NEXT_RCV_SEQ count - 1.

Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay
the remainder TLPs.

Indicates the type of DLLP. For the Ack/Nak DLLPs:


Byte 0 Bit 0000 0000b = ACK DLLP.
Type 7:0
7:0
0001 0000b = NAK DLLP.

Byte 5 Bit
7:0 16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0-3 of the
16-bit CRC
ACK/NAK.
Byte 4 Bit
7:0
Power Management DLLP Packet Format

PCI Express power management DLLPs and TLPs replace most signals associated with power
management state changes. The format of the DLLP used for power management is illustrated
in Figure 4-15.

Figure 4-15. Power Management DLLP Packet Format

Definitions Of Power Management DLLP Fields

Table 4-19 describes the fields contained in a Power Management DLLP.

Table 4-19. Power Management DLLP Fields

Field Header
DLLP Function
Name Byte/Bit

This field indicates type of DLLP. For the Power Management DLLPs:

0010 0000b = PM_Enter_L1

Type 7:0 Byte 0 Bit 7:0 0010 0001b = PM_Enter_L2

0010 0011b = PM_Active_State_Request

0010 0100b = PM_Request_Ack

Byte 5 Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of
Link CRC
whether fields are used.
Byte 4 Bit 7:0

Flow Control Packet Format

PCI Express eliminates many of the inefficiencies of earlier bus protocols through the use of a
credit-based flow control scheme. This topic is covered in detail in Chapter 7, entitled "Flow
Control," on page 285. Three slightly different DLLPs are used to initialize the credits and to
update them as receiver buffer space becomes available. The two flow control initialization
packets are referred to as InitFC1 and InitFC2. The Update DLLP is referred to as UpdateFC.

The generic DLLP format for all three flow control DLLP variants is illustrated in Figure 4-16 on
page 205.

Figure 4-16. Flow Control DLLP Packet Format

Definitions Of Flow Control DLLP Fields

Table 4-20 on page 206 describes the fields contained in a flow control DLLP.

Table 4-20. Flow Control DLLP Fields

Field Header
DLLP Function
Name Byte/Bit

Byte 3
This field contains the credits associated with data storage. Data credits are in units of 16 bytes per credit, and are
DataFC Bit 7:0
applied to the flow control counter for the virtual channel indicated in V[2:0], and for the traffic type indicated by the
11:0
Byte 2 code in Byte 0, Bits 7:4.
Bit 3:0

Byte 2
This field contains the credits associated with header storage. Data credits are in units of 1 header (including
HdrFC Bit 7:6
digest) per credit, and are applied to the flow control counter for the virtual channel indicated in V[2:0], and for the
11:0
Byte 1 traffic type indicated by the code in Byte 0, Bits 7:4.
Bit 5:0

VC Byte 0
This field indicates the virtual channel (VC 0-7) receiving the credits.
[2:0] Bit 2:0

This field contains a code indicating the type of FC DLLP:

0100b = InitFC1-P (Posted Requests)

0101b = InitFC1-NP (Non-Posted Requests)

0110b = InitFC1-Cpl (Completions)


0101b = InitFC2-P (Posted Requests)
Type Byte 0
3:0 Bit 7:4 1101b = InitFC2-NP (Non-Posted Requests)

1110b = InitFC2-Cpl (Completions)

1000b = UpdateFC-P (Posted Requests)

1001b = UpdateFC-NP (Non-Posted Requests)

1010b = UpdateFC-Cpl (Completions)

Byte 5
Link Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of whether fields
CRC Byte 4 are used.
Bit 7:0

Vendor Specific DLLP Format

PCI Express reserves a DLLP type for vendor specific use. Only the Type code is defined. The
Vendor DLLP is illustrated in Figure 4-17.

Figure 4-17. Vendor Specific DLLP Packet Format

Definitions Of Vendor Specific DLLP Fields

Table 4-21 on page 207 describes the fields contained in a Vendor-Specific DLLP

Table 4-21. Vendor-Specific DLLP Fields

Field Header
DLLP Function
Name Byte/Bit

This field contains a code indicating the Vendor-specific DLLP:


Type 3:0 Byte 0 Bit 7:4
0011 0000b = Vendor specific DLLP

Byte 5 Bit 7:0 16 Bit CRC sent to protect the contents of this DLLP. Calculation is made on Bytes 0-3, regardless of
Link CRC
Byte 4 Bit 7:0 whether fields are used.
Chapter 5. ACK/NAK Protocol
The Previous Chapter

This Chapter

The Next Chapter

Reliable Transport of TLPs Across Each Link

Elements of the ACK/NAK Protocol

ACK/NAK DLLP Format

ACK/NAK Protocol Details

Error Situations Reliably Handled by ACK/NAK Protocol

ACK/NAK Protocol Summary

Recommended Priority To Schedule Packets

Some More Examples

Switch Cut-Through Mode


The Previous Chapter
Information moves between PCI Express devices in packets. The two major classes of packets
are Transaction Layer Packets (TLPs), and Data Link Layer Packets (DLLPs). The use,
format, and definition of all TLP and DLLP packet types and their related fields were detailed in
that chapter.
This Chapter
This chapter describes a key feature of the Data Link Layer: 'reliable' transport of TLPs from
one device to another device across the Link. The use of ACK DLLPs to confirm reception of
TLPs and the use of NAK DLLPs to indicate error reception of TLPs is explained. The chapter
describes the rules for replaying TLPs in the event that a NAK DLLP is received.
The Next Chapter
The next chapter discusses Traffic Classes, Virtual Channels, and Arbitration that support
Quality of Service concepts in PCI Express implementations. The concept of Quality of Service
in the context of PCI Express is an attempt to predict the bandwidth and latency associated
with the flow of different transaction streams traversing the PCI Express fabric. The use of
QoS is based on application-specific software assigning Traffic Class (TC) values to
transactions, which define the priority of each transaction as it travels between the Requester
and Completer devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage
transaction priority via two arbitration schemes called port and VC arbitration.
Reliable Transport of TLPs Across Each Link
The function of the Data Link Layer (shown in Figure 5-1 on page 210) is two fold:

'Reliable' transport of TLPs from one device to another device across the Link.

The receiver's Transaction Layer should receive TLPs in the same order that the
transmitter sent them. The Data Link Layer must preserve this order despite any
occurrence of errors that require TLPs to be replayed (retried).

Figure 5-1. Data Link Layer

The ACK/NAK protocol associated with the Data Link Layer is described with the aid of Figure
5-2 on page 211 which shows sub-blocks with greater detail. For every TLP that is sent from
one device (Device A) to another (Device B) across one Link, the receiver checks for errors in
the TLP (using the TLP's LCRC field). The receiver Device B notifies transmitter Device A on
good or bad reception of TLPs by returning an ACK or a NAK DLLP. Reception of an ACK
DLLP by the transmitter indicates that the receiver has received one or more TLP(s)
successfully. Reception of a NAK DLLP by the transmitter indicates that the receiver has
received one or more TLP(s) in error. Device A which receives a NAK DLLP then re-sends
associated TLP(s) which will hopefully, arrive at the receiver successfully without error.

Figure 5-2. Overview of the ACK/NAK Protocol


The error checking capability in the receiver and the transmitter's ability to re-send TLPs if a
TLP is not received correctly is the core of the ACK/NAK protocol described in this chapter.

Definition: As used in this chapter, the term Transmitter refers to the device that sends TLPs.

Definition: As used in this chapter, the term Receiver refers to the device that receives TLPs.
Elements of the ACK/NAK Protocol
Figure 5-3 is a block diagram of a transmitter and a remote receiver connected via a Link. The
diagram shows all of the major Data Link Layer elements associated with reliable TLP transfer
from the transmitter's Transaction Layer to the receiver's Transaction Layer. Packet order is
maintained by the transmitter's and receiver's Transaction Layer.

Figure 5-3. Elements of the ACK/NAK Protocol

Transmitter Elements of the ACK/NAK Protocol

Figure 5-4 on page 215 illustrates the transmitter Data Link Layer elements associated with
processing of outbound TLPs and inbound ACK/NAK DLLPs.

Figure 5-4. Transmitter Elements Associated with the ACK/NAK Protocol


Replay Buffer

The replay buffer stores TLPs with all fields including the Data Link Layer-related Sequence
Number and LCRC fields. The TLPs are saved in the order of arrival from the Transaction
Layer before transmission. Each TLP in the Replay Buffer contains a Sequence Number which
is incrementally greater than the sequence number of the previous TLP in the buffer.

When the transmitter receives acknowledgement via an ACK DLLP that TLPs have reached the
receiver successfully, it purges the associated TLPs from the Replay Buffer. If, on the other
hand, the transmitter receives a NAK DLLP, it replays (i.e., re-transmits) the contents of the
buffer.

NEXT_TRANSMIT_SEQ Counter

This counter generates the Sequence Number assigned to each new transmitted TLP. The
counter is a 12-bit counter that is initialized to 0 at reset, or when the Data Link Layer is in the
inactive state. It increments until it reaches 4095 and then rolls over to 0 (i.e., it is a modulo
4096 counter).
LCRC Generator

The LCRC Generator provides a 32-bit LCRC for the TLP. The LCRC is calculated using all
fields of the TLP including the Header, Data Payload, ECRC and Sequence Number. The
receiver uses the TLP's LCRC field to check for a CRC error in the received TLP.

REPLAY_NUM Count

This 2-bit counter stores the number of replay attempts following either reception of a NAK
DLLP, or a REPLAY_TIMER time-out. When the REPLAY_NUM count rolls over from 11b to
00b, the Data Link Layer triggers a Physical Layer Link-retrain (see the description of the
LTSSM recovery state on page 532). It waits for completion of re-training before attempting to
transmit TLPs once again. The REPLAY_NUM counter is initialized to 00b at reset, or when the
Data Link Layer is inactive. It is also reset whenever an ACK is received, indicating that forward
progress is being made in transmitting TLPs.

REPLAY_TIMER Count

The REPLAY_TIMER is used to measure the time from when a TLP is transmitted until an
associated ACK or NAK DLLP is received. The REPLAY_TIMER is started (or restarted, if
already running) when the last Symbol of any TLP is sent. It restarts from 0 each time that
there are outstanding TLPs in the Replay Buffer and an ACK DLLP is received that references
a TLP still in the Replay Buffer. It resets to 0 and holds when there are no outstanding TLPs in
the Replay Buffer, or until restart conditions are met for each NAK received (except during a
replay), or when the REPLAY_TIMER expires. It is not advanced (i.e., its value remains fixed)
during Link re-training.

ACKD_SEQ Count

This 12-bit register tracks or stores the Sequence Number of the most recently received ACK
or NAK DLLP. It is initialized to all 1s at reset, or when the Data Link Layer is inactive. This
register is updated with the AckNak_Seq_Num [11:0] field of a received ACK or NAK DLLP.
The ACKD_SEQ count is compared with the NEXT_TRANSMIT_SEQ count.

IF (NEXT_TRANSMIT_SEQ - ACKD_SEQ) mod 4096 2048 THEN New TLPs from


Transaction Layer are not accepted by Data Link Layer until this equation is no longer true. In
addition, a Data Link Layer protocol error which is a fatal uncorrectable error is reported. This
error condition occurs if there is a separation greater than 2047 between
NEXT_TRANSMIT_SEQ and ACKD_SEQ. i.e, a separation greater than 2047 between the
sequence number of a TLP being transmitted and that of a TLP in the replay buffer that
receives an ACK or NAK DLLP.
Also, the ACKD_SEQ count is used to check for forward progress made in transmitting TLPs.
If no forward progress is made after 3 additional replay attempts, the Link in re-trained.

DLLP CRC Check

This block checks for CRC errors in DLLPs returned from the receiver. Good DLLPs are further
processed. If a DLLP CRC error is detected, the DLLP is discarded and an error reported. No
further action is taken.

Definition: The Data Link Layer is in the inactive state when the Physical Layer reports that the
Link is non-operational or nothing is connected to the Port. The Physical Layer is in the non-
operational state when the Link Training and Status State Machine (LTSSM) is in the Detect,
Polling, Configuration, Disabled, Reset or Loopback states during which LinkUp = 0 (see
Chapter 14 on 'Link Initialization and Training'). While in the inactive state, the Data Link Layer
state machines are initialized to their default values and the Replay Buffer is cleared. The Data
Link Layer exits the inactive state when the Physical Layer reports LinkUp = 1 and the Link
Disable bit of the Link Control register = 0.

Receiver Elements of the ACK/NAK Protocol

Figure 5-5 on page 218 illustrates the receiver Data Link Layer elements associated with
processing of inbound TLPs and outbound ACK/NAK DLLPs.

Figure 5-5. Receiver Elements Associated with the ACK/NAK Protocol


Receive Buffer

The receive buffer temporarily stores received TLPs while TLP CRC and Sequence Number
checks are performed. If there are no errors, the TLP is processed and transferred to the
receiver's Transaction Layer. If there are errors associated with the TLP, it is discarded and a
NAK DLLP may be scheduled (more on this later in this chapter). If the TLP is a duplicate TLP
(more on this later in this chapter), it is discarded and an ACK DLLP is scheduled. If the TLP is
a 'nullified' TLP, it is discarded and no further action is taken (see "Switch Cut-Through Mode"
on page 248).

LCRC Error Check

This block checks for LCRC errors in the received TLP using the TLP's 32-bit LCRC field.

NEXT_RCV_SEQ Count
The 12-bit NEXT_RCV_SEQ counter keeps track of the next expected TLP's Sequence
Number. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive. This
counter is incremented once for each good TLP received that is forwarded to the Transaction
Layer. The counter rolls over to 0 after reaching a value of 4095. The counter is not
incremented for TLPs received with CRC error, nullified TLPs, or TLPs with an incorrect
Sequence Number.

Sequence Number Check

After the CRC error check, this block verifies that a received TLP's Sequence Number matches
the NEXT_RCV_SEQ count.

If the TLP's Sequence Number = NEXT_RCV_SEQ count, the TLP is accepted, processed
and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented. The
receiver continues to process inbound TLPs and does not have to return an ACK DLLP
until the ACKNAK_LATENCY_TIMER expires or exceeds its set value.

If the TLP's Sequence Number is an earlier Sequence Number than NEXT_RCV_SEQ


count and with a separation of no more than 2048 from NEXT_RCV_SEQ count, the TLP
is a duplicate TLP. It is discarded and an ACK DLLP is scheduled for return to the
transmitter.

If the TLP's Sequence Number is a later Sequence Number than NEXT_RCV_SEQ count,
or for any other case other than the above two conditions, the TLP is discarded and a NAK
DLLP may be scheduled (more on this later) for return to the transmitter.

NAK_SCHEDULED Flag

The NAK_SCHEDULED flag is set when the receiver schedules a NAK DLLP to return to the
remote transmitter. It is cleared when the receiver sees the first TLP associated with the replay
of a previously-Nak'd TLP. The specification is unclear about whether the receiver should
schedule additional NAK DLLP for bad TLPs received while the NAK_SCHEDULED flag is set.
It is the authors' interpretation that the receiver must not schedule the return of additional NAK
DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.

ACKNAK_LATENCY_TIMER

The ACKNAK_LATENCY_TIMER monitors the elapsed time since the last ACK or NAK DLLP
was scheduled to be returned to the remote transmitter. The receiver uses this timer to ensure
that it processes TLPs promptly and returns an ACK or a NAK DLLP when the timer expires or
exceeds its set value. The timer value is set based on a formula described in "Receivers
ACKNAK_LATENCY_TIMER" on page 237.

ACK/NAK DLLP Generator

This block generates the ACK or NAK DLLP upon command from the LCRC or Sequence
Number check block. The ACK or NAK DLLP contains an AckNak_Seq_Num[11:0] field
obtained from the NEXT_RCV_SEQ counter. ACK or NAK DLLPs contain a
AckNak_Seq_Num[11:0] value equal to NEXT_RCV_SEQ count - 1.
ACK/NAK DLLP Format
The format of an ACK or NAK DLLP is illustrated in Figure 5-6 on page 219. Table 5-6
describes the ACK or NAK DLLP Fields.

Figure 5-6. Ack Or Nak DLLP Packet Format

Table 5-1. Ack or Nak DLLP Fields

Header
Field Name DLLP Function
Byte/Bit

For an ACK DLLP:

For good TLPs received with Sequence Number = NEXT_RCV_SEQ count (count before
incrementing), use NEXT_RCV_SEQ count - 1 (count after incrementing minus 1).

For TLP received with Sequence Number earlier than NEXT_RCV_SEQ count (duplicate TLP),
Byte 3 Bit use NEXT_RCV_SEQ count - 1.
AckNak_Seq_Num 7:0
For a NAK DLLP:
[11:0] Byte 2 Bit
3:0 Associated with a TLP that fails the CRC check, use NEXT_RCV_SEQ count - 1.

For a TLP received with Sequence Number later than NEXT_RCV_SEQ count, use
NEXT_RCV_SEQ count - 1.

Upon receipt, the transmitter will purge TLPs with equal to and earlier Sequence Numbers and replay
the remainder TLPs.

Indicates the type of DLLP. For the Ack/Nak DLLPs:


Byte 0 Bit 0000 0000b = ACK DLLP.
Type 7:0
7:0
0001 0000b = NAK DLLP.

Byte 5 Bit
7:0 16-bit CRC used to protect the contents of this DLLP. Calculation is made on Bytes 0-3 of the
16-bit CRC
Byte 4 Bit ACK/NAK.
7:0
ACK/NAK Protocol Details
This section describes the detailed transmitter and receiver behavior in processing TLPs and
ACK/NAK DLLPs. The examples demonstrate flow of TLPs from transmitter to the remote
receiver in both the normal non-error case, as well as the error cases.

Transmitter Protocol Details

This section delves deeper into the ACK/NAK protocol. Consider the transmit side of a device's
Data Link Layer shown in Figure 5-4 on page 215.

Sequence Number

Before a transmitter sends TLPs delivered by the Transaction Layer, the Data Link Layer
appends a 12-bit Sequence Numbers to each TLP. The Sequence Number is generated by the
12-bit NEXT_TRANSMIT_SEQ counter. The counter is initialized to 0 at reset, or when the
Data Link Layer is in the inactive state. It increments after each new TLP is transmitted until it
reaches its maximum value of 4095, and then rolls over to 0. For each new TLP sent, the
transmitter appends the Sequence Number from the NEXT_TRANSMIT_SEQ counter.

Keep in mind that an incremented Sequence Number does not necessarily mean a greater
Sequence Number (since the counter rolls over when after it reaches a maximum value of
4095).

32-Bit LCRC

The transmitter also appends a 32-bit LCRC (Link CRC) calculated based on TLP contents
which include the Header, Data Payload, ECRC and Sequence Number.

Replay (Retry) Buffer

General

Before a device transmits a TLP, it stores a copy of the TLP in a buffer associated with the
Data Link Layer referred to as the Replay Buffer (the specification uses the term Retry Buffer).
Each buffer entry stores a complete TLP with all of its fields including the Header (up to 16
bytes), an optional Data Payload (up to 4KB), an optional ECRC (up to four bytes), the
Sequence Number (12-bits wide, but occupies two bytes) and the LCRC field (four bytes). The
buffer size is unspecified. The buffer should be big enough to store transmitted TLPs that have
not yet been acknowledged via ACK DLLPs.

When the transmitter receives an ACK DLLP, it purges from the Replay Buffer TLPs with equal
to or earlier Sequence Numbers than the Sequence Number received with the ACK DLLPs.

When the transmitter receives NAK DLLPs, it purges the Replay Buffer of TLPs with Sequence
Numbers that are equal to or earlier than the Sequence Number that arrives with the NAK and
replays (re-transmits) TLPs of later Sequence Numbers (the remainder TLPs in the Replay
Buffer). This implies that a NAK DLLP inherently acknowledges TLPs with equal to or earlier
Sequence Numbers than the AckNak_Seq_Num[11:0] of the NAK DLLP and replays the
remainder TLPs in the Replay Buffer. Efficient replay strategies are discussed later.

Replay Buffer Sizing

The Replay Buffer should be large enough so that, under normal operating conditions, TLP
transmissions are not throttled due to a Replay Buffer full condition. To determine what buffer
size to implement, one must consider the following:

ACK DLLP delivery Latency from the receiver.

Delays cause by the physical Link interconnect and the Physical Layer implementations.

Receiver L0s exit latency to L0. i.e., the buffer should ideally be big enough to hold TLPs
while the Link which is in L0s is returned to L0.

Transmitter's Response to an ACK DLLP

General

If the transmitter receives an ACK DLLP, it has positive confirmation that its transmitted TLP(s)
have reached the receiver successfully. The transmitter associates the Sequence Number
contained in the ACK DLLP with TLP entries contained in the Replay Buffer.

A single ACK DLLP returned by the receiver Device B may be used to acknowledge multiple
TLPs. It is not necessary that every TLP transmitted must have a corresponding ACK DLLP
returned by the remote receiver. This is done to conserve bandwidth by reducing the ACK
DLLP traffic on the bus. The receiver gathers multiple TLPs and then collectively acknowledges
them with one ACK DLLP that corresponds to the last received good TLP. In InfiniBand, this is
referred to as ACK coalescing.
The transmitter's response to reception of an ACK DLLP include:

Load ACKD_SEQ register with AckNak_Seq_Num[11:0] of the ACK DLLP.

Reset the REPLAY_NUM counter and REPLAY_TIMER to 0.

Purge the Replay Buffer as described below.

Purging the Replay Buffer

An ACK DLLP of a given Sequence Number (contained in the AckNak_Seq_Num[11:0] field)


acknowledges the receipt of a TLP with that Sequence Number in the transmitter Replay
Buffer, PLUS all TLPs with earlier Sequence Numbers. In other words, an ACK DLLP with a
given Sequence Number not only acknowledges a specific TLP in the Replay Buffer (the one
with that Sequence Number), but it also acknowledges TLPs of earlier (logically lower)
Sequence Numbers. The transmitter purges the Replay Buffer of all TLPs acknowledged by the
ACK DLLP.

Examples of Transmitter ACK DLLP Processing

Example 1

Consider Figure 5-7 on page 223, with the emphasis on the transmitter Device A.

1. Device A transmits TLPs with Sequence Numbers 3, 4, 5, 6, 7 where TLP 3 is the


first TLP sent and TLP 7 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 3, 4, 5 in that order. TLP 6, 7 are still en
route.

Device B performs the error checks and collectively acknowledges good receipt of TLPs 3,
4, 5 with the return of an ACK DLLP with a Sequence Number of 5.

Device A receives ACK 5.

Device A purges TLP 3, 4, 5 from the Replay Buffer.

When Device B receives TLP 6, 7, steps 3 through 5 may be repeated for those packets as
well.
Figure 5-7. Example 1 that Shows Transmitter Behavior with Receipt of an
ACK DLLP

Example 2

Consider Figure 5-8, with the emphasis on the transmitter Device A.

1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, 2 where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 4094, 4095, 0, 1 in that order. TLP 2 is still
en route.

Device B performs the error checks and collectively acknowledges good receipt of TLPs
4094, 4095, 0, 1 with the return of an ACK DLLP with a Sequence Number of 1.

Device A receives ACK 1.

Device A purges TLP 4094, 4095, 0, 1 from the Replay Buffer.

When Device B ultimately receives TLP 2, steps 3 through 5 may be repeated for TLP 2.

Figure 5-8. Example 2 that Shows Transmitter Behavior with Receipt of an


ACK DLLP
Transmitter's Response to a NAK DLLP

A NAK DLLP received by the transmitter implies that a TLP transmitted at an earlier time was
received by the receiver in error. The transmitter first purges from the Replay Buffer any TLP
with Sequence Numbers equal to or earlier than the NAK DLLP's AckNak_Seq_Num[11:0]. It
then replays (retries) the remainder TLPs starting with the TLP with Sequence Number
immediately after the AckNak_Seq_Num[11:0] of the NAK DLLP until the newest TLP. In
addition, the transmitter's response to reception of a NAK DLLP include:

Reset REPLAY_NUM and REPLAY_TIMER to 0 only if the NAK DLLP's


AckNak_Seq_Num[11:0] is later than the current ACKD_SEQ value (forward progress is
made in transmitting TLPs).

Load ACKD_SEQ register with AckNak_Seq_Num[11:0] of the NAK DLLP.

TLP Replay

When a Replay becomes necessary, the transmitter blocks the delivery of new TLPs by the
Transaction Layer. It then replays (re-sends or retries) the contents of the Replay Buffer
starting with the earliest TLP first (of Sequence Number = AckNak_Seq_Num[11:0] + 1) until
the remainder of the Replay Buffer is replayed. After the replay event, the Data Link Layer
unblocks acceptance of new TLPs from the Transaction Layer. The transmitter continues to
save the TLPs just replayed until they are finally acknowledged at a later time.

Efficient TLP Replay


ACK DLLPs or NAK DLLPs received during replay must be processed. This means that the
transmitter must process the DLLPs and, at the very least, store them until the replay is
finished. After replay is complete, the transmitter evaluates the ACK or NAK DLLPs and
performs the appropriate processing.

A more efficient design might begin processing the ACK/NAK DLLPs while the transmitter is still
in the act of replaying. By doing so, newly received ACK DLLPs are used to purge the Replay
Buffer even while replay is in progress. If another NAK DLLP is received in the meantime, at the
very least, the TLPs that were acknowledged have been purged and would not be replayed.

During replay, if multiple ACK DLLPs are received, the ACK DLLP received last with the latest
Sequence Number can collapse earlier ACK DLLPs of earlier Sequence Numbers. During the
replay, the transmitter can concurrently purge TLPs of Sequence Number equal to and earlier
than the AckNak_Seq_Num[11:0] of the last received ACK DLLP.

Example of Transmitter NAK DLLP Processing

Consider Figure 5-9 on page 226, with focus on transmitter Device A.

1. Device A transmits TLPs with Sequence Number 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs 4094, 4095, and 0 in that order. TLP 1, 2 are still en route.

Device B receives TLP 4094 with no error and hence NEXT_RCV_SEQ count increments to
4095

Device B receives TLP 4095 with a CRC error.

Device B schedules the return of a NAK DLLP with Sequence Number 4094
(NEXT_RCV_SEQ count - 1).

Device A receives NAK 4094 and blocks acceptance of new TLPs from its Transaction Layer
until replay completes.

Device A first purges TLP 4094 (and earlier TLPs; none in this example).

Device A then replays TLPs 4095, 0, 1, and 2, but does not purge them.

Figure 5-9. Example that Shows Transmitter Behavior on Receipt of a NAK


DLLP
Repeated Replay of TLPs

Each time the transmitter receives a NAK DLLP, it replays the Replay Buffer contents. The
transmitter uses a 2-bit Replay Number counter, referred to as the REPLAY_NUM counter, to
keep track of the number of replay events. Reception of a NAK DLLP increments
REPLAY_NUM. This counter is initialized to 0 at reset, or when the Data Link Layer is inactive.
It is also reset if an ACK or NAK DLLP is received with a later Sequence Number than that
contained in the ACKD_SEQ register. As long as forward progress is made in transmitting
TLPs the REPLAY_NUM counter resets. When a fourth NAK is received, indicating no forward
progress has been made after several tries, the counter rolls over to zero. The transmitter will
not replay the TLPs a fourth time but instead it signals a replay number rollover error. The
device assumes that the Link is non-functional or that there is a Physical Layer problem at
either the transmitter or receiver end.

What Happens After the Replay Number Rollover?

A transmitter's Data Link Layer triggers the Physical Layer to re-train the Link. The Physical
Layer Link Training and Status State Machine (LTSSM) enters the Recovery State (see
"Recovery State" on page 532). The Replay Number Rollover error bit is set ("Advanced
Correctable Error Handling" on page 384) in the Advanced Error Reporting registers (if
implemented). The Replay Buffer contents are preserved and the Data Link Layer is not
initialized by the re-training process. Upon Physical Layer re-training exit, assuming that the
problem has been cleared, the transmitter resumes the same replay process again. Hopefully,
the TLPs can be re-sent successfully on this attempt.

The specification does not address a device's handling of repeated re-train attempts. The
author recommends that a device track the number of re-train attempts. After a re-train rollover
the device could signal a Data Link Layer protocol error indicating the severity as an
Uncorrectable Fatal Error.

Transmitter's Replay Timer

The transmitter implements a REPLAY_TIMER to measure the time from when a TLP is
transmitted until the transmitter receives an associated ACK or NAK DLLP from the remote
receiver. A formula (described below) determines the timer's expiration period. Timer expiration
triggers a replay event and the REPLAY_NUM count increments. A time-out may arise if an
ACK or NAK DLLP is lost en route, or because of an error in the receiver that prevents it from
returning an ACK or NAK DLLP. Timer-related rules are:

The Timer starts (if not already started) when the last symbol of any TLP is transmitted.

The Timer is reset to 0 and restarted when:

- A Replay event occurs and the last symbol of the first TLP is replayed.

- For each ACK DLLP received, as long as there are unacknowledged TLPs in the
Replay Buffer,

The Timer is reset and held when:

- There are no TLPs to transmit, or when the Replay Buffer is empty.

- A NAK DLLP is received. The timer restarts when replay begins.

- When the timer expires.

- The Data Link Layer is inactive.

Timer is Held during Link training or re-training.

REPLAY_TIMER Equation

The timer is loaded with a value that reflects the worst-case latency for the return of an ACK or
NAK DLLP. This time depends on the maximum data payload allowed for a TLP and the width
of the Link.

The equation to calculate the REPLAY_TIMER value required is:


The value in the timer represents a symbol time (4ns).

The equation fields are defined as follows:

- Max_Payload_Size is the value in the Max_Payload_Size field of the Device Control


Register ("Device Capabilities Register" on page 900).

- TLP Overhead includes the additional TLP fields beyond the data payload (header,
digest, LCRC, and Start/End framing symbols). In the specification, the overhead value is
treated as a constant of 28 symbols.

- The Ack Factor is a fudge factor that represents the number of maximum-sized TLPs
(based on Max_Payload) that can be received before an ACK DLLP must be sent. The AF
value ranges from 1.0 to 3.0 and is used to balance Link bandwidth efficiency and Replay
Buffer size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link
widths and payloads. These Ack Factor values are chosen to allow implementations to
achieve good performance without requiring a large uneconomical buffer.

Figure 5-10. Table and Equation to Calculate REPLAY_TIMER Load


Value

- Link Width ranges from 1-bit wide to 32-bits wide.

- Internal Delay is the receiver's internal delay between receiving a TLP, processing it at
the Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19
symbol times in these calculations.

- Rx_L0s_Adjustment is the time required by the receive circuits to exit from L0s to L0,
expressed in symbol times.
REPLAY_TIMER Summary Table

Figure 5-10 on page 229 is a summary table that shows possible timer load values with various
variables plugged into the REPLAY_TIMER equation.

Transmitter DLLP Handling

The DLLP CRC Error Checking block determines whether there is a CRC error in the received
DLLP. The DLLP includes a 16-bit CRC for this purpose (see Table 5-1 on page 219). If there
are no DLLP CRC errors, then the DLLPs are further processed. If a DLLP CRC error is
detected, the DLLP is discarded, and the error is reported as a DLLP CRC error to the error
handling logic which logs the error in the optional Advanced Error Reporting registers (see Bad
DLLP in "Advanced Correctable Error Handling" on page 384). No further action is taken.

Discarding an ACK or NAK DLLP received in error is not a severe response because a
subsequently received DLLP will accomplish the same goal as the discarded DLLP. The side
effect of this action is that associated TLPs are purged a little later than they would have been
or that a replay happens at a later time. If a subsequent DLLP is not received in time, the
transmitter REPLAY_TIMER expires anyway, and the TLPs are replayed.

Receiver Protocol Details

Consider the receive side of a device's Data Link Layer shown in Figure 5-5 on page 218.

TLP Received at Physical Layer

TLPs received at the Physical Layer are checked for STP and END framing errors as well as
other receiver errors such as disparity errors. If there are no errors, the TLPs are passed to
the Data Link Layer. If there are any errors, the TLP is discarded and the allocated storage is
freed up. The Data Link Layer is informed of this error so that it can schedule a NAK DLLP.
(see "Receiver Schedules a NAK" on page 233).

Received TLP Error Check

The receiver accepts TLPs from the Link into a receiver buffer and checks for CRC errors. The
receiver calculates an expected LCRC value based on the received TLP (excluding the LCRC
field) and compares this value with the TLP's 32-bit LCRC. If the two match, the TLP is good. If
the two LCRC values do not match, the received TLP is bad and the receiver schedules a NAK
DLLP to be returned to the remote transmitter. The receiver also checks for other types of non-
CRC related errors (such as that described in the next section).

Next Received TLP's Sequence Number

The receiver keeps track of the next expected TLP's Sequence Number via a 12-bit counter
referred to as the NEXT_RCV_SEQ counter. This counter is initialized to 0 at reset, or when
the Data Link Layer is inactive. This counter is incremented once for each good TLP that is
received and forwarded to the Transaction Layer. The counter rolls over to 0 after reaching a
value of 4095.

The receiver uses the NEXT_RCV_SEQ counter to identify the Sequence Number that should
be in the next received TLP. If a received TLP has no LCRC error, the device compares its
Sequence Number with the NEXT_RCV_SEQ count. Under normal operational conditions, these
two numbers should match. If this is the case, the receiver accepts the TLP, forwards the TLP
to the Transaction Layer, increments the NEXT_RCV_SEQ counter and is ready for the next
TLP. An ACK DLLP may be scheduled for return if the ACKNAK_LATENCY_ TIMER expires or
exceeds its set value. The receiver is ready to perform a comparison on the next received
TLP's Sequence Number.

In some cases, a received TLP's Sequence Number may not match the NEXT_RCV_SEQ
count. The received TLP's Sequence Number may be either logically greater than or logically
less than NEXT_RCV_SEQ count (a logical number in this case accounts for the count rollover,
so in fact a logically greater number may actually be a lower number if the count rolls over).
See "Receiver Sequence Number Check" on page 234 for details on these two abnormal
conditions.

For a TLP received with a CRC error, or a nullified TLP or a TLP for which the Sequence
Number check described above fails, the NEXT_RCV_SEQ counter is not incremented.

Receiver Schedules An ACK DLLP

If the receiver does not detect an LCRC error (see "Received TLP Error Check" on page 230)
or a Sequence Number related error (see "Next Received TLP's Sequence Number" on page
230) associated with a received TLP, it accepts the TLP and sends it to the Transaction Layer.
The NEXT_RCV_SEQ counter is incremented and the receiver is ready for the next TLP. At this
point, the receiver can schedule an ACK DLLP with the Sequence Number of the received TLP
(see the AckNak_Seq_Num[11:0] field described in Table 5-1 on page 219). Alternatively, the
receiver could also wait for additional TLPs and schedule an ACK DLLP with the Sequence
Number of the last good TLP received.

The receiver is allowed to accumulate a number of good TLPs and then sends one aggregate
ACK DLLP with a Sequence Number of the latest good TLP received. The coalesced ACK
DLLP acknowledges the good receipt of a collection of TLPs starting with the oldest TLP in the
transmitter's Replay Buffer and ending with the TLP being acknowledged by the current ACK
DLLP. By doing so, the receiver optimizes the use of Link bandwidth due to reduced ACK DLLP
traffic. The frequency with which ACK DLLPs are scheduled for return is described in
"Receivers ACKNAK_LATENCY_TIMER" on page 237. When the ACKNAK_LATENCY_ TIMER
expires or exceeds its set value and TLPs are received, an ACK DLLP with a Sequence
Number of the last good TLP is returned to the transmitter.

When the receiver schedules an ACK DLLP to be returned to the remote transmitter, the
receiver might have other packets (TLPs, DLLPs or PLPs) enqueued that also have to be
transmitted on the Link in the same direction as the ACK DLLP. This implies that the receiver
may not immediately return the ACK DLLP to the transmitter, especially if a large TLP (with up
to a 4KB data payload) is already being transmitted (see "Recommended Priority To Schedule
Packets" on page 244).

The receiver continues to receive TLPs and as long as there are no detected errors (LCRC or
Sequence Number errors), it forwards the TLPs to the Transaction Layer. When the receiver
has the opportunity to return the ACK DLLP to the remote transmitter, it appends the Sequence
Number of the latest good TLP received and returns the ACK DLLP. Upon receipt of the ACK
DLLP, the remote transmitter purges its Replay Buffer of the TLPs with matching Sequence
Numbers and all TLPs transmitted earlier than the acknowledged TLP.

Example of Receiver ACK Scheduling

Example: Consider Figure 5-11 on page 233, with focus on the receiver Device B.

1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs with Sequence Numbers 4094, 4095, 0, and 1, in that order.
NEXT_RCV_SEQ count increments to 2. TLP 2 is still en route.

Device B performs error checks and issues a coalesced ACK to collectively acknowledge
receipt of TLPs 4094, 4095, 0, and 1, with the return of an ACK DLLP with Sequence Number
of 1.

Device B forwards TLPs 4094, 4095, 0, and 1 to its Transaction Layer.

When Device B ultimately receives TLP 2, steps 3 and 4 may be repeated for TLP 2.

Figure 5-11. Example that Shows Receiver Behavior with Receipt of Good
TLP
NAK Scheduled Flag

The receiver implements a Flag bit referred to as the NAK_SCHEDULED flag. When a receiver
detects a TLP CRC error, or any other non-CRC related error that requires it to schedule a
NAK DLLP to be returned, the receiver sets the NAK_SCHEDULED flag and clears it when the
receiver detects replayed TLPs from the transmitter for which there are no CRC errors.

Receiver Schedules a NAK

Upon receipt of a TLP, the first type of error condition the receiver may detect is a TLP LCRC
error (see "Received TLP Error Check" on page 230). The receiver discards the bad TLP. If the
NAK_SCHEDULED flag is clear, it schedules a NAK DLLP to return to the transmitter. The
NAK_SCHEDULED flag is then set. The receiver uses the NEXT_RCV_SEQ count - 1 count
value as the AckNak_Seq_Num [11:0] field in the NAK DLLP (Table 5-1 on page 219). At the
time the receiver schedules a NAK DLLP to return to the transmitter, the Link may be in use to
transmit other queued TLPs, DLLPs or PLPs. In that case, the receiver delays the transmission
of the NAK DLLP (see "Recommended Priority To Schedule Packets" on page 244). When the
Link becomes available, however, it sends the NAK DLLP to the remote transmitter. The
transmitter replays the TLPs from the Replay Buffer (see "TLP Replay" on page 225).

In the meantime, TLPs currently en route continue to arrive at the receiver. These TLPs have
later Sequence Numbers than the NEXT_RCV_SEQ count. The receiver discards them. The
specification is unclear about whether the receiver should schedule a NAK DLLP for these
TLPs. It is the authors' interpretation that the receiver must not schedule the return of additional
NAK DLLPs for subsequently received TLPs while the NAK_SCHEDULED flag remains set.

The receiver detects a replayed TLP when it receives a TLP with Sequence Numbers that
matches NEXT_RCV_SEQ count. If the replayed TLPs arrive with no errors, the receiver
increments NEXT_RCV_SEQ count and clears the NAK_SCHEDULED flag. The receiver may
schedule an ACK DLLP for return to the transmitter if the ACKNAK_LATENCY_TIMER expires.
The good replayed TLPs are forwarded to the Transaction Layer.

There is a second scenario under which the receiver schedules NAK DLLPs to return to the
transmitter. If the receiver detects a TLP with a later Sequence Number than the next expected
Sequence Number indicated by NEXT_RCV_SEQ count or for which the TLP has a Sequence
Number that is separated from NEXT_RCV_SEQ count by more than 2048, the above
described procedure is repeated. See "Receiver Sequence Number Check" below for the
reasons why this could happen.

The two error conditions just described wherein a NAK DLLP is scheduled for return are
reported as errors associated with the Data Link Layer. The error reported is a bad TLP error
with a severity of correctable.

Receiver Sequence Number Check

Every received TLP that passes the CRC check goes through a Sequence Number check. The
received TLPs Sequence Number is compared with the NEXT_RCV_SEQ count. Below are
three possibilities:

TLP Sequence Number equal NEXT_RCV_SEQ count. This situation results when a
good TLP is received. It also occurs when a replayed TLP is received. The TLP is
accepted and forwarded to the Transaction Layer. NEXT_RCV_SEQ count is incremented
and an ACK DLLP may be scheduled (according to the ACK DLLP scheduling rules
described in "Receiver Schedules An ACK DLLP" on page 231).

TLP Sequence Number is logically less than NEXT_RCV_SEQ count (earlier


Sequence Number). This situation results when a duplicate TLP is received as the result
of a replay event. The duplicate TLP is discarded. NEXT_RCV_SEQ count is not
incremented. An ACK DLLP is scheduled so that the transmitter can purge its Replay
Buffer of the duplicate TLP(s). The receiver uses the NEXT_RCV_SEQ count - 1 in the
ACK DLLP's AckNak_Seq_Num[11:0] field. What scenario results in a duplicate TLP being
received? Consider this example. A receiver accepts a TLP and returns an associated
ACK DLLP and increments the NEXT_RCV_SEQ count. The ACK DLLP is lost en route to
the transmitter. As a result, this TLP remains in the remote transmitter's Replay Buffer. The
transmitter's REPLAY_TIMER expires when no further ACK DLLPs are received. This
causes the transmitter to replay the entire contents of the Replay Buffer. The receiver sees
these TLPs with earlier Sequence Numbers than the NEXT_RCV_SEQ count and discards
them because they are duplicate TLPs. More precisely, a TLP is a duplicate TLP if:
(NEXT_RCV_SEQ - TLP Sequence Number) mod 4096 <= 2048.

An ACK DLLP is returned for every duplicate TLP received.

TLP Sequence Number is logically greater than NEXT_RCV_SEQ count (later


Sequence Number). This situation results when one or more TLPs are lost en route. The
receiver schedules a NAK DLLP for return to the transmitter if NAK_SCHEDULED flag is
clear (see NAK DLLP scheduling rules described in "Receiver Schedules a NAK" on page
233). NEXT_RCV_SEQ count does not increment when the receiver receives such TLPs of
later Sequence Number.

Receiver Preserves TLP Ordering

In addition to guaranteeing reliable TLP transport, the ACK/NAK protocol preserves packet
ordering. The receiver's Transaction Layer receives TLPs in the same order that the transmitter
sent them.

A transmitter correctly orders TLPs according to the ordering rules before transmission in order
to maintain correct program flow and to eliminate the occurrence of potential deadlock and
livelock conditions (see Chapter 8, entitled "Transaction Ordering," on page 315). The Receiver
is required to preserve TLP order (otherwise, application program flow is altered). To
preserved this order, the receiver applies three rules:

When the receiver detects a bad TLP, it discards the TLP and all new TLPs that follow in
the pipeline until the replayed TLPs are detected.

Also, duplicate TLPs are discarded.

TLPs received after one or more lost TLPs are received are discarded.

For TLPs that arrive after the first bad TLP, the motivation to discard these TLPs, not forward
them to the Transaction Layer and schedule a NAK DLLP is as follows. When the receiver
detects a bad TLP, it discards it and any new TLPs in the pipeline. The receiver then waits for
TLP replay. After verifying that there are no errors in the replayed TLP(s), the receiver
forwards them to the Transaction Layer and resumes acceptance of new TLPs in the pipeline.
Doing so preserves TLP receive and acceptance order at the receivers Transaction Layer.

Example of Receiver NAK Scheduling

Example: Consider Figure 5-12 on page 237 with emphasis on the receiver Device B.
1. Device A transmits TLPs with Sequence Numbers 4094, 4095, 0, 1, and 2, where TLP
4094 is the first TLP sent and TLP 2 is the last TLP sent.

Device B receives TLPs 4094, 4095, and 0, in that order. TLPs 1 and 2 are still in flight.

Device B receives TLP 4094 with no errors and forwards it to the Transaction Layer.
NEXT_RCV_SEQ count increments to 4095.

Device B detects an LCRC error in TLP 4095 and hence returns a NAK DLLP with a
Sequence Number of 4094 (NEXT_RCV_SEQ count - 1). The NAK_SCHEDULED flag is set.
NEXT_RCV_SEQ count does not increment.

Device B discards TLP 4095.

Device B also discards TLP 0, even though it is a good TLP. Also TLP 1 and 2 are discarded
when they arrive.

Device B does not schedule a NAK DLLP for TLP 0, 1 and 2 because the
NAK_SCHEDULED flag is set.

Device A receives NAK 4094.

Device A does not accept any new TLPs from its Transaction Layer.

Device A first purges TLP 4094.

Device A then replays TLPs 4095, 0, 1, and 2, but continues to save these TLPs in the
Replay Buffer. It then accepts TLPs from the Transaction Layer.

Replayed TLPs 4095, 0, 1, and 2 arrive at Device B in that order.

After verifying that there are no CRC errors in the received TLPs, device B detects TLP
4095 as a replayed TLP because it has a Sequence Number equal to NEXT_RCV_SEQ count.
NAK_SCHEDULED flag is cleared.

Device B forwards these TLPs to the Transaction Layer in this order: 4095, 0, 1, and 2.

Figure 5-12. Example that Shows Receiver Behavior When It Receives Bad
TLPs
Receivers ACKNAK_LATENCY_TIMER

The ACKNAK_LATENCY_TIMER measures the duration since an ACK or NAK DLLP was
scheduled for return to the remote transmitter. This timer has a value that is approximately 1/3
that of the transmitter REPLAY_TIMER. When the timer expires, the receiver schedules an
ACK DLLP with a Sequence Number of the last good unacknowledged TLP received. The timer
guarantees that the receiver schedules an ACK or NAK DLLP for a received TLP before the
transmitter's REPLAY_TIMER expires causing it to replay.

The timer resets to 0 and restarts when an ACK or NAK DLLP is scheduled.

The timer resets to 0 and holds when:

All received TLPs have been acknowledged.

The Data Link Layer is in the inactive state.

ACKNAK_LATENCY_TIMER Equation

The receiver's ACKNAK_ LATENCY_TIMER is loaded with a value that reflects the worst-case
transmission latency in sending an ACK or NAK in response to a received TLP. This time
depends on the anticipated maximum payload size and the width of the Link.

The equation to calculate the ACKNAK_LATENCY_TIMER value required is:


The value in the timer represents symbol times (a symbol time = 4 ns).

The fields above are defined as follows:

Max_Payload_Size is the value in the Max_Payload_Size field of the Device Control


Register (see page 900).

TLP Overhead includes the additional TLP fields beyond the data payload (header, digest,
LCRC, and Start/End framing symbols). In the specification, the overhead value is treated
as a constant of 28 symbols.

The Ack Factor is the biggest number of maximum-sized TLPs (based on Max_Payload)
which can be received before an ACK DLLP is sent. The AF value (it's a fudge factor)
ranges from 1.0 to 3.0, and is used to balance Link bandwidth efficiency and Replay Buffer
size. Figure 5-10 on page 229 summarizes the Ack Factor values for various Link widths
and payloads. These Ack Factor values are chosen to allow implementations to achieve
good performance without requiring a large, uneconomical buffer.

Link Width ranges from 1-bit wide to 32-bits wide.

Internal Delay is the receiver's internal delay between receiving a TLP, processing it at the
Data Link Layer, and returning an ACK or NAK DLLP. It is treated as a constant of 19
symbol times in these calculations.

Tx_L0s_Adjustment: If L0s is enabled, the time required for the transmitter to exit L0s,
expressed in symbol times. Note that setting the Extended Sync bit of the Link Control
register affects the exit time from L0s and must be taken into account in this adjustment.

It turns out that the entries in this table are approximately a third in value of the
REPLAY_TIMER latency values in Figure 5-10 on page 229.

ACKNAK_LATENCY_TIMER Summary Table

Figure 5-13 on page 239 is a summary table that shows possible timer load values with various
variables plugged into the ACKNAK_LATENCY_TIMER equation.

Figure 5-13. Table to Calculate ACKNAK_LATENCY_TIMER Load Value


Error Situations Reliably Handled by ACK/NAK Protocol
This section describes the possible sources of errors that may occur in delivery of TLPs from a
transmitter to a receiver across a Link. The ACK/NAK protocol guarantees reliable delivery of
TLPs despite the unlikely event that these errors occur. Below is a bullet list of errors and the
related error correction mechanism the protocol uses to resolve the error:

Problem: CRC error occurs in transmission of a TLP (see "Transmitter's Response to a


NAK DLLP" on page 224 and "Receiver Schedules a NAK" on page 233.)

Solution: Receiver detects LCRC error and schedules a NAK DLLP with Sequence
Number = NEXT_RCV_SEQ count - 1. Transmitter replays TLPs.

Problem: One or more TLPs are lost en route to the receiver.

Solution: The receiver performs a sequence number check on all received TLPs. The
receiver expects TLPs to arrive with each TLP that has an incremented 12-bit Sequence
Number from that in the previous TLP. If one or more TLPs are lost en route, a TLP will
have a Sequence Number issued later than expected Sequence Number reflected in the
NEXT_RCV_SEQ count. The receiver schedules a NAK DLLP with a Sequence Number =
NEXT_RCV_SEQ count - 1. Transmitter replays the Replay Buffer contents.

Problem: Receiver returns an ACK DLLP, but it is corrupted en route to the transmitter.
The remote Transmitter detects a CRC error in the DLLP (DLLP is covered by 16-bit CRC,
see "ACK/NAK DLLP Format" on page 219). In fact, the transmitter does not know that the
malformed DLLP just received is supposed to be an ACK DLLP. All it knows is that the
packet is a DLLP.

Solution:

- Case 1: The Transmitter discards the DLLP. A subsequent ACK DLLP received with
a later Sequence Number causes the transmitter Replay Buffer to purge all TLPs with
equal and earlier generated Sequence Numbers. The transmitter never knew that
anything went wrong.

- Case 2: The Transmitter discards the DLLP. A subsequent NAK DLLP received with
a later generated Sequence Number causes the transmitter Replay Buffer to purge
TLPs with equal to an earlier Sequence Numbers. The transmitter then replay all TLPs
with later Sequence Numbers till the last TLP in the Replay Buffer. The transmitter
never knew that anything went wrong.

Problem: ACK or NAK DLLP for received TLPs are not returned by the receiver by the
proper ACKNAK_LATENCY_TIMER time-out. The associated TLPs remain in the
transmitter Replay Buffer.

Solution: The REPLAY_TIMER times-out and the transmitter replays its Replay Buffer.

Problem: The Receiver returns a NAK DLLP but it is corrupted en route to the transmitter.
The remote Transmitter detects a CRC error in the DLLP. In fact, the transmitter does not
know that the DLLP received is supposed to be an NAK DLLP. All it knows is that the
packet is a DLLP.

Solution: The Transmitter discards the DLLP. The receiver discards all subsequently
received TLPs and awaits the replay. Given that the NAK was rejected by the transmitter,
it's REPLAY_TIMER expires and triggers the replay.

Problem: Due to an error in the receiver, it is unable to schedule an ACK or NAK DLLP for
a received TLP.

Solution: The transmitter REPLAY_TIMER will expire and result in TLP replay.
ACK/NAK Protocol Summary
Refer to Figure 5-3 on page 212 and the following subsections for a summary of the elements
of the Data Link Layer.

Transmitter Side

Non-Error Case (ACK DLLP Management)

Unless blocked by the Data Link Layer, the Transaction Layer passes down the Header,
Data, and Digest information for each TLP to be sent.

Each TLP is assigned a 12-bit Sequence Number using current NEXT_TRANSMIT_SEQ


count.

A check is made to see if the acceptance of new TLPs from the Transaction Layer should
be blocked. The transmitter performs a modulo 4096 subtraction of the ACKD_SEQ count
from the NEXT_TRANSMIT_SEQ count to see if the result is >= 2048d. If it is, further
TLPs are blocked until incoming ACK/NAK DLLPs render the equation untrue.

The NEXT_TRANSMIT_SEQ counter increments by one for each TLP processed. Note: if
the transmitter wants to nullify a TLP being sent, it sends an inverted CRC to the physical
layer and indicates an EDB end (End Bad Packet) symbol should be used
(NEXT_TRANSMIT_SEQ is not incremented). See the "Switch Cut-Through Mode" on
page 248 for details.

A 32-bit LCRC value is calculated for the TLP (the LCRC calculation includes the Sequence
Number).

A copy of the TLP is placed in the Replay Buffer and the TLP is forwarded to the Physical
Layer for transmission.

The Physical Layer adds STP and END framing symbols, then transmits the packet.

At a later time, assume the transmitter receives an ACK DLLP from the receiver. It
performs a CRC error check and, if the check fails, discards the ACK DLLP (the same
holds true if a bad NAK DLLP is received). If the check is OK, it purges the Replay buffer
of TLPs from the oldest TLP up to and including the TLP with Sequence Number that
matches the Sequence Number in the ACK DLLP.
Error Case (NAK DLLP Management)

Repeat the process described in the previous section, but this time, assume that the transmitter
receives a NAK DLLP:

Upon receipt of the NAK DLLP with no CRC error, the transmitter follows the following
sequence of steps in performing the Replay. NOTE: this is the same sequence of events
which would occur if the REPLAY_TIMER expires instead.

- The REPLAY_NUM is incremented. The maximum number of attempts to clear


(ACK) all unacknowledged TLPs in the Replay Buffer is four.

- If the REPLAY_NUM count rolls over from 11b to 00b, the transmitter instructs the
Physical Layer to re-train the Link.

- If REPLAY_NUM does not roll over, proceed.

- Block acceptance of new TLPs from the Transaction Layer.

- Complete transmission of any TLPs in progress.

- Purge any TLPs of equal or earlier Sequence Numbers than NAK DLLP's
AckNak_Seq_Num[11:0].

- Re-transmit TLPs with later Sequence Numbers than the NAK DLLP's
AckNak_Seq_Num[11:0].

- ACK DLLPs or NAK DLLPs received during replay must be processed. The
transmitter may disregard them until replay is complete or use them during replay to
skip transmission of newly acknowledged TLPs. Earlier Sequence Numbers can be
collapsed when an ACK DLLP is received with a later Sequence Number. Also, ACK
DLLPs with later Sequence Numbers than a NAK DLLP received earlier supersede the
earlier NAK DLLP.

- When the replay is complete, unblock TLPs and return to normal operation.

Receiver Side

Non-Error Case

TLPs are received at the Physical Layer where they are checked for framing errors and other
receiver-related errors. Assume that there are no errors. If the Physical Layer reports the end
symbol was EDB and the CRC value was inverted, this is not an error condition; discard the
packet and free any allocated space (see "Switch Cut-Through Mode" on page 248). There will
be no ACK or NAK DLLP returned for this case.

The sequence of steps performed are as follows:

Calculate the CRC for the incoming TLP and check it against the LCRC provided with the
packet. If the CRC passes, go to the next step.

Compare the Sequence Number for the inbound packet against the current value in the
NEXT_RCV_SEQ count.

If they are the same, this is the next expected TLP. Forward the TLP to the Transaction
Layer. Also increment the NEXT_RCV_SEQ count.

Clear the NAK_SCHEDULED flag if set.

If the ACKNAK_LATENCY_TIMER expires, schedule and ACK DLLP with


AckNak_Seq_Num[11:0] = NEXT_RCV_SEQ count - 1.

Error Case

TLPs are received at the Physical Layer where they are checked for framing errors and other
receiver-related errors. In the event of an error, the Physical Layer discards the packet, reports
the error, and frees any storage allocated for the TLP. If the EDB is set and the CRC is not
inverted, this is a bad packet: discard the TLP and set the error flag. If the NAK_SCHEDULED
flag is clear, set it, and schedule a NAK DLLP with the NEXT_RCV_SEQ count - 1 value used
as the Sequence Number.

If there are no Physical Layer errors detected, forward the TLP to the Data Link Layer.

Calculate the CRC for the incoming TLP and check it against the LCRC provided with the
packet. If the CRC fails, set the NAK_SCHEDULED flag. Schedule a NAK DLLP with
NEXT_RCV_SEQ count - 1 used as the Sequence Number. If LCRC error check passes,
go to the next bullet.

If the LCRC check passes, then compare the Sequence Number for the inbound packet
against the current value in the NEXT_RCV_SEQ count. If the TLP Sequence Number is
not equal to NEXT_RCV_SEQ count and if (NEXT_RCV_SEQ - TLP Sequence Number)
mod 4096 <= 2048, the TLP is a duplicate TLP. Discard the TLP, and schedule an ACK
with NEXT_RCV_SEQ count - 1 value used as AckNak_Seq_Num[11:0].
Discard TLPs received with Sequence Number other than the Sequence Number described
by the above bullet. If the NAK_SCHEDULED flag is clear, set it, and schedule a NAK
DLLP with NEXT_RCV_SEQ count - 1 used as AckNak_Seq_Num[11:0]. If the NAK
_SCHEDULED flag bit is already set, keep it set and do not schedule a NAK DLLP.
Recommended Priority To Schedule Packets
A device may have many types of TLPs, DLLPs and PLPs to transmit on a given Link. The
following is a recommended but not required set of priorities for scheduling packets:

1. Completion of any TLP or DLLP currently in progress (highest priority).

PLP transmissions.

NAK DLLP.

ACK DLLP.

FC (Flow Control) DLLP.

Replay Buffer re-transmissions.

TLPs that are waiting in the Transaction Layer.

All other DLLP transmissions (lowest priority)


Some More Examples
To demonstrate the reliable TLP delivery capability provided by the ACK/NAK Protocol, the
following examples are provided.

Lost TLP

Consider Figure 5-14 on page 245 which shows the ACK/NAK protocol for handling lost TLPs.

1. Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is 1. Device B is ready to receive TLP 1.

Seeing ACK 0, Device A purges TLPs 4094, 4095, and 0 from its replay buffer.

TLP 1 is lost en route.

TLP 2 arrives instead. Upon performing a Sequence Number check, Device B realizes that
TLP 2's Sequence Number is greater than NEXT_RCV_SEQ count.

Device B discards TLP 2 and schedules NAK 0 (NEXT_RCV_SEQ count - 1).

Upon receipt of NAK 0, Device A replays TLPs 1 and 2.

TLPs 1 and 2 arrive without error at Device B and are forwarded to the Transaction Layer.

Figure 5-14. Lost TLP Handling


Lost ACK DLLP or ACK DLLP with CRC Error

Consider Figure 5-15 on page 246 which shows the ACK/NAK protocol for handling a lost ACK
DLLP.

1. Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is set to 1.

ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.

TLPs 1 and 2 arrive at Device B shortly thereafter. NEXT_RCV_SEQ count increments to 3.

Device B returns ACK 2 and sends TLPs 1 and 2 to the Transaction Layer.

ACK 2 arrives at Device A.

Device A purges its Replay Buffer of TLPs 4094, 4095, 0, 1, and 2.

Figure 5-15. Lost ACK DLLP Handling


The example would be the same if a CRC error existed in ACK packet 0. Device A would
detect the CRC error in ACK 0 and discard it. When received later, ACK 2 would cause the
Replay Buffer to purge all TLPs (4094 through 2).

If ACK 2 is also lost or corrupted, and no further ACK or NAK DLLPs are returned to Device A,
its REPLAY_TIMER will expire. This results in replay of its entire buffer. Device B receives TLP
4094, 4095, 0, 1 and 2 and detects them as duplicate TLPs because their Sequence Numbers
are earlier than NEXT_RCV_SEQ count of 3. These TLPs are discarded and ACK DLLPs with
AckNak_Seq_Num[11:0] = 2 are returned to Device A for each duplicate TLP.

Lost ACK DLLP followed by NAK DLLP

Consider Figure 5-16 on page 247 which shows the ACK/NAK protocol for handling a lost ACK
DLLP followed by a valid NAK DLLP.

1. Device A transmits TLPs 4094, 4095, 0, 1, and 2.

Device B receives TLPs 4094, 4095, and 0, for which it returns ACK 0. These TLPs are
forwarded to the Transaction Layer. NEXT_RCV_SEQ is incremented and the next value of
NEXT_RCV_SEQ count is 1.

ACK 0 is lost en route. TLPs 4094, 4095, and 0 remain in Device A's Replay Buffer.

TLPs 1 and 2 arrive at Device B shortly thereafter. TLP 1 is good and NEXT_RCV_SEQ
count increments to 2. TLP 1 is forwarded to the Transaction Layer.

TLP 2 is corrupt. NEXT_RCV_SEQ count remains at 2.

Device B returns a NAK with a Sequence Number of 1 and discards TLP 2.


NAK 1 arrives at Device A.

Device A first purges TLP 4094, 4095, 0 and 1

Device A replays TLP 2.

TLP 2 arrive at Device B. The NEXT_RCV_SEQ count is 2.

Device B accepts good TLP 2 and forwards it to the Transaction Layer. NEXT_RCV_SEQ
increments to 3.

Device B may return an ACK with a Sequence Number of 2 if the


ACKNAK_LATENCY_TIMER expires.

Upon receipt of ACK 2, Device A purges TLP 2.

Figure 5-16. Lost ACK DLLP Handling


Switch Cut-Through Mode
PCI Express supports a switch-related feature that allows TLP transfer latency through a
switch to be significantly reduced. This feature is referred to as the 'cut-through' mode. Without
this feature, the propagation time through a switch could be significant.

Without Cut-Through Mode

Background

Consider an example where a large TLP needs to pass through a switch from one port to
another. Until the tail end of the TLP is received by the switch's ingress port, the switch is
unable to determine if there is a CRC error. Typically, the switch will not forward the packet
through the egress port until it determines that there is no CRC error. This implies that the
latency through the switch is at least the time to clock the packet into the switch. If the packet
needs to pass through many switches to get to the final destination, the latencies would add up,
increasing the time to get from source to destination.

Possible Solution

One option to reduce latency would be to start forwarding the TLP through the switch's egress
port before the tail end of the TLP has been received by the switch ingress port. This is fine as
long as the packet is not corrupted. Consider what would happen if the TLP were corrupt. The
packet would begin transmitting through the egress port before the switch realized that there is
an error. After the switch detects the CRC error, it would return a NAK to the TLP source and
discard the packet, but part of the packet has already been transmitted and its transmission
cannot be cleanly aborted in mid-transmit. There is no point keeping a copy of the bad TLP in
the egress port Replay Buffer because it is bad. The TLP source port would at a later time
replay after receiving the NAK DLLP. The TLP is already outbound and en route to the Endpoint
destination. The Endpoint receives the packet, detects a CRC error, and returns a NAK to the
switch. The switch is expected to replay the TLP, but the switch has already discarded the TLP
due to the detected error on the inbound TLP. The switch is stuck between a rock and a hard
place!

Switch Cut-Through Mode

Background
The PCI Express protocol permits the implementation of an optional feature referred to as cut-
through mode. Cut-though is the ability to start streaming a packet through a switch without
waiting for the receipt of the tail end of the packet. If, ultimately, a CRC error is detected when
the CRC is received at the tail end of the packet, the packet that has already begun
transmission from the switch egress port can be 'nullified'.

A nullified packet is a packet that terminates with an EDB symbol as opposed to an END. It
also has an inverted 32-bit LCRC.

Example That Demonstrates Switch Cut-Through Feature

Consider the example in Figure 5-17 that illustrates the cut-though mode of a switch.

Figure 5-17. Switch Cut-Through Mode Showing Error Handling

A TLP with large data payload passes from the left, through the switch, to the Endpoint on the
right. The steps as the packet is routed through the switch are as follows:

1. A TLP is inbound to a switch. While en route, the packet's contents is corrupted.

The TLP header at the head of the TLP is decoded by the switch and the packet is
forwarded to the egress port before the switch becomes aware of a CRC error. Finally, the tail
end of the packet arrives in the switch ingress port and it is able to complete a CRC check.

The switch detects a CRC error for which the switch returns a NAK DLLP to the TLP source.

On the egress port, the switch replaces the END framing symbol at the tail end of the bad
TLP with the EDB (End Bad) symbol. The CRC is also inverted from what it would normally be.
The TLP is now 'nullified'. Once the TLP has exited the switch, the switch discards its copy
from the Replay Buffer.

The nullified packet arrives at the Endpoint. The Endpoint detects the EDB symbol and the
inverted CRC and discards the packet.
The Endpoint does not return a NAK DLLP (otherwise the switch would be obliged to
replay).

When the TLP source device receives the NAK DLLP, it replays the packet. This time no error
occurs on the switch's ingress port. As the packet arrives in the switch, the header is decoded
and the TLP is forwarded to the egress port with very short latency. When the tail end of the
TLP arrives at the switch, a CRC check is performed. There is no error, so an ACK is returned
to the TLP source which then purges its replay buffer. The switch stores a copy of the TLP in
its egress port Replay Buffer. When the TLP reaches the destination Endpoint, the Endpoint
device performs a CRC check. The packet is a good packet terminated with the END framing
symbol. There are no CRC errors and so the Endpoint returns an ACK DLLP to the switch. The
switch purges the copy of the TLP from its Replay Buffer. The packet has been routed from
source to destination with minimal latency.
Chapter 6. QoS/TCs/VCs and Arbitration
The Previous Chapter

This Chapter

The Next Chapter

Quality of Service

Perspective on QOS/TC/VC and Arbitration

Traffic Classes and Virtual Channels

Arbitration
The Previous Chapter
The previous chapter detailed the Ack/Nak Protocol that verifies the delivery of TLPs between
each port as they travel between the requester and completer devices. This chapter details the
hardware retry mechanism that is automatically triggered when a TLP transmission error is
detected on a given link.
This Chapter
This chapter discusses Traffic Classes, Virtual Channels, and Arbitration that support Quality of
Service concepts in PCI Express implementations. The concept of Quality of Service in the
context of PCI Express is an attempt to predict the bandwidth and latency associated with the
flow of different transaction streams traversing the PCI Express fabric. The use of QoS is
based on application-specific software assigning Traffic Class (TC) values to transactions,
which define the priority of each transaction as it travels between the Requester and Completer
devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage transaction
priority via two arbitration schemes called port and VC arbitration.
The Next Chapter
The next chapter discusses the purposes and detailed operation of the Flow Control Protocol.
This protocol requires each device to implement credit-based link flow control for each virtual
channel on each port. Flow control guarantees that transmitters will never send Transaction
Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and
eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control
also helps enable compliance with PCI Express ordering rules by maintaining separate virtual
channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and
Completions (Cpl).
Quality of Service
Quality of Service (QoS) is a generic term that normally refers to the ability of a network or
other entity (in our case, PCI Express) to provide predictable latency and bandwidth. QoS is of
particular interest when applications require guaranteed bus bandwidth at regular intervals,
such as audio data. To help deal with this type of requirement PCI Express defines isochronous
transactions that require a high degree of QoS. However, QoS can apply to any transaction or
series of transactions that must traverse the PCI Express fabric. Note that QoS can only be
supported when the system and device-specific software is PCI Express aware.

QoS can involve many elements of performance including:

Transmission rate

Effective Bandwidth

Latency

Error rate

Other parameters that affect performance

Several features of PCI Express architecture provide the mechanisms that make QoS
achievable. The PCI Express features that support QoS include:

Traffic Classes (TCs)

Virtual Channels (VCs)

Port Arbitration

Virtual Channel Arbitration

Link Flow Control

PCI Express uses these features to support two general classes of transactions that can
benefit from the PCI Express implementation of QoS.

Isochronous Transactions ​ from Iso (same) + chronous (time), these transactions require a
constant bus bandwidth at regular intervals along with guaranteed latency. Isochronous
transactions are most often used when a synchronous connection is required between two
devices. For example, a CD-ROM drive containing a music CD may be sourcing data to
speakers. A synchronous connection exists when a headset is plugged directly into the drive.
However, when the audio card is used to deliver the audio information to a set of external
speakers, isochronous transactions may be used to simplify the delivery of the data.

Asynchronous Transactions ​ This class of transactions involves a wide variety of applications


that have widely varying requirements for bandwidth and latency. QoS can provide the more
demanding applications (those requiring higher bandwidth and shorter latencies) with higher
priority than the less demanding applications. In this way, software can establish a hierarchy of
traffic classes for transactions that permits differentiation of transaction priority based on their
requirements. The specification refers to this capability as differentiated services.

Isochronous Transaction Support

PCI Express supports QoS and the associated TC, VC, and arbitration mechanisms so that
isochronous transactions can be performed. A classic example of a device that benefits from
isochronous transaction support is a video camera attached to a tape deck. This real-time
application requires that image and audio data be transferred at a constant rate (e.g., 64
frames/second). This type of application is typically supported via a direct synchronous
attachment between the two devices.

Synchronous Versus Isochronous Transactions

Two devices connected directly perform synchronous transfers. A synchronous source delivers
data directly to the synchronous sink through use of a common reference clock. In our example,
the video camera (synchronous source) sends audio and video data to the tape deck
(synchronous sink), which immediately stores the data in real time with little or no data
buffering, and with only a slight delay due to signal propagation.

When these devices are connected via PCI Express a synchronous connection is not possible.
Instead, PCI Express emulates synchronous connections through the use of isochronous
transactions and data buffering. In this scenario, isochronous transactions can be used to
ensure that a constant amount of data is delivered at specified intervals (100µs in this
example), thus achieving the required transmission characteristics. Consider the following
sequence (Refer to Figure 6-1 on page 254):

1. The synchronous source (video camera and PCI Express interface) accumulates
data in Buffer A during service interval 1 (SI 1).

The camera delivers the accumulated data to the synchronous sink (tape deck) sometime
during the next service interval (SI 2). The camera also accumulates the next block of data in
Buffer B as the contents of Buffer A is delivered.

The tape deck buffers the incoming data (in its Buffer A), which can then be delivered
synchronously for recording on tape during service interval 3. During SI 3 the camera once
again accumulates data into Buffer A, and the cycle repeats.

Figure 6-1. Example Application of Isochronous Transaction

Isochronous Transaction Management

Management of an isochronous communications channel is based on a Traffic Class (TC) value


and an associated Virtual Channel (VC) number that software assigns during initialization.
Hardware components including the Requester of a transaction and all devices in the path
between the requester and completer are configured to transport the isochronous transactions
from link to link via a hi-priority virtual channel.
The requester initiates isochronous transactions that include a TC value representing the
desired QoS. The Requester injects isochronous packets into the fabric at the required rate
(service interval), and all devices in the path between the Requester and Completer must be
configured to support the transport of the isochronous transactions at the specified interval. Any
intermediate device along the path must convert the TC to the associated VC used to control
transaction arbitration. This arbitration results in the desired bandwidth and latency for
transactions with the assigned TC. Note that the TC value remains constant for a given
transaction while the VC number may change from link to link.

Differentiated Services

Various types of asynchronous traffic (all traffic other than isochronous) have different priority
from the system perspective. For example, ethernet traffic requires higher priority (smaller
latencies) than mass storage transactions. PCI Express software can establish different TC
values and associated virtual channels and can set up the communications paths to ensure
different delivery policies are established as required. Note that the specification does not
define specific methods for identifying delivery requirements or the policies to be used when
setting up differentiated services.
Perspective on QOS/TC/VC and Arbitration
PCI does not include any QoS-related features similar to those defined by PCI Express. Many
questions arise regarding the need for such an elaborate scheme for managing traffic flow
based on QoS and differentiated services. Without implementing these new features, the
bandwidth available with a PCI Express system is far greater and latencies much shorter than
PCI-based implementations, due primarily to the topology and higher delivery rates.
Consequently, aside from the possible advantage of isochronous transactions, there appears to
be little advantage to implementing systems that support multiple Traffic Classes and Virtual
Channels.

While this may be true for most desktop PCs, other high-end applications may benefit
significantly from these new features. The PCI Express specification also opens the door to
applications that demand the ability to differentiate and manage system traffic based on Traffic
Class prioritization.
Traffic Classes and Virtual Channels
During initialization a PCI Express device-driver communicates the levels of QoS that it desires
for its transactions, and the operating system returns TC values that correspond to the QoS
requested. The TC value ultimately determines the relative priority of a given transaction as it
traverses the PCI Express fabric. Two hardware mechanisms provide guaranteed isochronous
bandwidth and differentiated services:

Virtual Channel Arbitration

Port Arbitration

These arbitration mechanisms use VC numbers to manage transaction priority. System


configuration software must assign VC IDs and set up the association between the traffic class
assigned to a transaction and the virtual channel to be used when traversing each link. This is
done via VC configuration registers mapped within the extended configuration address space.
The list of these registers and their location within configuration space is illustrated in Figure 6-
2.

Figure 6-2. VC Configuration Registers Mapped in Extended Configuration


Address Space

The TC value is carried in the transaction packet header and can contain one of eight values
(TC0-TC7). TC0 must be implemented by all PCI Express devices and the system makes a
"best effort" when delivering transactions with the TC0 label. TC values of TC1-TC7 are
optional and provide seven levels of arbitration for differentiating between packet streams that
require varying amounts of bandwidth. Similarly, eight VC numbers (VC0-VC7) are specified,
with VC0 required and VC1-VC7 optional. ("VC Assignment and TC Mapping" on page 258
discusses VC initialization).

Note that TC0 is hardwired to VC0 in all devices. If configuration software is not PCI Express
aware all transactions will use the default TC0 and VC0; thereby eliminating the possibility of
supporting differentiated services and isochronous transactions. Furthermore, the specification
requires some transaction types to use TC0/VC0 exclusively:

Configuration

I/O

INTx Message

Power Management Message

Error Signaling Message

Unlock Message

Set_Slot_Power_Limit Message

VC Assignment and TC Mapping

Configuration software designed for PCI Express sets up virtual channels for each link in the
fabric. Recall that the default TC and VC assignments following Cold Reset will be TC0 and
VC0, which is used when the configuration software is not PCI Express aware. The number of
virtual channels used depends on the greatest capability shared by the two devices attached to
a given link. Software assigns an ID for each VC and maps one or more TCs to each.

Determining the Number of VCs to be Used

Software checks the number of VCs supported by the devices attached to a common link and
assigns the greatest number of VCs that both devices have in common. For example, consider
the three devices attached to the switch in Figure 6-3 on page 259. In this example, the switch
supports all 8 VCs on each of its ports; while Device A supports only the default VC, Device B
supports 4 VC s, and Device C support 8 VCs. When configuring VCs for each link, software
determines the maximum number of VCs supported by both devices at each end of the link and
assigns that number to both devices. The VC assignment applies to transactions flowing across
a link in both directions.

Figure 6-3. The Number of VCs Supported by Device Can Vary

Note that even though switch port A supports all 8 VCs Device A supports a single VC, leaving
7 VCs unused within switch port A. Similarly, 4 VCs are used by switch port B. Software of
course configures and enables all 8 VCs within switch port C.

Configuration software determines the maximum number of VCs supported by each port
interface by reading its Extended VC Count field contained within the "Virtual Channel
Capability" registers. The smaller of the two values governs the maximum number of VCs
supported by this link for both transmission and reception of transactions. Figure 6-4 on page
260 illustrates the location and format of the Extended VC Count field. Software may restrict
the number of VCs configured and enabled to fewer than actually allowed. This may be done to
achieve the QoS desired for a given platform or application.

Figure 6-4. Extended VCs Supported Field


Assigning VC Numbers (IDs)

Configuration software must assign VC numbers or IDs to each of the virtual channels, except
VC0 which is always hardwired. As illustrated in Figure 6-5 on page 261, the VC Capabilities
registers include 3 DWs used for configuring each VC. The first set of registers (starting at
offset 10h) always applies to VC0. The Extended VCs Count field (described above) defines
the number of additional VC register sets implemented by this port, each of which permits
configuration of an additional VC. Note that these register sets are mapped in configuration
space directly following the VC0 registers. The mapping is expressed as an offset from each of
the three VC0 DW registers:

10h + (n*0Ch)

14h + (n*0Ch)

18h + (n*0Ch)

Figure 6-5. VC Resource Control Register


The value "n" represents the number of additional VCs implemented. For example, if the
Extended VCs Count contains a value of 3, then n=1, 2, and 3 for the three additional register
sets. Note that these numbers simply identify the register sets for each VC supported and is
not the VC ID.

Software assigns a VC ID for each of the additional VCs being used via the VC ID field within
the VCn Resource Control Register. (See Figure 6-5) These IDs are not required to be
assigned contiguous values, but the same VC value can be used only once.

Assigning TCs to each VC ​ TC/VC Mapping

The Traffic Class value assigned by a requester to each transaction must be associated with a
VC as it traverses each link on its journey to the recipient. Also, the VC ID associated with a
given TC may change from link to link. Configuration software establishes this association
during initialization via the TC/VC Map field of the VC Resource Control Register. This 8-bit field
permits any TC value to be mapped to the selected VC, where each bit position represents the
corresponding TC value (i.e., bit 0 = TC0:: bit 7 = TC7). Setting a bit assigns the corresponding
TC value to the VC ID. Figure 6-6 shows a mapping example where TC0 and TC1 are mapped
to VC0 and TC2::TC4 are mapped to VC3.
Figure 6-6. TC to VC Mapping Example

Software is permitted a great deal of flexibility in assigning VC IDs and mapping the associated
TCs. However, the specification states several rules associated with the TC/VC mapping:

TC/VC mapping must be identical for the two ports attached to the same link.

One TC must not be mapped to multiple VCs in any PCI Express Port.

One or multiple TCs can be mapped to a single VC.

Table 6-1 on page 263 lists a variety of combinations that may be implemented. This is
intended only to illustrate a few combinations, and many more are possible.

Table 6-1. Example TC to VC Mappings

VC
TC Comment
Assignment

TC0 VC0 Default setting, used by all transactions.

TC0-
TC1 VC0
VCs are not required to be assigned consecutively. Multiple TCs can be assigned to a single VC.
TC2- VC7
TC7

TC0 VC0

TC1 VC1 Several transaction types must use TC0/VC0. (1) TCs are not required to be assigned consecutively. Some
TC6 VC6 TC/VC combinations can be used to support an isochronous connection.

TC7 VC7

TC0 VC0

TC1 VC1
TC2 VC2

TC3 VC3
All TCs can be assigned to the corresponding VC numbers.
TC4 VC4

TC5 VC5

TC6 VC6

TC7 VC7

TC0
VC0
The VC number that is assigned need not match one of the corresponding TC numbers.
TC1-
VC6
TC4

TC0
VC0
TC1- Illegal. A TC number can be assigned to only one VC number. This example shows TC2 mapped to both VC1
VC1
TC2 and VC2, which is not allowed.
VC2
TC2
Arbitration
Two types of transaction arbitration provide the method for managing isochronous transactions
and differentiated services:

Virtual Channel (VC) Arbitration ​ determines the priority of transactions being


transmitted from the same port, based on their VC ID.

Port Arbitration ​ determines the priority of transactions with the same VC assignment at
the egress port, based on the priority of the port at which the transactions arrived. Port
arbitration applies to transactions that have the same VC ID at the egress port, therefore a
port arbitration mechanism exists for each virtual channel supported by the egress port.

Arbitration is also affected by the requirements associated with transaction ordering and flow
control. These additional requirements are discussed in subsequent chapters, but are
mentioned in the context of arbitration as required in the following discussions.

Virtual Channel Arbitration

In addition to supporting QoS objectives, VC arbitration should also ensure that forward
progress is made for all transactions. This prevents inadvertent split transaction time-outs. Any
device that both initiates transactions and supports two or more VCs must implement VC
arbitration. Furthermore, other device types that support more than one VC (e.g., switches)
must also support VC arbitration.

VC arbitration allows a transmitting device to determine the priority of transactions based on


their VC assignment. Key characteristics of VCs that are relevant to VC arbitration include:

Each VC supported and enabled provides its own buffers and flow control.

Transactions mapped to the same VC are issued in strict order (unless the "Relaxed
Ordering" attribute bit is set).

No ordering relationship exists between transactions assigned to different VCs.

Figure 6-7 on page 265 illustrates the concept of VC arbitration. In this example two VCs are
implemented (VC0 and VC1) and transmission priority is based on a 3:1 ratio, where 3 VC1
transactions are sent to each VC0 transaction. The device core issues transactions (that
include a TC value) to the TC/VC Mapping logic. Based on the associated VC value, the
transaction is routed to the appropriate VC buffer where it awaits transmission. The VC arbiter
determines the VC buffer priority when sending transactions.

Figure 6-7. Conceptual VC Arbitration Example

This example illustrates the flow of transaction in only one direction. The same logic exists for
transmitting transactions simultaneously in the opposite direction. That is, the root port also
contains transmit buffers and an arbiter and the endpoint device contains receive buffers.

A variety of VC arbitration mechanisms may be employed by a given design. The method


chosen by the designer is specified within the VC capability registers. In general, there are
three approaches that can be taken:

Strict Priority Arbitration for all VCs

Split Priority Arbitration ​ VCs are segmented into low- and high-priority groups. The low-
priority group uses some form of round robin arbitration and the high-priority group uses
strict priority.

Round robin priority (standard or weighted) arbitration for all VCs

Strict Priority VC Arbitration


The specification defines a default priority scheme based on the inherent priority of VC IDs
(VC0=lowest priority and VC7=highest priority). The arbitration mechanism is hardware based,
and requires no configuration. Figure 6-8 illustrates a strict priority arbitration example that
includes all VCs. The VC ID governs the order in which transactions are sent. The maximum
number of VCs that use strict priority arbitration cannot be greater than the value in the
Extended VC Count field. (See Figure 6-4 on page 260.) Furthermore, if the designer has
chosen strict priority arbitration for all VCs supported, the Low Priority Extended VC Count field
of Port VC Capability Register 1 is hardwired to zero. (See Figure 6-9 on page 267.)

Figure 6-8. Strict Arbitration Priority

Figure 6-9. Low Priority Extended VC Count


Strict priority requires that VCs of higher priority get precedence over lower priority VCs based
on the VC ID. For example, if all eight VCs are governed by strict priority, transactions with a
VC ID of VC0 can only be sent when no transactions are pending transmission in VC1-VC7. In
some circumstances strict priority can result in lower priority transactions being starved for
bandwidth and experiencing extremely long latencies. Conversely, the highest priority
transactions receive very high bandwidth with minimal latencies. The specification requires that
high priority traffic be regulated to avoid starvation, and further defines two methods of
regulation:

The originating port can manage the injection rate of high priority transactions, to permit
greater bandwidth for lower priority transactions.

Switches can regulate multiple data flows at the egress port that are vying for link
bandwidth. This method may limit the throughput from high bandwidth applications and
devices that attempt to exceed the limitations of the available bandwidth.

The designer of a device may also limit the number of VCs that participate in strict priority by
specifying a split between the low- and high-priority VCs as discussed in the next section.

Low- and High-Priority VC Arbitration


Figure 6-9 on page 267 illustrates the Low Priority Extended VC Count field within VC
Capability Register 1. This read-only field specifies a VC ID value that identifies the upper limit
of the low-priority arbitration group for the design. For example, if this count contains a value of
4, then VC0-VC4 are members of the low-priority group and VC5-VC7 use strict priority. Note
that a Low Priority Extended VC Count of 7 means that no strict priority is used.

As depicted in Figure 6-11 on page 269, the high-priority VCs continue to use strict priority
arbitration, while the low-priority arbitration group uses one of the other prioritization methods
supported by the device. VC Capability Register 2 reports which alternate arbitration methods
are supported for the low priority group, and the VC Control Register permits selection of the
method to be used by this group. See Figure 6-10 on page 268. The low-priority arbitration
schemes include:

Hardware Based Fixed Arbitration Scheme ​ the specification permits the vendor to define a
hardware-based fixed arbitration scheme that provides all VCs with the same priority. (e.g.
round robin).

Weighted Round Robin (WRR) ​ with WRR some VCs can be given higher priority than
others because they have more positions within the round robin than others. The
specification defines three WRR configurations, each with a different number of entries (or
phases).

Figure 6-11. VC Arbitration with Low-and High-Priority Implementations

Figure 6-10. Determining VC Arbitration Capabilities and Selecting the


Scheme
Hardware Fixed Arbitration Scheme

This selection defines a hardware-based VC arbitration scheme that requires no additional


software setup. The specification mentions standard Round Robin arbitration as an example
scheme that the designer may choose. In such a scheme, transactions pending transmission
within each low-priority VC are sent during each pass through the round robin. The specification
does not preclude other implementation-specific schemes.

Weighted Round Robin Arbitration Scheme

The weighted round robin (WRR) approach permits software to configure the VC Arbitration
table. The number of arbitration table entries supported by the design is reported in the VC
Arbitration Capability field of Port VC Capability Register 2. The table size is selected by
writing the corresponding value in to the VC Arbitration Select field of the Port VC Control
Register. See Figure 6-10 on page 268. Each entry in the table represents one phase that
software loads with a low priority VC ID value. The VC arbiter repeatedly scans all table entries
in a sequential fashion and sends transactions from the VC buffer specified in the table entries.
Once a transaction has been sent, the arbiter immediately proceeds to the next phase.

Software can set up the VC arbitration table such that some VCs are listed in more entries than
others; thereby, allowing differentiation of QoS between the VCs. This gives software
considerable flexibility in establishing the desired priority. Figure 6-12 on page 270 depicts the
weighted round robin VC arbitration concept.

Figure 6-12. Weighted Round Robin Low-Priority VC Arbitration Table


Example

Round Robin Arbitration (Equal or Weighted) for All VCs

The hardware designer may choose to implement one of the round robin forms of VC
arbitration for all VCs. This is accomplished by specifying the highest VC number supported by
the device as a member of the low priority group (via the Lowest Priority Extended Count field.
In this case, all VC priorities are managed via the VC arbitration table. Note that the VC
arbitration table is not used when the Hardware Fixed Round Robin scheme is selected. See
page 269.

Loading the Virtual Channel Arbitration Table

The VC Arbitration Table (VAT) is located at an offset from the beginning of the extended
configuration space as indicated by the VC Arbitration Table Offset field. This offset is
contained within Port VC Capability Register 2. (See Figure 6-13 on page 271.)

Figure 6-13. VC Arbitration Table Offset and Load VC Arbitration Table


Fields
Refer to Figure 6-14 on page 272 during the following discussion. Each entry within the VAT is
a 4-bit field that identifies the VC ID of the virtual channel buffer that is scheduled to deliver
data during this corresponding phase. The table length is a function of the hardware design and
the arbitration scheme selected if choices are supported by the design as illustrated in Figure
6-10 on page 268.

Figure 6-14. Loading the VC Arbitration Table Entries


The table is loaded by configuration software to achieve the priority order desired for the virtual
channels. Hardware sets the VC Arbitration Table Status bit when software updates any entry
within the table. Once the table is loaded, software sets the Load VC Arbitration Table bit
within the Port VC Control register. This bit causes hardware to load the new values into the VC
Arbiter. Hardware clears the VC Arbitration Table Status bit when table loading is complete;
thereby, permitting software to verify successful loading.

VC Arbitration within Multiple Function Endpoints

The specification does not state how an endpoint should manage the arbitration of data flows
from different functions within an endpoint. However it does state that "Multi-function
Endpoints... should support PCI Express VC-based arbitration control mechanisms if multiple
VCs are implemented for the PCI Express Link." VC arbitration when there are multiple
functions raises interesting questions about the approach to be taken. Of course when the
device functions support only VC0, no VC arbitration is necessary. The specification leaves the
approach open to the designer.

Figure 6-15 on page 274 shows a functional block diagram of an example implementation in
which two functions are implemented within an endpoint device, each of which supports two
VCs. The example approach is based upon the goal of using a standard PCI Express core to
interface both functions to the link. The transaction layer within the link performs the TC/VC
mapping and VC arbitration. The device-specific portion of the design is the function arbiter that
determines the priority of data flows from the functions to the transaction layer of the core.
Following are key considerations for such an approach:
Rather than duplicating the TC/VC mapping within each function, the standard device core
performs the task. An important consideration for this decision is that all functions must use
the same TC/VC mapping. The specification requires that the TC/VC mapping be the same
for devices at each end of a link. This means that each function within the endpoint must
have the same mappings.

The function arbiter used TC values to determine the priority of transactions being
delivered from the two functions, and selects the highest priority transaction from the
functions when forwarding transactions to the transaction layer of the PCI Express core.
The arbitration algorithm is hardwired based on the applications associated with each
function.

Figure 6-15. Example Multi-Function Endpoint Implementation with VC


Arbitration

Port Arbitration

When traffic from multiple ports vie for limited bandwidth associated with a common egress
port, arbitration is required. The concept of port arbitration is pictured in Figure 6-16 on page
275. Note that port arbitration exists in three locations within a system:

Egress ports of switches

Root Complex ports when peer-to-peer transactions are supported

Root Complex egress ports to that lead to sources such as main memory
Figure 6-16. Port Arbitration Concept

Port arbitration requires software configuration, which is handled via PCI-to-PCI bridge (PPB)
configuration in both switches and peer-to-peer transfers within the Root Complex and by the
Root Complex Register Block when accessing shared root complex resources such as main
memory. Port arbitration occurs independently for each virtual channel supported by the egress
port. In the example below, root port 2 supports peer-to-peer transfers from root ports 1 and 2;
however, peer-to-peer transfer support between root complex ports is not required.

Because port arbitration is managed independently for each VC of the egress port or RCRB, a
port arbitration table is required for each VC that supports programmable port arbitration as
illustrated in Figure 6-17 on page 276. Port arbitration tables are supported only by switches
and RCRBs and are not allowed for endpoints, root ports and PCI Express bridges.

Figure 6-17. Port Arbitration Tables Needed for nEach VC


The process of arbitrating between different packet streams also implies the use of additional
buffers to accumulate traffic from each port in the egress port as illustrated in Figure 6-18 on
page 277. This example illustrates two ingress ports (1 and 2) whose transactions are routed
to an egress port (3). The action taken by the switch include:

1. Transactions arriving at the ingress ports are directed to the appropriate flow
control buffers based on the TC/VC mapping.

Transactions are forwarded from the flow control buffers to the routing logic is consulted to
determine the egress port.

Transactions are routed to the egress port (3) where TC/VC mapping determines into which
VC buffer the transactions should be placed.

A set of VC buffers is associated with each of the egress ports. Note that the ingress port
number is tracked until transactions are placed in their VC buffer.

Port arbitration logic determines the order in which transactions are sent from each group of
VC buffers.

Figure 6-18. Port Arbitration Buffering


The Port Arbitration Mechanisms

The actual port arbitration mechanisms defined by the specification are similar to the models
used for VC arbitration and include:

Non-configurable hardware-fixed arbitration scheme

Weighted Round Robin (WRR) arbitration with 32 phases

WRR arbitration with 64 phases

WRR arbitration with 128 phases

Time-based WRR arbitration with 128 phases

WRR arbitration with 256 phases

Configuration software must determine the port arbitration capability for a switch or RCRB and
select the port arbitration scheme to be used for each enabled VC. Figure 6-19 on page 278
illustrates the registers and fields involved in determining port arbitration capabilities and
selecting the port arbitration scheme to be used by each VC.

Figure 6-19. Software checks Port Arbitration Capabilities and Selects the
Scheme to be Used
Non-Configurable Hardware-Fixed Arbitration

This port arbitration mechanism does not require configuration of the port arbitration table.
Once selected by software, the mechanism is managed solely by hardware. The actual
arbitration scheme is based on a round-robin or similar approach where each port has the
same priority. This type of mechanism ensures a type of fairness and ensures that all
transactions can make forward progress. However, it does not service the goals of
differentiated services and does not support isochronous transactions.

Weighted Round Robin Arbitration

Like the weighted round robin mechanism used for VC arbitration, software loads the port
arbitration table such that some ports can receive higher priority than others based on the
number of phases in the round robin that are allocated for each port. This approach allows
software to facilitate differentiated services by assigning different weights to traffic coming from
different ports.

As the table is scanned each table phase specifies a port number that identifies the VC buffer
from which the next transaction is sent. Once the transaction is delivered arbitration control
logic immediately proceeds to the next phase. For a given port, if no transaction is pending
transmission the arbiter advances immediately to the next phase.
The specification defines four table lengths for WRR port arbitration, determined by the number
of phases used by the table. The table length selections include:

32 phases

64 phases

128 phases

256 phases

Time-Based, Weighted Round Robin Arbitration

The time-based WRR mechanism is required for supporting isochronous transactions.


Consequently, each switch egress port and RCRB that supports isochronous transactions must
implement time-based WRR port arbitration.

Time-based weighted round robin adds the element of a virtual timeslot for each arbitration
phase. Just as in WRR the port arbiter delivers one transaction from the Ingress Port VC buffer
indicated by the Port Number of the current phase. However, rather than immediately advancing
to the next phase, the time-based arbiter waits until the current virtual timeslot elaspses before
advancing. This ensures that transactions are accepted from the ingress port buffer at regular
intervals. Note that the timeslot does not govern the duration of the transfer, but rather the
interval between transfers. The maximum duration of a transaction is the time it takes to
complete the round robin and return to the original timeslot. Each timeslot is defined as 100ns.

Also, it is possible that no transaction is delivered during a timeslot, resulting in an idle timeslot.
This occurs when:

no transaction is pending for the selected ingress port during the current phase, or

the phase contains the port number of this egress port

Time-based WRR arbitration supports a maximum table length of 128 phases. The actual
number of phases implemented is reported via the Maximum Time Slot field of each virtual
channel that supports Timed WRR arbitration. See the Figure 6-20 on page 280 which illustrate
the Maximum Time Slots Field within the VCn Resource Capability register. See MindShare's
website for a white paper on example applications of Time-Based WRR.

Figure 6-20. Maximum Time Slots Register


Loading the Port Arbitration Tables

A port arbitration table is required for each VC supported by the egress port.

The actual size and format of the Port Arbitration Tables are a function of the number of phases
and the number of ingress ports supported by the Switch, RCRB, or Root Port that supports
peer-to-peer transfers. The maximum number of ingress ports supported by the Port Arbitration
Table is 256 ports. The actual number of bits within each table entry is design dependent and
governed by the number of ingress ports whose transactions can be delivered to the egress
port. The size of each table entry is reported in the 2-bit Port Arbitration Table Entry Size field
of Port VC Capability Register 1. The permissible values are:

00b ​ 1 bit

01b ​ 2 bits

10b ​ 4 bits

11b ​ 8 bits

Configuration software loads each table with port numbers to accomplish the desired port
priority for each VC supported. As illustrated in Figure 6-21 on page 281, the port arbitration
table format depends on the size of each entry and the number of time slots supported by this
design.

Figure 6-21. Format of Port Arbitration Table


Switch Arbitration Example

This section provides an example of a three-port switch with both Port and VC arbitration
illustrated. The example presumes that packets arriving on ingress ports 0 and 1 are moving in
the upstream direction and port 2 is the egress port facing the Root Complex. This example
serves to summarize port and VC arbitration and illustrate their use within a PCI Express
switch. Refer to Figure 6-22 on page 283 during the following discussion.

1. Packets arrive at ingress port 0 and are placed in a receiver flow control buffer
based on TC/VC mapping associated with port 0. As indicated, TLPs carrying traffic
class TC0 or TC1 are sent to the VC0 receiver flow control buffers. TLPs carrying
traffic class TC3 or TC5 are sent to the VC1 receiver flow control buffers. No other
TCs are permitted on this link.

Packets arrive at ingress port 1 and are placed in a receiver flow control buffer based on
port 1 TC/VC mapping. As indicated, TLPs carrying traffic class TC0 are sent to the VC0
receiver flow control buffers. TLPs carrying traffic class TC2-TC4 are sent to the VC3 receiver
flow control buffers. NO OTHER TCs are permitted on this link.

The target egress port is determined from routing information in each packet. Address
routing is applied to memory or IO request TLPs, ID routing is applied to configuration or
completion TLPs, etc.

All packets destined for egress port 2 are subjected to the TC/VC mapping for that port. As
shown, TLPs carrying traffic class TC0-TC2 are managed as virtual channel 0 (VC0) traffic,
TLPs carrying traffic class TC3-TC7 are managed as VC1 traffic.

Independent Port Arbitration is applied to packets within each VC. This may be a fixed or
weighted round robin arbitration used to select packets from all possible different ingress ports.
Port arbitration ultimately results in all VCs of a given type being routed to the same VC buffer.

Following Port Arbitration, VC arbitration determines the order in which transactions pending
transmission within the individual VC buffers will be transferred across the link. The arbitration
algorithm may be fixed or weighted round robin. The arbiter selects transactions from the head
of each VC buffer based on the priority scheme implemented.

Note that the VC arbiter selects packets for transmission only if sufficient flow control credits
exist.

Figure 6-22. Example of Port and VC Arbitration within A Switch


Chapter 7. Flow Control
The Previous Chapter

This Chapter

The Next Chapter

Flow Control Concept

Flow Control Buffers

Introduction to the Flow Control Mechanism

Flow Control Packets

Operation of the Flow Control Model - An Example

Infinite Flow Control Advertisement

The Minimum Flow Control Advertisement

Flow Control Initialization

Flow Control Updates Following FC_INIT


The Previous Chapter
This previous chapter discussed Traffic Classes, Virtual Channels, and Arbitration that supports
Quality of Service concepts in PCI Express implementations. The concept of Quality of Service
in the context of PCI Express is an attempt to predict the bandwidth and latency associated
with the flow of different transaction streams traversing the PCI Express fabric. The use of
QoS is based on application-specific software assigning Traffic Class (TC) values to
transactions, which define the priority of each transaction as it travels between the Requester
and Completer devices. Each TC is mapped to a Virtual Channel (VC) that is used to manage
transaction priority via two arbitration schemes called port and VC arbitration.
This Chapter
This chapter discusses the purposes and detailed operation of the Flow Control Protocol. This
protocol requires each device to implement credit-based link flow control for each virtual
channel on each port. Flow control guarantees that transmitters will never send Transaction
Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer over-runs and
eliminates the need for inefficient disconnects, retries, and wait-states on the link. Flow Control
also helps enable compliance with PCI Express ordering rules by maintaining separate virtual
channel Flow Control buffers for three types of transactions: Posted (P), Non-Posted (NP) and
Completions (Cpl).
The Next Chapter
The next chapter discusses the ordering requirements for PCI Express devices, as well as PCI
and PCI-X devices that may be attached to a PCI Express fabric. The discussion describes the
Producer/Consumer programming model upon which the fundamental ordering rules are based.
It also describes the potential performance problems that can emerge when strong ordering is
employed, describes the weak ordering solution, and specifies the rules defined for deadlock
avoidance.
Flow Control Concept
The ports at each end of every PCI Express link must implement Flow Control. Before a
transaction packet can be sent across a link to the receiving port, the transmitting port must
verify that the receiving port has sufficient buffer space to accept the transaction to be sent. In
many other architectures including PCI and PCI-X, transactions are delivered to a target device
without knowing if it can accept the transaction. If the transaction is rejected due to insufficient
buffer space, the transaction is resent (retried) until the transaction completes. This procedure
can severely reduce the efficiency of a bus, by wasting bus bandwidth when other transactions
are ready to be sent.

Because PCI Express is a point-to-point implementation, the Flow Control mechanism would be
ineffective, if only one transaction stream was pending transmission across a link. That is, if the
receive buffer was temporarily full, the transmitter would be prevented from sending a
subsequent transaction due to transaction ordering requirements, thereby blocking any further
transfers. PCI Express improves link efficiency by implementing multiple flow-control buffers for
separate transaction streams (virtual channels). Because Flow Control is managed separately
for each virtual channel implemented for a given link, if the Flow Control buffer for one VC is full,
the transmitter can advance to another VC buffer and send transactions associated with it.

The link Flow Control mechanism uses a credit-based mechanism that allows the transmitting
port to check buffer space availability at the receiving port. During initialization each receiver
reports the size of its receive buffers (in Flow Control credits) to the port at the opposite end of
the link. The receiving port continues to update the transmitting port regularly by transmitting the
number of credits that have been freed up. This is accomplished via Flow Control DLLPs.

Flow control logic is located in the transaction layer of the transmitting and receiving devices.
Both transmitter and receiver sides of each device are involved in flow control. Refer to Figure
7-1 on page 287 during the following descriptions.

Devices Report Buffer Space Available ​ The receiver of each node contains the Flow
Control buffers. Each device must report the amount of flow control buffer space they have
available to the device on the opposite end of the link. Buffer space is reported in units
called Flow Control Credits (FCCs). The number of Flow Control Credits within each buffer
is forwarded from the transaction layer to the transmit side of the link layer as illustrated in
Figure 7-1. The link creates a Flow Control DLLP that carries this credit information to the
receiver at the opposite end of the link. This is done for each Flow Control Buffer.

Receiving Credits ​ Notice that the receiver in Figure 7-1 also receives Flow Control
DLLPs from the device at the opposite end of the link. This information is transferred to the
transaction layer to update the Flow Control Counters that track the amount of Flow
Control Buffer space in the other device.
Credit Checks Made ​ Each transmitter check consults the Flow Control Counters to check
available credits. If sufficient credits are available to receive the transaction pending
delivery then the transaction is forwarded to the link layer and is ultimately sent to the
opposite device. If enough credits are not available the transaction is temporarily blocked
until additional Flow Control credits are reported by the receiving device.

Figure 7-1. Location of Flow Control Logic


Flow Control Buffers
Flow control buffers are implemented for each VC resource supported by a PCI Express port.
Recall that devices at each end of the link may not support the same number of VC resources,
therefore the maximum number of VCs configured and enabled by software is the greatest
number of VCs in common between the two ports.

VC Flow Control Buffer Organization

Each VC Flow Control buffer at the receiver is managed for each category of transaction
flowing through the virtual channel. These categories are:

Posted Transactions ​ Memory Writes and Messages

Non-Posted Transactions ​ Memory Reads, Configuration Reads and Writes, and I/O
Reads and Writes

Completions ​ Read Completions and Write Completions

In addition, each of these categories is separated into header and data portions of each
transaction. Flow control operates independently for each of the six buffers listed below (also
see Figure 7-2 on page 289).

Posted Header

Posted Data

Non-Posted Header

Non-Posted Data

Completion Header

Completion Data

Figure 7-2. Flow Control Buffer Organization


Some transactions consist of a header only (e.g., read requests) while others consist of a
header and data (e.g., write requests). The transmitter must ensure that both header and data
buffer space is available as required for each transaction before the transaction can be sent.
Note that when a transaction is received into a VC Flow Control buffer that ordering must be
maintained when the transactions are forwarded to software or to an egress port in the case of
a switch. The the receiver must also track the order of header and data components within the
Flow Control buffer.

Flow Control Credits

Buffer space is reported by the receiver in units called Flow Control credits. The unit value of
Flow Control credits (FCCs) may differ between header and data as listed below:

Header FCCs ​ maximum header size + digest

4 DWs for completions

5 DWs for requests

Data FCCs ​ 4 DWs (aligned 16 bytes)

Flow control credits are passed within the header of the link layer Flow Control Packets. Note
that DLLPs do not require Flow Control credits because they originate and terminate at the link
layer.
Maximum Flow Control Buffer Size

The maximum buffer size that can be reported via the Flow Control Initialization and Update
packets for the header and data portions of a transaction are as follows:

128 Credits for headers

2,560 bytes Request Headers @ 20 bytes/credit

2048 bytes for completion headers @ 16 bytes/credit

2048 Credits for data

32KB @ 16 bytes/credit

The reason for these limits is discussed in the section entitled "Stage 1 ​ Flow Control Following
Initialization" page 296, step 2.
Introduction to the Flow Control Mechanism
The specification defines the requirements of the Flow Control mechanism by describing
conceptual registers and counters along with procedures and mechanisms for reporting,
tracking, and calculating whether a transaction can be sent. These elements define the
functional requirements; however, the actual implementation may vary from the conceptual
model. This section introduces the specified model that serves to explain the concept and
define the requirements. The approach taken focuses on a single flow control example for a
non-posted header. The concepts discussed apply to all Flow Control buffer types.

The Flow Control Elements

Figure 7-3 identifies and illustrates the elements used by the transmitter and receiver when
managing flow control. This diagram illustrates transactions flowing in a single direction across
a link, but of course another set of these elements is used to support transfers in the opposite
direction. The primary function of each element within the transmitting and receiving devices is
listed below. Note that for a single direction these Flow Control elements are duplicated for
each Flow Control receive buffer, yielding six sets of elements. This example deals with non-
posted header flow control.

Figure 7-3. Flow Control Elements

Transmitter Elements
Pending Transaction Buffer ​ holds transactions that are pending transfer within the same
virtual channel.

Credit Consumed Counter ​ tracks the size of all transactions sent from the VC buffer (of
the specified type, e.g., non-posted headers) in Flow Control credits. This count is
abbreviated "CC."

Credit Limit Register ​ this register is initialized by the receiving device when it sends Flow
Control initialization packets to report the size of the corresponding Flow Control receive
buffer. Following initialization, Flow Control update packets are sent periodically to add
more Flow Control credits as they become available at the receiver. This value is
abbreviated "CL."

Flow Control Gating Logic ​ performs the calculations to determine if the receiver has
sufficient Flow Control credits to receive the pending TLP (PTLP). In essence, this check
ensures that the total CREDITS_CONSUMED (CC) plus the credit required for the next
packet pending transmission (PTLP) does not exceed the CREDIT_LIMIT (CL). This
specification defines the following equation for performing the check, with all values
represented in credits:

CL ​ (CC + PTLP )mod2[FieldSize] 2[FieldSize ]/2

For an example application of this equation, See "Stage 1 ​ Flow Control Following Initialization"
on page 294.

Receiver Elements

Flow Control (Receive) Buffer ​ stores incoming header or data information.

Credit Allocated ​ This counter tracks the total Flow Control credits that have been
allocated (made available) since initialization. It is initialized by hardware to reflect the size
of the associated Flow Control buffer. As the buffer fills the amount of available buffer
space decreases until transactions are removed from the buffer. The number of Flow
Control credits associated with each transaction removed from the buffer is added to the
CREDIT_ALLOCATED counter; thereby keeping a running count of new credits made
available.

Credits Received Counter (optional) ​ this counter keeps track of the total size of all data
received from the transmitting device and placed into the Flow Control buffer (in Flow
Control credits). When flow control is functioning properly, the CREDITS_RECEIVED count
should be the same as CREDITS_CONSUMED count at the transmitter and be equal to or
less than the CREDIT_ALLOCATED count. If this is not true, then a flow control buffer
overflow has occurred and error is detected. Although optional the specification
recommends its use.

Flow control management is based on keeping track of Flow Control credits using modulo
counters. Consequently, the counters are designed to role over when the count saturates. The
width of the counters depend on whether flow control is tracking transaction headers or data:

Header flow control uses modulo 256 counters (8-bits wide)

Data flow control uses modulo 4096 counters (12-bits wide)

In addition, all calculations are made using unsigned arithmetic. The operation of the counters
and the calculations are explained by example on page 290.
Flow Control Packets
The transmit side of a device reports flow control credit information from its receive buffers to
the opposite device. The specification defines three types of Flow Control packets:

Flow Control Init1 ​ used to report the size of the Flow Control buffers for a given virtual
channel

Flow Control Init2 ​ same as Flow Control Init1 except it is used to verify completion of flow
control initialization at each end of the link (receiving device ignores flow control credit
information)

Flow Control Update ​ used to update Credit Limit periodically

Each Flow Control packet contains the header and data flow control credit information for each
virtual channel and type of Flow Control packet. The packet fields that carry the header and
data Flow Control credits reflect the counter width as discussed in the previous section. Figure
7-4 pictures the format and content of these packets.

Figure 7-4. Types and Format of Flow Control Packets


Operation of the Flow Control Model - An Example
The purpose of this example is to explain the operation of the Flow Control mechanism based
on the conceptual model presented by the specification. The example uses the non-posted
header Flow Control buffer type, and spans four stages to capture the nuances of the flow
control implementation:

Stage One ​ Immediately following initialization, the several transactions are tracked to explain
the basic operation of the counters and registers as they track transactions as they are sent
across the link. In this stage, data is accumulating within the Flow Control buffer, but no
transactions are being removed.

Stage Two ​ If the transmitter sends non-posted transactions at a rate such that the Flow
Control buffer is filled faster than the receiver can forward transactions from the buffer, the
buffer will fill. Stage two describes this circumstance.

Stage Three ​ The modulo counters are designed to roll over and continue counting from zero.
This stage describes the flow control operation at the point of the CREDITS_ALLOCATED
count rolling over to zero.

Stage Four ​ The specification describes the optional error check that can be made by the
receiver in the event of a Flow Control buffer overflow. This error check is described in this
section.

Stage 1 ​ Flow Control Following Initialization

The assumption made in this example is that flow control initialization has just completed and
the devices are ready for normal operation. The Flow Control buffer is presumed to be 2KB in
size, which represents 102d (66h) Flow Control units with 20 bytes/header. Figure 7-5 on page
295 illustrates the elements involved with the values that would be in each counter and register
following flow control initialization.

Figure 7-5. Flow Control Elements Following Initialization


The transmitter must check Flow Control credit prior to sending a transaction. In the case of
headers the number of Flow Control units required is always one. The transmitter takes the
following steps to determine if the transaction can be sent. For simplicity, this example ignores
the possibility of data being included in the transaction.

The credit check is made using unsigned arithmetic (2's complement) in order to satisfy the
following formula:

CL ​ (CC + PTLP )mod2[FieldSize] 2[FieldSize ]/2

Substituting values from Figure 7-5 yields:

66h​(00h + 01h)mod28 28/2

66h​01h mod256 80h

1. The current CREDITS_CONSUMED count (CC) is added to the PTLP credits


required, to determine the CUMULATIVE_CREDITS_REQUIRED (CR), or 00h + 01h
=01h. Sufficient credits exist if this value is equal to or less than the credit limit.

The CUMULATIVE_CREDITS_REQUIRED count is subtracted from the CREDIT_LIMIT


count (CL) to determine if sufficient credits are available. The following description incorporates
a brief review of 2's complement subtraction. When performing subtraction using 2's
complement the number to be subtracted is complemented (1's complement) and 1 is added
(2's complement). This value is then added to the number being subtracted from. Any carry due
to the addition is simply ignored.
The numbers to subtract are:

CL 01100110b (66h) - CR 00000001b (01h) = n

Number to be subtracted is converted to 2's complement:

00000001b > 11 111110b (1's complement)

11111110b +1 = 11111111b (1's complement +1 = 2's complement)

2's complement is added.

01100110
11111111 (add)
01100101 = 01100101b = 65h

Is result <= 80h?

Yes, 65h <= 80h (send transaction)

The result of the subtraction must be equal to or less than 1/2 the maximum value that can be
tracked with a modulo 256 counter (128). This approach is taken to ensure unique results from
the unsigned arithmetic. For example, unsigned 2's-complement subtraction yields the same
results for both 0-128 and 255-127, as shown below.

00h(0) - 80h(128)= -80h(128)


00000000b - 1000000b = n
00000000b + 01111111+1b (add 2's complement)
00000000b + 10000000b = 10000000b(10h)

FFh(255) - 7Fh(127) = +80h(128)


11111111b - 01111111b = n
11111111b + 10000000+1 (add 2's complement)
11111111b + 10000001b = 10000000b(10h)

To ensure that conflicts such as the one above do not occur, the maximum number of unused
credits that can be reported is limited to 28/2 (128) credits for headers and 212/2 (2048) credits
for data. This means that the CREDITS_ALLOCATED count must never exceed the
CREDITS_CONSUMED count by more than 128 for headers and 2048 for data. This ensures
that any result < 1/2 the maximum register count is a positive number and represents credits
available, and results > 1/2 the maximum count are negative numbers that indicate credits not
available.
The CREDITS_CONSUMED count increments by one when the transaction is forwarded to
the link layer.

When the transaction arrives at the receiver, the transaction header is placed into the Flow
Control buffer and the CREDITS_RECEIVED counter (optional) increments by one. Note that
CREDIT_ALLOCATED does not change.

Figure 7-6 on page 297 illustrates the Flow Control elements following transfer of the first
transaction.

Figure 7-6. Flow Control Elements Following Delivery of First Transaction

Stage 2 ​ Flow Control Buffer Fills Up

This example presumes that the receiving device has been unable to move transactions from
the Flow Control buffer since initialization. This could be caused if the device core was
temporarily busy and unable to process transactions. Consequently, the Flow Control buffer has
completely filled. Figure 7-7 on page 299 illustrates this scenario.

Figure 7-7. Flow Control Elements with Flow Control Buffer Filled
Again the transmitter checks Flow Control credits to determine if the next pending TLP can be
sent. The unsigned arithmetic is performed to subtract the Credits Required from the
CREDIT_LIMIT:

66h(CL) - 67h (CR) <= 80h


01100110b - 01100111b <= 10000000b (if yes, send transaction)

CL 01100110(66)
CR 10011001 (add 2's complement of 67h)
11111111 = FFh<=80h (not true, don't send packet)

Not until the receiver moves one or more transactions from the Flow Control buffer can the
pending transaction be sent. When the first transaction is moved from the Flow Control buffer,
the CREDIT_ALLOCATED count is increased to 67h. When the Update Flow Control packet is
delivered to the transmitter, the new CREDIT_LIMIT will be loaded into the CL register. The
resulting check will pass the test, thereby permitting the packet to be sent.

CL 01100111 (67)
CR 10011001 add 2's complement of 67
00000000 = 00h<=80h (send transaction)

Stage 3 ​ The Credit Limit count Rolls Over

The receiver's CREDIT_LIMIT (CL) always runs ahead of (or is equal to) the
CREDITS_CONSUMED (CC) count. Each time the transmitter performs a credit check, it adds
the credits required (CR) for a TLP to the current CREDITS_CONSUMED count and subtracts
the result from the current CREDIT_LIMIT to determine if enough credits are available to send
the TLP.

Because both the CL count and the CC count only index up, they are allowed to roll over from
maximum count back to 0. A problem appears to arise when the CL count (which, again, is
running ahead) has rolled over and the CC has not. Figure 7-8 shows the CL and CR counts
before and after CL rollover.

Figure 7-8. Flow Control Rollover Problem

If a simple subtraction is performed in the rollover case, the result is negative. This indicates
that credits are not available. However, because unsigned arithmetic is used the problem does
not arise. See below:

CL 00001000 (08h)
CR 11111000 (F8h) > 00000111+1 = 2's complement

CL 00001000 (08h)
CR 00001000 (add 2's complement)
00010000 or 10h

Stage 4 ​ FC Buffer Overflow Error Check

The specification recommends implementation of the optional FC buffer overflow error checking
mechanism. These optional elements include:

CREDITS_RECEIVED counter
Error Check Logic

These elements permit the receiver to track Flow Control credits in the same manner as the
transmitter. That is, the transmitter CREDIT_LIMIT count should be the same as the receiver's
CREDITS_ALLOCATED count (after an Update DLLP is sent) and the receiver's
CREDITS_RECEIVED count should be the same as the transmitter's CREDITS_CONSUMED
count. If flow control is working correctly the following will be true:

the transmitter's CREDITS_CONSUMED count should always be its CREDIT_LIMIT

the receiver's CREDITS_RECEIVED count (CR) should always be its


CREDITS_ALLOCATED count (CA)

An overflow condition is detected when the following formula is satisfied. Note that the field size
is either 8 (headers) or 12 (data):

(CA ​ CR)mod2[FieldSize] > 2[FieldSize ]/2

If the formula is true, then the result is negative; thus, more credits have been sent to the FC
buffer than were available and an overflow has occurred. Note that the 1.0a version of the
specification defines the equation as rather than > as shown above. This appears to be an
error, because when CA=CR no overflow condition exists. For example, for the case right after
initialization where the receiver advertises that it has 128 credits for the transmitter to use, CA
= 128, and CR = 0 because it hasn't received anything yet, then this equation evaluates true.
Which means it has overflowed, when actually all we have done is advertise our max allowed
number of credits. If the equation evaluates for only > and not , then everything seems to
work.
Infinite Flow Control Advertisement
PCI Express defines an infinite Flow Control credit value. A device that advertises infinite Flow
Control credits need not send Flow Control Update packets following initialization and the
transmitter will never be blocked from sending transactions. During flow control initialization, a
device advertises "infinite" credits by delivering a zero in the credit field of the FC_INIT1 DLLP.

Who Advertises Infinite Flow Control Credits?

It's interesting to note that the minimum Flow Control credits that must be advertised includes
infinite credits for completion transactions in certain situations. See Table 7-1 on page 303.
These requirements involve devices that originate requests for which completions are expected
to be returned (i.e., Endpoints and root ports that do not support peer-to-peer transfers). It
does not include devices that merely forward completions (switches and root ports that support
peer-to-peer transfers). This implies a requirement that any device initiating a request must
commit buffer space for the expected completion header and data (if applicable). This
guarantees that no throttling would ever occur when completions cross the final link to the
original requester. This type of rule is required of PCI-X devices that initiate split transactions.
Multiple searches of the specification failed to reveal this requirement explicitly stated for PCI
Express devices; however, it is implied by the requirement to advertise infinite Flow Control
credits.

Note also that infinite flow control credits can only be advertised during initializtion. This must be
true, because the CA counter in the receiver could rollover to 00h and send an Update FC
packet with the credit field set to 00h. If the Link is in the DL_Init state, this means infinite
credits, but if the Link is in the DL_Active state, this does not mean infinite credits.

Special Use for Infinite Credit Advertisements.

The specification points out a special consideration for devices that do not need to implement
all the FC buffer types for all VCs. For example, the only Non-Posted writes are I/O Writes and
Configuration Writes both of which are permitted only on VC0. Thus, Non-Posted data buffers
are not needed for VC1 - VC7. Because no Flow Control tracking is needed, a device can
simply advertise infinite Flow Control credits during initialization, thereby eliminating the need to
send needless FC_Update packets.

Header and Data Advertisements May Conflict

An infinite Flow Control advertisement might be sent for either the Data or header buffers (with
same FC type) but not both. In this case, Update DLLPs are required for one buffer but not the
other. This simply means that the device requiring credits will send an Update DLLP with the
corresponding field containing the CREDITS_ALLOCATED credit information, and the other
field must be set to zero (consistent with its advertisement).
The Minimum Flow Control Advertisement
The minimum number of credits that can be reported for the different Flow Control buffer types
is listed in Table 7-1 on page 303.

Table 7-1. Required Minimum Flow Control Advertisements

Credit Type Minimum Advertisement

Posted Request
1 unit. Credit Value = one 4DW HDR + Digest = 5DW.
Header (PH)

Largest possible setting of the Max_Payload_Size; for the component divided by FC Unit Size (4DW).
Posted Request
Data (PD) Example: If the largest Max_Payload_Size value supported is 1024 bytes, the smallest permitted initial credit value
would be 040h.

Non-Posted
Request HDR 1 unit. Credit Value = one 4 DW HDR + Digest = 5DW.
(NPH)

Non-Posted
Request Data 1 unit. Credit Value = 4DW.
(NPD)

Completion HDR 1 unit. Credit Value = one 3DW HDR + Digest = 4DW; for Root Complex with peer-to-peer support and Switches.
(CPLH)
Infinite units. Initial Credit Value = all 0's for Root Complex with no peer-to-peer support and Endpoints.

n units. Value of largest possible setting of Max_Payload_Size or size of largest Read Request (which ever is
Completion Data smaller) divided by FC Unit Size (4DW); for Root Complex with peer-to-peer support and Switches.
(CPLD)
Infinite units. Initial Credit Value = all 0's; for Root Complex with no peer-to-peer support and Endpoints.
Flow Control Initialization
Prior to sending any transactions, flow control initialization must be performed. Initialization
occurs for each link in the system and involves a handshake between the devices attached to
the same link. TLPs associated with the virtual channel being initialized cannot be forwarded
across the link until Flow Control Initialization is performed successfully.

Once initiated, the flow control initialization procedure is fundamentally the same for all Virtual
Channels. The small differences that exist are discussed later. Initialization of VC0 (default VC)
must be done in hardware so that configuration transactions can traverse the PCI Express
fabric. Other VCs initialize once configuration software has set up and enabled the VCs at both
ends of the link. Enabling a VC triggers hardware to perform flow control initialization for this
VC.

Figure 7-9 pictures the Flow Control counters within the devices at both ends of the link, along
with the state of flag bits used during initialization.

Figure 7-9. Initial State of Example FC Elements

The FC Initialization Sequence

PCI Express defines two stages in flow control initialization: FC_INIT1 and FC_INIT2. Each
stage of course involves the use of the Flow Control packets (FCPs).

Flow Control Init1 ​ reports the size of the Flow Control buffers for a given virtual channel
Flow Control Init2 ​ verifies that the device transmitting the Init2 packet has completed the
flow control initialization for the specified VC and buffer type.

FC Init1 Packets Advertise Flow Control Credits Available

During the FC_INIT1 state, a device continuously outputs a sequence of 3 InitFC1 Flow Control
packets advertising its posted, non-posted, and completion receiver buffer sizes. (See Figure
7-10.) Each device also waits to receive a similar sequence from its neighbor. Once a device
has received the complete sequence and sent its own, it initializes transmit counters, sets an
internal flag FI1, and exits FC_INIT1. This process is illustrated in Figure 7-11 on page 306 and
described below. The example shows Device A reporting Non-Posted Buffer Credits and Device
B reporting Posted Buffer Credits. This illustrates that the devices need not be in
synchronization regarding what they are reporting. In fact, the two device will typically not start
the flow control initialization process at the same time.

1. Each device sends InitFC1 type Flow Control packets (FCPs) to advertise the size of
its respective receive buffers. A separate FCP for posted requests (P), non-posted
requests (NP) and completion (CPL) packet types is required. The order in which
this sequence of three FCPs is sent is:

Header and Data buffer credit units for Posted Requests (P).

Header and Data buffer credit units for Non-Posted Requests (NP)

Header and Data buffer credit units for Completions (CPL)

The sequence of FCPs is repeated continuously until a device leaves the FC_INIT1
initialization state.

In the meantime, devices take the credit information and initialize the transmit credit limit
registers. In this example, Device A loads its PH transmit Credit Limit register with a value of 4,
which was reported by Device B for its posted request header FC buffer. It also loads its PD
Credit Limit register with a value of 64d credits (1024 bytes worth of data) for accompanying
posted data. Similarly, Device B loads its NPH transmit Credit Limit counter with a value of 2 for
non-posted request headers and its NPD transmit counter with a value of 32d credits (512
bytes worth of data) for accompanying non-posted data.

Note that when this process is complete, the Credits Allocated counter in the receivers and
the corresponding Credit Limit counters in the transmitters will be equal.

Once a device receives Init1 FC values for a given buffer type (e.g., Posted) and has
recorded them, the FC_INIT1 state is complete for that Flow Control buffer. Once all FC
buffers for a given VC have completed the FC_INIT1 state, Flag 1 (Fl1) is set and the device
ceases to send FCInit1 DLLPs and advances to FC Init2 state. Note that receipt of an Init2 FC
packets may also cause Fl1 to be set. This can occur if the neighboring device has already
advanced to the FC Init2 state.

Figure 7-10. INIT1 Flow Control Packet Format and Contents

Figure 7-11. Devices Send and Initialize Flow Control Registers

FC Init2 Packets Confirm Successful FC Initialization

PCI Express defines the InitFC2 state that is used for feedback to verify the Flow Control
initialization has been successful for a given VC. During FC_INIT2, each device continuously
outputs a sequence of 3 InitFC2 Flow Control packets; however, credit values are discarded
during the FC_INIT2 state. Note that devices are permitted to send TLPs upon entering the
FC_INIT2 state. Figure 7-12 illustrates InitFC2 behavior, which is described following the
illustration.
1. At the start of initialization state FC_INIT2, each device commences sending InitFC2
type Flow Control packets (FCPs) to indicate it has completed the FC_INIT1 state.
Devices use the same repetitive sequence when sending FCPs in this state as
before:

Header and Data buffer credit allocation for Posted Requests (P)

Header and Data buffer credit allocation for Non-Posted Requests (NP)

Header and Data buffer credit allocation for Completions (CPL)

All credits reported in InitFC2 FCPs may be discarded, as the transmitter Credit Limit
counters were already set up in FC_INIT1.

Once a device receives an FC_INIT2 packet for any buffer type, it sets an internal flag (Fl2).
(It doesn't wait to receive an FC_Init2 for each type.) Note that Fl2 is also set upon receipt of
an UpdateFC packet or TLP.

Figure 7-12. Device Confirm that Flow Control Initialization is Completed


for a Given Buffer

Rate of FC_INIT1 and FC_INIT2 Transmission

The specification defines the latency between sending FC_INIT DLLPs as follows:

VC0. Hardware initiated flow control of VC0 requires that FC_INIT1 and FC_INIT2 packets
be transmitted "continuously at the maximum rate possible." That is, the resend timer is set
to a value of zero.

VC1-VC7. When software initiates flow control initialization, the FC_INIT sequence is
repeated "when no other TLPs or DLLPs are available for transmission." However, the
latency between the beginning of one sequence to the next can be no greater than 17µs.

Violations of the Flow Control Initialization Protocol

A violation of the flow control initialization protocol can be optionally checked by a device. An
error detected can be reported as a Data Link Layer protocol error. See "Link Flow Control-
Related Errors" on page 363.
Flow Control Updates Following FC_INIT
The receiver must continually update its neighboring device to report additional Flow Control
credits that have accumulated as a result of moving transactions from the Flow Control buffer.
Figure 7-13 on page 309 illustrates an example where the transmitter was previously blocked
from sending header transactions because the Flow Control buffer was full. In the example, the
receiver has just removed three headers from the Flow Control buffer. More space is now
available, but the neighboring device has no knowledge of this. As each header is removed
from the Flow Control buffer, the CREDITS_ALLOCATED count increments. The new count is
delivered to the CREDIT_LIMIT register of the neighboring device via an update Flow Control
packet. The updated credit limit allows transmission of additional transactions.

Figure 7-13. Flow Control Update Example

FC_Update DLLP Format and Content

Recall that update Flow Control packets, like the Flow Control initialization packets contain two
update fields, one for header and one for data for the selected credit type (P, NP, and Cpl).
Figure 7-14 on page 310 depicts the content of the update packet. The receiver's
CREDITS_ALLOCATED counts that are reported in the HdrFC and DataFC fields may have
been updated many times or not at all since the last update packet sent.

Figure 7-14. Update Flow Control Packet Format and Contents


Flow Control Update Frequency

The specification defines a variety of rules and suggested implementations that govern when
and how often Flow Control Update DLLPs should be sent. The motivation includes:

Notifying the transmitting device as early as possible about new credits allocated, which
allows previously blocked transactions to continue.

Establishing worst-case latency between FC Packets.

Balancing the requirements and variables associated with flow control operation. This
involves:

the need to report credits available often enough to prevent transaction blocking

the desire to reduce the link bandwidth required to send FC_Update DLLPs

selecting the optimum buffer size

the maximum data payload size

Detecting violation of the maximum latency between Flow Control packets.

The update frequency limits specified assume that the link is in the active state (L0 or LOs
(s=standby). All other link states represent more aggressive power management with longer
recovery latencies that require link recovery prior to sending packets.

Immediate Notification of Credits Allocated


When a Flow Control buffer has filled to the extent that maximum-sized packets cannot be sent,
the specification requires immediate delivery of an FC_Update DLLP when the deficit is
eliminated. Specifically, when additional credits are allocated by a receiver that guarantee
sufficient space now exists to accepts another maximum-sized packet, an Update packet must
be sent. Two cases exist:

Maximum Packet Size = 1 Credit. When packet transmission is blocked due to a buffer
full condition for non-infinite NPH, NPD, PH, and CPLH buffer types, an UpdateFC packet
must be scheduled for Transmission when one or more credits are made available
(allocated) for that buffer type.

Maximum Packet Size = Max_Payload_Size. Flow Control buffer space may decrease
to the extent that a maximum-sized packet cannot be sent for non-infinite PD and CPLD
credit types. In this case, when one or more additional credits are allocated, an Update
FCP must be scheduled for transmission.

Maximum Latency Between Update Flow Control DLLPs

The transmission frequency of Update FCPs for each FC credit type (non-infinite) must be
scheduled for transmission at least once every 30 µs (-0%/+50%). If the Extended Sync bit
within the Control Link register is set, Updates must be scheduled no later than every 120 µs
(-0%/+50%). Note that Update FCPs may be scheduled for transmission more frequently than
is required.

Calculating Update Frequency Based on Payload Size and Link Width

The specification offers a formula for calculating the frequency at which update packets need to
be sent for maximum data payloads sizes and link widths. The formula, shown below, defines
FC Update delivery intervals in symbol times (.4ns).

where:

MaxPayloadSize = The value in the Max_Payload_Size field of the Device Control register

TLPOverhead = the constant value (28 symbols) representing the additional TLP
components that consume Link bandwidth (header, LCRC, framing Symbols)

UpdateFactor = the number of maximum size TLPs sent during the interval between
UpdateFC Packets received. This number balances link bandwidth efficiency and receive
buffer sizes ​ the value varies with Max_Payload_Size and Link width

LinkWidth = The operating width of the Link negotiated during initialization

InternalDelay = a constant value of 19 symbol times that represents the internal


processing delays for received TLPs and transmitted DLLPs

The simple relationship defined by the formula show that for a given data payload and buffer
size, the frequency of update packet delivery becomes higher as the link width increases. This
relatively simple approach suggests a timer implementation that triggers scheduling of update
packets. Note that this formula does not account for delays associated with the receiver or
transmitter being in the L0s power management state.

The specification recognizes that the formula will be inadequate for many applications such as
those that stream large blocks of data. These applications may require buffer sizes larger than
the minimum specified, as well as a more sophisticated update policy in order to optimize
performance and reduce power consumption. Because a given solution is dependent on the
particular requirements of an application, no definition for such policies is provided.

Error Detection Timer ​ A Pseudo Requirement

The specification defines an optional time-out mechanism that is highly recommended. So much
so, that the specification points out that it is expected to become a requirement in futures
versions of the spec. This mechanism detects prolonged absences of Flow Control packets.
The maximum latency between FC packets for a given Flow Control credit type is specified to
be no greater than 120µs. This error detection timer has a maximum limit of 200µs, and it gets
reset any time a Flow Control packet of any type is received. If a time-out occurs, this
suggests a serious problem with a device's ability to report Flow Control credits. Consequently,
a time-out triggers the Physical Layer to enter its Recovery state which retrains the link and
hopefully clears the error condition. Characteristics of this timer include:

operational only when the link is in its active state (L0 or L0s)

maximum count limited to 200 µs (-0%/+50%)

timer is reset when any Init or Update FCP is received, or optionally the timer may be
reset by the receipt of any type of DLLP

when timer expires Physical Layer enters the Link Training Sequence State Machine
(LTSSM) Recovery state
Chapter 8. Transaction Ordering
The Previous Chapter

This Chapter

The Next Chapter

Introduction

Producer/Consumer Model

Native PCI Express Ordering Rules

Relaxed Ordering

Modified Ordering Rules Improve Performance

Support for PCI Buses and Deadlock Avoidance


The Previous Chapter
The previous chapter discussed the purposes and detailed operation of the Flow Control
Protocol. This protocol requires each device to implement credit-based link flow control for
each virtual channel on each port. Flow control guarantees that transmitters will never send
Transaction Layer Packets (TLPs) that the receiver can't accept. This prevents receive buffer
over-runs and eliminates the need for inefficient disconnects, retries, and wait-states on the
link. Flow Control also helps enable compliance with PCI Express ordering rules by maintaining
separate Virtual Channel Flow Control buffers for three types of transactions: Posted (P), Non-
Posted (NP) and Completions (Cpl).
This Chapter
This chapter discusses the ordering requirements for PCI Express devices as well as PCI and
PCI-X devices that may be attached to a PCI Express fabric. The discussion describes the
Producer/Consumer programming model upon which the fundamental ordering rules are based.
It also describes the potential performance problems that can emerge when strong ordering is
employed and specifies the rules defined for deadlock avoidance.
The Next Chapter
Native PCI Express devices that require interrupt support must use the Message Signaled
Interrupt (MSI) mechanism defined originally in the PCI 2.2 specification. The next chapter
details the MSI mechanism and also describes the legacy support that permits virtualization of
the PCI INTx signals required by devices such as PCI Express-to-PCI Bridges.
Introduction
As with other protocols, PCI Express imposes ordering rules on transactions moving through
the fabric at the same time. The reasons for the ordering rules include:

Ensuring that the completion of transactions is deterministic and in the sequence intended
by the programmer.

Avoiding deadlocks conditions.

Maintaining compatibility with ordering already used on legacy buses (e.g., PCI, PCI-X,
and AGP).

Maximize performance and throughput by minimizing read latencies and managing


read/write ordering.

PCI Express ordering is based on the same Producer/Consumer model as PCI. The split
transaction protocol and related ordering rules are fairly straight forward when restricting the
discussion to transactions involving only native PCI Express devices. However, ordering
becomes more complex when including support for the legacy buses mentioned in bullet three
above.

Rather than presenting the ordering rules defined by the specification and attempting to explain
the rationale for each rule, this chapter takes the building block approach. Each major ordering
concern is introduced one at a time. The discussion begins with the most conservative (and
safest) approach to ordering, progresses to a more aggressive approach (to improve
performance), and culminates with the ordering rules presented in the specification. The
discussion is segmented into the following sections:

1. The Producer/Consumer programming model upon which the fundamental ordering


rules are based.

The fundamental PCI Express device ordering requirements that ensure the
Producer/Consumer model functions correctly.

The Relaxed Ordering feature that permits violation of the Producer/Consumer ordering
when the device issuing a request knows that the transaction is not part of a
Producer/Consumer programming sequence.

Modification of the strong ordering rules to improve performance.


Avoiding deadlock conditions and support for PCI legacy implementations.
Producer/Consumer Model
Readers familiar with the Producer/Consumer programming model may choose to skip this
section and proceed directly to "Native PCI Express Ordering Rules" on page 318.

The Producer/Consumer model is a common methodology that two requestercapable devices


might use to communicate with each other. Consider the following example scenario:

1. A network adapter begins to receive a stream of compressed video data over the
network and performs a series of memory write transactions to deliver the stream
of compressed video data into a Data buffer in memory (in other words the network
adapter is the Producer of the data).

After the Producer moves the data to memory, it performs a memory write transaction to
set an indicator (or Flag) in a memory location (or a register) to indicate that the data is ready
for processing.

Another requester (referred to as the Consumer) periodically performs a memory read from
the Flag location to see if there's any data to be processed. In this example, this requester is a
video decompressor that will decompress and display the data.

When it sees that the Flag has been set by the Producer, it performs a memory write to
clear the Flag, followed by a burst memory read transaction to read the compressed data (it
consumes the data; hence the name Consumer) from the Data buffer in memory.

When it is done consuming the Data, the Consumer writes the completion status into the
Status location. It then resumes periodically reading the Flag location to determine when more
data needs to be processed.

In the meantime, the Producer has been reading periodically from the Status location to
see if data processing has been completed by the other requester (the Consumer). This
location typically contains zero until the other requester completes the data processing and
writes the completion status into it. When the Producer reads the Status and sees that the
Consumer has completed processing the Data, the Producer then performs a memory write
to clear the Status location.

The process then repeats whenever the Producer has more data to be processed.

Ordering rules are required to ensure that the Producer/Consumer model works correctly no
matter where the Producer, the Consumer, the Data buffer, the Flag location, and the Status
location are located in the system (in other words, no matter how they are distributed on
various links in the system).
Native PCI Express Ordering Rules
PCI Express transaction ordering for native devices can be summarized with four simple rules:

1. PCI Express requires strong ordering of transactions (i.e., performing transactions


in the order issued by software) flowing through the fabric that have the same TC
assignment (see item 4 for the exception to this rule). Because all transactions that
have the same TC value assigned to them are mapped to a given VC, the same rules
apply to transactions within each VC.

No ordering relationship exists between transactions with different TC assignments.

The ordering rules apply in the same way to all types of transactions: memory, IO,
configuration, and messages.

Under limited circumstances, transactions with the Relaxed Ordering attribute bit set can be
ordered ahead of other transactions with the same TC.

These fundamental rules ensure that transactions always complete in the order intended by
software. However, these rules are extremely conservative and do not necessarily result in
optimum performance. For example, when transactions from many devices merge within
switches, there may be no ordering relationship between transactions from these different
devices. In such cases, more aggressive rules can be applied to improve performance as
discussed in "Modified Ordering Rules Improve Performance" on page 322.

Producer/Consumer Model with Native Devices

Because the Producer/Consumer model depends on strong ordering, when the following
conditions are met native PCI Express devices support this model without additional ordering
rules:

1. All elements associated with the Producer/Consumer model reside within native PCI
Express devices.

All transactions associated with the operation of the Producer/Consumer model transverse
only PCI Express links within the same fabric.

All associated transactions have the same TC values. If different TC values are used, then
the strong ordering relationship between the transactions is no longer guaranteed.

The Relaxed Ordering (RO) attribute bit of the transactions must be cleared to avoid
reordering the transactions that are part of the Producer/Consumer transaction series.

When PCI legacy devices reside within a PCI Express system, the ordering rules become more
involved. Consequently, additional ordering rules apply because of PCI's delayed transaction
protocol. Without ordering rules, this protocol could permit Producer/Consumer transactions to
complete out of order and cause the programming model to break.
Relaxed Ordering
PCI Express supports the Relaxed Ordering mechanism introduced by PCI-X; however, PCI
Express introduces some changes (discussed later in this chapter). The concept of Relaxed
Ordering in the PCI Express environment allows switches in the path between the Requester
and Completer to reorder some transactions just received before others that were previously
enqueued.

The ordering rules that exist to support the Producer/Consumer model may result in
transactions being blocked, when in fact the blocked transactions are completely unrelated to
any Producer/Consumer transaction sequence. Consequently, in certain circumstances, a
transaction with its Relaxed Ordering (RO) attribute bit set can be re-ordered ahead of other
transactions.

The Relaxed Ordering bit may be set by the device if its device driver has enabled it to do so
(by setting the Enable Relaxed Ordering bit in the Device Control register​see Table 24 - 3 on
page 906). Relaxed ordering gives switches and the Root Complex permission to move this
transaction ahead of others, whereas the action is normally prohibited.

RO Effects on Memory Writes and Messages

PCI Express Switches and the Root Complex are affected by memory write and message
transactions that have their RO bit set. Memory write and Message transactions are treated
the same in most respects​both are handled as posted operations, both are received into the
same Posted buffer, and both are subject to the same ordering requirements. When the RO bit
is set, switches handle these transactions as follows:

Switches are permitted to reorder memory write transactions just posted ahead of
previously posted memory write transactions or message transactions. Similarly, message
transactions just posted may be ordered ahead of previously posted memory write or
message transactions. Switches must also forward the RO bit unmodified. The ability to
reorder these transactions within switches is not supported by PCI-X bridges. In PCI-X, all
posted writes must be forwarded in the exact order received. Another difference between
the PCI-X and PCI Express implementations is that message transactions are not defined
for PCI-X.

The Root Complex is permitted to order a just-posted write transaction ahead of another
write transaction that was received earlier in time. Also, when receiving write requests
(with RO set), the Root Complex is required to write the data payload to the specified
address location within system memory, but is permitted to write each byte to memory in
any address order.
RO Effects on Memory Read Transactions

All read transactions in PCI Express are handled as split transactions. When a device issues a
memory read request with the RO bit set, the request may traverse one or more switches on
its journey to the Completer. The Completer returns the requested read data in a series of one
or more split completion transactions, and uses the same RO setting as in the request. Switch
behavior for the example stated above is as follows:

1. A switch that receives a memory read request with the RO bit set must forward the
request in the order received, and must not reorder it ahead of memory write
transactions that were previously posted. This action guarantees that all write
transactions moving in the direction of the read request are pushed ahead of the
read. Such actions are not necessarily part of the Producer/Consumer programming
sequence, but software may depend on this flushing action taking place. Also, the
RO bit must not be modified by the switch.

When the Completer receives the memory read request, it fetches the requested read data
and delivers a series of one or more memory read Completion transactions with the RO bit set
(because it was set in the request).

A switch receiving the memory read Completion(s) detects the RO bit set and knows that it
is allowed to order the read Completion(s) ahead of previously posted memory writes moving in
the direction of the Completion. If the memory write transaction were blocked (due to flow
control), then the memory read Completion would also be blocked if the RO was not set.
Relaxed ordering in this case improves read performance.

Table 8-1 summarizes the relaxed ordering behavior allowed by switches.

Table 8-1. Transactions That Can Be Reordered Due to Relaxed Ordering

These Transactions with RO=1 Can Pass These Transactions

Memory Write Request Memory Write Request

Message Request Memory Write Request

Memory Write Request Message Request

Message Request Message Request

Read Completion Memory Write Request

Read Completion Message Request


Summary of Strong Ordering Rules

The PCI Express specification defines strong ordering rules associated with transactions that
are assigned the same TC value, and further defines a Relaxed Ordering attribute that can be
used when a device knows that a transaction has no ordering relationship to other transactions
with the same TC value. Table 8-2 on page 322 summarizes the PCI Express ordering rules
that satisfy the Producer/Consumer model and also provides for Relaxed Ordering. The table
represents a draconian approach to ordering and does not consider issues of performance,
preventing deadlocks, etc.

The table applies to transactions with the same TC assignment that are moving in the same
direction. These rules ensure that transactions will complete in the intended program order and
eliminates the possibility of deadlocks in a pure PCI Express implementation (i.e., systems with
no PCI Bridges). Columns 2 - 6 represent transactions that have previously latched by a PCI
Express device, while column 1 represents subsequently-latched transactions. The ordering
relationship between the transaction in column 1 to other transactions previously enqueued is
expressed in the table on a row-by-row basis. Note that these rules apply uniformly to all
transaction types (Memory, Messages, IO, and Configuration). The table entries are defined as
follows:

No ​ The transaction in column 1 must not be permitted to proceed ahead of the previously
enqueued transaction in the corresponding columns (2-6).

Y/N (Yes/No) ​ The transaction in column 1 is allowed to proceed ahead of the previously
enqueued transaction because its Relaxed Ordering bit is set (1), but it is not required to do so.

Table 8-2. Fundamental Ordering Rules Based on Strong Ordering and RO


Attribute

Note that the shaded area represents the ordering requirements that ensure the
Producer/Consumer model functions correctly and is consistent with the basic rules associated
with strong ordering. The transaction ordering associated with columns 3 - 6 play no role in the
Producer/Consumer model.
Modified Ordering Rules Improve Performance
This section describes how temporary transaction blocking can occur when the strong ordering
rules listed in Table 8-2 are rigorously enforced. Modification of strong ordering between
transactions that do not violate the Produce/Consumer programming model can eliminate many
blocking conditions and improve link efficiency.

Strong Ordering Can Result in Transaction Blocking

Maintaining the strong ordering relationship between transactions would likely result in instances
where all transactions would be blocked due to a single receive buffer being full. The strong
ordering requirements to support the Producer/Consumer model cannot be modified (except in
the case of relaxed ordering described previously). However, transaction sequences that do not
occur within the Producer/Consumer programming model can be modified to a weakly ordered
scheme that can lead to improved performance.

The Problem

Consider the following example illustrated in Figure 8-1 on page 323 when strong ordering is
maintained for all transaction sequences. This example depicts transmitter and receiver buffers
associated with the delivery of transactions in a single direction (from left to right) for a single
Virtual Channel (VC), and the transmit and receive buffers are organized in the same way. Also,
recall that each of the transaction types (Posted, Non-Posted, and Completions) have
independent flow control within the same VC. The numbers within the transmit buffers show the
order in which these transactions were issued to the transmitter. In addition, the non-posted
receive buffer is currently full. Consider the following sequence.

1. Transaction 1 (a memory read​non-posted operation) is the next transaction that


must be sent (based on strong ordering). The flow control mechanism detects that
insufficient credits are available, so Transaction 1 cannot be sent.

Transaction 2 (a posted memory write) is the next transaction pending. When consulting
Table 8-2 (based on strong ordering), entry A3 specifies that a memory write must not pass a
previously posted read transaction.

Because all entries in Table 8-2 are "No", all transactions are blocked due to the non-posted
receive buffer being filled.

Figure 8-1. Example of Strongly Ordered Transactions that Results in


Temporary Blocking

The Weakly Ordered Solution

As discussed previously, strong ordering is required to support the Producer/Consumer model.


This requirement is satisfied entirely by the shaded area in Table 8-2. The non-shaded area
deals with transaction sequences that do not occ