Presentation - Advanced Micro Devices

AMD’s

Next Generation

Microprocessor

Architecture

Fred Weber

October 2001

"Hammer" Goals

•Build a next-generation system architecture

which serves as the foundation for future

processor platforms

•Enable a full line of server and workstation

products

–Leading edge x86 (32-bit) performance and

compatibility

–Native 64-bit support

–Establish x86-64 Instruction Set Architecture

–Extensive Multiprocessor support

–RAS features

•Provide top-to-bottom desktop and mobile

processors

Agenda

•x86-64™ Technology

•"Hammer" Architecture

•"Hammer" System Architecture

x86-64™ Technology

Why 64-Bit Computing?

•Required for large memory programs

–Large databases

–Scientific and Engineering Problems

•Designing CPUs J

•But,

–Limited Demand for Applications which require 64

bits

•Most applications can remain 32-bit x86 instructions, if

the processor continues to deliver leading edge x86

performance

•And,

–Software is a huge investment (tool chains,

applications, certifications)

–Instruction set is first and foremost a vehicle for

compatibility

•Binary compatibility

•Interpreter/JIT support is increasingly important

x86-64 Instruction Set Architecture

•x86-64 mode built on x86

–Similar to the previous extension from 16-bit to 32-

bit

–Vast majority of opcodes and features unchanged

–Integer/Address register files and datapaths are

native 64-bit

–48-Bit Virtual Address Space, 40-Bit Physical

Address Space

•Enhancements

–Add 8 new integer registers

–Add PC relative addressing

–Add full support for SSE/SSEII based Floating Point

Application Binary Interface (ABI)

•including 16 registers

–Additional Registers and Data Size added through

reclaim of one byte increment/decrement opcodes

(0x40-0x4F) for use as a single optional prefix

•Public specification

–www.x86-64.org

x86-64 Programmer’s Model

RAX

Added by x86-64

XMM8

XMM15

R15

EAX AL

079

31 0715

In x86

XMM0

XMM7XMM7

127 0 EAX

EAX

EDIEDI

EIP

EIP031

Program

Counter

EAX AH AL

X86-64 Code Generation and Quality

•Compiler and Tool Chain is a straight forward port

•Instruction set is designed to offer all the

advantages of CISC and RISC

–Code density of CISC

–Register usage and ABI models of RISC

–Enables easy application of standard compiler

optimizations

•SpecInt2000 Code Generation (compared to 32 bit x86)

–Code size grows <10%

•Due mostly to instruction prefixes

–Static Instruction Count SHRINKS by 10%

–Dynamic Instruction Count SHRINKS by at least 5%

–Dynamic Load/Store Count SHRINKS by 20%

–All without any specific code optimizations

x86-64™ Summary

•Processor is fully x86 capable

–Full native performance with 32-bit applications and

–Full compatibility (BIOS, OS, Drivers)

•Flexible deployment

–Best-in-class 32-bit, x86 performance

–Excellent 64-bit, x86-64 instruction execution when

needed

•Server, Workstation, Desktop, and Mobile share

same architecture

–OS, Drivers and Applications can be the same

–CPU vendors focus not split, ISV focus not split

–Support, optimization, etc. all designed to be the

same

The "Hammer"

Architecture

The “Hammer” Architecture

Cache

Instruction

Cache

Data

Cache

“Hammer”

Processor

Core

DDR Memory

Controller

HyperTransport™

. . . .

Processor Core Overview

Level 2

Cache

L2 ECC

L2 Tags

L2 Tag ECC

System Request

Queue (SRQ)

Cross Bar

(XBAR)

Memory Controller

HyperTransport™

AGU ALU AGU ALU AGU ALU FADD FMUL FMISC

8-entry

Scheduler 8-entry

Scheduler 36-entry

Scheduler

Branch

Targets

16k

History

Counter

RAS

Target Address

Fetch 2 -transit

Pick

DecodeDecodeDecode

Data

TLB Level 1 Data Cache ECC

Instr’n

TLB Level 1 Instr’n Cache

Pack Pack Pack

Decode 1

Decode 2

Decode 1

Decode 2

Decode 1

Decode 2

Processor Core Overview

Level 2

Cache

L2 ECC

L2 Tags

L2 Tag ECC

System Request

Queue (SRQ)

Cross Bar

(XBAR)

Memory Controller

HyperTransport™

Branch

Targets

16k

History

Counter

RAS

Target Address

Data

TLB Level 1 Data Cache ECC

Instr’n

TLB Level 1 Instr’n Cache

Pack

AGU ALU AGU ALU AGU ALU FADD FMUL FMISC

8-entry

Scheduler 8-entry

Scheduler 36-entry

Scheduler

Fetch 2 -transit

DecodeDecodeDecode

Decode 1

Decode 2

Decode 1

Decode 2

Decode 1

Decode 2

Pick

Pack Pack

Processor Core Overview

System Request

Queue (SRQ)

Cross Bar

(XBAR)

Memory Controller

HyperTransport™

AGU ALU AGU ALU AGU ALU FADD FMUL FMISC

8-entry

Scheduler 8-entry

Scheduler 36-entry

Scheduler

Branch

Targets

16k

History

Counter

RAS

Target Address

Fetch 2 -transit

DecodeDecodeDecode

Pack Pack Pack

Decode 1

Decode 2

Decode 1

Decode 2

Decode 1

Decode 2

Pick

Level 2

Cache

L2 ECC

L2 Tags

L2 Tag ECC

Data

TLB Level 1 Data Cache ECC

Instr’n

TLB Level 1 Instr’n Cache

"Hammer" Pipeline

Exec

Fetch

DRAM

Fetch/Decode Pipeline

Fetch 1

Fetch 2

Exec

Fetch

DRAM

Pick

Decode 1

Decode 2

Pack

Pack/Decode

Execute Pipeline

1 ns

Fetch

DRAM

Exec

Dispatch

Schedule

AGU/ALU

Data Cache 1

Data Cache 2

L2 Pipeline

L2 Request

Address to L2 Tag

L2 Tag

L2 Tag, L2 Data

L2 Data

Data From L2

Data to DC MUX

Write L1, Forward

Exec

Fetch

DRAM

5 ns

1 ns

Address to NB

Clock Boundary

SRQ Load

SRQ Schedule

GART/AddrMap CAM

GART/AddrMap RAM

XBAR

Coherence/Order Check

MCT Schedule

DRAM Cmd Q Load

DRAM Page Status Check

DRAM Cmd Q Schedule

Request to DRAM Pins

…. DRAM Access

Pins to MCT

Through NB

Clock Boundary

Across CPU

ECC and MUX

Write DC

DRAM Pipeline

Exec

Fetch

DRAM

1 ns

12 ns

5 ns

L2 Request

Address to L2 Tag

L2 Tag

L2 Tag, L2 Data

L2 Data

Data from L2

Data to DC MUX

Write L1, Forward

•Sequential Fetch

•Predicted Fetch

•Branch Target

Address Calculator

Fetch

•Mispredicted Fetch

Large Workload Branch Prediction

L2 Cache

Branch

Selectors

Evicted

Data

Branch

Selectors

Global

History

Counter

(16k, 2-bit

counters)

Target Array

(2k targets) 12-entry

Return Address

Stack (RAS)

Branch

Target

Address

Calculator

(BTAC)

Execution

stages

Large Workload TLBs

24-entry

Page Descriptor

Cache

PDP, PDE

L2 Data Cache

Flush Filter

CAM

32 Entry

CR3, PDP, PDE Probe Modify

Table Walk

TLB

Reload

PDC Reload

TLB

Reload

ASN VA PA

L1 Instruction TLB

40 Entry

Fully Associative

4M/2M & 4k pages

L2 Instruction TLB

512-entry

4-way associative

ASN VA PA

Port 0, L1 Data TLB

40 Entry

Fully Associative

4M/2M & 4k pages

ASN

Current ASN

L2 Data TLB

512-entry

4-way associative

ASN VA PA

Port 1, L1 Data TLB

40 Entry

Fully Associative

4M/2M & 4k pages

DDR Memory Controller

•Integrated Memory Controller Details

–Memory controller details

•8 or 16-byte interface

•16-Byte interface supports

–Direct connection to 8 registered DIMMs

–Chipkill ECC

•Unbuffered or Registered DIMMs

•PC1600, PC2100, and PC2700 DDR memory

•Integrated Memory Controller Benefits

–Significantly reduces DRAM latency

–Memory latency improves

•as CPU and HyperTransport™ link speed improves

–Bandwidth and capacity grows with number of CPUs

–Snoop probe throughput scales with CPU frequency

Reliability and Availability

•L1 Data Cache ECC Protected

•L2 Cache AND Cache Tags ECC Protected

•DRAM ECC Protected

–With Chipkill ECC support

•On Chip and off Chip ECC Protected Arrays include

background hardware scrubbers

•Remaining arrays parity protected

–L1 Instruction Cache, TLBs, Tags

–Generally read only data which can be recovered

•Machine Check Architecture

–Report failures and predictive failure results

–Mechanism for hardware/software error containment

and recovery

HyperTransport™ Technology

•Next-generation computing performance goes beyond the

microprocessor

•Screaming I/O for chip-to-chip communication

–High bandwidth

–Reduced pin count

–Point-to-point links

–Split transaction and full duplex

•Open standard

–Industry enabler for building high bandwidth I/O subsystems

–I/O subsystems: PCI-X, G-bit Ethernet, Infiniband, etc.

•Strong Industry Acceptance

–100+ companies evaluating specification & several licensing

technologies through AMD (2000)

–First HyperTransport technology-based south bridge announced

by nVIDIA (June 2001)

•Enables scalable 2-8 processor SMP systems

–Glueless MP

CPU With Integrated Northbridge

XBAR

HT*-HB

HT*

MCT

CPU

SRQ

XBAR

HT*

HT*-HB

HT*

MCT

CPU

SRQ

XBAR

HT*

HT*-HB

HT*

MCT CPU

SRQ

XBAR

HT*-HB

HT*

MCT CPU

SRQ

DRAM DRAM

DRAM

I/OI/O I/O

HyperTransport™ Link

Coherent

HyperTransport

HT* = HyperTransport™ technology

HB = Host Bridge

DRAM

I/OI/O

Northbridge Overview

System

Request

Queue

(SRQ)

Advanced

Priority

Interrupt

Controller

(APIC)

Crossbar

(XBAR) Memory

Controller

(MCT)

DRAM

Controller

(DCT)

64-bit Data

64-bit

Command/Address

16-bit

Data/Command/Address

CPU 0

Data CPU 1

Data CPU 0

Probes CPU 1

Probes CPU 0

Requests CPU 1

Requests CPU 0

Int CPU 1

Int

HyperTransport™

Link 0 HyperTransport

Link 1

HyperTransport

Link 2 DRAM

Data

RAS/CAS/Cntl

Northbridge Command Flow

Address MAP

& GART

System Request

Queue

24-entry

CPU 0

All buffers are 64-bit

command/address

Router

10-entry Buffer

Router

16-entry Buffer

Router

16-entry Buffer

Router

16-entry Buffer

Router

12-entry Buffer

Memory

Command

Queue

20-entry

CPU 1

HyperTransport™

Link 0 Input HyperTransport

Link 1 Input HyperTransport

Link 2 Input

Victim Buffer (8-entry)

Write Buffer (4-entry)

Instruction MAB (2-entry)

Data MAB (8-entry)

DCT

HyperTransport

Link 0 Output HyperTransport

Link 1 Output HyperTransport

Link 2 Output

CPU

XBAR

Northbridge Data Flow

Victim Buffer (8-entry)

Write Buffer (4-entry)

5-entry Buffer 8-entry Buffer8-entry Buffer 8-entry Buffer 8-entry Buffer

System

Request

Data Queue

12-entry

Memory

Data Queue

8-entry

to CPU to Host

Bridge to DCT

HyperTransport

Link 0 output HyperTransport

Link 1 output HyperTransport

Link 2 output

HyperTransport™

Link 0 input HyperTransport

Link 1 input HyperTransport

Link 2 input

CPU 0

CPU 1 from Host

Bridge

from DCT

All buffers are 64-byte cache

lines

XBAR XBAR

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

CPU 1CPU 0

Read Cache Line

I/O

Step 1

I/O

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

CPU 1CPU 0

Read Cache Line

I/O

Step 2

I/O

1: RdBlk

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Read Cache Line Probe Request 2

Probe Request 0

Probe Request 3

Step 3

I/O

1: RdBlk

2: RdBlk

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Probe Response 3

Probe Request 1

Step 4

I/O

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ3

3: PRQ0

3: RdBlk

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Probe Response 0

Read Response

Probe Response 3

Step 5

I/O

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ0

3: RdBlk

4: TRSP3

4: PRQ1

3: PRQ3

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Probe Response 2

Read Response

Step 6

I/O

5: RDRSP

5: TRSP3

5: TRSP0

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ0

3: RdBlk

4: TRSP3

4: PRQ1

3: PRQ3

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Read Response

Step 7

I/O

3: PRQ3

5: RDRSP

5: TRSP3

5: TRSP0

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ0

3: RdBlk

4: TRSP3

4: PRQ1

6: RDRSP

6: TRSP2

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Source Done

Step 8

I/O

3: PRQ3

5: RDRSP

5: TRSP3

5: TRSP0

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ0

3: RdBlk

4: TRSP3

4: PRQ1

6: RDRSP

6: TRSP2

7: RDRSP

Coherent HyperTransport™

Read Request

CPU 3 CPU 2

Memory 1Memory 1

I/O

CPU 1CPU 0

Source Done

Step 9

I/O

3: PRQ3

5: RDRSP

5: TRSP3

1: RdBlk

2: RdBlk

3: PRQ2

3: PRQ0

3: RdBlk

4: TRSP3

6: RDRSP

6: TRSP2

7: RDRSP

9: SrcDn

5: TRSP0

4: PRQ1

"Hammer" Architecture Summary

•8th Generation microprocessor core

–Improved IPC and operating frequency

–Support for large workloads

•Cache subsystem

–Enhanced TLB structures

–Improved branch prediction

•Integrated DDR memory controller

–Reduced DRAM latency

•HyperTransport™ technology

–Screaming I/O for chip-to-chip communication

–Enables glueless MP

"Hammer" System

Architecture

“Hammer” System Architecture

1-way

Southbridge

AGP

"Hammer"

HyperTransport™

AGP

HyperTransport™

AGP

Int

Gfx

“Hammer” System Architecture

Glueless Multiprocessing: 2-way

Southbridge

AGP

"Hammer"

HyperTransport™

AGP

HyperTransport™

AGP HyperTransport

PCI-X

HyperTransport

PCI-X

"Hammer"

“Hammer” System Architecture

Glueless Multiprocessing: 4-way

Southbridge

"Hammer"

HyperTransport

PCI-X

HyperTransport

PCI-X

AGP HyperTransport™

AGP

HyperTransport™

AGP

AGP optional

HyperTransport

PCI-X

HyperTransport

PCI-X

“Hammer” System Architecture

Glueless Multiprocessing: 8-way

"Hammer"

“Hammer”

“Hammer” "Hammer"

"Hammer"

MP System Architecture

•Software view of memory is SMP

–Physical address space is flat and fully coherent

–Latency difference between local and remote

memory in an 8P system is comparable to the

difference between a DRAM page hit and DRAM page

conflict

–DRAM location can be contiguous or interleaved

•Multiprocessor support designed in from the

beginning

–Lower overall chip count

–All MP system functions use CPU technology and

frequency

•8P System parameters

–64 DIMMs (up to 128GB) directly connected

–4 HyperTransport links available for IO (25GB/s)

The Rewards of Good Plumbing

•Bandwidth

–4P system designed to achieve 8GB/s aggregate

memory copy bandwidth

•With data spread throughout system

–Leading edge bus based systems limited to about

2.1GB/s aggregate bandwidth (3.2GB/s theoretical

peak)

•Latency

–Average unloaded latency in 4P system (page miss)

is designed to be 140ns

–Average unloaded latency in 8P system (page miss)

is designed to be 160ns

–Latency under load planned to increase much more

slowly than bus based systems due to available

bandwidth

–Latency shrinks quickly with increasing CPU clock

speed and HyperTransport link speed

"Hammer" Summary

•8th generation CPU core

–Delivering high-performance through an optimum balance of

IPC and operating frequency

•x86-64™ technology

–Compelling 64-bit migration strategy without any significant

sacrifice of existing code base

–Full speed support for x86 code base

–Unified architecture from notebook through server

•DDR memory controller

–Significantly reduces DRAM latency

•HyperTransport™ technology

–High-bandwidth I/O

–Glueless MP

•Foundation for future portfolio of processors

–Top-to-bottom desktop and mobile processors

–High-performance 1-, 2-, 4-, and 8-way servers and

workstations

AMD, the AMD Arrow logo, 3DNow! And

combinations thereof are trademarks of Advanced

Micro Devices. HyperTransport is a trademark of the

HyperTransport Technology Consortium. Other

product names are for informational purposes only

and may be trademarks of their respective

companies.