AMD’s
Next Generation
Microprocessor
Architecture
Fred Weber
October 2001
2
"Hammer" Goals
Build a next-generation system architecture
which serves as the foundation for future
processor platforms
Enable a full line of server and workstation
products
Leading edge x86 (32-bit) performance and
compatibility
Native 64-bit support
Establish x86-64 Instruction Set Architecture
Extensive Multiprocessor support
RAS features
Provide top-to-bottom desktop and mobile
processors
3
Agenda
x86-64™ Technology
"Hammer" Architecture
"Hammer" System Architecture
x86-64™ Technology
5
Why 64-Bit Computing?
Required for large memory programs
Large databases
Scientific and Engineering Problems
Designing CPUs J
But,
Limited Demand for Applications which require 64
bits
Most applications can remain 32-bit x86 instructions, if
the processor continues to deliver leading edge x86
performance
And,
Software is a huge investment (tool chains,
applications, certifications)
Instruction set is first and foremost a vehicle for
compatibility
Binary compatibility
Interpreter/JIT support is increasingly important
6
x86-64 Instruction Set Architecture
x86-64 mode built on x86
Similar to the previous extension from 16-bit to 32-
bit
Vast majority of opcodes and features unchanged
Integer/Address register files and datapaths are
native 64-bit
48-Bit Virtual Address Space, 40-Bit Physical
Address Space
Enhancements
Add 8 new integer registers
Add PC relative addressing
Add full support for SSE/SSEII based Floating Point
Application Binary Interface (ABI)
including 16 registers
Additional Registers and Data Size added through
reclaim of one byte increment/decrement opcodes
(0x40-0x4F) for use as a single optional prefix
Public specification
www.x86-64.org
7
x86-64 Programmer’s Model
RAX
63
Added by x86-64
XMM8
XMM8
XMM15
XMM15
R8
R8
R15
R15
AH
EAX AL
63
G
G
P
P
R
R
x
x
8
8
7
7
079
31 0715
In x86
XMM0
XMM0
XMM7XMM7
S
S
S
S
E
E
&
&
S
S
S
S
E
E
2
2
127 0 EAX
EAX
EDIEDI
EIP
EIP031
Program
Program
Counter
Counter
EAX AH AL
8
X86-64 Code Generation and Quality
Compiler and Tool Chain is a straight forward port
Instruction set is designed to offer all the
advantages of CISC and RISC
Code density of CISC
Register usage and ABI models of RISC
Enables easy application of standard compiler
optimizations
SpecInt2000 Code Generation (compared to 32 bit x86)
Code size grows <10%
Due mostly to instruction prefixes
Static Instruction Count SHRINKS by 10%
Dynamic Instruction Count SHRINKS by at least 5%
Dynamic Load/Store Count SHRINKS by 20%
All without any specific code optimizations
9
x86-64™ Summary
Processor is fully x86 capable
Full native performance with 32-bit applications and
OS
Full compatibility (BIOS, OS, Drivers)
Flexible deployment
Best-in-class 32-bit, x86 performance
Excellent 64-bit, x86-64 instruction execution when
needed
Server, Workstation, Desktop, and Mobile share
same architecture
OS, Drivers and Applications can be the same
CPU vendors focus not split, ISV focus not split
Support, optimization, etc. all designed to be the
same
The "Hammer"
Architecture
11
The “Hammer” Architecture
L2
Cache
L1
Instruction
Cache
L1
Data
Cache
“Hammer”
Processor
Core
DDR Memory
Controller
HyperTransport
. . . .
12
Processor Core Overview
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport
AGU ALU AGU ALU AGU ALU FADD FMUL FMISC
8-entry
Scheduler 8-entry
Scheduler 8-entry
Scheduler 36-entry
Scheduler
2k
Branch
Targets
16k
History
Counter
RAS
&
Target Address
Fetch 2 -transit
Pick
DecodeDecodeDecode
Data
TLB Level 1 Data Cache ECC
Instr’n
TLB Level 1 Instr’n Cache
Pack Pack Pack
Decode 1
Decode 2
Decode 1
Decode 2
Decode 1
Decode 2
13
Processor Core Overview
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport
2k
Branch
Targets
16k
History
Counter
RAS
&
Target Address
Data
TLB Level 1 Data Cache ECC
Instr’n
TLB Level 1 Instr’n Cache
Pack
AGU ALU AGU ALU AGU ALU FADD FMUL FMISC
8-entry
Scheduler 8-entry
Scheduler 8-entry
Scheduler 36-entry
Scheduler
Fetch 2 -transit
DecodeDecodeDecode
Decode 1
Decode 2
Decode 1
Decode 2
Decode 1
Decode 2
Pick
Pack Pack
14
Processor Core Overview
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport
AGU ALU AGU ALU AGU ALU FADD FMUL FMISC
8-entry
Scheduler 8-entry
Scheduler 8-entry
Scheduler 36-entry
Scheduler
2k
Branch
Targets
16k
History
Counter
RAS
&
Target Address
Fetch 2 -transit
DecodeDecodeDecode
Pack Pack Pack
Decode 1
Decode 2
Decode 1
Decode 2
Decode 1
Decode 2
Pick
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
Data
TLB Level 1 Data Cache ECC
Instr’n
TLB Level 1 Instr’n Cache
15
"Hammer" Pipeline
Exec
Fetch
1
7
8
13
12
32
DRAM
L2
19
20
16
Fetch/Decode Pipeline
Fetch 1
Fetch 1
Fetch 2
Fetch 2
Exec
Fetch
Fetch
1
7
8
13
12
32
DRAM
19
20
L2
Pick
Pick
Decode 1
Decode 1
Decode 2
Decode 2
Pack
Pack
Pack/Decode
Pack/Decode
17
Execute Pipeline
1 ns
Fetch
1
7
8
13
12
32
L2
DRAM
19
20
Exec
Exec
Dispatch
Dispatch
Schedule
Schedule
AGU/ALU
AGU/ALU
Data Cache 1
Data Cache 1
Data Cache 2
Data Cache 2
18
L2 Pipeline
L2 Request
L2 Request
Address to L2 Tag
Address to L2 Tag
L2 Tag
L2 Tag
L2 Tag, L2 Data
L2 Tag, L2 Data
L2 Data
L2 Data
Data From L2
Data From L2
Data to DC MUX
Data to DC MUX
Write L1, Forward
Write L1, Forward
Exec
Fetch
1
7
8
13
12
32
DRAM
L2
L2
19
20
5 ns
1 ns
19
Address to NB
Address to NB
Clock Boundary
Clock Boundary
SRQ Load
SRQ Load
SRQ Schedule
SRQ Schedule
GART/AddrMap CAM
GART/AddrMap CAM
GART/AddrMap RAM
GART/AddrMap RAM
XBAR
XBAR
Coherence/Order Check
Coherence/Order Check
MCT Schedule
MCT Schedule
DRAM Cmd Q Load
DRAM Cmd Q Load
DRAM Page Status Check
DRAM Page Status Check
DRAM Cmd Q Schedule
DRAM Cmd Q Schedule
Request to DRAM Pins
Request to DRAM Pins
…. DRAM Access
…. DRAM Access
Pins to MCT
Pins to MCT
Through NB
Through NB
Clock Boundary
Clock Boundary
Across CPU
Across CPU
ECC and MUX
ECC and MUX
Write DC
Write DC
DRAM Pipeline
Exec
Fetch
1
7
8
13
12
32
L2
DRAM
19
20
1 ns
12 ns
5 ns
L2 Request
Address to L2 Tag
L2 Tag
L2 Tag, L2 Data
L2 Data
Data from L2
Data to DC MUX
Write L1, Forward
20
Sequential Fetch
Predicted Fetch
Branch Target
Address Calculator
Fetch
Mispredicted Fetch
Large Workload Branch Prediction
L2 Cache
Branch
Selectors
Evicted
Data
Branch
Selectors
Global
History
Counter
(16k, 2-bit
counters)
Target Array
(2k targets) 12-entry
Return Address
Stack (RAS)
Branch
Target
Address
Calculator
(BTAC)
Execution
stages
21
Large Workload TLBs
24-entry
Page Descriptor
Cache
PDP, PDE
L2 Data Cache
Flush Filter
CAM
32 Entry
CR3, PDP, PDE Probe Modify
Table Walk
TLB
Reload
PDC Reload
TLB
Reload
ASN VA PA
L1 Instruction TLB
40 Entry
Fully Associative
4M/2M & 4k pages
L2 Instruction TLB
512-entry
4-way associative
ASN VA PA
Port 0, L1 Data TLB
40 Entry
Fully Associative
4M/2M & 4k pages
ASN
Current ASN
L2 Data TLB
512-entry
4-way associative
ASN VA PA
Port 1, L1 Data TLB
40 Entry
Fully Associative
4M/2M & 4k pages
22
DDR Memory Controller
Integrated Memory Controller Details
Memory controller details
8 or 16-byte interface
16-Byte interface supports
Direct connection to 8 registered DIMMs
Chipkill ECC
Unbuffered or Registered DIMMs
PC1600, PC2100, and PC2700 DDR memory
Integrated Memory Controller Benefits
Significantly reduces DRAM latency
Memory latency improves
as CPU and HyperTransport™ link speed improves
Bandwidth and capacity grows with number of CPUs
Snoop probe throughput scales with CPU frequency
23
Reliability and Availability
L1 Data Cache ECC Protected
L2 Cache AND Cache Tags ECC Protected
DRAM ECC Protected
With Chipkill ECC support
On Chip and off Chip ECC Protected Arrays include
background hardware scrubbers
Remaining arrays parity protected
L1 Instruction Cache, TLBs, Tags
Generally read only data which can be recovered
Machine Check Architecture
Report failures and predictive failure results
Mechanism for hardware/software error containment
and recovery
24
HyperTransport™ Technology
Next-generation computing performance goes beyond the
microprocessor
Screaming I/O for chip-to-chip communication
High bandwidth
Reduced pin count
Point-to-point links
Split transaction and full duplex
Open standard
Industry enabler for building high bandwidth I/O subsystems
I/O subsystems: PCI-X, G-bit Ethernet, Infiniband, etc.
Strong Industry Acceptance
100+ companies evaluating specification & several licensing
technologies through AMD (2000)
First HyperTransport technology-based south bridge announced
by nVIDIA (June 2001)
Enables scalable 2-8 processor SMP systems
Glueless MP
25
CPU With Integrated Northbridge
XBAR
HT*-HB
HT*
HT*
MCT
CPU
SRQ
XBAR
HT*
HT*-HB
HT*
MCT
CPU
SRQ
XBAR
HT*
HT*-HB
HT*
MCT CPU
SRQ
XBAR
HT*-HB
HT*
HT*
MCT CPU
SRQ
DRAM DRAM
DRAM
I/OI/O I/O
HyperTransport™ Link
Coherent
HyperTransport
HT* = HyperTransport™ technology
HB = Host Bridge
DRAM
I/OI/O
26
Northbridge Overview
System
Request
Queue
(SRQ)
Advanced
Priority
Interrupt
Controller
(APIC)
Crossbar
(XBAR) Memory
Controller
(MCT)
DRAM
Controller
(DCT)
64-bit Data
64-bit
Command/Address
16-bit
Data/Command/Address
CPU 0
Data CPU 1
Data CPU 0
Probes CPU 1
Probes CPU 0
Requests CPU 1
Requests CPU 0
Int CPU 1
Int
HyperTransport
Link 0 HyperTransport
Link 1
HyperTransport
Link 2 DRAM
Data
RAS/CAS/Cntl
27
Northbridge Command Flow
Address MAP
& GART
System Request
Queue
24-entry
CPU 0
All buffers are 64-bit
command/address
Router
10-entry Buffer
Router
16-entry Buffer
Router
16-entry Buffer
Router
16-entry Buffer
Router
12-entry Buffer
Memory
Command
Queue
20-entry
CPU 1
HyperTransport
Link 0 Input HyperTransport
Link 1 Input HyperTransport
Link 2 Input
Victim Buffer (8-entry)
Write Buffer (4-entry)
Instruction MAB (2-entry)
Data MAB (8-entry)
to
DCT
HyperTransport
Link 0 Output HyperTransport
Link 1 Output HyperTransport
Link 2 Output
to
CPU
XBAR
28
Northbridge Data Flow
Victim Buffer (8-entry)
Write Buffer (4-entry)
5-entry Buffer 8-entry Buffer8-entry Buffer 8-entry Buffer 8-entry Buffer
System
Request
Data Queue
12-entry
Memory
Data Queue
8-entry
to CPU to Host
Bridge to DCT
HyperTransport
Link 0 output HyperTransport
Link 1 output HyperTransport
Link 2 output
HyperTransport
Link 0 input HyperTransport
Link 1 input HyperTransport
Link 2 input
CPU 0
CPU 1 from Host
Bridge
from DCT
All buffers are 64-byte cache
lines
XBAR XBAR
29
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
CPU 1CPU 0
Read Cache Line
I/O
I/O
Step 1
I/O
I/O
30
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
CPU 1CPU 0
Read Cache Line
I/O
I/O
Step 2
I/O
I/O
1: RdBlk
31
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Read Cache Line Probe Request 2
Probe Request 0
Probe Request 3
Step 3
I/O
I/O
1: RdBlk
2: RdBlk
32
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Probe Response 3
Probe Request 1
Step 4
I/O
I/O
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ3
3: PRQ0
3: RdBlk
33
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Probe Response 0
Read Response
Probe Response 3
Step 5
I/O
I/O
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk
4: TRSP3
4: PRQ1
3: PRQ3
34
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Probe Response 2
Read Response
Step 6
I/O
I/O
5: RDRSP
5: TRSP3
5: TRSP0
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk
4: TRSP3
4: PRQ1
3: PRQ3
35
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Read Response
Step 7
I/O
I/O
3: PRQ3
5: RDRSP
5: TRSP3
5: TRSP0
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk
4: TRSP3
4: PRQ1
6: RDRSP
6: TRSP2
36
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Source Done
Step 8
I/O
I/O
3: PRQ3
5: RDRSP
5: TRSP3
5: TRSP0
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk
4: TRSP3
4: PRQ1
6: RDRSP
6: TRSP2
7: RDRSP
37
Coherent HyperTransport
Read Request
CPU 3 CPU 2
Memory 1Memory 1
Memory 1Memory 1
I/O
I/O
CPU 1CPU 0
Source Done
Step 9
I/O
I/O
3: PRQ3
5: RDRSP
5: TRSP3
1: RdBlk
2: RdBlk
3: PRQ2
3: PRQ0
3: RdBlk
4: TRSP3
6: RDRSP
6: TRSP2
7: RDRSP
9: SrcDn
5: TRSP0
4: PRQ1
38
"Hammer" Architecture Summary
8th Generation microprocessor core
Improved IPC and operating frequency
Support for large workloads
Cache subsystem
Enhanced TLB structures
Improved branch prediction
Integrated DDR memory controller
Reduced DRAM latency
HyperTransport™ technology
Screaming I/O for chip-to-chip communication
Enables glueless MP
"Hammer" System
Architecture
40
“Hammer” System Architecture
1-way
Southbridge
Southbridge
8x
AGP
"Hammer"
"Hammer"
HyperTransport
AGP
HyperTransport
AGP
Int
Gfx
41
“Hammer” System Architecture
Glueless Multiprocessing: 2-way
Southbridge
Southbridge
8x
AGP
"Hammer"
"Hammer"
HyperTransport
AGP
HyperTransport
AGP HyperTransport
PCI-X
HyperTransport
PCI-X
"Hammer"
"Hammer"
42
“Hammer” System Architecture
Glueless Multiprocessing: 4-way
Southbridge
Southbridge
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
HyperTransport
PCI-X
HyperTransport
PCI-X
8x
AGP HyperTransport
AGP
HyperTransport
AGP
AGP optional
HyperTransport
PCI-X
HyperTransport
PCI-X
43
“Hammer” System Architecture
Glueless Multiprocessing: 8-way
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
“Hammer”
“Hammer” "Hammer"
"Hammer"
44
MP System Architecture
Software view of memory is SMP
Physical address space is flat and fully coherent
Latency difference between local and remote
memory in an 8P system is comparable to the
difference between a DRAM page hit and DRAM page
conflict
DRAM location can be contiguous or interleaved
Multiprocessor support designed in from the
beginning
Lower overall chip count
All MP system functions use CPU technology and
frequency
8P System parameters
64 DIMMs (up to 128GB) directly connected
4 HyperTransport links available for IO (25GB/s)
45
The Rewards of Good Plumbing
Bandwidth
4P system designed to achieve 8GB/s aggregate
memory copy bandwidth
With data spread throughout system
Leading edge bus based systems limited to about
2.1GB/s aggregate bandwidth (3.2GB/s theoretical
peak)
Latency
Average unloaded latency in 4P system (page miss)
is designed to be 140ns
Average unloaded latency in 8P system (page miss)
is designed to be 160ns
Latency under load planned to increase much more
slowly than bus based systems due to available
bandwidth
Latency shrinks quickly with increasing CPU clock
speed and HyperTransport link speed
46
"Hammer" Summary
8th generation CPU core
Delivering high-performance through an optimum balance of
IPC and operating frequency
x86-64™ technology
Compelling 64-bit migration strategy without any significant
sacrifice of existing code base
Full speed support for x86 code base
Unified architecture from notebook through server
DDR memory controller
Significantly reduces DRAM latency
HyperTransport™ technology
High-bandwidth I/O
Glueless MP
Foundation for future portfolio of processors
Top-to-bottom desktop and mobile processors
High-performance 1-, 2-, 4-, and 8-way servers and
workstations
47
©2001 Advanced Micro Devices, Inc.
AMD, the AMD Arrow logo, 3DNow! And
combinations thereof are trademarks of Advanced
Micro Devices. HyperTransport is a trademark of the
HyperTransport Technology Consortium. Other
product names are for informational purposes only
and may be trademarks of their respective
companies.