Architecture Overview w w w. i n f i n e o n . c o m / d s p Never stop thinking. CARMEL DSP Core Technical Overview Handbook About this Document This document was created with Adobe FrameMaker 5.5.6 at Infineon Technologies North America Corp., 1730 North First Street, San Jose, California 95112, USA. Revision number and date are shown on each page. This document is not controlled, meaning that no distribution list is maintained and the reader is responsible for ensuring that he/ she is not using an obsolete version. Please e-mail your comments, corrections, and feedback to: editor@infineon.com Revision History Release Version Release Date 1.0 06/01/00 Comments Preliminary release Copyright (c) 2000 Infineon Technologies Corp. All Rights Reserved. V1.0 2000-06-01 Attention please! As far as patents or other rights of third parties are concerned, liability is only assumed for components, not for applications, processes, and circuits implemented within components or assemblies. This information describes the type of component and shall not be considered as assured characteristics. Terms of delivery and rights to change design reserved. Due to technical requirements, components may contain dangerous substances. For information on the types in question, please contact your nearest Infineon office. Infineon Technologies Corp. is an approved CECC manufacturer. Packing Please use the recycling operators known to you. We can also help you get in touch with your nearest sales office. By agreement, we will take packing material back, if it is sorted. You must bear the cost of transport. For packing material that is returned to us unsorted or which we are not obligated to accept, we shall have the right to invoice you for any costs incurred. Components used in life-support devices or systems must be expressly authorized for such purpose! Critical components1) of Infineon Technologies Corp. may only be used in life-support devices or systems2) with the express written approval of Infineon Technologies Corp. 1 A critical component is a component used in a life-support device whose failure can reasonably be expected to cause the failure of that life-support device or system, and/or to affect the safety or effectiveness of that device or system. 2 Life-support devices or systems are intended: (a) to be implemented in the human body, or (b) to support and/ or maintain human life. If they fail, it is reasonable to assume that the health of the user may be endangered. CARMELTM Technical Overview 4 V1.0 2000-06-01 Table of Contents Table of Contents Page 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 2.1 The CARMELTM DSP Core Product Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CARMEL Synthesizable Product Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The CARMEL Core Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Address Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Program and System Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CARMEL Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Interconnect Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Peripheral And Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CARMEL Product Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emulation Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 6 7 8 8 8 9 9 9 The CARMEL Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Core Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Core Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Execution Unit (EU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic Logic Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiply-Accumulate Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barrel Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponent Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Address Unit (AU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Program Control Unit (PCU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The A and B Data Memory Interface (ABIF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 12 12 13 13 13 13 14 14 14 CARMEL Programming Model And Instruction Set Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CARMEL Core Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operand Data Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operand Data Registers and Memory Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Operand Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configurability And The Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 15 15 16 16 17 17 18 20 The CARMEL System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Representative External Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Representative CARMEL-Based System-on-a-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The CARMEL Core Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CARMEL Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Program Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CLIW Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Interconnect Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Core-to-System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FPI Bus, Clock and System Configuration Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emulation Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GPIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Peripheral And Memory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FPI-Bus System Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 22 23 23 23 23 23 23 23 24 24 24 25 25 25 25 Example Programs And Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-Sample Real Non-Symmetrical FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block Real Symmetrical FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 27 28 30 2.2 3 3.1 3.2 3.3 3.4 3.5 3.6 4 4.1 4.2 5 5.1 5.2 5.3 5.4 5.5 5.6 6 6.1 6.2 CARMELTM Technical Overview 5 V1.0 2000-06-01 Introduction 1 Introduction This CARMEL Core is the first in a family of 16-bit, fixed-point digital signal processing (DSP) cores that target advanced communications and consumer applications. Its modular design architecture allows for complete system-on-a-chip (SoC) implementations using advanced Electronic Design Automation (EDA) methodologies in an integrated development environment. The CARMEL Core delivers high performance and an efficient DSP instruction rate without sacrificing power dissipation and code compactness requirements. The patented Configurable Long Instruction Word (CLIWTM) architecture sets the core apart by allowing Very Long Instruction Word (VLIW) performance at the low cost of traditional DSP architectures. CARMEL's memory-oriented design is characterized by its flexible instruction set and high-performance arithmetic and addressing units. Instructions can be customized through CLIW. CLIW provides a high degree of parallelism with ability to simultaneously generate four addresses, perform four arithmetic operations and two transfers. The important design factors in modern communication and consumer products are: Performance, Power and Price. These design factors apply to the system's key component, the digital signal processor. These criteria require that DSP cores and their system solutions must give the designer all the choices to optimize those same factors in a system-on-a-chip for their particular application. The CARMEL Core has been created to enable those design choices. It can be configured for your application in instruction set as well as in hardware. The CARMELTM DSP Product Line provides high performance in the system's DSP tasks, control tasks and system performance because the memories, peripherals and input/output operate in a balanced way with the processing elements. High performance is not just in the core processor but throughout the system. Low system power comes through careful overall design. System price is low due to compact modular designs with just the needed functions, and shortened development times through the use of a complete suite of tools with large libraries of both software functions and synthesizable circuit modules. The sections that follow provide a detailed summary of the CARMELTM DSP Product Line with specifics as to how these benefits are achieved. Shifter Exponent ALU 1 MAC 1 ALU 2 MAC 2 Six 40-bit Accumulators Figure 1-1. The Powerful CARMEL Core Execution Unit CARMELTM Technical Overview 6 V1.0 2000-06-01 The CARMELTM DSP Core 2 The CARMELTM DSP Core The CARMEL brand name applies specifically to the core digital signal processor available from Infineon Technologies and others as a licensable Register Transfer Level (RTL) description or macro circuit designs. This description when used with a RTL synthesizer produces the central portion of a processor that executes the CARMEL Instruction Set. The CARMEL Product Line also includes all of the shaded items shown in Figure 2-1 to provide a complete design solution for a Systemon-a-Chip (SoC) integrated circuit. These items are peripheral circuit RTL description modules as well as a full set of software development tools for writing, testing and debugging programs. All modules and software are added into the design flow of the licensee for possible use with other system element designs and tools. CARMEL Core RTL Description User Design RTL Description CARMEL Peripherals RTL CARMEL DSP Library User Software CARMEL Software Development Tools RTL Logic Synthesizer Peripherals Emulator CARMEL Core Data Memories Program Memories System-on-a-Chip Figure 2-1. The CARMEL Product Line Design Solution 2.1 CARMEL Synthesizable Product Modules The CARMEL product modules fall into the four functional groups shown in Figure 2-2. The signal processing core connects directly with its program and data memories. Connections to other system peripherals, larger memories and input/output devices are made with the Flexible Peripheral Interconnect Bus (FPI Bus). The core itself interfaces to the bus through units that also provide interrupt processing and DMA transfers. A block diagram of a representative system-on-a-chip using the CARMEL product modules is in Figure 2-3. For this CARMEL Core, the data precision is 16 bits fixed-point in an address space of 16 bits. Instructions are 24 bits in an address space of 24 bits. For higher performance these fundamental dimensions are increased in some portions to 32 and 40 bits for data precision, to 24-bits in a data I/O space and to 48 and 144 bit instruction formats. All the CARMEL product modules are designed to be device technology independent for static circuit implementations with all transfers on positive clock transitions of a single phase clock. This conservative easy-to-synthesize design style and the core's eight-stage pipeline allow high-performance clock rates of at least 250 MHz at 1.8 Volts in a 0.18m process for typical systems-on-a-chip. The CARMEL Core Module This core is a uniquely crafted matching of a powerful highly modular hardware architecture with an equally modular instruction set architecture. The modular core hardware allows for choices in the complexity of DMA, interrupt control and data memories, for example. The instruction set modularity provides the performance of longer-word multiple parallel computations when needed or the economy of short-word simple instruction execution when it is not. The core itself handles arithmetic processing of data, generating addresses, data memory interface, and program and system control including address generation for program memories and instruction issuing. CARMELTM Technical Overview 7 V1.0 2000-06-01 The CARMELTM DSP Core External System Interfaces System Peripherals and Memories System Interconnect CARMEL Core CARMEL Memories CARMEL-Based System-on-a-Chip Figure 2-2. CARMEL Product Line Functional Modules The following sections detail the specifications and architectures for these core functions. Arithmetic Functions The six computation sub-units of two 16 x 16-bit MACs, two 40-bit ALUs, a 40-bit barrel shifter and a 40-bit exponent unit with six 40-bit accumulators shown in Figure 1-1 provide a full function set: - - - - - - - - - - 16, double-16, 32 and 40 bit data types plus single-bit manipulations SIMD operations on double 16-bit data operand pairs within a single ALU Logical and arithmetic shifts, extract, insert and logical operations Minimum and maximum operations with address register and Viterbi back trace register Fractional and integer arithmetic and a normalizing exponent operation for block floating-point Limiting, saturation, automatic FFT scaling and rounding modes of nearest and convergent Multiply for signed and unsigned operands and superscalar parallel adder/subtractor accumulation Multistate conditional execution Iterative division support Two secondary accumulators for fast context switching with a bank exchange instruction Additionally, application specific accelerators or co-processors (e.g. convolutional encoders) are easily integrated. Data Address Generation The 30 registers of the address register file, the four address ALUs and a stack ALU provide full data memory addressing capability: - Dual data memories A and B in a single 64k words address space - Four buses for up to four separate memory banks which may be odd and even addressed, single and/or dual port - Four simultaneous addresses per instruction cycle with four address modifications: linear, modulo (aligned), special modulo (non-aligned) and bit-reverse - Independent base, offset or index, displacement, limit and modulo registers including secondary registers for fast context switching with a bank exchange instruction - Modes for direct (immediate) or indirect addressing, registers or memory, with or without post modification or indexing, and single or double word access - Memory conflicts resolved by automatic wait-state insertion CARMELTM Technical Overview 8 V1.0 2000-06-01 The CARMELTM DSP Core SRAM, DRAM, ROM, Flash FPI Bus Data Memories Core Emulation Unit Timers Execution Unit CoProcessor Peripheral Ports Peripheral Controllers JTAG I/F Host Interface Host I/F Interrupt System Peripherals and Memories B RAM Instruction Memories JTAG Interface Memory I/F A RAM DMA DMA FPI Bus I/O External Memory Interface Memory Interface System Memories IO Bus Interrupts Address Unit Program Control Unit RAM ROM Program Memory Bus CLIW Memory Bus Core-to-System I/F RAM System Interconnect CARMEL Core ROM CARMEL Memories CARMEL-Based System-on-a-Chip Figure 2-3. A Representative System-on-a-Chip Using the CARMEL Product Modules Program and System Control The program counter, instruction pipeline with decoding, loop counters and interrupt control provide the following advanced programming features: - - - - - - - - - - - Single-cycle execution with an 8-stage pipeline 24-, 48- and 144-bit instruction words that match code density to processing load Configurable Long Instruction Word (CLIWTM) of 96 bits defines up to six custom parallel operations for user algorithms CLIW operations reusable with different 48-bits of operands and different execution conditions Superscalar parallel execution of two 24-bit instructions (MIMD) Flexible multiple-state conditional execution strategies for computation instructions Full set of condition branching program control instructions Zero over-head loop and repeat instructions with four nesting levels Exception processing: 240 vectored interrupts, trap and breakpoints for emulation support and full stack operations Fast context switching with a register bank exchange instruction and conditional execution load instruction Single and parallel move operations for data and program with memories and I/O spaces CARMELTM Technical Overview 9 V1.0 2000-06-01 The CARMELTM DSP Core CARMEL Memory Modules The sophisticated memory control features built into the CARMEL Core mean that straightforward synchronous memory designs can be used for the core memories. The three CARMEL memories can have the following characteristics: Program Memory: - SRAM or ROM - 48-bit data CLIW Program Memory which is an optional memory - SRAM or ROM - 96-bit data with 10-bit address selection Data Memories: - SRAM - 16-bit data words with 16-bit address selection into regions that are multiples of 2k words (11 bits) - Three configurations: Single-port, dual-port with even/odd addressing, true independent dual-port System Interconnect Modules System Peripheral Modules can be thought of as arrayed along a System Interconnect extending from the CARMEL Core. This interconnect is the basic Flexible Peripheral Interconnect (FPI) Bus augmented with conforming DMA and Interrupt control signals. The basic FPI Bus has its own controller while the interrupt and DMA controllers are part of the system interface to the CARMEL Core. The core emulation unit also acts as an interface controller to the core from the system interconnect for its special function in the system. The GPP general purpose port is for eight simple programmed I/Os. The FPI Bus - FPI Bus Control: 250 MB/second data transfers on a 16-bit data bus in a 16 M address space with master/slave operation. This enables high transfer bandwidths without slowing core program execution. - System Configuration Control: Distributes system configuration parameters that are determined at reset time - System Clock & Reset: Distributes system clock and reset signals based on parameters determined at reset time The CARMEL Core-to-System Interface - DMA Controller: A modular design with up to eight channels in three priority groups. It makes independent data transfers between core data memories and FPI-Bus peripherals as well as peripheral-to-peripheral transfers without processor intervention. - Interrupt Controller: Up to 240 prioritized maskable vectored interrupts - Core-to-FPI Interface (FPIU): Provides data queueing and address generation for the core I/O operations on the FPI bus CARMEL Core Emulation Unit - Its intimate connection to the core permits off-line breakpoint intervention and analysis with an emulation monitor program System Peripheral And Memory Modules These are proven designs by Infineon for common system peripherals that reside on the FPI Bus System Interconnect. External Interfaces - External Memory Bus Interface: Maps external memories of various types with programmable choices of timing and address selection - Host Interface: A full function buffered interface to big or little endian microprocessors - JTAG Emulation Interface: A companion to the core's Emulation Unit that allows an off-chip host to control emulation and the on-chip debug support Controllers - Programmable Parallel and Serial Interfaces: Generic designs that accommodate a wide variety of synchronous and non-synchronous data rates and configurations - Timer: Multiple counters for real-time system control, time-outs, etc. FPI-Bus Memories CARMELTM Technical Overview 10 V1.0 2000-06-01 The CARMELTM DSP Core - Larger, slower, denser memories 2.2 CARMEL Product Development Tools Infineon's established CARMEL DSP Alliance initiative provides, through third-party suppliers, a complete suite of hardware and software tools to aid the system designer. These assist software writing and debugging as well as prototyping, testing and debugging of system hardware. Software Available as a complete integrated program development environment with uniform interfaces on PC workstations. A modern editor with a macroassembler for writing and linking programs that are then run on the instruction-cycle and bit-true accurate software simulator. Debugging is easy with breakpoint capability and optimization is enhanced with profiling and resource utilization views. As a further aid, CARMEL models run on a high-level DSP design environment tool suite. An ANSI C Compiler optimized C language code takes advantage of the unique CARMEL features. Algorithm libraries are available in both C language and assembly language routines for common DSP applications and functions. A flexible Real-Time Operating System (RTOS) adds just the real-time controls needed for the multiple tasks in an individual application. This approach keeps response times fast and RTOS program size small. The RTOS also facilitates task-level debugging. Emulation Hardware These tools permit running programs in real-time and within the actual application's hardware system. The CDEV Development Chip is a CARMEL Core with large on-chip memories, a typical DMA and interrupt controller, an emulation unit and many input/output interfaces for off-chip memories, controllers and additional peripherals. The CAREB Evaluation Board puts the CDEV chip into a plug-in, ready-to-run system with additional memories, peripherals and standard interfaces for easy test and debugging of programs. Field programmable gate arrays (FPGAs) speed prototyping of hardware designs for system elements on the FPI bus. CARMELTM Technical Overview 11 V1.0 2000-06-01 The CARMEL Core Architecture 3 The CARMEL Core Architecture The system aspects of the CARMEL Product Line are significant, but it is the CARMEL Core that is most important. The CARMEL Core does the processing and is at the center of the system; the system is defined largely by how the core provides for extending itself. It is the core that implements the Instruction Set Architecture (ISA) and the man-years of intellectual property that this represents in applications algorithms, library functions, design verification, test routines, development software and hardware tools. This section describes the core's hardware architecture and its interface to the rest of the system, while the following Section 4 describes the ISA in terms of the programming model and instruction set summary. 3.1 Core Functional Units A CARMEL system's high performance comes from a careful matching of the core's four functional units as shown, with their major components, in Figure 3-1. The Executions Unit's six computational sub-units with accumulators can perform massively parallel data operations in its two sides. The A and B Data Memory Interface Unit assures available data with up to four data streams from the full sized data memories, not just a constricted register file. The data Address Unit generates the required four address streams, created not with simple counters, but with full special function ALUs and a large register set. The Program Control Unit fetches appropriate sized instructions, decodes them, sequences them in the pipeline and issues them to the other units as needed, all while imposing little time or instruction overhead itself and remaining responsive to other real-time system events. Shift Exp ALU 1 MAC 1 ALU 2 Left MAC 2 Right EU1 EU2 Accumulators AB Data Memory Interface (ABIF) Execution Unit (EU) Address ALU 0,1 Address ALU 2,3 Stack ALU Register Set 0-3 Register Set 4-7 Stack Pointer Address Unit (AU) Interrupts Program Counter Loop Counters Instruction Pipeline Program Control Unit (PCU) CARMEL Core Figure 3-1. The CARMEL Core's Functional Units CARMELTM Technical Overview 12 V1.0 2000-06-01 The CARMEL Core Architecture 3.2 Core Interfaces Figure 3-2 shows the primary data and address interconnections of the core's four functional units and its six interfaces with the remainder of the system. Secondary data and addresses pass between units on the G1 and G2 data buses. The IO Bus interface is for the processors programmed input/output address space for registers and memory. The DMA Bus interface provides direct access to the data memory address space. Forty-eight-bit memory goes on the Program Memory Bus interface with optional 96-bit memory on the CLIW Memory Bus. The four ports for the A and B data memories connect to the 136 signals of the Data Memory interface. The System and Control interface provides for program interrupts, general purpose input/output (GPIO) and emulation units that can govern program execution control. System signals are basic clocks and configuration determiners as shown in Table 3-1. Note from the signal names that the interface designs are simple and straightforward where data and addresses are clocked by transitions on enable signals or by a simple distribution of the core clock. G1 Data G2 Data A1 Addr 16 16 A1 Data 16 16 A1 Ctrl 2 B1 Addr 16 B1 Data 16 B1 Ctrl 2 A2 Addr Data Memory Buses 16 A2 Data 16 A2 Ctrl 2 B2 Addr Execution Unit 16 B2 Data 16 B2 Ctrl ABIF 2 DMA Data DMA Bus 16 DMA Addr 16 DMA Ctrl 5 Address Unit IO Data IO Bus 16 IO Addr 24 IO Ctrl Addr 23 4 Data 48 Ctrl 3 Mem Config System and Control Program Memory Bus 14 Interrupts Addr 14 GPP Data 8 General Ctrl 8 Emulation 10 96 CLIW Memory Bus 2 Program Control Unit 23 CARMEL Core Figure 3-2. CARMEL Core Functional Unit Interconnections and Interfaces CARMELTM Technical Overview 13 V1.0 2000-06-01 The CARMEL Core Architecture Table 3-1. CARMEL Core Interface Signal Summary Interface I/O Bus (IO) DMA Bus (D) Function IOA[23:0] 24 Data IOD[15:0] 16 Control IORE, IOWE, IOWAIT, IOREG 4 Address DA[15:0] 16 Data DD[15:0] 16 Control Data Data Memory Buses System and Control 3.3 Signals Address Program Memory Bus (P) Address CLIW Memory Bus (CI) Names DRD, DWR, DREQ, DACK[1:0] 5 PMA[23:1] 23 PMD[47:1] 48 Control PMWE, PMRE, PMWAIT 3 Address CIA[9:0] 10 Data CID[95:0] 96 Control CIRE, CIWE 2 A1/A2/B1/B2 Address A1A[15:0], A2A[15:0], B1A[15:0], B2A[15:0] 64 A1/A2/B1/B2 Data A1D[15:0], A2D[15:0], B1D[15:0], B2D[15:0] 64 A1/A2/B1/B2 Control A1RE, A1WE, A2RE, A2WE, B1RE, B1WE, B2RE, B2WE 8 General Purpose I/O GPI[3:0], GPO[3:0] 8 Data Memory Configuration MEMCFG[13:0] 14 Interrupts NMIREQ, NMIACK, INTVCT[7:0], VCTREQ, VCTACK, VCTEOI, ERRACK 14 Emulation Breakpoint and Trace Operands and Control 23 General CLK, RESET, EXTWAIT, INTWAIT, SWAP, BOOT, PROTECT, PMEM 8 The Execution Unit (EU) The Execution Unit is divided into two fully independent execution units: EU1 or the Left side and EU2 or the Right side, as shown in Figure 3-3. Both share equally the data sources and destinations of the accumulators, A and B data memories, immediate data from instructions and the other core units. EU1 and EU2 can operate in parallel using these shared resources. Execution Unit 1 has four computational sub-units: an Arithmetic Logic Unit (ALU), a Multiply-Accumulate unit (MAC), a Barrel Shifter (Shifter) and an Exponent Unit (Exponent). Execution Unit 2 has only an ALU and MAC since the Shifter and Exponent operations occur less frequently in most DSP applications. All sub-units execute in a single instruction cycle. Execution Unit Instructions The CARMEL Core's high performance with highly efficient instruction programs comes from the unique way it keeps these six computation sub-units busy when they are needed and without large instructions when they are not all needed. The core in general uses three distinct instruction formats to get operation codes to the sub-units: - As one of up to six sub-instructions within a reusable 96-bit CLIW operations block with a 48-bit CLIW reference operand instruction - As a 24-bit instruction with short direct and indirect operand references - As a 48-bit instruction with longer direct and indirect operand references Two CLIW sub-instructions can go to each execution unit EU1 and EU2 with the remaining two sub-instructions for simultaneous move operations in the total of 144 bits of instruction. Two 24-bit instructions can be fetched together and generally execute in parallel, so both full execution units can process in one instruction cycle with just 48-bits of instruction. A single execution unit by itself requires just a single 24- or 48-bit instruction. The following sections describe each component of these two execution units. CARMELTM Technical Overview 14 V1.0 2000-06-01 The CARMEL Core Architecture In Data Bus Switch G1 Bus 16 16 I/O 16 Shifter Exponent ALU 1 EU1 Left Immediate Data 40 16 ALU 2 MAC 1 MAC 2 EU2 Right 40 40 Memory Bus Switch G2 Bus Data Bus Switch 16 16 Data Memory Interface 16 40 16 Out Data Bus Switch Six 40-bit Accumulators Execution Unit Figure 3-3. The CARMEL Core Execution Unit Arithmetic Logic Units The ALUs are full 40-bit by 40-bit units with 40-bit outputs, yet they can operate on 32-bit long data words, 16-bit words or even two unrelated 16-bit operands in a double data word for certain operations (Add, Sub, Min, Max). Basic operations are: - Arithmetic: Add, subtract, negate, absolute value - Comparison: Compare, test field, minimum and maximum including with a serial back trace of results for the Viterbi algorithm and saving the minimum/maximum data address - Logical: And, Or, Not, Xor - Bit: Set, clear, test, change - Division iteration steps: High or low resolution - Ancillary: Automatic scaling, rounding Multiply-Accumulate Units The 16-bit by 16-bit multipliers in combination with the 40-bit accumulators provide multiply, square and multiply-accumulate operations with these features: - - - - - Operand signs: All four possible signed/unsigned operand combinations supported Formats: Integer or fractional with correct rounding Mixed precision: Support for 16-bit by 32-bit multiply with aligned-accumulate instruction ALU only operations: Add, subtract or move operations with CLIW sub-instructions Dual accumulators: Single 40-bit accumulator may be split into two 16-bit accumulators Barrel Shifter The Shifter in EU1 also is a full 40-bit unit yet supports 16- and 32-bit operands as well. The basic operations are: - Logical shift and arithmetic shift with a 6-bit shift value - Insert and extract bit fields - Rotate-thru-carry by one bit for 16-, 32 and 40-bit operands Exponent Unit This unit determines the 6-bit shift value needed in the barrel shifter to normalize a 16-, 32- or 40-bit input operand. It facilitates using block floating-point. CARMELTM Technical Overview 15 V1.0 2000-06-01 The CARMEL Core Architecture 3.4 The Address Unit (AU) The Address Unit in the core generates four simultaneous 16-bit addresses for the four ports of the data memory. These provide operand memory access for the computational sub-units in the Execution Unit and implement the stack for the Program Control Unit. The ALU does not create addresses for DMA transfers with the data memory. However, it does arbitrate and resolve all memory and memory bus conflicts with wait-states, including with the DMA. The address unit works with all the operand addressing modes of the core including some registers as well as data memory. The modes for a single 16-bit data address or for 32-bit data with two sequential addresses are: - Direct: Immediate memory address of operand or direct access of operand in a register including post modification - Indirect: Generated memory address of operand including post-modification or indexing The are four operand address modification modes, all of which execute in a single cycle: - - - - Linear Bit-Reverse for the FFT Modulo with aligned boundary addresses Special Modulo with arbitrary non-aligned boundary addresses The data operand addresses are generated by four special ALUs operating out of a 30 entry register file. The primary registers are in eight sets where each set has a base register and an offset or index register. Every two sets has a modulo and lower boundary limit register. There are six secondary registers that shadow their primary equivalents that can be swapped with a single bank exchange instruction to provide a rapid context switch upon interrupts. Immediate values can be used in address modifications as well as registered values. Each group of four sets has an additional fixed displacement register for use in CLIW instructions. 3.5 The Program Control Unit (PCU) The PCU is truly the heart of the CARMEL Core because it sends out program memory addresses, fetches instructions including CLIW, decodes them and then pumps out instruction commands to all of the units in the proper sequence. Sequencing is done in an eight-stage pipeline that allows all instructions to execute in a single instruction cycle regardless of whether they are a single 24- or 48-bit instruction (SISD or SIMD), two superscalar parallel 24-bit instructions (MIMD) or a 144-bit CLIW reference operand and operation block. This sequencing includes determining if it is a conditional execution instruction that should execute partially, fully or not at all. In addition, the PCU considers that the execution may be extended because of various wait conditions to maintain synchronization. The program counter creates a steady stream of instruction addresses including repeated instructions and instruction loops that are repeated for a predetermined count, all without any instruction cycle overhead and with nesting up to four levels deep. The program counter handles exceptions processing in the core by maskable and non-maskable interrupts, hardware exception traps and debug breakpoints. Full control and handling of the 240 interrupts vector space is done by an externalto-the-core interrupt controller and the breakpoints are enabled by the systems emulation unit which can also do single-step execution. Rapid handling of exceptions in the program flow is facilitated by the program stack being located in the fast data memory, by a full set of stack operations that includes conditional ones, and by the bank exchange instruction that in a single instruction cycle changes a full data address register set and two accumulators. 3.6 The A and B Data Memory Interface (ABIF) The data memory interface connects the four data memory buses to the other functional units of the core. It resolves data and addresses from the DMA and I/O interfaces as well as data of the Execution Unit with addresses from the Address Unit. It synchronizes the transfers with read delays and memory write-back operations so that the memories themselves can be simple un-registered synchronous designs. The interface assures that read operations use current write-back data and it can apply saturation arithmetic to write memory data that exceeds the memory size. CARMELTM Technical Overview 16 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary 4 CARMEL Programming Model And Instruction Set Summary Section 3 described the interconnected hardware units of the CARMEL Core. This section summarizes the programmers view of these units; the instruction set composed of operations and operands along with the programming model of the memory and registers that are sources and destinations for the operands. 4.1 CARMEL Core Programming Model The operand data programming model is defined by the operand data types, the data register and memory map, and the data memory addressing modes. Operand Data Types and Formats The CARMEL Core has a basic data precision of 16 bits and all of the data types follow from this precision. A data Word is thus 16 bits as shown in Figure 4-1. It may be signed (two's complement) or unsigned, integer or fractional depending upon the operation. A 32-bit Double Word (data) is two unrelated 16-bit Words composed of Most- and Least-Significant Word fields used in Double Operations with two 16-bit operand sets. A 32-bit Long Word (data) is a double-precision word composed of Most- and Least-Significant Word fields used in 32-bit operations with 32-bit operands. The six accumulators have a 40 bit Accumulator Word with three fields of Guard and Most- and Least-Significant Word. Data Word 16 Bits Double Data Word 16 Bits 16 Bits MSW LSW Long Data Word 32 Bits MSW Accumulator Data Word LSW 40 Bits Guard MSW LSW Figure 4-1. CARMEL Core Data Types Operand Data Registers and Memory Maps The data operand portion of the programming model is shown in Figure 4-2. The CARMEL Core is not a RISC load/store architecture, but rather like a conventional DSP where the execution units operate out of the large data memories directly rather than through an intermediate register file. Thus, most operations are with the data memory and the accumulators in the top row of figure 4-2. Unused data address registers may be used for data and some operations use core's internal system registers. In addition, the CARMEL Core has data move operations with the I/O spaces shown in the bottom row of figure 4-2 including the 512-word portion for registers that are external to the core. Move operations also perform data load and store operations with the program memory shown. The 64k-word data memory space is divided into four distinct regions. Zone A and B memories may each have single or dual ported regions. With the two buses for both A and B, single port memories may be divided into odd and even addresses as shown to give the effect of dual ported memory transfers without the complexity. The stack can be placed arbitrarily within the data memory space. Data Operand Addressing Modes Data operands are specified in the instruction set in any of four modes: as immediate data in the instruction, a direct reference to the operand register including post modification, direct reference to the operand in data memory with an immediate address or by an indirect reference to the operand in data memory by specifying the data address unit register and address modification. CARMELTM Technical Overview 17 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary 64 K 30 B Dual-Port B Single-Port B1 Even, B2 Odd Control and Status A Single-Port A1 Even, A2 Odd Stack 0K A Dual-Port 16 Bits Addresses 6 0 40 Bits 0 Core Accumulators Data Memory Space 16 Bits Core Data Address Registers 16 M 512 24 Bits Core System Register Space 16 M FPI Bus 0 16 Bits External-to-the-Core Register Space 16 Bits I/O Space 0M 24 Bits 0M Program Memory Space Figure 4-2. Data Operand Programming Model Addresses for data memory may be 16 or 32 bits where the second 16-bit address in the 32-bit address is for the higher order word in a 32-bit long or double data word. There are three types of address modification, linear, bit-reverse and modulo, each with increment, decrement and indexed modifications. 4.2 Instruction Set Architecture The CARMEL Instruction Set Architecture (ISA) is defined by the instruction formats, the program memory map and addressing modes, the instruction types and the instructions themselves. Instruction Formats Instruction code efficiency is important because smaller code directly lowers program memory cost and it lowers power consumed in operation. Small code size can come at the expense of performance. The CARMEL Core's modular instruction set gives the programmer the benefits of small instructions for simple operations or longer instructions when multiple parallel operations are required. The basic instruction Word is 24 bits as shown in Figure 4-3. The Full Word instructions of 48 bits are the same basic operations with larger immediate operand fields and direct operand references. The third size is the Configurable Long Instruction Word (CLIWTM) of 144 bits. It is composed of a 48-bit reference instruction in the program memory space and a 96-bit block instruction in the CLIW memory space. The reference instruction identifies the appropriate CLIW block instruction and references up to four operands by their data Memory Address (MA) pointers. The block instruction specifies up to six parallel operations with CLIW sub-instructions that can use the two ALUs, the two MACs and perform two data moves. Note that the field separations shown in the figure are only symbolic for the maximum number of operands and operations and do not represent actual field sizes. Instructions in the program memory space are fetched 48 bits at a time. Each fetch may be a single Full Word instruction, two separate 24-bit Word instructions that execute sequentially or a single Parallel Instruction Word composed of two 24-bit Word instructions that execute in parallel in the two execution units. The Parallel Instruction Word format is shown in Figure 4-3 where the higher order Word is designated Left and the lower order Word is Right. In parallel execution the Left instruction operates on the Execution Unit 1 designated the Left EU and the Right instruction operates on the Execution Unit 2 which is designated the Right EU. Instructions in the CLIW memory space are fetched singly 96 bits at a time. CARMELTM Technical Overview 18 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary Instruction Word 24 Bits Parallel Instruction Word 24 Bits 24 Bits Left Word Right Word Full Instruction Word 48 Bits CLIW Reference Operands Instruction with CLIW Operations Block 48 Bit CLIW Reference Operands 96 Bit CLIW Operations Block Operand 1 Operand 2 Operand 3 Operand 4 MA 1 MA 2 MA 3 MA 4 Operation 1 Operation 2 Operation 3 Operation 4 Operation 5 Operation 6 MAC 1 ALU 1 MAC 2 ALU 2 Move 1 Move 2 Configurable Long Instruction Word (CLIW) Memory Program Memory Figure 4-3. Instruction Formats Instruction Memory Map The two memory spaces for instructions in the CARMEL Core are shown in Figure 4-4. The program memory space is shown 48-bits wide because two 24-bit words are fetched together even though addressing is for individual words as shown in the programming model of Figure 4-2. Note that the CLIW memory space is mapped into the program memory at an arbitrary location for loading and unloading/verification with move operations. The interrupt vector block's location is set arbitrarily and the bootstrap program is at the very top of the program memory space as shown. 16 M Bootstrap CLIW Memory Interrupt Vectors 1K 0K 48 Bits 96 Bits Program Memory Space CLIW Memory Space 0K Figure 4-4. Program Memory Map Instruction Addressing Modes Program memory addresses are specified directly from the program counter register, usually with post modification. The program counter is loaded directly in instructions with immediate jumps, interrupts, loop and repeat counters or indirectly from the stack in data memory. CLIW memory space addresses are always specified in the CLIW reference instruction in the program memory space. CARMELTM Technical Overview 19 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary Instruction Summary The CARMEL Core instruction set is powerful and flexible, and the C-like assembly syntax makes it easy to learn and program. The 112 distinct instructions of the CARMEL Core are summarized in Table 4-1. They are arranged by operation type: arithmetic, multiply, logical and single-bit operations, program control and system operations. Within an operation type they are listed first in order of size with the smallest first where the CLIW sub-instruction is the smallest. The majority of instructions are available in all three sizes: as a portion of a CLIW instruction, as a 24-bit instruction Word or as a 48-bit Full Word with extended operand references. Also listed in Table 4-1 for each instruction is the execution unit side where the operation can take place which is the same as the side that the 24-bit Instruction Word must occupy in a 48-bit Parallel Instruction Word. The choices of side are: - ANY. Any side Left or Right including both Left and Right. - ONE. Only one such Instruction Word per Parallel Instruction Word as for example with program control. May be in either side, just not in both. - LEFT. Only in the Left Execution Unit (EU1) and therefore the Left side of the Parallel Instruction Word. Examples are operations that use the shifter or exponent units that exists only in EU1. There are five inherently conditional instructions, such as a conditional branch, where execution depends upon a single selected condition code. Most other instructions, including CLIW sub-instructions, are conditionally executable depending upon a very flexible condition mechanism. Execution can depend on any of the eight combinations of three selected condition codes from a choice of sixteen. The condition codes can change dynamically each cycle or statically under program control by setting the conditional execution register. Execution may be suppressed completely or only partially where pointers are modified but the final results and flags are not changed. Double operations are the SIMD (Single Instruction Multiple Data) operations using the 16-bit operands in double data word formats. The Back Trace Registers store from left to right the sequential comparison results of minimum or maximum operations for accelerating the Viterbi algorithm. The MINMAX address register holds the data address associated with minimum or maximum operations. The updating of the Shift-Right bit permits automatic FFT scaling. CARMELTM Technical Overview 20 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary Table 4-1. Instruction Set Summary Instruction Operations Side Instruction Size CLIW Sub- 24 48 Instruction Bits Bits Arithmetic (56) Absolute Value, Decrement, Increment, Limit, Negate, Round, Division Step High, Division Step Low Add, Add With Carry, Add Absolute Value, Add Double, Add High, Add Low, Compare Signed, Compare Unsigned, Test Field For Ones, Test Field For Zeros Max (2 & 3 Operands), Max & Update Back Trace Register (2 & 3 Operands), Min (2 & 3 Operands), Min & Update Back Trace Register (2 & 3 Operands), Subtract, Subtract With Borrow, Subtract Double, Subtract High, Subtract Low Any 3 3 3 Max & Update MINMAX Address Register (2 Operands), Min & Update MINMAX Address Register (2 Operands) One 3 3 3 Exponent, Extract Signed, Extract Unsigned, Insert, Rotate Left Thru Carry, Rotate Right Thru Carry, Shift Arithmetic, Shift Logical Left 3 3 3 Add & Round, Add & Scale, Clear Accumulator, Subtract & Scale, Subtract & Round Any 3 Max Double, Min Double Left 3 Clear Back Trace Register Any 3 Clear Accumulators, Clear & Round Accumulators, Update Shift-Right Bit, Trace One 3 Max & Update MINMAX Address Register (3 Operands), Min & Update MINMAX (3 Operands) One 3 Limit 8 Bit - 3 3 Multiply (11) Multiply, Multiply & Round, Multiply-Accumulate, Multiply-Accumulate & Round, Multiply-Accumulate Aligned, Multiply-Subtract, Multiply-Subtract & Round, Square, Square & Round, Square-Accumulate, Square-Accumulate & Round Any 3 3 3 Any 3 3 3 One 3 3 Logical (4) And, Not, Or, Xor Single-Bit (4) Change Bit, Clear Bit, Set Bit, Test Bit Program Control (23) Return, Return Conditional, Return From Interrupt, Return From Interrupt Conditional, Leave, Trap One 3 Branch Absolute (Indirect), Branch Conditional, Branch Relative, Break, Continue, Call Absolute, Call Conditional (Relative), Call Relative, Link, Pop, Push, Repeat, Block Repeat One 3 Nop Any 3 Set Conditional Execution Flag, Clear Conditional Execution Flag - Load Conditional Execution Flags - 3 3 3 System (14) Move Any 3 3 Move Unsigned, Change Pointer Register Any 3 3 Bank Exchange of Pointer Registers, Load Pointer & Mode One 3 3 Change Groups of Pointer Registers, Interrupt Enable, Interrupt Disable One 3 Move EXT, Move From Data To Program, Move From Program To Data, Move IO, Swap, Reset CARMELTM Technical Overview 21 - 3 3 V1.0 2000-06-01 CARMEL Programming Model And Instruction Set Summary Configurability And The Instruction Set A major design objective for the CARMEL Core architecture was that it be configurable to optimize speed and code efficiency for a given application. This objective has clearly been met in a hardware sense with the core's extensibility in many directions and the modularity inherent in an individually configurable core design. But configurability can be obtained in the instruction set also as the CARMEL Core confirms. The primary configuring mechanism is in the use of the Configurable Long Instruction Word (CLIW) operations block. Once selected these 96-bit instructions composed of up to six individual parallel sub-instructions can be called by multiple 48-bit CLIW reference instructions in a variety of contexts and with very different operands and execution conditions. Their operation is much like a conditional subroutine with various input operands. Thus, once the algorithm design has been done, the processing core is configured to repeatedly perform the same efficient operations required by the application. Figure 45 illustrates how this task is achieved. The data path itself has a high degree of configurability that tends to be constant throughout an application. Typical configurable settings are rounding methods, saturation and limiting on overflow, scaling during multiplication and memory operations, and variable scaling strategies for block floating-point and the FFT. The CARMEL Core's unusually extensive set of conditionally executable instructions, including the CLIW and SIMD ones, provide another means of rapidly re-configuring the architecture in a sense. Consistency of execution time is often critical in real-time digital signal processing and using conditional execution is an established technique to ensure it. It is particularly flexible in this core because the conditions can be complex, dynamically determined or statically changed, and they can suppress only the data execution or the full operation. Data Memories Execution Units Mem Addr 4 Shifter Mem Addr 3 Exponent Mem Addr 2 Mem Addr 1 ALU 1 MAC 1 ALU 2 MAC 2 Transfers EU1 Instructions Memory Addresses CLIW Reference Operands Instruction with Operations Block MA 1 MA 2 MA 3 MA 4 Operand 1 Operand 2 Operand 3 Operand 4 48 Bit Reference Operands EU2 Instructions MAC 1 ALU 1 MAC 2 ALU 2 Move 1 Move 2 Operation 1 Operation 2 Operation 3 Operation 4 Operation 5 Operation 6 96 Bit Operations Block Program Memory Configurable Long Instruction Word (CLIW) Memory Figure 4-5. Using The CLIW Power on All CARMEL Core Execution Units and for Two Data Memory Transfers CARMELTM Technical Overview 22 V1.0 2000-06-01 The CARMEL System Architecture 5 The CARMEL System Architecture A system-on-a-chip (SoC) design solution needs a complete system architecture, not just an isolated processing core architecture. The CARMEL Product Line provides this. Beyond meeting the signal processing needs of the system, the core must provide for other system functions that may or may not be on the same chip. General system partitioning issues that are important in assessing an architecture are: - - - - - 5.1 System control: with or without a host that may be an internal or external host Testing and emulation: both in development and production phases Support for external buses and standard interfaces Internal/external component choices: for additional processing power, large memories and large peripherals System functions: clock generation and synchronization, power management Representative External Systems For the purposes of understanding the CARMEL system architecture consider using an external system like the one shown in Figure 5-1. It is a composite of most of the possible external components to illustrate how they can be accommodated in a CARMEL-based SoC. It is not necessarily a typical system. Given large complex system control tasks, established bodies of proven control software, and large control peripherals, an external microcontroller host is often dictated in such a system. Thus the digital signal processing system-on-a-chip must connect effectively to common microcontrollers with standard interfaces. The host is also often the system self-test controller using JTAG serial scan paths with all system elements. Because of sheer size, power or mixed-signal requirements, some peripheral circuits may not be able to be included on the SoC. These peripherals must be easily connected using standard interfaces and buses. They are represented in figure 5-1 by peripherals 1-4 on a serial and a parallel bus. For systems with large memories the economy of large off-chip DRAMs may be required. These memories need to be added cost-effectively with their special timing characteristics. Additional processing power may be needed, can the co-processor or accelerator be internal or must it be external? Many systems already have a system clock whose frequency is chosen. The signal processing SoC often must operate synchronously with system clock. Normal clock rates, sleep clock rates, gated clocks and clock distribution also figure heavily in system power management. These functions must be controllable in SoC applications. Historically in small systems and even some large ones, just a few general purpose programmed I/O signals can handle critical timing and configuration control. They require no special interface circuits or interrupts, and are programmed in as a general purpose I/O. Serial Port Periph 1 Parallel Port Periph 2 Parallel Peripheral Bus Periph 3 Signal Processing System-On-A-Chip JTAG Interface Host Bus Serial Peripheral Bus Host Memory Periph 4 Serial System Test Host I/O Host Interface Host Host Peripherals External Memory Interface DRAM GPP System Interface System Control Signals System Reset, Clock and Configuration CARMEL-Based Chip Figure 5-1. A Composite External-Host CARMEL-Based System CARMELTM Technical Overview 23 V1.0 2000-06-01 The CARMEL System Architecture 5.2 A Representative CARMEL-Based System-on-a-Chip Figure 5-2 shows a CARMEL-based SoC configured to fit into the external system of Figure 5-1. Looking at each module in turn illustrates how the CARMEL architecture and standard modules can meet the representative system requirements. FBCU Intrpt System Peripherals and Memories FPI Bus and Controller System Memories DMA Sys SRAM, DRAM, ROM, Flash Memories Timers Peripheral Controller 2 Serial Port Peripheral Controller 1 Parallel Port JTAG I/F 5 Host I/F 24 BIU 55 To Peripherals JTAG Interface Host Interface Interfaces External Memory Interface Data Memories Core Emulation Unit EMU A RAM DMA DMA FPIU IO Bus ICU Interrupts EU B RAM CoProcessor AB I/F Instruction Memories AU GPP Core-to-System I/F RAM ROM Program Memory Bus CLIW Memory Bus PCU RAM CARMEL Core ROM CARMEL Memories GPP 8 System Clock & Configuration System Interconnect System Interface CARMEL-Based System-on-a-Chip Figure 5-2. A Composite CARMEL-Based System-on-a-Chip Illustrating the System Architecture CARMELTM Technical Overview 24 V1.0 2000-06-01 The CARMEL System Architecture 5.3 The CARMEL Core Module The core is shown fully connected in Figure 5-2 to use the extended architecture on all six interfaces of Table 3-1. 5.4 CARMEL Memory Modules For all except the largest DRAMs, cost, speed and power are improved with on-chip memories. Thus all of the direct CARMEL memory interfaces are designed for on-chip single-cycle synchronous memories, except program memory during booting-up. Program Memory Program memory can be all ROM for the greatest size, power and reliability benefits. Most often the ability to download program RAM to change functionality or provide an upgrade path requires a mix of the memory types. In some cases only the simplest bootstrap is in ROM. The CARMEL system architecture permits all of these combinations because it allows move operations between program memory and other system memories. Slower, byte-wide bootstrap memories can be elsewhere in the system including off-chip as determined by the core configuration settings. CLIW Memory Similarly, CLIW memory can be RAM and/or ROM since it appears in the program memory space and may also be downloaded. This memory is optional if the benefits of the parallel CLIW functionality are not needed. Data Memories The choice of memories and non-memory elements in this address space is very large. Both RAM and ROM may be used as well as memory-mapped co-processors. The highest performance comes with using all four buses with separate A and B zone memories. Within each zone there can be the smaller single-port designs and the higher performance dual-port ones including designs with separate even and odd address ports. Co-processors located in this space are typically for arithmetic acceleration on local data with deterministic processing times. 5.5 System Interconnect Modules The designer of a system with the CARMEL Core can integrate the core in any manner they choose provided they observe the protocols at the six signal interfaces in Table 3-1. However, the CARMEL product modules are designed to use the Flexible Peripheral Interconnect (FPI) Bus to extend the architecture for other than the direct memories described above. Table 5-1 summarizes the characteristics of the bus as permitted by the FPI Bus specification, supported by the CARMEL Core, and implemented in the CDEV development chip. The FPI specification only specifies the FPI Bus with master and slave interfaces for various types of data transfers. Additional so-called sideband bus signals for DMA and Interrupts are permitted but are allowed to be implementation dependent. Certain System clocks and resets are defined as well. For the CARMEL product modules, the FPI Bus, the DMA signals, the Interrupt signals and distributed Systems signals are all defined and are shown schematically in Figure 5-2. These signals are largely determined by the CARMEL Core-to-System Interface and they can be changed as desired. For example, although the core has 16-bit data paths, it could easily be interfaced to a conforming 32-bit FPI Data Bus by redesigning the Core-to-System Interface sub-units. Core-to-System Interface As Figure 5-2 shows, three sub-units in the core-to-system interface extend the core's own DMA, IO and Interrupt interfaces onto the augmented FPI bus. The DMA unit has data buffering and dual address generation for eight independent channels of DMA with three priority levels. The modular Interrupt Control Unit (ICU) can be cascaded to add in groups of sixteen the masking, priority and vector multiplexing control for the core's interrupt interface. The FPI Unit (FPIU) buffers and sequences the core's own IO bus for normal data transfers. When the core must act as a master on the FPI Bus as for DMA and most core initiated I/O, it also executes the required bus protocol. FPI Bus, Clock and System Configuration Units The FPI Bus requires a controller and default master for its operation. This is done by the FPI Bus Control Unit (FBCU) module in the CARMEL product line. Other bus masters may be the default master that arbitrates the bus use. Common clocks and resets are defined for a CARMEL-based system-on-a-chip. To optimize power and performance trade-offs, the CARMELTM Technical Overview 25 V1.0 2000-06-01 The CARMEL System Architecture Table 5-1. FPI Bus Design Choices Bus FPI Parameter Permitted By The FPI Specification Supported By The CARMEL Core Address Bus size 16 - 32 bits 24 bits 24 bits Data Bus size 16, 32, 64 bits 16 bits 16 bits Data transfer size 8, 16, 32, 64 bits 8, 16 bits 8, 16 bits Data transfer modes Single, split, block, DMA Single, DMA Single, DMA Masters internal 16 - 6: DMA(3), FPIU(1), JTAG(2) - - Masters external Slaves internal 6 Slaves external DMA DMA Channels external - - - 240 Clock Rate - - 125 MHz Resets System and/or Interfaces - System and Interfaces Interrupts internal 4: 5-2 4: 7,6 = Host; 1,0 = Core Interrupts external System 6: DMA, EMU, JTAG, BIU, HI, SCU 5 DMA Channels internal Interrupt Implemented In The CDEV Development Chip Master(12), Slave(12) Master(4), Slave(4) FPI bus clock, the core clock and the oscillator or external reference clock frequencies can all be determined independently. Power management clock control signals are distributed in the system signals. Reset signals are defined separately for bus interfaces and peripherals and the system reset for processors like the core. Configuration signals distribute to the system a uniform set of information about the FPI bus and system memory configurations. Emulation Unit The core emulation unit is interfaced to the FPI bus rather than directly to an external interface. This permits other processor emulation units to share the single augmented JTAG external interface and for all emulation units to work in concert to test and debug the complete system-on-the chip. The CARMEL Core Emulation Unit's intimate connection allows trapping and breakpoint support for off-line tracing after a breakpoint or error trap condition using then-resident emulation software. GPIO The four programed inputs and four programmed outputs of the core's General Purpose Port (GPP) can be distributed on the chip for direct control without using the FPI Bus, but can also be an external interface as shown for the industry standard General Purpose Input Output (GPIO). CARMELTM Technical Overview 26 V1.0 2000-06-01 The CARMEL System Architecture 5.6 System Peripheral And Memory Modules CARMEL system product modules that go on the FPI Bus are grouped into those that directly provide an interface for off-chip connection to a specific device, those that provide some autonomous controller function with or without creating an external generic bus and those that are on-chip system memories. Interfaces For a variety of system partitioning reasons, like having replaceable program ROMs or large DRAMs, it may be desirable to have off-chip memories. They are easily added in the large 24-bit FPI bus address space supported by the core. The systemwide configuration signals can facilitate the address mapping, data alignment and timing control. DMA priorities assure that transfers take place at appropriate times for large data movements. External host interfaces can be added to the FPI Bus with a data format and priority for real-time control that suits the application. Transfers can be DMA, programmed or interrupt driven. Industry standard JTAG test and on-chip debug support connections can be made with an Infineon specified JTAG interface that help implement a comprehensive test/debug solution that is even more important in the SoC context. Controllers There are many industry standard interfaces that involve operations that are autonomous or independent of the direct timing of transfers on the FPI Bus. These interfaces divide into generic ones like the synchronous and asynchronous serial and parallel interfaces such as the UART. Other interfaces are application specific such as audio codecs or LCD displays. An asynchronous/synchronous serial interface module covers 8- or 9-bit half or full duplex transfers with programmable rates from less than one baud to greater than a megabaud with a 25-MHz input clock. A parallel port interface module is organized as three 8-bit input/output ports for control signals and/or data transfers. General purpose timers are a common autonomous controller in real-time DSP systems. The CARMEL system product timer module has three 32-bit counters which are very flexible in configuration and operating mode. They can count events or time intervals with a variety of clock, signal transition, reload and service request types. FPI-Bus System Memories The DMA and FPI Bus transfer modes easily accommodate the various timing constraints of larger slower internal system memories. These memories may be more SRAM and ROM or embedded DRAMs and Flash with their own controllers. CARMELTM Technical Overview 27 V1.0 2000-06-01 Programming Examples 6 Programming Examples The following two DSP functions have been programmed on the CARMEL Core. These annotated examples show how straightforward it is to program the core and use some of the features in the instruction set. They also provide performance numbers for comparison with other processor benchmarks. 6.1 FIR Filters The Finite-Impulse-Response (FIR) filter is the most common DSP function in many applications. With simple, but highly repetitive arithmetic and simple data structures it is the primary benchmark to measure computation-limited speed. A FIR filter is a continuing computation over time of the form: y(t) = n = N-1 x( t - n ) w( n ) n=0 where y is the filter output and x is the uniformly sampled signal input data points with the index t over time. N is the number of points in the filter impulse response, filter taps or coefficients in the weighting function w. Common algorithm variations are for real or complex input data, real or complex filter coefficients, N being even or odd and the coefficients being symmetrical (w [N/2 + n] = w [N/2 - n]). Common arithmetic variations involve the signal and coefficient precisions, amount of accumulation precision and scaling and/or saturation strategies. Additional considerations when selecting an algorithm can be speed (i.e. maximum real-time bandwidth), minimum total program size or minimum data memory size including accumulator usage. Two variations are provided here that illustrate a single output point on real data with a non-symmetrical impulse response and one generating two output points with a symmetrical impulse response. The symmetrical case in the following section shows the power of the CLIW instructions while this example demonstrates the power of Carmel parallelism. CARMELTM Technical Overview 28 V1.0 2000-06-01 Programming Examples Single-Sample Real Non-Symmetrical FIR Filter Using the data memory and register utilization shown in Figure 6-1, the following program computes a single output sample for an N tap filter: FIR1: //Single-Sample Real Non-Symmetrical FIR Filter { //Prolog clr(a0,a1)||rep(N/2)single; //Clear accumulators a0 and a1, and repeat the next //single instruction N/2 times { //Kernel a0 += *r6++ * *r2++||a1 += *r6++ * *r2++; //Accumulate in a0 and a1 the products of the even and //odd data and coefficients respectively as pointed to } //by r6 and r2 which are sequentially post-incremented //by one //Epilog *(r0++)=a0+a1; //Add a0 to a1 and store in location pointed to by r0 } //End Where || specifies parallel operations, + is addition, * is multiplication, += is accumulation and *r0++ is data at the indirect address in address register r0 with post incrementation by one. Address Unit Registers r6 w Pointer r2 x Pointer r0 y Pointer Accumulators a1 a0 Data Memory B Data Memory A Odd Accumulation w(N-1) x(t-N+1) w(N-2) x(t-N+2) Even Accumulation w(1) r6 w(0) r6 x(t-1) r2 x(t) r0 y(t) r2 Figure 6-1. Register and Data Memory Usage for a Real Non-Symmetrical FIR Filter Note that two accumulations are calculated in the single parallel word that is repeated in the kernel and that by using the same pointers in both instructions they are incremented as though the left were executed and then the right. In Figure 6-1 these address pairs are designated as r6 and r6 and r2 and r2. If N is not even, then an additional zero valued coefficient can be added to make it so without a loss of speed. The computation time is N/2 + 2 instruction cycles with 4 data memory accesses per cycle. The program size is 15 bytes with a data memory of 4N + 2 bytes required. CARMELTM Technical Overview 29 V1.0 2000-06-01 Programming Examples Block Real Symmetrical FIR Filter Using the data memory and register utilization shown in Figure 6-2, this program computes two output samples for t equal to t and t+1 for an N tap filter with a symmetrical impulse response: FIR2: //Block Real Symmetrical FIR Filter { //Prolog cliw firsym1 (r2++,r2++,r4--,r4--) { nop||a0=*ma2+*ma3||nop||a1=*ma1+*ma4|| ff1=*ma2||ff2=*ma4; } clr(a2,a3)||rep(N/2-1)single; //Add two initial pairs of x(t) values and load delay //registers: ff1 becomes x(t) and ff2 becomes x(t-N+2) { //Kernel cliw firsym2 (r2++,r4--,r6++) { a2+=a0h**ma3||a0=*ma1+ff2|| a3+=a1h**ma3||a1=*ma2+ff1|| ff1=*ma1 ||ff2=*ma2; } //Clear accumulators a2 and a3, and repeat next //single instruction N/2 - 1 times //Add two pairs of x(t) values, perform two multiply//accumulates and load delay registers ff1 and ff2 } //Epilog cliw firsym3 (r0++,r0++,r6++) { *ma1+=a0h**ma3+a2||nop||*ma2+=a1h**ma3+a3; //Add final products and store two output points } //y(t) and y(t+1) //End } Where CLIW instructions are of the form: cliw name (ma1, ma2, ma3, ma4) {mac1 || alu1 || mac2 || alu2 || move1 || move2}; Address Unit Registers r6 w Pointer Accumulators a3 Data Memory A y(t+1) Accumulation w(N/2-1) Data Memory B r4 w(N/2-2) r4 x Fwd. Pointer a2 x(t-N+1) x(t-N+2) r4 y(t) Accumulation w(1) r2 x Rev. Pointer a1 y(t+1) Data Sums r0 y Pointer a0 y(t) Data Sums r6 w(0) x(t-1) Execution Unit Registers ff2 Forward Data ff1 Reverse Data r2 x(t) r2 x(t+1) y(t+1) r0 r0 y(t) Figure 6-2. Register and Data Memory Usage for a Block Symmetrical FIR Filter CARMELTM Technical Overview 30 V1.0 2000-06-01 Programming Examples The successive readings of x(t) from data memory B are shown in Figure 6-3 from left to right. The brackets show the data pairs that are added from the forward and reverse progression of time. The arrows show the moves of some of the same data to ff1 and ff2 hold registers for re-use in the next pass without another memory read. Note that forward progression of time with increasing t is for numerically decreasing data addresses. Three CLIW instructions, firsym1-3 are used. The firsym2 instruction takes advantage of the symmetry by using a CLIW that can do two data additions and two multiply-accumulates along with two register loads. Both firsym1-2 utilize the delay registers (ff1, ff2) available for parallel use with CLIW. The computation time is now N/4 + 1.5 instruction cycles per output point with a total of three data memory accesses per cycle except for four in the prolog. The program size is now 60 bytes with a data memory of 3N + 2 bytes required. Prolog 1st Pass Kernel x(t-N+1) x(t-N+1) x(t-N+1) x(t-N+2) x(t-N+2) x(t-N+2) x(t-N+3) ff2 2nd Pass Kernel x(t-N+3) x(t-N+4) x(t-N+4) x(t-N+3) ff2 x(t-N+4) ff2 x(t-2) x(t-2) x(t-1) x(t-1) x(t) x(t) x(t+1) 1st y(t+1) Sum ff1 1st y(t) Sum 2nd y(t+1) Sum x(t+1) x(t-2) x(t-1) ff1 2nd y(t) Sum 3rd y(t+1) Sum x(t) x(t+1) ff1 3rd y(t) Sum Figure 6-3. Successive Read Cycles for Data Memory B CARMELTM Technical Overview 31 V1.0 2000-06-01 Programming Examples 6.2 Vector Quantization Vector quantization is a common form of communications speech encoding that is computationally intensive. The best fit is found between an incoming speech sample vector and a codebook of reference coefficient vectors. This involves finding the minimum distance between the sample vector and all codebook entries. The squared difference between a sample vector x i of length N and a codebook coefficient vector c j of length N is of the form: n = N-1 distanceij = [ xi ( n ) - c j ( n ) ] 2 n=0 Common variations combine it with the minimizing process of comparing other codebook entries or to use a weighting function on the distances to compensate for non-uniform energies in the codebook entries. The minimum distance for a sample vector i with a codebook of M entries is of the form: j = M-1 minimum_distance i = MIN j = 0 [ dis tan ce ij ] The index of the codebook entry j with the minimum distance is the desired result, not the actual minimum distance for the i sample vector. CARMELTM Technical Overview 32 V1.0 2000-06-01 Programming Examples Using the data memory and register utilization shown in Figure 6-4, the following program computes the minimum squared distance between the xi sample vector of length N with M code vectors of length N. This minimum squared distance remains in accumulator a4 and the address of the last sample for the corresponding code vector is in the MINMAX register. VQ1: //Vector Quantization { //Outer Loop over M code vectors rep(M/2)block; //Repeat outer block loop M/2 times { a0l = *r0++ - *r4++ || a1l = *(r0+rn0) - *(r4+rn4); clr(a2,a3)||rep(N-1)single; //Find first distance with first half code vectors //Find first distance with second half code vectors //Clear accumulators a2 and a3, and repeat next //single instruction N-1 times { //Kernel inner loop over N-1 points cliw vq (r0++,r4++,r4+rn4) { a0l = *ma1-*ma2||a2 += sqr(a0l)|| a1l = *ma1-*ma3||a3 += sqr(a1l); } //Accumulate distance2 with first half code vectors //Accumulate distance2 with second half code vectors } //Accumulate last distance2 with first half code //vectors and with second half code vectors //The current minimum squared distance is placed in a4 //and the end address of the corresponding code vector //is stored in the MINMAX register a2 += sqr(a0l) || a3 += sqr(a1l); minm(a4,a2,r4); minm(a4,a3,r4+rn4); } } //End Address Unit Registers rn4 Accumulators Data Memory A M*N/2 a4 Current Minimum Squared Distance c Pointer a3 Distance2 Zero a2 Data Memory B xi(N-1) cM-1(0 - N-1) xi(N-2) cM-2(0 - N-1) r4 rn0 Second Half Accumulation cM/2+1(0 - N-1) First Half Distance2 Accumulation r4+rn4 r0 x Pointer a1 Second Half Distances cM/2(0 - N-1) cM/2-1(0 - N-1) xi(1) r0 xi(0) cM/2-2(0 - N-1) a0 First Half Distances c1(0 - N-1) r4 c0(0 - N-1) Figure 6-4. Register and Data Memory Usage for Vector Quantization A single CLIW instruction, VQ is used. Instruction VQ utilizes both ALU and both MAC execution units at once in the inner loop. There are only three data memory data accesses. Note the speed derived from the three operand minimum and update instruction minm that finds the minimum but also the corresponding address which is the important result. The computation time is (M/2)(N+4) + 1 instruction cycles. The program size is 51 bytes with a data memory of 2(M+1)N bytes required. CARMELTM Technical Overview 33 V1.0 2000-06-01