An overview of ARM processor architecture evolution with emphasis to specific instruction sets capable of executing the demanding high speed real-time multimedia applications is provided. Basic ARM architecture is presented, together with the hardware architecture evolution. Specifics of the instruction set extensions based on SIMD style processing are described. Most important performance issues and speed restriction problems are analyzed.
The ARM processor is a key component of many embedded systems . First ARM prototype was introduced in 1985. under the name of Acorn RISC Machine (afterwards renamed to Advanced RISC Machine). The relative simplicity of ARM processors made them suitable for low consumption applications. Currently ARM processor based solutions power over 90 percent of current 3G baseband products .
The paper is structured as follows. Section II shows the description of basic ARM architecture. It includes presentation of processor datapath, pipeline, register file, basic instructions and data types. Section III describes the evolution process of ARM architecture through hardware architecture changes and instruction set extensions. Section IV gives an insight to ARM multimedia instruction sets. Section V gives an overview of the performance issues and bottlenecks encountered with multimedia instruction sets. The conclusion is summarized in section VI.
II. BASIC ARM ARCHITECTURE
- The basic ARM architecture can be analyzed on ARM7 core (ARMv4 revision) which is a 32-bit Reduced Instruction Set Computer (RISC) . It incorporates these typical RISC architecture features:
- A large uniform register file
- A load/store architecture, where data-processing operations only operate on register contents, not directly on memory contents.
- Simple addressing modes, with all load/store addresses being determined from register contents and instruction fields only
- Uniform and fixed-length instruction fields, to simplify instruction decode
In addition the ARM architecture provides:
- Control over both the Arithmetic Logic Unit (ALU) and shifter in most data-processing instructions to maximize the use of ALU and a shifter Auto-increment and auto-decrement addressing modes to optimize program loops
- Load and Store Multiple instructions to maximize data throughput
- Conditional execution of almost all instructions to maximize execution throughput
Block diagram of the basic ARM architecture dataflow is shown on Figure .
Figure 1. ARM core dataflow model
- Fetch – loads an instruction from memory
- Decode – identifies the instruction to be executed
- Execute – processes the instruction
Figure 2. ARMv4 pipeline
The hardware of each stage is designed to be independent so up to three instructions can be processed simultaneously.
- r13 – Stack pointer
- r14 – Link register
- r15 – Program counter
Figure 3. ARMv4 full register set
C. Basic instructions and datatypes
- Data processing instructions such as move, logical and arithmetic shift, rotation, etc
- Branch instructions used for changing flow of execution
- Load/store instructions used for transferring data between memory and processor registers
- Software interrupt instruction used for causing software interrupt exception
- Program status register instructions used for direct control of program status register bits
The ARMv4 core supported data types are signed and unsigned words (32-bit), halfwords (16-bit) and bytes.
D. Thumb instruction set
Coding using Thumb operations has higher code-density than coding using normal ARM operations if operations are executed on both 32-bit and smaller data.
Every Thumb operation is related to a 32-bit ARM instruction and Thumb-instruction decoder is placed in pipeline as shown on Figure 4.
Figure 4. Thumb instruction decoding in ARM pipeline
III. ARM ARCHITECTURE EVOLUTION
With each new revision new enhancements to existing operations were added, new instructions introduced, architecture changed and improved with additional elements.
ARMv4T introduced transition from Von-Neumann to Harvard architecture which divided the instruction and data bus enabling efficient memory management.
With this change pipeline stages also started to grow from 3 to 5 in ARMv4T, 6, 7, 15 (ARM Cortex A9) and massively 24 in the latest ARM Cortex A15 architecture.
Starting with latest ARMv7 revision ARM core implementations were divided in 3 different profiles:
- ARMv7M – microcontroller profile
- ARMv7R – real-time profile
- ARMv7A – application profile
The real-time profile and application profile both support ARM and Thumb instruction sets with a distinction that systems implemented using real-time profile require physical address only support in the memory management model, while systems implemented using application profile require virtual address support (based on memory management unit).
Most powerful ARM processor currently is Cortex-A15 . It is targeted to high-end wireless, Smartphone, gaming, networking and server applications. Basic architectural updates are large physical addressing (up to 1TB of memory), virtualization, ISA extensions (integer divide, fused MAC), multiprocessing, AMBA 4 bus architecture  and Error Correction Control (ECC) on L1 and L2 memories.
Multiprocessing is supported (quad cores) with processor coherency in L2 cache (up to 4MB) connected externally to 128-bit AMBA 4 interface as shown on Figure 5.
Figure 5. Quad Cortex-A15 MPCore
Out-Of-Order execution is used to increase instruction parallelism and reduce mispredict penalty. Also, execution stage is broken down to multiple clusters defined by instruction type:
- Simple cluster – single cycle integer operations
- Complex cluster – NEON and Floating point data processing operations
- Branch cluster
- Multiply and Divide cluster
- Load/Store cluster
IV. MULTIMEDIA INSTRUCTION SETS
SIMD operations have been implemented in both the Intel x86 (Intel 64, IA-32) and the ARM architecture. MMX instruction set was the first introduced in Intel architecture. Streaming SIMD Extensions (SSE, SSE2, SSE3 and SSE4) were added gradually. Most recent extension to the Intel architecture is the Advanced Vector Extension (AVX) .
First SIMD extension was included in ARMv6 architecture, and was further expanded with the NEON extension in the ARMv7 architecture.
A. ARM SIMD extension
Figure 6. ARM SIMD operation example
B. ARM NEON extension
The implementation of the Advanced SIMD extension used in ARM processors is called NEON, and this is the common terminology used outside architecture specifications. The NEON instructions perform memory accesses, data copying between NEON and general purpose registers, data type conversions and data processing. An example of ADD operation which performs parallel addition of four lanes of 8-bit elements is shown on Figure 7.
Figure 7. ARM NEON operation example
- Sixteen 128-bit quadword registers, Q0-Q15
- Thirty-two 64-bit doubleword registers, D0-D31
Data inside the register can be packed in following manners:
- 16 integer bytes (16 x 8 bits)
- 8 integer words (8 x16 bits)
- 8 half-precision floating point numbers (8 x 16 bits)
- 4 integer doublewords (4 x 32 bits)
- 4 single precision floating point numbers (4 x 32 bits)
- 2 integer quad words (2 x 64 bits)
This paper presents an overview of ARM architecture starting from initial design with 3-stage pipeline and Von-Neumann architecture to newest ARM Cortex based embedded systems. Special emphasis is given to ARM extensions (SIMD and NEON) aimed for optimization of multimedia applications.
Some problems are brought out to attention and could serve as a starting point for the future improvements of processor architecture and instruction sets aimed specifically for the multimedia applications.