Lucian Codrescu
Sr. Director, Technology
Qualcomm Technologies, Inc.
Qualcomm Hexagon
DSP: An architecture
optimized for mobile
multimedia and
communications
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP processors in Snapdragon products
aDSP: Real-time
media & sensor
processing
Snapdragon
800
Camera
Adreno
GPU
Display
JPEG
Video
Krait
CPU
Krait
CPU
Krait
CPU
Krait
CPU
Other
Audio
Sensors
Hexagon
aDSP
Misc.
Connectivity
2MB L2
Multimedia Fabric
System Fabric
Fabric & Memory Controller
LPDDR3
LPDDR3
Modem
Hexagon
mDSP
mDSP: Dedicated
modem processing
Qualcomm Technologies, Inc. All Rights Reserved
Expansion of Hexagon DSP use cases beyond audio
Image Enhancement
Camera, Still, Video
HexagonV4 based products
HexagonV2/V3
Computer Vision &
Augmented Reality
HexagonV4 based products
Video
HexagonV5 based products
Voice
Audio
Sensors
HexagonV5 based products
Hexagon DSP is evolving for use beyond voice and audio to
computer vision, video and imaging features
Qualcomm Technologies, Inc. All Rights Reserved
The Hexagon DSP evolution
Generational improvements in performance and power efficiency driven by
both architecture and implementation
V4M
V3M
V5A
28nm
Dec 2010
45nm
June 2009
28nm
Dec 2012
V4L
V1
65nm
Oct 2006
28nm
Apr 2011
V3L
V2
45nm Nov
2009
65nm
Dec 2007
V3C
V4C
V5H
45nm Aug
2009
28nm
Dec 2010
28nm
Dec 2012
Time
Qualcomm Technologies, Inc. All Rights Reserved
Key characteristics of
modem & multimedia applications
Requirements
Characteristics
Require fixed real-time
performance level
(fps, Mbit/sec, etc.)
Extremely aggressive
power & area targets
Mix of signal processing
& control code
For modem, Qualcomm does not
use a split CPU/DSP architecture.
All processing is done on Hexagon
DSP
Multimedia apps have significant
control in the RTOS & frameworks
Heavy L2$ misses
Multimedia is data intensive
Modem is code intensive
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP blends features targeted to modem &
multimedia
VLIW
Need multi-issue to
meet performance
Low complexity for
Area & Power
Multi-Threading
To reduce L2$ miss
penalty without the need
for a large L2
Increases
instructions/VLIW packet
because compiler doesnt
need to schedule latency
Hexagon
DSP
Innovate in ISA to
maximize IPC
More work/VLIW packet
reduces energy/instruction
Keep the pipelines full for
MIPS/mm2
Target both Signal
Processing & Control code
Qualcomm Technologies, Inc. All Rights Reserved
VLIW: Area & power efficient multi-issue
Variable sized
instruction packets
(1 to 4 instructions
per Packet)
Device
DDR
Memory
Dual 64-bit
load/store
units
Also 32-bit
ALU
Dual 64-bit execution units
Standard 8/16/32/64bit data
types
SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
Up to 8 16b MAC/cycle
2 SP FMA/cycle
Instruction
Cache
Instruction Unit
L2
Cache
/ TCM
Data Unit
(Load/
Store/
ALU)
Data Unit
(Load/
Store/
ALU)
Execution
Unit
(64-bit
Vector)
Data Cache
Register File/Thread
Register File
Register File
Execution
Unit
(64-bit
Vector)
Unified 32x32bit
General Register
File is best for
compiler.
No separate Address
or Accum Regs
Per-Thread
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the signal processing code work/packet
Example from inner loop of FFT: Executing 29 simple RISC ops in 1 cycle
64-bit Load and
64-bit Store with
post-update
addressing
Zero-overhead loops
{ R17:16 = MEMD(R0++M1)
MEMD(R6++M1) = R25:24
R20 = CMPY(R20, R8):<<1:rnd:sat
R11:10 = VADDH(R11:10, R13:12)
}:endloop0
Complex multiply with
round and saturation
Vector 4x16-bit Add
Dec count
Com pare
Jum p top
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the control code work/packet
Hexagon DSP ISA improves control code efficiency
over traditional VLIW
Example C code
void example(int *ptr, int val) {
if (ptr!=0) {
*ptr = *ptr + val + 2;
}}
Tradional VLIW
Assembly Code
Hexagon DSP:
Hexagon DSP:
Dot-New Predication
Compound ALU
New-Value Store
p0 = [Link](r0,#0)
{
if (!p0) r2=memw(r0)
if (p0) jumpr:nt r31
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31
p0 = [Link] (r0,#0)
if (![Link]) r2=memw(r0)
if ([Link]) jumpr:nt r31
}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31
Hexagon DSP:
p0 = [Link](r0,#0)
if (![Link]) r2=memw(r0)
if ([Link]) jumpr:nt r31
}
{
}
r1 = add(r1,add(r2,#2))
{
memw(r0) = r1
jumpr r31
p0 = [Link](r0,#0)
if (![Link]) r2=memw(r0)
if ([Link]) jumpr:nt r31
r1 = add(r1,add(r2,#2))
memw(r0) = [Link]
jumpr r31
Instr/Packet =
7 instr/5 packets = 1.4
Instr/Packet =
7 instr/2packets = 3.5
Qualcomm Technologies, Inc. All Rights Reserved
High avg. instructions/packet for targeted use cases
Average Instructions/VLIW Packet
Compound instructions count as 2
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Computer
Vision
Source: Qualcomm internal measurements
Video
Imaging
Control
Audio
Qualcomm Technologies, Inc. All Rights Reserved
10
Programmers view of Hexagon DSP HW
multi-threading
Hexagon V5 includes three hardware threads
Architected to look like a multi-core with communication
through shared memory
Shared Instruction Cache
Thread 0
D
U
D
U
X
U
Thread 1
X
U
Register File
D
U
D
U
X
U
Thread 2
X
U
Register File
D
U
D
U
X
U
X
U
L2
Cache /
TCM
Register File
Shared Data Cache
Qualcomm Technologies, Inc. All Rights Reserved
11
Hexagon DSP V1-V4: Interleaved multi-threading
Simple round-robin thread scheduling
Number of threads match execution pipe depth
(three threads three execute stages)
All instructions complete before next packet dispatch
Compiler schedules for zero-latency which helps to increase
instructions/VLIW packet
T0: {
Thread 0 Dispatch
Thread 1 Dispatch
Ld
Add Cmp } T1: {
St
Ld
Mpy Add
T2: {
Ld
Add Jump
T0: {
Ld
Ld
Add Cmp }
T1: {
St
Ld
Mpy Add
T0: {
Ld
Ld
Add Cmp }
Ld
Thread 2 Dispatch
Qualcomm Technologies, Inc. All Rights Reserved
12
Hexagon DSP V5: Dynamic HW multi-threading
Recover some performance when threads idle or stalled
Remove a thread from IMT rotation
On L2 cache misses
When in wait-for-interrupt or off
mode
Additional forwarding to support
2-cycle packets
VLIW packets with dependencies
between long latency instructions
will stall
But many VLIW packets with
simple instructions can
complete in 2 processor clocks
Coremarks/
MHz
8
4.5
4
3.5
2.5
2
1.5
0.5
IMT
Source: Qualcomm internal measurements
Dhrystone
DMIPS/MHz
DMT
IMT
DMT
Qualcomm Technologies, Inc. All Rights Reserved
13
Hexagon DSP instructions per cycle
Average Instructions / Cycle
Multi-Threaded Apps
4.5
4
3.5
3
2.5
2
1.5
Single-Threaded Apps
IPC_DMT
IPC_IMT
1
0.5
0
Source: Qualcomm internal measurements
Qualcomm Technologies, Inc. All Rights Reserved
14
Qualcomm Hexagon DSP architecture
BDTImark2000/MHz
DSP Performance per MHz
Highly efficient mobile application processordesigned for more
performance per MHz
20
18
16
14
12
10
8
6
4
2
0
Clock Rate (MHz)
DSP Performance (BDTImark2000)
Mobile Competitor
Qualcomm HXGN V4 (1 thread)
Qualcomm HXGN V4 (3 threads)
430-520
100-233
300-700
4730-5720
1810-4220
5440-12660*
* - Projected best case score for 3-threads
Source: BDTI - For more detailed information see [Link]. All scores 2013 BDTI
Qualcomm Technologies, Inc. All Rights Reserved
15
Hexagon DSP Power Benefits
Qualcomm Technologies, Inc. All Rights Reserved
16
MP3 playback power for competitive smartphones
Lower is better
Power
Competitor A
Qualcomm /
Competitor B
Hexagon-based
Competitor C
Competitor D
Competitor E
Competitor F
Competitor G
Power measured at the battery for various phones
Includes everything: DSP, CPU, memory, analog components, etc
Source: Qualcomm internal measurements
Qualcomm Technologies, Inc. All Rights Reserved
17
Computer vision offload ARM/neon to Hexagon DSP
Augmented Reality Java App finding objects in
image using FastCV Feature Detect
Comparison of Feature Detect run on:
App CPU (ARM/Neon)
App DSP (Hexagon)
CPU Utilization (%)
52% Less CPU
Detection Time (%)
7% Less Time
Source: Qualcomm internal measurements. * Power measured at the device battery
Total Device Power (%)
32% Less Power*
Qualcomm Technologies, Inc. All Rights Reserved
18
Hexagon DSP power for different thread utilizations
Excellent near-linear power scalability
(as threads go idle, power used by the thread is nearly eliminated)
Achieved through optimized clock tree design & clock gating
Dhrystone Power,
IMT Mode
FIR Power,
IMT Mode
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
30%
Actual
Ideal
40%
30%
20%
20%
10%
10%
0%
0%
Source: Qualcomm internal measurements
Actual
Ideal
Qualcomm Technologies, Inc. All Rights Reserved
19
Hexagon DSP Software Development
Qualcomm Technologies, Inc. All Rights Reserved
20
Independent Algorithm Developers on Hexagon DSP
Qualcomm Technologies, Inc. All Rights Reserved
21
Announcing the Hexagon DSP SDK
See the Hexagon DSP SDK in action at Uplinq2013 ([Link])
Visit [Link] for more information.
Qualcomm Technologies, Inc. All Rights Reserved
22
Thank you
Follow us on:
For more information on Qualcomm, visit us at:
[Link] & [Link]/blog
2013 Qualcomm Technologies, Inc.
Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States
and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other
product and brand names may be trademarks or registered trademarks of their respective owners.
Hexagon is a product of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. All Rights Reserved
23