Multi-core Architectures
Rakesh Kumar
[email protected]
Progress of processor technology/architecture
10000.00
Intel Specint2000
Alpha
Sparc
1000.00
Mips
HP PA
100.00 Pow er PC
AMD
10.00
1.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
1
Price being paid
1
Watts/Spec
0.1
Intel
Alpha
Sparc
Mips
HP PA
Pow er PC
AMD
0.01
1 10 100 1000 10000
Spec2000
Lessons learned
Marginal utility of transistors decreasing
If n be the number of transistors
Power and Area are O(n)
Performance is O(sqrt(n))
• Wrong side of square law
Increasingly difficult to squeeze performance
Not enough exploitable ILP in programs
Easy ILP already extracted
More transistors available than we know to how make
use of when applied to a single processor
Clearly, we have a problem!
2
One way of handling a problem is….
..instead of confronting the problem try skipping
to a simpler one
Change the focus from single-thread performance to
throughput
Don’t have increasingly complex uniprocessors
Have multiple simple processors on the same die
instead [Olukotun et al, ASPLOS96]
Each on-chip processor (called core) can execute a
program now
We can now jump to the right side of the
square law
If n be the number of transistors on a die:
Area = O(n)
Performance = O(n1-x)
Roughly O(sqrt(n))
More aggregate performance (throughput) can be had using large
number of small cores than small number of large cores
At the expense of single-thread performance
For example,
In terms of area:
1 EV6 5 EV5 cores
In terms of throughput:
1 EV6 2.0-2.2 EV5 cores
5EV5 cores >=2 EV6 cores
• Performance doubled just by having multiple cores!
The main motivation for having multi-core architecures
3
Multi-core Architecture: Definition
A multi-core architecture (or a chip multiprocessor) is a
general-purpose processor that consists of multiple
cores on the same die and can execute programs
simultaneously
Multi-core architecture: Advantages
(Relatively) High performance/watt
(Relatively) High performance/area
Simpler core
Possibility of lower cycle time, better optimisation etc.
Ease of design, verification etc.
4
So, the next question to ask obviously is…
How should one design a multi-core architecture?
This is the question I address in my thesis research
A Naive methodology for Multi-core Design
! "# "
5
Goals of my thesis research
Demonstrate that the prior methodology is highly
inefficient in terms of area and power
Demonstrate the need to do holistic design of multi-core
architectures
Subsystem design should be aware of the multi-core
architecture it is going to be a part of
Propose and evaluate novel and efficient multi-core
architecture design methodologies that follow a holistic
approach
Assumptions inherent to the naïve approach
All cores have to be the same
Each core is distinct
Core/memory and interconnect can be
designed in isolation
I will talk about the first assumption today
6
Before scrutinizing the “identical cores” assumption...
…let’s consider characteristics of typical workloads
There is enormous diversity among applications
7
Implication of diversity on multi-core design
If all cores are to be identical, then can’t address
diverse workload demands
E.g. need to decide beforehand if the core targets gcc or
mcf
Either way one application loses
Underutilization or low performance
An example multi-core architecture
8
An example multi-core architecture
%& %& %& %&
%& %& %& %&
%& %& %& %& %& %& %& %&
%& %& %&
%& %& %& %& %&
Processors and Program diversity
Some applications will run much faster on an EV6 than
on an EV5
Others will take little advantage of the larger processor
and run at the same speed on either
With a homogeneous architecture,
you either have the former running very slowly on small
processors,
or the latter unnecessarily wasting the capabilities of the large
processor.
9
An alternate multi-core architecture
%&
%&
%& %& %& %& %&
%&
%&
An alternate multi-core architecture
' ( "$ $)
%& ( ! $ ( * %&
10
Single-ISA Heterogeneous Multi-core Architectures
Have multiple heterogeneous cores on the same
die
Each core-type represents a different point in the power
performance space
i.e. while one core-type might be small low-
performance, low-power, some other core-type might
be big high performance, high power
Each core capable of executing the same ISA
Unlike SoCs/embedded heterogeneous multi-core
architectures
Such an architecture will be highly efficient on workloads with diverse applications
Another Performance Advantage: Adjusts to varying TLP
11
Another Performance Advantage: Adjusts to varying TLP
%&
%&
%& +, $ $ $( $
%& %&
$( ( %&
%&
+- ( $$ $ ( *
%&
%&
Comparing Single-ISA Heterogeneous
Architectures against Conventional CMPs
4EV6
6
Weighted Speedup
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Num ber of threads
12
Comparing Single-ISA Heterogeneous
Architectures against Conventional CMPs
7 4EV6 20EV5
6
Weighted Speedup
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Num ber of threads
A choice has to be made between throughput and ST performance
Comparing Single-ISA Heterogeneous
Architectures against Conventional CMPs
8
4EV6
7 3EV6 & 5EV5 (static best)
20EV5
6
Weighted Speedup
5 +.( / $ $( $ %&
+0
4
%& ! ( 1 $ ( *
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Num ber of threads
Best of both the worlds!
13
Then there is intra-program diversity as well!
1.6
1.2 EV8-
EV6
IPS
EV5
0.8
EV4
0.4
0
1 201 401 601 801
Committed instructions (in millions)
Dynamic scheduling results
7 4EV6
3EV6 & 5EV5 (random)
6
3EV6 & 5EV5 (stat ic best)
5 3EV6 & 5EV5 (bounded-global-event )
0
1 2 3 4 5 6 7 8
N um be r o f t hre a ds
14
To sum up….
Single-ISA Heterogeneous architectures a good design
point for throughput as well as performance:
Efficient use of die-area for a given thread-level parallelism
Provides low-latency for few application on powerful cores
A large number of applications can be hosted at once on simple cores
Efficient adaptation to application diversity
Enables it approach the performance of an architecture with a large
number of complex cores
Provides higher performance in the same area than a conventional chip
multiprocessor
Talk Outline
All cores have to be the same
Single-ISA heterogeneous multi-core
architectures
Performance Benefits
Power Benefits
15
Reducing power for a conventional multi-core architecture
Done at the core-level
Each core optimised for power and then replicated
multiple times
Multi-core oblivious
Processor power reduction typically involves V/f scaling,
gating etc for the core
Power reduction techniques applied at single-core level have
limited effectiveness
23
16
23
23
17
4 " (# ! (#
!
(#
Have multiple heterogeneous cores on the same die
Match workload (or workload phase) to core that
achieves best efficiency according to some objective
function
Power down the unused cores completely
18
An example Single-ISA heterogeneous multi-core architecture
+ ( # $* *5($
$
+5 ( 1
2 13,6
!( 1&
Processor Peak-power (in W) Core-area (in mm^2)
EV4 4.97 3
EV5 9.83 5
EV6 17.80 24
EV8- 92.88 260
The processor only marginally bigger than EV8- !
7$ ) # $ (
#$ ( *
19
Choosing Dynamically the Core with Least Energy
(perf. loss<10%)
2
1.6
1.2 EV8-
EV6
IPS
EV5
0.8
EV4
0.4
0
1 201 401 601 801
Committed instructions (in millions)
Choosing Dynamically the Core with Least Energy
(perf. loss<10%)
2
1.6
1.2 EV8-
EV6
IPS
EV5
EV4
0.8
Best-path
0.4
0
1 201 401 601 801
Committed instructions (in millions)
20
Choosing Dynamically the Core with Least Energy
(perf. loss<10%)
[Summary of results]
Energy Savings(%) Performance
Degradation(%)
Maximum 77.3 8.5
Minimum 0.1 0.1
Mean 38.5 3.4
Results “verified” by other researchers using real prototypes
[Grochowski ICCD2004, Ghiasi CF2005]
Realistic heuristics
1
Energy
Performance(1/execution-time)
Energy-delay
0.8
Normalized Value (wrt EV8-)
0.6
0.4
0.2
0
neighbour neighbor- random all Dynamic
global oracle
5$ ( 8 / *$
21
To sum up…
A single-ISA heterogeneous multi-core architecture offers
enormous potential for even power-savings
Realistic heuristics can achieve much of the savings
potential
Beats chip-wide voltage scaling handsomely (50.6% ED2
improvement)
Subsequent research has shown this technique to better than
dynamic V/f scaling, gating, adaptive optimizations etc.
[Grochowski et al ICCD2004]
Bottomline
All cores do not have to be the same
In fact, should not be same
22
Summary of talk
Decreasing marginal utility of transistors is
leading us to multi-core architectures
Conventional multi-core architectures have
identical cores
Having heterogeneous architectures lead
to higher performance and lower power
23