0% found this document useful (0 votes)
19 views12 pages

Efficient C Code For ARM Devices

Here is an overview of some of the techniques for optimizing C-code for use on the ARM architecture

Uploaded by

bruno.coppi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Efficient C Code For ARM Devices

Here is an overview of some of the techniques for optimizing C-code for use on the ARM architecture

Uploaded by

bruno.coppi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

[Link]

Efficient C code for ARM devices


Chris Shore, ARM

11/3/2010 3:57 PM EDT

Chris Shore, ARM


Here is an overview of some of the techniques for optimizing C-code for use on the ARM architecture.
(Paper from ARM TechCon 2010.)

Our industry moves incredibly quickly. The hot topic last year is very rarely so important this year – either
we will have solved it or some other issue will have risen to even greater prominence. The issue of
efficiency, however, has been a relatively constant theme through my time in the industry, now some 24
years but efficiency has had many faces over that time.

In the early days, I can recall developing local area networking software
on small 8-bit microcontrollers where RAM was the scarce resource. We
worked long and hard to come up with cleverer and cleverer ways of
storing, compressing, reusing and encoding our necessary data structures
to make them fit.

Later on, I can recall a controller for a touch screen radio operator station
in which the scarce commodity was code space. Limited to 32k and
writing in C, the first version or two didn’t present any problems.
This article is from class ATC-152 at
But by the third and fourth versions, with the customer requesting added
ARM Technology Conference. Click here
for more information about the
conference.
functionality all the time, we were banging hard on the ceiling of the
ROM. Again, we put in the nights and weekends working out better ways
of reusing common code sequences, re-coding some functions in assembler by hand to make them smaller,
repartitioning the C library to remove even more unused sections.

Then there were the systems which needed to respond so quickly. There is nothing like the demand of
hitting the frame rate of a time-division multiplexed cellular phone system to focus the mind on interrupt
latency and processing deadlines.

These days, it seems that we very often have RAM and ROM coming out of our ears and we have
processing power to burn. So what is the pressing issue of today? So often, it is power-efficiency or,
perhaps more exactly, energy-efficiency.

A huge proportion of today’s electronics is portable and battery-powered. We have got used to our cell
phones running for a week between charges and our expectations of what they will do on that limited
amount of energy goes up year on year.

Hardware engineers have been in the business of saving energy for longer than we software engineers but,

1 di 12 08/11/2010 09:03
[Link]

increasingly, it is our job to utilize hardware features to the maximum and then realize even greater
savings by writing efficient software.

A double-sided coin
One of the fundamental changes in our field in recent years has been the increasing use of platform
operating systems in the embedded world. The majority of ARM-based systems these days, large and
surprisingly small, will use either one of the many proprietary real-time operating systems available or will
use a home-brewed scheduler of some kind.

Except at the small end, the people who develop the applications are generally not the same people as
those who develop the operating systems under which they run. Both groups of engineers, though, need
increasingly to focus on being energy-efficient.

There is some common ground between the two – indeed, much of what can be said about writing
efficient applications applies equally to operating system development – but each contains specialist areas
and focuses of its own (Table 1, below).

Table 1. The two sides of efficiency

Some basic stuff


There are some things which are so basic I hope I don’t have to say them. But, just in case, I will get them
out of the way early on!

We have a very nice training course on optimizing embedded software on ARM platforms. In a series of
optimization steps over 3 days, we take a standard piece of video codec software and progressively apply
varying optimizations to it. Some are more effective than others but the cumulative effect is to improve
performance by between 200% and 250%.

We then note that all of this would be completely wasted work without correctly enabling and configuring
the data cache as that simple omission reduces performance by approximately 5000%. Modern ARM
cores are specifically designed and optimized to run from caches as much as possible. Neglecting this
simple fact is catastrophic. We shall have more to say about cache-friendly programming later on but, at
this point, it is simply sufficient to say “Turn it on!”

Memory use is key


The fact that cache is so important is a direct result of the fact that memory access in general is costly –
both in terms of time and energy. Almost all systems have some kind of memory hierarchy. It might be as
simple as some on-chip scratchpad RAM and some external off-chip RAM; it might be as complex as two
or three levels of cache between the core and the external memory system (Figure 1 below).

2 di 12 08/11/2010 09:03
[Link]

Figure 1. Typical System Memory Architecture

A rough rule of thumb might be that if an L1 cache access takes one time period, an L2 cache access
might take around 10 and an external memory access of the order of 100. There is an increasing energy
cost with this as well. The increasing expense of memory accesses the further they are removed from the
core is something which you need to bear in mind at all times when looking at writing efficient code.

Figure 2 below shows the measured energy cost of memory accesses [1], benchmarked against the cost
of executing an instruction.

Figure 2. Estimated cost of memory accesses.

An external memory access is typically going to take 100 times longer than accessing cache and cost
50-60 times the energy.

So, message number two, following on from simply turning on the cache, is to make sure that you use it as
much as possible.

Page 2

Instructions count, too


In terms of energy and time cost, instruction execution come after memory access. Put simply, the fewer
instructions you execute, the less energy you consume and the shorter time you take to do it. We are
assuming here that instructions are, in the main, executed from cache. For most applications, with sensible
cache configuration and reasonable cache size, this will generally be the case.

So, message number three is “optimize for speed”. The efficiency comes in two ways: firstly, the fewer
instructions you execute, the less energy that takes; secondly, the faster you accomplish your task, the
sooner you can sleep the processor and save power. We will have more to say about dynamic and static
power management later, when we get to operating systems but, for now it suffices to say that fast code is,
in general, power efficient code.

3 di 12 08/11/2010 09:03
[Link]

Further, notice that executing instructions is cheap compared to accessing memory. This leads us to
conclude that algorithms which favor computation (data processing) over communication (data
movement) will tend to be more power-efficient. Another way of looking at this is to say the programs
which are CPU-bound tend to be more efficient than those which are memory-bound.

There is one complication which we would do well to mention at this point. Code which is compiled for
size (i.e. smallest code size) may, in some circumstances be more efficient than code which is compiled
for speed.

This will occur if, by virtue of being smaller, it uses the cache better. I think this may be regarded as an
exception to the general rule discussed above but it may apply in some circumstances. If it does, due to
the greatly increased efficiency of cache accesses, its effect could be huge.

Good coding practice


A good rule of thumb here is “Make things Match.” So, match data types to the underlying architecture,
code to the available instruction set, memory use to the configuration of the platform, coding conventions
to the available tools.

Similarly “Write sensible code” and make sure you know at least something about how the tools work. If
you know even a little about how procedure calls and global data are implemented on the ARM
architecture, there is a lot you can do to make your code more efficient.

To some extent, be driven by the way the tools work, rather than trying to drive the tools to work the way
you do. Very often, the tools work they way they do for very valid architectural reasons!

Data type and size


ARM cores are 32-bit machines. That is to say that they have 32-bit registers, a 32-bit ALU, and 32-bit
internal data paths. Unsurprisingly, they are good at doing things with 32-bit quantities. Some cores do
implement wider buses than this between, say, the core and the cache but this does not change the fact
that the core is fundamentally a 32-bit machine.

For instance, as shown below, simple ALU operations on 16-bit data (signed or unsigned short) will
involve extra instructions to either truncate or sign-extend the result in the 32-bit register.

The core can, sometimes, hide these extra operations. For instance, truncating a 32-bit value to 16-bits
can be accomplished at the same time as storing it to memory if a halfword store instruction is used. This
does indeed accomplish two operations for the price of one but does depend on storing the item being
what you actually want to do with it.

Later versions of the architecture provide sub-word SIMD instructions which can get around some of
these inefficiencies but in most cases these instructions are hard, if not impossible, to use from C.

Remember, too, that local variables, regardless of size, always take up an entire 32-bit register when held
in the register bank and an entire 32-bit word in memory when spilled on to the stack.

There is very little advantage in using sub-word types on an ARM core (as compared with many other
architectures in which it can save both time and storage). That’s not to say that it doesn’t work or isn’t
sometimes the right thing to do – just that there is little or no efficiency advantage in doing so.

4 di 12 08/11/2010 09:03
[Link]

Data alignment
In keeping with this, the ARM core has historically imposed very strict alignment requirements on both
code and data placement. It is still true today that ARM cores cannot execute unaligned instructions.
However, more recent cores, from architecture ARMv6 onwards, have relaxed somewhat the
requirements for data accesses.

These cores support access to unaligned words and halfwords. It is tempting, therefore, to abandon all
consideration of alignment in your code, enable unaligned support in the hardware and declare everything
as unaligned.

This would be a mistake! For instance, loading an unaligned word may indeed take one instruction where,
on earlier cores, it would have taken three or four but there is still a performance penalty due to the need
for the hardware to carry out multiple bus transactions.

Those transactions are hidden from the program and from the programmer but, at hardware level, they are
still there and they still take time.

Structures and arrays


If storage is not at too much of a premium, there are good reasons for choosing carefully the size of array
elements. Firstly, making the length of each element a power of 2 simplifies the offset calculation when
accessing an individual element. This is a result of the fact that the ARM instruction set allows shift
operations to be combined with ALU operations and to be built into addressing modes.

For an array of elements of size 12, based at address r3, accessing the element at index r1 would take a
sequence like this:

ADD r1, r1, r1, LSL #1 ; r1 = 3 * r1


LDR r0, [r3, r1, LSL #2] ; r0 = *(r1 + 4 * r1)

By comparison, if the element size is 16, the address calculation is much simpler:

LDR r0, [r3, r1, LSL #4] ; r0 = *(r3 + 16 * r1)

Although we will look at cache-friendly data access in more detail later, it is worth noting at this point that
elements which fit neatly into cache lines make more efficient use of the automatic prefetching behavior
of cache linefills.

Efficient parameter passing


Each subword-sized parameter to a function will be passed using either a single 32-bit register or a 32-bit
word on the stack. This information is buried inside a specification called the Procedure Call Standard for
the ARM Architecture (AAPCS).

This, in turn, is part of a larger document called the ARM Application Binary Interface (ARM ABI). The
latest version of both can be found on ARM’s website and, while the ABI document is aimed largely at
tools developers, the AAPCS information is directly relevant to applications programmers.

According to the AAPCS, we have four registers available for passing parameters. So, up to four
parameters can be passed very efficiently simply by loading them into registers prior to the function call.
Similarly, a single word-sized (or smaller) return value can be passed back in a register and a doubleword
value in two registers.

It is plain to see that trying to pass more than four parameters involves placing the remainder on the stack
with the attendant cost in memory accesses, extra instructions and, therefore, energy and time. The simple
rule “keep parameters to four or fewer” is well worth keeping in mind when coding procedures.

Further, there are “alignment” restrictions on the use of registers when passing doublewords as

5 di 12 08/11/2010 09:03
[Link]

parameters. In essence, a doubleword is passed in two register and those registers must be an even-odd
pair i.e. r0 and r1, or r2 and r3.

The following function call will pass ‘a’ in r0, ‘b’ in r2:r3 and ‘c’ on the stack. R1 is unable to be used due
to the alignment restrictions which apply to passing the doubleword.

fx(int a, double b, int c)

Re-declaring the function as shown below is functionally identical but passes all the parameters in
registers.

fx(int a, int c, double b)

Page 3

The C compiler is not omniscient


Simply put, the C compiler cannot read your mind! Much as you try, it cannot divine your attentions
simply by looking at the code you write. Further, it is restricted to examining the single compilation unit
with which it is presented. In order to guarantee correct program execution, it is programmed for
maximum paranoia in all circumstances.

So it must make “worst-case” assumptions about everything. The most obvious and best known example
of this is “pointer aliasing.” This means that the compiler must make the assumption that any write
through any pointer can change the value of any item in memory whose address is known to the program.
This frequently has severe repercussions for compiler optimization.

Other examples would be that the compiler must assume that all global data is volatile across external
function boundaries, loop counters may take any value, loop tests may fail first time round. There are
many other examples.

The good news is that it is, in most cases, very easy for the programmer to provide extra information
which helps the compiler out. In other cases, you can rewrite your code better to express your intentions
and better to convey the particular conditions which prevail. For instance, if you know that a particular
loop will always execute at least once, a do-while loop is a much better choice than a for(;;) loop.

The for loop in C always applies the termination test before the first iteration of the loop. The compiler is
therefore forced to either place duplicate tests at top and bottom, place a branch at the top to a single test
at the bottom or a branch at the bottom to a single test at the top which is executed first time round.

All three approaches involve extra code, extra branches or both. Yes, you can argue that modern
sophisticated branch prediction hardware reduces the penalty of this but it is, in my view, still lazy code
and you should be doing better!

As shown below the ARM compiler also provide several keywords which are available for giving the
compiler “metainformation” about your code at various points.

__pure: Indicates that a function has no side-effects and accesses no global data. In other words, its result
depends only on its parameters and calling it twice with the same parameters will always return the same
result. This makes the function a candidate for common subexpression elimination.

__restrict: When applied to a pointer declaration, indicates that no write through this pointer changes the
value of an item referenced by any other pointer. This is of particular benefit in loop optimization where it
increases the freedom of the compiler to apply transformations such as rotation, unrolling and inversion
etc.

__promise: Indicates that, at a particular point in a program, a particular expression holds true.

6 di 12 08/11/2010 09:03
[Link]

Consider the following example:

The __promise intrinsic here informs the compiler that the loop counter, at this particular point, will be
greater than zero and divisible by eight. This allows the compiler to treat the for loop as a do-while loop,
omitting the test at the top, and also to unroll the loop by any factor up to eight without having to worry
about boundary conditions.

This kind of optimization is particularly important when combined with the vectorization capability of the
NEON engine.

The C compiler is not omnipotent


As well as not knowing everything, the compiler cannot do everything either!

There are many instructions, particularly in more recent versions of the architecture, which cannot be
generated by the C compiler at all. This is usually because they have no equivalent in the semantics of the
C language.

Using SIMD instructions to operate on complex numbers stored as packed signed halfwords is a good
example. The C compiler has no complex type, certainly not one which fits this storage method and so it
cannot use these very efficient instructions for carrying out very straightforward operations.

The proficient programmer could write hand-coded assembly functions to implement these operations.
However, it is often much easier to use the rich set of intrinsic functions provided by the ARM C
compiler.

The following example shows a short function implementing a complex multiplication using the SMUSD
and SMUADX instructions provided in architecture ARMv6.

This produces the following output.

If the compiler is able to inline the function, then there is no function call overhead either. The obvious
advantages of this approach over coding in assembly are increased portability and readability.

7 di 12 08/11/2010 09:03
[Link]

NEON support in the compiler


The C compiler can be also used to access the functionality of the NEON Media Processing Engine via a
comprehensive set of intrinsic functions and built-in data types.

Here is a straightforward implementation of an array multiplication function. The C code is shown on the
left and the resulting assembly code on the right. Only the body of the loop is shown and this executes a
total of eight times.

The next pair of sequences show the same loop implemented using NEON instrinsics. The key points to
note are that the loop has been unrolled by a factor of four to reflect the fact that the NEON load, store
and multiplication instructions are dealing with four 32-bit words each time. This greatly lowers the
instruction count. Due to the reduced number of iterations, the loop overhead is also reduced.

If you look a little more closely, you can see that the compiler output does not exactly correspond to the C
source code. The order of the instructions has been changed. The compiler has done this to minimize
interlocks and thus maximize throughput. This is another advantage of using intrinsics over hand coding in
assembly -the compiler is able to optimize the machine instructions in the context of the surrounding code
and in response to the target architecture.

Page 4

Intelligent data cache use


Most application programmers, often quite rightly, view the caches as something which is the province of
the operating system. Certainly, configuration and management of the cache is something which the
operating system takes care of and applications are not generally permitted to meddle with them.

That is not to say, however, that the application programmer should completely ignore the fact that there
is a cache in the system and that it is capable of providing huge performance benefits. Those benefits can
be improved if the application code is written with the cache operation and architecture in mind.

Aligning data structures to cache line boundaries in memory can make better use of the pre-loading effect
of cache line fills, making it more likely that wanted data is loaded as part of the line and unwanted data is
not. The compiler provides a data attribute for doing this:

int myarray[16] __attribute__((aligned(64)));

Some very common-place algorithms can also benefit from being written in a cache-friendly way. It is

8 di 12 08/11/2010 09:03
[Link]

well known that caches perform much better when data is accessed sequentially – the data automatically
loaded as part of a line fill is automatically reused.

Operations like matrix multiplication pose an interesting problem in this regard. The following example is
a simplistic implementation of a matrix multiplication function.

You can see that the ‘a’ array is accessed sequentially, because the rightmost index varies most quickly –
this is cache-friendly. In contrast, the ‘c’ array is accessed by column – the leftmost index varies most
quickly. This is decidedly not cache-friendly. Each access will result in the loading of a complete cache
line of data, only one element of which is likely to be used before being evicted.

Since cache accesses often happen in the background, there may not be a speed penalty associated with
this but there will definitely be an energy cost due to unnecessary external memory accesses.

Various techniques are well known for avoiding this behavior – strip mining and tiling being the most
common.

Certain access patterns are also pathologically bad in terms of cache reuse and are best avoided
completely. For instance, in the case of a write-allocate cache, writing a large amount of data which is not
to be immediately re-used will result in large areas of the cache being populated with useless data.

Some recent ARM cache controllers take note of this and, in the case of the Cortex-A9, will temporarily
disable write-allocation if they detect certain unfavorable access patterns. This behavior is automatic and
transparent to the programmer but awareness of the cache allocation strategy can be a useful guide when
writing code.

Global data access


It is a feature of the ARM architecture that you cannot specify a full 32-bit address in a memory access
instruction. Accessing a variable held in memory requires its address to be placed in a register. Or, at least,
a base address from which it can be accessed via a simple offset.

This gives rise to problems with accessing external global variables in that the compiler must store at
compile time and load at run time a base pointer for each and every such variable. If a function accesses a
number of external globals, the compiler must assume that they are in separate compilation units and
therefore not guaranteed to be co-located at run-time. A separate base pointer is required for each one.

If you are able to allow the compiler to infer that a group of globals are co-located in the memory address
map then they can be accessed as different offsets from a single base pointer.

The easiest way of achieving this is simply to declare globals in the module in which they are used.
However, applications which do not need items with global scope are few and far between so a more
pragmatic solution is required.

The most common solution is to place globals, or more usually groups of related globals, into structures.
These structures are then guaranteed to be placed as a single unit at link-time and can be accessed using a
single base pointer.

9 di 12 08/11/2010 09:03
[Link]

Tools features like multi-file compilation can go some way to getting round this problem but they lack
flexibility in other ways.

Page 5

System power management


We turn now to wider system issues which are generally regarded as the domain of the operating system.
In most systems, it is the operating system which has control of things like clock frequency, operating
voltage, power state of individual cores etc. The application programmer is not generally aware of these.

There is an ongoing debate over the most fundamental issue: is it better to complete the necessary
computing work at maximum speed and then sleep for the longest possible time, or is it better to wind
down voltage and/or clock frequency to reduce power consumption while completing the computation just
in time.

The major factor these days in this debate is the growing contribution of static power consumption in a
system. Historically, static power consumption (primarily leakage) has been much smaller than dynamic
power consumption.

As chip geometries become ever smaller, it is a fact that leakage increases, this making static consumption
more of a contributor to overall energy consumption. These days, this leads to the conclusion that it may
be better to “race to completion” and then power down completely (to eliminate leakage), rather than to
keep executing for longer, albeit at a lower frequency.

A suitable metric

What is needed is a metric which somehow combines energy consumption with an idea of how long a
particular computation takes. Such a metric is often referred to as “Energy Delay Product” or EDP
(Figure 3 below).

Although such metrics have been widely used in circuit design for many years, there is as yet no generally
accepted methodology for deriving or using such a metric in the field of software development.

Figure 3. Energy Delay Product

The example above [2] shows one instance where the EDP metric assists in making a decision as to the
optimum size of cache for a specific processor undertaking a specific task. It is clear that a larger cache
increases performance but that this is at the expense of higher power consumption.

However, the EDP metric shows that there is a “sweet spot” with a cache size of 64k at which the energy
saving due to higher performance is maximized. At greater cache sizes, performance still increases but
energy usage per unit of computation also increases, negating the benefit.

Should we turn it on?


Figure 4 below shows the impact on energy consumption of Instruction Cache and a Floating Point Unit.

10 di 12 08/11/2010 09:03
[Link]

The energy consumption is shown for all four possible configurations and for two different clock speeds.

Figure 4. Energy consumption of instruction cache vs floating point unit[3]

You can see that both units decrease overall energy consumption when carrying out the same amount of
computation. However, the benefit is essentially independent of clock speed. In this particular system,
therefore, it is clearly better to run to completion at maximum clock speed and then power down as early
as possible.

Re-visiting the cost of memory access


We have already established that external memory access is one of the most energy-costly operations
carried out by any system. The following is a list of techniques which can be used to minimize the cost of
this.

1) Use registers as much as possible. Minimize live variables, limit parameter count, don’t take the
address of automatic variables, inline where possible to reduce stack usage.

2) Make best use of cache. Write-Allocate/Write-Back is generally the best configuration for stack, heap
and global data; Write-Back/Read-Allocate is often better for buffers (as this kind of data is often written
by one agent and then read by another). Multi-level cache systems are best configured with L1 as Write-
Through/Read Allocate and L2 as Write-Back/Write-Allocate. Beware of “false sharing” in systems with
cache coherency management hardware.

Managing sub-systems
In a single-core system, we must ensure that additional computation engines (such as NEON) and external
peripherals (serial ports and the like) are only powered when required. This is a system decision for the
operating system developer and individual chips will provide features for managing this.

The operating system will almost always need customizing to target specific devices. For instance,
Freescale’s i.MX51 range of chips incorporate a NEON hardware monitor which can be programmed
automatically to turn off the NEON engine if it is unused for a configurable period of time. Power it on
again is handled in the Undefined Instruction exception which results in trying to use it when it is not
powered.

In multi-core systems, we have the option of turning individual cores off and on in order to match the
system capability with current workload. The cost of power-cycling an individual core is often significant,
so the decision as to what circumstances trigger this will always be system-dependent. Current
implementations of ARM SMP Linux support the following features:

1) CPU hotplug

2) Load-balancing and dynamic re-prioritization

3) Intelligent and cache-friendly scheduling

11 di 12 08/11/2010 09:03
[Link]

4) Dynamic Voltage and Frequency Scaling (DVFS) per CPU

5) Individual CPU power state management

The kernel incorporates an interface to a generic external power management controller and this interface
will need targeting for individual platforms to make use of available features.

By way of example, TI’s OMAP4 platforms incorporate a range of voltage and frequency scaling options.
They define a number of “Operating Performance Points” from which the operating system can select the
most appropriate option at any particular time. The power consumption of the device can be varied from
600 microwatts to 600 mW depending on load.

The programmer’s job


In multi-core systems it is easy to focus on the features provided by the hardware and think that this is the
end of the story – that would be delinquent on the part of any serious programmer! It is still vital to bear in
mind the following properties when writing software and configuring operating systems.

1) System efficiency. Intelligent and dynamic task prioritization; load-balancing; power-conscious


spinlocks.
2) Computation efficiency. Parallelism at data, task and function level; minimized synchronization
overhead.
3) Data efficiency. Efficient use of memory system features; careful use of coherent caches to avoid
thrashing and false sharing.

Conclusions
The elephants in the room here are to do the following: 1) Configure tools and platform properly; 2)
Minimize external memory access by writing code carefully and configuring caches correctly ; and 3)
Minimize instruction count by optimizing for speed and making effective use of computation engines

Chris Shore is Training Manager at ARM headquarters in Cambridge, UK. In that role for the past ten
years he has been responsible for ARM’s worldwide customer technical training team, coordinating the
delivery of over 80 courses per year to ARM licensees all over the globe. He is passionate about ARM
technology and, as well as teaching ARM’s customers, regularly presents papers and workshops at
engineering conferences. Starting out as a software engineer in 1986, his career has included software
project management, consultancy, engineering management and marketing. Chris holds an MA in
Computer Science from Cambridge University.

References:
[1] Reducing Energy Consumption by Balancing Caches and Scratchpads on ARM Cores, Mario Mattei,
University of Siena

[2] Wattch: A framework for Architectural-Level Power Analysis and Optimizations, Brooks et al,
Princeton University

[3] Evaluating the performance of multi-core processors – Part 1, Max Domeika, Embedded. com

12 di 12 08/11/2010 09:03

You might also like